Light Field Based Biometric Recognition and Presentation ...€¦ · Keywords: Biometric...

UNIVERSIDADE DE LISBOA

INSTITUTO SUPERIOR TÉCNICO

Light Field Based Biometric Recognition and Presentation Attack Detection

Alireza Sepasmoghaddam

Supervisor: Doctor Paulo Luís Serras Lobato Correia

Co-supervisor: Doctor Fernando Manuel Bernardo Pereira

Thesis approved in public session to obtain the PhD Degree in

Electrical and Computer Engineering

Jury final classification: Pass with Distinction and Honour

2019

UNIVERSIDADE DE LISBOA

INSTITUTO SUPERIOR TÉCNICO

Light Field Based Biometric Recognition and Presentation Attack Detection

Alireza Sepasmoghaddam

Supervisor: Doctor Paulo Luís Serras Lobato Correia

Co-supervisor: Doctor Fernando Manuel Bernardo Pereira

Thesis approved in public session to obtain the PhD Degree in

Electrical and Computer Engineering

Jury final classification: Pass with Distinction and Honour

Jury

Chairperson: Doctor Mário Alexandre Teles de Figueiredo, Instituto Superior Técnico, Universidade de Lisboa

Members of the Committee:

Doctor Luís Filipe Barbosa de Almeida Alexandre, Faculdade de Engenharia, Universidade da Beira Interior

Doctor Alexandre José Malheiro Bernardino, Instituto Superior Técnico, Universidade de Lisboa

Doctor Luís Eduardo de Pinho Ducla Soares, Escola de Tecnologias e Arquitectura, ISCTE - Instituto Universitário de Lisboa

Doctor Paulo Luís Serras Lobato Correia, Instituto Superior Técnico, Universidade de Lisboa

Funding Institutions

This research has been made possible with funding from the Fundação para a Ciência e a Tecnologia, Instituto de Telecomunicações, Instituto Superior Técnico, European Cooperation in Science and Technology, IEEE Signal Processing Society, and European Association for Signal

Processing.

2019

v

To the memory of my beloved cousin,

Ali (1983-2019)

a beautiful noble human being who we will miss terribly;

a brother who will live intimately inside my heart.

vii

Abstract

In a world where security issues have been gaining explosive importance, face and ear recognition

systems have attracted increasing attention in multiple application areas, ranging from forensics

and surveillance to commerce and entertainment. While the recognition performance has been

steadily improving, there are still challenging recognition scenarios and conditions, notably when

facing large variations in the biometric data characteristics. Additionally, the widespread use of

face and ear recognition solutions raises new security concerns, making the robustness against

presentation attacks a very active field of research. Lenslet light field cameras have recently come

into prominence as they are able to also capture the intensity of the light rays coming from multiple

directions, thus offering a richer representation of the visual scene, notably spatio-angular

information. To take benefit of this richer representation, light field cameras have recently been

successfully applied, not only to biometric recognition, but also to biometric Presentation Attack

Detection (PAD).

This Thesis focuses on exploiting the advances in light field imaging technology towards

developing more advanced biometric recognition and PAD systems with improved performance.

In this context, new taxonomies have been developed for face and ear recognition and PAD, to

facilitate the organization and categorization of face and ear recognition and PAD solutions.

Following the proposed taxonomies, a comprehensive review on recent, representative and

relevant face and ear recognition solutions has been made.

In the context of this Thesis, two light field face and ear databases have been created, towards

allowing more powerful benchmarking for testing and validating face and ear recognition solutions

while exploiting the full light field data. Additionally, two light field face and ear artefact databases

have been created consisting of bona fide images and artefact images using different types of

presentation attack instruments, such as printed papers and digital displays.

Concerning recognition and PAD solutions, two hand-crafted light field based face and ear

recognition descriptors and five deep learning light field based face recognition descriptors have

been developed, evolving through progressive levels of functionality and performance.

Concerning PAD, this Thesis has developed two solutions for light field based face and ear PAD,

exploiting the variations associated to different directions of light captured in the light field images.

A comprehensive evaluation of the proposed and benchmarking face and ear recognition and PAD

solutions has been performed. The obtained results have shown the added value of light field

information for face and ear recognition and PAD purposes as the proposed solutions have

achieved superior recognition and PAD performance when compared to the state-of-the-art

benchmarking solutions.

Keywords: Biometric Recognition, Biometric Presentation Attack Detection, Biometric Database,

Light Field Imaging, Deep Learning.

ix

Resumo

Num mundo onde as questões de segurança têm ganho uma enorme importância, os sistemas de

reconhecimento facial e auricular atraem cada vez mais atenção em áreas de aplicação que vão

desde a análise forense e a vigilância até ao comércio e ao entretenimento. Embora o desempenho

dos sistemas de reconhecimento tenha vindo a melhorar de forma sustentada, ainda há cenários

desafiadores para os sistemas atuais, nomeadamente quando os dados biométricos são capturados

em condições menos controladas e apresentam variações significativas. Adicionalmente, o uso

generalizado de soluções de reconhecimento automático levanta novas preocupações de segurança,

nomeadamente a robustez contra ataques em que se apresenta ao sistema de reconhecimento não

um indivíduo mas algo que pretende simular a presença desse indivíduo. Neste contect, a deteção

de ataques de apresentação é uma área de investigação muito ativa. Recentemente, as câmaras

lenticulares que capturam o campo de luz (lenslet light field cameras) têm ganho importância, pois

capturam a intensidade dos raios de luz vindos de múltiplas direções, oferecendo assim uma

representação mais rica da cena visual, nomeadamente com variações espacio-angulares. Este tipo

de câmaras tem sido recentemente usado com sucesso no reconhecimento biométrico e também na

deteção de ataques de apresentação biométrica.

Esta Tese foca-se na exploração dos avanços na tecnologia de imagens lenticulares do campo de

luz para o desenvolvimento de sistemas mais eficientes de reconhecimento biométrico e de deteção

de ataques de apresentação. Neste contexto, são propostas novas taxonomias com o objetivo de

facilitar a organização e categorização das soluções de reconhecimento facial e auricular, e de

deteção de ataques de apresentação . Seguindo as taxonomias propostas, foi feita uma revisão

abrangente das soluções mais representativas e relevantes.

No contexto da Tese, foram criadas duas novas bases de dados com imagens lenticulares de campo

de luz para reconhecimento facial e auricular, permitindo pela primeira vez comparar diferentes

soluções de reconhecimento que explorem também informação angular do campo de luz. Foram

também criadas duas bases de dados de artefactos faciais e auriculares, contendo além das imagens

originais (bona fide) também imagens de vários tipos de ataques usando artefactos como

impressões em papel e displays digitais, de forma a testar soluções de deteção de ataques de

apresentação explorando mais informação do campo de luz.

Nesta Tese, são apresentados dois novos descritores para reconhecimento facial e auricular usando

técnicas convencionais (hand-crafted) e cinco novos descritores para reconhecimento facial usando

redes neuronais de aprendizagem profunda (deep learning neural networks), todos explorando a

informação adicional do campo de luz. Também se apresentam duas novas soluções para deteção

de ataques de apresentação a sistemas de reconhecimento facial e auricular, explorando as

variações associadas a diferentes direções da luz capturada nas imagens de campo de luz.

Foi realizada uma avaliação abrangente das soluções propostas para reconhecimento facial e

auricular e para deteção de ataques de apresentação, em comparação com soluções representativas

do estado da arte. Os resultados obtidos mostram a vantagem significativa em utilizar informação

adicional do campo de luz. De facto, o desempenho das soluções propostas para reconhecimento

x

e deteção de ataques de apresentação suplanta o obtido com as soluções representativas do estado

da arte.

Palavras-chave: Reconhecimento Biométrico, Ataques de Apresentação Biométrica, Bases de

Dados Biométricas, Imagens do Campo de Luz, Redes Neuronais de Aprendizagem Profunda.

xi

Acknowledgments

Undertaking this PhD has indeed been a brilliant life-changing experience for me which would not

have been realized without the support and guidance that I received from many people.

First and foremost, I would like to express my deepest gratitude to my Ph.D. supervisors, Prof.

Paulo Correia and Prof. Fernando Pereira, for their phenomenal supervision and support which not

only fostered my professional talents, but also uplifted my characteristics as a human being. Their

immense knowledge, remarkable guidance, enthusiasm for research, patience, and trust have been

a great inspiration to me during all these years. Being so, I am entirely delighted for having had

them as my PhD mentors and advisors.

My sincere thanks also goes to Prof. Kamal Nasrollahi and Prof. Thomas Moeslund who provided

me the beneficial opportunity of joining their team at Aalborg University as a visiting researcher.

I also thank Prof. Mohammad-Shahram Moin, Dr. Maedeh Arvani, and Dr. Mohammad Rouhani

for their support in taking my career abroad.

I am grateful to the members of the doctoral committee, Prof. Mário Figueiredo, Prof. Luís

Alexandre, Prof. Alexandre Bernardino, and Prof. Luís Soares for their insightful detailed analysis

of this work, and for their interesting questions and remarks during my defense.

I am heartily grateful to my colleagues and friends at Instituto de Telecomunicações for all the

unforgettable moments we created together during my PhD years, especially to be named Alireza

Javaheri, Alireza Tavanfar, André Guarda, Falah Jabar, Ivo Sousa, Miguel Simões, Milad

Niknejad, and Tanmay Verlekar. I have also spent many great moments with Iranian friends during

my stay in Lisbon. In particular, I would like to thank Ahmad Nadali, Arash Abbasnia, Faezeh

Rastgari, Hamdireza Yeganeh, Mohammad Farzamian, Nahal Mojarad, Niloofar Dehghani, and

Sana Hashemi Nasl.

Words cannot express how grateful I am to my mother, Zohreh, my father, Mahmoud, and my

brother, Amir, for all of the loving sacrifices they have unconditionally made for me.

Most significantly, I wish to thank from the bottom of my heart my beloved wife, Maryam, for all

her unconditional continual love, encouragements and support. It is definitely with her truly

unparalleled love that my PhD has been completed successfully.

This work would not have been possible without the financial support of Instituto de

Telecomunicações (IT), Instituto Superior Técnico (IST), Fundação para a Ciência e a Tecnologia

(FCT), European Cooperation in Science and Technology (COST), IEEE Signal Processing

Society (IEEE SPS), and European Association for Signal Processing (EURASIP). In Addition, I

am thankful to all the IT staff for the assistance and facilities they provided me to conduct my PhD

research.

Alireza Sepas-Moghaddam

Lisbon, January 2019

xiii

Table of Contents

Abstract .................................................................................................................................. vii

Resumo..................................................................................................................................... ix

Acknowledgments .................................................................................................................... xi

Table of Contents .................................................................................................................. xiii

List of Figures ........................................................................................................................ xxi

List of Tables ...................................................................................................................... xxvii

List of Acronyms ................................................................................................................. xxxi

Part I. Objectives and Basics ................................................................................................... 1

Chapter 1: Introduction ........................................................................................................... 3

1.1 Context and Motivation .................................................................................................. 3

1.2 Objectives ....................................................................................................................... 5

1.3 Contributions .................................................................................................................. 5

1.4 Thesis Structure ............................................................................................................ 11

Chapter 2: Light Field Imaging: Basic Concepts and Tools ................................................. 13

2.1 Introduction .................................................................................................................. 13

2.2 Plenoptic Function ........................................................................................................ 13

2.3 Light Field Acquisition ................................................................................................. 15

2.3.1 Multi-Camera Arrays ........................................................................................ 15

2.3.2 Lenslet Light Field Cameras .............................................................................. 16

2.4 Lenslet Light Field Imaging: From Micro-images to Sub-Aperture Images ................... 18

2.5 Added Value for Biometric Recognition and PAD ........................................................ 21

Part II. Light Field Based Face and Ear Recognition ........................................................... 23

Chapter 3: State-of-the-Art on Face and Ear Recognition ................................................... 25

xiv

3.1 Introduction .................................................................................................................. 25

3.2 Face/Ear Recognition Taxonomy .................................................................................. 25

3.2.1 Reviewing Existing Face Recognition Taxonomies ........................................... 26

3.2.2 Reviewing Existing Ear Recognition Taxonomy................................................ 26

3.2.3 Proposing a Novel Multi-Level Face/Ear Recognition Taxonomy...................... 27

3.3 Face Recognition .......................................................................................................... 30

3.3.1 Face Databases: Status Quo ............................................................................... 30

3.3.2 Non-Light Field Based Face Recognition Solutions ........................................... 32

3.3.2.1 Appearance Based Solutions ................................................................ 34

3.3.2.2 Model Based Solutions ........................................................................ 34

3.3.2.3 Learning Based Solutions (excluding Deep Learning) ......................... 35

3.3.2.4 Deep Learning Based Solutions ........................................................... 36

3.3.2.5 Hand-Crafted Based Solutions ............................................................. 37

3.3.2.6 Hybrid Solutions ................................................................................. 38

3.3.2.7 Fusion of Solutions .............................................................................. 39

3.3.2.8 Non-Light Field Based Face Recognition: the Status Quo .................... 40

3.3.3 Light Field Based Face Recognition Solutions ................................................... 41

3.3.3.1 Appearance Based Solution ................................................................. 42

3.3.3.2 Hand-Crafted Based Solutions ............................................................. 42

3.3.3.3 Light Field Based Face Recognition: the Status Quo ............................ 43

3.4 Ear Recognition ............................................................................................................ 43

3.4.1 Ear Databases: Status Quo ................................................................................. 44

3.4.2 Ear Recognition Solutions ................................................................................. 44

3.4.2.1 Appearance Based Solutions ................................................................ 46

3.4.2.2 Geometric Based Solutions .................................................................. 46

3.4.2.3 Learning Based Solutions .................................................................... 47

xv

3.4.2.4 Hand-crafted Based Solutions .............................................................. 47

3.4.2.5 Hybrid Solutions ................................................................................. 48

3.4.2.6 Ear Recognition: the Status Quo .......................................................... 48

Chapter 4: Proposing Novel Light Field Face and Ear Recognition Databases................... 51

4.1 Introduction .................................................................................................................. 51

4.2 Lenslet Light Field Face Recognition Database ............................................................. 52

4.2.1 Acquisition Setup and Statistics ......................................................................... 52

4.2.2 Database Variations ........................................................................................... 53

4.2.3 Database Elements ............................................................................................ 54

4.2.4 Database Structure and Naming Convention ...................................................... 56

4.2.5 Database Access and Usage Conditions ............................................................. 57

4.3 Lenslet Light Field Ear Recognition Database ............................................................... 57

4.3.1 Acquisition Setup and Statistics ......................................................................... 58

4.3.2 Database Variations ........................................................................................... 58

4.3.3 Database Elements ............................................................................................ 59

4.3.4 Database Structure and Naming Convention ...................................................... 60

4.3.5 Database Access and Usage Conditions ............................................................. 61

Chapter 5: Proposing Novel Light Field Face and Ear Recognition Solutions .................... 63

5.1 Introduction .................................................................................................................. 63

5.2 Face and Ear Recognition Based on Light Field Local Binary Pattern Descriptor .......... 63

5.2.1 Architecture and Walkthrough ........................................................................... 64

5.2.2 Light Field Local Binary Patterns Feature Description ....................................... 65

5.3 Face and Ear Recognition Based on Light Field Histogram of Gradients Descriptor...... 68


5.3.2 Light Field Histogram of Disparity Gradients Feature Description..................... 69

xvi

5.4 Face Recognition Based on a VGG 2D+Disparity+Depth (VGG-D3) Fused Deep

Descriptor ..................................................................................................................... 71


5.4.2 VGG-Face Feature Description ......................................................................... 73

5.4.3 Fusion Strategies Comparison ........................................................................... 73

5.5 Face Recognition Based on a VGG + Conventional LSTM Double-Deep Descriptor .... 74


5.5.2 SA Images Selection and Scanning .................................................................... 76

5.5.3 LSTM Angular Description ............................................................................... 78

5.5.4 Softmax Classification....................................................................................... 81

5.6 Face Recognition Based on VGG + Light Field LSTM Double-Deep Descriptors ......... 81


5.6.2 Light Field LSTM Angular Descriptors ............................................................. 84

5.6.2.1 Gate-Level Fusion LSTM Cell Architecture ........................................ 84

5.6.2.2 State-Level Fusion LSTM Cell Architecture ........................................ 85

5.6.2.3 Sequential Learning LSTM Cell Architecture ...................................... 87

5.7 Summary of the Proposed Face/Ear Recognition Solutions ........................................... 88

Chapter 6: Light Field Face and Ear Recognition Performance .......................................... 91

6.1 Introduction .................................................................................................................. 91

6.2 Performance Assessment Frameworks .......................................................................... 92

6.2.1 Face Recognition Performance Assessment Framework .................................... 92

6.2.1.1 Face Recognition Test Material ........................................................... 92

6.2.1.2 Face Recognition Evaluation Protocols ................................................ 92

6.2.1.3 Face Recognition Performance Assessment Metrics ............................ 94

6.2.1.4 Face recognition Benchmarking Solutions ........................................... 94

6.2.2 Ear Recognition Performance Assessment Framework ...................................... 95

xvii

6.2.2.1 Ear Recognition Test Material ............................................................. 95

6.2.2.2 Ear Recognition Evaluation Protocol and Metrics ................................ 95

6.2.2.3 Ear Recognition Benchmarking Solutions ............................................ 96

6.3 LFLBP Descriptor Parameter Setting ............................................................................ 96

6.3.1 LFLBP Descriptor Parameter Setting: View radius ............................................ 97

6.3.2 LFLBP Descriptor Parameter Setting: Number of Views and Starting Angle ..... 97

6.4 Light Field Histogram of Disparity Gradients Descriptor Parameters ............................ 98

6.5 LSTM Hyper-Parameter Setting .................................................................................... 98

6.5.1 Hyper-Parameter Evaluation: Hidden Layer Size ............................................... 98

6.5.2 Hyper-Parameter Evaluation: Batch Size ........................................................... 99

6.5.3 Hyper-Parameter Evaluation: Number of Training Epochs ...............................100

6.5.4 SA Images Selection Evaluation .......................................................................100

6.6 Face Recognition Accuracy..........................................................................................101

6.7 Ear recognition Accuracy .............................................................................................104

Part III. Light Field Based Face and Ear Presentation Attack Detection ...........................107

Chapter 7: State-of-the-Art on Face Presentation Attack Detection ...................................109

7.1 Introduction .................................................................................................................109

7.2 Proposed Face Presentation Attack Detection Taxonomy .............................................110

7.3 Face Artefact Databases ...............................................................................................111

7.3.1 Non-Light Field Face Artefact Databases .........................................................112

7.3.2 Light Field Face Artefact Databases .................................................................113

7.4 Non-Light Field Based Face PAD Solutions ................................................................114

7.4.1 Texture Based Methods ....................................................................................114

7.4.2 Quality Based Methods ....................................................................................115

7.4.3 Learning Based Methods ..................................................................................118

7.4.4 Focus/Depth Based Methods ............................................................................118

xviii

7.4.5 Motion Based Methods .....................................................................................118

7.5 Light Field Based Face PAD Solutions ........................................................................119

7.5.1 Texture Based Methods ....................................................................................120

7.5.2 Focus/Depth Based Methods ............................................................................120

7.6 Adaptation of Face Presentation Attack Detection Solutions for Ear Biometrics ...........120

Chapter 8: Proposing Novel Light Field Face and Ear Artefact Databases........................121

8.1 Introduction .................................................................................................................121

8.2 Light Field Based Face Artefact Database ....................................................................122

8.2.1 Acquisition Setup and Statistics ........................................................................122

8.2.2 Presentation Attack Instruments .......................................................................122

8.2.3 Database Elements ...........................................................................................124

8.2.4 Database Access and Usage Conditions ............................................................124

8.3 Light Field Based Ear Artefact Database ......................................................................125

8.3.1 Acquisition Setup and Statistics ........................................................................125

8.3.2 Presentation Attack Instruments .......................................................................125

8.3.3 Database Elements ...........................................................................................127

8.3.4 Database Access and Usage Conditions ............................................................127

Chapter 9: Proposing Novel Light Field Face and Ear Presentation Attack Detection

Solutions ............................................................................................................................129

9.1 Introduction .................................................................................................................129

9.2 PAD Based on Light Field Angular Local Binary Pattern Descriptor ...........................130

9.3 PAD Based on Light Field Histogram of Disparity Gradient Descriptor .......................132

Chapter 10: Light Field Face and Ear Presentation Attack Detection Performance .........135

10.1 Introduction ................................................................................................................135

10.2 Performance Assessment Framework .........................................................................135

10.2.1 Test Material ....................................................................................................135

xix

10.2.2 Evaluation Metrics ...........................................................................................135

10.2.3 Benchmarking Methods ....................................................................................136

10.3 Face PAD Performance ..............................................................................................136

10.3.1 Face PAD Accuracy Evaluation .......................................................................137

10.3.2 Face PAD Color Features Accuracy Evaluation ................................................139

10.3.3 Face PAD Generalization Accuracy Evaluation ................................................140

10.3.4 Face PAD Computational Complexity ..............................................................142

10.4 Ear PAD Performance ................................................................................................142

10.4.1 Ear PAD Accuracy Evaluation .........................................................................143

10.4.2 Ear PAD Generalization Accuracy Evaluation ..................................................144

10.4.3 Ear PAD Computational Complexity ................................................................145

Part IV. Conclusion ...............................................................................................................147

Chapter 11: Summary of Contributions ...............................................................................149

11.1 Introduction ................................................................................................................149

11.2 Light Field Face and Ear Recognition .........................................................................149

11.3 Light Field Biometric Presentation Attack Detection ..................................................151

Chapter 12: Future Research Directions ..............................................................................153

Future Research Direction .................................................................................................153

12.1 Introduction ................................................................................................................153

12.2 Future Research Directions for Light Field Face and Ear Recognition ........................153

12.3 Future Research Directions for Light Field Based Face and Ear Presentation Attack

Detection .....................................................................................................................154

References ..............................................................................................................................157

xxi

List of Figures

Figure 1.1: Possible attack points in a generic biometric system [14]. ........................................ 4

Figure 1.2: Structured representation of the main contributions of this Thesis. ............................ 6

Figure 1.4: Thesis organization (highlighting in gray the Thesis contributions). ........................ 11

Figure 2.1: Visualization of the plenoptic function. ................................................................... 14

Figure 2.2: Parameterization of light rays in lenslet light field cameras using two planes. ......... 15

Figure 2.3: Multi-camera arrays arrangements: (a) Regular, rectangular arrangement of cameras:

Stanford multi-camera array [51]; (b) Regular, circular arrangement; (c) Irregular arrangement of

cameras on Light L16 [50]. ....................................................................................................... 16

Figure 2.4: Lenslet light field imaging based on micro-lens array. ............................................ 17

Figure 2.5: Lenslet light field cameras: (a) unfocused and (b) focused architectures [54]. ......... 17

Figure 2.6: Lenslet light field cameras: (a) Lytro first generation camera [26]; (b) Lytro ILLUM

lenslet camera [26]; and (c) Raytrix R11 camera [57]. .............................................................. 18

Figure 2.7: The lenslet light field pre-processing architecture. .................................................. 18

Figure 2.8: Raw lenslet light field representation, before color demosaicing (each position

corresponds to a R, G or B intensity). ....................................................................................... 19

Figure 2.9: Raw lenslet light field representation, after color demosaicing. ............................... 19

Figure 2.10: Light field multi-view array of SA images and central rendered 2D image. ........... 20

Figure 2.11: A light field SA image (a) before and (b) after color and gamma corrections. ....... 20

Figure 3.1: Multi-level taxonomy for face recognition solutions [64]. ....................................... 26

Figure 3.2: Taxonomy of ear recognition solutions [5]. ............................................................. 27

xxii

Figure 3.3: Proposed multi-level face/ear recognition taxonomy. .............................................. 27

Figure 3.4: Face/ear structure level: (a) global; (b) component + structure; and (c) component

representation face structures. ................................................................................................... 28

Figure 3.5: Ear structure [5]. ..................................................................................................... 28

Figure 3.6: Feature support level: Global feature support with (a) global and (b) component face

structures; Local feature support with (c) global and (d) component face structures. ................. 29

Figure 3.7: Overview of the evolution of face recognition solutions over time, grouped based on

feature extraction approaches; performance values for the LFW database. ................................ 40

Figure 3.8: Overview of the evolution of ear recognition solutions over time, grouped based on

feature extraction approaches; performance values for the AWE database. ............................... 49

Figure 4.1: Acquisition setup at (a) IST; and (b) EURECOM. .................................................. 52

Figure 4.2: A sketch of the LLFFD acquisition setup. ............................................................... 53

Figure 4.3: Age distribution for the subjects in IST-EURECOM LLFFD. ................................. 53

Figure 4.4: Illustration of 2D rendered images for the facial variations in the IST-EURECOM

LLFFD. .................................................................................................................................... 54

Figure 4.5: Sample depth map. ................................................................................................. 55

Figure 4.6: Illustration of facial landmarks. ............................................................................. 55

Figure 4.7: Metadata associated to each subject. ....................................................................... 56

Figure 4.8: IST-EURECOM LLFFD file structure. ................................................................... 56

Figure 4.9: Illustration of IST-EURECOM LLFEDB 2D rendered ear images for the four profiles

of a specific subject in two separate acquisition sessions........................................................... 58

Figure 4.10: Examples of partially occluded ear images: (a) ear piercing; (b) earing; (c) hair; and

(d) combination of occlusions. .................................................................................................. 59

xxiii

Figure 4.11: IST-EURECOM LLFEDB file structure. .............................................................. 60

Figure 1.3: Summary of the proposed recognition solutions. ..................................................... 63

Figure 5.1: Architecture of the proposed face and ear recognition solution based on LFLBP hand-

crafted descriptor. ..................................................................................................................... 64

Figure 5.2: Examples of selected SA images (red). The SA images highlighted in dark grey do

not contain usable image information due to the micro-lens shape. ........................................... 65

Figure 5.3: LFALBP descriptor extraction example. ................................................................. 67

Figure 5.4: Proposed spatial and angular descriptors combination framework. .......................... 68

Figure 5.5: Architecture of the proposed face and ear recognition solution based on the fused

LFHG hand-crafted descriptor. ................................................................................................. 69

Figure 5.6: Division of an ear sample disparity magnitude map into 8×8 sample cells and

overlapping 2×2 cell blocks. ..................................................................................................... 70

Figure 5.7: Architecture of the proposed face recognition solution based on a

2D+disparity+depth fused deep descriptor. ............................................................................... 73

Figure 5.8: Architecture of the proposed face recognition solution based on VGG + Conv-LSTM

double-deep descriptor. ............................................................................................................ 75

Figure 5.9: (a) High-density SA images selection topology; (b) row-major scanning order; (c)

snake-like scanning order; (d) max-disparity SA images selection topology; (e) mid-density

horizontal SA images selection topology; (f) mid-density vertical SA images selection. ........... 77

Figure 5.10: Score-level fusion for combining the horizontal and vertical angular information. 77

Figure 5.11: Architecture of a Conv-LSTM cell with peephole connections (indicated by a

dashed line). ............................................................................................................................. 79

Figure 5.12: Architecture of the proposed face recognition solution based on VGG + Light Field

LSTM double-deep descriptors. ................................................................................................ 82

xxiv

Figure 5.13: Architecture of a GLF-LSTM cell. ........................................................................ 84

Figure 5.14: Architecture of a SLF-LSTM cell. ........................................................................ 86

Figure 5.15: Architecture of a SeqL-LSTM cell. ....................................................................... 87

Figure 6.1: IST-EURECOM LFFD (non-cropped) database division into training, validation and

testing sets for (a) Protocol 1; (b) Protocol 2; and (c) Protocol 3. .............................................. 93

Figure 6.2: CRR5 versus R for LFALBPR,45º,4. ........................................................................... 97

Figure 6.3: Rank-1 recognition results versus hidden layer size considering all proposed SA

image selection methods for: (a) Protocol 1 and (b) Protocol 3. ................................................ 99

Figure 6.4: Rank-1 recognition results versus the batch size considering all proposed SA image

selection methods for: (a) Protocol 1, and (b) Protocol 3. .......................................................... 99

Figure 6.5: Rank-1 recognition results versus number of training epochs considering all proposed

SA image selection methods for: (a) Protocol 1 and (b) Protocol 3. ........................................ 100

Figure 6.6: Ear recognition cumulative recognition rank curves (up to CRR50) for the proposed

recognition and best performing benchmarking solutions........................................................ 105

Figure 7.1: Proposed taxonomy for face PAD solutions. ......................................................... 110

Figure 7.2: Illustration of different types of PAIs. ................................................................... 112

Figure 7.3: Illustration of GUC-LiFFAD face artefact acquisition [33]. .................................. 114

Figure 8.1: IST LLFFSB face artefact acquisition pipeline. .................................................... 123

Figure 8.2: IST LLFFSD example: Illustration of 2D central view rendered images for: (a) bona

fide face; (b) print paper attack; (c) wrapped print paper attack; (d) laptop attack; (e) tablet

attack; (f) mobile attack 1; and (g) mobile attack 2. ................................................................ 124

Figure 8.3: Illustration of IST LLFEADB images for a bona fide sample and corresponding

artefact samples for four different PAIs. ................................................................................. 126

xxv

Figure 8.4: IST LLFEADB ear artefact acquisition pipeline. ................................................... 126

Figure 8.5: Multi-view sub-aperture image array for an artefact ear image. ............................. 127

Figure 9.1: Architecture of the proposed face and ear PAD solution based on LFALBP hand-

crafted descriptor. ................................................................................................................... 130

Figure 9.2: Architecture of the proposed face and ear PAD solution based on LFHDG descriptor.

............................................................................................................................................... 132

Figure 10.1: DET face PAD performance for the proposed and benchmarking solutions using

IST LLFFSD for: (a) monitor; (b) tablet; (c) mobile 1; (d) mobile 2; (e) paper; (f) wrapped paper

PAIs. ...................................................................................................................................... 138

Figure 10.2: ACER face PAD performance for the proposed and benchmarking solutions with n-

fold cross-validation. .............................................................................................................. 139

Figure 10.3: DET face PAD generalization performance for the proposed and benchmarking

solutions using IST LLFFSD for: (a) monitor; (b) tablet; (c) mobile 1; (d) mobile 2; (e) paper; (f)

wrapped paper PAIs. .............................................................................................................. 141

Figure 12.1: Mobile phones equipped by multiple cameras. .................................................... 154

xxvii

List of Tables

Table 3.1: Overview of selected, prominent face databases with different (Diff.) characteristics.

................................................................................................................................................. 31

Table 3.2: Classification of a selection of non-light field based face recognition solutions based

on the proposed taxonomy. ....................................................................................................... 32

Table 3.3: Classification of the prior and proposed (Prop.) light field based face recognition

solutions, based on the proposed taxonomy. ............................................................................. 41

Table 3.4: Overview of prior and the proposed light field based face recognition solutions. ...... 42

Table 3.5: Overview of ear databases with different (Diff.) characteristics. ............................... 44

Table 3.6: Classification of a selection of ear recognition solutions based on the developed

taxonomy. ................................................................................................................................ 45

Table 4.1: List of Acronyms used in IST-EURECOM LLFFD along with the their definition. .. 57

Table 4.2: Metadata associated to each subject in each acquisition session. .............................. 60

Table 5.1: Face rank-1 recognition performance for the 2D baseline descriptor and three

alternative fusion strategies (best results in bold). ..................................................................... 74

Table 5.2: Overview of the recognition solutions proposed in this Thesis. ................................ 89

Table 6.1: Overview of the face recognition benchmarking solutions. ....................................... 95

Table 6.2: Overview of the ear recognition benchmarking solutions. ........................................ 96

Table 6.3: RR1 and CRR5 for LFLBP for different values of A and N (best results in bold). ....... 97

Table 6.4: Selected configuration for the the Conv-LSTM and the proposed GLF-LSTM, SLF-

LSTM and SeqL-LSTM architectures for face recognition...................................................... 101

xxviii

Table 6.5: Protocol 1 assessment: Face recognition rank-1 for the proposed and benchmarking

recognition solutions (best results in bold). ............................................................................. 101





Table 6.8: Protocol 1 average rank-1 recognition results for some 2D baseline solutions against

their light field based variants. ................................................................................................ 103





Table 6.11: Ear recognition CRR1 up to CRR3 for the proposed recognition and benchmarking

solutions (best results in bold). ............................................................................................... 105

Table 7.1: Overview of publicly available non-light field face artefact databases. ................... 112

Table 7.2: Overview of publicly available light field artefact databases. ................................. 113

Table 7.3: Overview of non-light field face PAD solutions. .................................................... 116

Table 7.4: Overview of light field face PAD solutions. ........................................................... 119

Table 10.1: Overview of PAD benchmarking solutions........................................................... 137

Table 10.2: ACER face PAD performance for the proposed and benchmarking solutions using

IST LLFFSD (minimum errors in bold). ................................................................................. 138


color or gray information (minimum errors in bold). ............................................................... 140

xxix

Table 10.4: ACER face PAD generalization performance for the proposed and benchmarking

solutions using IST LLFFSD (minimum errors in bold). ......................................................... 141

Table 10.5: Average extraction and classification times, and feature vector size for the proposed

and benchmarking face PAD solutions (minimum values in bold). ......................................... 142

Table 10.6: ACER ear PAD performance for the proposed and benchmarking solutions using IST

LLFEADB baseline set (minimum errors in bold)................................................................... 143

Table 10.7: ACER ear PAD performance for the proposed and benchmarking solutions using IST

LLFEADB extended set (minimum errors in bold). ................................................................ 144

Table 10.8: ACER ear PAD generalization performance for the proposed and benchmarking

solutions using IST LLFEADB baseline set (minimum errors in bold). ................................... 144


solutions using IST LLFEADB extended set (minimum errors in bold). ................................. 144


and benchmarking ear PAD solutions (minimum values in bold). ........................................... 145

xxxi

List of Acronyms

2D 2 Dimensional

3D 3 Dimensional

3DMAD 3D Mask Attack Database

3DMM 3D Morphable Model

4D 4 Dimensional

ACER Average Classification Error Rate

AHFSVD Adaptive High-Frequency Singular Value Decomposition

ALTP Adaptive Local Ternary Pattern

APCER Attack Presentation Classification Error Rate

ASVDF Adaptive Singular Value Decomposition Face

BPCER Bona Fide Presentation Classification Error Rate

BPR Bayesian Patch Representation

BSIF Binarized Statistical Image Features

BSIF Binarized Statistical Image Features

BU-3DFE Binghamton University 3D Facial Expression

BU-3DFE Binghamton University 3D Facial Expression

CASIA Chinese Academy of Sciences, Institute of Automation

CLF-LSTM Gate-Level Fusion Long Short-Term Memory

CNN Convolutional Neural Network

Conv-LSTM VGG + Conventional Long Short-Term Memory

CRR Cumulative Recognition Rate

xxxii

CS Centre-symmetric

CSLBP Center-Symmetric Local Binary Patterns

DB Database

DBN Deep Belief Network

DCP Dual-Cross Pattern

DET Detection Error Tradeoff

DFW Disguised Faces in the Wild

DLBP Depth Local Binary Pattern

DLM Dynamic Link Matching

DM Depth Map

DOG Difference of Gaussians

DP Decision Pyramid

DPC Decision Pyramid Classifier

EBGM Elastic Bunch Graph Matching

ELBP Extended Local Binary Patterns

FAR False Acceptance Rate

FPGA Field Programmable Gate Array

FERET FacE REcognition Technology

FRR False Rejection Rate

GGZ Global-Gabor-Zernike

GLCM Grey-Level Co-occurrence Matrices

GLCM Grey-Level Co-occurrence Matrices

G-NSRC Gabor scale feature based non-Negative Sparse Representation Classification

xxxiii

GPCA Generalized PCA

GPU Graphics Processing Unit

GRBP Global RBP

GW Gabor Wavelet

HDCA High Density Camera Array

HOG Histogram of Oriented Gradients

HSV Hue-Saturation-Value

ICA Independent Component Analysis

IJB-A IARPA Janus Benchmark

KED Kernel Extended Dictionary

KLT Kanade-Lucas-Tomasi

KNN k-Nearest Neighbour

KPCA MM Kernel PCA Mixture Model

LBP Local Binary Pattern

LBPNET Local Binary Pattern Network

LCCP Local Contourlet Combined Patterns

LDA Linear Discriminant Analysis

LDF Local Difference Feature

LF Light Field

LFALBP Light Field Angular Local Binary Pattern

LFHDG Light Field Histogram of Disparity Gradients

LFHG Light Field Histogram of Gradients

LFHOG Light Field Histogram of Oriented Gradients

xxxiv

LFLBP Light Field Local Binary Patterns

LFR Light Field Raw

LFW Labeled Faces in the Wild

LGPDP Local Gabor Phase Difference Pattern

LiFFAD Light Field Face Artefact Database

LiFFID Light Field Face and Iris Database

LLFEADB Lenslet Light Field Ear Artefact Database

LLFEDB Lenslet Light Field Ear DataBase

LLFFD Lenslet Light Field Face Database

LLFFSD Lenslet Light Field Face Spoofing Database

LPQ Local Phase Quantization

LPS Local Pattern Selection

LR Logistic Regression

LRBP Local RBP

LRT Local Radon Transform

LSM Local Shape Map

LSTM Long Short-Term Memory

MB-LBP Multi-scale Block Light Field Local Binary Patterns

MCP Mean based Contrast Patterns

MDML-DCP Multi-Directional Multi-Level Dual-Cross Pattern

ME-CS-LDP Multi-resolution Elongated Centre-Symmetric Local Derivative Pattern

MFSD Mobile Face Spoofing Database

MLBP Multi-scale Local Binary Pattern

xxxv

MLFP Mask based Video Face Presentation Attack

MLP Multi-Layer Perceptron

MOBIO MObile BIOmetry

MPCA Multilinear Principal Component Analysis

MS-PCANET Multi-Scaled PCA Network

NIR Near Infra-Red

NNC Nearest Neighbor Classifier

NUAA Nanjing University of Aeronautics and Astronautics

PAD Presentation Attack Detection

PAI Presentation Attack Instrument

PCA Principle Component Analysis

PHOW Pyramidal Histogram Of visual Words

PIN Personal Identification Number

PIPA People In Photo Albums

PLS Partial Least Square

POEM Patterns of Oriented Edge Magnitudes

PPR Probabilistic Patch Representation

RBF Radial Basis Function

RBP Riesz Binary Pattern

RNN Recurrent Neural Networks

RR Recognition Rate

SA Sub-Aperture

Scface Surveillance Cameras face

xxxvi

SeqL-LSTM Sequential Learning Long Short-Term Memory

SIFT Scale Invariant Feature Transform

S-KDA Specific Kernel Discriminant Analysis

SLBP Spatial Local Binary Pattern

SLF-LSTM State-Level Fusion Long Short-Term Memory

SMAD Silicone Mask Attack Database

SML Sum-Modified-Laplacian

SoF Specs on Faces

SRC Sparse Reconstruction Classifier

SVM Support Vector Machine

TDSIFT Texture and Depth Scale Invariant Feature Transform

TRIVET TransfeR Nir-Vis heterogeneous facE recognition neTwork

U-3DMM Unified 3D Morphable Model

VGG Visual Geometry Group

VGG-D3 2D+Disparity+Depth VGG

VIS VISible

1

Part I. Objectives and Basics

3

Chapter 1 _

Introduction

1.1 Context and Motivation

Nowadays, automatically recognizing the identity of a person is of paramount importance in

various application domains, from forensics and surveillance to commerce and entertainment.

Biometric recognition, referring to the automated recognition of individuals based on their

biological and behavioral traits, appears as a viable alternative to more traditional approaches such

as Personal Identification Numbers (PINs) or passwords [1] [2]. There are multiple types of

biometric modalities available, such as fingerprint, iris, face, ear, and gait, and they are in use in

multiple types of applications. Each biometric modality has its strengths and weaknesses, and the

choice mainly depends on the application scenario [1]. Depending on the application context, the

generic term recognition may become either a verification or an identification. In a verification

system, a person claims a specific identity and the system either confirms or denies that claim. In

an identification system, the one considered in this Thesis, recognition of an individual happens

by searching the templates of all the users in the database for a match, without that individual first

claiming an identity. In the following, identification is simply called recognition.

This Thesis focuses on the face and ear biometric modalities. Face recognition is a nonintrusive

method, and facial images are probably the most common biometric characteristic used by humans

to perform personal recognition. Face recognition systems have been successfully used in various

application areas with high acceptability, collectability and universality [1] [3]. After the first

automatic face recognition algorithms emerged more than four decades ago [4], this area has

attracted much research and there has been incredible progress in this field. Ear recognition has

also evolved as a reliable biometric modality for human identification over recent years, with its

potential stemming from the specific ear structure, which significantly varies across different

people, while remaining stable over time for the same person and without significant changing for

different facial emotions and actions [5] [6]. It has also proved useful for facial profile based

recognition [7] [8] and in combination with other modalities in multimodal biometric systems [9]

[10].

4

Despite the significant progress in biometric recognition performance, the widespread use of

biometric recognition applications raises new security concerns, making the robustness against

presentation attacks a very active field of research [11] [12] [13]. The security of a biometric

recognition system can be compromised at different architectural points, as illustrated in Figure

1.1, all the way from the biometric trait presentation to the final recognition decision. The attacks

to a biometric recognition system can be broadly divided into indirect and direct attacks [14].

Indirect attacks are performed inside the recognition system to bypass the feature extractor,

matcher, or tamper the template database. Direct attacks, also referred as spoofing or presentation

attacks, are performed outside the biometric system by presenting falsified data, or artefacts, in

front of the acquisition sensors, for instance using printed photos or electronic devices displaying

a face or an ear. While the recognition system robustness against indirect attacks can be increased

using conventional protection mechanisms, such as data encryption and intrusion prevention and

detection [15], it is also critical to incorporate in the recognition systems efficient Presentation

Attack Detection (PAD) solutions. According to ISO/IEC JTC1 SC37, a Presentation Attack

Instrument (PAI) is "an artificial object or representation presenting a copy of biometric

characteristics or synthetic biometric patterns" [11]. The most common types of attacks produced

by different PAIs involve: i) a printed face on a paper or a wrapped paper, simulating the human

face curvature; ii) a displayed face image or video on the screen of a portable device such as a

laptop, tablet, or mobile phone; and iii) 3D masks of various types [13].

Figure 1.1: Possible attack points in a generic biometric system [14].

The availability of richer imaging sensors is opening a new range of possibilities, not only for

biometric recognition but also for PAD solutions [2] [16]. Beside conventional 2D sensors, depth

sensors, as used by Microsoft Kinect, and Near Infra-Red (NIR) cameras, have been used for face

and ear biometric recognition and PAD [17] [18] [19] [20] [21] [22] [23]. Additionally, light field

imaging technologies [24] [25] have recently come into prominence with commercial lenslet light

field cameras, such as Lytro [26], available in the market. These cameras capture not only the

intensity of light on a specific 2D plane position but also the intensity of the light rays arriving

from multiple directions in space. Light field cameras are receiving increasing interest from the

5

biometrics and forensics communities, for both biometric recognition [27] [28] [29] [30] [31] and

PAD [32] [33] [34].

The key advantage of the light field imaging sensors for biometric recognition and PAD comes

from the richer scene representation, allowing a posteriori refocusing, disparity exploitation and

depth map exploitation. Preliminary works [27], [28], [29], [30], [31], [32], [33], [34] have shown

the effectiveness of the supplementary information captured by light field cameras for biometric

recognition and PAD applications, even when considering one single shot. Most of these works

have explored the possibility of creating multiple focus images, rendered from the same light field

image acquisition, for instance using super-resolution and fusion schemes for the biometric

recognition and PAD tasks. The results demonstrate the added value of light field imaging in terms

of post-capture refocusing capability, and improved biometric recognition and PAD accuracy,

when compared with conventional 2D images.

While the preliminary works processed multiple 2D images at different focus or depth, biometric

recognition and PAD systems based on light field imaging can be further extended in other

directions. More precisely, by processing the rich information associated to a light field in its native

form there is potential to improve the performance of current biometric recognition and PAD

systems, which is the direction considered in this Thesis.

1.2 Objectives

This Thesis focuses on exploring the advances in light field imaging technology and applying them

to develop advanced face and ear recognition and PAD systems with improved performance. The

main research question being addressed in this work is: how to exploit the additional information

available in a light field image to improve the performance of face and ear recognition and PAD

systems?

In the context described above, the Thesis targets four main objectives:

1. Review recent advances in light field based face and ear recognition and PAD databases and

solutions.

2. Create and publicly provide to the research community new light field databases for designing,

testing and validating light field based face and ear recognition and PAD solutions.

3. Design new light field based face and ear recognition and PAD solutions to exploit the richer

information available in light field images.

4. Perform appropriate performance assessment, including benchmarking with the state-of-the-

art in face and ear recognition and PAD solutions, to assess the performance of the proposed

solutions, in terms of accuracy, generalization and complexity, while ensuring the

reproducibility of results.

1.3 Contributions

Following the main objectives defined above, the main contributions of this Thesis are illustrated

in Figure 1.2. The contributions are organized based on two main tasks, i.e., biometric recognition

6

and PAD, for the face and ear biometric modalities. These contributions will be presented in Part

II (Chapters 3, 4, 5, and 6) and Part III (Sections 7, 8, 9 and 10), respectively.

Figure 1.2: Structured representation of the main contributions of this Thesis.

Part II – Light Field Based Biometric Recognition (Chapters 3, 4, 5, and 6)

In the context of light field based biometric recognition, this Thesis proposes the following main

contributions: i) a new taxonomy for face and ear recognition systems; ii) light field face and ear

databases; iii) two hand-crafted light field based descriptors for light field face and ear recognition;

and iv) five deep learning light field based solutions, evolving through progressive levels of

functionality and performance, for light field face recognition.

A summary of the contributions are described in the following.

1. Multi-Level Face and Ear Recognition Taxonomies

To better understand the technological landscape in the area of face and ear recognition systems,

this work proposes a new, more encompassing multi-level taxonomy for face/ear recognition

solutions, thus facilitating the organization and categorization of face recognition solutions. The

proposed multi-level taxonomy considers four levels including: i) face/ear structure; ii) feature

support; iii) feature extraction approach; and iv) feature extraction sub-approach. Following the

proposed taxonomy, a comprehensive review on recent, representative and relevant face and ear

recognition solutions has been done. As a result of this work, a research paper is under preparation

to be submitted to an international journal.

2. The IST-EURECOM Lenslet Light Field Face Database

A new lenslet light field face database has been proposed, the so-called IST-EURECOM Lenslet

Light Field Face Database (IST-EURECOM LLFFD), including data from 100 subjects, with 20

7

samples per each person, captured by a Lytro ILLUM lenslet camera. This database was captured

in cooperation with EURECOM, with the images of 50 persons being captured in each of the

collaborating institutions, with the IST acquisition setup being replicated in the EURECOM lab.

The images are captured in a controlled acquisition setup with different facial variations, including

emotions, actions, poses, illuminations, and occlusions thus exposing the non-intrusive nature of

face recognition. The database includes the raw light field images, sample 2D rendered images

and the associated depth maps, along with a rich set of metadata. This research work led to the

following publication [35]:

A. Sepas-Moghaddam, V. Chiesa, P. Correia, F. Pereira, and J. Dugelay " The IST-EURECOM

light field face database," International Workshop on Biometrics and Forensics, Coventry,

UK, Apr. 2017.

3. IST-EURECOM Lenslet Light Field Ear Database

To establish the connection between light field cameras and ear recognition research, the IST-

EURECOM Lenslet Light Field Ear DataBase (LLFEDB) has been created with a Lytro ILLUM

lenslet camera, and publicly made available to be used as a basis for testing and validating light

field based ear recognition systems. The proposed ear database consists of 536 light field ear

images from 67 subjects, with 8 image shots per person, captured with a Lytro ILLUM lenslet

camera, over two separate sessions, with four different poses per session. This research work led

to the following publication [36]:

A. Sepas-Moghaddam, F. Pereira, and P. Correia, " Ear recognition in a light field imaging

framework: a new perspective", IET Biometrics, Vol. 7, No. 3, pp. 224-231, May 2018.

4. Face and Ear Recognition Based on Light Field Local Binary Patterns Descriptor

Face and ear recognition solutions have been proposed based on a novel simple, yet effective hand-

crafted descriptor, named Light Field Local Binary Patterns (LFLBP), able to exploit the richer

information available in light field images. LFLBP is a combined descriptor with two main

components, the spatial Local Binary Pattern (LBP) and the angular LBP, able to capture not only

the usual spatial information but also the light field angular information associated to the set of

sub-aperture images, corresponding to different viewpoints. When compared with alternative

methods, the proposed descriptor has shown good face and ear recognition performance under

varied and challenging acquisition conditions. This research work led to the following publication

[37]:

A. Sepas-Moghaddam, P. Correia, and F. Pereira, "Light field local binary patterns description

for face recognition", IEEE International Conference on Image Processing, Beijing, China,

Sep. 2017.

5. Face and Ear Recognition Based on Light Field Histogram of Gradients Descriptor

Another light field based recognition solution has been proposed, able to exploit the spatio-angular

information available in light field images. This novel recognition solution is based on a new light

field hand-crafted descriptor, named Light Field Histogram of Gradients (LFHG), fusing a non-

light field based descriptor, the so-called Histogram of Oriented Gradients (HOG), with a light

8

field based descriptor, called Light Field Histogram of Disparity Gradients (LFHDG). The LFHG

descriptor considers both the orientation and magnitude for the spatial and angular information,

while the solution described in 4. only captures the magnitude for the spatial and angular

information. Thus, it is expected that this fused descriptor improves face and ear recognition

performance. This research work led to the following publication [36]:

A. Sepas-Moghaddam, F. Pereira, and P. Correia, "Ear recognition in a light field imaging

framework: a new perspective", IET Biometrics, Vol. 7, No. 3, pp. 224-231, May 2018.

6. Face Recognition Based on a VGG 2D+Disparity+Depth (VGG-D3) Fused Deep Descriptor

Recognizing the importance of deep learning in biometric recognition, a light field face recognition

solution has been proposed, based on a VGG 2D+Disparity+Depth (VGG-D3) fused deep

descriptor. The VGG-D3 description is formed by concatenating descriptions extracted from 2D

images as well as disparity and depth maps using VGG-Face descriptor [38]. This solution is the

first adopting a fused deep CNN representation to exploit the complementary information available

in the light field for face description and then recognition. The VGG-Face descriptor, trained over

2.6 million face images, is computed based on a VGG- 16 network, ignoring the last fully connected

layer in the architecture, to extract a feature vector with 4096 elements. The exploitation of disparity

maps together with 2D images and depth maps, in the context of a fusion scheme, is a novel

approach never tried in the literature, acknowledging that disparity and depth maps may bring some

complementary information to the recognition task. This research work led to the following

publication [39]:

A. Sepas-Moghaddam, P. Correia, K. Nasrollahi, T. Moeslund, and F. Pereira, “Light field

based face recognition via a fused deep representation”, IEEE International Workshop on

Machine Learning for Signal Processing, Aalborg, Denmark, Sep. 2018.

7. Face Recognition Based on a VGG + Conventional LSTM Double-Deep Descriptor

A face recognition solution based on a double-deep descriptor, so-called VGG + Conventional

Long Short Term Memory (Conv-LSTM), has been proposed, exploiting the multi-perspective

information available in a light field image. The fused deep representation solution described in 6.

processes only light field central view data, notably using its rendered texture and corresponding

disparity and depth maps. On the contrary, the proposed double-deep descriptor adopts a Conv-

LSTM recurrent network to extract higher dimensional angular dependencies from VGG deep

descriptions associated to different face viewpoints rendered from a full light field image, thus

offering a more powerful description for light field face recognition; a softmax layer is used for

classification. This research work led to the following submission [40]:

A. Sepas-Moghaddam, P. Correia, K. Nasrollahi, T. Moeslund, and F. Pereira "A double-deep

spatio-angular learning framework for light field based face recognition", Submitted to IEEE

Transactions on Circuits and Systems for Video Technology, Oct. 2018.

8. Face Recognition Based on VGG + Light Field LSTM Double-Deep Descriptors

The solution described in 7. proposes to organize the light field views’ spatial features as a

sequence to be input to a conventional LSTM network, thus ignoring the additional angular

9

information/dependencies, notably in terms of parallax, that could be further exploited during the

learning process to increase recognition accuracy. This work proposes three novel light field

LSTM cell architectures able to jointly learn the light field horizontal and vertical parallaxes. The

three LSTM cell architectures proposed perform: i) Gate-Level Fusion LSTM (GLF-LSTM), ii)

State-Level Fusion LSTM (SLF-LSTM) and iii) Sequential Learning LSTM (SeqL-LSTM); these

architectures create richer spatio-angular light field descriptions for visual recognition tasks. The

proposed cell architectures have been integrated into a spatio-angular deep learning framework for

double-deep description, where a LSTM network adopting the proposed light field LSTM cell

architectures receives its inputs from a VGG-Face deep descriptor applied to a set of horizontal

and vertical 2D face viewpoint images, derived from a light field image. This research work led to

the following submission [41]:

A. Sepas-Moghaddam, F. Pereira, and P. Correia "Light field long short-term memory: novel

cell architectures with application to face recognition", Submitted to Pattern Recognition

Letters, Oct. 2018.

Part III – Light Field Based Presentation Attack Detection (Chapters 7, 8, 9 and 10)

In the context of light field based biometric PAD, this Thesis proposes the following main

contributions: i) a new taxonomy for biometric PAD systems; ii) two light field artefact face and

ear databases; and iii) two solutions for light field based face and ear PAD. A summary of these

contributions is presented in the following.

1. Encompassing Taxonomy for Face PAD Solutions

To better understand the technological landscape in the area of PAD, this work proposes a

taxonomy to organize the face PAD solutions according to four main dimensions, notably user

interaction support, imaging sensor, contextual information and feature extraction. Following the

proposed taxonomy, a comprehensive review of recent, representative and relevant non-light field

based and light field based face PAD solutions has been developed. This research work led to the

following publication [42]:

A. Sepas-Moghaddam, F. Pereira, and P. Correia, "Light field based face presentation attack

detection: reviewing, benchmarking and one step further", IEEE Transactions on Information

Forensics and Security, Vol. 13, No. 7, pp. 1696-1709, Jul. 2018.

2. IST Lenslet Light Field Face Artefact Database

In the context of light field based face PAD, the IST Lenslet Light Field Face Spoofing Database

(IST LLFFSD) has been proposed, consisting of 100 bona fide images, from 50 subjects, captured

with a Lytro ILLUM lenslet camera, and a set of 600 face artefact images, captured using the same

camera. The IST LLFFSD simulates six different types of presentation attacks, including printed

paper, wrapped printed paper, laptop, tablet and two different mobile phones. This research work

led to the following publication [43]:

A. Sepas-Moghaddam, L. Malhadas, P. Correia, and F. Pereira, "Face spoofing detection using

a light field imaging framework", IET Biometrics, Vol. 7, No. 1, pp. 39-48, Jan. 2018.

3. IST Lenslet Light Field Ear Artefact Database

10

The IST Lenslet Light Field Ear Artefact Database (LLFEADB) has been proposed, including both

2D and light field ear artefact images. The database contains two sets: The first set, named baseline

LLFEADB, includes 268 bona fide ear samples derived from the publicly available IST-

EURECOM LLFEDB ear database, which includes ears from 67 subjects, with 4 shots per person,

captured with a Lytro ILLUM lenslet camera. The extended LLFEADB includes an additional set

of high resolution bona fide samples, captured with the same camera from 15 subjects, with 4

image shots per person. For both sets, four types of PAI were used to create the artefact samples:

a laptop, a tablet and two different mobile phones. This research work led to the following

publication [44]:

A. Sepas-Moghaddam, F. Pereira, and P. Correia, "Ear presentation attack detection:

benchmarking study with first lenslet light field database", European Signal Processing

Conference, Rome, Italy, Sep. 2018.

4. Face and Ear PAD Based on Light Field Angular Local Binary Patterns Descriptor

A novel PAD solutions has been proposed based on a hand-crafted descriptor exploiting the color

and texture variations associated to the different directions of the light captured in light field

images. The proposed PAD solution is based on the Light Field Angular Local Binary Patterns

(LFALBP) descriptor, which captures the disparity information present in light field images. The

proposed PAD solution exploits the LFALBP in two different color spaces, and when applied to

face and ear PAD the resulting performance compares favorably with the alternative solutions in

the literature. This research work led to the following publications [43] [44]:

A. Sepas-Moghaddam, L. Malhadas, P. Correia, and F. Pereira, "Face spoofing detection using

a light field imaging framework", IET Biometrics, Vol. 7, No. 1, pp. 39-48, Jan. 2018.




5. Face and Ear PAD Based on Light Field Histogram of Disparity Gradients Descriptor

This work proposes a new light field based PAD solution based on a hand-crafted LFHDG

descriptor, computed in the Hue-Saturation-Value (HSV) color space, able to express the light

variations associated to the multiple light capturing directions in light field images. As the LFHDG

descriptor considers both the orientation and magnitude variations for the angular information, it

offers a more comprehensive angular description compared to the LFALBP solution described in

4. The performance of the proposed PAD solution compares favorably with state-of-the-art

solutions. This research work led to the following publications [42] [44]:

A. Sepas-Moghaddam, F. Pereira, and P. Correia, "Light field based face presentation attack

detection: reviewing, benchmarking and one step further", IEEE Transactions on Information

Forensics and Security, Vol. 13, No. 7, pp. 1696-1709, Jul. 2018.




11

1.4 Thesis Structure

This Thesis proposes novel taxonomies and light field based databases and solutions for face and

ear biometric recognition and PAD. The organization of the Thesis is illustrated in Figure 1.3.

Figure 1.3: Thesis organization (highlighting in gray the Thesis contributions).

The Thesis starts with Part I which includes two chapters. First, this Chapter featuring a brief

introduction of the context, motivation, objectives, and contributions and after Chapter 2 which

briefly reviews the main light field imaging basic concepts and added value of light field cameras

for biometric recognition and PAD.

Next, Part II, including Chapters 3, 4, 5, and 6, presents the light field based biometric

recognition related contributions. Chapter 3 reviews the state-of-the-art on ear and face

recognition databases and solutions, guided by new, more encompassing taxonomies. Chapter 4

proposes two lenslet light field based face and ear databases to allow more powerful benchmarking

for testing and validating face and ear recognition solutions, exploiting the full light field data.

Chapter 5 proposes seven light field based face and ear recognition solutions, exploiting the

additional information available in a light field image. An extensive performance evaluation has

been conducted in Chapter 6 with the proposed light field databases, using a common,

representative evaluation framework for varied and challenging recognition tasks.

Part III, including Chapters 7, 8, 9, and 10, presents the light field based biometric PAD related

contributions. First, Chapter 7 proposes a taxonomy to organize the face PAD solutions according

to four main dimensions. Then, available face PAD solutions are reviewed according to the

proposed taxonomy. After, Chapter 8 proposes two light field face and ear artefact databases for

testing and validating face and ear PAD solutions. Chapter 9 proposes two light field based face

12

and ear PAD solutions, exploiting the disparity information available in a light field image. Finally,

Chapter 10 assesses the proposed solutions along with state-of-the-art solutions in terms of

accuracy, generalization and complexity, using a common, representative evaluation framework.

Finally, Part IV, featuring Chapters 11 and 12, closes this Thesis with a summary of the

achievements and some relevant future work directions.

13

Chapter 2 _

Light Field Imaging: Basic Concepts and

Tools

2.1 Introduction

This Thesis is devoted to light field biometrics. The main purpose of this chapter is to review the

light field basic concepts, light field acquisition approaches, and highlight the added value of light

field imaging for visual recognition tasks, notably biometric recognition and PAD.

Light field imaging technology has emerged as one of the most promising visual representation

formats, enabling a richer and more faithful representation of a visual scene. Light field cameras

acquire more information about the light, namely information about its direction, providing richer

content to immersively experience visual scenes and to accurately perform visual recognition

tasks.

Light field acquisition shall consider light variations, both in terms of position and direction, as

expressed by the plenoptic function introduced in Section 2.2. There are currently two main

practical ways of capturing light fields, introduced in Section 2.3: using an array of cameras or

using a lenslet light field camera. Since this Thesis adopts the second approach, the main focus of

this chapter is on lenslet light field imaging.

2.2 Plenoptic Function

Light field imaging has been considered by researchers for more than one century [45]. Already

in 1908, Lippmann discussed how to use small and closely spaced circular-shaped lenses to record

many slightly different perspectives of a scene. An observer could then view an image from a

selected perspective through an array of lenses, thus selecting small portions of each captured

image to create the so-called integral image [45]. The term light field was suggested by Gershun

in 1936, referring to the amount of light traveling in all directions through every point in space

[46]. In 1991, Adelson [47] proposed using the so-called plenoptic function P(x,y,z,t,λ,θ,φ) to

describe what was called the luminous environment. As illustrated in Figure 2.1, the 7D plenoptic

function describes the information carried by the light rays at every point in the 3D space (x,y,z),

14

towards every possible direction (θ, φ), over any wavelength (λ), and at any time (t).

Figure 2.1: Visualization of the plenoptic function.

The plenoptic function provides a complete modeling of the light involved in a visual scene;

however, using it in practice, is certainly a challenge due to the large amount of data involved, and

the associated complexity.

In 1996, two simplifications of the plenoptic function have been proposed: the so-called static 4D

light field L(x,y,u,v), by Levoy et al. [24], and the so-called static 4D Lumigraph by Gortler et al.

[48]. These 4D representations are more compact and easier to process as they adopt a set of

simplifications resulting from the following observations [24] [25] [49]:

1. The wavelength dimension in the plenoptic function can be simplified by considering only

three components, the red, green and blue color channels, typically used by existing capture

and display devices; in this case, each channel should integrate the plenoptic function over a

certain wavelength range. This would imply using three independent light field functions, one

for each color channel, without the wavelength dimension.

2. The radiance along a light ray crossing the empty space remains constant, implying that it is

not required considering different positions along its path, thus reducing one spatial dimension.

3. For a static scene, the temporal dimension can be skipped.

4. The angular coordinates (θ,φ) can be replaced with Cartesian coordinates (u,v).

For a static 4D light field representation, L(u, v, x, y), the light rays can be described by their points

of intersection with two parallel planes, the X-Y plane, by convention closer to the camera, and the

U-V plane, closer to the captured scene, as illustrated in Figure 2.2. In this two-plane

parameterization, the static 4D light field, L(u, v, x, y), describes all light rays passing through the

U-V and X-Y planes. A light ray emanates from a specific point (u, v) on the U-V plane to a specific

point (x, y) in the X-Y plane, with each ray keeping its RGB radiance [49]. A multi-camera array

15

could then be used as a simple way to acquire the light field, including a set of cameras, with

appropriately set apertures, placed in the X-Y plane, depicted in Figure 2.2 as grey cubes.

Given the two-plane parameterization, the light information can be described in terms of position

and direction, and so the terms spatial and angular can be employed to describe these dimensions.

One interpretation is that x and y fix the position of a ray, while u and v fix its direction. In this

interpretation, X and Y are spatial, and U and V are angular dimensions – this convention is

followed throughout the Thesis.

Figure 2.2: Parameterization of light rays in lenslet light field cameras using two planes.

In summary, the plenoptic function is proposed to provide a comprehensive description of the light

in any scene. However, sampling the full plenoptic function to obtain a scene representation

requires considerable data and computational complexity. The light field imaging representation

has been proposed as a way to sample the plenoptic function assuming some simplifications.

2.3 Light Field Acquisition

Light field acquisition shall consider light variations, both in terms of position and direction, as

proposed by the plenoptic function. There are currently two main practical ways of understanding

and capturing light fields: the first considers an (ideally high density) array of regular (or even

irregular) cameras, which acquires different perspectives of the same scene; the second considers

a so-called lenslet light field camera, which includes an array of microlens, each one playing the

role of a small camera and acquiring a so-called micro-image. In practice, the two acquisition

approaches are rather equivalent with the main difference being the ‘camera’ baseline and all the

implications in size and cost that derive from that.

2.3.1 Multi-Camera Arrays

The most important parameters for a 2D array of cameras are the number of cameras, the resolution

of each camera and their arrangement. A 2D array of cameras can be arranged either in a regular

way, e.g. rectangular (Figure 2.3.a) or circular (Figure 2.3.b), or in an irregular way, such as the

irregular array cameras on the recent handheld Light L16 camera [50] (Figure 2.3.c). In a 2D array

16

of cameras, each camera acquires a 2D slice of the 4D light field from a different perspective; by

arranging these slices, a multi-view array of 2D images can be obtained.

Figure 2.3: Multi-camera arrays arrangements: (a) Regular, rectangular arrangement of cameras:

Stanford multi-camera array [51]; (b) Regular, circular arrangement [51]; (c) Irregular

arrangement of cameras on Light L16 [50].

One of the first multi-camera arrays was designed at Stanford University in 2004, and consisted

on 128 (8×16) cameras, each with a spatial resolution of 640×480 pixels [51], as illustrated in

Figure 2.3.a. Recently, more dense and higher resolution camera acquisition systems have

emerged, such as the one used to acquire the JPEG Pleno High Density Camera Array (HDCA)

content [52], which used a single camera in a rig with a 7956×5304 spatial resolution. This spatial

multiposition acquisition system considers 101 horizontal steps with a gap of 4 mm (distance

between adjacent cameras in the horizontal direction) and 21 vertical steps with a gap of 6 mm

(distance between adjacent cameras in the vertical direction).

2.3.2 Lenslet Light Field Cameras

A lenslet light field camera includes a digital sensor, main optics and an aperture control similar

to normal cameras. The main difference regarding regular cameras comes from placing a micro-

lens array on the focal plane of the main lens at a given distance from the sensor, as shown in

Figure 2.4. The main lens aims to focus the light rays from the object into the microlens array.

Then, the micro-lenses split the incoming light cone based on the direction of the incoming rays

onto the sensor area of the corresponding micro-lens. A microlens array is usually composed by

thousands of tiny lenses that are arranged in a rectangular, hexagonal or custom grid. The photo

sensors have their sensing positions masked by a color filter array, with the most popular being

the Bayer pattern filter [53].

17

Figure 2.4: Lenslet light field imaging based on micro-lens array.

Concerning the placement of the microlens array, there are two main types of lenslet light field

cameras, so-called focused and unfocused [54]. In the so-called plenoptic 1.0 camera, also known

as unfocused light field cameras, the main lens is focused on the microlens plane while the micro-

lenses are focused at infinity, as shown in Figure 2.5.a. Differently from a conventional 2D camera

capturing an image by integrating the intensities of all rays (from all directions) impinging each

sensor element, each pixel in this light field camera collects the light of a single ray (or a thin

bundle of rays) from a given angular direction (θ,φ) that converges on a specific microlens at

position (x,y) in the array. The research advances made by Ng [25] have led to the development of

the commercially available plenoptic 1.0 set of Lytro lenslet light field cameras [26], see Figure

2.6.a and Figure 2.6.b.

Figure 2.5: Lenslet light field cameras: (a) unfocused and (b) focused architectures [54].

On the contrary, in the plenoptic 2.0 cameras, also known as focused light field cameras [55], [56],

the micro-lenses are no longer focused at infinity, but they are rather focused on the main lens

image plane, as shown in Figure 2.5.b; this justifies the name focused as each microlens is now

focused on the same subject as the main lens. Raytrix cameras [57], which target industrial

18

applications, follow the plenoptic 2.0 paradigm, see Figure 2.6.c. Adjusting the focal distance by

moving the main lens, allows changing the depth of field, thus capturing multiple pixels per

microlens in different focus plans, raising the possibility to render 2D images with increased spatial

resolution.

In practice, the two lenslet light field camera setups, this means plenoptic 1.0 and plenoptic 2.0

cameras, allow different trade-offs between the spatial and angular resolutions in the captured light

field image.

Figure 2.6: Lenslet light field cameras: (a) Lytro first generation camera [26]; (b) Lytro ILLUM

lenslet camera [26]; and (c) Raytrix R11 camera [57].

In this Thesis, light field images acquired with the plenoptic 1.0 Lytro ILLUM lenslet camera,

using a 40 Megaray sensor and a 30-250 mm lens with 8.3× optical zoom and f/2.0 aperture, are

processed. In the following, the plenoptic 1.0 lenslet light field camera will be simply called light

field camera.

2.4 Lenslet Light Field Imaging: From Micro-images to Sub-Aperture

Images

This section reviews the lenslet light field pre-processing operations, relevant for different visual

analysis tasks, including biometric recognition and PAD and eventually coding, which targets to

transform the acquired lenslet light field represented as a set of micro-images into a lenslet light

field represented as a set of views/perspective usually called Sub-Aperture (SA) images. The

architecture of the light field pre-processing adopted in the Light Field Toolbox software [58] is

represented in Figure 2.7.

Figure 2.7: The lenslet light field pre-processing architecture.

The main modules in the architecture have the following tasks:

Acquisition – This module has the task to acquire the light field data from the scene, as

described in Section 2.3.2. After acquisition, a light field image is stored in a raw lenslet

19

format, corresponding to a set of micro-images, using a single precision floating point format,

and a resolution of 7728×5368 pixels in GRGB Bayer format, as illustrated in Figure 2.8.

RGB color demosaicing – This module has the task to recover a full RGB lenslet light field

image from the raw lenslet image, which has been obtained with a Bayer-pattern filter. To

achieve this goal, a conventional linear demosaicing technique is applied over the raw image

[59]. A sample demosaiced lenslet light field image is shown in Figure 2.9.

Multi-view array creation - This module re-arranges the demosaiced raw light field image

into a multi-view array of SA images. A SA image results from putting together the pixels in

the same position within each micro-image to create a rendered image for a specific

viewpoint/perspective; the full set of these SA images corresponds to the light field multi-view

array of SA images. A Lytro ILLUM multi-view array corresponds to 15×15 SA images

(Figure 2.10-left), each with a resolution of 434×625 pixels (Figure 2.10-right). The SA images

in black in Figure 2.10 do not contain usable image information due to the vignetting effect,

resulting from the circular microlens shape, implying that some sensor positions almost do not

get any incident light.

Figure 2.8: Raw lenslet light field representation, before color demosaicing (each position

corresponds to a R, G or B intensity).

Figure 2.9: Raw lenslet light field representation, after color demosaicing.

20

Figure 2.10: Light field multi-view array of SA images and central rendered 2D image.

Multi-view array color and gamma corrections – By exploiting the available light field

metadata associated with each light field image, including color balance matrix and white

balance level, this module applies: i) histogram equalization to adjust the contrast using the

image's histogram; ii) color saturation adjustment to control the intensity of RGB color

channels; iii) white balance adjustment, to adjust the so-called color temperature

corresponding to the relative warmth or coolness of light; and iv) gamma correction, a

nonlinear operation to adjust the overall brightness of the image. It should be noted that the

color and gamma corrections are applied to each SA image, to enhance the quality of the multi-

view array of SA images [59]. Figure 2.11 shows a light field SA image before (left side) and

after (right side) color and gamma corrections. The output of this module is the color-gamma

corrected multi-view array of SA images which can then be used as input to the feature

extraction stage of a biometric recognition or PAD solution.

Figure 2.11: A light field SA image (a) before and (b) after color and gamma corrections.

It should be noted that other types of rendering solutions may be used to extract/render 2D images

from a raw light field image, depending on the specific needs, e.g. focus view rendering.

21

Nevertheless, the simple rendering solution described above is more convenient for the biometric

recognition and PAD solutions proposed in this Thesis whose aims are exploiting the spatio-

angular information available in the multi-view array of SA images.

In this Thesis, the pre-processing architecture illustrated in Figure 2.7 corresponds to the initial

part of the full architecture designed for the proposed light field biometric recognition and PAD

solutions.

2.5 Added Value for Biometric Recognition and PAD

Images acquired with a light field camera include rich spatio-angular information about different

viewpoints, thus supporting characteristics/functionalities such a posteriori refocusing, disparity

exploitation, and depth exploitation. These light field distinctive characteristics can be useful for

addressing many imaging analysis tasks, notably face and ear biometric recognition and PAD:

1. A posteriori refocusing: A posteriori refocusing on a given selected plane can be performed

with a rendering solution controlled by a single focal shift parameter. This capability can be

very useful to improve the quality of a previously out-of-focus region of interest for the

subsequent recognition of either a single face/ear or multiple faces/ears, positioned at different

distances or focus planes [27] [28] [29]. In addition, a presentation attack image may have

different surface geometry than bona fide samples, thus exhibiting limited differences for the

attack images rendered at different depth planes, what facilitates detecting presentation attacks

[33].

2. Disparity exploitation: Disparity refers to the distance between the corresponding points in

different viewpoints. Given a captured light field image, it is possible to render a set of SA

images, each one corresponding to a specific viewpoint, which show some disparity between

the objects, which also depends on the distance to the camera. Disparity information is

instrumental for different analysis tasks including image registration, as it represents relevant

information for biometric recognition and PAD, such as the position and shape of shadows,

changes in contrast and contrast gradient among observation viewpoints, and defocus blur.

Disparity information can be exploited to improve the performance of biometric recognition

solutions [36] [37] [39] [40] [41] and PAD solutions [42] [43] [44].

3. Depth exploitation: Depth information expresses the distance from the scene objects to the

camera, thus providing geometric information about the position and shape of the various

objects, e.g., face components, which may not be equally expressed by disparity information.

The depth map of a light field image can provide key information for biometric recognition

[31] [39] and PAD [34]. In addition, depth information can be exploited to decide whether an

image is being captured from a flat surface or not. For example, face presentation attacks using

2D supports exhibit limited depth differences for facial landmarks, which can be exploited to

detect face presentation attacks [34].While disparity and depth are, in theory, the same

information and may be mutually converted, the independent extraction of these two types of

information may bring additional information, notably compensating the weaknesses of each

individual extraction process.

The face and ear recognition and PAD solutions proposed in this Thesis are mostly focused on

exploiting disparity information [36] [37] [39] [40] [41] [42] [43] [44], thus capturing the richer

22

spatio-angular information available in a light field image. There is only one proposed face

recognition solution [39] that exploits the disparity together with depth maps, acknowledging that

disparity and depth information may bring some complementary information to the recognition

task.

23

Part II. Light Field Based

Face and Ear Recognition

25

Chapter 3 _

State-of-the-Art on Face and Ear Recognition

3.1 Introduction

Biometric recognition, referring to the automated recognition of individuals based on their

biological and behavioral traits [1], has been successfully used in multiple application domains,

ranging from forensics and surveillance to commerce and entertainment [2]. There are different

types of biometric modalities available, such as fingerprint, iris, face, and ear, with each modality

having its strengths and weaknesses, and the choice clearly depending on the target application

[1].

This Thesis has been focused on face and ear biometric modalities. The main objective of this

chapter is to review the state-of-the-art on face and ear recognition databases and solutions. To

better understand the technological landscape in terms of recognition systems, this Thesis proposes

a new, more encompassing multi-level taxonomy for face and ear recognition solutions, to

facilitate the organization and categorization of face and ear recognition solutions. Following the

proposed taxonomy, a comprehensive review on recent, representative and relevant face and ear

recognition solutions is presented. Additionally, this chapter reviews face and ear databases that

are instrumental for designing, testing and validating face and ear recognition solutions

3.2 Face/Ear Recognition Taxonomy

In order to help understanding the structure and abstraction levels that can be considered in the

face and ear recognition problems, a number of face recognition [60] [61] [62] [63] [64] and ear

recognition [5] taxonomies have been developed so far, allowing the dissection and comparison of

face and ear recognition solutions. These taxonomies may also guide researchers in the

development of more efficient face and ear recognition solutions. This Thesis also attempts to

support an informed comparison of face recognition solutions, as well as ear recognition solutions,

by proposing a comprehensive multi-level face/ear recognition taxonomy for categorization of

such solutions.

26

3.2.1 Reviewing Existing Face Recognition Taxonomies

The existing face recognition taxonomies have been reviewed to understand their benefits and

drawbacks. In [60], a taxonomy is proposed dividing the face recognition solutions into appearance

based (holistic), feature based, and hybrid approaches. This taxonomy is widely used in the

literature and has been used for categorization of face detection and recognition solutions [61].

The face recognition taxonomy proposed in [62] gives an overview of various face recognition

solutions by classifying them into geometric vs. template based, piecemeal vs. holistic, appearance

based vs. model based, and statistical vs. neural network approaches. In [63], the face recognition

solutions are classified based on the sensing modalities, i.e., 2D conventional image, 3D and infra-

red data. Depending on the main purpose, the reviewed taxonomies classify the face recognition

solutions based on a specific abstraction level, e.g., feature extraction or sensing modalities, while

ignoring other dimensions.

To structure the face recognition solutions based on different levels of abstraction, a multi-level

taxonomy may be adopted. In [64], a multi-level face recognition taxonomy is proposed, providing

an overview of face recognition solutions based on three different abstraction levels, notably pose-

dependency, face representation, and features used for matching, as illustrated in Figure 3.1.

However, this taxonomy ignores some relevant abstraction levels, such as face structure and

feature support, which may be considered for a more complete characterization of face recognition

solutions.

Figure 3.1: Multi-level taxonomy for face recognition solutions [64].

3.2.2 Reviewing Existing Ear Recognition Taxonomy

To help understanding the relations between the various ear recognition solutions, an ear

recognition taxonomy has been proposed in [5] to divide the ear recognition solutions into holistic,

geometric, local, and hybrid approaches as illustrated in Figure 3.2. However, this taxonomy also

ignores some relevant abstraction levels, such as the ear structure and feature extraction support,

which may be considered for a more complete characterization of ear recognition solutions.

27

Figure 3.2: Taxonomy of ear recognition solutions [5].

3.2.3 Proposing a Novel Multi-Level Face/Ear Recognition Taxonomy

This Thesis proposes a new, more encompassing multi-level taxonomy, which can be applied to

the face recognition and the ear recognition problems, helping to better understand the

technological landscape in the area of face and ear recognition, thus facilitating the

characterization and organization of face and ear recognition methods. The proposed multi-level

face/ear recognition taxonomy, illustrated in Figure 3.3, has four levels including:

Figure 3.3: Proposed multi-level face/ear recognition taxonomy.

28

1. Face/ear structure –This level describes the way a recognition solution deals with the

structure of a face or ear image. It includes three classes: i) global representation, dealing with

face/ear as a whole (see Figure 3.4.a); ii) component + structure representation, relying on the

characteristics of some face components, such as eyes, nose, mouth, etc. or some ear

components, such as tragus, helix, lope, etc. (as illustrated in Figure 3.5), along with their

relations (Figure 3.4.b); and iii) component representation, dealing independently with a

selection of face/ear components, without any consideration about the relations between them

(Figure 3.4.c).

Figure 3.4: Face/ear structure level: (a) global; (b) component + structure; and (c) component

representation face structures.

Figure 3.5: Ear structure [5].

2. Feature support – This level is related to the region of support considered for feature

extraction, which can be either global or local. Global feature support implies that the region

of support is the whole image, either a face/ear (Figure 3.6.a) or a face/ear component (Figure

3.6.b), depending on the face/ear structure class considered. Local feature support implies that

the region of support is a local region of either a face/ear (Figure 3.6.c) or a face/ear component

(Figure 3.6.d). A local region of support can have different characteristics, for instance in what

concerns topology, size, overlapping, among others.

29

Figure 3.6: Feature support level: Global feature support with (a) global and (b) component face

structures; Local feature support with (c) global and (d) component face structures.

3. Feature extraction approach – This level is related to the approaches used for feature

extraction, which may be classified as: i) appearance based, deriving features by using

statistical transformations from the intensity data; ii) model based, deriving features based on

geometrical characteristics of the face/ear; iii) learning based, deriving features by modelling

and learning relationships from the input data; and iv) hand-crafted based, deriving features by

describing pre-selected elementary characteristics computed over a local image area.

4. Feature extraction sub-approach– The last level considered in the proposed taxonomy is a

sub-category of the previous one, identifying the family of techniques used by the selected

feature extraction approach.

Appearance based feature extraction can be divided into: i) linear solutions, such as Principle

Component Analysis (PCA) [65] and Independent Component Analysis (ICA) [66], performing

an optimal linear mapping to a lower dimensional space to extract the representative features; ii)

non-linear solutions, such as kernel PCA [67], exploiting the non-linear structure of face/ear

patterns to compute a non-linear mapping; and iii) multi-linear, such as generalized PCA [68],

extracting information from high dimensional data while retaining its natural structure, providing

more compact representations than linear methods.

Model based feature extraction can be divided into: i) graph based solutions, such as Elastic Bunch

Graph Matching (EBGM) [69], representing face/ear features as a graph, where nodes store local

information about face/ear landmarks and edges represent relations, such as distances between

nodes, and a graph similarity function is used for matching; and ii) shape based solutions, such as

the 3D Morphable Model (3DMM) [70], using landmarks to represent key face/ear components,

controlled by the model, and using shape similarity functions for matching.

The third feature extraction approach, learning based solutions, can be categorized into five

families of techniques , including: i) deep neural networks, such as using the VGG-Face descriptor

[38], modelling the input data with high abstraction levels by using a deep graph with multiple

processing layers to automatically learn features from the input data; ii) dictionary learning

solutions, such as Kernel Extended Dictionary (KED) [71], finding a sparse feature representation

of the input data in the form of a linear combination of basic elements; iii) decision tree solutions,

such as Decision Pyramid (DP) [72], representing features as the result of a series of decisions; iv)

regression solutions, such as Logistic Regression (LR) [73], with the relationship between

variables being iteratively refined using a measure of error for the predictions made by the

considered model; and v) Bayesian solutions, such as Bayesian Patch Representation (BPR) [74],

applying Bayes’ theorem to extract features and using a probabilistic measure of similarity.

30

Finally, the hand-crafted based feature extraction approach includes: i) local shape based solutions,

such as Local Shape Map (LSM) [75], defining feature vectors using local shape descriptors; ii)

texture based solutions, such as Local Binary Patterns (LBP) [76], exploring the structure of local

spatial neighborhoods; and ii) frequency based solutions, such as Local Phase Quantization (LPQ)

[77], exploring the local structure in the frequency domain.

It is not uncommon to find hybrid face/ear recognition solutions, such as LBP Net [78] and Mesh-

LBP [79], combining elements from two or more feature extractors to improve the recognition

performance. Additionally, for face/ear recognition solutions combining multiple features,

extracted using different feature extraction methods, fusion can be done at several levels among

them the feature level and score level fusion strategies are the most often employed ones [80].

3.3 Face Recognition

Face recognition systems have been successfully used in multiple application areas with high

acceptability, collectability and universality [1] [3]. After the first automatic face recognition

algorithms emerged more than four decades ago [4], this area has attracted much research and

there has been incredible progress in this field. Following the multi-level face/ear recognition

taxonomy proposed in Section 3.2.3, this section reviews the state-of-the-art on face recognition

solutions. Since this Thesis focuses on the added value of light field images for biometric

recognition, the reviewed face recognition solutions are organized around the exploitation or not

of light field data. This section also provides an overview of the main characteristics of a set of

selected prominent existing face databases and the face variations addressed in these databases.

3.3.1 Face Databases: Status Quo

Face databases play a very important role for designing, testing and validating face recognition

solutions, while ensuring the reproducibility of performance results and their fair comparison. A

set of selected face biometric databases are briefly reviewed in the following.

Currently, there are over 100 publicly available face databases. Table 3.1 overviews the main

characteristics of a set of selected prominent existing face databases and the face variations

addressed in these databases, notably in terms of acquisition date, lighting, poses, expression, and

occlusions, sorted according to their release date (a more complete list can be found in [81]). For

comparison, also the characteristics of the IST-EURECOM Lenslet Light Field Face Database

(IST-EURECOM LLFFD) [35] proposed in this Thesis are included in Table 3.1.

Among the selected databases, several consider the usage of sensors that had not been considered

previously, thus motivating their creation. For instance, the MObile BIOmetry (MOBIO) database

[91] was recorded using two mobile devices, a mobile phone and a laptop computer, to boost the

research on face recognition techniques for mobile devices. The Surveillance Cameras face

(SCface) database [83] was collected to provide VISible (VIS) and Near Infra-Red (NIR) spectrum

images in an uncontrolled indoor environment. The Binghamton University 3D Facial Expression

(BU-3DFE) [19] database was developed for analyzing facial expressions in dynamic 3D spaces.

The Kinect Face database [18] provides RGB-D face images, captured by a Kinect sensor, to

evaluate how face recognition technology can benefit from this specific imaging sensor.

31

Table 3.1: Overview of selected, prominent face databases with different (Diff.) characteristics.

Database

Name Year

No. of

subjects

Image

type

Image

modality

Spatial

Resolution

Face Variation

Diff.

Dates

Diff.

Lighting

Diff.

Poses

Diff.

Expres.

Diff.

occlusion

ORL [84] 1994 40 Gray 2D 92×112

AR [85] 1998 126 Color 2D 380×285

Yale B [86] 2001 28 Gray 2D 640×480

FERET [87] 2003 1199 Gray /

color 2D 256×384

FEI [88] 2006 200 Color 3D 640×480

FRAV3D [89] 2007 106 Color 2D;

3D N/A

LFW [90] 2007 5749 Color 2D 250×250

Bosphorus [91] 2008 105 Color 2D;

3D; 1600×1200

Multi-PIE [92] 2009 337 Color 2D 3072×2048

MOBIO [82] 2010 150 Color 2D up to 2048×1536

Texas 3D [93] 2010 118 Color 2D;

3D 751×501

YouTube Faces

[93] 2011 1595 Color 2D video Different sizes

SCface [83] 2011 130 Color/

infrared 2D

100×75;

144×108; 224×168;

426×320

BU-3DFE [19] 2013 100 Color 3D;

3D video 1040×1329

Kinect Face DB

[18] 2014 52 Color

2D;

depth map; 640×480

Face

Warehouse [95] 2014 150 Color 2D; 3D; 640×480

IJB-A [96] 2015 500 Color 2D; 2D video Different sizes

PIPA [97] 2015 2000 Color 2D Different sizes

LiFFID [98] 2016 112 Gray 2D;

2D rendered

1054×1054;

120×120

Prop. IST-

EURECOM

LLFFD [35]

2016 100 Color

4D light field;

2D rendered;

2D depth map;

15×15×434×625

2022×1404

2022×1404

SoF [99] 2017 112 Color 2D N/A

DFW [100] 2018 1000 Color 2D Different sizes

As the emergence of novel imaging sensors motivates the research community to work with

associated new and richer imaging formats, gathering a powerful light field face database was

becoming a pressing need. Light field imaging is a relatively new topic and, thus, only a few

databases have been made available. The Light Field Face and Iris Database (LiFFID) [98] is the

first face database where the importance of light field imaging sensors for facial recognition tasks

has been acknowledged. It includes a set of 2D greyscale images, focused at different depths,

rendered from the light field content acquired using a first generation Lytro lenslet camera, but

does not include the raw light field images.

32

3.3.2 Non-Light Field Based Face Recognition Solutions

Existing non-light field based face recognition solutions are briefly reviewed according to the

proposed multi-level face recognition taxonomy. Table 3.2 summarizes the main characteristics of

a selection of recent, representative and relevant solutions, sorted based on the feature extraction

approach, feature extraction sub-approach and, finally, publication date. Apart from the

information about taxonomy levels considered in the reviewed solutions, this table includes

information about the face databases considered in the publications reporting these solutions. The

solutions summarized in Table 3.2 are briefly reviewed in the following, grouped based on the

feature extraction approaches considered in the taxonomy.

Table 3.2: Classification of a selection of non-light field based face recognition solutions based

on the proposed taxonomy.

Solution Name Year Face Structure Feature

Support

Feature

Extraction

Approach

Feature Extraction

Sub-Approach Database

PCA [65] 1991 Global Global Appearance Linear Private

ICA [66] 2002 Global Global Appearance Linear FERET

ASVDF [101] 2016 Global Global Appearance Linear PIE; FEI;FERET

KPCA MM [67] 2016 Global Global Appearance Non-Linear Yale; ORL

AHFSVD-Face [102]

2017 Global Global Appearance Non-Linear CMU PIE;LFW

GPCA [68] 2004 Global Global Appearance Multi-Linear AR; ORL

EBGM [69] 1997 Comp.+ Struct. Global Model Graph FERET

Homography Based

[103] 2017 Comp.+ Struct.

Global

Local Model Graph

FERET, CMU-

PIE, Multi-PIE

3DMM [70] 2003 Comp.+ Struct. Global Model Shape CMU-PIE; FERET

U-3DMM [104] 2016 Comp.+ Struct. Global Model Shape Multi-PIE;AR

Face Hallucination

[105] 2016 Comp.+ Struct. Global Learning Dictionary Learning Yale B

Orthonormal

Dic. Lear. [106] 2016 Global Global Learning Dictionary Learning AR

LKED [71] 2017 Global Global Learning Dictionary Learning AR; FERET;

CAS-PEAL

Decision Pyramid

[72] 2017 Global Local Learning Decision Tree AR; Yale B

Logistic Regression

[73] 2014 Global Global Learning Regression ORL; Yale B

BPR [74] 2016 Global Local Learning Bayesian AR

AlexNet [107] 2014 Global Global Learning Deep Neural Network LFW; YTF

VGG Face [38] 2015 Global Global Learning Deep Neural Network LFW; Youtube

GoogLeNet [108] 2015 Global Global Learning Deep Neural Network LFW

Deep RGB-D [109] 2016 Global Global Learning Deep Neural Network Kinect Face DB

TRIVET [110] 2016 Global Global Learning Deep Neural Network CASIA

Deep HFR [111] 2016 Global Global Learning Deep Neural Network CASIA

Deep RGB-D [112] 2016 Global Global Learning Deep Neural Network Kinect Face DB

CDL [113] 2017 Global Global Learning Deep Neural Network CASIA

Deep NIR-VIS

[114] 2017 Global Global Learning Deep Neural Network CASIA

Deep CSH [115] 2017 Global Global Learning Deep Neural Network CASIA

Alexnet [116] 2018 Global Global Learning Deep Neural Network IJB-A; PIPA

Lightened CNN

[117] 2018 Global Global Learning Deep Neural Network LFW; YTF

33

Solution Name Year Face Structure Feature

Support

Feature

Extraction

Approach

Feature Extraction

Sub-Approach Database

SqueezeNet [118] 2018 Global Global Learning Deep Neural Network LFW

LSM [75] 2004 Global Local Hand-Crafted Shape Private

LBP [76] 2006 Global Local Hand-Crafted Texture FERET

HOG [119] 2011 Global Local Hand-Crafted Texture FERET

DLBP [120] 2014 Global Local Hand-Crafted

Texture TEXAS; FRGC;

BOSPHORUS

E-LBP [121] 2016 Global Local Hand-Crafted

Texture Yale B; FERET;

CAS-PEAL

MB-LBP [122] 2016 Global Local Hand-Crafted Texture Yale B; FERET

MR CS-LDP [123] 2016 Component Local Hand-Crafted

Texture PIE; Yale B;

VALID

ALTP [124] 2016 Global Local Hand-Crafted Texture FERET;ORL

LPQ [77] 2008 Global Local Hand-Crafted Frequency CMU PIE

Hybrid Solution: Mesh-LBP [79]

2015 Comp.+ struct; Local Global

Hand-Crafted; Model based

Texture; Graph

MIT CSAIL; BU-3DFE

Hybrid Solution:

LBP Net [78] 2016 Global Local

Hand-Crafted;

Learning

Texture;

Deep Neural Network LFW; FERET

Hybrid Solution:

PCA Net [125] 2016 Global Global

Appearance;

Learning

Linear;

Deep Neural Network LFW

Hybrid Solution:

Aging FR [126] 2016 Component Local

Hand-Crafted;

Learning

Texture;

Decision tree MORPH

Hybrid Solution:

MSB LBP + WPCA

[127]

2016 Global Local

Global

Hand-Crafted;

Appearance Texture; Linear

ORL

Hybrid Solution:

LFD+PCA [128] 2016 Comp.+ struct.

Local

Global

Hand-Crafted;

Appearance

Texture;

Linear

SGIDCDL;

FERET

Hybrid Solution:

Deep BeliefNet

+CSLBP [129]

2016 Global Local

Global

Hand-Crafted;

Learning

Texture;

Deep Neural Network ORL

Hybrid Solution:

Discriminative Dic.

Lear. [130]

2016 Global Local

Global

Local

Global

Texture;

Dictionary Learning AR; Yale B

Hybrid Solution: Nonlinear 3DMM

[131]

2018 Comp.+ struct. Global Appearance;

Model;

Learning

Non-Linear; Shape;

Deep Neural Network

FaceWarehouse

Fusion Scheme:

RGB-D-T [132] 2014 Global Local Hand-Crafted Texture Private

Fusion Scheme:

Binocular Stereo

[133]

2015 Global Local Hand-Crafted Texture Private

Fusion Scheme:

RBP [134] 2016 Global Local Hand-Crafted Texture

AR, Yale B,

UMIST

Fusion Scheme:

LCCP [135] 2016 Global Local Hand-Crafted

Frequency;

Texture; FERET

Fusion Scheme:

Gabor-Zernike

Descriptor [136]

2016 Global Local Hand-Crafted Texture ORL; Yale; AR

Fusion Scheme:

MDML-DCP [137] 2016 Comp.+ struct.

Local

Global

Hand-Crafted;

Appearance

Texture;

Linear

FRGC; CAS;

FERET

Fusion Scheme:

RGB-D-IR [138] 2016 Global

Local

Global

Hand-Crafted;

Learning

Texture;

Deep Neural Network Private

Fusion Scheme: Thermal Fus. [139]

2016 Global Local Hand-Crafted Texture Thermal/Visible

Face

34

3.3.2.1 Appearance Based Solutions

Appearance based face recognition solutions map the input data into a lower dimensional space,

while retaining the most relevant information. Appearance based solutions are generally sensitive

to face variations, such as occlusion, scale, pose, expression, as they do not consider any

knowledge about the face structure.

The most popular appearance based solutions for face recognition are PCA, also known as

eigenfaces [65], and ICA [66]. PCA is an appearance based face recognition solution that finds

useful representations by projecting face images onto an orthogonal representation space where

each basis image captures the highest variance possible, thus decomposing an image into an

uncorrelated linear combination of the basis images. ICA is proposed to find the independent

components that are linearly mixed by maximizing the statistical independence of the estimated

components [66]. Linear methods work based on the vectorization of intensity data; in order to

work directly with 2D images in their native state, a multi-linear appearance based solution, so-

called Generalized PCA (GPCA) [68], has been proposed. By projecting the images to a vector

space that is the tensor product of two lower-dimensional vector spaces, GPCA aims to preserve

spatial locality to improve the effectiveness of the feature extraction method. More recent

appearance based solutions include, for instance, Kernel PCA Mixture Model (KPCA MM) [67],

a supervised version for probabilistic kernel principal component analysis mixture model,

obtaining local non-linear structure of facial patterns which can be used for dimensionality

reduction in recognition task. Adaptive Singular Value Decomposition Face (ASVDF) [101] is an

illumination compensation method in the two-dimensional discrete Fourier domain, for reducing

the influence of side light on a color face image when there is insufficient light, improving the

performance of recognition systems. As a final example, Adaptive High-Frequency Singular Value

Decomposition face (AHFSVD-face) [102] adaptively selects a nonlinear parameter to generate

features according to the face image illumination level.

3.3.2.2 Model Based Solutions

Model based face recognition solutions derive features based on geometrical characteristics of the

face. These solutions are generally less sensitive to face variations as they consider structural

information of the face, for which they require accurate landmark localization prior to feature

extraction.

Graph based solutions represent face features as a graph, with the nodes collecting local

information about each facial landmark and edges representing their relations, e.g. geometric

distances between the nodes. Model matching can be performed using a graph similarity function.

EBGM [69] is a graph based solution where the local texture of fiducial points on the face (eyes,

mouth, etc.) is described by a set of wavelet components, so-called jets, and the edges represent

distances between the node locations on an image; recognition is done based on a Dynamic Link

Matching (DLM) method. In the homography based normalization solution [103], an efficient

pose-invariant face recognition solution is proposed, projecting a dense grid of 3D facial

landmarks to each 2D face image, to enable pose-invariant feature extraction. Then, an optimal

35

warp is estimated for each landmark in order to correct the texture deformation caused by pose

variations. The reconstructed frontal-view features are then utilized for recognition.

Shape model face recognition solutions represent features for a set of points controlled by the

model to find the best matching position between the prior models and the input face image.

Landmark points represent the positions of key facial features used for facial alignment. As an

example, 3DMM [70] captures the class-specific properties of faces by learning from a data set of

3D scans. The morphable model represents face shape and texture as vectors in a high-dimensional

face space, and involves a probability density function of faces within the face space. Matching is

achieved by fitting a statistical, morphable model of 3D faces to images. The Unified 3D

Morphable Model (U-3DMM) solution [104] proposes an improved approach to learn the face

subspace, by modelling the difference in the texture map of the 3D aligned input and reference

images, resulting in an improved fit of the 3D face model.

3.3.2.3 Learning Based Solutions (excluding Deep Learning)

Learning based face recognition solutions derive features by modelling and learning relationships

from the input data. These solutions can present some robustness against facial variations,

depending on the considered training data; however, they can be computationally more complex

than solutions based on other feature extraction approaches, as they require initialization, training,

and tuning of (hyper) parameters. As the majority of recent learning face recognition solutions are

based on deep learning, those solutions are reviewed in the next sub-section.

Dictionary learning based solutions aim to find a sparse feature representation of the input data in

the form of a linear combination of basic elements. A two-step supervised face hallucination

framework based on class-specific dictionary learning is proposed in [105] to learn a set of class-

specific dictionaries. The learned dictionaries can fit the global and local characteristics of an input

face image. Then, a maximum a posteriori estimator is used to infer the global characteristics. An

orthonormal dictionary learning solution is presented in [106], obtaining low-rank face

representations with fast computation. The solution enhances the ability of the class-specific

dictionary to represent samples from the associated class and suppress the ability of representing

samples from other classes. In [71], several kernel principal components of occlusion variations

are learned, representing the possible occlusion variations efficiently. Then, the occlusion model

is projected by kernel discriminant analysis to get the kernel extended dictionary; finally, a

structured sparse representation classifier is used for classification.

Decision tree based solutions represent features as a model of decisions that is constructed based

on values of attributes in the input image. The Decision Pyramid Classifier (DPC) face recognition

solution [72] solves the single sample per person problem by considering large appearance

variations. DPC divides each training image into multiple non-overlapping local blocks and

extracts features from each block to generate the training feature set. By following the constructed

decision pyramid, the person identity is predicted.

Bayesian learning solutions apply Bayes’ theorem to extract features using a probabilistic measure

of similarity. In [74], a simple yet effective framework is proposed, generating, interpreting, and

aggregating the partial representations in a Bayesian manner. First, linear representations are

36

computed on randomly generated face patches. Second, each patch representation is considered as

a probability vector, with each element corresponding to a certain individual. The interpretation is

obtained by applying the Bayes theorem on a basic distribution assumption and, thus, is referred

to as Probabilistic Patch Representation (PPR). Finally, a linear combination of the obtained PPRs

is learned to achieve higher recognition performance.

3.3.2.4 Deep Learning Based Solutions

With the development of deep learning architectures and the increase in computational power,

rapid advances in a variety of visual recognition tasks, including face recognition, have been

observed [140]. In recent years, deep learning architectures have been increasingly adopted for face

recognition tasks and, not surprisingly, the current state-of-the-art on face recognition is dominated

by deep neural networks, notably Convolutional Neural Networks (CNNs). Deep CNN

architectures take raw data as their input and extract features using convolutional filters in multiple

levels of abstraction, followed by a few fully connected layers. However, optimizing tens of

millions parameters to learn deep learning weights needs a huge amount of labeled/learning

samples along with powerful computational resources. Hence, deep learning architectures have

been trained over millions of face images with different variations, obtaining the so-called pre-

trained face models for face recognition that can then be after used for deep feature

extraction/description; at this stage, a conventional classifier such as Support Vector Machine

(SVM) can be employed to classify the extracted features. The adaptation of the pre-trained face

model to a specific face recognition problem can also be done using a so-called transfer learning

process, meaning that the pre-trained model is fine-tuned using a part of the newly available

datasets, notably when the type of face data is different from the images used for training the

model, by changing the last (classification) layer(s) of the architecture to learn the new classes

[141]. Nowadays, the most efficient and commonly used CNN architectures for face recognition

are AlexNet [142] [116] [107], Lightened CNN [143] [117], SqueezeNet [144] [118], GoogLeNet

[145] [108], and VGG-16 network [146] [38] and ResNet [147] [148].

Several deep learning based face recognition solutions exploiting richer imaging representation

formats have recently been proposed and are discussed in the following, excluding light field

solutions, which are addressed in a later section. Coupled Deep Learning (CDL) [113] is proposed

to address the VIS and NIR heterogeneous matching problem. It transfers the deep representation

learned on a large-scale VIS dataset and adapts it to the NIR domain by introducing a VIS-NIR

objective function for convolution neural networks. It seeks a deep feature space in which the

heterogeneous face matching problem can be approximately treated as a homogeneous face

matching problem. A deep TransfeR nIr-Vis heterogeneous facE recognition neTwork (TRIVET)

for NIR-VIS face recognition is proposed in [110], employing a CNN with ordinal measures to

learn discriminative models. The ordinal activation function, so-called max-feature-map, is used

to select discriminative features and make the models robust and light. Then, the models are

transformed to the NIR-VIS domain by fine-tuning with a NIR-VIS triplet loss function. A method

using the GoogLeNet to learn global features for heterogeneous face recognition is presented in

[111], which is able to learn coupled deep convolutional neural networks to map visible and NIR

faces into a domain independent latent feature space where they can be compared directly. Another

deep convolutional network is proposed [114], using only one network to map both NIR and VIS

37

images to a compact Euclidean space. Each convolutional layer implements the maxout operator

and the layers are divided into two orthogonal subspaces that contain modality-invariant identity

information. The solution proposed in [115] extends the deep learning breakthrough for VIS face

recognition to the NIR spectrum, without retraining the underlying deep models trained for VIS

faces. The solution consists of two core components, cross-spectral hallucination and low-rank

embedding, optimizing the input and output of a VIS deep model for cross-spectral face

recognition, respectively. A face recognition system is proposed in [109] to recognize faces with

color and depth information including three parts: i) depth image recovery; ii) deep learning for

feature extraction with a 12-layer deep architecture; and iii) joint classification. To alleviate the

problem of the limited size of available RGB-D data for deep learning, the network is firstly trained

with a color face dataset, and later fine-tuned on depth face images exploiting transfer learning.

Finally, a deep face recognition solution is proposed in [112] to learn effective color and depth

feature transformation, containing 3 parts: i) depth data enhancement, recovery, and augmentation;

ii) deep CNN transfer learning, efficiently transferring the knowledge of color images to depth

images for feature extraction; and iii) joint classification of color and depth features.

3.3.2.5 Hand-Crafted Based Solutions

Hand-crafted based face recognition solutions derive features by describing elementary

characteristics of the visual information selected a priori. These solutions are not usually very

sensitive to face variations, e.g. pose, expression, occlusion, aging, illumination changes, as they

can consider different/multiple scales, orientations, and frequency bands. These solutions require

tuning one/several parameters such as region size, scale, or topology. However, they are not

computationally expensive as there is no need for training at feature extraction level.

Texture based solutions can encode the local structures in spatial neighbourhoods within an image.

For instance, LBP [76] is among the most widely used local texture descriptors. This solution

divides the face image into several regions from which the LBP feature distributions are computed,

encoded and concatenated into a feature vector to be used as a face description. Sample encoding

is performed based on a comparison of the center pixel’s gray value and the corresponding values

for neighbour pixels, while using zero as the threshold value. Another widely used local texture

descriptor is the Histogram of Oriented Gradients (HOG) [149], which is able to represent spatial

gradient variations for face recognition [119]. HOG divides a face image into small connected

regions, called cells, and for each cell a histogram of edge orientations is computed. The histogram

channels are evenly spread over the ranges 0–180° and 0–360°, depending on whether the gradient

is ‘unsigned’ or ‘signed’. The histogram counts are normalized to compensate for illumination.

The combination of these histograms corresponds to the final HOG description. In [120], a depth

image descriptor called DLBP (Depth Local Binary Pattern) is proposed, capturing features from

texture and depth values of neighbourhood patterns. As it takes a similar form to conventional

LBP, patterns can be readily combined to form joint histograms to represent depth faces. The

Extended Local Binary Patterns (ELBP) solution [121] decomposes angular and radial differences

into complementary components of sign and magnitude, learning the most frequently occurring

patterns and their labels to capture discriminative textural information. Histogram features are

obtained from each given face image by concatenating spatial histograms extracted from non-

overlapping sub-regions which are then used for face classification. Multi-scale Block LBP (MB-

38

LBP) [122] processes average pixel values of block sub-regions instead of single pixels. Then,

binarized histograms obtained from MB-LBP are used for a rapid comparison of face images.

Multi-resolution Elongated Centre-Symmetric Local Derivative Pattern (ME-CS-LDP) is

proposed in [123], allowing to capture more important information from some important elliptical

parts of faces, like eyes and mouth. An adaptive local feature descriptor, Adaptive Local Ternary

Pattern (ALTP) [124], is proposed based on an adaptive sampling threshold, exploiting positive

and negative channel patterns for extracting more discriminative information.

Frequency descriptors encode the local structures in frequency neighbourhoods within images. A

well-known example is the LPQ [77] hand-crafted feature descriptor, based on quantizing the

Fourier transform phase in local neighbourhoods. Histograms of LPQ labels computed within local

regions are used for face image description, similarly to LBP.

3.3.2.6 Hybrid Solutions

Hybrid face recognition solutions combine elements from two or more feature extraction solutions,

taking advantage of their strengths to offer a more discriminative representation. It is worth to note

that hybrid solutions often do not refer to simply combining multiple features/classification scores

extracted by different feature extraction approaches. Hybrid face recognition solutions, especially

those using deep learning, are often quite competitive, but depending on their building blocks may

also be computationally more complex than other approaches.

One example is the Local Binary Pattern Network (LBPNet) [78], an unsupervised deep learning

based solution that efficiently extracts and compares high-level features in a multilayer hierarchy.

LBPNet retains the same CNN topology, whereas the trainable kernels are replaced by the off-the-

shelf LBP descriptor. In Multi-Scaled PCA Network (MS-PCANet) [125], a multiple scale

combined deep learning network is proposed to learn a set of high-level feature representations

through each stage of the convolutional neural network for face recognition. The network obtains

the filter kernels by learning the principal components of images using PCA, then nonlinearly

processes the convolutional results by using simple binary hashing, and pool them using a spatial

pyramid pooling method. Finally, the output features of several stages are fed to the classifier, thus

providing classifying multi-scaled features. In [126], a hierarchical model based on two-level

learning is proposed. At the first level, effective features are learned from low-level

microstructures, based on a Local Pattern Selection (LPS) descriptor, selecting low-level

discriminant patterns to minimize intra-user dissimilarity. At the second level, higher level visual

information is further refined based on the output from the first level. The nonlinear 3DMM [131]

solution contains a deep network, encoding the projection, shape and texture parameters. Two

decoders nonlinearly map from the shape and texture parameters to the 3D shape and texture,

respectively. With the projection parameter, 3D shape and texture, an analytically-differentiable

rendering layer is designed to recognize the input face. Another hybrid solution is proposed in

[127], combining Centre-Symmetric (CS)-LBP based on Gaussian pyramids and weighted PCA

for face recognition; different classifiers are used to select the optimal classification approach. In

[128], a hybrid solution is proposed, applying dense sampling around each detected feature point,

extracting Local Difference Feature (LDF) for face representation, and then utilizing PCA and

linear discriminant analysis to reduce feature dimension; finally, cosine similarity evaluation is

39

used for classification. In [129], another hybrid solution is proposed, combining (Center-

Symmetric Local Binary Patterns) CSLBP and Deep Belief Network (DBN). CSLBP is applied to

extract local texture features of face images and the extracted features are used as the input to a

DBN. A face recognition solution based on discriminative dictionary learning, LBP, and

regularized robust coding, is proposed in [130] to obtain the Gabor amplitude images of a face

image using a Gabor filter, extract the uniform LBP histograms to form a new dictionary, and,

finally, classify the test image via sparse representation coding. The challenge of LBP computation

on a mesh manifold is addressed in [79] by proposing a computational framework, called mesh-

LBP, allowing the extraction of LBP-like patterns directly from a triangular mesh manifold,

without the need for any intermediate representation in the form of depth images.

3.3.2.7 Fusion of Solutions

Face recognition fusion can be performed at four levels to combine the relevant information

[150]: i) feature level, usually concatenating features obtained by different feature extractors into

a single vector for classification; ii) score level, combining the different classifier output scores,

usually using the ‘sum rule’; iii) rank level, combining the ranking of the enrolled identities to

consolidate the ranks output from multiple biometric systems; and iv) decision level, combining

different decisions by those biometric matchers that provide access only to the final recognition

decision, usually adopting a ‘majority vote’ scheme . Fusion at feature and score levels are the

most commonly used approaches in the biometric literature. Generally, feature level fusion

contains richer information than score level; however, it is not always possible to apply it due to:

i) incompatibility of the features extracted in different feature spaces, in terms of data precision,

scale, structure, and size; and ii) large dimensionality of the concatenated features thus leading to

a higher complexity in the matching stage [80]. If one of these difficulties exists, fusion can be

performed at the rank, score or decision levels.

A fused presentation named Riesz Binary Pattern (RBP) is proposed in [134], consisting of two

complementary components: a Local RBP (LRBP) and a Global RBP (GRBP). LRBP is obtained

by applying a local binary coding operator on each Riesz transform response to extract image

intrinsic two-dimensional structure features [151], while GRBP is the global binary coding of joint

information of multi-scale image analysis and multi-order Riesz transform. Histograms of LRBP

and GRBP are concatenated at feature level to form the face RBP hand-crafted description. In

[135], a fused solution called Local Contourlet Combined Patterns (LCCP) is proposed, combining

local descriptions at multiple scales, orientations, and frequency bands at feature level. LBP and

Mean based Contrast Patterns (MCP) have been applied to different levels' of Contourlet transform

coefficients and then concatenated at feature level. Then, a block based kernel Fisher linear

discriminant is used to select the most discriminative feature sets. In [136], a feature level fusion

scheme is proposed, combining a multi-scale and rotation invariant global feature descriptor called

Global-Gabor-Zernike (GGZ) with HOG for face recognition. Another feature fusion level

solution extracts Multi-Directional Multi-Level Dual-Cross Patterns (MDML-DCPs) [137],

encoding the invariant characteristics of a face image at multiple levels into patterns based on

Dual-Cross Pattern (DCP) descriptions at both component and global face representation levels.

Several fused face recognition solutions based on emerging, non-light field sensors have recently

40

been proposed. In [132], a feature level fusion solution is proposed, concatenating extracted

features from RGB, depth, and thermal data using LBP, HOG, and HAAR methods. A score level

fusion solution is proposed in [133], adding classification scores obtained by horizontal gradient

ordinal relationship patterns and steerable filters to perform face recognition on rectified stereo

images. The fused representation solution proposed in [138] combines shallow Pyramidal

Histogram Of visual Words (PHOW) and VGG-Face deep descriptions at feature and score levels

to perform face recognition on a new RGB-depth-infrared database. A feature level fusion scheme

proposed in [139] combines LBP, Gabor jet, Weber local and down-sampling local descriptors for

thermal face recognition.

3.3.2.8 Non-Light Field Based Face Recognition: the Status Quo

Figure 3.7 illustrates the evolution of face recognition solutions over time, grouped based on their

feature extraction approaches. Figure 3.7 also includes the typical performance in terms of

Recognition Rate at rank 1 (RR1) obtained for each group of techniques on the LFW [88] database.

The appearance based solutions dominated the face recognition landscape from the early 1990

until around 1997. Then, model based solutions appeared and remained the state-of-the-art until

approximately 2006. Hand-crafted based solutions were introduced after, providing a moderate

improvement in the accuracy of face recognition solutions. In 2014, DeepFace [107] dramatically

improved the state-of-the-art accuracy, from around 80% to above 97.5%. From that date, the face

recognition research focus has shifted to deep learning based solutions and the current state-of-

the-art on face recognition is dominated by deep neural networks, offering more than 99%

accuracy for the LFW database [152].

Figure 3.7: Overview of the evolution of face recognition solutions over time, grouped based on

feature extraction approaches; performance values for the LFW database.

41

In [141], a comprehensive evaluation of deep learning models for face recognition using different

CNN architectures and different databases, under various facial variations, is presented.

Additionally, the impact of different covariates, such as compression artefacts, occlusions, noise,

and color information, on the face recognition performance for different architectures has been

studied [118]. The results have shown that the VGG-Face descriptor [38], computed using a VGG-

16 network, achieves superior recognition performance under various facial variations, and is more

robust to different covariates, when compared to relevant alternatives. The VGG-Face descriptor

has been trained over 2.6 million face images, covering rich variations in expression, pose,

occlusion, and illumination, obtaining a so-called pre-trained VGG-Face model for face

recognition, containing 144 million parameters. The VGG-Face description is obtained by running

the VGG-16 network [146] without the last two fully connected layers [38] using the pre-trained

VGG-Face model, thus including 13 convolutional layers, followed by one fully connected layer,

resulting in a feature vector of size 4096.

3.3.3 Light Field Based Face Recognition Solutions

This section reviews existing face recognition solutions exploiting light field sensors. Several face

recognition solutions exploiting the richer light field imaging information have recently been

proposed. Following the multi-level face recognition taxonomy developed, Table 3.3 classifies the

available light field based face recognition solutions, along with the databases used for reporting

results. The characteristics of the light field based face recognition solutions being proposed in this

Thesis are also listed in Table 3.3 for comparison purposes.

Table 3.3: Classification of the prior and proposed (Prop.) light field based face recognition

solutions, based on the proposed taxonomy.

Solution Name Year Face

Structure

Feature

Support

Feature

Extraction

Approach

Feature

Extraction

Sub-Approach

Database

MPCA Tensor [153] 2008 Global Global Appearance Multi-Linear N/A

LF Face [28] 2013 Global Local Hand-Crafted Texture Private

Multi-Face LF [29] 2013 Global Local Hand-Crafted Texture LiFFID

Super Res. LF [30] 2013 Global Local Hand-Crafted Texture LiFFID

Face-Iris MF LF [27] 2016 Global Local Hand-Crafted Texture LiFFID

DM LF [31] 2016 Global Local Hand-Crafted Texture Private

Prop. LFLBP [37] 2017 Global Local Hand-Crafted Texture LLFFD

Prop. LFHG [36] 2018 Global Local Hand-Crafted Texture LLFFD

Prop. VGG-D3 [39] 2018 Global Local Learning Deep Neural Nets LLFFD

Prop. VGG+ Conv-LSTM [40] 2018 Global Global Learning Deep Neural Nets LLFFD

Prop. VGG+ GLF-LSTM [41] 2018 Global Global Learning Deep Neural Nets LLFFD

Prop. VGG+ SLF-LSTM [41] 2018 Global Global Learning Deep Neural Nets LLFFD

Prop. VGG+ SeqL-LSTM [41] 2018 Global Global Learning Deep Neural Nets LLFFD

Additionally, Table 3.4 summarizes the main characteristics, including feature extraction method,

classifier, light field capability exploited, and light field format, of prior and proposed light field

based face recognition solutions. The genesis of these solutions is associated to three distinctive

light field capabilities, i.e., a posteriori refocusing, disparity exploitation and depth map

exploitation (see Section 2.5). The solutions summarized in Table 3.3, excluding the proposed

solutions, are briefly reviewed in the following, grouped based on the feature extraction

42

approaches considered in the taxonomy.

Table 3.4: Overview of prior and the proposed light field based face recognition solutions.

Solution Name Year Feature Extraction

Method Classifier Light Field Capability Format

MPCA Tensor [153] 2008 MPCA NNC Disparity exploitation M-V SA array

LF Face [28] 2013 LBP NNC Depth computation 2D rendered

from LF

Multi-Face LF [29] 2013 LBP; LG filter SRC A posteriori refocusing 2D rendered

from LF

Super Res. LF [30] 2013 LBP SCR A posteriori refocusing 2D rendered

from LF

Face-Iris MF [27] 2016 HOG; LBP; CSLBP;

BSIF SRC A posteriori refocusing

2D rendered

from LF

DM LF [31] 2016 LFHOG SVM Depth computation M-V SA array

Prop. LFLBP [37] 2017 LFLBP SVM Disparity exploitation M-V SA array

Prop. LFHG [36] 2018 HOG; LFHDG SVM Disparity exploitation M-V SA array

Prop. VGG-D3 [39] 2018 VGG SVM Disparity exploitation;

Depth computation M-V SA array

Prop. VGG+ Conv-LSTM [40] 2018 VGG; Conv-LSTM Softmax Disparity exploitation M-V SA array

Prop. VGG+GLF-LSTM [41] 2018 VGG; GLF-LSTM Softmax Disparity exploitation M-V SA array

Prop. VGG+SLF-LSTM [41] 2018 VGG; SLF-LSTM Softmax Disparity exploitation M-V SA array

Prop. VGG+SeqL-LSTM [41] 2018 VGG; SeqL-LSTM Softmax Disparity exploitation M-V SA array

3.3.3.1 Appearance Based Solution

There are a few multilinear appearance based solutions able to analyse the high dimensional light

field image information in its native form, thus exploiting the disparity information available in a

light field image; it is worth to mention that none of the multilinear appearance based solutions

were originally designed for face recognition. Multilinear Principal Component Analysis (MPCA)

[153] is one such solution using tensors for feature extraction, and is able to decompose the original

problem into a series of multiple projection sub-problems to capture most of the tensorial input

variations. As a light field image represented in the form of a multi-view array of SA images can

be interpreted as a 4D tensor, MPCA has been considered for light field based face recognition in

this Thesis for the first time.

3.3.3.2 Hand-Crafted Based Solutions

The first group of hand-crafted based solutions relies on the a posteriori refocusing capability

when using light field imaging. This can improve the image quality of a previously out-of-focus

region of interest for the subsequent recognition of either a single face or multiple faces, positioned

at different distances. The solution presented in [28] proposes a wavelet energy based approach to

select the best focused face image from a set of refocused images, rendered from a light field

image. Then, the LBP descriptor is applied to extract features and a Nearest Neighbor Classifier

(NNC) is used for classification. Another solution is based on a resolution enhancement scheme

[29], using the discrete wavelet transform, to capture high frequency components from different

focused 2D images to create an all-in-focus face image to be input to a LBP descriptor. The

identification of multiple faces at different distances is investigated in [30] by exploring an all-in-

43

focus image created from a light field image. A LBP descriptor is applied to extract features from

the all-in-focus image and a Sparse Reconstruction Classifier (SRC) is used to perform the

recognition task. In [27], a face recognition solution is presented, relying on rendering a light field

image in different focus planes in two different ways: i) selecting the best focus image; and ii)

combining focus images to create a super-resolved image; both approaches have been considered

in this research. Different local descriptors including HOG, LBP, CSLBP, and Binarized Statistical

Image Features (BSIF) are used for feature extraction.

The second group of hand-crafted based solutions relies on exploiting the depth information that

can be estimated from a light field image, thus providing geometric information about the position

and shape of facial components. In [31], a depth map computed from a light field image is analyzed

using a HOG descriptor for extracting discriminative features, which are then fed into a linear

SVM classifier to perform the recognition task.

3.3.3.3 Light Field Based Face Recognition: the Status Quo

Table 3.3 summarizes the six available light field based face recognition solutions, along with the

databases used for reporting results. As it can be observed, the experiments performed in [28] and

[31] were conducted on private databases, so there is no way to compare their performance with

the other light field based face recognition solutions. MPCA has been considered for light field

based face recognition in this Thesis for the first time, meaning that no previous recognition

performance results had been reported in the literature. Concerning [29] [30] [27], although they

are tested on the same database, LiFFID, there is no comparative study available to analyze the

level of performance achieved by these solutions. In conclusion, given the nature of the existing

solutions and the way they were tested, it is difficult to provide precise information the

performance to expect from light field based face recognition. The benchmarking study performed

in this Thesis will address this shortcoming.

3.4 Ear Recognition

Since face is not the single relevant biometric, increasing research work has been recently

developed in the area of ear recognition. As the human ear structure remains stable over time for

the same person, and it does not present significant changes for different facial emotions and

actions, ear recognition has evolved as a reliable biometric modality for human identification in

recent years [6]. Since there is no research activity or publicly available databases addressing ear

recognition using light field sensors, excluding the ear recognition solutions proposed in this

Thesis, this section is dedicated to the review of non-light field based ear recognition databases

and solutions, following the multi-level face/ear recognition taxonomy proposed in Section 3.2.3.

Considering this taxonomy, see Figure 3.3, it should be noted that, although the relations between

the ear components contain critical information for ear recognition, no ear recognition solution

adopting the component representation paradigm, thus dealing independently with a selection of

ear components, has been yet proposed in the literature for ear recognition. Additionally, while all

the feature extraction sub-approaches in the taxonomy can potentially be used also for ear

recognition, some of them, notably multi-linear, dictionary learning, decision tree, regression,

44

Bayesian and shape descriptor feature extraction sub-approaches, have not yet been considered in

the literature for ear recognition.

3.4.1 Ear Databases: Status Quo

Ear databases play a very important role for designing, testing and validating ear recognition

solutions, while ensuring the reproducibility of performance results and their fair comparison. A

set of selected ear biometric databases are briefly reviewed in the following.

Currently, there are several publicly available ear image databases, none of them light field based.

An overview of the publicly available ear databases, including their main characteristics and

variations, is provided in Table 3.5. For comparison, also the characteristics of the IST-

EURECOM Lenslet Light Field Ear DataBase (LLFEDB) [36] proposed in this Thesis are included

in Table 3.5. The databases consider different characteristics and the corresponding ear images

exhibit different levels of variability. The variation ‘sides’ indicates whether images were captured

from one or both ears, ‘poses’ refers to yaw and pitch ear rotations, and ‘occlusions’ and

‘accessories’ hint whether ears are (partly) occluded and whether accessories, such as earrings, are

visible in the ear images.

Table 3.5: Overview of ear databases with different (Diff.) characteristics.

Database Name Year Number of

Subjects

Number of

Images

Ear Variation

Diff.

Sides

Diff.

Poses

Diff.

Occlusions

Diff.

Accessories

USTB I [154] 2002 60 185 Right

USTB II [154] 2004 77 308 Right

IITD I [155] 2007 125 493 Right

AMI [156] 2009 100 700 Both

WPUT [157] 2010 501 2071 Both

IITD II [155] 2014 221 793 Right

AWE [5] 2016 100 1000 Both

Prop. LLFEDB [36] 2018 67 536 Both

3.4.2 Ear Recognition Solutions

Ear recognition has recently attracted a considerable amount of research interest, thus resulting in

a large number of papers, although none of them exploring light field images. Following the

proposed taxonomy in Section 3.2.3, the main ear recognition solutions available are briefly

described in this section. Table 3.6 summarizes the main characteristics of a selection of

representative and relevant ear recognition non-light field solutions. The solutions listed in Table

3.6 are sorted based on feature extraction approach, feature extraction sub-approach and,

eventually, publication date. The characteristics of the light field based face recognition solutions

being proposed in this Thesis are also listed in Table 3.6 for comparison purposes. The solutions

summarized in Table 3.6 are briefly reviewed in the following, grouped based on the feature

extraction approach considered in the taxonomy.

45

Table 3.6: Classification of a selection of ear recognition solutions based on the developed

taxonomy.

Ref. year Ear

Structure

Feature

Support

Feature

Extraction

Approach

Feature Extraction

sub-approach Feature Extractor Classifier

[7] 2003 Global Global Appearance Linear PCA NNC

[158] 2007 Global Global Appearance Linear LDA NNC

[8] 2013 Global Global Appearance Linear Sparse Coding Error

Ratio NNC

[159] 2014 Global Global Appearance Linear G-NSRC NNC

[160] 2015 Global Local Appearance Linear PCA NNC

[161] 2016 Global Global Appearance Linear Dictionary Based

Sparse Representation NNC

[162] 2013 Global Global Appearance Linear;

Non-Linear ICA; LDA NNC

[163] 1999 Comp.+ struct.

Global Geometric Graph Distance Model NNC

[164] 2004 Comp.+

struct. Global Geometric Graph Distance Model NNC

[165] 1997 Comp.+

struct. Global Geometric Shape Contour Model NNC

[166] 2008 Comp.+

struct. Global Geometric Shape Deformable Model NNC

[167] 2008 Comp.+


[168] 2016 Comp.+


[169] 2017 Comp.+

struct. Global Geometric Shape Deformable models NNC

[170] 2017 Global Global Learning Deep Neural Network SqueezeNet Softmax

[171] 2017 Global Global Learning Deep Neural Network VGG-16 Netwrok Softmax

[172] 2018 Global Global Learning Deep Neural Network AlexNet; VGG-16

Network; GoogLeNet Softmax

[173] 2009 Global Local Hand-crafted Texture LGPDP NNC

[174] 2012 Global Local Hand-crafted

Texture Multi-Scale Dense

HOG NNC

[175] 2013 Global Local Hand-crafted Texture SIFT NNC

[176] 2014 Global Local Hand-crafted

Texture LBP; LPQ; HOG;

BSIF NNC

[177] 2016 Global Local Hand-crafted Texture MLBP NNC

[178] 2016 Global Local Hand-crafted Texture TDSIFT NNC

[5] 2017 Global Local Hand-crafted Texture;

Frequency

LPQ; BSIF; SIFT; POEM; Gabor; HOG

NNC

Prop.

LFLBP

[43]

2018 Global Local

Hand-crafted

Texture LFLBP SVM

Prop.

LFHG

[42]

2018 Global Local

Hand-crafted

Texture LFHG SVM

[179] 2005 Global Global Hybrid Linear;

Deep Neural Network ICA, RBF Softmax

[180] 2007 Global Global Hybrid Linear

Wavelet Transform,

PCA NNC

46

Ref. year Ear

Structure

Feature

Support

Feature

Extraction

Approach

Feature Extraction

sub-approach Feature Extractor Classifier

[181] 2008 Global Local Hybrid Linear;

Texture

Haar Wavelet

Transform, LBP NNC

[182] 2013 Global Local Hybrid Linear;

Texture

Sparse

Representation; LRT NNC

[183] 2014 Global Global Hybrid

Non-linear;

Deep Neural Network; Texture

LDA; Neural

Network; SURF Softmax

[184] 2014 Global Local Hybrid Texture GLCM; LBP; Gabor

filters NNC

3.4.2.1 Appearance Based Solutions

Appearance based ear recognition solutions exploit the global appearance of the input image, either

the whole ear or its components, to compute representations encoding the ear structure. A PCA

based solution, called eigen-ears, is proposed, combining uncorrelated linear combinations of the

basis images for ear recognition [7]. An ear recognition solution is proposed in [158], applying

Linear Discriminant Analysis (LDA) to determine a set of projection vectors maximizing inter-

class and minimizing intra-class variabilities. An extensive experimental comparison is conducted

for ear recognition using the ICA and LDA feature extraction approaches [162]. In [8], an adaptive

feature weighting scheme based on a sparse representation method, called sparse coding error

ratio, is proposed for ear recognition. A new feature extraction approach is investigated for ear

recognition, using the scale information of Gabor wavelets. Then, Gabor scale feature based non-

Negative Sparse Representation Classification (G-NSRC) is proposed for ear recognition under

occlusion [159]. The ear recognition solution proposed in [160] divides an ear image into smaller

blocks, to apply after PCA to each block separately, and eventually add the outputs of the

classifiers applied to each block to perform ear recognition. Finally, the ear recognition solution

proposed in [161] uses a sparse representation framework without requiring any preprocessing or

normalization of the ear region.

3.4.2.2 Geometric Based Solutions

Geometric solutions use features representing the geometrical characteristics of the ear. Ear

recognition solutions from this category are, in general, computationally simple and often rely on

edge detection as a pre-processing step. The geometric based solutions to recognize user identity

from ear images focus on analyzing either simple ear geometrical features, such as height, width,

size, and distances between ear components, or more advanced shape models, e.g., graph,

deformable, and contour models, to provide more comprehensive geometrical descriptions. The

ear recognition solution in [165] localizes the ear using deformable contours on a Gaussian

pyramid representation of the image gradient. Then, the ear features are computed as a number of

scale, rotation and translation invariant geometrical factors. An ear identification solution is

proposed based on the extraction of an ear feature combining outer ear points, ear shape and

wrinkles information [163]. The ear recognition solution proposed in [164] consists of: i) ear edge

detection using a Sobel operator; and ii) ear feature extraction by forming a shape feature vector

of the outer ear and the structural feature vector of the inner ear. In [166], an ear deformable model

47

is constructed and then converted to a geometry image. After, a wavelet transform is applied to the

geometry image and the wavelet coefficients form the ear feature vector. Another feature

extraction approach computes geometrical parameters of ear contours extracted from ear images

[167]. The feature extraction is based on a concentric circles centered method to obtain an ear

centroid point and contour features, including contour starting points, ending points, bifurcations,

and intersections, computed with respect to the centroid point. A geometric feature extraction

method is proposed in [168], which finds the contours of the ear based on a Canny edge detector,

and then extracts shape features from the outer ear images with respect to the ear height line.

Finally, an extensive experimental comparison of ear recognition using state-of-the-art

methodologies for training and fitting statistical deformable models is presented in [169].

3.4.2.3 Learning Based Solutions

Thanks to the popularity of deep learning based solutions and their significant impact on different

computer vision tasks, three deep CNN solutions have recently been adopted also for ear

recognition [170] [171] [172]. In [170], the problem of training CNNs with limited training data

for the ear recognition task is addressed by considering data augmentation techniques including:

i) geometric and color perturbations to the available training data; and ii) synthetic data samples

generation. Then, the SqueezeNet [144], AlexNet [142], and VGG-16 network [146] generic

models are fine-tuned for deep ear description and a softmax classifier is used to perform ear

recognition. In [171], two fully-connected layers are added on top of the VGG-16 network seventh

layer and the pre-trained VGG-16 model optimized in [170] is used for model initiation. Then, the

pre-trained weights of the early layers are frozen and kept unchanged, while the newly added fully-

connected layers are trained from scratch. The output of the second fully connected layer of the

modified architecture is used as input to a softmax classifier to perform ear recognition. In [172],

deep CNN networks, notably AlexNet [142], VGG-16 network [146], and GoogLeNet [145] are

used considering two different training approaches, notably full model learning and selective

model learning. Data augmentation has also been applied to increase the amount of data for deep

CNN model training.

3.4.2.4 Hand-crafted Based Solutions

Several hand-crafted based solutions use local texture descriptors for ear recognition [5], [176],

[185]. A local hand-crafted based ear recognition solution is proposed in [173], deriving features

using Local Gabor Phase Difference Pattern (LGPDP) to represent images by exploiting

relationships of Gabor phase between pixel and its neighbours. A robust ear recognition solution

using multi-scale dense HOG features as a descriptor of 2D ear images, capturing different and

complicated structures of ear images, is proposed in [174]. The Scale Invariant Feature Transform

(SIFT) is applied for ear feature description in [175]. An ear recognition solution is proposed in

[177], extracting features based on Multi-scale Local Binary Pattern (MLBP) descriptor to be used

as input to a classifier. A Texture and Depth Scale Invariant Feature Transform (TDSIFT)

descriptor, encoding 2D and 3D local features for ear recognition is proposed in [178]. A

comparative study of ear recognition performance is presented in [176], using LBP, LPQ, HOG,

and BSIF local descriptors.

48

3.4.2.5 Hybrid Solutions

Hybrid solutions combine elements from several categories to improve ear recognition

performance. A hybrid system for classifying ear images is proposed in [179], combining ICA and

a Radial Basis Function (RBF) network. The ear image is decomposed into linear combinations of

several basic images and the corresponding coefficients of these combinations are fed up into a

RBF network to perform recognition. The ear recognition solution proposed in [180] decomposes

the ear image into three horizontal, vertical and diagonal images using wavelet transform and then

PCA is applied for feature dimension reduction. The proposed ear recognition solution in [181]

decomposes ear images using the Haar wavelet transform to provide input to an uniform LBP

descriptor, thus describing the ear sub-images texture features in the Haar wavelet domain. A

solution based on sparse representation of local gray-level orientation described by a Local Radon

Transform (LRT) is proposed for ear recognition in [182]. In [183], SURF features are computed

from the ear images and then the dimensionality is reduced using LDA as an input of two neural

networks. Finally, the extracted features from Grey-Level Co-occurrence Matrices (GLCM), LBP

and Gabor filters are combined for ear recognition in [184].

3.4.2.6 Ear Recognition: the Status Quo

Figure 3.8 illustrates an overview of the evolution of ear recognition solutions over time, grouped

based on feature extraction approaches. Figure 3.8 also includes the range of RR1 results obtained

for each group of techniques on the AWE dataset [5]. The model based solutions were the first

appearing solutions for ear recognition. Then, appearance based and hand-crafted based solutions

appeared successively, providing a moderate improvement in the accuracy of ear recognition

solutions. The state-of-the-art model based, appearance based, and local hand-crafted based

solutions provide, respectively, a RR1 recognition performance of 63.80%, 61.10%, and 65.20%

for the AWE dataset, thus showing the slight superiority of local hand-crafted based solutions at

the cost of a lower computational complexity. In 2017, deep learning based solutions started to be

appeared although they have not yet led to a considerably superior performance over local hand-

crafted based solutions; this is most probably due to the lack of enough available training samples,

which has a larger impact on the recognition performance on the deep solution architectures. In

fact, the amount of samples in the available datasets for ear recognition is rather limited (see Table

3.5), thus deep learning based ear recognition solutions mainly utilize an already trained

classification models, e.g., for generic object classification or for face recognition, for model

initiation. The adaptation to the specific ear recognition problem is done using transfer learning,

fine-tuning the models, using a part of the available ear dataset, which typically is not large enough

to result in an appropriate training. This reveals a pressing need to gather large-scale ear databases

in order to obtain better deep classification models for ear recognition.

49

Figure 3.8: Overview of the evolution of ear recognition solutions over time, grouped based on

feature extraction approaches; performance values for the AWE database.

51

Chapter 4 _

Proposing Novel Light Field Face and Ear

Recognition Databases

4.1 Introduction

As stated in Section 3.3.1, it was difficult to fully assess how face recognition technology can

benefit from light filed data, as the only available light field face database, LiFFID [98], does not

include the raw light field images. In fact, LiFFID only includes a number of 2D images focused

at different depths for each person, rendered from light field images acquired by an old generation

of lenslet light field cameras; thus, it can be only useful for testing and validating those face

recognition solutions that exploit the a posteriori refocusing capability supported by light field

imaging. Concerning ear recognition, no ear database captured by lenslet light field cameras was

available at the time of writing this Thesis.

To be able to test any light field face recognition solution, including those proposed in this Thesis,

it is necessary to have access to databases including light field face images in the light field raw

format, thus providing the flexibility to exploit different type of light field data for biometric

recognition. It should again be noted that the available databases did not include the light field

images, but rather only specific sets of 2D images rendered from the light fields.

To overcome these limitations, two light field based face and ear databases have been developed

in the context of this Thesis, allowing more powerful benchmarking for testing and validating face

and ear recognition solutions exploiting the full light field data; these databases have been made

publicly available to the research community. This section reviews the proposed light field face

and ear databases.

52

4.2 Lenslet Light Field Face Recognition Database

A new database, the so-called IST-EURECOM Lenslet Light Field Face Database (IST-

EURECOM LLFFD), is introduced in this section. The proposed database includes data from 100

subjects, with 20 samples per each person, captured by a Lytro ILLUM lenslet camera. The images

are captured in a controlled acquisition setup with different facial variations, including emotions,

actions, poses, illuminations, and occlusions in order to benefit from the non-intrusive nature of

face recognition. This database refers to application scenarios where the subjects present

themselves to a fixed camera with a controlled background, but significant flexibility is allowed

in terms of pose, expression and occlusions. This is a rather common and realistic scenario in

business and industrial environments where the facial images to be recognized are captured in, at

least partly, constrained conditions. The database includes the raw light field images, sample 2D

rendered images and the associated depth maps, along with a rich set of metadata.

4.2.1 Acquisition Setup and Statistics

The proposed IST-EURECOM LLFFD was acquired in the context of a cooperation between the

Multimedia Signal Processing laboratory at Instituto de Telecomunicações, Instituto Superior

Técnico, Lisbon, Portugal and the Imaging Security Lab at EURECOM, SophiaTech, Nice,

France. Image acquisition was performed in an indoor environment, using the Lytro ILLUM

lenslet camera [26]. The acquisition setup, illustrated in Figure 4.1, included a white backdrop

background behind a chair at a fixed distance of 1.25 m to the camera. The scene was illuminated

with a three-point lighting kit, including a key light, a fill light and a back light, placed to limit

shadows and allow ease segmentation of the subject from the background; a sketch of the database

acquisition setup is included in Figure 4.2. The image acquisition process has been repeated in the

two labs with the same predefined setup. Each volunteer participated in two separate acquisition

sessions, with a time interval between 1 and 6 months. The database includes 20 shots per person

in each session, with different facial variations including facial emotions, actions, poses,

illuminations and occlusions. Before the acquisition process, volunteers were asked to fill and sign

consent and metadata forms.

Figure 4.1: Acquisition setup at (a) IST; and (b) EURECOM.

The IST-EURECOM LLFFD includes data from 100 volunteers, 66 males and 34 females, with a

total number of 4000 light field face images in the database, corresponding to a total disk space of

about 270 GB. The participants were born between 1957 and 1998, and are from 19 different

countries. Figure 4.3 illustrates the distribution of subjects by age.

53

Figure 4.2: A sketch of the LLFFD acquisition setup.

Figure 4.3: Age distribution for the subjects in IST-EURECOM LLFFD.

4.2.2 Database Variations

To fully benefit from the non-intrusive nature of face recognition, a face recognition system may

be required to recognize a face in an arbitrary situations, without the explicit cooperation of the

subject. This flexibility is of great interest in many face recognition applications, notably many

video surveillance environments.

To consider less controlled acquisition conditions, the IST-EURECOM LLFFD includes a total of

20 face variations per person, categorized into 6 dimensions:

1. Neutral image (1 image): image captured with standard illumination, frontal pose, neutral

emotion, no action, and no occlusion;

2. Emotions (3 images): images with three different emotions, notably happy, angry and surprise;

54

3. Actions (2 images): images with two different actions, notably closed eyes and open mouth;

4. Poses (6 images): images with different poses, notably looking up, looking down, right half-

profile, right profile, left half-profile, left profile;

5. Illumination (2 images): images with different illumination intensities, notably low and high

illumination levels;

6. Occlusions (6 images): images with occlusions, notably eye occluded by hand, mouth occluded

by hand, with glasses, with sunglasses, with surgical mask and with hat.

Examples of the various face variations considered in the IST-EURECOM LLFFD are illustrated

in Figure 4.4. All images were taken under controlled conditions, but there were no restrictions

imposed on clothing, make-up and hair style.

Figure 4.4: Illustration of 2D rendered images for the facial variations in the IST-EURECOM

LLFFD.

4.2.3 Database Elements

The IST-EURECOM LLFFD is the first biometric database to include raw light field imaging files.

It also includes additional information that can be useful for developing and testing face

recognition systems. The database is composed by the following elements:

1. Raw Light Field Images: Raw light field images are stored in the Lytro ILLUM native file

format, so-called Light Field Raw (LFR) files, with a size of about 50 MB/image. LFR files

can be used as initial input for both the Lytro camera software i.e., Lytro Desktop Software

55

[186], or to any other processing library/toolbox, such as the Matlab Light Field Toolbox V0.4

[58].

2. 2D Images: Since light field images are not directly viewable in conventional 2D displays, the

proposed database also includes 2D rendered images for the central view of each light field

image variation, generated using the Lytro Desktop Software [186]. It is worth noting that this

software automatically performs a number of processing steps, including up-sampling and

color correction, to enhance the quality of the output images. As the raw light field images are

made available, any other rendering solution may also be used. The 2D rendered face images

can be viewed using conventional 2D displays or be further processed.

3. Depth Map: A depth map for each central view 2D rendered image is available in the database.

The depth map can be used to bridge the gap between 2D and 3D face recognition. Depth maps

(see example in Figure 4.5) can provide geometric information about the position and shape of

objects, to be explored by recognition systems. The supplied depth maps were generated using

the Lytro Desktop Software [186].

Figure 4.5: Sample depth map.

4. Landmark Information: Facial landmarks are relevant for facial region extraction and

normalization in face recognition systems. In the IST-EURECOM LLFFD, the facial

landmarks information includes the location of the face, left eye, right eye, nose and mouth

bounding boxes, as illustrated in Figure 4.6. The landmark information is extracted for the

central view 2D rendered images.

Figure 4.6: Illustration of facial landmarks.

56

5. Subjects Metadata Information: Metadata information can be used for the evaluation of face

recognition, facial expression recognition, gender classification, and age estimation automated

results. The IST-EURECOM LLFFD rich metadata includes the image acquisition date, as

well as information on the subject gender, age, facial hair, makeup, haircut and usage of

accessories; the range of values/labels for each of these metadata fields is listed in Figure 4.7.

Figure 4.7: Metadata associated to each subject.

6. Calibration Information: Calibration data is essential to compensate for the specific

properties of each camera’s sensor. For example, it is a required input for some light field

image processing software products, such as the Lytro Desktop Software [186] and the Matlab

Light Field Toolbox [58].

4.2.4 Database Structure and Naming Convention

The files composing the database are organized according to a hierarchical structure, as illustrated

in Figure 4.8. The root level of the hierarchy includes the metadata information and facial

landmarks for all the subjects and the camera calibration files. The root level also includes a folder

for each of the N subjects in the database, labelled using a 3 digit identifier, xxx. Each of these

folders contains 3 sub-folders: “LFR files”, “2D images” and “Depth map images”.

Figure 4.8: IST-EURECOM LLFFD file structure.

57

The naming convention for the database light field images is type_xxx_s_vv_variation where:

“type” refers to the type of image, notably “LF” (Light Field), “2D” (2 Dimensional) or “DM”

(Depth Map);

“xxx” is a three digit integer uniquely identifying the subject, starting from 001; the first 50

subjects have been recorded at IST and the second 50 subjects at EURECOM;

“s” is a digit indicating the acquisition session number, notably “1” or “2”;

“vv” is a two digit integer indicating the variation number, ranging from 01 to 20,

corresponding to the variations illustrated in Figure 6.

“variation” is a three letter acronym identifying the variation in a format more suitable for

human reading, as defined in Table 4.1.

Table 4.1: List of Acronyms used in IST-EURECOM LLFFD along with the their definition.

Acronym Definition

NFF Neutral Frontal Face

EHF Emotion Happy Face

EAF Emotion Angry Face

ESF Emotion Surprised Face

AEC Action Eyes Closed

AMO Action Mouth Open

PUL Pose Up Looking

PDL Pose Down Looking

PHL Pose Half-profile Left

PHR Pose Half-profile Right

PPL Pose Profile Left

PPR Pose Profile Right

ILI Illumination High Intensity

IHI Illumination Low Intensity

OMH Occlusion Mouth by Hand

OEH Occlusion Eye by Hand

OFG Occlusion Face by Glasses

OFS Occlusion Face by Sunglasses

OFM Occlusion Face by Mask

OFH Occlusion Face by Hat

4.2.5 Database Access and Usage Conditions

IST-EURECOM-LFFD is freely distributed for standardization and academic research purposes.

The first part of the database, captured at Instituto de Telecomunicações – Instituto Superior

Técnico, Lisbon, Portugal can be accessed at http://www.img.lx.it.pt/LFFD/. The second part,

captured at EURECOM, SophiaTech Campus, Nice, France can be accessed at

http://lffd.eurecom.fr/.

4.3 Lenslet Light Field Ear Recognition Database

Since no light field ear database was available, the IST-EURECOM Lenslet Light Field Ear

DataBase (LLFEDB) has been created, to make publicly available content allowing testing and

http://www.img.lx.it.pt/LFFD/

http://lffd.eurecom.fr/

58

validating light field imaging based ear recognition systems. The proposed ear database consists

of 536 light field ear images from 67 subjects, with 8 image shots per person, captured with a Lytro

ILLUM lenslet camera, over two separate sessions, with four different poses per session.


This Thesis proposes the IST-EURECOM Lenslet Light Field Ear Database (LLFEDB),

containing only the ear region from a relevant subset of IST-EURECOM LLFFD images. Out of

the 100 LLFFD subjects, only 67 were selected, as for the remaining subjects the ears were

completely occluded with hair. The interval between acquisition sessions is in the range of 1-6

months. The IST-EURECOM LLFEDB includes data from volunteers from both genders, with a

total number of 536 light field ear images in the database, corresponding to a total disk space of

about 30 GB.

The participants were born between 1957 and 1998, originating from 15 different countries. The

ear portion of the facial images has been manually cropped. Since the facial images were acquired

at slightly different distances/camera’s zoom levels, the ear size in the database for each view

varies from 75×35 up to 107×86 pixels, with an average aspect ratio of 1.49. If necessary, some

normalization may have to be applied when using this content.

4.3.2 Database Variations

Among the available IST-EURECOM LLFFD facial images corresponding to multiple poses,

there are four of interest for ear recognition, notably the right and left half and full profile images.

LLFEDB consists of 536 light fields image from 67 subjects, considering the four poses

mentioned, taken in two sessions, in a total of 8 images per subject – see Figure 4.9.

Figure 4.9: Illustration of IST-EURECOM LLFEDB 2D rendered ear images for the four

profiles of a specific subject in two separate acquisition sessions.

59

As an ear recognition system is often required to operate in unconstrained situations, i.e. without

the explicit cooperation of the subject, it is important to include less ‘recognition friendly’ content.

To investigate the effects of occlusions on the ear recognition performance, the IST-EURECOM

LLFEDB includes ear images partly occluded by ear piercing, earing, hair and combinations of

these occlusions – see examples in Figure 4.10.

Figure 4.10: Examples of partially occluded ear images: (a) ear piercing; (b) earing; (c) hair; and

(d) combination of occlusions.


The IST-EURECOM LLFEDB is composed by: i) the raw light field ear images; ii) their

corresponding representation as a multi-view SA images array; iii) central view 2D images (for

easy access); iv) metadata; v) ear landmark information; and vi) camera calibration file, as

described in the following:

1. Raw Light Field Images: The raw light field ear images are the most important component of

the database; they are stored in the Lytro ILLUM native format, the so-called Light Field Raw

(LFR) files. As landmark information for the central view rendered image is made available in

IST-EURECOM LLFEDB, the ear region can be easily cropped from the original LLFFD

facial images.

2. Multi-View Array: Ear recognition systems working on light field images may not process

directly the raw light field images and may instead process some conversion of the LFR files,

e.g. 2D rendered images such as SA images. The multi-view SA arrays available in the IST-

EURECOM LLFEDB database, extracted using the Matlab Light Field Toolbox [58], contain

only the ear region, cropped using the landmark information provided in the LLFFD database.

3. 2D Images: Since light field images are not directly viewable in conventional 2D displays, the

proposed database also includes 2D rendered ear images for the central view of each light field

image, extracted using the Matlab Light Field Toolbox [58] – see Figure 4.9. The available 2D

rendered images contain only the ear region.

60

4. Subjects Metadata Information: The IST-EURECOM LLFEDB metadata includes the

image acquisition date, subject gender, subject age, and information about occlusions by hair,

earings or piercings. The set of labels for each metadata field are listed in Table 4.2.

5. Ear Landmark Information: In the IST-EURECOM LLFEDB, the landmark information is

defined by the corner coordinates of the ear bounding boxes in the facial image. The landmark

coordinates refer to the central view 2D rendered images.

6. Calibration Information: Calibration data is essential to compensate for the specific

properties of each camera’s sensor. For example, it is a required input for some light field

image processing software products, such as the Lytro Desktop Software [186] and the Matlab

Light Field Toolbox [58].

Table 4.2: Metadata associated to each subject in each acquisition session.

Field Range

Date taken Date defined as YYYY/MM/DD

Gender Male, Female

Age Integer number

Hair Occlusion Yes, No

Earing Occlusion Yes, No

Ear Piercing Occlusion Yes, No

4.3.4 Database Structure and Naming Convention

The files composing the database are organized according to the hierarchical structure illustrated

in Figure 4.11. The root level includes files containing the metadata information for all the subjects

and the ear landmark information for all the images. The root level also includes a folder for each

of the 67 subjects in the database, labelled using a 3 digit identifier, xxx, corresponding to their

identifications in IST-EURECOM LLFFD. Each of these folders contains 2 sub-folders: “2D

images” and “Multi-view arrays”.

Figure 4.11: IST-EURECOM LLFEDB file structure.

61

The naming convention for the database files follows the same protocol as defined for IST-

EURECOM LLFFD (Section 4.2.4).


IST-EURECOM-LFFD is freely distributed for standardization and academic research purposes.

The database can be downloaded from: http://www.img.lx.it.pt/LLFEDB/.

http://www.img.lx.it.pt/LLFEDB/

63

Chapter 5 _


Recognition Solutions

5.1 Introduction

This Thesis proposes seven light field based face and ear recognition solutions, evolving through

progressive levels of functionality and performance, exploiting the additional information

available in a light field image. The first two solutions are proposed based on light field hand-

crafted descriptors, describing the disparity information available in light field images for both

face and ear recognition. The other five recognition solutions are based on fused deep/double-deep

descriptors, learning convolutional representations and angular dynamics from a light field image

for face recognition. The proposed recognition solutions are summarized in Figure 1.3.

Figure 5.1: Summary of the proposed recognition solutions.

5.2 Face and Ear Recognition Based on Light Field Local Binary Pattern

Descriptor

This section proposes a face/ear recognition solution based on a new hand-crafted light field

descriptor, so-called Light Field Local Binary Patterns (LFLBP) descriptor, exploiting the spatial

and disparity information available in light field images for face and ear recognition tasks.

64

5.2.1 Architecture and Walkthrough

The generic architecture of the proposed recognition solution based on LFLBP hand-crafted

descriptor for both face and ear recognition tasks is illustrated in Figure 5.2. By exploiting the

multiple SA images, available from the light field multi-view representation, the proposed LFLBP

descriptor is expected to improve the recognition system performance.

The proposed recognition architecture includes the following main steps:

1. Pre-processing: The Matlab Light Field Toolbox v0.4 [58] has been used to create the multi-

view array of SA images (Section 2.4) from the input, raw light field. Then, the face and ear

in all SA images are cropped and resized to 128×128 and 192×128 pixels, respectively.

Additionally, for ear images, the images of left side ears are flipped horizontally, making all

ear images to be further processed to have the same orientation. There are three reasons for

these image size selections: i) A study with the IST-EURECOM LLFFD and IST-EURECOM

LLFEDB databases has shown that the average aspect ratios of the cropped faces and ears are

1.08 and 1.51, respectively, thus justifying the aspect ratio of the resized faces and ears; ii) A

preliminary study conducted during the Thesis work has shown that increasing the image

resolution does not significantly impact the recognition performance, while increasing the

computational complexity; and iii) Although the IST-EURECOM LLFFD database considers

larger image sizes, the face area is only a portion of that size, with128×128 being a size

appropriate for the cropped face image; for ear images, 192×128 is a size using the some

horizontal resolution and a vertical resolution growing to adjust to the ears aspect ratio

measured from IST-EURECOM LLFEDB.

2. LFLBP feature description: The proposed LFLBP descriptions are extracted from the

normalized multi-view array, as detailed in Section 5.2.2.

3. Offline training: The LFLBP descriptions extracted from the training samples, highlighted by

red in Figure 5.2, are fed to a linear SVM classifier (implemented using the LIBSVM library

[187]) to define the classification model. The training data should be selected based on the test

protocol considered.

4. Classification: LFLBP descriptions extracted from testing samples are fed to the previously

trained SVM classifier, thus determining the subject identity.

Figure 5.2: Architecture of the proposed face and ear recognition solution based on LFLBP

hand-crafted descriptor.

65

5.2.2 Light Field Local Binary Patterns Feature Description

LBP [76] and its variant are among the best performing hand-crafted feature descriptors for face

recognition. The LFLBP combined descriptor is an extension of LBP, which is able to exploit the

richer information available in light field images. Thus, the recognition solution can exploit both

spatial and angular information that may boost the final recognition performance. Similarly to the

original LBP [76], the novel LFLBP processes the gray level intensities of the captured light fields.

The input to a LFLBP descriptor is a multi-view array, i.e. L(u,v,x,y), where u and v identify the

viewpoint, and x and y the pixel position within a SA image. In a Cartesian representation, for the

used Lytro Illum lenslet camera [26], u and v take integer values in the range {-7, …, 7}, and the

size of each SA is 625×434 pixels. The central SA image is the reference view position, denoted

as L(0,0,x,y), as highlighted in yellow in Figure 5.3. Each SA image can also be identified using

polar coordinates using two parameters: i) the radius, R, expressing the Euclidean distance to the

reference view, with a direct relation with the observed disparity; and ii) the angle, A, measured

counter-clockwise from the positive part of the real axis. A third parameter, N, defines the number

of SA images, or views, to consider in the descriptor. Figure 5.3 shows three examples of SA

images, highlighted in red, with different parameter values.

Figure 5.3: Examples of selected SA images (red). The SA images highlighted in dark grey do

not contain usable image information due to the micro-lens shape.

The proposed LFLBP descriptor combines two components: i) the Spatial Local Binary Patterns

(SLBP) descriptor, which corresponds to LBP applied to the central, reference view; and ii) the

novel Light Field Angular Local Binary Patterns (LFALBP) descriptor, which captures the multi-

view information available in the light field image.

1) Spatial Local Binary Pattern (SLBP): For a selected set of p samples in the spatial

neighborhood, at distance r from the reference sample x,y, with starting angle a, the sample level

SLBP pattern value (SLBPSL),for position x,y is defined by Equation 5.1:

SLBP𝑆𝐿(𝑟, 𝑎, 𝑝, 𝑥, 𝑦) = ∑ s(𝐿0,0,(𝑥+𝑘),(𝑦+𝑙) − 𝐿0,0,𝑥,𝑦) × 2𝑖−1 𝑝𝑖=1 (5.1)

Where

{𝑘 = ⌈𝑟 sin (𝑎 +

360°

𝑝× (𝑖 − 1))⌉

𝑙 = ⌈𝑟 cos(𝑎 +360°

𝑝× (𝑖 − 1))⌉

(5.2)

s(x) is the sign function defined as:

66

𝑠𝑖𝑔𝑛(𝑥) = {1, 𝑖𝑓 𝑥 ≥ 00, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒.

(5.3)

Equation 5.2 transforms the central-view ith sample coordinates from polar to Cartesian, as

Equation 5.1 works with Cartesian coordinates. The central-view sample intensity at position x,y

is used as threshold value; whenever the selected sample in the spatial neighbourhood is below the

threshold, then the sign(x) function takes value 0; otherwise, it takes value 1, according to Equation

5.3. The binary thresholding result is multiplied by the binomial factor, 2𝑖−1, and the resulting

values are summed, to obtain a value in the range [0.. 2S-1], which is the SLBPSL pattern value for

each sample position x,y.

Finally, the SLBP descriptor corresponds to the histogram of SLBPSL pattern values computed for

all 𝑥, 𝑦 samples, according to Equation 5.4:

SLBP(𝑟, 𝑎, 𝑝)=Histogram (SLBP𝑆𝐿(𝑟, 𝑎, 𝑝, 𝑥, 𝑦);

∀𝑥 ∈ (1, … , 𝑋); ∀𝑦 ∈ (1, … , 𝑌)) (5.4)

where X and Y indicate the number of samples in the central view. The SLBP descriptor expresses

elementary characteristics of spatial information in the form of a magnitude sign histogram.

2) Light Field Angular Local Binary Pattern (LFALBP): The LFALBP hand-crafted descriptor

is here proposed to exploit the information corresponding to the variations observed for light rays

travelling in different directions, as captured by light field images. For a selected set of S SA

images, at distance R from the reference view, with starting angle A, the sample level LFALBP

pattern value (LFALBPSL), for position x,y, is defined by Equation 5.5:

LFALBP𝑆𝐿(𝑅, 𝐴, 𝑆, 𝑥, 𝑦) = ∑ 𝑠𝑖𝑔𝑛(𝐿𝑢,𝑣,𝑥,𝑦 − 𝐿0,0,𝑥,𝑦) × 2𝑖−1𝑆𝑖=1 (5.5)

where

{𝑢 = ⌈𝑅 sin (𝐴 +

360°

𝑁× (𝑗 − 1))⌉

𝑣 = ⌈𝑅 cos(𝐴 +360°

𝑁× (𝑗 − 1))⌉

(5.6)

Equation 5.6 transforms the ith selected SA image coordinates from polar to Cartesian, as Equation

5.5 works with Cartesian coordinates. As defined in Section 5.2.2, and illustrated in Figure 5.3, A

indicates the starting angle for the first SA image to consider, at radius R, and S indicates the

number of views to consider for the descriptor. The reference view sample intensity at position x,y

is used as threshold value; whenever another view’s component intensity value at position x,y is

below the threshold, then the sign(x) function takes value 0; otherwise, it takes value 1, according

to Equation 5.3. As for the conventional LBP descriptor, the binary thresholding result is

multiplied by the binomial factor, 2𝑖−1, and the resulting values are summed, to obtain a value in

the range [0.. 2S-1], which is the LFALBPSL pattern value for each sample position x,y,.

Finally, the LFALBP description is the histogram of LFALBPSL pattern values computed for all

𝑥, 𝑦 samples, according to Equation 5.7:

LFALBP(𝑅, 𝐴, 𝑆) = Histogram (LFALBP𝑆𝐿(𝑅, 𝐴, 𝑆, 𝑥, 𝑦);

∀𝑥 ∈ (1, … , 𝑋); ∀𝑦 ∈ (1, … , 𝑌)) (5.7)

67

where X×Y indicate the number of luminance samples (or pixels) in each view. The LFALBP

descriptor expresses the magnitude sign histogram of the disparity information present in a light

field image.

The computation of LFALBPSL for a face sample x1,y1, is illustrated in Figure 5.4, considering the

parameter values R=3, A=0º and S=4. The LFALBP pattern value is obtained by taking the gray

values of pixel x1,y1 from the reference view, 225 in this example, and from the other four views.

The reference view value is used for thresholding; whenever the other view’s value in position

x1,y1 is below the threshold it takes value 0, otherwise it takes value 1. Finally, the thresholded

values are multiplied by the corresponding binomial factors, as shown in Figure 5.4, and summed

to obtain the LFALBP pattern value for pixel x1,y1 – value 12 in the example.

Figure 5.4: LFALBP descriptor extraction example.

3) LFLBP as a LFALBP and SLBP Combination: LFALBP may be combined with not only

SLBP to build LFLBP, but also with any existing local hand-crafted descriptor, to derive enhanced

descriptions for light field based recognition. In fact, the combination of angular and spatial

descriptions ‘fuses’ complementary information, thus improving the final recognition

performance. This combination flexibility is expressed by the generic combination framework

presented in Figure 5.5 where any spatial descriptor can be used to replace SLBP while still

benefiting from the complementary angular information captured by the novel LFALBP

descriptor.

This Thesis proposes a specific combination, the LFLBP descriptor, computed according to

Equation 5.8.

LFLBP𝑟,𝑎,𝑝,𝑅,𝐴,𝑉(𝑥, 𝑦) = (LFALBP𝑅,𝐴,𝑁(𝑥, 𝑦) × 2𝑝) + SLBP𝑟,𝑎,𝑝(𝑥, 𝑦) (5.8)

68

where (LFALBP𝑅,𝐴,𝑉(𝐿0,0,𝑥,𝑦) × 2𝑝) means binary left shift by 𝑝 bits. Equation 5.8 concatenates

the spatial SLBP with the angular light field descriptions to form the combined LFLBP description.

The resulting description has p + N bits for each sample. The final description is the histogram of

LFLBP pattern values computed for all 𝑥, 𝑦 samples, expressing the variations for the spatial and

angular information available in a light field image.

Figure 5.5: Proposed spatial and angular descriptors combination framework.

5.3 Face and Ear Recognition Based on Light Field Histogram of Gradients

Descriptor

This novel recognition solution is based on a new hand-crafted light field descriptor, named Light

Field Histogram of Gradients (LFHG), fusing a non-light field based descriptor, the so-called

Histogram of Oriented Gradients (HOG), with a light field based descriptor, called Light Field

Histogram of Disparity Gradients (LFHDG), each having a specific and complementary function

in the resulting fused descriptor. The fused hand-crafted descriptor considers both the orientation

and magnitude variations for the spatial and angular information. Compared to the LFLBP [37]

descriptor proposed in Section 5.2, that only captures the magnitude sign, the descriptor proposed

in this section offers a more comprehensive spatio-angular description. As expected, it boosts the

final recognition performance, as described in detail in this section.


The architecture of the proposed solution, based on the fusion of HOG and LFHDG hand-crafted

descriptors, for face and ear recognition is shown in Figure 5.6. By exploiting both the orientation

and magnitude variations for the spatial and angular information available in a light field image,

the proposed solution is expected to improve the recognition performance over conventional

spatial-only face and ear recognition solutions.

The proposed face and ear recognition solution includes the following main steps:

1. Pre-processing: Matlab Light Field Toolbox v0.4 [58] has been used to create the multi-view

array of SA images (Section 2.4). Then, each face and ear in all SA images are cropped and

69

resized to 128×128 and 192×128 pixels, respectively. Additionally for ear images, images of

left side ears are flipped horizontally, making all ear images to be further processed have the

same orientation.

2. HOG and LFHDG feature descriptions: The HOG and LFHDG (Section 5.3.2) descriptions

are respectively extracted from the central SA image and the normalized multi-view SA array,

to capture the multi-view information available in the light field image.

3. Fusion of descriptions: The extracted HOG and LFHDG descriptors are concatenated,

resulting in the fused LFHG descriptor.

4. Offline training: The fused LFHG descriptions obtained in the previous step, extracted from

the training samples, highlighted by red in Figure 5.6, are fed to a linear SVM classifier

(implemented using the LIBSVM library [187]) to define the classification model. The training

data should be selected based on the test protocol considered.

5. Classification: The fused LFHG descriptions extracted from testing samples are fed to the

previously trained SVM classifier to be compared to the classification model, thus determining

the subject identity.

Figure 5.6: Architecture of the proposed face and ear recognition solution based on the

fused LFHG hand-crafted descriptor.

5.3.2 Light Field Histogram of Disparity Gradients Feature Description

The Histogram of Oriented Gradients (HOG) hand-crafted descriptor [149] is a widely used, non-

light field based local texture descriptor, able to represent spatial orientation and gradient

variations. It has been successfully applied in several computer vision problems, such as pedestrian

detection [149], face recognition [119] and ear recognition [5], [176], [174]. The Light Field

Histogram of Disparity Gradients (LFHDG) descriptor, an extension of HOG, targets the

description of the light field disparity variations. This Thesis proposes to fuse the HOG and

LFHDG descriptors, forming Light Field Histogram of Gradients (LFHG), for exploiting the light

field variations, both in terms of position and direction, thus obtaining an improved face and ear

recognition descriptor.

The HOG descriptor computation follows the implementation in [149] and the tunned parameter

settings for face and ear recognition proposed in [119] and [5]. It is applied to the central SA image

to capture the texture information available in the central SA image.

70

The proposed LFHDG descriptor processing steps are:

1. Gradient computation: Horizontal and vertical disparity gradients, Gx (x,y) and Gy (x,y), for

a given (x,y) sample are computed as:

{𝐺𝑥 (𝑥, 𝑦) = L(𝑢1, 𝑣1, 𝑥, 𝑦) − 𝐿(𝑢2, 𝑣2, 𝑥, 𝑦)

𝐺𝑦 (𝑥, 𝑦) = 𝐿(𝑢3, 𝑣3, 𝑥, 𝑦) − 𝐿(𝑢4, 𝑣4, 𝑥, 𝑦) (5.9)

where (u1, v1), (u2, v2), (u3, v3) and (u4, v4) correspond to the specific selected SA images.

2. Disparity gradient magnitude and orientation computation: The disparity gradient

magnitude, |∇I (x, y)| and orientation, θ(x,y), for each (x,y) sample, are computed according to

Equations 5.10 and 5.11:

|∇I (𝑥, 𝑦)| = √𝐺𝑥(𝑥, 𝑦)2 + 𝐺𝑦(𝑥, 𝑦)2 (5.10)

θ(𝑥, 𝑦) = arctan (𝐺𝑥(𝑥,𝑦)

𝐺𝑦(𝑥,𝑦)) (5.11)

3. Cell histogram computation: The computed disparity gradient magnitude and orientation

maps are divided into non-overlapping 8×8 cells. Gradient orientation values for all (x,y)

samples (in the range 0°-180°), in each cell, are quantized into 9 bins; instead of storing how

many times a quantized orientation occurs in the cell, the magnitudes for identical orientations

are added into the closest bin to its orientation, forming a local histogram for the cell.

4. Block normalization: To make the descriptor image contrast independent [149], cells are

grouped into blocks of 2×2 cells and the histograms of the 4 cells concatenated. Adjacent

blocks are made to overlap, with each cell being shared by four blocks (see Figure 5.7 for an

ear sample), meaning that each local cell histogram contributes more than once to the final

LFHDG description. Finally, each block histogram is normalized with respect to its Euclidean

norm [149].

5. Block histogram concatenation: All normalized block histograms are concatenated to create

the LFHDG description.

Figure 5.7: Division of an ear sample disparity magnitude map into 8×8 sample cells and

overlapping 2×2 cell blocks.

71

In summary, the fused LFHG descriptor expresses both the orientation and magnitude variations

for the spatial and angular information.

5.4 Face Recognition Based on a VGG 2D+Disparity+Depth (VGG-D3) Fused

Deep Descriptor

As discussed in Section 3.4.2.6, contrary to face recognition, the current CNN networks such as

SqueezeNet [144], AlexNet [142], and VGG-16 network [146] may not achieve superior

performance over conventional solutions for the ear recognition task. This is probably due to the

lack of a sufficient number of available training samples to let the deep networks learn good ear

representations, having a large impact on the recognition performance. Hence, the deep learning

based solutions proposed in the Thesis, presented in this and the next two sections, are optimized

for the face recognition task, although they might also be applied to the ear recognition task once

large-scale ear databases become available, allowing to obtain better deep ear classification

models.

The previous two sections proposed hand-crafted light field description based recognition solutions;

this section proposes for the first time a light field face recognition solution based on a deep

learning, named VGG 2D+Disparity+Depth (VGG-D3) fused deep descriptor. The VGG-D3

description is obtained by the feature level fusion of deep descriptions extracted from 2D images

as well as the corresponding disparity and depth maps, using a VGG-Face descriptor [38]. The

VGG-Face descriptor, pre-trained over 2.6 million face images, is computed based on a VGG-16

network, ignoring the last fully connected layer in the architecture to extract a description with 4096

elements.

The exploitation of disparity maps together with 2D images and depth maps, in the context of a

fusion scheme, is a novel approach never tried in the literature, acknowledging that disparity and

depth maps may bring some complementary information to the recognition task. It is well-known

that a depth map may be computed from disparity information and the camera intrinsic parameters,

thus being rather equivalent information, even if they visually express different features.

Moreover, if disparity and depth are extracted with independent algorithms and not directly

computed from each other, it is very likely that they partly compensate for each other algorithmic

weaknesses. The implication is that disparity and depth maps may not necessarily provide exactly

the same visual information for face recognition. A disparity map can represent relevant facial

information such as the position and shape of shadows, changes in contrast and contrast gradient

among observation viewpoints, and defocus blur, which may not be equally expressed by a depth

map. On the other hand, geometric information about the position and shape of face components

may be better represented by a depth map. Hence, disparity and depth maps may express visually

complementary information, and jointly exploiting them may contribute to improve face

recognition performance.


The architecture of the proposed face recognition solution based on the VGG-D3 fused deep

descriptor, is presented in Figure 5.8. It takes as input a raw light field face image to create a multi-

view SA array. The face region is cropped and then resized to 224×224 pixels in all SA images,

72

as this is the input size expected by the VGG-Face descriptor [38]. This work uses the VGG-Face

descriptor that can be directly used to extract descriptions from the 2D central view. However, for

the disparity and depth maps extracted from the light field multi-view SA array, the VGG-16

network [146] needs to be retrained, to fine-tune the pre-trained model to perform well when

disparity or depth maps are taken as input, instead of regular 2D images. Once all models are

available, VGG-Face descriptions are extracted from the three types of data inputs, then these

descriptions are concatenated to form the VGG-D3 description which is then passed to a SVM

classifier. By fusing the descriptions extracted from the 2D central SA view as well as the

corresponding disparity and depth maps, the proposed solution exploits the complementary

information available in the light field image. The VGG-D3 fused deep description is expected to

improve the recognition performance over 2D and 2D +depth face recognition solutions.

The proposed face recognition solution includes the following steps:


array of SA images (Section 2.4). Then, the face region is cropped in all SA images, based on

the landmarks provided in the database, and the cropped SA images are resized to 224×224

pixels as this is the input size expected by the VGG-Face descriptor [38].

2. Disparity map extraction: A disparity map is extracted from the cropped multi-view SA

array, capturing the angular information available in the light field image. The light field

disparity map is extracted using the method proposed in [188] and [189], which computes the

disparity map as gradients of epipolar plane images.

3. Depth map extraction: A depth map is extracted from the cropped multi-view SA array,

providing geometric information about the position and shape of the facial components. The

depth map has been extracted using the method proposed in [190], which estimates multi-view

stereo correspondences and then optimizes them using graph cuts.

4. VGG-Face feature description: The pre-trained VGG-Face model, which is originally

trained for 2D face recognition, is independently fine-tuned using the training disparity and

depth map samples. Then, the 2D central view as well as the disparity and depth maps are fed

into three VGG-Face descriptor based deep learning networks to extract texture, disparity and

depth descriptions [38] – see details in Section 5.4.2.

5. Description level fusion: Description level fusion is adopted, concatenating the descriptions

extracted for each input into a single VGG-D3 fused deep description.

6. Offline training: The VGG-D3 fused deep descriptions extracted from the training samples,

highlighted by red in Figure 5.8, are fed to a linear SVM classifier (implemented using the

LIBSVM library [187]) to define the classification model. The training data should be selected

based on the test protocol considered.

7. Classification: The VGG-D3 fused deep description extracted from testing samples is fed to

the previously trained SVM classifier (implemented using the LIBSVM library [187]), to be

compared to the previously trained classification model, thus returning a subject identification.

This work also tested the performance of a softmax classifier, with SVM performing slightly

better than softmax (less than 1% improvement) that justifying the choice of SVM as the final

classifier.

73

Figure 5.8: Architecture of the proposed face recognition solution based on a

2D+disparity+depth fused deep descriptor.

5.4.2 VGG-Face Feature Description

The VGG-16 network is one of the top performing convolutional network architectures for several

visual recognition tasks [146]. The VGG-Face descriptor [38], running the VGG-16 network

without the last two fully connected layer, has been trained over 2.6 million face images, covering

rich variations in expression, pose, occlusion, and illumination to obtain a so-called pre-trained

VGG-Face model [38]. The pre-trained model can then be used to extract descriptions from 2D

face images for face recognition. As the VGG-Face descriptor is originally trained with 2D images,

it may not be suitable for describing disparity and depth information for face recognition. For the

purposes of this Thesis, the pre-trained VGG-Face model is fine-tuned considering disparity and

depth maps at the input and back-propagating the loss function results through the VGG-16

network layers. During the fine-tuning, the pre-trained weights of the convolutional layers are

frozen and kept unchanged, while the fully-connected layers are re-trained. Considering some

empirical studies and memory constraints, the fine-tuning for both disparity and depth maps has

been done using a learning rate of 0.005, a batch size of 32, and a total of 30 epochs. The pre-

trained VGG-Face model has been used for the 2D images, and the fine-tuned models have been

used for the disparity and depth maps, resulting in a so called fully connected layer 6 (FC6)

description, with a total of 4096 elements for each input.

5.4.3 Fusion Strategies Comparison

The proposed face recognition solution based on the VGG-D3 fused deep descriptor processes the

2D central view as well as the corresponding disparity and depth maps. In order to study the other

possible descriptor combinations and the effectiveness of the proposed fusion strategy, Table 5.1

reports the rank 1 recognition rate performance (RR1) when considering only the 2D VGG

descriptor and several alternative fusion strategies: i) 2D + disparity; ii) 2D + depth; and iii) 2D +

disparity + depth. The recognition results are presented for a cross-session face recognition

scenario, this means training and testing the recognition system using the samples captured in

different sessions.

74

Table 5.1: Face rank-1 recognition performance for the 2D baseline descriptor and three

alternative fusion strategies (best results in bold).

Solution

Recognition Tasks

Neutral&

Emotion Action Pose Illumination Occlusion Average

2D 99.5% 100% 94.6% 98.5% 94.6% 96.8%

2D +disparity 99.5% 99.0% 94.8% 99.0% 94.8% 97.0%

2D +depth 99.5% 100% 95.6% 99.5% 95.6% 97.7%

2D +disparity+depth 99.5% 100% 95.8% 100% 95.8% 98.1%

The obtained recognition results show that the proposed 2D+disparity+depth fusion strategy

always achieves the best performance among all the considered recognition alternative cases. Since

the solution corresponding to the fusion of the VGG descriptors for the 2D image with the disparity

and depth maps allows best exploring the complementary information available in the light field,

thus increasing the discriminative power of the fused descriptor, this is the VGG-16 based

recognition solution proposed out of this section.

5.5 Face Recognition Based on a VGG + Conventional LSTM Double-Deep

Descriptor

The proposed solution described in the previous section processes only light field central view

data, notably using its rendered texture, and corresponding disparity and depth maps, using a CNN

network. This Thesis also proposes a double-deep spatio-angular learning framework/descriptor

adopting a conventional long short-term memory (LSTM) recurrent network to extract higher

dimensional angular dependencies from different viewpoints rendered from a full light field image,

thus offering a more powerful double-deep spatio-angular description for light field face

recognition. The double-deep descriptor for light field based face recognition proposed in this

Thesis is based on the combination of a VGG-Face descriptor with a Conventional LSTM (Conv-

LSTM) recurrent deep network [191] [192]. This novel descriptor combines the spatial information

learned using a VGG-Face descriptor with the angular dynamics available in a light field image

that are learned using a Conv-LSTM deep recurrent neural network.

While the combination of VGG and Conv-LSTM has recently been used to learn spatio-temporal

information for visual classification and description tasks, including action recognition [193],

facial expression classification [194], or image captioning and video description [195], this

combination has never been proposed to exploit the multi-view information from a single temporal

instant, as performed by the double-deep spatio-angular learning description proposed in this

Thesis; this novel approach of successively processing views within a light field image instead of

a sequence of frames along time has been never tried before for face recognition or any other visual

recognition task.

In the proposed framework, a VGG-Face descriptor, is employed to capture 2D information from

multiple SA images, thus extracting high-level spatial/textural descriptions. Next, a Conv-LSTM

network exploits the angular dynamics by learning from the spatial descriptions previously

extracted for slightly different viewpoints. Hence, the proposed double-deep VGG + Conv-LSTM

combination can be very powerful to jointly exploit the spatio-angular information available in

light field images to boost face recognition performance.

75


The proposed double-deep VGG + Conv-LSTM framework architecture is presented in Figure 5.9.

Figure 5.9: Architecture of the proposed face recognition solution based on VGG + Conv-LSTM

double-deep descriptor.

The proposed face recognition solution/framework based on double-deep VGG + Conv-LSTM

descriptor includes the following main modules:





2. SA images selection and scanning: This module successively scans a selected sub-set of the

SA images into a SA image pseudo-video sequence, as described in Section 5.5.2.

3. VGG-Face spatial description: Each selected SA image is fed into a pre-trained VGG-Face

descriptor, trained with totally different content from the test material used in this Thesis, to

extract a spatial description containing 4096 elements, as described in Section 5.4.2. Since a

pre-trained model has been used, no additional learning/fine-tuning has been performed for

this specific purposes.

4. LSTM angular description: The extracted spatial deep descriptions are passed to a LSTM

network composed by conventional LSTM (Conv-LSTM) cells with peephole connections, to

learn angular dependencies across the selected SA viewpoints and then extracting double-deep

descriptions for classification, as described in Section 5.5.3.

5. Offline training: The set of double-deep description outputs from the LSTM gates, extracted

from the training samples, highlighted by red in Figure 5.9Figure 5.6, is used as input to a

76

softmax classifier to create a classification model. The set of training description outputs are

denoted by a red in Figure 5.9 to be distinguished from the testing description outputs. The

training data should be selected based on the test protocol considered.

6. Classification: The set of double-deep descriptor outputs from the LSTM gates, extracted from

testing samples is used as input to the previously trained softmax classifier to be compared to

the classification model. Then, the average of the classification probabilities across the

rendered SA images, selected from the light field image, is used to predict the most probable

label and, thus, the final output, as described in Section 5.5.4.

5.5.2 SA Images Selection and Scanning

The pre-processed multi-view SA array contains 15×15 rendered 2D SA images. A representative

subset of SA images is selected for processing by the VGG-Face descriptor, and then scanned as

a pseudo-video sequence, so that their angular dynamics can be learned by the conventional LSTM

network. Different methods can be considered to select and scan the sequence of representative

SA images, notably varying in their number, position and scanning order. It is again worth

mentioning that since the Lytro ILLUM lenslet camera microlens shape is hexagonal, the SA

image positions highlighted in dark grey in Figure 5.10 do not contain usable information, thus

being ignored in the selection process. To consider different solutions in terms of number of views,

thus impacting complexity, and positioning, thus impacting the amount of disparity, the following

SA images selection topologies have been defined:

1. High-density SA images selection: This SA topology considers a rather large number of SA

images from the multi-view array, as illustrated in Figure 5.10.a, where the selected SA images

are highlighted in red. To arrange the selected SA images into a sequence, two different

scanning orders are proposed: i) row-major scanning, which concatenates SA images one row

after another, from left to right, as illustrated in Figure 5.10.b; and ii) snake-like scanning,

which also progresses row-wise, but the rows are alternatively scanned from left to right and

right to left, as illustrated in Figure 5.10.c.

2. Max-disparity SA images selection: This selection topology considers those SA images

corresponding to the multi-view array's borders, thus considering the SA images for which the

viewpoint changes the most and, thus, have the maximum disparity, as illustrated in Figure

5.10.d. Some of the selected SA images may not be of the highest quality, due to the vignetting

effect.

3. Mid-density SA images selection: In this case, the selected SA images capture horizontal,

vertical, and both horizontal and vertical parallaxes. The SA images considered are: i) middle

row – see Figure 5.10.e; ii) middle column – see Figure 5.10.f; and iii) combination of middle

row and middle column – see Figure 5.10.g. There are two possible ways to combine the

horizontal and vertical angular information for the topology in Figure 3-g: i) scanning the

horizontal SA images followed by the vertical ones; or ii) processing each direction separately

and then applying score-level fusion, by adding the LSTM softmax classifier outputs obtained

for the horizontal and vertical SA images, as illustrated in Figure 5.11. As it will be seen later,

the performances for these two approaches may be rather different.

77

Figure 5.10: (a) High-density SA images selection topology; (b) row-major scanning order; (c)

snake-like scanning order; (d) max-disparity SA images selection topology; (e) mid-density

horizontal SA images selection topology; (f) mid-density vertical SA images selection.

Figure 5.11: Score-level fusion for combining the horizontal and vertical angular information.

78

4. Low-density SA images selection: Exploiting spatio-angular dynamics for a considerable

number of SA images may not always be the best option, as this requires considerable

computational power and memory resources. Thus, a low-density sampling of the SA images

is also considered. Since results in [36] and [37] show a clear performance improvement for

light field based face recognition and presentation attack detection as the SA images’ disparity

increases, the central view SA image along with two SA images at maximum horizontal and

vertical disparities from the central view are selected, as illustrated in Figure 5.10.h and Figure

5.10.i. Figure 5.10.j shows the selection of both these horizontal and vertical SA images, for

which the two combination approaches described above may be applied.

5.5.3 LSTM Angular Description

The VGG-Face descriptor only deals with spatial information within a 2D image. However, for a

multi-view array of rendered 2D SA images, it is possible to additionally exploit the angular

information available in the light field image to improve the face recognition performance.

Recurrent neural networks (RNN) can be used to extract higher dimensional dependencies from

sequential data. The RNN units, called cells, have connections not only between the subsequent

layers, but also into themselves, to keep information from previous inputs. To train a RNN, the so-

called back-propagation through time algorithm can be used [196]. Traditional RNNs can easily

learn short-term dependencies; however, they have difficulties to learn long-term dynamics due to

the vanishing and exploding gradient problems [197].

The Long Short-Term Memory (LSTM) is a type of RNN addressing the vanishing and exploding

gradient problems by learning both long- and short-term dependencies [191] [192]. LSTM has

recently achieved impressive results on many large-scale learning tasks, such as speech recognition

[198], language translation [199], activity recognition [193], facial expression classification [194],

and image captioning and video description [195]. Therefore, LSTM based networks are now

widely used in many cutting-edge applications, notably Google Translate, Facebook, Siri or

Amazon's Alexa.

A LSTM network is composed of cells whose outputs evolve through the network based on past

memory content. Since the introduction of LSTM in 1997 [191] , the conventional LSTM (Conv-

LSTM) with peephole connections has been the most commonly used cell architecture for visual

analysis tasks [195]. Figure 5.12 illustrates the architecture of a Conv-LSTM cell with peephole

connections, which are connections from the previous cell state to the gates, denoted by a dash-

line in Figure 5.12. The cells have a common cell state, which keeps long-term dependencies along

the entire LSTM cells chain, controlled by two gates, the so-called input and forget gates, thus

allowing the network to decide when to forget the previous state or update the current state given

new information. The output of each cell, hidden state, is controlled by an output gate, allowing

the cell to compute its output given the updated cell state.

79

Figure 5.12: Architecture of a Conv-LSTM cell with peephole connections (indicated by a

dashed line).

A Conv-LSTM cell can be mathematically formulated as follows:

For a descriptor sample Si, belonging to a descriptor sequence, derived from an image of the multi-

view sequence, the output of the input gate, Ii, is computed as in Equation 5.12, based on the

sample value, the previous hidden state hi-1, and the previous cell state Ci-1 (for the peephole LSTM

cell architecture):

𝐼𝑖 = 𝜎(𝑊𝐼[𝑆𝑖 + ℎ𝑖−1 + 𝐶𝑖−1] + 𝑏𝐼) (5.12)

where 𝑊𝐼 is the input gate weight and bI is the input gate bias. Each gate is controlled by a sigmoid

activation function, defining the output of the gate, as formulated by Equation 5.13, bounding its

output to a [0,1] range:

𝜎(𝑥) = (1 + 𝑒−𝑥)−1 (5.13)

Equation (5.14) creates a vector of new cell state candidate values, �̃�𝑖, that may be added to the

cell state later:

�̃�𝑖 = 𝑡𝑎𝑛ℎ(𝑊𝐶[𝑆𝑖 + ℎ𝑖−1 + 𝐶𝑖−1] + 𝑏𝐶) (5.14)

where 𝑊𝐶 is the vector of new candidate values weights and 𝑏𝐶 is the vector of new candidate

values biases. The hyperbolic tangent activation function, 𝑡𝑎𝑛ℎ, is used to create the vector of

candidate values, �̃�𝑖, and is defined as:

𝑡𝑎𝑛ℎ(𝑥) = 2𝜎(2𝑥) − 1 (5.15)

The output of the forget gate, Fi, is defined as in Equation 5.16, and defines what information

should be removed from the cell state:

80

𝐹𝑖 = 𝜎(𝑊𝐹[𝑆𝑖 + ℎ𝑖−1 + 𝐶𝑖−1] + 𝑏𝐹) (5.16)

where 𝑊𝐹 is the forget gate weight and bF is the forget gate bias.

Then, based on Ii, Fi, and �̃�𝑖, the previous cell state, Ci-1, is updated to obtain Ci as follows:

𝐶𝑖 = 𝐹𝑖ʘ𝐶𝑖−1 + 𝐼𝑖ʘ �̃�𝑖 (5.17)

where ʘ denotes the vector element-wise product. As the output values for Ii and Fi lie in the range

[0,1], the LSTM selectively learns to consider or forget the current input and the previous state.

The current cell state, Ci, can then be used for predicting the current cell’s hidden state, hi,

according to Equations 5.18 and 5.19, thus allowing the LSTM to learn how much from the cell

memory should be included into the hidden state.

𝑂𝑖 = 𝜎(𝑊𝑂[𝑆𝑖 + ℎ𝑖−1 + 𝐶𝑖−1] + 𝑏𝑂) (5.18)

ℎ𝑖 = 𝑂𝑖ʘ tanh (𝐶𝑖) (5.19)

where 𝑊𝑂 is the output gate weight and bO is the output gate bias. The hidden state, hi, is the cell

output for the descriptor sample Si which is passed as input to the next LSTM cell in the LSTM

network architecture, which is composed of LSTM cells sequentially connected together. As

shown in Figure 5.9, the adopted LSTM network includes one Conv-LSTM cell per each selected

SA image in the pseudo-video sequence. Based on the selected scanning order, the deep spatial

descriptions extracted from the SA images are passed to the corresponding LSTM cell. The output

of each LSTM cell, corresponding to its hidden state, describes the short-term and long-term

angular dependencies captured so far.

The LSTM network has been trained with the MSE loss function and batch normalization [200] is

used to control the distributions of feedforward network activations. The model obtained from the

LSTM learning process can then be used for descriptor extraction during testing. Applying the

above structure to a SA image pseudo-video sequence (and not a sequence of images along time)

enables the LSTM to learn long short-term angular dynamics when using light field images for

face recognition; it offers a novel approach, never tried before for any visual recognition task.

LSTM has a number of hyper-parameters for network training whose optimization is of major

importance for the final recognition performance, notably

LSTM hidden layer size: This hyper-parameter controls the size of the hidden layer in the

LSTM units, which is also the size of each LSTM cell’ output. A small hidden layer size

requires setting fewer parameters, but it may lead to underfitting. A larger hidden layer size

gives the network more capacity for convergence, while increasing the required training time.

However, a too large hidden layer size may result in overfitting, thus highlighting the

importance of appropriately adjusting the hidden layer size.

Batch size: The input data can be divided into a number of batches, each used for one round

of network weights update. There are two main advantages of training a deep learning network

using batches instead of the whole input data at once: i) decreasing the computational

complexity, increasing the parallelization ability and needing less memory; and ii) performing

a better training with stronger generalization ability as the network can escape from local

81

minima [201] [202]. Nevertheless, it should be noted that a high number of batches, i.e., small

batches, may lead to less accurate gradient estimation during the training/learning process.

Number of training epochs: One epoch is a full forward-backward pass of all training

samples through the network. Each epoch may consider a number of iterations, in case the

whole data is divided into batches. The number of epochs should be selected in such a way that

it guarantees network convergence within a reasonable training time.

The impact of the hyper-parameter settings on face recognition performance will be evaluated in

the experiments reported in Section 6.5.

5.5.4 Softmax Classification

The output (hidden state) of each Conv-LSTM cell is used as input to a softmax classifier and

includes: i) the short-term dependencies, corresponding to the recently observed viewpoint

changes; and ii) long-term dependencies corresponding to the all viewpoint changes observed so

far, is used as input to a softmax classifier. Then, the average of the classification probabilities

across the rendered SA images, selected from the light field image under consideration, predicts

the most probable label and, thus, the final output. The averaging mechanism, which has been

widely used in the literature in the context of spatio-temporal frameworks for visual recognition

tasks [195], considers all LSTM hidden states, thus exploiting both the full short-term and long-

term angular dependencies; this approach offers a comprehensive angular description for visual

recognition. The alternative mechanism of only using the output of the last LSTM cell [194], thus

considering long-term dependencies and short-term dependency corresponding to only the last

LSTM cell, ignoring the other hidden states, may not exploit the full angular dependencies, thus

offering a slightly lower performance than the former mechanism.

5.6 Face Recognition Based on VGG + Light Field LSTM Double-Deep

Descriptors

As discussed above, a conventional LSTM network can be used to learn the available angular

information from the multiple viewpoints included in a light field image to provide richer

descriptions for visual analysis tasks. In order to capture both the horizontal and vertical angular

information, one possibility is to scan the horizontal SA images followed by the vertical ones, thus

creating a single descriptors sequence to be used as LSTM input, as this can represent the

viewpoint changes along different directions. However, this sequential descriptor concatenation

implies a viewpoint descriptor discontinuity when moving from the last horizontal SA image

position to the first vertical one, which may lead to a degraded learning performance. Additionally,

dealing with angular information as a single pseudo-video sequence ignores the additional angular

information/dependencies, such as parallax, that could be further exploited during the

training/learning process to increase recognition accuracy.

This Thesis also proposes three light field LSTM cell architectures which have been integrated

(naturally, one at a time) in a double-deep learning framework for face recognition, whose inputs

come from a VGG-Face descriptor applied to the set of horizontal and vertical face 2D SA images

sequences derived from a light field image.

82


This Thesis proposes three novel light field LSTM cell architectures able to jointly learn light field

horizontal and vertical dynamics and, thus, providing highly discriminative double-deep

descriptions for spatio-angular based face recognition tasks. The differences between this VGG +

light field LSTM framework represented in Figure 5.13 and the VGG + conventional LSTM

framework, presented in Section 5.5.1 (see Figure 5.9), are twofold: i) the Conv-LSTM cell

architecture in the basic framework is replaced by the new light field LSTM cell architectures

proposed here; and ii) in the VGG + conventional LSTM learning framework, different methods

to select the sequence of representative SA images were considered, notably varying in their

number, position and scanning order, whereas in the present solution only the middle row and the

middle column SA images are considered, as they can represent the essential light field information

coming from multiple directions. The proposed VGG + light field LSTM learning framework

architecture, adopting the proposed light field LSTM cell architectures is represented in Figure

5.13.

Figure 5.13: Architecture of the proposed face recognition solution based on VGG + Light Field

LSTM double-deep descriptors.

83

The proposed face recognition solution/framework based on double-deep VGG + light field LSTM

descriptors includes the following main modules:





2. Horizontal and vertical SA image selection: This module independently scans the middle

row and the middle column SA images into two SA images pseudo-video sequences, each

including fifteen SA images (for the used Lytro ILLUM lenslet camera). These images

represent viewpoint changes along the horizontal and vertical directions, thus expecting to

capture light field information coming from multiple directions.

3. VGG-Face spatial description: Each selected SA image is fed to a VGG-Face descriptor, to

extract descriptions with a fixed length of 4096 elements (see Section 5.4.2). This work uses

the available VGG-Face model, with no additional training performed at this stage. It should

be noted that the training of the VGG-Face model has been done with totally different content

from the test material used in this Thesis.

4. Light Field LSTM angular description: The extracted spatial descriptions are provided to a

LSTM network including one of the newly proposed LSTM cell architectures (see Section

5.6.2), which jointly learn horizontal and vertical angular dependencies across the selected SA

viewpoints, extracting double-deep spatio-angular descriptions for classification. Naturally,

the number of LSTM cells in a LSTM network equals the number of samples in the input

sequence. It should be noted that the SeqL-LSTM cell architecture has been used in Figure

5.13 for illustration purposes.

5. Offline training: The set of double-deep description outputs from the light filed LSTM gates,

extracted from training samples, highlighted by red in Figure 5.13Figure 5.6, are used as input

to a softmax classifier to create a classification model. The training data should be selected

based on the test protocol considered.

6. Softmax classification: The set of double-deep description outputs from the light field LSTM

gates, extracted from testing samples, is used as input to the previously trained softmax

classifier to be compared to the classification model. Then, the average of the classification

probabilities across the rendered SA images, selected from the light field image, is used to

predict the most probable label and, thus, the final output – more details about softmax

classification stage were provided in Section 5.5.4.

In summary, the adoption of the proposed light field LSTM cell architectures in the context of a

double-deep spatio-angular based recognition framework can offer very powerful recognition

solutions, by exploiting both the spatial and combined horizontal and vertical angular information

available in light field images, leading to a boost in face recognition performance.

84

5.6.2 Light Field LSTM Angular Descriptors

This Thesis proposes three novel light field LSTM cell architectures able to jointly learn light field

horizontal and vertical dynamics and, thus, providing highly discriminative double-deep

descriptions for spatio-angular based visual recognition tasks. The proposed architectures express

gate-level fusion, state-level fusion, and sequential learning schemes, as described in the

following.

5.6.2.1 Gate-Level Fusion LSTM Cell Architecture

The first proposed light field LSTM cell architecture, called Gate-Level Fusion LSTM (GLF-

LSTM), adopts a gate-level fusion scheme, separately learning horizontal and vertical forget, input

and output gates and then merging the horizontal and vertical gates’ outputs to compute the fused

gates’ output. As illustrated in Figure 5.14, the horizontal and vertical gates are, respectively,

computed based on the description samples Hi and Vi, respectively extracted from the horizontal

and vertical multi-view description sequences, the previous hidden state, hi-1, and the previous cell

state, Ci-1. Then, the fused gates are computed by adding the horizontal and vertical gates together.

The cell and hidden state outputs are controlled by the fused gates, thus providing a richer joint

information to learn light field angular dynamics.

Figure 5.14: Architecture of a GLF-LSTM cell.

Given inputs Hi, Vi, hi-1, and Ci-1, the GLF-LSTM cell architecture for view number i can be

formulated as:

𝐻𝐼𝑖 = 𝜎(𝑊𝐻𝐼[𝐻𝑖 + ℎ𝑖−1 + 𝐶𝑖−1] + 𝑏𝐻𝐼) (5.20)

𝑉𝐼𝑖 = 𝜎(𝑊𝑉𝐼[𝑉𝑖 + ℎ𝑖−1 + 𝐶𝑖−1] + 𝑏𝑉𝐼) (5.21)

85

𝐼𝑖 = [𝐻𝐼𝑖 + 𝑉𝐼𝑖] (5.22)

𝐻�̃�𝑖 = 𝑡𝑎𝑛ℎ(𝑊𝐻�̃� [𝐻𝑖 + ℎ𝑖−1 + 𝐶𝑖−1] + 𝑏𝐻�̃� ) (5.23)

𝑉�̃�𝑖 = 𝑡𝑎𝑛ℎ(𝑊𝑉�̃�[𝑉𝑖 + ℎ𝑖−1 + 𝐶𝑖−1] + 𝑏𝑉�̃�) (5.24)

�̃�𝑖 = [𝐻�̃�𝑖 + 𝑉�̃�𝑖] (5.25)

𝐻𝐹𝑖 = 𝜎(𝑊𝐻𝐹[𝐻𝑖 + ℎ𝑖−1 + 𝐶𝑖−1] + 𝑏𝐻𝐹) (5.26)

𝑉𝐹𝑖 = 𝜎(𝑊𝐻𝐹[𝑉𝑖 + ℎ𝑖−1 + 𝐶𝑖−1] + 𝑏𝑉𝐹) (5.27)

𝐹𝑖 = [𝐻𝐹𝑖 + 𝑉𝐹𝑖] (5.28)

𝐶𝑖 = 𝐹𝑖ʘ𝐶𝑖−1 + 𝐼𝑖ʘ �̃�𝑖 (5.29)

𝐻𝑂𝑖 = 𝜎(𝑊𝐻𝑂[𝐻𝑖 + ℎ𝑖−1 + 𝐶𝑖−1] + 𝑏𝐻𝑂) (5.30)

𝑉𝑂𝑖 = 𝜎(𝑊𝑉𝑂[𝑉𝑖 + ℎ𝑖−1 + 𝐶𝑖−1] + 𝑏𝑉𝑂) (5.31)

𝑂𝑖 = [𝐻𝑂𝑖 + 𝑉𝑂𝑖] (5.32)

ℎ𝑖 = 𝑂𝑖ʘ tanh (𝐶𝑖) (5.33)

Where: i) 𝐻𝐼𝑖, 𝐻𝐹𝑖, 𝐻�̃�𝑖, and 𝐻𝑂𝑖 are, respectively, the horizontal input gate, forget gate, candidate

values and output gate; ii) 𝑉𝐼𝑖, 𝑉𝐹𝑖, 𝑉�̃�𝑖, and 𝑉𝑂𝑖 are, respectively, the vertical input gate, forget

gate, candidate values and output gate; iii) 𝑊𝐻𝑖 , 𝑊𝐻𝑖, 𝑊𝐻�̃� , and 𝑊𝐻𝑂 are, respectively, the

horizontal input, forget, candidate values and output weights; iv) 𝑊𝑉𝑖 , 𝑊𝑉𝑖 , 𝑊𝑉�̃� , and 𝑊𝑉𝑂 are,

respectively, the vertical input, forget, candidate values and output weights; v) 𝑏𝐻𝑖 , 𝑏𝐻𝑖 , 𝑏𝐻�̃� , and

𝑏𝐻𝑂 are, respectively, the horizontal input, forget, candidate values and output bias; and vi) 𝑏𝑉𝑖,

𝑏𝑉𝑖, 𝑏𝑉�̃� , and 𝑏𝑉𝑂 are, respectively, the vertical input, forget, candidate values and output bias.

The GLF-LSTM cell architecture jointly learns light field horizontal and vertical dynamics, in the

form of fused gates composed by independent horizontal and vertical gates. The computation of

the horizontal and vertical input, forget, and output gates can be done in parallel, as the learning

of 𝑊𝐻𝑖, 𝑊𝐻𝑖, 𝑊𝐻�̃� , and 𝑊𝐻𝑂 horizontal weights is independent from that of 𝑊𝑉𝑖 , 𝑊𝑉𝑖 , 𝑊𝑉�̃� , and

𝑊𝑉𝑂 vertical weights. Although this independency increases parallelism and, thus, may reduce the

computational time, it implies that the vertical and horizontal gates cannot establish a learning

interaction between themselves when optimizing learning weights for updating the cell sate.

5.6.2.2 State-Level Fusion LSTM Cell Architecture

The second proposed light field LSTM cell architecture, called State-Level Fusion LSTM (SLF-

LSTM), provides a state-level fusion scheme, separately learning the horizontal and vertical cell

and hidden states and then merging the horizontal and vertical states outputs to compute the fused

cell and hidden state outputs. As illustrated in Figure 5.15, the horizontal and vertical gates are,

respectively, computed based on the descriptor samples Hi and Vi, the previous hidden state hi-1,

and the previous cell state Ci-1. Then, the horizontal and vertical cell and hidden state outputs are

independently computed. The final cell and hidden state outputs are computed by adding the

horizontal and vertical state outputs together.

86

Figure 5.15: Architecture of a SLF-LSTM cell.

Given inputs Hi, Vi, hi-1, and Ci-1, the SLF-LSTM cell architecture for view number i can be

formulated as:




𝐻𝐶𝑖 = 𝐻𝐹𝑖ʘ𝐻𝐶𝑖−1 + 𝐻𝐼𝑖ʘ𝐻 �̃�𝑖 (5.37)


𝐻ℎ𝑖 = 𝐻𝑂𝑖ʘ tanh (𝐻𝐶𝑖) (5.39)



𝑉𝐹𝑖 = 𝜎(𝑊𝑉𝐹[𝑉𝑖 + ℎ𝑖−1 + 𝐶𝑖−1] + 𝑏𝑉𝐹) (5.42)

𝑉𝐶𝑖 = 𝑉𝐹𝑖ʘ𝑉𝐶𝑖−1 + 𝑉𝐼𝑖ʘ𝑉 �̃�𝑖 (5.43)

𝑉𝑂𝑖 = 𝜎(𝑊𝑉𝑂[𝑉𝑖 + ℎ𝑖−1 + 𝐶𝑖−1] + 𝑏𝑉𝑂) (5.44)

𝑉ℎ𝑖 = 𝑉𝑂𝑖ʘ tanh (𝑉𝐶𝑖) (5.45)

𝐶𝑖 = [𝐻𝐶𝑖 + 𝑉𝐶𝑖] (5.46)

87

ℎ𝑖 = [𝐻ℎ𝑖 + 𝑉ℎ𝑖] (5.47)

where 𝐻𝐶𝑖 and 𝑉𝐶𝑖 are, respectively, the horizontal and vertical cell states and 𝐻ℎ𝑖 and 𝑉ℎ𝑖 are,

respectively, the horizontal and vertical hidden states (other variables were defined in Section

5.6.2.1).

The SLF-LSTM cell architecture jointly learns light field horizontal and vertical dynamics, in the

form of fused states composed by independent horizontal and vertical states. The parallelism

capability of SLF-LSTM is the same as GLF-LSTM, as all the horizontal and vertical learning

weights are independently computed. The SLF-LSTM architecture implies not only that vertical

and horizontal gates cannot establish a learning interaction between themselves, but the fused

horizontal and vertical gates cannot do so when optimizing learning weights for updating the cell

and hidden states either, which may decrease the learning efficiency.

5.6.2.3 Sequential Learning LSTM Cell Architecture

The last proposed light field cell architecture performs sequential learning (SeqL) by modeling in

sequence the angular dynamics available in the horizontal and vertical parallaxes. As illustrated in

Figure 5.16, the proposed SeqL-LSTM cell architecture updates first the cell state using horizontal

information, thus creating an updated cell state expressing all previous horizontal and vertical

viewpoint changes observed so far, as well as the horizontal changes observed in the current

viewpoint. Then the cell state is again updated using vertical information. Unlike the previous

proposals, in this approach, the cell state is not updated based on a fusion scheme. In this cell

architecture, the horizontal and vertical hidden states are independently computed based on the

sequentially learned cell states and only then combined to compute the final cell output, i.e. the

hidden state.

Figure 5.16: Architecture of a SeqL-LSTM cell.

88

Given inputs Hi, Vi, hi-1, and Ci-1, the SeqL-LSTM cell architecture for view number i can be

formulated as:




𝐻𝐶𝑖 = 𝐻𝐹𝑖ʘ𝐶𝑖−1 + 𝐻𝐼𝑖ʘ 𝐻�̃�𝑖 (5.51)


𝑉𝐹𝑖 = 𝜎(𝑊𝐻𝐹[𝑉𝑖 + ℎ𝑖−1 + 𝐶𝑖−1] + 𝑏𝑉𝐹) (5.53)


𝐶𝑖 = 𝑉𝐹𝑖ʘ𝐻𝐶𝑖 + 𝑉𝐼𝑖ʘ 𝑉�̃�𝑖 (5.55)


𝑉𝑂𝑖 = 𝜎(𝑊𝐻𝑂[𝑉𝑖 + ℎ𝑖−1 + 𝐶𝑖−1] + 𝑏𝑉𝑂) (5.57)

𝑉ℎ𝑖 = 𝐻𝑂𝑖ʘ tanh (𝐶𝑖) (5.58)

𝐻ℎ𝑖 = 𝑉𝑂𝑖ʘ tanh (𝐶𝑖) (5.59)

ℎ𝑖 = [𝐻ℎ𝑖 + 𝑉ℎ𝑖] (5.60)

The variables above have been defined in Sections 5.6.2.1 and 5.6.2.2.

The SeqL-LSTM cell architecture establishes a learning interaction between horizontal and

vertical input, forget and candidate value weights when updating the cell state which is expected

to provide a better learning and, thus, a better angular description than SLF-LSTM and GLF-

LSTM cell architectures. Indeed, the vertical weights are optimized considering the fact that the

horizontal information for the current input has already been observed, and thus the horizontal

weights for updating the cell state are already optimized, which is not the case for the other two

proposed light field LSTM cell architectures. However, the expected superior performance of

SeqL-LSTM cell architecture comes with the cost of reducing parallelism ability, as updating the

SeqL-LSTM cell state must be done in sequence.

5.7 Summary of the Proposed Face/Ear Recognition Solutions

Table 5.2 summarizes the main characteristics of the face/ear recognition solutions proposed in

this chapter, sorted (from low to high) based on the level of performance provided, as it will be

shown in Chapter 6. This table includes information about the biometric modalities considered,

the levels of the taxonomy proposed in Section 3.2.3, the classifiers used and the light field

capabilities exploited.

89

Table 5.2: Overview of the recognition solutions proposed in this Thesis.

Proposed

Solution

Biometric

Modality

Face

Structure

Feature

Support

Feature

Extraction

Approach

Feature

Extraction

Sub-Approach

Classifier Light Field

Capability

LFLBP Face/Ear Global Local Hand-Crafted Texture SVM Disparity exploitation

LFHG Face/Ear Global Local Hand-Crafted Texture SVM Disparity exploitation

VGG-D3 Face Global Global Learning Deep Neural Nets SVM Disparity exploitation;

Depth exploitation

VGG +

Conv-LSTM Face Global Global Learning Deep Neural Nets Softmax Disparity exploitation

VGG +

GLF-LSTM Face Global Global Learning Deep Neural Nets Softmax Disparity exploitation

VGG +

SLF-LSTM Face Global Global Learning Deep Neural Nets Softmax Disparity exploitation

VGG +

SeqL-LSTM Face Global Global Learning Deep Neural Nets Softmax Disparity exploitation

91

Chapter 6 _

Light Field Face and Ear Recognition

Performance

6.1 Introduction

In this chapter, an extensive performance evaluation is reported for the proposed light field face

and ear recognition solutions, as well as for several benchmarking solutions, using common,

representative performance evaluation frameworks for varied and challenging face and ear

recognition tasks.

This Thesis proposed two recognition solutions based on hand-crafted light field descriptors,

LFLBP (Section 5.2) and LFHG (Section 5.3), which can be applied to the face and ear recognition

problems, describing the disparity information available in light field images and thus exploiting

the angular variations together with the available spatial information. Additionally, five deep

learning based fused deep/double-deep descriptors have been proposed for face recognition,

learning convolutional representations and angular dynamics from a light field image. To

successfully apply the proposed deep learning solutions for ear recognition a larger light field

dataset would be necessary for fine-tuning the considered neural networks.

In order to analyze the sensitivity of the proposed face recognition solutions to the available

training data, different evaluation protocols with practical meaningfulness have been proposed,

offering different trade-offs in terms of initial setting complexity and later recognition

performance. Concerning ear recognition, the performance of the proposed solutions is evaluated

based on a cross-session scenario (IST-EURECOM LLFEDB has been captured in two separate

acquisition sessions), training the classifier using the ear images captured in the first session while

testing with the second acquisition session’s image. It should be noted that the proposed ear

recognition database does not include many variations (it only includes four images cropped from

right and left half and full profile face images), thus more complex evaluation protocols are not

considered for ear recognition.

92

6.2 Performance Assessment Frameworks

This section presents the frameworks considered for evaluating the performance of the proposed

face (Section 6.2.1) and ear (Section 6.2.2) recognition solutions.

6.2.1 Face Recognition Performance Assessment Framework

The test material, evaluation protocols and metrics to assess the performance of the proposed face

recognition solutions and other relevant recognition solutions used for benchmarking are described

in the following.

6.2.1.1 Face Recognition Test Material

A comprehensive set of experiments using a common, representative evaluation framework has

been conducted with the IST-EURECOM LLFFD face database (presented in Section 4.2), for

varied and challenging recognition tasks. In the experiments, all the images from the IST-

EURECOM LLFFD are used to assess the performance of the proposed face recognition solutions

and other relevant recognition methods used for benchmarking.

6.2.1.2 Face Recognition Evaluation Protocols

To analyse the sensitivity of the proposed face recognition solutions to the available training data,

both in terms of number of training samples and facial variations, three evaluation protocols with

practical meaningfulness are proposed. The protocols are defined as follows:

Protocol 1: The training set contains only the neutral light field images from the first

acquisition session (1 image per subject), while the validation set includes left and right half-

profile images from the same acquisition session (2 images per subject), thus corresponding to

a low-complexity enrolment and training scenario; the testing set includes all the light field

images from the second acquisition session, as illustrated in Figure 6.1-a. This 'single training

image per person' protocol is the simplest protocol considered, but it is the most challenging

in terms of recognition performance.

Protocol 2: The training set contains the neutral plus the left and right full-profile light field

images from the first acquisition session (3 images per subject), while the testing set includes

all the light field images from the second acquisition session, as illustrated in Figure 6.1-b. The

validation study is is omitted for this protocol, as the training set is not very different from

protocol 1. This protocol assumes a rather simple and quick enrolment phase thus

corresponding to a low-complexity enrolment and training scenario and is slightly less

challenging than the first protocol in terms of recognition performance.

Protocol 3: The training set contains all twenty database face variations captured during the

first acquisition session, while the validation and testing sets each consider half of the second

session images, as illustrated in Figure 6.1-c, thus corresponding to a higher-complexity

enrolment and training scenario. This scenario is less challenging in terms of recognition

performance as the system learns more in the training phase.

The first and second protocols (Protocol 1 and Protocol 2) correspond to application scenario

where each person registers/enrolls into the system by quickly taking just one or three photos in a

93

controlled setup, similar to the famous police station paradigm. Testing is done by considering all

facial variations captured in the second acquisition session, assuming that the recognition should

be robust to real-life conditions where the face images to be used for recognition may have

captured in less constrained conditions, notably including facial expressions or be partly occluded,

for instance. With these protocols, the recognition system has not been exposed/trained with many

of the facial variations with which it will be tested.

The third protocol (Protocol 3) assumes a more complex acquisition phase, considering more

training images, under the assumption that the increased complexity will result in a better trained

and thus more knowledgeable model, which should offer a better recognition performance. This

protocol divides the available database material into disjoint training (50%), validation (25%), and

testing (25%) sets where the first session images are all used for training. In this case, the

recognition system has been initially exposed/trained to more facial variations, increasing the

initial complexity to get a better deep model, and thus achieve a better recognition performance.

Figure 6.1: IST-EURECOM LFFD (non-cropped) database division into training, validation and

testing sets for (a) Protocol 1; (b) Protocol 2; and (c) Protocol 3.

94

The three protocols correspond to cooperative user scenarios, offering different trade-offs in terms

of the initial enrolment and training complexity versus the expected recognition performance. The

first and second protocols have multiple practical applications, such as in access control systems,

where the users can be registered/enrolled into the system by quickly taking a mugshot, including

a frontal-view and side-view photos in a controlled setup. Then, the goal is to recognize a person

from image captured at a different time, and possibly in non-ideal conditions, e.g. exhibiting

unpredictable facial variations. The third protocol corresponds to a very cooperative user scenario,

for usage in applications with increased security requirements, where the users are willing to

cooperate during the registration phase, simulating different facial variations, over a range of

expressions, actions, poses, illuminations, and occlusions, to capture as much variations as possible

during the enrollment phase so that the proposed system can more effectively recognize users

during the daily operation of the system.

For all protocols, the training set is used to obtain the classification model weights, the validation

set is used to tune the training model hyper-parameters in case of deep learning based solutions,

and the testing set is used for the system performance assessment. By considering a multi-label

classification task (face recognition), at least one image from each subject (classes) with whom the

system will be validated/tested must be available during the training stage. If a new subject is to

be recognized, the database has to be extended with corresponding images and the classification

model has to be re-trained (fine-tuned), as the new subject is an unseen label in the previous model.

As the performance of the classification model being trained depends on a set of hyper-parameters,

a disjoint set of validation samples are used to select the hyper-parameter values for deep learning

based solutions leading to the best performance.

6.2.1.3 Face Recognition Performance Assessment Metrics

After performing classification, similarity scores between the test and all the enrolment samples

are sorted, thus every test sample has a best match with one of the enrolment samples. A test

sample has rank k if the correct match has the kth largest similarity score, where k can vary between

1 and the number of samples enrolled in the database. To evaluate the identification performance

of the face recognition solutions, the Recognition Rate at rank n (RRn) and Cumulative Recognition

Rate at rank n (CRRn) metrics are used. RRn can be calculated according to Equation 6.1:

𝑅𝑅n =𝑁𝑛

|𝑇| (6.1)

where Nn is the number of samples that have rank n, and |T| is the total number of test images

considered. CRRn can then be computed using Equation 6.2

𝐶𝑅𝑅𝑛 = ∑ 𝑅𝑅𝑖𝑛𝑖=1 (6.2)

6.2.1.4 Face recognition Benchmarking Solutions

The competing recognition solutions considered for benchmarking purposes are grouped into two

categories:

1. Conventional 2D solutions, notably PCA [65], LBP [76], HOG [119] and VGG-Face [38],

which are applied to the central view 2D rendered SA images;

95

2. Light field solutions, which fully use the light field data, notably DLBP [120], fusing features

extracted from the central view 2D rendered SA image with a disparity map computed from

the light field, and MPCA [153], adopted for the first time for light field based face recognition

in this Thesis.

A summary of the characteristics of each considered benchmarking solution, following the

taxonomy illustrated in Figure 3.3, is available in Table 6.1 for ease of reference. The central view

2D rendered SA images and the full light field images are used to test the non-light field based 2D

and the light field based solutions, respectively. All tested face recognition solutions are re-

implemented by the author of this Thesis and performance results were obtained considering the

best parameter settings reported in the relevant original papers.

Table 6.1: Overview of the face recognition benchmarking solutions.

Solution Name Type Face Structure Feature

Support

Feature

Extraction

Approach

Feature Extraction

Sub-Approach

PCA [65] 2D Global Global Appearance Linear

VGG Face [38] 2D Global Global Learning Deep Neural Network

LBP [76] 2D Global Local Hand-Crafted Texture

HOG [119] 2D Global Local Hand-Crafted Texture

MPCA [153] LF Global Global Appearance Multi-Linear

DLBP [120] LF Global Local Hand-Crafted Texture

6.2.2 Ear Recognition Performance Assessment Framework

To be able to test the proposed light field ear recognition solution only the proposed IST-

EURECOM LLFEDB database (see Section 4.3) is available. The database has been proposed in

this Thesis, and made publicly available, to facilitate testing, validating and comparison of light

field ear recognition solutions. This section presents the experimental assessment setup, the

benchmarking ear recognition solutions, and the obtained ear recognition performance results and

analysis of the two proposed ear recognition and the benchmarking solutions.

6.2.2.1 Ear Recognition Test Material

A comprehensive set of experiments using a common, representative evaluation framework has

been conducted with the novel IST-EURECOM LLFEDB face database. In the experiments, all

the images from the IST-EURECOM LLFEDB are used to assess the performance of the proposed

ear recognition and other relevant recognition solutions used for benchmarking.

6.2.2.2 Ear Recognition Evaluation Protocol and Metrics

This Thesis proposes an ear recognition evaluation protocol based on a cross-session scenario. The

training phase uses the four ear images per user of the IST-EURECOM LLFEDB first acquisition

session, applying the proposed light field descriptors, whose outputs are then used to train a

classifier and create a classification model. The testing phase uses the four IST-EURECOM

LLFEDB images from the second acquisition session. The training and testing steps are repeated

using the second session images as enrolment data and the first session images as test data; the

average results of these two runs are reported.

96

Similarly to face recognition, to evaluate the identification performance of the tested ear

recognition solutions the RRn (Equation 6.1) and CRRn (Equation 6.2) metrics are used.

6.2.2.3 Ear Recognition Benchmarking Solutions

A set of representative and promising ear recognition solutions available in the literature were

selected for benchmarking purposes. The selection includes hand-crafted based solutions,

including Local Gabor (LG) descriptor [5], [203], [173], LBP [176], [177], [204], [205], [181],

LPQ and rotation invariant LPQ [5], [176], [206], HOG [5], [176], [174], POEM (Patterns of

Oriented Edge Magnitudes) [5], and BSIF [5], [176]. The performance of the tested 2D recognition

solutions was evaluated considering the best parameter settings reported in [5]. It should be noted

that apart from the solutions proposed in this Thesis, there is no published research activity

addressing ear recognition using light field sensors at the time of the writing of the Thesis, thus

the benchmarking solutions do not contain any light field based ear recognition solution. A

summary of the characteristics of each considered ear benchmarking solution, following the

taxonomy illustrated in Figure 3.3, is available in Table 6.2 for ease of reference.

Table 6.2: Overview of the ear recognition benchmarking solutions.

Ref. Ear

Structure

Feature

Support

Feature

Extraction

Approach

Feature Extraction

Sub-Approach Feature Extractor

[173] Global Local Hand-Crafted Texture LG

[176] Global Local Hand-Crafted Texture LBP; LPQ; HOG;

BSIF

[5] Global Local Hand-Crafted Texture;

Frequency

LPQ; BSIF; SIFT;

POEM; Gabor; HOG

[184] Global Local Hand-Crafted Texture GLCM; LBP; Gabor

filters

The central view 2D rendered images and the multi-view SA images arrays of the IST-EURECOM

LLFEDB are the input to the 2D benchmarking solutions and the proposed light field based ear

recognition solutions, respectively.

6.3 LFLBP Descriptor Parameter Setting

The proposed Light Field Local Binary Pattern (LFLBP) descriptor hand-crafted descriptor has a

number of parameters whose optimization is of major importance for the final recognition

performance. In this context, parameter setting experiments are performed to study the influence

of the key parameters on the light field based face recognition performance.

As discussed in Section 5.2.2, LFLBP has three parameters, including: i) the radius, R, expressing

the Euclidean distance to the reference view, with a direct relation with the observed disparity; ii)

the angle, A, identifying the starting angle; and iii) N, defining the number of SA images to consider

in the descriptor. The experimental work starts by analyzing the influence of the radius parameter,

R. Once the optimal value of R is fixed, the impact of considering a different numbers of angular

views (N) and of the starting angle (A) parameters, is investigated.

97

6.3.1 LFLBP Descriptor Parameter Setting: View radius

As mentioned before, light field images allow the recognition technique to benefit from the

captured disparity, therefore the amount of disparity to consider is the first aspect to be analysed.

For this purpose, the value of R is increased from 3 up to 7, with A=45º and N=4. The CRR5 values

for the emotion, action and occlusion recognition tasks corresponding to the IST-EURECOM

LFFD database dimensions, are illustrated in Figure 6.2. The results show a clear increase on the

recognition rate as the disparity increases. By considering a larger radius, more distinctive angular

information is captured by the proposed LFALBP descriptor, and therefore the matching accuracy

between the test and enrolment samples is considerably improved.

Figure 6.2: CRR5 versus R for LFALBPR,45º,4.

6.3.2 LFLBP Descriptor Parameter Setting: Number of Views and Starting Angle

After finding the optimal value of R, the second set of experiments aims to select the ideal starting

angle (A) and number of angular views (N) to use. Table 6.3 shows RR1 and CRR5 (in percentage)

for the proposed recognition system based on using the LFLBP descriptor with three different

parameter settings for A and N. Table 6.3 shows results for N values of 4 and 8. Results show that

considering 4 views allows to capture the essential angular variations, leading to the best

recognition performance. Concerning parameter A, there is no significant difference between the

results obtained when using 0° or 45°. Thus, 0° and 4 are selected as final values for the A and N

parameters.

Table 6.3: RR1 and CRR5 for LFLBP for different values of A and N (best results in bold).

Method

Recognition Tasks

Emotion Action Occlusion Average

RR1 CRR5 RR1 CRR5 RR1 CRR5 RR1 CRR5

LFLBP4,0°,16,7,0°,4 97 98 93.5 97 86 96.5 92.1 97.1

LFLBP4,0°,16,7,45°,4 96.6 97.3 93 97 86.5 96.5 92 96.9

LFLBP4,0°,16,7,0°,8 86.3 94.3 84.5 95 80.5 91 83.7 93.4

98

6.4 Light Field Histogram of Disparity Gradients Descriptor Parameters

The proposed LFHG descriptor, presented in Section 5.3.2, targets to exploit the light field

variations, both in terms of position and direction. It uses the central SA image to compute the

HOG descriptor, and four SA images, referred to by their position in the SA multi-view array by

(u1, v1), (u2, v2), (u3, v3) and (u4, v4), to compute the LFHDG descriptor. Results in Section 6.3.1

show a clear performance improvement for light field based face recognition as the SA images’

disparity increases. Thus, it is proposed here that the SA images selected for computing the

disparity gradients are at maximum distance from the central view, i.e., u1=15, v1=8, u2=1, v2=8,

u3=8, v3=15, u4=8, v4=1, be selected for computing the LFHDG descriptor.

6.5 LSTM Hyper-Parameter Setting

The proposed recognition solutions using deep learning models combine the usage of VGG-Face

descriptor and LSTM. The LSTM network has a number of parameters whose optimization is of

major importance for the final recognition performance; this is not the case for VGG-Face

descriptor as the pre-trained VGG-Face model can be directly use to extract descriptions from 2D

face images for face recognition. This section evaluates the impact of the LSTM hyper-parameters

setting, notably analyzing the influence of the LSTM hidden layer size, the batch size and the

number of epochs to consider for network convergence. Then the impact of the various proposed

SA image selection topologies and scanning methods is evaluated in terms of recognition accuracy.

The optimal recognition framework configuration will be used to test the proposed solutions based

on combination of VGG-Face descriptor with the conventional and the proposed light field LSTM

architectures. In the following subsections LSTM hyper-parameters are evaluated considering

protocols 1 and 3, given the similarities between protocols 1 and 2.

6.5.1 Hyper-Parameter Evaluation: Hidden Layer Size

The study of recognition performance sensitivity to the size of the LSTM hidden layers is reported

first. Figure 6.3 illustrates the recognition performance at rank-1, RR1, for hidden layer sizes of 32,

64, 128, 256, and 512 for Protocol 1 (Figure 6.3-a) and Protocol 3 (Figure 6.3-b) validation sets,

after training with all the considered SA image selection methods. These results are reported

considering a batch size of 34 and 667 (1/3 of the input data), respectively for protocols 1 and 3,

and 50 epochs. These values were selected after some initial experimentation, which showed the

suitability of these values for network initialization.

The results show a clear improvement on the recognition performance as the hidden layer size is

increased up to 256. The recognition accuracy is not further increased by considering larger LSTM

hidden layer sizes, even gradually decreasing for a size of 512. This may be due to overfitting and

shows that LSTM tends to converge to a complex model that is not well captured using a too small

hidden layer size.

99

Figure 6.3: Rank-1 recognition results versus hidden layer size considering all proposed SA

image selection methods for: (a) Protocol 1 and (b) Protocol 3.

6.5.2 Hyper-Parameter Evaluation: Batch Size

In theory, the batch size should be adjusted to have an accurate gradient estimation while avoiding

overfitting. Figure 6.4 illustrates the recognition performance for protocols 1 and 3 validation sets,

when considering between 2 and 6 batches, resulting in batch sizes of 50, 34, 25, 20 and 17 for

Protocol 1, and 1000, 667, 500, 400 and 333 for Protocol 3. Results are reported for 50 epochs,

after setting the hidden layer size to 256, the best size obtained in Section 6.5.1.

Figure 6.4: Rank-1 recognition results versus the batch size considering all proposed SA image

selection methods for: (a) Protocol 1, and (b) Protocol 3.

The results presented in Figure 6.4 show that using three batches, i.e., batch sizes of 34 and 667,

respectively for protocols 1 and 3, allows a good gradient estimation, leading to the best

recognition performance for almost all cases. It should be noted that since the LSTM inputs are

VGG face descriptions, the input dimension is very small, i.e., 4096, thus justifying the better

performance obtained by the large batch size selected for Protocol 3. It is also possible to observe

that mid-density SA image selection methods are more robust to changes in the number of batches,

when compared to the other SA image selection methods.

100

6.5.3 Hyper-Parameter Evaluation: Number of Training Epochs

The number of training epochs, which directly impacts the required training time, should be

minimized while guaranteeing network convergence. Figure 6.5 shows the recognition

performance for Protocol 1 (Figure 6.5-a) and Protocol 3 (Figure 6.5-b) validation sets when

varying the number of training epochs, after training with all the considered SA image selection

methods. Results are reported by setting the hidden layer size to 256 and the number of batches to

3, based on the conclusions from the previous sections.

The experimental results show that considering 40 and 130 training epochs, respectively for

protocols 1 and 3, leads to a stable performance for almost all the cases. The recognition

performance remains almost constant for higher number of epochs. The network converges much

faster in Protocol 1 as the validation data is smaller. Hence, to keep a good trade-off between

accuracy and training time and also to keep the same framework configuration for both evaluation

protocols, the number of training epochs selected is 130.

Figure 6.5: Rank-1 recognition results versus number of training epochs considering all

proposed SA image selection methods for: (a) Protocol 1 and (b) Protocol 3.

6.5.4 SA Images Selection Evaluation

As discussed in Section 5.5.2, there are different options for selecting a SA image subset to be

processed by the VGG-Face descriptor and then scanned as a pseudo-video sequence so that their

angular dynamics can be learned by the LSTM network.

The results for the different SA image selection methods presented in Figure 6.5 show that, for the

high density SA images selection strategy, the snake-like scanning offers superior performance

over the row-major scanning, as it avoids the significant viewpoint feature discontinuities resulting

from moving from the right-most SA image in a row to the left-most SA image in the next row.

It is also clear from Figure 6.5 that the mid-density SA image selection solutions, capturing full

angular information along the horizontal and vertical directions, achieve better average

performance when compared to the high- and low-density selection methods. Among the proposed

mid-density selection alternatives, the score-level fusion of horizontal and vertical angular

101

information leads to the best performance. The alternative of performing a single combined scan

implies a viewpoint feature discontinuity when moving from the last horizontal SA image (middle

row) to the first vertical SA image (top row) which leads to a worse performance.

Based on the validation experiments described so far, the best configuration, to be used from this

point on for the system performance assessment using the Conv-LSTM and the proposed GLF-

LSTM, SLF-LSTM and SeqL-LSTM architectures, is summarized in Table 6.4.

Table 6.4: Selected configuration for the the Conv-LSTM and the proposed GLF-LSTM, SLF-

LSTM and SeqL-LSTM architectures for face recognition.

Hyper-parameter /

Image selection method Setting

Hidden layer size 256

Batch size 1/3 of the input data

Number of training epochs 130

SA image selection method Mid-density horizontal and vertical, with score-level fusion

6.6 Face Recognition Accuracy

Once the values of the parameters and hyper parameters of the proposed solutions have been

selected, their face recognition performance can be evaluated and compared to that of the selected

benchmarking solutions. The rank-1 recognition results obtained are reported in Table 6.5, Table

6.6 and Table 6.7, respectively for test protocols 1, 2 and 3 (see Section 6.2.1.2), using the best

configurations obtained in Sections 6.3, 6.4, and 6.5, for the seven proposed and the six

benchmarking recognition solutions introduced in Section 6.2.1.4. Keep in mind that each protocol

has a different classification model as each protocol uses a different training set. The results in

these tables are presented for five recognition tasks corresponding to the IST-EURECOM LFFD

database dimensions, and the best results are highlighted in bold.


recognition solutions (best results in bold).

Solution Recognition Tasks

Neutral

&Emotion Action Pose Illumination Occlusion Average

Name Type

PCA [65] 2D 28.50% 28.00% 06.67% 12.50% 16.33% 17.40%

LBP [76] 2D 16.75% 18.50% 06.67% 12.00% 09.33% 11.20%

HOG [119] 2D 57.50% 58.50% 09.83% 48.00% 38.33% 36.60%

VGG-Face [38] 2D 99.50% 99.00% 56.33% 99.00% 74.67% 79.00%

DLBP [120] LF 59.25% 64.50% 30.33% 24.50% 22.33% 36.55%

MPCA [153] LF 36.75% 33.50% 07.50% 14.50% 19.67% 20.30%

Prop. LFLBP LF 34.25% 31.00% 10.17% 17.00% 13.17% 18.65%

Prop. LFHG LF 62.25% 62.50% 12.00% 62.00% 41.33% 40.90%

Prop. VGG-D3 LF 99.50% 99.00% 56.50% 99.00% 75.50% 79.50%

Prop. VGG + Conv-LSTM LF 99.25% 99.50% 71.67% 99.00% 91.17% 88.55%

Prop. VGG + GLF-LSTM LF 99.50% 100% 71.33% 99.00% 92.00% 88.75%

Prop. VGG + SLF-LSTM LF 99.75% 100% 72.33% 99.00% 92.00% 89.10%

Prop. VGG + SeqL-LSTM LF 99.75% 100% 73.17% 99.50% 92.33% 89.55%

102




Neutral


Name Type

PCA [65] 2D 66.00% 63.50% 16.50% 46.50% 36.00% 40.70%

LBP [76] 2D 71.00% 73.00% 21.50% 43.00% 36.50% 43.85%

HOG [119] 2D 81.50% 77.50% 21.00% 70.50% 64.00% 58.25%

VGG-Face [38] 2D 91.75% 86.00% 82.00% 90.00% 63.67% 79.65%

DLBP [120] LF 89.50% 89.00% 72.50% 65.00% 63.33% 73.60%

MPCA [153] LF 68.50% 68.50% 20.50% 32.00% 41.00% 42.85%

Prop. LFLBP LF 67.00% 70.50% 38.50% 46.00% 55.67% 53.75%

Prop. LFHG LF 80.00% 79.00% 21.34% 67.50% 65.00% 59.20%

Prop. VGG-D3 LF 97.25% 93.00% 86.34% 96.00% 72.33% 85.95%

Prop. VGG + Conv-LSTM LF 98.50% 99.00% 92.00% 98.00% 83.00% 91.95%

Prop. VGG + GLF-LSTM LF 98.75% 99.50% 93.17% 98.50% 84.00% 92.80%

Prop. VGG + SLF-LSTM LF 98.75% 99.50% 93.50% 98.50% 85.17% 93.15%

Prop. VGG + SeqL-LSTM LF 99.25% 99.00% 94.00% 98.50% 85.17% 93.35%




Neutral


Name Type

PCA [65] 2D 53.00% 65.00% 56.66% 65.00% 56.66% 49.80%

LBP [76] 2D 48.50% 83.00% 67.66% 64.00% 67.66% 52.20%

HOG [119] 2D 51.50% 96.00% 84.66% 75.00% 84.66% 64.60%

VGG-Face [38] 2D 93.50% 97.00% 97.00% 95.00% 97.00% 92.90%

DLBP [120] LF 56.50% 64.00% 69.66% 75.00% 69.66% 63.70%

MPCA [153] LF 48.00% 89.00% 65.00% 63.00% 64.66% 50.30%

Prop. LFLBP LF 52.50% 96.00% 87.66% 76.00% 87.66% 65.80%

Prop. LFHG LF 61.00% 93.00% 83.33% 80.00% 83.33% 67.10%

Prop. VGG-D3 LF 94.00% 98.00% 98.00% 97.00% 98.33% 97.40%

Prop. VGG + Conv-LSTM LF 100% 100% 96.33% 100% 98.66% 98.60%

Prop. VGG + GLF-LSTM LF 100% 100% 97.33% 100% 98.66% 98.80%

Prop. VGG + SLF-LSTM LF 100% 100% 98.00% 100% 98.33% 98.90%

Prop. VGG + SeqL-LSTM LF 100% 100% 98.33% 100% 98.33% 99.00%

Comparison of benchmarking face recognition solutions:

The results for the benchmarking face recognition solutions clearly show that the VGG-Face

descriptor [38] performs considerably better than all the other tested solutions, including PCA [65],

LBP [76], HOG [119], DLBP [120], and MPCA [153]. These results were expected as the current

state-of-the-art on face recognition is dominated by deep neural networks, and the VGG-16

descriptor has proved to be as one of the most efficient CNN architectures for face recognition.

The next best performing conventional recognition solution is DBLP [120], due to the exploitation

of 2D texture as well as depth information for face recognition.

Comparison with benchmarking face recognition solutions:

103

Comparing the proposed light field face recognition solutions against the 2D benchmarking, the

obtained rank-1 recognition results are presented in Table 6.8, Table 6.9 and Table 6.10,

respectively for test protocols 1, 2 and 3. These results demonstrate the superiority of the proposed

light field based solutions when compared to the corresponding 2D baseline solutions, including:

i) PCA [65] against MPCA [153], ii) LBP [76] against LFLBP, iii) HOG [119] against

HOG+HDG, and iv) VGG [38] against VGG+ SeqL-LSTM.


their light field based variants.

Solution Performance

2D LF Based 2D Average LF Based Average Gain

PCA [65] MPCA [153] 17.40% 20.30% 2.90%

LBP [76] Proposed LFLBP 11.20% 18.65% 7.45%

HOG [119] Proposed LFHG 36.60% 40.90% 4.30%

VGG [38] Proposed VGG+ SeqL-LSTM 79.00% 89.55% 10.55%





PCA [65] MPCA [153] 40.70% 42.85% 2.15%








PCA [65] MPCA [153] 49.80% 50.30% 0.50%




The average recognition gain clearly shows the added value of light field information for face

recognition purposes. The considerable gains obtained are, to a large extend, due to the

consideration of the angular information as expressed by the proposed light field based solutions,

which provides complementary information and discriminative power to the baseline solutions, as

shown by the improved performances.

Comparison of the proposed solutions:

The obtained rank-1 recognition results demonstrate the superiority of the proposed LFLBP and

LFHG solutions when compared against their corresponding 2D variants, i.e., LBP and HOG. The

results show that as the LFLBP hand-crafted descriptor captures only the magnitude sign for the

spatial and angular information its performance is lower (9.67% in average) than the proposed

104

LFHG descriptor that considers both the orientation and magnitude variations for the spatial and

angular information. This superiority is more evident for the most challenging protocol 1.

The proposed VGG-D3 fused deep descriptor achieves superior performance over the proposed

hand-crafted descriptors, i.e., LFLBP and LFHG, due to: i) adoption of a CNN for light field based

face description; and ii) fusion of the CNN descriptions for the 2D texture with disparity and depth

maps, which allows exploring the complementary information available in the light field. The

average performance gain regarding the baseline 2D VGG-Face descriptor is more than 3.76%,

showing the additional discriminative power of the proposed VGG-D3 fused deep descriptor.

However, the proposed VGG-D3 fused deep descriptor processes only light field central view data,

notably using its rendered texture, and corresponding disparity and depth maps, using a CNN

network. The results clearly show that the proposed VGG + Conv-LSTM double-deep descriptor

performs considerably better than the proposed VGG-D3 fused deep descriptor for all face

recognition tasks/protocols considered. This is due to: i) adoption of a double-deep learning

descriptor for light field face recognition; and ii) exploitation of the full spatio-angular information

available in a light field image. The proposed VGG + Conv-LSTM double-deep descriptor

achieved average performance gains of 9.18% and 5.42%, when compared to the baseline 2D

VGG-face descriptor, and the proposed VGG-D3 fused deep descriptor, respectively.

The obtained results also show the superiority of the face recognition solutions based on three

VGG+ light field LSTM double-deep descriptors, notably GLF-LSTM, SLF-LSTM and SeqL-

LSTM, over the VGG+Conv-LSTM double-deep descriptor. The added value is more evident for

the more challenging protocols/tasks, including protocols 1 and 2 and for the pose variation and

occlusions tasks, where the new joint learning of the light field horizontal and vertical parallaxes,

leading to richer descriptions, contributes to improve the final recognition performance.

Additionally, the average rank-1 recognition performance obtained for the three evaluation

protocols shows that the proposed solutions based on VGG + light field LSTM double-deep

descriptors are less sensitive to the number of training samples and the presence of facial

variations, when compared to the other solutions. The much improved face recognition results

under illumination variations illustrate the robustness of the proposed solutions based on light field

LSTM double-deep descriptors to illumination changes, highlighting the importance of exploiting

the angular information, which is invariant to the intensity changes resulting from different

illumination levels during the data acquisition process.

Finally, the performance results show that the proposed recognition solution based on SeqL-LSTM

works slightly better than the other solutions based on the other proposed LSTM cell architectures,

i.e. GLF-LSTM and SLF-LSTM, due to establishing a learning interaction between vertical and

horizontal weights when updating the cell sate, thus proving a better angular description.

6.7 Ear recognition Accuracy

Ear recognition performance assessment is performed using the IST-EURECOM LLFEDB

database and considering the cross-session scenario discussed in Section 6.2.2.2, to compare the

proposed hand-crafted based ear recognition solutions with the state-of-the-art benchmarking

solutions. Table 6.11 reports the obtained recognition results in terms of CRR1 up to CRR3 (in

105

percentage). Additionally, to have a more precise performance analysis, Figure 6.6 includes the

cumulative recognition rank curves (up to CRR50) for the proposed and the four best performing

benchmarking solutions reported in Table 6.11.

Table 6.11: Ear recognition CRR1 up to CRR3 for the proposed recognition and benchmarking

solutions (best results in bold).

Benchmarking

Solutions

Performance metric

CRR1 CRR2 CRR3

LPQ 80.4% 86.0% 87.9%

BSIF 78.0% 85.6% 87.7%

LBP 76.7% 83.4% 85.6%

ULBP 75.6% 82.8% 86.2%

POEM 75.6% 79.9% 83.0%

RILPQ 71.6% 80.2% 83.8%

Gabor 66.6% 72.2% 76.3%

HOG 82.3% 88.4% 90.7%

Proposed LFLBP 81.9% 87.1% 90.1%

Proposed LFHG 88.2% 90.9% 92.9%

Figure 6.6: Ear recognition cumulative recognition rank curves (up to CRR50) for the proposed

recognition and best performing benchmarking solutions.

Comparison of benchmarking ear recognition solutions:

Among the 2D benchmarking ear recognition solutions, the HOG descriptor shows the best

performance due to: i) consideration of both orientation and magnitude variations, providing a

more comprehensive description of the ear; and ii) use of overlapping blocks. Exploiting

overlapping blocks is beneficial as the ears may be cropped from slightly different positions,

leading to slight misalignments between the ear images registered in the database and the newly

acquired ear images. Consideration of overlapping blocks helps to reduce the misalignment

impact.

106

Comparison with benchmarking ear recognition solutions:

The proposed ear recognition solution based on LFHG descriptor achieves the best recognition

performance, thanks to: i) the exploitation of the spatial and angular information available in light

field images, providing complementary information for ear recognition; ii) consideration of both

orientation and magnitude variations for the spatial and angular information, and iii) exploitation

of overlapping blocks to compensate misalignment impacts. The angular/disparity information

represents the changes in light distribution bringing additional information for ear recognition; by

expressing changes in the ear surface, it captures more information about the ear structure and

geometrical information about the position, depth and shape of the ear components. In summary,

the good results obtained are due to the joint exploitation of the spatial and angular information,

as expressed by the proposed fused descriptor.

Comparison of the proposed solutions:

Concerning the proposed LFLBP descriptor, as it captures only the magnitude sign for the spatial

and angular information, its performance is lower than the proposed LFHG descriptor.

Nevertheless, the results show that LFLBP performance is superior to its 2D variants, LBP and

ULBP, due to the exploitation of light field angular information on the top of spatial information.

It should be noted again that, at this point when the amount of training data available is insufficient,

deep learning based ear recognition solutions may not offer superior performance over local

description based solutions, thus justifying the absence of deep learning based solution in the

benchmarking study.

107

Part III. Light Field Based

Face and Ear Presentation

Attack Detection

109

Chapter 7 _

State-of-the-Art on Face Presentation

Attack Detection

7.1 Introduction

The widespread use of biometric recognition applications raises new security concerns [15],

making the robustness against presentation attacks a very active field of research [11]. Presentation

Attack Detection (PAD) solutions aim to automatically detect the presentation of artefacts to

acquisition sensors. The presentation of attack samples can be done using a Presentation Attack

Instrument (PAI) which is defined as an artificial object or representation presenting a copy of

biometric characteristics or synthetic biometric patterns, for instance printed photos, electronic

devices displaying a face or ear, or silicon masks [11]. PAIs can be classified in terms of their

attack potential, an attribute expressing the effort expended in the preparation and execution of the

attack in terms of elapsed time, expertise, knowledge about the capture device being attacked, and

equipment. The attack potential can be graded as “minimal”, “basic”, "enhanced-basic,”

“moderate” or “high” [11]. Among the different PAIs, mask attacks, and especially attacks using

thin silicon masks, have higher attack potential than other types, but these masks are not easy to

get.

Several PAD solutions have been presented for face [12] [13] [14], fingerprint [207], and iris

recognition [208]. To better understand the technological landscape in this area, this chapter

proposes a new, more encompassing, taxonomy of face PAD solutions, which is after used to guide

a review of existing PAD solutions for face biometrics. Additionally, this chapter reviews existing

face artefact databases, which are instrumental for designing, testing and validating face PAD

solutions.

As for other biometric modalities, there are challenges for ear PAD which had not yet been

addressed at the time of the writing of the Thesis. Hence, there are no ear artefact databases or ear

PAD solutions to be reviewed in this chapter and, thus, the reason of the naming of this chapter,

110

mentioning only ‘face’ and not also ‘ear’. Nevertheless, some face PAD solutions, notably those

that do not rely on specific facial characteristics, can potentially also be applied to detect ear

presentation attacks.

7.2 Proposed Face Presentation Attack Detection Taxonomy

This Thesis proposes a taxonomy to organize the face PAD solutions according to four main

dimensions, notably user interaction support, imaging sensor, contextual information and feature

extraction - see Figure 7.1. The different types of presentation attacks are not considered as a

dimension in the taxonomy, as the available PAD solutions are typically not developed to address

a specific type of attack, but rather try to efficiently detect all of them, since it would be unwise to

assume a single specific type of attack.

Figure 7.1: Proposed taxonomy for face PAD solutions.

The four selected taxonomy dimensions are:

User interaction support – This dimension is related to the level of interaction supported with

the user in the relevant application scenario. When face recognition is used to grant access to

sensitive information or facilities, the user may be willing to undergo a more thorough identity

check. In such scenarios, face PAD solutions often employ the so-called challenge response

strategy for liveness detection, e.g. by analyzing the user’s response to external commands or

stimuli. Such responses can be voluntary, e.g., when the user is asked to look left or close the

eyes, or involuntary, e.g., as a reaction to unexpected luminous or acoustic stimuli.

Imaging sensor – This dimension is related to the selected sensor and, thus, the type of

information that can be exploited for PAD detection. Typically, 2D RGB cameras are used,

but the recent availability of richer imaging sensors is opening new possibilities for designing

improved face PAD solutions. Light field cameras [32] [33] [34], near infra-red (NIR) cameras

111

[22] [23] thermal cameras [209] [210] [211], stereo cameras [212] [213] and depth sensors

[20] [21] have recently been used for detecting presentation attacks.

Contextual information – This dimension is related to the possibility to use contextual

information, for instance including background and scenic cues, to detect the presentation

attacks [214] [215] [216]. This is possible in application scenarios where the image acquisition

is not performed with a very limited field of view and the PAD solutions do not have to

concentrate only on the (cropped) face region.

Feature extraction – This dimension is related to the feature extraction methods adopted for

PAD. A first key distinction is between dynamic methods, which use video, and static methods,

based on image analysis. Feature extraction can then use cues derived from texture, quality,

depth/focus or learning methods. Dynamic methods can additionally explore motion. Texture

based methods can exploit the textural patterns to detect presentation attacks. Quality based

methods typically explore changes in the attack images’ quality characteristics to distinguish

bona fide faces from those captured from PAIs. Learning based methods derive features by

modelling and learning relationships from images in view of distinguishing bona fide from

artefact samples. Depth/focus based methods explore changes either in depth or focus

information between the images captured/rendered at different focus planes. Finally, motion

based methods can explore the voluntary or involuntary movements of the head, mouth, or

eyes to detect presentation attacks. It is also not uncommon to find face PAD methods

combining two or more feature extraction methods [12] [13] [14].

In addition to the dimensions considered in the proposed PAD taxonomy, PAD solutions are

expected to work in combination, and eventually in synergy, with some existing face recognition

system [11] [12]. If the face PAD and face recognition systems are running in parallel,

independently of each other, then any samples flagged as suspicious by the face PAD system

should be further investigated; this can be even done manually by a human operator. The automatic

integration of both systems will typically follow a sequential approach, with face PAD only

passing bona fide images as input to the recognition system. Alternatively, when the two systems

run in parallel, some fusion strategy may be used to integrate the obtained results. If an integrated

system for face PAD and face recognition is being developed, then it is expected that some

modules will be shared by both sub-systems. For instance, it may be possible to use contextual

cues or share feature extraction methods, notably if the implementation targets a platform with

limited computational resources.

7.3 Face Artefact Databases

The artefact databases play a very important role for designing, testing and validating face PAD

solutions, while ensuring the reproducibility of performance results and their fair comparison.

Since this Thesis is focusing on the added value of light field images for PAD, the reminder of this

section will review face artefact databases organized around the exploitation or not of light field

cameras and data.

112

7.3.1 Non-Light Field Face Artefact Databases

The main characteristics of the publicly available non-light field face artefact databases described

in the literature are summarized in Table 7.1, including the types of PAIs addressed, as shown in

Figure 7.2, with the databases sorted according to their release date.

Table 7.1: Overview of publicly available non-light field face artefact databases.

Database Name

Relea

se

Year

No. of

Subjects

No. of

Images/

Videos

Type of

Content

Type of attack

Paper Wrapped

Paper Mobile Tablet Laptop

3D

Mask

NUAA [217] 2010 15 58 2D

Print-Attack

[218] 2011 50 1200 2D

REPLAY-

ATTACK [219] 2012 50 1200 2D

CASIA [220] 2012 50 650 2D

3DMAD [20] 2013 17 255 2D+depth

Multi-Spectral

DB [22] 2014 100 200 2D (NIR)

MSU MFSD

[221] 2015 55 440 2D

MS-Face [23] 2016 21 450 NIR

Oulu-NPU [222] 2017 55 4940 2D

SMAD [223] 2017 N/A 130 2D

MLFP [224] 2017 10 1350 VIS, NIR,

Thermal

Figure 7.2: Illustration of different types of PAIs.

113

The genesis of some of the artefact databases was partly motivated by the availability of new

imaging sensors. For instance, the 3D Mask Attack Database (3DMAD) [20] was recorded using

the Microsoft Kinect sensor, and the Multi-Spectral [22] and MS Face [23] databases were

recorded with Near Infra-Red (NIR) cameras, to support the study of presentation attacks on face

recognition systems using those sensors and associated content.

Among the face available artefact databases there are a few that consider the importance of 3D

masks (hard/latex/silicone) for face PAD. 3DMAD [20] contains hard mask samples from 17

subjects, whose masks are provided by thatsmyface.com [225] and the used masks do not have a

very high quality. Two other face artefact databases have recently been proposed considering the

usage of silicon and latex mask attack samples for face PAD. The Silicone Mask Attack Database

(SMAD) [223] consist of a person wearing a silicone mask that can be used to perform attacks

highly sensitive security scenarios, such as semi-supervised border control scenarios. The

Multispectral Latex Mask based Video Face Presentation Attack (MLFP) database [224] contains

latex mask attack samples that are captured in different scenarios; the acquisition has been done in

three different spectrum bands: visible, NIR and thermal.

7.3.2 Light Field Face Artefact Databases

Table 7.2 provides details of the only publicly available light field face artefact database,

describing its main characteristics, including the attack types addressed. For comparison, also the

characteristics of the IST Lenslet Light Field Face Spoofing Database (IST LLFFSD) [43]

proposed in this Thesis, see Section 8.2, are included in Table 7.2.

Table 7.2: Overview of publicly available light field artefact databases.

Database

Name Year

No. of

Subjects

No. of

Images Type of Content

Type of attack

Paper Wrapped

Paper Mobile Tablet Laptop

3D

Mask

GUC-

LiFFAD [33] 2015 80 4826

2D rendered from

LF

Prop. IST

LLFFSD 2018 50 700

Raw LF image; 4D MV SA array;

2D rendered;

The GUC Light Field Face Artefact Database (GUC-LiFFAD) [33] was the first available face

artefact database acknowledging the importance of light field imaging sensors for face PAD. It

includes a set of 2D greyscale images, using printed paper and tablet PAIs, as illustrated in Figure

7.3, focused at different depths, rendered from the light field data acquired using a first generation

Lytro lenslet camera; however, the database does not include the raw light field images, which is

a limitation. This database can be useful for testing and validating face PAD solutions, notably

exploiting the a posteriori refocusing supported by light field imaging.

114

Figure 7.3: Illustration of GUC-LiFFAD face artefact acquisition [33].

7.4 Non-Light Field Based Face PAD Solutions

Existing non-light field based face PAD solutions are here briefly reviewed according to the

proposed taxonomy. Table 7.3 summarizes the main characteristics of a selection of recent,

representative and relevant PAD solutions, sorted according to their publication date. Additionally,

this table includes information about the used color space, classifier, fusion level, test databases,

and types of attack considered in these solutions. For face PAD solutions combining multiple

features or multiple feature extraction methods, fusion can be done at: i) feature level, usually

concatenating features into a single vector for classification; and ii) score level, combining the

classifier outputs [80]. The solutions summarized in Table 7.3 are briefly reviewed in the

following, grouped based on the feature extraction types considered in the taxonomy.

7.4.1 Texture Based Methods

Static texture based methods exploit the textural patterns in images, usually using hand-crafted

descriptors, to detect presentation attacks [226]. Määttä et al. used multi-scale Local Binary

Patterns (LBP) to form a feature vector by concatenating local histograms extracted from

overlapping micro-textures, classified using a Support Vector Machine (SVM) classifier [227].

The same authors add Gabor wavelet (GW) features and HOG hand-crafted descriptions to the

multi-scale LBP descriptor, using score level fusion to combine individual SVM outputs [228].

Kose and Dugelay used rotation invariant LBP descriptor together with a pre-processing step of

Difference of Gaussians (DoG) filtering, followed by a classifier using a chi-square dissimilarity

metric [229]. Waris et al. fused the features extracted by Rotation invariant uniform LBP

descriptor, GW, and Grey-Level Co-occurrence Matrices (GLCM) and used SVM and Partial

Least Square (PLS) regression for classification [230]. Raghavendra and Busch proposed a PAD

scheme exploring both global face structure and face component regions using LBP and Binarized

Statistical Image Features (BSIF) descriptors; score level fusion of two SVM classifiers computed

over two feature vectors is used [231]. Erdogmus and Marcel evaluated the performance of

different LBP based descriptions including conventional LBP, modified LBP, transitional LBP

and direction-coded LBP, using Linear Discriminant Analysis (LDA), Chi-square, and SVM

classifiers, showing the superiority of LBP with LDA classifier to detect 3D mask attacks [21]. Yi

et al. proposed a multi-spectral face PAD system utilizing GW descriptions extracted on 76 facial

landmarks together with a linear SVM classifier, operating in the visible and NIR spectra [22].

Hadid et al. analysed facial image textures using LBP and GLCM descriptors, using logistic

regression classifiers and score-level fusion [232]. Arashloo et al. proposed a solution based on

115

the fusion of multiscale BSIF and multiscale Local Phase Quantization (LPQ) descriptors, using a

Specific Kernel Discriminant Analysis (S-KDA) [233]. In one of the more recent and promising

works, Boulkenafet et al. exploited the joint color texture information extracted by LPQ and the

co-occurrence of adjacent local binary patterns descriptors, concluding that using color rather than

greyscale is advantageous for PAD [234]. The same authors proposed a solution to describe the

facial appearance by speeded-up robust descriptions extracted over the HSV and YCbCr color

spaces using a softmax classifier [235]. Peng et al. proposed a solution based on guided scale LBP

and Local Guided Binary Pattern descriptors, concatenating features and using a linear SVM

classifier [236].

Dynamic texture based methods exploit the textural patterns in videos to detect presentation attacks

[237] [238]. Pereira et al. proposed a spatio-temporal texture based solution using a LBP from

three orthogonal planes descriptor to consider both spatial and temporal information, followed by

a SVM classifier [239]. Bharadwaj et al. proposed two dynamic texture based solutions using

Dynamic Multi-scale LBP descriptor together with SVM classifiers, and histogram of oriented

optical flows with a LDA classifier, respectively [240]. Pinto et al. proposed a solution based on

low level time-spectral descriptors to exploit spectral and temporal information, testing several

classifiers, with SVM to achieve the best performance [241]. Phan et al. considered a local

derivative pattern from three orthogonal planes descriptor to exploit temporal and spatial

information in different directions of face movements with a SVM classifier [242].

7.4.2 Quality Based Methods

Quality based methods typically explore changes in the attack images’ quality characteristics to

distinguish bona fide faces from those captured from 2D reproductions; examples include the loss

of sharpness and detail, blurriness, and differences in light distribution. This means the quality of

the samples captured by the PAD system can be exploited to detect attacks. Zhang et al. proposed

a solution able to learn multispectral reflectance distributions and analyse the faces based on a

Lambertian model to select the two more discriminative wavelengths for attack detection; finally,

a SVM classifier is trained to learn the multispectral distribution [243]. Kose and Dugelay analysed

the reflectance characteristics of masks and real faces [244]. The input image is decomposed into

illumination and reflectance components and the gradient of reflectance components is considered

to define a feature vector. Finally, a linear SVM classifier is applied to detect mask attacks.

Galbally et al. proposed to use 14 full-reference quality assessment metrics classified into three

different classes, notably: i) pixel difference; ii) correlation based; and iii) edge based measures.

The metrics are then combined to form a feature vector to be fed into SVM classifiers [245]. In

[246], the same authors add some full-reference and no-reference image quality measures, such as

spectral distance measures, gradient based measures, structural similarity measures and

information theoretic measures, and the detection method was extended to presentation attack

detection in iris, fingerprint and face recognition. Wen et al. proposed an image distortion analysis

based solution, exploiting specular reflection, blurriness, chromatic moments, and color diversity,

together with multiple SVM classifiers, and trained for different face presentation attacks [221].

Agarwal et al. proposed to use 13 Haralick features [247], computed over non-overlapping patches

for each color channel, to be fed into a SVM classifier for detecting face presentation attacks [248].

Finally, Bhogal et al. used non-reference image quality assessment measures for face PAD [249].

116

The feature vector includes a natural image quality metric, blind image integrity notator using

DCT statistics, blind image quality assessment through anisotropy, blind/reference-less image

spatial quality metric, distortion identification based image verity and integrity metric, and blind

image quality index metric, using SVM classifiers.

Table 7.3: Overview of non-light field face PAD solutions.

Ref.

Rele

ase

Year

User

Inter.

Support

Imagi

ng

Sensor

Context

ual Inf.

Feature

Extraction

Type

Feature

Extraction

Sub-Type

Color

Space Class.

Fusion

level DB Type of attack

[227] 2011 No inter. RGB Cropped Static Texture

based Gray SVM Feature NUAA Paper

[243] 2011 No inter. RGB Full Static Quality

based Gray SVM --- Private Paper, laptop

[250] 2012 No inter. RGB Cropped Dynamic Motion

based Gray Logistic Score Print-Att. Paper


based Gray SVM Score

NUAA,

Print-Att. Paper, monitor

[229] 2012 No inter. RGB Full Static Texture

based Gray

Chi-

square --- NUAA Paper

[239] 2012 No inter. RGB Cropped Dynamic Texture

based Gray SVM Feature REPLAY Paper, tablet

[230] 2013 No inter. RGB Full Static Texture

based Gray

SVM,

PLS Feature REPLAY Paper, tablet

[251] 2013 No inter. RGB Cropped Dynamic Motion

based Gray MLP Score REPLAY Paper, tablet

[252] 2013 No inter. RGB Full Static Depth/focu

s based RGB

Plane

equation --- Private Paper


based RGB

SVM,

LDA Feature

Print-Att.

REPLAY Paper, tablet

[244] 2013 No inter. RGB+

Dep Cropped Static

Quality

based Gray SVM --- Private Mask

[253] 2013 Voluntar

y RGB Cropped Dynamic

Motion

based Gray kNN Feature Private Paper, tablet


based Gray

LDA,

QDA Feature REPLAY

Paper, tablet,

mobile


based Gray LDA Feature

CASIA,

REPLAY

Paper, tablet,

mobile

[254] 2014 No inter. RGB Full Static Depth/focu

s based Gray

Naïve

Bayes --- Private N/A

[255] 2014 Voluntar


Motion

based Gray SVM Feature Print-Att. Paper

[231] 2014 No inter.

RGB,

RGB+

Dep.

Cropped Static Texture

based Gray SVM Score

CASIA,

3DMAD Paper, mask

[21] 2014 No inter. RGB+

Dep. Cropped Static

Texture

based RGB LDA Score 3DMAD Mask

[22] 2014 No inter. RGB,

NIR Full Static

Texture

based

RGB,

Gray SVM Feature

Multi-

Spectral

DB

Paper

[256] 2014 Voluntar


Motion

based Gray kNN --- Private N/A

117

Ref.

Rele

ase

Year

User

Inter.

Support

Imagi

ng

Sensor

Context

ual Inf.

Feature

Extraction

Type

Feature

Extraction

Sub-Type

Color

Space Class.

Fusion

level DB Type of attack

[257] 2014 No inter. RGB Full Static Learning

based RGB SVM ---

CASIA,

REPLAY

Paper, tablet,

mobile


based RGB

Regressi

on Score REPLAY Paper, tablet

[233] 2015 No inter. RGB Full Staric Texture

based Gray S-KDA Score

REPLAY,

CASIA,

NUAA

Paper, wrapped

paper, tablet,

[258] 2015 No inter. RGB Cropped Static Depth/focu

s based Gray SVM Feature Private Paper

[221] 2015 No inter. RGB Cropped Static Quality

based RGB SVM Feature

NUAA,

REPLAY,

CASIA

Paper, tablet,

mobile

[259] 2015 No inter. RGB Cropped Dynamic Learning

based Gray

Softmax

layer --- CASIA

Paper, wrapped

paper, tablet


based Gray SVM ---

REPLAY,

3DMAD

Paper, tablet,

mobile, mask

[241] 2015 No inter.

RGB,

RGB+

Dep.

Full Dynamic Texture


CASIA,R

EPLAY,

3DMAD

Paper, tablet,

mobile, mask

[248] 2016 No inter.

RGB,

RGB+

Dep.

Cropped Static Quality


3DMAD,

CASIA,

MSU

Paper, wrapped

paper, tablet,

mobile, mask


based

HSV,

YCbCr SVM

Feature,

score

REPLAY,

CASIA,

MSU

Paper, wrapped

paper, tablet,

mobile


based

HSV,

YCbCr

Softmax

Regressi

on

Feature

REPLAY,

CASIA,

MSU

Paper, wrapped

paper, tablet,

mobile

[23] 2016 No inter. RGB,

NIR Full Static

Texture

based

RGB,

Gray SVM Feature MS-Face Paper



MSU,

REPLAY,

CASIA

Paper, tablet,

mobile


based RGB SVM ---

CASIA,

REPLAY

Paper, tablet,

mobile

[262] 2016 No inter. RGB Full Dynamic Learning

based RGB Softmax ---

CASIA,

3DMAD

Paper, tablet,

mask


based

HSV,

YCbCr SVM Feature

REPLAY,

CASIA,

MSU

Paper, wrapped

paper, tablet,

mobile

[263] 2017 No inter. RGB Cropped Static Learning

based Gray Softmax ---

NUAA,

REPLAY

Paper, tablet,

mobile

[264] 2017 Involunta

ry RGB Full Dynamic

Motion

based YCbCr Voting --- Private N/A


based RGB SVM Feature REPLAY

Paper, tablet,

mobile

118

7.4.3 Learning Based Methods

Learning based methods derive features by modelling and learning relationships from images in

view of distinguishing bona fide samples from attack attempts. In particular, the usage of CNNs

have been growing very fast for face PAD since 2014 [257]. CNNs can support both feature

extraction and classification, or they can be used only for feature extraction and combined with

different classifiers, such as SVM. Examples of CNN supporting only feature extraction include:

i) the canonical CNN structure proposed by Yang et al., which includes five convolutional and

three fully connected layers, followed by a SVM classifier [257]; ii) a deep learning solution

including a conventional CNN with back-propagation approach, proposed by Menotti et al. to

learn the weights of the network and using SVM classifiers [260]; and iii) a CNN proposed by Li

et al. to learn features based on the pre-trained VGG-Face CNN [38], followed by principle

component analysis for dimensionality reduction and a linear SVM for classification [261].

CNN examples supporting both feature extraction and classification include: i) the CNN by Xu et

al. to extend a CNN with two convolutional, one fully connected and a softmax layers, with a new

layer, called long short term memory, after the fully connected layer of the CNN architecture, for

learning and extracting the temporal structure across the video [259]; ii) the CNN by Feng et al.

to combine a shearlet based image quality feature and two types of motion features using a

hierarchical neural network containing an input layer, two hidden layers and a softmax layer for

classification [262]; iii) the solution by Alotaibi et al. to use nonlinear diffusion to preserve depth

and edge information, using a deep CNN with five convolutional and subsampling layers, trained

using stochastic gradient descent to extract the discriminative high-level features for face PAD,

and using softmax layer for classification [263].

7.4.4 Focus/Depth Based Methods

Focus/depth based methods explore changes either in depth or focus information between the

images captured/rendered at different focus planes. Kim et al. used the power histogram and

gradient location features to compare images focused at two different planes, i.e., nose and ears,

classifying the resulting degree of blurriness with a Sum-Modified-Laplacian (SML) method

[252]. Yang et al. proposed a face PAD method by investigating the focus distance between the

face and the background [254]. In this context, the degree of blurriness of the face and the

background are considered to detect attacks. Kim et al. concatenated three different features

extracted from two images sequentially captured at different focuses to be fed into a SVM classifier

[258].

7.4.5 Motion Based Methods

Motion patterns of the face or facial landmarks can reveal valuable information for face PAD,

allowing for instance to analyze responses when the user interaction dimension is being explored,

or to better analyze contextual information when exploring that taxonomy dimension. Dynamic

motion based methods can explore the voluntary or involuntary movements of the head [265]

[266], [267], mouth [268], or eyes [251] [255] [269] [270] to detect presentation attacks. Examples

exploring contextual information include: i) Yan et al. performed a foreground–background

consistency analysis in both spatial and temporal domains using a logistic classifier to detect

119

presentation attacks [250]; ii) Komulainen et al. measured the temporal correlations between the

background and user’s head, using a multilayer perceptron (MLP) classifier [251]; and iii) Anjos

et al. performed foreground/background motion correlation using optical flow, with a binary SVM

classifier [251].

Some examples where motion is explored to evaluate the challenge responses include: i) Ali et al.

proposed a gaze tracking solution in response to an external challenge, testing various fusion

schemes and using a k-nearest neighbour (kNN) classifier [253]; ii) Cai et al. exploited gaze

estimation in a challenge-response framework, using a kNN classifier [256]; and iii) Killioğlu et

al. tracked the eyes with a Kanade-Lucas-Tomasi (KLT) tracker, in response to an external

challenge, and used a voting scheme to distinguish bona fide users from attack attempts [264].

7.5 Light Field Based Face PAD Solutions

Excluding the solutions proposed in this Thesis, three face PAD solutions exploiting the richer

light field information have been proposed in the literature. Following the taxonomy illustrated in

Figure 7.1, the main characteristics of those three light field based PAD solutions [32] [33] [34]

are summarized in Table 7.4. Additionally, this table includes some information about the

distinctive characteristics of the light field sensors (see Section 2.5) and imaging representations

considered, such as format, color space, classifier, fusion level, test databases, and type of attacks.

For comparison, also the characteristics of the two light field based face recognition solutions [42]

[43] being proposed in this Thesis are included in Table 3.3. The solutions summarized in Table

3.3 are briefly reviewed in the following, grouped based on the feature extraction types considered

in the taxonomy.

Table 7.4: Overview of light field face PAD solutions.

Ref. Rele

ase

Year

User

Inter.

Support

LF

Imaging

Sensor

Conte.

Inf.

Feat.

Ext.

Type

Feature

Extrac.

Sub-Type

LF

Analysis Format

Col.

Space Class.

Fusion

level DB

Type of

attack

[32] 2014 No inter. Lytro

1st Gen. Crop. Static

Depth/

focus

based

Disparity

exploit.

LF

microlens

image

Gray SVM --- Privat

e Paper



Depth/

focus based

A

posteriori refocusing

2D

rendered from LF

Gray SVM Feature

GUC-

LiFFAD

Paper; tablet



Texture

based

Depth

comp.

2D

rendered

from LF

Gray SVM Feature Privat

e Paper

Prop.

[43] 2018 No inter.

Lytro

ILLUM Crop. Static

Texture

based

Disparity

exploit.

LF MV

SA array

YCbCr

HSV SVM

Feature;

score

LLFF

SD

Wrap.

paper;

paper;

tablet;

laptop;

mobile

Prop.

[42] 2018 No inter.

Lytro

ILLUM Crop. Static

Texture

based

Disparity

exploit.

LF MV

SA array HSV SVM Feature

LLFF

SD

Wrap.

paper;

paper;

tablet; laptop;

mobile

120

7.5.1 Texture Based Methods

Kim et al. exploited disparity information for face PAD by considering edge and ray difference

information, which cannot be obtained from the images captured by a conventional camera [32].

The edge information expresses the microlens image properties, which have different light

distributions for different focal planes. The ray difference information expresses the different

incident rays for the multiple SA images. A LBP descriptor is employed to extract features from

edge and ray difference information and a decision rule is used to distinguish bona fide users from

attacks. Performance evaluation considered a private light field database, including printed paper,

wrapped printed paper and tablet attacks. Li et al. [34] proposed a solution relying on a light field

histogram of gradients (LFHoG) descriptor, considering both spatial and depth information. The

rendered image texture and the distribution of the scene depth are combined, providing a more

comprehensive description of the face, which is exploited by a linear SVM classifier. A private

light field database was used to demonstrate the LFHoG descriptor effectiveness.

7.5.2 Focus/Depth Based Methods

Raghavendra et al. proposed a solution relying on a posteriori refocusing, exploiting the variation

of focus between images rendered at different depths to detect presentation attacks [33].

Experiments using the GUC-LiFFAD database included paper and tablet attacks. The best results,

after comparing 26 different focus measures, are reported when using gradient based focus

measurement operators.

7.6 Adaptation of Face Presentation Attack Detection Solutions for Ear

Biometrics

All the reviewed artefact databases and solutions in this chapter have been proposed for face

biometrics. In spite of the ear PAD challenges, currently there are no available artefact databases

or PAD solutions for ear biometrics. Nevertheless, most of the face PAD solutions can potentially

be applied to detect ear presentation attacks. The face PAD solutions that may not be used in the

context of ear PAD are those relying on specific characteristics of the face, for example those

solutions with user interaction support for analyzing the user’s face reaction to external commands

or stimuli.

This Thesis proposes one ear artefact database along with two PAD solutions that can be applied

to both face and ear biometrics.

121

Chapter 8 _


Artefact Databases

8.1 Introduction

As the emergence of novel imaging sensors motivates the research community to work with new

and richer imaging formats to detect presentation attacks, gathering extensive lenslet light field

artefact databases was pressing need during this Theis. As stated in Section 3.3.10, it was difficult

to fully assess how face PAD systems could benefit from light filed data, as the only available light

field face artefact database, GUC-LiFFAD [33], does not include the raw light field images. In

fact, GUC-LiFFAD only includes a number of 2D images focused at different depths for each

person, rendered from light field images acquired by an old generation of lenslet light field

cameras; thus, it can be only useful for testing and validating those face recognition solutions that

exploit the a posteriori refocusing capability supported by light field imaging. Concerning ear

PAD, no ear artefact database, captured by lenslet light field cameras, was available when this

Thesis started.

To be able to test light field face and ear PAD solutions, including those proposed in this Thesis,

it is necessary to have access to artefact databases including light field images in the LFR format,

thus providing the flexibility to exploit different types of light field data for biometric PAD. To

overcome these limitations, light field based face and ear artefact databases have been created in

the context of this Thesis, allowing more powerful benchmarking for testing and validating face

and ear PAD solutions, exploiting the full light field data. The proposed IST Lenslet Light Field

Face Spoofing Database (IST LLFFSD) consists of 100 bona fide images, from 50 subjects,

captured with a Lytro ILLUM lenslet camera, and a set of 600 face presentation attack images,

using several types of presentation attack types, including printed paper, wrapped printed paper,

laptop, tablet and two different mobile phones, captured with the same camera. This Thesis also

proposes the first ear PAD database, the IST Lenslet Light Field Ear Artefact Database (IST

122

LLFEADB), captured with a Lytro ILLUM lenslet camera, including both 2D and light field

contents, using several types of presentation attack instruments, including laptop, tablet and two

different mobile phones. By including the raw light field images in the proposed databases, the

potential of these databases is significantly boosted as it allows more powerful benchmarking for

testing and validating face and ear PAD solutions exploiting the full light field data. These

databases have been made publicly available to the research community. This chapter reviews the

proposed light field face and ear artefact databases.

8.2 Light Field Based Face Artefact Database

The IST LLFFSD being proposed in this Thesis is the first face artefact database to include the

raw lenslet light field images, along with 2D rendered images and the corresponding depth maps.

It contains 100 bona fide samples and six types of presentation attacks: printed paper, wrapped

printed paper, laptop, tablet and two different mobile phones. This database has been made

publicly available to the research community.

In comparison with GUC-LiFFAD, the only other available light field face artefact database, the

proposed IST LLFFSD has the following main advantages: i) uses the higher resolution Lytro

ILLUM lenslet camera; ii) includes 2D RGB rendered face images, instead of greyscale; iii)

includes a depth map for the rendered images; and more importantly, iv) includes the raw light

field imaging information itself, in LFR format, boosting the potential of this database to allow

more powerful benchmarking for testing and validating face PAD solutions, exploiting the full

light field data.


Since face presentation attacks mostly happen in face verification systems [11] [13], where image

acquisition is done in controlled conditions, this is the scenario considered for the propose IST

LLFFSD database. The IST LLFFSD artefact acquisition was performed indoors, using a lenslet

light field camera, the Lytro ILLUM [26], for capturing images from the attack attempts. The bona

fide IST LLFFSD samples are derived from the publicly available IST-EURECOM LLFFD [35],

captured with the same camera. It includes data from 50 volunteers, 33 males and 17 females, who

were born between 1957 and 1998, originating from 10 different countries. Each volunteer

participated in two separate acquisition sessions, with a time interval between 1 and 6 months,

resulting in a total of 100 bona fide face images. During acquisition, a uniform background was

used, and volunteers were asked to look frontally to the camera, with a neutral expression – see

example in Figure 8.2.a.

8.2.2 Presentation Attack Instruments

The artefact acquisition pipeline is illustrated in Figure 8.1. A 2D central view rendered image,

with a resolution of 2022×1404 pixels, corresponding to each bona fide light field, is used to

generate the six types of PAIs considered in the proposed database:

1. Printed paper attack: 2D images were printed on A4 paper using a Canon i-SENSYS

MF8300 color laser printer. The printed paper is placed on a flat surface and the attack image

is captured with the Lytro ILLUM lenslet camera – see Figure 8.2.b.

123

2. Wrapped printed paper attack: The printed paper is wrapped over an object simulating the

human face curvature, resulting in different depths for different face areas. A more challenging

attack is expected to result, especially for the methods exploiting depth to detect presentation

attacks – see Figure 8.2.c.

3. Laptop attack: A 2D bona fide rendered image is displayed using a MacBook Pro 13’’ – see

Figure 8.2.d.

4. Tablet attack: A 2D bona fide rendered image is displayed using an iPad Air2, 9,7’’ – see

Figure 8.2.e.

5. Mobile attack 1: A 2D bona fide rendered image is displayed using an iPhone 6S – see Figure

8.2.f.

6. Mobile attack 2: A 2D bona fide rendered image is displayed using a Sony Xperia z2 – see

Figure 8.2.g.

Figure 8.1: IST LLFFSB face artefact acquisition pipeline.

124

Figure 8.2: IST LLFFSD example: Illustration of 2D central view rendered images for: (a) bona

fide face; (b) print paper attack; (c) wrapped print paper attack; (d) laptop attack; (e) tablet

attack; (f) mobile attack 1; and (g) mobile attack 2.


The IST LLFFSD database is the first face artefact database to include the raw lenslet light field

imaging files. It is composed by the following elements:

1. Raw light lield images: Light field images in the Lytro ILLUM native file format, LFR, with

about 50 MB/image. LFR files can be used as initial input for both the Lytro camera software

i.e., Lytro Desktop Software [186], or to any other processing library/toolbox, such as the

Matlab Light Field Toolbox V0.4 [58].

2. 2D rendered images: 2D rendered images for the light field central view, created using the

Lytro Desktop Software [186]. This software performs up-sampling and color correction, to

enhance the rendered image quality, as described in Section 2.4.

3. Depth maps: Depth map for each 2D rendered image, generated with the Lytro Desktop

Software [186].

4. Camera calibration file: Calibration data is provided, as this information is essential to

compensate for the specific properties of each camera sensor.


The IST LLFFSD database is publicly available for standardization and academic research

purposes and can be downloaded from http://www.img.lx.it.pt/ LLFFSD /.

125

8.3 Light Field Based Ear Artefact Database

This Thesis proposes the first ear PAD database, the IST Lenslet Light Field Ear Artefact Database

(IST LLFEADB), including both 2D and light field ear artefact images. The database contains two

sets which are reviewed in the following.


IST LLFEADB consists of a baseline and an extended sets, with the difference between the two

sets being related to the settings used for bona fide image acquisition.

Baseline set: The bona fide samples in the baseline IST LLFEADB have been derived from

the LLFEDB [36], consisting of 268 ear samples from 67 subjects, with 4 image shots per

person, notably the right and left half and full profile images, captured with a Lytro ILLUM

lenslet camera. They include ear images partly occluded by ear piercing, earing, hair and

combinations of multiple occlusions. A 2D central view image (see Figure 8.3.a), rendered by

the Lytro Desktop Software [186], corresponding to each bona fide light field, was used to

generate the artefacts. The size of the rendered ear images varies, with an average size of

213×143 pixels and aspect ratio of 1.49. All bona fide images in the baseline set were first

rescaled to 192×128 pixels, with an aspect ratio of 1.5, to have the same size while displaying

using PAIs.

Extended high resolution set: The bona fide baseline samples do not have a very high

resolution, which can affect the quality of the samples displayed on PAIs, thus facilitating

distinguishing the attack from bona fide samples. To consider a more challenging condition,

additional high resolution bona fide images have been captured with the same camera, from

15 subjects, with 4 image shots per person. The size of the rendered ear images varies, with an

average size of 1162×760 pixels and aspect ratio of 1.53. All the bona fide images were

rescaled to 1152×768 pixels, with an aspect ratio of 1.5, to have the same size while displaying

using PAIs.

8.3.2 Presentation Attack Instruments

The artefact acquisition pipeline is illustrated in Figure 8.4; this acquisition of images from PAIs

was performed using the Lytro ILLUM lenslet camera, thus creating one LFR file per sample. It

should be noted that an ear does not have a curved shape, therefore wrapped paper attacks are not

relevant for ear recognition systems. Additionally, due to the low quality of bona fide ear samples

available from the baseline IST LLFEADB set, printing those low resolution and low quality

artefacts would not result in challenging attacks, so paper attacks were not considered.

Four types of PAIs were considered for the LLFEADB:

1. Laptop attack: A 2D bona fide rendered image is displayed using a MacBook Pro 13’’ – see

Figure 8.3.b.

2. Tablet attack: A 2D bona fide rendered image is displayed using an iPad Air2, 9,7’’ – see

Figure 8.3.c.

126

3. Mobile attack 1: A 2D bona fide rendered image is displayed using an iPhone 6S – see Figure

8.3.d.

4. Mobile attack 2: A 2D bona fide rendered image is displayed using a Sony Xperia z2 – see

Figure 8.3.e.

Figure 8.3: Illustration of IST LLFEADB images for a bona fide sample and corresponding

artefact samples for four different PAIs.

Figure 8.4: IST LLFEADB ear artefact acquisition pipeline.

127


The IST LLFEADB is the first ear artefact database for PAD, including both 2D and raw light

field images. It is composed by the following elements:

1. Raw light field images: Light field images in the Lytro ILLUM native file format, LFR, that

can be used as initial input for both the Lytro camera software i.e., Lytro Desktop Software

[186], or to any other processing library/toolbox, such as the Matlab Light Field Toolbox V0.4

[58].

2. 2D rendered images: 2D rendered images for the light field central view, containing only the

ear region, and created using the Matlab Light Field Toolbox V0.4 [58] – see Figure 8.3.a.

3. Multi-view SA image array: Sub-aperture images corresponding to different viewpoints,

forming a multi-view array, and created using the Matlab Light Field Toolbox V0.4 [58] – see

Figure 8.5; these multi-view arrays contain only the ear region.

4. Camera calibration file: Calibration data is provided, as this information is essential to

compensate for the specific properties of each camera sensor.

Figure 8.5: Multi-view sub-aperture image array for an artefact ear image.


The IST LLFEADB database is publicly available for standardization and academic research

purposes and can be downloaded from http://www.img.lx.it.pt/ LLFEADB /.

129

Chapter 9 _


Presentation Attack Detection Solutions

9.1 Introduction

This Thesis proposes two PAD solutions based on two light field angular hand-crafted descriptors

for the disparity information available in light field images, for both face and ear. Exploiting the

disparity information acquired in a light field image in the form of an array of SA images can be

very useful to improve PAD performance. The motivation behind exploiting disparity information

for detecting presentation attacks comes from disparity differences between bona fide and attack

images due to:

1. Differences in surface geometry, leading to considerable differences in the disparity

information. Differences in face and ear components’ depth influences the image texture,

leading to shifts in shadows’ location and shape, and changes in contrast gradients. All flat

attack types (laptop, tablet, mobile and printed paper) exhibit limited differences in the

disparity/depth at the positons of the various face and ear components. Wrapped printed papers

are more challenging attacks for face PAD as they simulate the face’s approximately

cylindrical shape. Nevertheless, the resulting disparity changes are smoother and different from

those observed in bona fide faces. It should be noted that it is not logical to use wrapped printed

paper for ear PAD, as this PAI cannot simulate the ear shape. Concerning 3D PAIs, including

face masks and silicon ears, as the surface geometry of these PAIs is very similar to a bona

fide face or ear, it is expected to see less changes regarding the bona fide disparities, thus

leading to the most challenging attacks; however, the light reflection pattern is also different

what is an advantage for light field approaches. Differences in face and ear components’ depth

can also cause changes in defocus blur between the different views obtained from a light field

image, with attack images exhibiting an almost constant amount of defocus blur across views,

while different views of a bona fide image are expected to exhibit an unequal amount of

defocus blur.

130

2. Reproduction materials used for the attacks, such as electronic displays and paper, lead to

changes in transmission, scattering, reflection and absorption of light, thus introducing changes

in the light distribution, as well as some additional acquisition noise types such as reflection

and sharpness loss, which can be expressed by disparity information.

The above mentioned effects are effectively explored by the solutions proposed in this Thesis to

detect presentation attacks.

9.2 PAD Based on Light Field Angular Local Binary Pattern Descriptor

This section proposes a novel PAD solution based on a light field angular hand-crafted descriptor

exploiting the color and texture variations associated to the different directions of light captured

in light field images. The proposed PAD solution is based on the Light Field Angular Local Binary

Patterns (LFALBP) descriptor presented in Section 5.2.2, here used to capture the disparity

information present in light field images in two different color spaces. The proposed LFALBP

based PAD architecture is represented in Figure 9.1.

Figure 9.1: Architecture of the proposed face and ear PAD solution based on LFALBP hand-

crafted descriptor.

This PAD solution includes the following main modules:


array of SA images (see Section 2.5). Then, each face/ear in all SA images is cropped and faces

are resized to 128×128 and ears to 192×128 pixels. There are three reasons for these cropping

sizes: i) A study on IST LLFFSD and IST LLFEADB databases showed that the average aspect

ratios of the cropped faces and ears are 1.08 and 1.51, thus justifying the aspect ratio of the

resized faces and ears; ii) A preliminary study conducted during the Thesis work as shown that

increasing the image resolution does not significantly affect the PAD performance, while

increasing computational complexity. It is due to the fact that although the IST LLFFSD

database considers larger image sizes, the face area is only a portion of that size, with 128×128

being a size close to the cropped face image. This is also the case for ear images from the

131

extended high resolution set of IST LLFEADB, and 192×128 is a size that is adjusted to the

aspect ratio of ears present in the database.

2. HSV/YCbCr color conversion: RGB may not always be the best color space to work since:

i) there is a strong correlation between the RGB components; and ii) the luminance and

chrominance information are not separately represented in RGB. This module converts the

RGB cropped SA face/ear images to the HSV and YCbCr color spaces, as they proved to be

efficient in detecting presentation attacks [234]. The combination of the descriptors computed

over HSV and YCbCr is able to express the color information in different, complementary

ways.

3. LFALBP description: The Light Field Angular Local Binary Patterns (LFALBP) descriptor

(Equation 5.7) is the angular part of the LFLBP combined descriptor proposed in Section 5.2.2,

which is a compact and efficient LBP extension, able to exploit the light field disparity

information. The LFALBP descriptions are computed over each color component from the

cropped SA images. Results in Section 6.3.1 show a clear performance improvement for light

field biometric recognition tasks as the SA images’ disparity increases. Thus, it is proposed

here that the SA images selected for computing the disparity description are at maximum

distance from the central view. More details about the LFALBP descriptor along with its

parameters, including radius, starting angle, and the number of SA images, have already been

presented in Section 5.2.2.

4. Component description concatenation: The extracted descriptions for each color space

component are concatenated, resulting in a 3-component description for each considered color

space.

5. Offline training: The LFALBP concatenated descriptions extracted from the training samples

for each color space are fed to a SVM classifier (implemented using the LIBSVM library

[190]), to define the classification model. The training data should be selected based on the test


6. SVM classification: The LFALBP descriptions extracted from the test samples for each color

space are fed to a SVM classifier (implemented using the LIBSVM library [271]), returning a

bona fide versus attack classification score for each color space. The decision to adopt the

SVM classifier was made after an extensive performance comparison of several classifiers,

including k-nearest neighbour (kNN) with L1 and L2 norms and logistic regression, as well as

different SVM kernels, including polynomial, radial basis, and sigmoid tanh functions. Linear

SVM performed slightly better than logistic regression and considerably better than kNN.

Additionally, a linear kernel led to the best results, compared to the other tested kernels, thus

justifying the choice of linear SVM as the final classifier.

7. Score level fusion: The integration of the individual SVM classifier outputs for the two color

spaces is done using score level fusion, applying the sum rule to compensate the small errors

obtained by each individual color space. The fused score finally determines whether the input

image should be considered to contain a bona fide face/ear or to be an attack attempt.

In summary, the novel PAD solution is able to take benefit of the variations associated to the

different directions in the captured light field, using the angular texture information represented in

two different color spaces.

132

9.3 PAD Based on Light Field Histogram of Disparity Gradient Descriptor

Another novel light field based PAD solution has been proposed based on the Light Field

Histogram of Disparity Gradients (LFHDG) hand-crafted descriptor presented in Section 5.3.2,

which is able to express the light variations associated to the multiple light capturing directions in

light field images. The LFHDG descriptor considers both the orientation and magnitude variations

for the angular information. Compared to the proposed solution above that only captures the

magnitude sign, the PAD solution proposed in this section offers a more comprehensive angular

description. As expected, it boosts the final recognition performance, as described in detail in this

section. The proposed LFHDG based PAD architecture is represented in Figure 9.2.

Figure 9.2: Architecture of the proposed face and ear PAD solution based on LFHDG

descriptor.

This PAD solution includes the following main modules:


array of SA images (Section 2.5). Then, as for the previous PAD solution, each face/ear in all

SA images is cropped and faces are resized to 128×128 and ears to 192×128 pixels.

2. HSV color conversion: Since in RGB there is a strong correlation between the color

components and the luminance is not separately represented from chrominance, RGB is not

necessarily the best color space to work with. This module converts the RGB cropped sub-

aperture face/ear images to the HSV color space, which can be beneficial to distinguish

between bona fide and attack samples as shown for instance in [234].

3. LFHDG extraction: The LFHDG descriptions are computed for each color component from

the cropped SA images, thus expressing orientation and magnitude for angular information.

Similar to the proposed PAD solution based on the LFALBP descriptor, it is proposed here

that the SA images selected for computing the disparity gradients are at maximum distance

from the central view. More details about the LFHDG descriptor have already been provided

in Section 5.3.2.

4. Components descriptor concatenation: The extracted descriptions for each color space

component are concatenated, resulting in the final 3-component description.

5. Offline training – The LFHDG concatenated descriptions extracted from the training samples

for each color space are fed to a SVM classifier (implemented using the LIBSVM library

[190]), to define the classification model. The training data should be selected based on the test


6. SVM classification: The concatenated LFHDG description extracted from the test samples is

fed to the previously trained SVM classifier to be compared to the classification model, thus

133

determining whether the input image should be considered to contain a bona fide face/ear or

an attack attempt. The experiments made with other classifiers and SVM kernels led to the

same conclusion of using a linear SCM classifier as for the previous PAD solution based on

the LFALBP descriptor.

In summary, the LFHDG based light field PAD solution exploits the orientation and magnitude

for the light variations associated to the multiple directions in the captured light field in the HSV

color space.

135

Chapter 10 _

Light Field Face and Ear Presentation Attack

Detection Performance

Light Fiel d Face a nd Ear Presenta tion A ttack De tection Performa nce

10.1 Introduction

In this chapter, an extensive performance evaluation is reported for the proposed PAD solutions

and several benchmarking methods, using a common, representative performance evaluation

framework for varied and challenging presentation attacks, notably the proposed light field artefact

databases. As the proposed IST LLFEADB is the first ear PAD light field database and no previous

light field ear PAD solutions were available, this Thesis considers a set of representative and

promising face PAD solutions applied to the ear PAD problem for benchmarking purposes.

10.2 Performance Assessment Framework

This section presents the test material and metrics used for the assessment of the proposed face

and ear PAD solutions and also several solutions from the literature. Also the non-light field and

light field based face and ear PAD solutions considered for benchmarking are introduced.

10.2.1 Test Material

The IST LLFFSD database (Section 8.2) is here used for the assessment of the proposed face PAD

solutions and also several benchmarking solutions from the literature. Similarly, IST LLFEADB

(Section 8.3), the first database to consider light field imaging of ear presentation attacks, has been

used for the assessment of the proposed and benchmarking ear PAD solutions.

10.2.2 Evaluation Metrics

The metrics used for evaluating the performance of PAD solutions are described in the following:

136

Bona Fide Presentation Classification Error Rate (BPCER), also known as False Rejection

Rate (FRR), showing the proportion of bona fide presentations incorrectly classified as attack

presentations.

Attack Presentation Classification Error Rate (APCER), also known as False Acceptance Rate

(FAR), showing the proportion of attack presentations incorrectly classified as bona fide

presentations.

ACER (Average Classification Error Rate), defined as half of the sum of the BPCER and

APCER, summarizing the overall system performance.

It is also recommended that the operational systems be configured at a defined security level, see

for instance the FRONTEX guidelines for automated border control systems in Europe [273]. In

this context, the classification performance of a PAD solution can be shown as a Detection Error

Tradeoff (DET) curve, plotting the BPCER versus the APCER.

10.2.3 Benchmarking Methods

Apart from the proposed PAD solutions, this Thesis ha selected a set of representative and

promising non-light field and light field based benchmarking solutions available in the literature.

For face PAD, the benchmarking solutions considered are:

1. Non-Light field based 2D solutions, including two baseline description based solutions [227]

[228] two state-of-the-art description based solutions [232] [234] and one state-of-the-art

quality based solution [221].

2. Light field based solutions, including the solutions presented in [32] [33] [34] (see Section

7.5).

None of the above face PAD solutions relies on specific facial characteristics, e.g., analyzing the

blinking of the eyes or the user’s face reaction to external commands or stimuli, to detect PAD,

thus they can potentially be used also for ear recognition. For ear PAD, the three best performing

benchmarking solutions for face PAD [232] [234] and [34] are considered to detect ear

presentation attacks.

A summary of the characteristics of each considered benchmarking solution, following the

taxonomy illustrated in Figure 7.1, is available in Table 10.1.

The central view 2D rendered SA images and the full light field images are used to test the non-

light field based 2D and the light field based solutions, respectively. All tested PAD solutions were

re-implemented by the author of this Thesis and performance results were obtained considering

the best parameter settings reported in the relevant original papers.

10.3 Face PAD Performance

The experimental work started by analyzing the performance of the proposed and benchmarking

face PAD solutions. The usage of different color spaces is analyzed, the generalization of face

PAD solutions to different operation contexts and different attacks is studied and finally the

computationally efficiency of the face PAD solutions is analyzed.

137

Table 10.1: Overview of PAD benchmarking solutions.

Ref. Release

Year

User

Inter.

Support

Imaging

Sensor

Contextual

Info.

Feature

Extraction

Type

Feature

Extraction

Sub-Type

Color

Space Classifier

Fusion

level

[227] 2011 No inter. RGB Cropped Static Texture based Gray SVM Feature

[228] 2012 No inter. RGB Cropped Static Texture based Gray SVM Score

[221] 2015 No inter. RGB Cropped Static Quality based RGB SVM Feature

[232] 2015 No inter. RGB Cropped Static Texture based RGB Regression Score

[234] 2016 No inter. RGB Cropped Static Texture based HSV,

YCbCr SVM

Feature,

score

[32] 2014 No inter. Lytro 1st Gen. Cropped Static Focus based Gray SVM ---

[33] 2015 No inter. Lytro 1st Gen. Cropped Static Focus based Gray SVM Feature

[34] 2016 No inter. Lytro 1st Gen. Cropped Static Texture based Gray SVM Feature

Prop. LFALBP No inter. Lytro ILLUM Cropped Static Texture based YCbCr;

HSV SVM

Feature;

score

Prop. LFHDG No inter. Lytro ILLUM Cropped Static Texture based HSV SVM Feature

10.3.1 Face PAD Accuracy Evaluation

Since in real-world situations there is no way to know what type of attack will be performed, the

face PAD solutions were trained with a mix of all artefact types available in the IST LLFFSD

database. 4-fold cross-validation experiments were performed, meaning that, for each experiment,

the face PAD systems are trained with ¾ of the database and tested with the remaining ¼. The

cross-validation strategy leads to a more accurate assessment on the model detection power for

unseen data, compared to the simpler strategy of dividing the data into rigid training and testing

sets. Regarding to the number of available IST LLFFSD samples, the experiments used the 100

bona fide and 100 attack images for each attack type, in total 600 attack images. To have a balanced

training, 75 bona fide and 75 attack images (randomly selecting 12 images from each attack type)

were considered to train the classifiers. Testing was performed with a non-overlapping set of 25

bona fide and 25 attack images. As the attack images are randomly selected for each attack type,

the experiments have been repeated 50 times and the average results of these 50 runs are reported

to provide a more stable performance estimation. Table 10.2 reports the average ACER results and

Figure 10.1 the DET curves. The red vertical dash-lines show BPCERs at a fixed 1% APCER, as

the operational systems are usually configured at a predefined security level. Naturally, BPCERs

at any APCER level can be observed from Figure 10.1.

The obtained results show that the proposed LFALBP based and LFHDG based face PAD

solutions always achieve the best performance when compared to the benchmarking non-light field

and light field face PAD solutions. It is worth noticing that the very good accuracy achieved by

the proposed PAD solutions is associated to the considered scenario, with images acquired under

controlled conditions, which corresponds to the most relevant presentation attack scenario when

considering a fixed camera setup. The experiments performed in [33] and [34] were conducted on

images captured in less controlled environments, at various distances. The exploitation of the

focus/depth variation for those images brings more information when compared to images

acquired under controlled conditions, which are almost all-in-focus. This may justify the reduced

performance of those solutions when testing with the IST LLFFSD database.

138

Due to considering both the orientation and magnitude for the angular information, the proposed

LFHDG based solution performs slightly better than the proposed LFALBP based solution, which

only captures the magnitude sign for the angular information.


IST LLFFSD (minimum errors in bold).

Detection solution Presentation Attack Instrument

Laptop Tablet Mobile 1 Mobile 2 Paper Wrapped

paper Ref. Year Type

[227] 2011 2D 42.62% 23.03% 39.33% 46.31% 33.87% 20.32%

[228] 2012 2D 37.67 14.00% 41.25% 45.95% 35.5% 15.95%

[221] 2015 2D 12.06% 13.01% 9.23% 15.83% 14.82% 15.27%

[232] 2015 2D 2.76% 4.39% 2.10% 2.80% 3.03% 4.17%

[234] 2016 2D 4.32% 2.65% 2.52% 5.81% 2.75% 4.95%

[32] 2014 LF 10.12% 12.39% 12.79% 13.86% 12.91% 16.14%

[33] 2015 LF 19.78% 26.36% 29.98% 22.46% 32.43% 38.03%

[34] 2016 LF 11.00% 10.77% 8.12% 18.50% 7.27% 22.05%

Proposed LFALBP LF 0.88% 2.14% 0.73% 0.79% 0.75% 2.85%

Proposed LFHDG LF 0.01% 0.29% 0% 0% 0.28% 0.45%

Figure 10.1: DET face PAD performance for the proposed and benchmarking solutions using

IST LLFFSD for: (a) monitor; (b) tablet; (c) mobile 1; (d) mobile 2; (e) paper; (f) wrapped paper

PAIs.

139

It is well-known that the PAD performance is sensitive to the amount of training data. To

investigate the robustness of the proposed and benchmarking solutions to the number of training

samples, the value of n considered for the adopted n-fold cross-validation strategy was tested with

values n=2,…,7, meaning that the number of training samples increases with n. Figure 10.2

illustrates the average ACER results for all artefact types obtained with 50 runs. The results clearly

show that the proposed LFALBP and LFHDG based face PAD solutions are less sensitive to the

number of training samples than the benchmarking solutions. As expected, the detection

performance increases when training uses more samples. The value of n is fixed to 4 (4-fold cross-

validation) in the next reported experiments as it shows stable performances for the proposed and

benchmarking solutions.

Figure 10.2: ACER face PAD performance for the proposed and benchmarking solutions with n-

fold cross-validation.

10.3.2 Face PAD Color Features Accuracy Evaluation

The importance of considering color information and not only luminance information for face

PAD has been studied. These tests were conducted for the proposed and the two best performing

benchmarking solutions [232] [234]. The ACER results are reported in Table 10.3, notably when

using color information from the color spaces depicted in Table 10.1 and when considering only

the luminance/gray channel. The results highlight PAD performance gains when using color in

comparison to only using gray level information. The advantage of using color information stems

from: i) as printed paper and display PAIs used for face presentation attacks have a limited color

gamut, not reproducing color perfectly, PAD solutions can, therefore, benefit from extracting

discriminative color features; and ii) computing features over several color components can

provide complementary information.

140


color or gray information (minimum errors in bold).

Ref. Type Color Presentation Attack instrument

Laptop Tablet Mobile 1 Mobile 2 Paper Wrapped paper

[232] 2D No 20.85% 26.67% 18.34% 39.89% 19.23% 19.94%

Yes 2.76% 4.39% 2.10% 2.80% 3.03% 4.17%

[234] 2D No 5.41% 9.72% 8.91% 18.18% 17.40% 7.93%

Yes 4.32% 2.65% 2.52% 5.81% 2.75% 4.95%

Prop. LFALBP LF No 15.53% 9.75% 6.92% 9.60% 15.70% 13.15%

Yes 0.88% 2.14% 0.73% 0.79% 0.75% 2.85%

Prop. LFHDG LF No 10.65% 11.62% 4.80% 2.52% 20.57% 16.92%

Yes 0.01% 0.29% 0% 0% 0.28% 0.45%

10.3.3 Face PAD Generalization Accuracy Evaluation

When deploying a face PAD solution, there is no way to know if the system will be attacked and

what type of attacks will be performed. A good way to test the ability to sustain unknown attacks

is to test the system in conditions not considered in the training stage. For this purpose, cross-

database evaluation has been considered, as performed in [234], training with one database and

testing with another. However, as there are only two publicly available light field face artefact

databases, GUC-LiFFAD [33] and the proposed IST LLFFSD, with the first including only 2D

rendered images, it is impossible at this stage to conduct a cross-database evaluation for light field

face PAD solutions.

As an alternative study, this Thesis investigates face PAD generalization from a different

perspective, notably considering an unforeseen artefact type, by training the PAD solutions with

all attack types available in IST LLFFSD excluding one, which is then used for testing. Table 10.4

reports the ACER results for the generalization tests obtained with 50 runs. Additionally, Figure

10.3 shows the DET generalization curves for the proposed methods and the benchmarking PAD

solutions, where the red vertical dash-lines show BPCERs at a fixed 1% APCER.

The results show that the light field based PAD solutions generalize very well to unforeseen flat

presentation attack instruments (either laptop, tablet, mobile or printed paper), as the training also

includes other types of flat attacks. However, the performance drops significantly when testing for

the wrapped paper attack type when it was not used for training. This is not surprising as wrapped

paper is the only 3D attack type considered in IST LLFSD, thus exhibiting more differences in the

various facial landmarks positions than the flat attack types. Therefore, the light field based

solutions using a classification model trained only with flat attacks experience some ACER

performance degradation in the presence of wrapped paper attacks. This observation highlights the

need for training face PAD systems using attack samples with different surface geometries.

Concerning the proposed LFALBP and LFHDG based solutions, even though their performance

decreases compared to the performance reported in Section 10.3.1, their generalization ability is

superior to that of the state-of-the-art solutions for most of the considered attack types.

141

Table 10.4: ACER face PAD generalization performance for the proposed and benchmarking

solutions using IST LLFFSD (minimum errors in bold).

Detection solution Presentation Attack instrument

Laptop Tablet Mobile 1 Mobile 2 Paper Wrapped paper Ref. Year Type

[227] 2011 2D 44.93% 34.27% 41.60% 47.05% 42.02% 32.87%

[228] 2012 2D 27.90% 24.13% 19.30% 25.60% 17.70% 28.30%

[221] 2015 2D 17.61% 16.93% 12.02% 22.88% 17.12% 24.27%

[232] 2015 2D 3.99% 12.45% 4.14% 7.13% 4.97% 45.95%

[234] 2016 2D 36.10% 7.28% 4.64% 15.15% 7.47% 33.57%

[32] 2014 LF 30.60% 33.60% 17.40% 37.40% 35.40% 42.40%

[33] 2015 LF 24.33% 29.85% 32.17% 23.26% 19.93% 47.16%

[34] 2016 LF 12.23% 27.95% 9.80% 33.48% 8.78% 37.75%

Prop. LFALBP LF 0.95% 6.04% 0.83% 0.98% 0.95% 38.75%

Prop. LFHDG LF 0.05% 9.20% 0% 0.02% 0.50% 45.10%

Figure 10.3: DET face PAD generalization performance for the proposed and benchmarking

solutions using IST LLFFSD for: (a) monitor; (b) tablet; (c) mobile 1; (d) mobile 2; (e) paper; (f)

wrapped paper PAIs.

142

10.3.4 Face PAD Computational Complexity

Quantifying the amount of resources needed to perform PAD, such as time and storage, is of

prominent importance for operational biometric systems. Table 10.5 shows the average extraction

and classification times per light field image (in seconds) for the proposed and benchmarking

solutions. Table 10.5 also summarizes the feature vector sizes extracted by the various solutions.

Time measurements were performed on a standard 64-bit Intel PC with a 3.40 GHz processor and

16 GB RAM, running MATLAB R2015b.


and benchmarking face PAD solutions (minimum values in bold).

Solution Feature

extraction time (s)

Classification

time

Total

time (s)

No. of vector

elements/bins

Feature size

(bytes) Ref. Year Type

[227] 2011 2D 0.9375 0.0020 0.9395 833 687

[228] 2012 2D 4.2508 0.0570 4.3078 45,669 339,071

[221] 2015 2D 0.1978 0.0013 0.1991 121 394

[232] 2015 2D 0.3148 0.0016 0.3164 369 474

[234] 2016 2D 0.6658 0.0068 0.6726 7,680 21,599

[32] 2014 LF 2.326 0.0009 2.3269 64 473

[33] 2015 LF 391.021 0.0004 391.02 2 16

[34] 2016 LF 19.0422 0.0105 19.052 8,100 60,860

Prop. LFALBP LF 0.2314 0.0015 0.2329 96 168

Prop. LFHDG LF 0.2286 0.0215 0.2501 24,300 18,122

Feature extraction typically has the largest impact on the overall presentation detection algorithm

complexity. The total processing duration for the proposed LFALBP and LFHDG solutions is,

respectively, around 0.23 and 0.25 second per image, thus exhibiting the second and third lowest

computational complexity over all tested solutions. This represents a step forward in making light

field based solutions viable to detect face presentation attacks, notably considering its detection

performance gains.

The feature size results highlight that the proposed LFALBP based solution offers a really compact

representation, simplifying its storage, retrieval, and transmission. Concerning the LFHDG based

solution, capturing both orientation and magnitude variations for the angular information comes

at the cost of increasing the feature size, although it is not as large as the feature sizes obtained by

two of the benchmarking solutions.

10.4 Ear PAD Performance

This section reports the experimental work conducted to assess the proposed ear PAD solutions as

well as the selected benchmarking solutions, which were originally proposed as face PAD

solutions and are here used as ear PAD solutions, notably [232] [234] and [34]. The assessment

includes also the generalization power of the ear PAD solutions to unknown attacks as well as their

computational complexity.

143

10.4.1 Ear PAD Accuracy Evaluation

The ear PAD performance evaluation considers the same 4-fold cross-validation strategy and the

same metrics as for face PAD, as discussed in Section 10.3.1. Therefore, for each experiment, the

ear PAD system is trained with ¾ of the IST LLFEADB database and tested with the remaining

¼. Regarding the number of available IST LLFEADB baseline set samples, the experiments used

the 266 bona fide and 266 attack images for each attack type, in a total of 1064 attack images. To

have a balanced training, 200 bona fide and 200 attack images (randomly selecting 50 images from

each attack type) were considered to train the classifiers. Testing was performed with a non-

overlapping set of 66 bona fide and 66 attack images. As the attack images are randomly selected

from each attack type, the experiments have been repeated for 50 times and the average results for

50 runs are reported to provide a more stable performance estimation. As the number of bona fide

and attack samples in the IST LLFEADB extended dataset are, respectively, 60 and 240, 45 bona

fide and 45 attack images (randomly selecting 12 images from each attack type) were considered

to train the classifiers and the average ACER results, obtained after 50 runs, are reported.

The results obtained for the proposed IST LLFEADB low resolution baseline set are presented in

Table 10.6. These results show that two proposed LFALBP based and LFHDG based light field

PAD solutions, as well as one of the conventional 2D solutions [234], exploiting the joint color

texture information extracted by LPQ and the co-occurrence of adjacent local binary patterns,

achieve perfect or near perfect classification accuracy for all PAIs considered.

IST LLFEADB consists of a baseline and an extended set, with the difference between the two

sets being related to the settings used for bona fide image acquisition and thus the quality of the

samples displayed on PAIs. Results for the IST LLFEADB high resolution extended set are

presented in Table 10.7. In this case, the proposed LFHDG based PAD solution still achieves

perfect classification accuracy, while the proposed LFALBP based PAD solution and the

benchmarking solution [234] show a slight reduction in PAD performance, when compared to the

baseline set. In practice, the IST LLFEADB extended set provides a more challenging task than

the baseline for ear PAD due to two main reasons: i) the higher resolution of the extended set

images captured from the PAI improves the quality of the samples displayed on PAIs, thus making

distinguishing artefacts from bona fide samples more difficult for the extended set, as expected;

ii) as the detection performance decreases when training uses less samples, a slight performance

degradation may be also justified by the smaller size of the extended set.

Table 10.6: ACER ear PAD performance for the proposed and benchmarking solutions using

IST LLFEADB baseline set (minimum errors in bold).


Laptop Tablet Mobile 1 Mobile 2 Average Ref. Year Type

[232] 2015 2D 5.12 % 9.29 % 5.34 % 4.80 % 6.13 %

[234] 2016 2D 0 % 0.15 % 0 % 0 % 0.04 %

[34] 2016 LF 3.11 % 2.09 % 1.66 % 0.57 % 1.85 %

Prop. LFALBP LF 0.01 % 0.02 % 0 % 0 % 0.01 %

Prop. LFHDG LF 0 % 0 % 0 % 0 % 0 %

144

Table 10.7: ACER ear PAD performance for the proposed and benchmarking solutions using

IST LLFEADB extended set (minimum errors in bold).



[232] 2015 2D 5.84 % 6.77 % 4.62 % 5.62 % 5.71 %

[234] 2016 2D 1.05 % 2.42 % 0.65 % 0.29 % 1.10 %

[34] 2016 LF 0.45 % 5.75 % 4.55 % 5.92 % 2.74 %

Prop. LFALBP LF 0.20 % 0.39 % 0.18 % 1.22 % 0.49 %

Prop. LFHDG LF 0 % 0 % 0 % 0 % 0 %

10.4.2 Ear PAD Generalization Accuracy Evaluation

When deploying an ear PAD solution, there is naturally no way to know with absolute certainty

what type of attacks will be performed. This Thesis investigates ear PAD generalization to consider

an unforeseen artefact type, by training the solutions with all attack types available in the IST

LLFEADB excluding one, which is then used for testing, thus performing the role on an unknown

attack. Table 10.8 and Table 10.9 report the ACER generalization performance results obtained

with 50 runs for the IST LLFEADB baseline and extended sets, respectively.


solutions using IST LLFEADB baseline set (minimum errors in bold).

Detection solution Unknown Presentation Attack Instrument


[232] 2015 2D 7.98 % 10.67 % 10.64 % 7.93 % 9.30 %

[234] 2016 2D 0 % 1.23 % 0 % 0 % 0.31 %

[34] 2016 LF 16.85 % 4.26 % 8.03 % 2.09 % 7.80 %

Prop. LFALBP LF 0.01 % 0.16 % 0 % 0 % 0.04 %

Prop. LFHDG LF 0 % 0 % 0 % 0 % 0 %


solutions using IST LLFEADB extended set (minimum errors in bold).

Detection solution Unknown Presentation Attack Instrument


[232] 2015 2D 7.88 % 13.49 % 4.15 % 10.11 % 10.03 %

[234] 2016 2D 1.84 % 6.25 % 0.67 % 0.53 % 2.32 %

[34] 2016 LF 0.28 % 6.72 % 5.84 % 6.15 % 4.74 %

Prop. LFALBP LF 0.27 % 0.43 % 0.21 % 2.22 % 0.78 %

Prop. LFHDG LF 0 % 0 % 0 % 0 % 0 %

The results show that the proposed LFHDG based ear PAD solution generalizes perfectly to the

unforeseen PAIs considered, for both the IST LLFEADB baseline and extended sets. Concerning

the proposed LFALBP based ear PAD solution, even though its performance slightly decreases

compared to the performance reported in Section 10.4.1, its generalization abilities are superior to

the benchmarking solutions for most the considered unknown PAIs. It should be noted that, in the

145

absence of ear artefact samples captured from 3D PAIs, for instance wrapped paper or silicon ears,

it is not expected to experience a significant performance degradation when considering this

generalization scenario as it happened for face PAD due to the inclusion of wrapped paper attacks.

This highlights the need for a more complete ear artefact database, notably including artefacts

samples captured from 3D PAIs, to more exhaustively study ear PAD technology.

10.4.3 Ear PAD Computational Complexity

As PAD performance may increase at the cost of additional computational complexity, it is

important to assess this trade-off. Table 10.10 shows the average extraction and classification times

per image (in seconds) for the two proposed and the three benchmarking ear PAD solutions

considered. This table also summarizes the descriptor sizes for the various solutions. Time

measurements were performed on a standard 64-bit Intel PC with a 3.40 GHz processor and 16

GB RAM, running MATLAB R2015b. The proposed LFHDG and LFALBP based ear PAD

solutions exhibit the lowest computational complexity over all tested solutions, with a total

processing time around 49 and 217 milliseconds per image, respectively. Concerning the feature

vector size, the proposed LFALBP based solution offers a very compact representation, this is not

the case for the LFHDG based solution as it needs a larger feature vector to capture both the

orientation and magnitude variations of the angular information.

Table 10.10: Average extraction and classification times, and feature vector size for the

proposed and benchmarking ear PAD solutions (minimum values in bold).

Solution Feature extraction

time (s)

Classification

time

Total processing

time (s)

No. of vector

elements/bins

Descriptor

size (bytes) Ref. Year Type

[232] 2015 2D 0.4188 0.1188 0.5376 369 339

[234] 2016 2D 0.3598 0.0095 0.3693 7,680 12,173

[34] 2016 LF 19.0351 0.0640 19.0991 12,420 93,494

Prop. LFALBP LF 0.0489 0.0002 0.0492 96 158

Prop. LFHDG LF 0.1016 0.1162 0.2178 37,260 274,475

In summary, the performance results and the computational complexity show that light field based

solutions not only achieve very effective and stable PAD performance, but can also offer lower

complexity, thus representing one step forward in biometric and forensic applications when

compared to regular 2D imaging PAD solutions.

147

Part IV. Conclusion

149

Chapter 11 _

Summary of Contributions

Sum mary of Co ntrib utio ns

11.1 Introduction

Exploiting light field imaging sensors and the associated data for face and ear biometric

recognition and Presentation Attack Detection (PAD) tasks has been the main focus of this Thesis. Following the research work developed, this Thesis has extensively reviewed the state-of-the-art

on face and ear recognition and PAD and has proposed several novel databases and solutions to

exploit the additional visual information available in a light field image. This section presents a

summary of the contributions, separately for recognition and PAD.

11.2 Light Field Face and Ear Recognition

To better understand the technological landscape in terms of recognition systems, this Thesis has

proposed a new multi-level taxonomy for face and ear recognition solutions, targeting to facilitate

the organization and categorization of face and ear recognition solutions. The proposed multi-level

taxonomy considers four levels, notably face/ear structure, feature support, feature extraction

approach, and feature extraction sub-approach. Following the proposed taxonomy, a

comprehensive review of recent, representative and relevant face and ear recognition solutions has

been made. This Thesis has also reviewed the available face and ear databases, which are

instrumental for designing, testing and validating face and ear recognition solutions.

Next, two light field face and ear databases were developed in the context of this Thesis, thus

allowing more extensive benchmarking for face and ear recognition solutions exploiting light field

data:

1. IST-EURECOM LLFFD - The IST-EURECOM Lenslet Light Field Face Database (IST-

EURECOM LLFFD) has been created and made publicly available, including data from 100

subjects, with 20 samples per each person, captured by a Lytro ILLUM lenslet camera. The

150

images were captured in a controlled acquisition setup with different facial variations,

including emotions, actions, poses, illuminations, and occlusions in order to benefit from the

non-intrusive nature of face recognition.

2. IST LLFEDB - Additionally, the IST-EURECOM Lenslet Light Field Ear DataBase (IST

LLFEDB) has been created, to make publicly available the first content allowing testing and

validating light field based ear recognition systems. The proposed ear database consists of 536

light field ear images from 67 subjects, with 8 image shots per person, captured with a Lytro

ILLUM lenslet camera, over two separate sessions, with four different poses per session. The

proposed database includes ear images partly occluded by ear piercing, earing, hair and

combinations of these occlusions

In the sequence, two light field face and ear recognition solutions based on hand-crafted descriptors

and five face recognition solutions based on fused deep/double-deep descriptors were developed,

evolving through progressive levels of functionality and recognition performance:

1. Face and Ear Recognition Based on Light Field Local Binary Patterns (LFLBP)

Descriptor - The first light field solution was proposed based on a novel, simple, yet effective,

hand-crafted descriptor, named Light Field Local Binary Patterns (LFLBP), able to exploit the

richer information available in light field images for face and ear recognition tasks.

2. Face and Ear Recognition Based on Light Field Histogram of Gradients (LFHG)

Descriptor - Another light field solution was proposed based on the fusion of a non-light field

based hand-crafted descriptor, the so-called Histogram of Oriented Gradients (HOG), with a

new light field based descriptor, called Light Field Histogram of Disparity Gradients

(LFHDG). The fused descriptor, named Light Field Histogram of Gradients (LFHG),

considered both the orientation and magnitude for the spatial and angular information, while

the LFLBP solution only captured the magnitude for the spatial and angular information.

3. Face Recognition Based on VGG 2D+Disparity+Depth (VGG-D3) Fused Deep Descriptor-

Recognizing the importance of deep learning in biometric recognition, the first deep learning

based solution was proposed for light field face recognition, based on a VGG

2D+Disparity+Depth (VGG-D3) fused deep descriptor. The fused deep VGG-D3 description is

obtained by the feature level fusion of deep descriptions extracted from 2D images as well as

the corresponding disparity and depth maps, using a VGG-Face descriptor, acknowledging that

disparity and depth maps may bring some complementary information to the recognition task.

4. Face Recognition Based on VGG + Conv-LSTM Double-Deep Descriptor - As the VGG-

D3 fused deep descriptor only processes light field central view data, notably using its rendered

texture and corresponding disparity and depth maps, a double-deep descriptor, based on VGG-

Face and conventional LSTM (Conv-LSTM) descriptors, was proposed, exploiting the multi-

perspective information available in a light field image. The proposed VGG + Conv-LSTM

double-deep descriptor extracts higher dimensional angular dependencies from different face

viewpoints rendered from a light field image.

5. Face Recognition Based on VGG + Light Field LSTM Double-Deep Descriptors - Finally,

three face recognition solutions based on three novel light field LSTM cell architectures, so

151

called Gate-Level Fusion (GLF-LSTM), State-Level Fusion (SLF-LSTM) and Sequential

Learning (SeqL-LSTM), were proposed, adopting joint learning of the light field horizontal

and vertical parallaxes. The proposed cell architectures have been integrated into a spatio-

angular learning framework for double-deep description, where a LSTM network adopting the

proposed light field LSTM cell architectures receives its inputs from a VGG-Face deep

descriptor applied to a set of horizontal and vertical 2D face viewpoint images, derived from a

light field image. These recognition solutions, named VGG + GLF-LSTM, VGG + SLF-

LSTM, and VGG + SeqL-LSTM double-deep descriptors, lead to richer spatio-angular

descriptions, compared to the proposed VGG + Conv-LSTM double-deep descriptor, for face

recognition.

A comprehensive evaluation of state-of-the-art face and ear recognition solutions was been

conducted, including analysing the sensitivity of the proposed recognition solutions to the

available training data, both in terms of number of training samples and variations. The extensive

performance assessment showed the superiority of the proposed light field based solutions for both

face and ear recognition tasks, compared to appropriate state-of-the-art benchmarking recognition

solutions. In particular, average recognition gains considering 2D baseline solutions against their

corresponding light field based variants, showed the added value of light field information for face

and ear recognition purposes. Among the proposed solutions, the recognition solutions based on

VGG + SeqL-LSTM double-deep descriptor and LFHG hand-crafted descriptor, achieved the best

recognition performances, respectively for face and ear recognition tasks. The average SeqL-

LSTM double-deep descriptor performance obtained for the three challenging face recognition

evaluation protocols on IST-EURECOM LLFFD was 93.96%, showing 10.11% gain regarding the

best performing benchmarking solution, 2D VGG-Face descriptor. The proposed LFHG descriptor

achieved average ear recognition performance of 88.2%, showing a gain of 5.9% on IST LLFEDB

when compared against its 2D baseline, HOG, which was the best performing benchmarking

solution.

11.3 Light Field Biometric Presentation Attack Detection

This Thesis has also provided a comprehensive review of the recent advances in light field based

face PAD solutions and artefact databases, following a new, encompassing taxonomy for PAD

solutions. Reviewing ear PAD was not considered, as there were no ear artefact databases or ear

PAD solutions to be considered at the time of the writing of the Thesis. The proposed multi-level

taxonomy organized the face PAD solutions according to four main dimensions, notably user

interaction support, imaging sensor, contextual information and feature extraction.

Next, two light field face and ear artefact databases were developed in the context of this Thesis,

thus allowing more extensive benchmarking for face and ear PAD solutions exploiting light field

data:

1. IST LLFFSD - The IST Lenslet Light Field Face Spoofing Database (IST LLFFSD), captured

with a Lytro ILLUM lenslet camera, has been created and made publicly available. It consists

of 100 bona fide light field images and a set of artefact images, simulating six different types

of presentation attacks, including printed paper, wrapped printed paper, laptop, tablet and two

different mobile phones.

152

2. IST LLFEADB - Additionally, the first ear PAD database, the IST Lenslet Light Field Ear

Artefact Database (IST LLFEADB) has been created, and made publicly available. This is the

first database allowing testing and validating ear PAD systems, including both 2D and light

field ear artefact images, captured with a Lytro ILLUM lenslet camera. IST LLFEADB

contains two sets of light field images, differing on the settings used for bona fide image

acquisition, notably in the imaging resolution and number of samples. For both sets, four types

of PAIs, including a laptop, a tablet and two different mobile phones were used to create the

artefact samples.

In the sequence, two PAD solutions based on two novel light field hand-crafted descriptors were

proposed for both face and ear PAD, exploiting the variations associated to different directions of

light captured in the light field images:

1. PAD Based on Light Field Angular Local Binary Patterns (LFALBP) Descriptor - The

first proposed PAD solution is based on a Light Field Angular Local Binary Patterns

(LFALBP) hand-crafted descriptor, capturing magnitude sign for the angular information.

2. PAD Based on Light Field Histogram of Disparity Gradients (LFHDG) Descriptor - The

second PAD solution is based on Light Field Histogram of Disparity Gradients (LFHDG)

hand-crafted descriptor, capturing both the orientation and magnitude variations for the angular

information available in a light field image.

A comprehensive evaluation of the proposed and benchmarking light field face and ear PAD

solutions has been performed, in terms of accuracy, generalization and complexity. The extensive

assessment showed that not only the proposed light field based PAD solutions achieve superior

PAD performance with higher robustness and generalization ability, but they can also exhibit low

computational complexity, thus representing one step forward towards allowing using light field

imaging solutions to effectively detect face and ear presentation attacks. The evaluation results

showed that the proposed light field based solutions achieved perfect or near perfect classification

accuracy for both face and ear PAD and all PAIs considered, where the best performing

benchmarking solution led to more than 2.5% average classification error rate.

153

Chapter 12 _

Future Research Directions

Fut ure Research Direction

12.1 Introduction

The experimental work conducted in this Thesis has confirmed the added value of the richer

information available in light field images for face and ear recognition and PAD purposes.

However, there is still room for further developments and improvements, a selection of which is

discussed in this chapter

12.2 Future Research Directions for Light Field Face and Ear Recognition

Some future research directions in terms of face and ear recognition include:

Unconstrained light field face and ear databases – The face and ear light field databases

proposed in the context of this Thesis were captured with a controlled acquisition setup, which

is a rather common and realistic scenario in business and industrial environments where the

images to be recognized are captured in, at least partly, constrained conditions. An important

future research direction may be to extend the light field face and ear database acquisitions,

considering not only constrained environments, but also the acquisition of images in

unconstrained conditions, thus introducing high degree of variability and presenting more

challenging recognition conditions to the existing technologies. Examples include face and ear

images captured at different distances to the camera, with uncontrolled backgrounds, and

uncontrolled poses. The more difficult recognition scenarios represented by the images

contained in such unconstrained databases will provide challenges that can lead to significant

improvements in light field based face and ear recognition technologies.

Large-scale ear database – Deep learning based ear recognition solutions have not yet led to

a considerably superior performance over traditional solutions due to the lack of enough

available training samples in the available datasets. This reveals a pressing need to gather large-

154

scale ear databases in order to obtain better deep classification models in order to yield a more

accurate ear recognition.

Fusion of multiple imaging sensors – Although light field based recognition solutions

achieve the best performance for the tested databases, different imaging sensors, such as depth

and NIR cameras may provide a valuable contribution if operating in unconstraint situations,

e.g., with varying illumination conditions. Therefore, recognition solutions considering

multiple sensors may contribute to enhance performance.

Face and ear recognition on mobile phones equipped with light field/multiple cameras –

Following the wide deployment of mobile authentication scenarios using facial images and the

availability of mobile devices equipped with light field/multiple cameras, as illustrated in

Figure 12.1, there is a strong need to design efficient recognition solutions, based on the

characteristics of the emerging light field/multiple cameras available in mobile phones.

Figure 12.1: Mobile phones equipped by multiple cameras.

12.3 Future Research Directions for Light Field Based Face and Ear

Presentation Attack Detection

While this Thesis demonstrates the efficacy of light field based face and ear PAD, some future

research directions include:

Light field artefact face mask and silicon ear databases – The current light field based face

and ear artefact databases cover all artefact types considered in the literature, except those

involving wearing 3D face masks and silicon ears, due to the high cost associated with

preparing good quality face masks and silicon ears PAIs. To fully explore the potential of light

field solutions to detect 3D presentation attacks, especially flexible thin-layered silicon face

masks that can be used in highly sensitive security scenarios such as a semi-supervised border

control scenario, more complete light field artefact databases should be created.

Deep learning based light field PAD – Deep learning PAD solutions based on conventional

cameras are among the most recent and promising approaches to detect presentation attacks

[261], [262], and [263]. However, the current publicly available light field artefact databases

do not provide enough information to train a deep network. The application of neural network

based solutions for light field images needs to be explored after the availability of more

comprehensive light field artefact databases.

Hardware implementation of PAD solutions – Recent works introduce hardware platforms

to implement hand-crafted descriptors, such as HOG [274] and LBP [275] and also classifiers

155

such as SVM [276], on FPGA and GPU. The hardware implementation of PAD systems, either

based on light field cameras or others, is a step forward towards designing fast solutions

operating in real-time.

PAD on mobile phones equipped with light field/multiple cameras – Similar to recognition

systems, there is a strong need to design efficient PAD solutions running on mobile platforms

due to the wide deployment of mobile authentication scenarios. As commercial light field

cameras, notably camera arrays, are getting cheaper and there are some efforts to equip mobile

devices with light field/multiple cameras (see Figure 12.1), designing efficient PAD solutions,

based on the characteristics of light field/multiple cameras for mobile phones, is becoming a

hot topic.

In summary, light field imagining technology may represent one step forward in biometric and

forensic applications when compared to the conventional imaging sensors.

157

References

[1] A. Jain and A. Ross, "An introduction to biometric recognition," IEEE Transactions on Circuits and Systems

for Video Technology, vol. 14, no. 1, pp. 4-20, Jan. 2004.

[2] A. Jain, K. Nandakumarb and A. Ross, "50 years of biometric research: Accomplishments, challenges, and

opportunities," Pattern Recognition Letters, vol. 79, no. 1, pp. 80-105, Aug. 2016.

[3] M. Günther, L. El Shafey and S. Marcel, "2D face recognition: An experimental and reproducible research

survey," Technical Report Idiap-RR-13, Martigny, Switzerland, Apr. 2017.

[4] A. Goldstein, L. Harmon and A. Lesk, "Identification of human faces," Proceedings of the IEEE, vol. 59, no.

5, pp. 748-760, May 1971.

[5] Z. Emeršic, V. Štruc and P. Peer, "Ear recognition: More than a survey," Neurocomputing, vol. 255, no. 1, pp.

26-39, Sep. 2017.

[6] A. Pflug, "Ear recognition: Biometric identification using 2- and 3-dimensional images of human ears," PhD

thesis in Information Security, Faculty of Computer Science and Media Technology Gjøvik University

College, Gjøvik, Norway, Jun. 2016.

[7] K. Chang, K. Bowyer, S. Sarkar and B. Victor, "Comparison and combination of ear and face images in

appearance-based biometrics," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no.

9, pp. 1160-1165, Sep. 2003.

[8] Z. Huang, Y. Liu, C. Li, M. Yang and L. Chen, "A robust face and ear based multimodal biometric system

using sparse representation," Pattern Recognition, vol. 46, no. 8, pp. 2156-2168, Aug. 2013.

[9] N. Hezil and A. Boukrouche, "Multimodal biometric recognition using human ear and palmprint," IET

Biometrics, vol. 6, no. 5, pp. 351-359, Aug. 2017.

[10] M. Monwar and M. Gavrilova, "FES: A system for combining face, ear and signature biometrics using rank

level fusion," in International Conference on Information Technology: New Generations, Las Vegas, NV,

USA, Apr. 2008.

[11] ISO/IEC 30107, "Information technology—Presentation attack detection—Part 1: Framework," International

Organization for Standardization, Jan. 2016.

[12] L. Li, P. Correia and A. Hadid, "Face recognition under spoofing attacks: Countermeasures and research

directions," IET Biometrics, vol. 7, no. 1, pp. 3-14, Jan. 2018.

[13] R. Ramachandra and C. Busch, "Presentation attack detection methods for face recognition systems: A

comprehensive survey," ACM Computing Surveys, vol. 50, no. 1, pp. 801-837, Apr. 2017.

[14] J. Komulainen, "Software-based countermeasures to 2D facial spoofing attacks," PhD thesis in Department of

Computer Science and Engineering, Infotech Oulu,University of Oulu, Oulu, Finland, Aug. 2015.

[15] D. Ngo, A. Teoh and J. Hu, Biometric security, Newcastle, UK: Cambridge Scholars Publishing, Feb. 2015.

158

[16] G. Goudelis, A. Tefas and I. Pitas, "Emerging biometric modalities: A survey," Journal on Multimodal User

Interfaces, vol. 2, no. 4, p. 217–235, Dec. 2008.

[17] G. Goswami, M. Vatsa and R. Singh, "RGB-D face recognition with texture and attribute features," IEEE

Transactions on Information Forensics and Security, vol. 9, no. 10, pp. 1629 - 1640, Jul. 2014.

[18] R. Min, N. Kose and J. Gugelay, "KinectFaceDB: A Kinect database," IEEE Transactions on Systems, Man,

and Cybernetics: Systems, vol. 44, no. 11, pp. 1534-1548, Jul 2014.

[19] X. Zhang, L. Yin and F. Cohn, "BP4D-Spontaneous: a high-resolution spontaneous 3D dynamic facial," Image

and Vision Computing, vol. 32, no. 1, p. 692–706, Oct 2014.

[20] N. Erdogmus and S. Marcel, "Spoofing in 2D face recognition with 3D masks and anti-spoofing with Kinect,"

in IEEE Sixth International Conference on Biometrics: Theory, Applications and Systems (BTAS), Arlington,

VA, USA, Sep. 2013.

[21] N. Erdogmus and S. Marcel, "Spoofing face recognition with 3D masks," IEEE Transactions on Information

Forensics and Security, vol. 9, no. 7, pp. 1084-1097, Jul. 2014.

[22] D. Yi, Z. Lei, Z. Zhang and S. Li, "Face anti-spoofing: Multi-spectral approach," in Handbook of Biometric

Anti-Spoofing, London, Springer-Verlag, Jul. 2014, pp. 83-102.

[23] I. Chingovska, N. Erdogmus, A. Anjos and S. Marcel, "Face recognition systems under spoofing attacks," in

Face Recognition Across the Imaging Spectrum, NY, Springer International Publishing, Feb. 2016, pp. 165-

194.

[24] M. Levoy and P. Hanrahan, "Light field rendering," in 23rd annual conference on Computer graphics and

interactive techniques, New York, NY, USA, Aug. 1996.

[25] R. Ng, M. Levoy, M. Bradif, G. Duval, M. Horowitz and P. Hanrahan, "Light field photography with a hand-

held plenoptic camera," Tech Report CSTR 2005-02, Stanford, CA, USA, Feb. 2005.

[26] "Lytro website," Lytro Inc, [Online]. Available: https://www.lytro.com. [Accessed Nov. 2018].

[27] R. Raghavendra,, K. Raja and C. Busch, "Exploring the usefulness of light field cameras for biometrics: An

empirical study on face and iris recognition," IEEE Transaction on Infromation Forensics And Security, vol.

11, no. 5, pp. 922-936, May 2016.

[28] R. Raghavendra, B. Yang, K. Raja and C. Busch, "A new perspective - Face recognition with light-field

camera," in International Conference on Biometrics, Madrid, Spain, Jun. 2013.

[29] R. Raghavendra, K. Raja, B. Yang and C. Busch, "Multi-face recognition at a distance using light-field

camera," in International Conference on Intelligent Information Hiding and Multimedia Signal Processing,

Beijing, China, Jul. 2013.

[30] R. Raghavendra, K. Raja, B. Yang and C. Busch, "Comparative evaluation of super-resolution techniques for

multi-face recognition using light-field camera," in IEEE International Conference on Digital Signal

Processing, Santorini, Greece, Jul. 2013.

159

[31] T. Shen, H. Fu and J. Chen, "Facial expression recognition using depth map estimation of light field camera,"

in IEEE International Conference on Signal Processing, Communications and Computing, Hong Kong, China,

Aug. 2016.

[32] S. Kim, Y. Ben and S. Lee, "Face liveness detection using a light field camera," Sensors, vol. 14, no. 12, pp.

71-99, Nov. 2014.

[33] R. Raghavendra, K. Raja and C. Busch, "Presentation attack detection for face recognition using light field

camera," IEEE Transactions on Image Processing, vol. 24, no. 3, pp. 1060-1075, Mar. 2015.

[34] Z. Ji, H. Zhu and Q. Wang, "LFHOG: A discriminative descriptor for live face detection from light field

image," in IEEE International Conference on Image Processing, Phoenix, AZ, USA, Sep. 2016.

[35] A. Sepas-Moghaddam, V. Chiesa, P. Correia, F. Pereira and J. Dugelay, "The IST-EURECOM light field face

database," in International Workshop on Biometrics and Forensics, Coventry, UK, Apr. 2017.

[36] A. Sepas-Moghaddam, F. Pereira and P. Correia, "Ear recognition in a light field imaging framework: A new

perspective," IET Biometrics, vol. 7, no. 3, p. 224–231, May. 2018.

[37] A. Sepas-Moghaddam, P. Correia and F. Pereira, "Light field local binary patterns description for face

recognition," in IEEE International Conference on Image Processing, Beijing, China, Sep. 2017.

[38] O. Parkhi, A. Vedaldi and A. Zisserman, "Deep face recognition," in British Machine Vision Conference,

Swansea, UK, Sep. 2015.

[39] A. Sepas-Moghaddam, P. Correia, K. Nasrollahi, T. Moeslund and F. Pereira, "Light field based face

recognition via a fused deep representation," in IEEE International Workshop on Machine Learning for Signal

Processing, Aalborg, Denmark, Sep. 2018.

[40] A. Sepas-Moghaddam, P. Correia, K. Nasrollahi, T. Moeslund and F. Pereira, "A double-deep spatio-angular

learning framework for light field based face recognition," Submitted to IEEE Transactions on Circuits and

Systems for Video Technology, Oct. 2018.

[41] A. Sepas-Moghaddam, F. Pereira and P. Correia, "Light field long short-term memory: Novel cell architectures

with application to face recognition," Submitted to Pattern Recognition Letters, Oct. 2018.

[42] A. Sepas-Moghaddam, F. Pereira and P. Correia, "Light field based face presentation attack detection:

Reviewing, benchmarking and one step further," IEEE Transactions on Information Forensics and Security,

vol. 13, no. 7, pp. 1696-1709, Jul. 2018.

[43] A. Sepas-Moghaddam, L. Malhadas, P. Correia and F. Pereira, "Face spoofing detection using a light field

imaging framework," IET Biometrics, vol. 7, no. 1, pp. 39-48, Jan. 2018.

[44] A. Sepas-Moghaddam, F. Pereira and P. Correia, "Ear presentation attack detection: Benchmarking study with

first lenslet light field database," in European Signal Processing Conference, Rome, Italy, Sep. 2018.

[45] G. Lippmann, "Épreuves réversibles. Photographies intégrales," Comptes Rendus de l'Académie des Sciences,

vol. 13, no. 9, pp. 245-254, Jan. 1908.

[46] A. Gershun, "The light field," Journal of Mathematics and Physics, vol. 18, no. 1, pp. 51-151, April 1939.

160

[47] E. Adelson and J. Bergen, "The plenoptic function and the elements of early vision," in Computation Models

of Visual Processing, MA, USA, MIT Press, Oct. 1991, pp. 3-20.

[48] S. Gortler, R. Grzeszczuk, R. Szeliski and M. Cohen, "The lumigraph," in Annual Conference on Computer

Graphics and Interactive Techniques, New Orleans, LA, USA, Aug. 1996.

[49] D. Dansereau, "Plenoptic signal processing for robust vision in field robotics," PhD Thesis in Mechatronic

Engineering, Queensland University of Technology, Queensland, Australia, 2014.

[50] Light, "The Light L16 Camera," [Online]. Available: https://light.co/camera. [Accessed Nov. 2018].

[51] B. Wilburn, "High performance imaging using arrys of inexpensive cameras," PhD Thesis in department of

Electrical Engineering, Stanford University, Stanford, CA, USA, Dec. 2004.

[52] ISO/IEC JTC 1/SC 29/WG 1 , "JPEG pleno call for proposals on light field coding," ISO/IEC, Geneva,

Switzerland, Jan. 2017.

[53] Z. Yu, J. Yu, A. Lumsdaine and T. Georgiev, "An analysis of color demosaicing in plenoptic cameras," in

IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, Jun. 2012.

[54] T. Georgiev and A. Lumsdaine, "The multi-focus plenoptic camera," in SPIE Electronic Imaging, Burlingame,

CA, USA, Jan. 2012.

[55] A. Lumsdaine and T. Georgiev, "The focused plenoptic camera," in IEEE International Conference on

Computational Photography, San Francisco, CA, USA, Aug. 2010.

[56] C. Perwass and L. Wietzke, "Single lens 3D-camera with extended depth-of-field," in Human Vision and

Electronic Imaging, Burlingame, CA, USA, Jan. 2012.

[57] "Raytrix," Raytrix GmbH, [Online]. Available: https://www.raytrix.de/. [Accessed Nov. 2018].

[58] D. Dansereau, "Light Field Toolbox V. 0.4," [Online]. Available:

http://www.mathworks.com/matlabcentral/fileexchange/49683-light-field-toolbox-v0-4. [Accessed Nov.

2018].

[59] D. Dansereau, "Plenoptic Signal Processing for Robust Vision in Field Robotics," PhD Thesis in Mechatronic

Engineering, Queensland University of Technology, Queensland, Australia, Jan. 2014.

[60] W. Zhao, R. Chellappa, J. Phillips and A. Rosenfeld, "Face recognition: A literature survey," ACM Computing

Surveys, vol. 35, no. 4, pp. 399-458, Dec 2003.

[61] E. Hjelmas and B. Low, "Face detection: A survey," Computer Vision and Image Understanding, vol. 83, no.

3, p. 236–274, Sep. 2001.

[62] I. Marqu´es, "Face recognition algorithms," Thesis in Computer Engineering, University of the Basque

Country, Vizcaya, Spain, Hun. 2010.

[63] R. Jafri and R. Arabnia, "A survey of face recognition techniques," Journal of Information Processing Systems,

vol. 5, no. 2, pp. 41-68, Jun. 2009.

[64] S. Li and A. Jain, Handbook of face recognition, London, UK: Springer-Verlag London, 2011.

161

[65] M. Turk and A. Pentland, "Eigenfaces for recognition," Journal of Cognitive Neuroscience, vol. 3, no. 1, pp.

71-86, jan. 1991.

[66] M. Bartlett, J. Movellan and T. Sejnowski, "Face recognition by independent component analysis," IEEE Trans

Neural Networks, vol. 13, no. 6, p. 1450–1464. , Nov. 2002.

[67] S. Ahmadkhani and P. Adibi, "Face recognition using supervised probabilistic principal component analysis

mixture model in dimensionality reduction without loss framework," IET Computer Vision, vol. 10, no. 3, pp.

193-201, Mar. 2016.

[68] J. Ye, R. Janardan and Q. Li, "GPCA: an efficient dimension reduction scheme for image compression and

retrieval," in ACM SIGKDD international conference on Knowledge discovery and data mining, Seattle, WA,

USA, Aug. 2004.

[69] L. Wiskott, J. Fellous, N. Kruger and C. Von der Malsburg , "Face recognition by elastic bunch graph

matching," in International Conference on Image Processing, Santa Barbara, CA, USA, Oct. 1997.

[70] V. Blanz and T. Vetter, "Face recognition based on fitting a 3D morphable model," IEEE Transactions on

Pattern Analysis and Machine Intelligence , vol. 25, no. 9, pp. 1063-1074 , Sep. 2003.

[71] K. Huang, D. Dai, C. Ren and Z. Lai, "Learning kernel extended dictionary for face recognition," IEEE

Transactions on Neural Networks and Learning Systems , vol. 28, no. 5, pp. 1082-1094, May 2017.

[72] T. Zhang, B. Wang, F. Li and Z. Zhang, "Decision pyramid classifier for face recognition under complex

variations using single sample per person," Pattern Recognition, vol. 64, no. 1, pp. 305-313, Apr. 2017.

[73] C. Zhou, L. Wang, Q. Zhang and X. Wei, "Face recognition based on PCA and logistic regression analysis,"

Optik, vol. 125, no. 20, pp. 5916-5919, Oct. 2014.

[74] H. Li, F. Shen, C. Shen, Y. Yang and Y. Gao, "Face recognition using linear representation ensembles," Pattern

Recognition, vol. 59, no. 1, pp. 72-87, Nov. 2016.

[75] Z. Wu, Y. Wang and G. Pan, "3D face recognition using local shape map," International Conference on Image

Processing, Singapore, Singapore, Oct. 2004.

[76] T. Ahonen, A. Hadid and M. Pietikainen, "Face description with local binary patterns: Application to face

recognition," IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 28, no. 12, pp. 2037-2041,

Dec. 2006.

[77] T. Ahonen, E. Rahtu, V. Ojansivu and J. Heikkila , "Recognition of blurred faces using local phase

wuantization," in International Conference on Pattern Recognition , Tampa, FL, USA , Dec. 2008.

[78] M. Xi, L. Chen, D. Polajnar and W. Tong, "Local binary pattern network: A deep learning approach for face

recognition," in International Conference on Image Processing, Phoenix, AZ, USA, Sep. 2016.

[79] N. Werghi, S. Berretti and A. Bimbo, "The Mesh-LBP: A framework for extracting local binary patterns from

discrete manifolds," IEEE Transactions on Image Processing, vol. 24, no. 1, pp. 220-235, Jan. 2015.

[80] A. Ross and A. Jain, "Information fusion in biometrics," Pattern Recognition Letters, vol. 24, no. 13, p. 2115–

2125, Sep. 2003.

162

[81] G. M and C. Delac, "Face Recognition Homepage," [Online]. Available: http://www.face-rec.org/databases/.

[Accessed Nov. 2018].

[82] C. McCool, "Bi-Modal Person Recognition on a Mobile Phone: using mobile phone data," in IEEE

International Conference on Multimedia and Expo Workshops, Melbourne, Australia, 2012.

[83] M. Grgic, K. Delac and S. Grgic, "SCface – surveillance cameras face database," Multimed Tools and

Application, vol. 51, no. 1, p. 863–879, Feb 2011.

[84] F. Samaria and A. Harter, "Parameterisation of a stochastic model for human face identification," in IEEE

Workshop on Applications of Computer Vision, Sarasota, FL, USA, Dec. 1994.

[85] M. A and B. R, "The AR face database," CVC Technical Report, Columbus, OH, USA, Jun. 1998.

[86] A. Georghiades,, P. Belhumeur and D. Kriegman, "From few to many: Illumination cone models for face

recognition under variable lighting and pose," IEEE Transactions on Pattern Analysis and Machine

Intelligence, vol. 23, no. 06, pp. 643-660, Aug 2001.

[87] J. Phillips, H. Moon, S. Rizvi and P. Rauss, "The FERET evaluation methodology for face-recognition

algorithms," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 10, pp. 1090-1104,

Aug 2000.

[88] C. Thomaz and G. Giraldi, "A new ranking method for principal components analysis and its application to

face image analysis," Image and Vision Computing, vol. 28, no. 06, pp. 902-913, Jun 2010.

[89] C. Conde, "Multimodal 2D, 2.5D & 3D face verification," in International Conference on Image Processing,

AT, USA, Oct. 2006.

[90] G. Huang, M. Ramesh, T. Berg and E. Learned-Miller, "Labeled faces in the wild: A database for studying

face recognition in unconstrained environments," University of Massachusetts, Amherst, Technical Report 07-

49, Amherst, MA, USA, Oct. 2007.

[91] A. Savran, N. Alyüz and H. Dibeklioğlu, "Bosphorus database for 3D face analysis," in BIOID 2008, LNCS

5372, Berlin, Germany, Springer-Verlag, 2008, p. 47–56.

[92] R. Gross, I. Matthews, J. Cohn, T. Kanade and S. Baker, "Multi-PIE," Image and Vision Computing, vol. 28,

no. 1, p. 807–813, May 2010.

[93] S. Gupta and a. et., "Texas 3D Face Recognition Database," in IEEE Southwest Symposium on Image Analysis

& Interpretation, Austin, TX, USA, 2010.

[94] L. Wolf, T. Hassner and I. Maoz, "Face recognition in unconstrained videos with matched background

similarity," in IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, TX,,

Aug. 2011.

[95] C. Cao, Y. Weng, S. Zhou and Y. Tong, "FaceWarehouse: A 3D facial expression database for visual

computing," IEEE Transactions on Visualization and Computer Graphics, vol. 20, no. 3, pp. 413-425 , Mar.

2014.

163

[96] B. Klare, B. Klein, E. Taborsky, A. Blanton and A. Jain, "Pushing the frontiers of unconstrained face detection

and recognition: IARPA Janus Benchmark A," in IEEE Conference on Computer Vision and Pattern

Recognition, Boston, MA, USA, Jun. 2015.

[97] N. Zhang, M. Paluri, Y. Taigman, R. Fergus and L. Bourdev, "Beyond frontal faces: Improving person

recognition using multiple cues," arXiv:1501.05703, Jan. 2015.

[98] R. Raghavendra,, K. Raja and C. Busch, "Exploring the usefulness of light field cameras," IEEE Transactions

on Infromation Forensics and Security, vol. 11, no. 5, pp. 922-936, May 2016.

[99] A. M, "The specs on face dataset," York University, Toronto, Ontario, Canada, Jan. 2017.

[100] V. Kushwaha, M. Singh, R. Singh, M. Vatsa, N. Ratha and R. Chellappa, "Disguised faces in the Wild,"

International Conference on Computer Vision and Pattern Recognition Workshop, Salt Lake City, UT, USA,

Jun. 2018.

[101] J. Wang, N. Le, J. Lee and C. Wang, "Color face image enhancement using adaptive singular value

decomposition in fourier domain for face recognition," Pattern Recognition, vol. 57, no. 1, pp. 31-49, Sep.

2016.

[102] S. Hu, X. Lu, M. Ye and W. Zeng, "Singular value decomposition and local near neighbors for face

recognition," Pattern Recognition, vol. 64, no. 1, pp. 60-83, Apr. 2017.

[103] C. Ding and D. Tao, "Pose-invariant face recognition with homography-based normalization," Pattern

Recognition, vol. 66, no. 1, pp. 144-152, Jun. 2017.

[104] G. Hu, F. Yan, C. Chan, W. Deng, W. Christmas, J. Kittler and N. Robertson, "Face recognition using a unified

3D morphable model," in European Conference on Computer Vision, Amsterdam, Netherlands, Oct. 2016.

[105] W. Su, C. Hsu, C. Lin and W. Lin, "Supervised-learning based face hallucination for enhancing face

recognition," in International Conference on Acoustics, Speech and Signal Processing, Shanghai, China, Mar.

2016.

[106] Z. Dong, M. Pei and Y. Jia, "Orthonormal dictionary learning and its application to face recognition," Image

and Vision Computing, vol. 51, no. 1, pp. 13-21, Jul. 2016.

[107] Y. Taigman, M. Yang, M. Ranzato and L. Wolf, "DeepFace: closing the gap to human-Level performance in

face verification," in Computer Vision and Pattern Recognition, Columbus, OH, USA, Jun. 2014.

[108] Y. Sun, D. Liang, X. Wang and X. Tang, "DeepID3: Face recognition with very deep neural networks,"

arXiv:1502.00873, Feb. 2015.

[109] Y. Lee, G. Chen, C. Tseng and S. Lai, "Accurate and robust face recognition from RGB-D images with a deep

learning," in British Machine Vision Conference, York, UK, Sep. 2016..

[110] X. Liu, L. Song, X. Wu and T. Tan, "Transferring deep representation for NIR-VIS heterogeneous face

recognition," in International Conference on Biometrics, Halmstad, Sweden, Aug. 2016.

164

[111] C. Reale, N. Nasrabadi, H. Kwon and R. Chellappa, "Seeing the forest from the trees: A holistic approach to

near-infrared heterogeneous face recognition," in in IEEE Conference on Computer Vision and Pattern

Recognition Workshops, Las Vegas, NV, USA, Jul. 2016.

[112] Y. Lee, J. Chen, C. Tseng and S. Lai, "Accurate and robust face recognition from RGB-D images with a deep

learning approach," in British Machine Vision Conference, York, UK, Sep. 2016.

[113] X. Wu, L. Song, R. He and T. Tan, "Coupled deep learning for heterogeneous face recognition,"

arXiv:1704.02450, Apr. 2017.

[114] R. He, X. Wu, Z. Sun and T. Tan, "Learning invariant deep representation for NIR-VIS face recognition," in

AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, Feb. 2017.

[115] J. Lezama, Q. Qiu and G. Sapiro, "Not afraid of the dark: NIR-VIS face recognition via cross-spectral

hallucination and low-rank embedding," in IEEE Conference on Computer Vision and Pattern Recognition,

Honolulu, HW, USA, Jul. 2017.

[116] I. Masi, F. Chang, P. Natarajan and R. Nevatia, "Learning Pose-Aware Models for Pose-Invariant Face

Recognition in the Wild," IEEE Transactions on Pattern Analysis and Machine Intelligence, in press, Jan.

2018.

[117] X. Wu, R. He, Z. Sun and T. Tan, "A lightened CNN for deep face representation," IEEE Transactions on

Information Forensics and Security, in press, Jan. 2018.

[118] K. Grm, V. Struc, A. Artiges, M. Caron and H. Ekenel, "Strengths and weaknesses of deep learning models

for face recognition against image degradations," IET Biometrics, vol. 7, no. 1, pp. 81-89 , Jan. 2018.

[119] O. Deniz, G. Bueno, J. Salido and F. De la Torre, "Face recognition using histograms of oriented gradients,"

Pattern Recognition Letters, vol. 32, no. 12, pp. 1598-1603, Sep. 2011.

[120] A. Aissaoui, J. Martinet and C. Ajeraba, "DLBP: A novel descriptor for depth image based face recognition,"

in IEEE International Conference on Image Processing, Paris, France, Oct. 2014.

[121] L. Liu, P. Fieguth, G. Zhao, M. Pietikäinen and D. Hu, "Extended local binary patterns for face recognition,"

Information Sciences, vol. 358, no. 1, pp. 56-72, Sep. 2016.

[122] T. Schlett, C. Rathgeb and C. Busch , "A binarization scheme for face recognition based on multi-scale block

local binary patterns," in International Conference of the Biometrics Special Interest Group, Darmstadt,

Germany, Nov. 2016.

[123] X. Chen, F. Hu, Z. Liu, Q. Huang and J. Zhang, "Multi-resolution elongated CS-LDP with Gabor feature for

face recognition," International Journal of Biometrics, vol. 8, no. 1, pp. 19-32, Jan. 2016.

[124] W. Yang, Z. Wang and B. Zhang, "Face recognition using adaptive local ternary patterns method,"

Neurocomputing, vol. 213, no. 1, pp. 183-190, Nov. 2016.

[125] L. Tian, C. Fan and Y. Ming, "Multiple scales combined principle component analysis," Journal of Electronic

Imaging, vol. 25, no. 2, pp. 3025-3041, Apr. 2016.

165

[126] Z. Li, D. Gong, X. Li and D. Tao, "Aging face recognition: A hierarchical learning model based on Local

patterns selection," IEEE Transactions on Image Processing, vol. 25, no. 5, pp. 2146-2154, May 2016.

[127] j. Li, Z. Chen and C. Liu, "Low-resolution face recognition of multi-scale blocking CS-LBP and weighted

PCA," International Journal of Pattern Recognition and Artificial Intelligence, vol. 30, no. 9, pp. 6005-6018,

Sep. 2016.

[128] J. Zhang, Y. Deng, Z. Guo and Y. Chen, "Face recognition using part-based dense sampling local features,"

Neurocomputing, vol. 184, no. 1, pp. 176-187, Apr. 2016.

[129] C. Li, W. Wei, J. Wang, W. Tang and S. Zhao, "Face recognition based on deep belief network combined with

center-symmetric local binary pattern," in Advanced Multimedia and Ubiquitous Engineering, Singapore,

Singapore, Springer, Aug. 2016, pp. 277-283.

[130] Z. Lu and L. Zhang, "Face recognition algorithm based on discriminative dictionary learning and sparse

representation," Neurocomputing, vol. 174, no. 2, pp. 749-755, Jan. 2016.

[131] L. Tran and X. Liu, "Nonlinear 3D face morphable model," arXiv:1804.03786, Apr. 2018.

[132] O. Nikisins, K. Nasrollahi, M. Greitans and T. Moeslund, "RGB-D-T based face recognition," in International

Conference on Pattern Recognition, Stockholm, Sweden, Dec. 2014.

[133] A. Nigam, G. Chhalotre and P. Gupta, "Pose and illumination invariant face recognition using binocular stereo

3D reconstruction," in Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics,

Patna, India, Dec. 2015.

[134] J. Li, N. Sang and C. Gao, "Face recognition with Riesz binary pattern," Digital Signal Processing, vol. 51,

no. 1, pp. 196-201, Apr. 2016.

[135] Y. Wang, S. Yu, W. Li, L. Wang and Q. Liao, "Face recognition with local contourlet combined patterns," in

International Conference on Acoustics, Speech and Signal Processing, Shanghai, China, May. 2016.

[136] A. Fathi, P. Alirezazadeh and F. Abdali-Mohammadi, "A new Global-Gabor-Zernike feature descriptor and its

application to face recognition," Journal of Visual Communication and Image Representation, vol. 38, no. 1,

pp. 65-72, Jul. 2016.

[137] C. Ding, J. Choi, D. Tao and L. Davis, "Multi-directional multi-level dual-cross patterns for robust face

recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 3, pp. 518-531,

Mar. 2016.

[138] T. Freitas, P. Alves, J. Monteiro and J. Cardoso, "A comparative analysis of deep and shallow features for

multimodal face recognition in a novel RGB-D-IR dataset," in International Symposium on Visual Computing,

Las Vegas, NV, USA, Dec. 2016.

[139] Y. Bi, M. Lv, Y. Wei, N. Guan and W. Yi, "Multi-feature fusion for thermal face recognition," Infrared Physics

& Technology, vol. 77, no. 1, pp. 366-374, Jul. 2016.

166

[140] G. Hu, Y. Yang, D. Yi, J. Kittler, W. Christmas, S. Li and T. Hospedales, "When face recognition meets with

deep learning: An Evaluation of convolutional neural networks for face recognition," in IEEE International

Conference on Computer Vision Workshop, Santiago, Chile, Dec. 2015.

[141] M. Mehdipour Ghazi and H. Ekenel, "A comprehensive analysis of deep learning based representation for face

recognition," in IEEE Conference on Computer Vision and Pattern Recognition Workshops, Las Vegas, NV,

United States, Jul. 2016.

[142] A. Krizhevsky, I. Sutskever and G. Hinton, "Imagenet classification with deep convolutional neural networks,"

in International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, Dec. 2012.

[143] X. Wu, R. He, Z. Sun and T. Tan, "A light CNN for deep face representation with noisy labels,"

arXiv:1511.02683, Ithaca, NY, USA, Apr. 2017.

[144] F. Iandola, S. Han, M. Moskewicz, K. Ashraf, W. Dally and K. Keutzer, "SqueezeNet: AlexNet-level accuracy

with 50x fewer parameters and <0.5MB model size," arXiv:1602.07360, Ithaca, NY, USA, Nov. 2016.

[145] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens and Z. Wojna, "Rethinking the inception architecture for

computer vision," in IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA,

Jun. 2016.

[146] K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv

preprint arXiv:1409.1556, Ithaca, NY, USA, Apr. 2015.

[147] K. He, X. Zhang, S. Ren and J. Sun, "Deep residual learning for image recognition," arXiv:1512.03385, Dec.

2015.

[148] F. Wang, X. Xiang, J. Cheng and A. Yuille, "NormFace: L2 hypersphere embedding for face verification,"

arXiv:1704.06369, Apr. 2017.

[149] N. Dalal and B. Triggs, "Histograms of oriented gradients for human detection," in IEEE Conference on

Computer Vision and Pattern Recognition, San Diego, CA, USA, Jul. 2005.

[150] A. Ross, "An introduction to multibiometrics," in European Signal Processing Conference , Poznan, Poland,

Sep. 2007.

[151] P. Auscher, T. Coulhon, X. Duong and S. Hofmann, "Riesz transform on manifolds and heat kernel regularity,"

Annales Scientifiques de l’École Normale Supérieure, vol. 37, no. 6, pp. 911-957, Dec. 2004.

[152] M. Wang and W. Deng, "Deep Face Recognition: A Survey," arXiv:1804.06655, Apr. 2018.

[153] H. Lu, K. Plataniotis and A. Venetsanopoulos, "MPCA: Multilinear principal component analysis of tensor

objects," IEEE Transactions on Neural Networks, vol. 19, no. 1, pp. 18-39, Jan.2008.

[154] Z. Mu and L. Yuan, "Introduction to USTB ear image databases," Ear recognition Lab at University of Science

& Technology Beijing, 2004. [Online]. Available: http://www1.ustb.edu.cn/resb/en/visit/visit.htm.

[155] A. Kumar and C. Wu, "Automated human identification using ear imaging," Pattern Recognition, vol. 45, no.

3, pp. 956-968, Mar. 2012.

167

[156] E. González Sánchez, "Biometric analysis of the ears," Ph.D. thesis, Universidad de Las Palmas, Gran Canaria,

Spain, Sep. 2008.

[157] D. Frejlichowski and N. Tyszkiewicz, "The west pomeranian university of technology ear database – A tool

for testing biometric algorithms," in International Conference Image Analysis and Recognition, Póvoa de

Varzim, Portugal, Jun. 2010.

[158] Z. Xiaoxun and J. Yunde, "Symmetrical null space LDA for face and ear recognition," Neurocomputing, vol.

70, no. 4, pp. 842-848, Jan. 2007.

[159] B. Zhang, Z. Mu, C. Li and H. Xeng, "Robust classification for occluded ear via Gabor scale feature-based

non-negative sparse representation," Optical Engineering, vol. 53, no. 6, pp. 1-11, Jun. 2014.

[160] A. Tharwat, A. Ibrahim, A. Hassanien and G. Schaefer, "Ear recognition using block-based principal

component analysis and decision fusion," in International Conference on Pattern Recognition and Machine

Intelligence, Warsaw, Poland, Jun. 2015.

[161] I. Naseem, R. Togneri and M. Bennamoun, "Sparse representation for ear biometrics," in International

Symposium on Visual Computing , Las Vegas, NV, USA , Dec. 2016.

[162] D. Watabe, H. Sai, T. Ueda and K. Sakai, "ICA, LDA, and Gabor jets for robust ear recognition, and jet space

similarity for ear detection," International Journal of Intelligent Computing in Medical Sciences & Image

Processing , vol. 3, no. 1, pp. 9-29, Feb. 2013.

[163] B. Moreno, A. Sanchez and J. Velez, "On the use of outer ear images for personal identification in security

applications," in International Carnahan Conference on Security Technology, Madrid, Spain, Oct. 1999 .

[164] Z. Mu, L. Yuan, Z. Xu, D. Xi and S. Qi, "Shape and structural feature based ear recognition," in 5th Chinese

conference on Advances in Biometric Person Authentication, Guangzhou, China, Dec. 2004.

[165] M. Burge and W. Burger, "Ear Biometrics for Machine Vision," in Workshop of the Austrian Association for

Pattern Recognition, New York, Springer, Sep. 1997, pp. 273-285.

[166] T. Theoharis, G. Passalis and G. Toderici, "Unified 3D face and ear recognition using wavelets on geometry

images," Pattern Recognition, vol. 41, no. 3, pp. 796-804, Mar. 2008.

[167] M. Choraś, "Perspective methods of human identification: Ear biometrics," Opto-Electronics Review, vol. 16,

no. 1, p. 85–96, Mar. 2008.

[168] I. Omara, F. Li, H. Zhang and W. Zuo, "A novel geometric feature extraction method for ear recognition,"

Expert Systems With Applications, vol. 65, no. 1, pp. 127-135, Dec. 2016.

[169] Y. Zhou and S. Zaferiou, "Deformable models of ears in-the-wild for alignment and recognition," in IEEE

International Conference on Automatic Face & Gesture Recognition, Washington, DC, USA, Jun. 2017.

[170] Z. Emeršič, D. Štepec, V. Štruc and P. Peer, "Training convolutional neural networks with limited training data

for ear recognition in the wild," in IEEE International Conference on Automatic Face & Gesture Recognition,

Washington, DC, USA, Jun. 2017.

168

[171] Z. Emeršič, D. Štepec, V. Štruc and H. Ekenel, "The unconstrained ear recognition challenge," in IEEE

International Joint Conference on Biometrics, Denver, CO, USA, Oct. 2017.

[172] F. Eyiokur, D. Yaman and H. Ekenel, "Domain adaptation for ear recognition using deep convolutional neural

networks," IET Biometrics, vol. 7, no. 3, pp. 199-206, May. 2018.

[173] Y. Guo, G. Zhao, M. Pietikäinen and Z. Xu, "A new Gabor phase difference pattern for face and ear

recognition," in International Conference on Computer Analysis of Images and Patterns, Münster, Germany,

Sep. 2009.

[174] N. Damer and B. Führer, "Ear recognition using multi-Scale histogram of oriented gradients," in International

Conference on Intelligent Information Hiding and Multimedia Signal Processing, Piraeus, Greece, Jul. 2012.

[175] A. Morales, M. Ferrer, M. Diaz-Cabrera and E. González, "Analysis of local descriptors features and its

robustness applied to ear recognition," in International Carnahan Conference on Security Technology,

Medellin, Colombia, Oct. 2013 .

[176] A. Pflug, P. Paul and C. Busch, "A comparative study on texture and surface descriptors for ear biometrics,"

in International Carnahan Conference on Security Technology, Rome, Italy, Dec. 2014.

[177] Z. Youbi, L. Boubchir, M. Bounneche, A. Ali-Chérif and A. Boukrouche, "Human ear recognition based on

multi-scale local binary pattern descriptor and KL divergence," in International Conference on

Telecommunications and Signal Processing, Vienna, Austria, Jun. 2016.

[178] C. Long, |. Zhichun, N. Bingfei, Z. Yi and Y. Ruyin, "TDSIFT: A new descriptor for 2D and 3D ear

recognition," in International Conference on Graphic and Image Processing, Tokyo, Japan, Oct. 2016.

[179] H. Zhang, Z. Mu, W. Qu, L. Liu and C. Zhang, "A novel approach for ear recognition based on ICA and RBF

network," in International Conference on Machine Learning and Cybernetics, Guangzhou, China, Aug. 2005.

[180] M. Nosrati, K. Faez and F. Faradji, "Using 2D wavelet and principal component analysis for personal

identification based On 2D ear structure," in International Conference on Intelligent and Advanced Systems,

Kuala Lumpur, Malaysia, Nov. 2007.

[181] Y. Wang, Z. Mu and H. Zeng, "Block-based and multi-resolution methods for ear recognition using wavelet

transform and uniform local binary patterns," in International Conference on Pattern Recognition, Tampa, FL,

USA, Dec. 2008.

[182] A. Kumar and T. Chan, "Robust ear identification using sparse representation of local texture descriptors,"

Pattern Recognition, vol. 85, no. 1, pp. 73-85, Jan. 2013.

[183] P. Galdámez, A. Arrieta and M. Ramon, "Ear recognition using a hybrid approach based on neural networks,"

in International Conference on Information Fusion, Salamanca, Spain, Jul. 2014.

[184] L. Jacob and G. Raju, "Ear recognition using texture features - A novel approach," in Advances in Intelligent

Systems and Computing, Singapor, Singapor, Springer, Jul. 2014, pp. 1-12.

[185] A. Benzaoui, A. Hadid and A. Boukrouche, "Ear biometric recognition using local texture descriptors," Journal

of Electronic Imaging, vol. 23, no. 5, pp. 1-12, Oct. 2014.

169

[186] "Lytro Desktop 4," Lytro, Inc., [Online]. Available: https://support.lytro.com/hc/en-us/articles/202590364-

Lytro-Desktop-4-Main-Overview. [Accessed Nov. 2018].

[187] C. Chang and C. Lin, "LIBSVM -- A library for support vector machines," National Taiwan University,

[Online]. Available: https://www.csie.ntu.edu.tw/~cjlin/libsvm/. [Accessed Nov. 2018].

[188] S. Marto, N. Monteiro, J. Barreto and J. Gaspar, "Structure from plenoptic imaging," in IEEE International

Conference on Development and Learning and on Epigenetic Robotics, Lisbon, Portugal, Sep. 2017.

[189] N. Monteiro, S. Marto, J. Barreto and J. Gaspar, "Depth range accuracy for plenoptic cameras," Accepted in

Computer Vision and Image Understanding, vol. 168, no. 1, pp. 104-117, Mar. 2018.

[190] H. Jeon, J. Park, G. Choe and G. Park, "Accurate depth map estimation from a lenslet light field camera," in

IEEE International Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, Jun. 2015.

[191] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural Computation, vol. 9, no. 8, pp. 1735-

1780, Nov. 1997.

[192] K. Greff, R. Srivastava, J. Koutník, B. Steunebrink and J. Schmidhuber, "LSTM: A search space odyssey,"

IEEE Transactions on Neural Networks and Learning Systems, vol. 28, no. 10, pp. 2222-2232, Oct. 2017.

[193] J. Liu, A. Shahroudy, D. Xu, A. Chichung and G. Wang, "Skeleton-based action recognition using spatio-

temporal LSTM network with trust gates," IEEE Transactions on Pattern Analysis and Machine Intelligence,

in press, Nov. 2017.

[194] P. Rodriguez, G. Cucurull, J. Gonzalez, J. Gonfaus, K. Nasrollahi, T. Moeslund and J. Xavier Roca, "Deep

pain: Exploiting long short-term memory networks for facial expression classification," IEEE Transactions on

Cybernetics, vol. 99, no. 1, pp. 1-11, Feb. 2017.

[195] J. Donahue, L. Hendricks, M. Rohrbach, S. Venugopalan, S. Guadarrama, K. Saenko and T. Darrell, "Long-

term recurrent convolutional networks for visual recognition and description," IEEE Transactions on Pattern

Analysis and Machine Intelligence, vol. 39, no. 4, pp. 677-691, Apr. 2017.

[196] P. Werbos, "Backpropagation through time: What it does and how to do it," Proceedings of the IEEE, vol. 78,

no. 10, pp. 1550-1560, Oct. 1990.

[197] S. Hochreiter, "The vanishing gradient problem during learning recurrent neural nets and problem solutions,"

International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 6, no. 2, pp. 107-116 ,

Apr. 1998.

[198] A. Zeyer, P. Doetsch, P. Voigtlaender, R. Schlüter and H. Ney, "A comprehensive study of deep bidirectional

LSTM RNNS for acoustic modeling in speech recognition," in IEEE International Conference on Acoustics,

Speech and Signal Processing, New Orleans, LA, USA, Jun. 2017.

[199] S. Merity, N. Keskar and R. Socher, "Regularizing and optimizing LSTM language models,"

arXiv:1708.02182, Ithaca, NY, USA, Aug. 2017.

[200] Y. Gal and Z. Ghahramani, "A theoretically grounded application of dropout in recurrent neural networks," in

International Conference on Neural Information Processing Systems, Barcelona, Spain, Dec. 2016.

170

[201] N. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy and P. Tang, "On large-batch training for deep learning:

Generalization gap and sharp minima," in International Conference on Learning Representations, Toulon,

France, Apr. 2017.

[202] V. Patel, "The impact of local geometry and batch size on convergence and divergence of stochastic gradient

descent," arXiv:1709.04718, Ithaca, NY, USA, Sep. 2017.

[203] A. Meraoumia, S. Chitroub and A. Bouridane, "An automated ear identification system using Gabor filter

responses," in International New Circuits and Systems Conference, Grenoble, France , Jun. 2015.

[204] Z. Hui, Z. Rui, M. Zhichun and W. Xiuqing, "Local feature descriptor based rapid 3D ear recognition," in

Chinese Control Conference, Nanjing, China, Jul. 2014.

[205] y. Guo and Z. Xu, "Ear recognition using a new local matching approach," in International Conference on

Image Processing, San Diego, CA, USA, Oct. 2008.

[206] V. Ojansivu, E. Rahtu and J. Heikkila, "Rotation invariant local phase quantization for blur insensitive texture

analysis," in International Conference on Pattern Recognition, Tampa, FL, USA, Dec. 2008.

[207] C. Sousedik and C. Busch, "Presentation attack detection methods for fingerprint recognition systems: a

survey," IET Biometrics, vol. 3, no. 4, pp. 219-233, Dec. 2014.

[208] A. Czajka and K. Bowyer, "Presentation attack detection for iris recognition: an assessment of the state-of-the-

art," ACM Computing Surveys, vol. 51, no. 4, pp. 1-35, Sep. 2018.

[209] D. Yi, Z. Lei, Z. Zhang and S. Li, "Face anti-spoofing: Multi-spectral approach," in Handbook of Biometric

Anti-Spoofing, London, Springer-Verlag, Jul. 2014, pp. 83-102.

[210] C. Kant and N. Sharma, "Fake face recognition using fusion of thermal imaging and skin elasticity,"

International Journal of Computer Science and Communication, vol. 4, no. 1, pp. 65-72, Mar. 2013.

[211] L. Sun, W. Huang and M. Wu, "TIR/VIS correlation for liveness detection in face recognition," in International

Conference on Computer Analysis of Images and Patterns, Seville, Spain, Aug. 2011.

[212] G. Tian, T. Mori and Y. Okuda, "Spoofing detection for embedded face recognition system using a low cost

stereo camera," in International Conference on Pattern Recognition, Cancun, Mexico, Dec. 2016.

[213] X. Sun, L. Huang and C. Liu, "Dual camera based feature for face spoofing detection," in Chinese Conference

on Pattern Recognition, Chengdu, China, Nov. 2016.

[214] J. Komulainen, A. Hadid and M. Pietikainen, "Context based face anti-spoofing," in International Conference

on Biometrics: Theory, Applications and Systems, Arlington, VA, USA, Jan. 2014.

[215] K. Patel, H. han and A. Jain, "Secure face unlock: spoof detection on smartphones," IEEE Transactions on

Information Forensics and Security, vol. 11, no. 10, pp. 2268-2283, Jun. 2016.

[216] G. Pan, L. Sun, Z. Wu and Y. Wang, "Monocular camera-based face liveness detection by combining eyeblink

and scene context," Telecommunication Systems, vol. 3, no. 1, p. 215–225, Aug. 2011.

[217] X. Tan, Y. Li, J. Liu and L. Jiang, "Face liveness detection from a single image with sparse low rank bilinear

discriminative model," in European Conference on Computer Vision, Crere, Greece, Sep. 2010.

171

[218] A. Anjos and S. Marcel, "Counter-measures to photo attacks in face recognition: A public database and a

baseline," in International Joint Conference on Biometrics, Washington, DC, USA, Oct. 2011.

[219] I. Chingovska, A. Anjos and S. Marcel, "On the effectiveness of local binary patterns in face anti-spoofing,"

in International Conference of Biometrics Special Interest Group, Darmstadt, Germany, Sep. 2012.

[220] Z. Zhang, J. Yan and S. Liu, "A face antispoofing database with diverse attacks," in IAPR International

Conference on Biometrics, Dehli, India, Apr. 2012.

[221] D. Wen, H. Han and A. Jain, "Face spoof detection with image distortion analysis," IEEE Transactions on

Information Forensics and Security, vol. 10, no. 4, pp. 746-761, Apr. 2015.

[222] A. Hadid, "Physics-based face database," University of Oulu, [Online]. Available:

http://www.cse.oulu.fi/CMV/Downloads/Pbfd. [Accessed Nov. 2018].

[223] I. Manjani, S. Tariyal, M. Vesta, R. Singh and A. Majumdar, "Detecting silicone mask-based presentation

attack via deep dictionary learning," IEEE Transactions on Information Forensics and Security, vol. 12, no. 7,

pp. 1713-1723, Mar. 2017.

[224] A. Agarwal, D. Yadav, N. Kohli, R. Singh, M. Vatsa and A. Noore, "Face presentation attack with latex masks

in multispectral videos," in Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, Jul.

2017.

[225] "Thats my face," [Online]. Available: http://faces.thatsmyface.com/. [Accessed Nov. 2018].

[226] D. Gragnaniello, G. Poggi, C. Sansone and L. Verdoliva, "An investigation of local descriptors for biometric

spoofing detection," IEEE Transactions on Information Forensics and Security, vol. 10, no. 4, pp. 849-863,

Feb. 2015.

[227] J. Määttä, A. Hadid and M. Pietikäinen, "Face spoofing detection from single images using micro-texture

analysis," in International Joint Conference on Biometrics , Washington, DC, USA, Oct. 2011.

[228] J. Maatta, A. Hadid and M. Pietikainen, "Face spoofing detection from single images using texture and local

shape analysis," IET Biometrics, vol. 1, no. 1, pp. 3-10, Mar. 2012.

[229] N. Kose and J. Dugelay, "Classification of captured and recaptured images to detect photograph spoofing," in

International Conference on Informatics, Electronics & Vision, Dhaka, Bangladesh , May 2012 .

[230] M. Waris, H. Zhang, I. Ahmad, S. Kiranyaz and M. Gabbouj, "Analysis of textural features for face biometric

anti-spoofing," in European Signal Processing Conference, Marrakech, Morocco , Sep. 2013.

[231] R. Raghavendra and C. Busch, "Robust 2D/3D face mask presentation attack detection scheme by exploring

multiple features and comparison score level fusion," in International Conference on Information Fusion,

Salamanca, Spain, Oct. 2014.

[232] A. Hadid, N. Evans, S. Marcel and J. Fierrez, "Biometrics systems under spoofing attack: An evaluation

methodology and lessons learned," IEEE Signal Processing Magazine, vol. 32, no. 5, pp. 20-30, Sep. 2015.

172

[233] S. Arashloo, J. Kittler and W. Christmas, "Face spoofing detection based on multiple descriptor fusion using

multiscale dynamic binarized statistical image features," IEEE Transactions on Information Forensics and

Security, vol. 10, no. 11, pp. 2396-2407, Jul. 2015.

[234] Z. Boulkenafet, J. Boulkenafet and A. Hadid, "Face spoofing detection using colour texture analysis," IEEE

Transactions on Information Forensics and Security, vol. 11, no. 8, pp. 1818-1830, Aug. 2016.

[235] Z. Boulkenafet, J. Komulainen and A. Hadid, "Face antispoofing using speeded-up robust features and fisher

vector encoding," IEEE Signal Processing Letters, vol. 24, no. 2, pp. 141-145, Feb. 2017.

[236] F. Peng, L. Qin and M. Long, "Face presentation attack detection using guided scale texture," Multimedia

Tools and Applications, vol. 77, no. 7, pp. 8883-8909, May 2017.

[237] S. Tirunagari, N. Poh, D. Windridge, A. Iorliam, N. Suki and A. Ho, "Detection of face spoofing using visual

dynamics," IEEE Transactions on Information Forensics and Security, vol. 10, no. 4, pp. 762-777, Feb. 2015.

[238] J. Komulainen, A. Hadid and M. Pietikäinen, "Face spoofing detection using dynamic texture," in Asian

Conference on Computer Vision, Daejeon, Korea, Nov. 2012.

[239] T. Pereira, A. Anjos, J. Martino and S. Marcel, "LBP−TOP based countermeasure against face spoofing

attacks," in Asian Conference on Computer Vision, Daejeon, Korea, Nov. 2012.

[240] S. Bharadwaj, T. Dhamecha, M. Vatsa and R. Singh, "Computationally efficient face spoofing detection with

motion magnification," in The IEEE Conference on Computer Vision and Pattern Recognition Workshops,

Portland, Oregon, Jun. 2013.

[241] A. Pinto, H. Pedrini, W. Schwartz and A. Rocha, "Face spoofing detection through visual codebooks of spectral

temporal cubes," IEEE Transactions on Image Processing, vol. 24, no. 12, pp. 4726-4740, Dec. 2015.

[242] Q. Phan, D. Dang-Nguyen, G. Boato and F. De Natale, "Face spoofing detection using LDP-TOP," in IEEE

International Conference on Image Processing, Phoenix, AZ, USA, Aug. 2016.

[243] Z. Zhang, D. Yi, Z. Lei and S. Li, "Face liveness detection by learning multispectral reflectance distributions,"

in International Conference on Automatic Face & Gesture Recognition and Workshops, Santa Barbara, CA,

USA, Mar. 2011.

[244] N. Kose and J. Dugelay, "Reflectance analysis based countermeasure technique to detect face mask attacks,"

in International Conference on Digital Signal Processing, Fira, Greece, Oct. 2013.

[245] J. Galbally and S. Marcel, "Face anti-spoofing based on general image quality assessment," in International

Conference on Pattern Recognition, Stockholm, Sweden, Aug. 2014.

[246] J. Galbally, S. Marcel and J. Fierrez, "Image quality assessment for fake biometric detection: Application to

iris, fingerprint,and face recognition," IEEE Transactions on Image Processing, vol. 23, no. 2, pp. 710-724,

Nov. 2013.

[247] R. Haralick, K. Shanmugam and I. Dinstein, "Textural features for image classification," IEEE Transactions

on Systems, Man, and Cybernetics, vol. 3, no. 6, pp. 610-621, Nov. 1973.

173

[248] A. Agarwal, R. Singh and M. Vatsa, "Face anti-spoofing using Haralick features," in International Conference

on Biometrics Theory, Applications and Systems, Niagara Falls, NY, USA , Sep. 2016.

[249] A. Bhogal, D. Söllinger, P. Trung and A. Uhl, "Non-reference image quality assessment for biometric

presentation attack detection," in International Workshop on Biometrics and Forensics, Coventry, UK, Apr.

2017.

[250] j. Yan, Z. Zhang, Z. Lei, D. Yi and S. Li, "Face liveness detection by exploring multiple scenic clues," in

International Conference on Control Automation Robotics & Vision, Guangzhou, China, Dec. 2012.

[251] J. Komulainen, A. Hadid, M. Pietikäinen, A. Anjos and S. Marcel, "Complementary countermeasures for

detecting scenic face spoofing attacks," in International Conference on Biometrics, Madrid, Spain, Sep. 2013.

[252] S. Kim, S. Yu, K. Kim, Y. Ban and S. Lee, "Face liveness detection using variable focusing," in International

Conference on Biometrics, Madrid, Spain , Jun. 2013.

[253] A. Ali, F. Deravi and S. Hoque, "Directional sensitivity of gaze-collinearity features in liveness detection," in

International Conference on Emerging Security Technologies, Cambridge, UK, Sep. 2013.

[254] L. Yang, "Face liveness detection by focusing on frontal faces and image backgrounds," in International

Conference on Wavelet Analysis and Pattern Recognition, Lanzhou, China, Jul. 2014.

[255] A. Anjos, M. Chakka and S. Marcel, "Motion-based counter-measures to photo attacks in face recognition,"

IET Biometrics, vol. 3, no. 3, pp. 147-158, Sep. 2014.

[256] L. Cai, C. Xiong, L. Huang and C. Liu, "A novel face spoofing detection method based on gaze estimation,"

in Asian Conference on Computer Vision, Singapore, Singapore, Nov. 2014.

[257] J. Yang, Z. Lei and S. Li, "Learn convolutional neural network for face anti-spoofing," arXiv preprint

arXiv:1408.5601, Ithaca, NY, USA, Aug. 2014.

[258] S. Kim, Y. Ban and S. Lee, "Face liveness detection using defocus," Sensors, vol. 15, no. 1, pp. 1537-1563,

jan. 2015.

[259] Z. Xu, S. Li and W. Deng, "Learning temporal features using LSTM-CNN architecture for face anti-spoofing,"

in IAPR Asian Conference on Pattern Recognition, Kuala Lumpur, Malaysia, Nov. 2015.

[260] D. Menotti, G. Chiachia and A. Pinto, "Deep representations for iris, face, and fingerprint spoofing detection,"

IEEE Transactions on Information Forensics and Security, vol. 10, no. 4, pp. 864-879, Apr. 2015.

[261] L. Li, X. Feng, Z. Boulkenafet, Z. Xia, M. Li and A. Hadid, "An original face anti-spoofing approach using

partial convolutional neural network," in International Conference on Image Processing Theory Tools and

Applications, Oulu, Finland, Dec. 2016.

[262] L. Feng, L. Po and Y. Li, "Integration of image quality and motion cues for face anti-spoofing: A neural

network approach," Journal of Visual Communication and Image Representation, vol. 38, no. 1, pp. 451-460,

Jul. 2016.

[263] A. Alotaibi and A. Mahmood, "Deep face liveness detection based on nonlinear diffusion using convolution

neural network," Signal, Image and Video Processing, vol. 11, no. 4, pp. 713-720, May 2017.

174

[264] K. Killioğlu, M. Taşkiran and N. Kahraman, "Anti-spoofing in face recognition with liveness detection using

pupil tracking," in International Symposium on Applied Machine Intelligence and Informatics, Herl'any,

Slovakia, Jan. 2017.

[265] M. De Marsico, M. Nappi, D. Riccio and J. Dugelay, "Moving face spoofing detection via 3D projective

invariants," in IAPR International Conference on Biometrics, New Delhi, India, Apr. 2012.

[266] A. Saad, "Anti-spoofing using challenge-response user interaction," Thesis in Dept. of Computer Science and

Engineering, American University in Cairo, Cairo, Egypt, Jan. 2015.

[267] A. Munalih, "Challenge response interaction for biometric liveness establishment and template protection," in

Annual Conference on Privacy, Security and Trust, Auckland, New Zealand, Dec. 2016.

[268] A. Singh, P. Joshi and G. Nandi, "Face recognition with liveness detection using eye and mouth movement,"

in International Conference on Signal Propagation and Computer Technology, Ajmer, India, Jul. 2014.

[269] O. Komogortsev, A. Karpov and C. Holland, "Attack of mechanical replicas: Liveness detection with eye

movement," IEEE Transactions on Information Forensics and Security, vol. 10, no. 4, pp. 716-725, Apr. 2015.

[270] K. Kollreider, H. Fronthaler and J. Bigun, "Verifying liveness by multiple experts in face biometrics," in IEEE

Conference on Computer Vision and Pattern Recognition Workshops, Anchorage, AK, USA, Jul. 2008.

[271] C. Chang and C. Lin, "LIBSVM -- A library for support vector machines," National Taiwan University,

[Online]. Available: https://www.csie.ntu.edu.tw/~cjlin/libsvm/. [Accessed Nov. 2018].

[272] I. I. 30107-3, "Information Technology—Presentation Attack Detection—Part 3: Testing, Reporting and

Classification of Attacks," International Organization for Standardization, Sep. 2017.

[273] FRONTEX, "Best practice operational guidelines for automated border control (ABC) systems," European

Agency for the Management, Research and Development Unit, Warsaw, Poland, Sep. 2015.

[274] P. Chen, C. Huang, C. Lien and Y. Tsai, "An efficient hardware implementation of HOG feature extraction for

human detection," IEEE Transactions on Intelligent Transportation Systems, vol. 15, no. 2, pp. 656-662, Oct.

2013.

[275] Y. Zhang, W. Cao and L. Wang, "Implementation of high performance hardware architecture of face

recognition algorithm based on local binary pattern on FPGA," in International Conference on ASIC, Chengdu,

China, Jul. 2016.

[276] K. Irick, M. DeBole, V. Narayanan and A. Gayasen, "A hardware efficient support vector machine architecture

for FPGA," in International Symposium on Field-Programmable Custom Computing Machines, Stanford, CA,

USA, Apr. 2008.

Light Field Based Biometric Recognition and Presentation ...€¦ · Keywords: Biometric...

Documents

Transcript of Light Field Based Biometric Recognition and Presentation ...€¦ · Keywords: Biometric...