Light Field Based Biometric Recognition and Presentation ...€¦ · Keywords: Biometric...
Transcript of Light Field Based Biometric Recognition and Presentation ...€¦ · Keywords: Biometric...
UNIVERSIDADE DE LISBOA
INSTITUTO SUPERIOR TÉCNICO
Light Field Based Biometric Recognition and Presentation Attack Detection
Alireza Sepasmoghaddam
Supervisor: Doctor Paulo Luís Serras Lobato Correia
Co-supervisor: Doctor Fernando Manuel Bernardo Pereira
Thesis approved in public session to obtain the PhD Degree in
Electrical and Computer Engineering
Jury final classification: Pass with Distinction and Honour
2019
ii
UNIVERSIDADE DE LISBOA
INSTITUTO SUPERIOR TÉCNICO
Light Field Based Biometric Recognition and Presentation Attack Detection
Alireza Sepasmoghaddam
Supervisor: Doctor Paulo Luís Serras Lobato Correia
Co-supervisor: Doctor Fernando Manuel Bernardo Pereira
Thesis approved in public session to obtain the PhD Degree in
Electrical and Computer Engineering
Jury final classification: Pass with Distinction and Honour
Jury
Chairperson: Doctor Mário Alexandre Teles de Figueiredo, Instituto Superior Técnico, Universidade de Lisboa
Members of the Committee:
Doctor Luís Filipe Barbosa de Almeida Alexandre, Faculdade de Engenharia, Universidade da Beira Interior
Doctor Alexandre José Malheiro Bernardino, Instituto Superior Técnico, Universidade de Lisboa
Doctor Luís Eduardo de Pinho Ducla Soares, Escola de Tecnologias e Arquitectura, ISCTE - Instituto Universitário de Lisboa
Doctor Paulo Luís Serras Lobato Correia, Instituto Superior Técnico, Universidade de Lisboa
Funding Institutions
This research has been made possible with funding from the Fundação para a Ciência e a Tecnologia, Instituto de Telecomunicações, Instituto Superior Técnico, European Cooperation in Science and Technology, IEEE Signal Processing Society, and European Association for Signal
Processing.
2019
iv
v
To the memory of my beloved cousin,
Ali (1983-2019)
a beautiful noble human being who we will miss terribly;
a brother who will live intimately inside my heart.
vi
vii
Abstract
In a world where security issues have been gaining explosive importance, face and ear recognition
systems have attracted increasing attention in multiple application areas, ranging from forensics
and surveillance to commerce and entertainment. While the recognition performance has been
steadily improving, there are still challenging recognition scenarios and conditions, notably when
facing large variations in the biometric data characteristics. Additionally, the widespread use of
face and ear recognition solutions raises new security concerns, making the robustness against
presentation attacks a very active field of research. Lenslet light field cameras have recently come
into prominence as they are able to also capture the intensity of the light rays coming from multiple
directions, thus offering a richer representation of the visual scene, notably spatio-angular
information. To take benefit of this richer representation, light field cameras have recently been
successfully applied, not only to biometric recognition, but also to biometric Presentation Attack
Detection (PAD).
This Thesis focuses on exploiting the advances in light field imaging technology towards
developing more advanced biometric recognition and PAD systems with improved performance.
In this context, new taxonomies have been developed for face and ear recognition and PAD, to
facilitate the organization and categorization of face and ear recognition and PAD solutions.
Following the proposed taxonomies, a comprehensive review on recent, representative and
relevant face and ear recognition solutions has been made.
In the context of this Thesis, two light field face and ear databases have been created, towards
allowing more powerful benchmarking for testing and validating face and ear recognition solutions
while exploiting the full light field data. Additionally, two light field face and ear artefact databases
have been created consisting of bona fide images and artefact images using different types of
presentation attack instruments, such as printed papers and digital displays.
Concerning recognition and PAD solutions, two hand-crafted light field based face and ear
recognition descriptors and five deep learning light field based face recognition descriptors have
been developed, evolving through progressive levels of functionality and performance.
Concerning PAD, this Thesis has developed two solutions for light field based face and ear PAD,
exploiting the variations associated to different directions of light captured in the light field images.
A comprehensive evaluation of the proposed and benchmarking face and ear recognition and PAD
solutions has been performed. The obtained results have shown the added value of light field
information for face and ear recognition and PAD purposes as the proposed solutions have
achieved superior recognition and PAD performance when compared to the state-of-the-art
benchmarking solutions.
Keywords: Biometric Recognition, Biometric Presentation Attack Detection, Biometric Database,
Light Field Imaging, Deep Learning.
viii
ix
Resumo
Num mundo onde as questões de segurança têm ganho uma enorme importância, os sistemas de
reconhecimento facial e auricular atraem cada vez mais atenção em áreas de aplicação que vão
desde a análise forense e a vigilância até ao comércio e ao entretenimento. Embora o desempenho
dos sistemas de reconhecimento tenha vindo a melhorar de forma sustentada, ainda há cenários
desafiadores para os sistemas atuais, nomeadamente quando os dados biométricos são capturados
em condições menos controladas e apresentam variações significativas. Adicionalmente, o uso
generalizado de soluções de reconhecimento automático levanta novas preocupações de segurança,
nomeadamente a robustez contra ataques em que se apresenta ao sistema de reconhecimento não
um indivíduo mas algo que pretende simular a presença desse indivíduo. Neste contect, a deteção
de ataques de apresentação é uma área de investigação muito ativa. Recentemente, as câmaras
lenticulares que capturam o campo de luz (lenslet light field cameras) têm ganho importância, pois
capturam a intensidade dos raios de luz vindos de múltiplas direções, oferecendo assim uma
representação mais rica da cena visual, nomeadamente com variações espacio-angulares. Este tipo
de câmaras tem sido recentemente usado com sucesso no reconhecimento biométrico e também na
deteção de ataques de apresentação biométrica.
Esta Tese foca-se na exploração dos avanços na tecnologia de imagens lenticulares do campo de
luz para o desenvolvimento de sistemas mais eficientes de reconhecimento biométrico e de deteção
de ataques de apresentação. Neste contexto, são propostas novas taxonomias com o objetivo de
facilitar a organização e categorização das soluções de reconhecimento facial e auricular, e de
deteção de ataques de apresentação . Seguindo as taxonomias propostas, foi feita uma revisão
abrangente das soluções mais representativas e relevantes.
No contexto da Tese, foram criadas duas novas bases de dados com imagens lenticulares de campo
de luz para reconhecimento facial e auricular, permitindo pela primeira vez comparar diferentes
soluções de reconhecimento que explorem também informação angular do campo de luz. Foram
também criadas duas bases de dados de artefactos faciais e auriculares, contendo além das imagens
originais (bona fide) também imagens de vários tipos de ataques usando artefactos como
impressões em papel e displays digitais, de forma a testar soluções de deteção de ataques de
apresentação explorando mais informação do campo de luz.
Nesta Tese, são apresentados dois novos descritores para reconhecimento facial e auricular usando
técnicas convencionais (hand-crafted) e cinco novos descritores para reconhecimento facial usando
redes neuronais de aprendizagem profunda (deep learning neural networks), todos explorando a
informação adicional do campo de luz. Também se apresentam duas novas soluções para deteção
de ataques de apresentação a sistemas de reconhecimento facial e auricular, explorando as
variações associadas a diferentes direções da luz capturada nas imagens de campo de luz.
Foi realizada uma avaliação abrangente das soluções propostas para reconhecimento facial e
auricular e para deteção de ataques de apresentação, em comparação com soluções representativas
do estado da arte. Os resultados obtidos mostram a vantagem significativa em utilizar informação
adicional do campo de luz. De facto, o desempenho das soluções propostas para reconhecimento
x
e deteção de ataques de apresentação suplanta o obtido com as soluções representativas do estado
da arte.
Palavras-chave: Reconhecimento Biométrico, Ataques de Apresentação Biométrica, Bases de
Dados Biométricas, Imagens do Campo de Luz, Redes Neuronais de Aprendizagem Profunda.
xi
Acknowledgments
Undertaking this PhD has indeed been a brilliant life-changing experience for me which would not
have been realized without the support and guidance that I received from many people.
First and foremost, I would like to express my deepest gratitude to my Ph.D. supervisors, Prof.
Paulo Correia and Prof. Fernando Pereira, for their phenomenal supervision and support which not
only fostered my professional talents, but also uplifted my characteristics as a human being. Their
immense knowledge, remarkable guidance, enthusiasm for research, patience, and trust have been
a great inspiration to me during all these years. Being so, I am entirely delighted for having had
them as my PhD mentors and advisors.
My sincere thanks also goes to Prof. Kamal Nasrollahi and Prof. Thomas Moeslund who provided
me the beneficial opportunity of joining their team at Aalborg University as a visiting researcher.
I also thank Prof. Mohammad-Shahram Moin, Dr. Maedeh Arvani, and Dr. Mohammad Rouhani
for their support in taking my career abroad.
I am grateful to the members of the doctoral committee, Prof. Mário Figueiredo, Prof. Luís
Alexandre, Prof. Alexandre Bernardino, and Prof. Luís Soares for their insightful detailed analysis
of this work, and for their interesting questions and remarks during my defense.
I am heartily grateful to my colleagues and friends at Instituto de Telecomunicações for all the
unforgettable moments we created together during my PhD years, especially to be named Alireza
Javaheri, Alireza Tavanfar, André Guarda, Falah Jabar, Ivo Sousa, Miguel Simões, Milad
Niknejad, and Tanmay Verlekar. I have also spent many great moments with Iranian friends during
my stay in Lisbon. In particular, I would like to thank Ahmad Nadali, Arash Abbasnia, Faezeh
Rastgari, Hamdireza Yeganeh, Mohammad Farzamian, Nahal Mojarad, Niloofar Dehghani, and
Sana Hashemi Nasl.
Words cannot express how grateful I am to my mother, Zohreh, my father, Mahmoud, and my
brother, Amir, for all of the loving sacrifices they have unconditionally made for me.
Most significantly, I wish to thank from the bottom of my heart my beloved wife, Maryam, for all
her unconditional continual love, encouragements and support. It is definitely with her truly
unparalleled love that my PhD has been completed successfully.
This work would not have been possible without the financial support of Instituto de
Telecomunicações (IT), Instituto Superior Técnico (IST), Fundação para a Ciência e a Tecnologia
(FCT), European Cooperation in Science and Technology (COST), IEEE Signal Processing
Society (IEEE SPS), and European Association for Signal Processing (EURASIP). In Addition, I
am thankful to all the IT staff for the assistance and facilities they provided me to conduct my PhD
research.
Alireza Sepas-Moghaddam
Lisbon, January 2019
xii
xiii
Table of Contents
Abstract .................................................................................................................................. vii
Resumo..................................................................................................................................... ix
Acknowledgments .................................................................................................................... xi
Table of Contents .................................................................................................................. xiii
List of Figures ........................................................................................................................ xxi
List of Tables ...................................................................................................................... xxvii
List of Acronyms ................................................................................................................. xxxi
Part I. Objectives and Basics ................................................................................................... 1
Chapter 1: Introduction ........................................................................................................... 3
1.1 Context and Motivation .................................................................................................. 3
1.2 Objectives ....................................................................................................................... 5
1.3 Contributions .................................................................................................................. 5
1.4 Thesis Structure ............................................................................................................ 11
Chapter 2: Light Field Imaging: Basic Concepts and Tools ................................................. 13
2.1 Introduction .................................................................................................................. 13
2.2 Plenoptic Function ........................................................................................................ 13
2.3 Light Field Acquisition ................................................................................................. 15
2.3.1 Multi-Camera Arrays ........................................................................................ 15
2.3.2 Lenslet Light Field Cameras .............................................................................. 16
2.4 Lenslet Light Field Imaging: From Micro-images to Sub-Aperture Images ................... 18
2.5 Added Value for Biometric Recognition and PAD ........................................................ 21
Part II. Light Field Based Face and Ear Recognition ........................................................... 23
Chapter 3: State-of-the-Art on Face and Ear Recognition ................................................... 25
xiv
3.1 Introduction .................................................................................................................. 25
3.2 Face/Ear Recognition Taxonomy .................................................................................. 25
3.2.1 Reviewing Existing Face Recognition Taxonomies ........................................... 26
3.2.2 Reviewing Existing Ear Recognition Taxonomy................................................ 26
3.2.3 Proposing a Novel Multi-Level Face/Ear Recognition Taxonomy...................... 27
3.3 Face Recognition .......................................................................................................... 30
3.3.1 Face Databases: Status Quo ............................................................................... 30
3.3.2 Non-Light Field Based Face Recognition Solutions ........................................... 32
3.3.2.1 Appearance Based Solutions ................................................................ 34
3.3.2.2 Model Based Solutions ........................................................................ 34
3.3.2.3 Learning Based Solutions (excluding Deep Learning) ......................... 35
3.3.2.4 Deep Learning Based Solutions ........................................................... 36
3.3.2.5 Hand-Crafted Based Solutions ............................................................. 37
3.3.2.6 Hybrid Solutions ................................................................................. 38
3.3.2.7 Fusion of Solutions .............................................................................. 39
3.3.2.8 Non-Light Field Based Face Recognition: the Status Quo .................... 40
3.3.3 Light Field Based Face Recognition Solutions ................................................... 41
3.3.3.1 Appearance Based Solution ................................................................. 42
3.3.3.2 Hand-Crafted Based Solutions ............................................................. 42
3.3.3.3 Light Field Based Face Recognition: the Status Quo ............................ 43
3.4 Ear Recognition ............................................................................................................ 43
3.4.1 Ear Databases: Status Quo ................................................................................. 44
3.4.2 Ear Recognition Solutions ................................................................................. 44
3.4.2.1 Appearance Based Solutions ................................................................ 46
3.4.2.2 Geometric Based Solutions .................................................................. 46
3.4.2.3 Learning Based Solutions .................................................................... 47
xv
3.4.2.4 Hand-crafted Based Solutions .............................................................. 47
3.4.2.5 Hybrid Solutions ................................................................................. 48
3.4.2.6 Ear Recognition: the Status Quo .......................................................... 48
Chapter 4: Proposing Novel Light Field Face and Ear Recognition Databases................... 51
4.1 Introduction .................................................................................................................. 51
4.2 Lenslet Light Field Face Recognition Database ............................................................. 52
4.2.1 Acquisition Setup and Statistics ......................................................................... 52
4.2.2 Database Variations ........................................................................................... 53
4.2.3 Database Elements ............................................................................................ 54
4.2.4 Database Structure and Naming Convention ...................................................... 56
4.2.5 Database Access and Usage Conditions ............................................................. 57
4.3 Lenslet Light Field Ear Recognition Database ............................................................... 57
4.3.1 Acquisition Setup and Statistics ......................................................................... 58
4.3.2 Database Variations ........................................................................................... 58
4.3.3 Database Elements ............................................................................................ 59
4.3.4 Database Structure and Naming Convention ...................................................... 60
4.3.5 Database Access and Usage Conditions ............................................................. 61
Chapter 5: Proposing Novel Light Field Face and Ear Recognition Solutions .................... 63
5.1 Introduction .................................................................................................................. 63
5.2 Face and Ear Recognition Based on Light Field Local Binary Pattern Descriptor .......... 63
5.2.1 Architecture and Walkthrough ........................................................................... 64
5.2.2 Light Field Local Binary Patterns Feature Description ....................................... 65
5.3 Face and Ear Recognition Based on Light Field Histogram of Gradients Descriptor...... 68
5.3.1 Architecture and Walkthrough ........................................................................... 68
5.3.2 Light Field Histogram of Disparity Gradients Feature Description..................... 69
xvi
5.4 Face Recognition Based on a VGG 2D+Disparity+Depth (VGG-D3) Fused Deep
Descriptor ..................................................................................................................... 71
5.4.1 Architecture and Walkthrough ........................................................................... 71
5.4.2 VGG-Face Feature Description ......................................................................... 73
5.4.3 Fusion Strategies Comparison ........................................................................... 73
5.5 Face Recognition Based on a VGG + Conventional LSTM Double-Deep Descriptor .... 74
5.5.1 Architecture and Walkthrough ........................................................................... 75
5.5.2 SA Images Selection and Scanning .................................................................... 76
5.5.3 LSTM Angular Description ............................................................................... 78
5.5.4 Softmax Classification....................................................................................... 81
5.6 Face Recognition Based on VGG + Light Field LSTM Double-Deep Descriptors ......... 81
5.6.1 Architecture and Walkthrough ........................................................................... 82
5.6.2 Light Field LSTM Angular Descriptors ............................................................. 84
5.6.2.1 Gate-Level Fusion LSTM Cell Architecture ........................................ 84
5.6.2.2 State-Level Fusion LSTM Cell Architecture ........................................ 85
5.6.2.3 Sequential Learning LSTM Cell Architecture ...................................... 87
5.7 Summary of the Proposed Face/Ear Recognition Solutions ........................................... 88
Chapter 6: Light Field Face and Ear Recognition Performance .......................................... 91
6.1 Introduction .................................................................................................................. 91
6.2 Performance Assessment Frameworks .......................................................................... 92
6.2.1 Face Recognition Performance Assessment Framework .................................... 92
6.2.1.1 Face Recognition Test Material ........................................................... 92
6.2.1.2 Face Recognition Evaluation Protocols ................................................ 92
6.2.1.3 Face Recognition Performance Assessment Metrics ............................ 94
6.2.1.4 Face recognition Benchmarking Solutions ........................................... 94
6.2.2 Ear Recognition Performance Assessment Framework ...................................... 95
xvii
6.2.2.1 Ear Recognition Test Material ............................................................. 95
6.2.2.2 Ear Recognition Evaluation Protocol and Metrics ................................ 95
6.2.2.3 Ear Recognition Benchmarking Solutions ............................................ 96
6.3 LFLBP Descriptor Parameter Setting ............................................................................ 96
6.3.1 LFLBP Descriptor Parameter Setting: View radius ............................................ 97
6.3.2 LFLBP Descriptor Parameter Setting: Number of Views and Starting Angle ..... 97
6.4 Light Field Histogram of Disparity Gradients Descriptor Parameters ............................ 98
6.5 LSTM Hyper-Parameter Setting .................................................................................... 98
6.5.1 Hyper-Parameter Evaluation: Hidden Layer Size ............................................... 98
6.5.2 Hyper-Parameter Evaluation: Batch Size ........................................................... 99
6.5.3 Hyper-Parameter Evaluation: Number of Training Epochs ...............................100
6.5.4 SA Images Selection Evaluation .......................................................................100
6.6 Face Recognition Accuracy..........................................................................................101
6.7 Ear recognition Accuracy .............................................................................................104
Part III. Light Field Based Face and Ear Presentation Attack Detection ...........................107
Chapter 7: State-of-the-Art on Face Presentation Attack Detection ...................................109
7.1 Introduction .................................................................................................................109
7.2 Proposed Face Presentation Attack Detection Taxonomy .............................................110
7.3 Face Artefact Databases ...............................................................................................111
7.3.1 Non-Light Field Face Artefact Databases .........................................................112
7.3.2 Light Field Face Artefact Databases .................................................................113
7.4 Non-Light Field Based Face PAD Solutions ................................................................114
7.4.1 Texture Based Methods ....................................................................................114
7.4.2 Quality Based Methods ....................................................................................115
7.4.3 Learning Based Methods ..................................................................................118
7.4.4 Focus/Depth Based Methods ............................................................................118
xviii
7.4.5 Motion Based Methods .....................................................................................118
7.5 Light Field Based Face PAD Solutions ........................................................................119
7.5.1 Texture Based Methods ....................................................................................120
7.5.2 Focus/Depth Based Methods ............................................................................120
7.6 Adaptation of Face Presentation Attack Detection Solutions for Ear Biometrics ...........120
Chapter 8: Proposing Novel Light Field Face and Ear Artefact Databases........................121
8.1 Introduction .................................................................................................................121
8.2 Light Field Based Face Artefact Database ....................................................................122
8.2.1 Acquisition Setup and Statistics ........................................................................122
8.2.2 Presentation Attack Instruments .......................................................................122
8.2.3 Database Elements ...........................................................................................124
8.2.4 Database Access and Usage Conditions ............................................................124
8.3 Light Field Based Ear Artefact Database ......................................................................125
8.3.1 Acquisition Setup and Statistics ........................................................................125
8.3.2 Presentation Attack Instruments .......................................................................125
8.3.3 Database Elements ...........................................................................................127
8.3.4 Database Access and Usage Conditions ............................................................127
Chapter 9: Proposing Novel Light Field Face and Ear Presentation Attack Detection
Solutions ............................................................................................................................129
9.1 Introduction .................................................................................................................129
9.2 PAD Based on Light Field Angular Local Binary Pattern Descriptor ...........................130
9.3 PAD Based on Light Field Histogram of Disparity Gradient Descriptor .......................132
Chapter 10: Light Field Face and Ear Presentation Attack Detection Performance .........135
10.1 Introduction ................................................................................................................135
10.2 Performance Assessment Framework .........................................................................135
10.2.1 Test Material ....................................................................................................135
xix
10.2.2 Evaluation Metrics ...........................................................................................135
10.2.3 Benchmarking Methods ....................................................................................136
10.3 Face PAD Performance ..............................................................................................136
10.3.1 Face PAD Accuracy Evaluation .......................................................................137
10.3.2 Face PAD Color Features Accuracy Evaluation ................................................139
10.3.3 Face PAD Generalization Accuracy Evaluation ................................................140
10.3.4 Face PAD Computational Complexity ..............................................................142
10.4 Ear PAD Performance ................................................................................................142
10.4.1 Ear PAD Accuracy Evaluation .........................................................................143
10.4.2 Ear PAD Generalization Accuracy Evaluation ..................................................144
10.4.3 Ear PAD Computational Complexity ................................................................145
Part IV. Conclusion ...............................................................................................................147
Chapter 11: Summary of Contributions ...............................................................................149
11.1 Introduction ................................................................................................................149
11.2 Light Field Face and Ear Recognition .........................................................................149
11.3 Light Field Biometric Presentation Attack Detection ..................................................151
Chapter 12: Future Research Directions ..............................................................................153
Future Research Direction .................................................................................................153
12.1 Introduction ................................................................................................................153
12.2 Future Research Directions for Light Field Face and Ear Recognition ........................153
12.3 Future Research Directions for Light Field Based Face and Ear Presentation Attack
Detection .....................................................................................................................154
References ..............................................................................................................................157
xx
xxi
List of Figures
Figure 1.1: Possible attack points in a generic biometric system [14]. ........................................ 4
Figure 1.2: Structured representation of the main contributions of this Thesis. ............................ 6
Figure 1.4: Thesis organization (highlighting in gray the Thesis contributions). ........................ 11
Figure 2.1: Visualization of the plenoptic function. ................................................................... 14
Figure 2.2: Parameterization of light rays in lenslet light field cameras using two planes. ......... 15
Figure 2.3: Multi-camera arrays arrangements: (a) Regular, rectangular arrangement of cameras:
Stanford multi-camera array [51]; (b) Regular, circular arrangement; (c) Irregular arrangement of
cameras on Light L16 [50]. ....................................................................................................... 16
Figure 2.4: Lenslet light field imaging based on micro-lens array. ............................................ 17
Figure 2.5: Lenslet light field cameras: (a) unfocused and (b) focused architectures [54]. ......... 17
Figure 2.6: Lenslet light field cameras: (a) Lytro first generation camera [26]; (b) Lytro ILLUM
lenslet camera [26]; and (c) Raytrix R11 camera [57]. .............................................................. 18
Figure 2.7: The lenslet light field pre-processing architecture. .................................................. 18
Figure 2.8: Raw lenslet light field representation, before color demosaicing (each position
corresponds to a R, G or B intensity). ....................................................................................... 19
Figure 2.9: Raw lenslet light field representation, after color demosaicing. ............................... 19
Figure 2.10: Light field multi-view array of SA images and central rendered 2D image. ........... 20
Figure 2.11: A light field SA image (a) before and (b) after color and gamma corrections. ....... 20
Figure 3.1: Multi-level taxonomy for face recognition solutions [64]. ....................................... 26
Figure 3.2: Taxonomy of ear recognition solutions [5]. ............................................................. 27
xxii
Figure 3.3: Proposed multi-level face/ear recognition taxonomy. .............................................. 27
Figure 3.4: Face/ear structure level: (a) global; (b) component + structure; and (c) component
representation face structures. ................................................................................................... 28
Figure 3.5: Ear structure [5]. ..................................................................................................... 28
Figure 3.6: Feature support level: Global feature support with (a) global and (b) component face
structures; Local feature support with (c) global and (d) component face structures. ................. 29
Figure 3.7: Overview of the evolution of face recognition solutions over time, grouped based on
feature extraction approaches; performance values for the LFW database. ................................ 40
Figure 3.8: Overview of the evolution of ear recognition solutions over time, grouped based on
feature extraction approaches; performance values for the AWE database. ............................... 49
Figure 4.1: Acquisition setup at (a) IST; and (b) EURECOM. .................................................. 52
Figure 4.2: A sketch of the LLFFD acquisition setup. ............................................................... 53
Figure 4.3: Age distribution for the subjects in IST-EURECOM LLFFD. ................................. 53
Figure 4.4: Illustration of 2D rendered images for the facial variations in the IST-EURECOM
LLFFD. .................................................................................................................................... 54
Figure 4.5: Sample depth map. ................................................................................................. 55
Figure 4.6: Illustration of facial landmarks. ............................................................................. 55
Figure 4.7: Metadata associated to each subject. ....................................................................... 56
Figure 4.8: IST-EURECOM LLFFD file structure. ................................................................... 56
Figure 4.9: Illustration of IST-EURECOM LLFEDB 2D rendered ear images for the four profiles
of a specific subject in two separate acquisition sessions........................................................... 58
Figure 4.10: Examples of partially occluded ear images: (a) ear piercing; (b) earing; (c) hair; and
(d) combination of occlusions. .................................................................................................. 59
xxiii
Figure 4.11: IST-EURECOM LLFEDB file structure. .............................................................. 60
Figure 1.3: Summary of the proposed recognition solutions. ..................................................... 63
Figure 5.1: Architecture of the proposed face and ear recognition solution based on LFLBP hand-
crafted descriptor. ..................................................................................................................... 64
Figure 5.2: Examples of selected SA images (red). The SA images highlighted in dark grey do
not contain usable image information due to the micro-lens shape. ........................................... 65
Figure 5.3: LFALBP descriptor extraction example. ................................................................. 67
Figure 5.4: Proposed spatial and angular descriptors combination framework. .......................... 68
Figure 5.5: Architecture of the proposed face and ear recognition solution based on the fused
LFHG hand-crafted descriptor. ................................................................................................. 69
Figure 5.6: Division of an ear sample disparity magnitude map into 8×8 sample cells and
overlapping 2×2 cell blocks. ..................................................................................................... 70
Figure 5.7: Architecture of the proposed face recognition solution based on a
2D+disparity+depth fused deep descriptor. ............................................................................... 73
Figure 5.8: Architecture of the proposed face recognition solution based on VGG + Conv-LSTM
double-deep descriptor. ............................................................................................................ 75
Figure 5.9: (a) High-density SA images selection topology; (b) row-major scanning order; (c)
snake-like scanning order; (d) max-disparity SA images selection topology; (e) mid-density
horizontal SA images selection topology; (f) mid-density vertical SA images selection. ........... 77
Figure 5.10: Score-level fusion for combining the horizontal and vertical angular information. 77
Figure 5.11: Architecture of a Conv-LSTM cell with peephole connections (indicated by a
dashed line). ............................................................................................................................. 79
Figure 5.12: Architecture of the proposed face recognition solution based on VGG + Light Field
LSTM double-deep descriptors. ................................................................................................ 82
xxiv
Figure 5.13: Architecture of a GLF-LSTM cell. ........................................................................ 84
Figure 5.14: Architecture of a SLF-LSTM cell. ........................................................................ 86
Figure 5.15: Architecture of a SeqL-LSTM cell. ....................................................................... 87
Figure 6.1: IST-EURECOM LFFD (non-cropped) database division into training, validation and
testing sets for (a) Protocol 1; (b) Protocol 2; and (c) Protocol 3. .............................................. 93
Figure 6.2: CRR5 versus R for LFALBPR,45º,4. ........................................................................... 97
Figure 6.3: Rank-1 recognition results versus hidden layer size considering all proposed SA
image selection methods for: (a) Protocol 1 and (b) Protocol 3. ................................................ 99
Figure 6.4: Rank-1 recognition results versus the batch size considering all proposed SA image
selection methods for: (a) Protocol 1, and (b) Protocol 3. .......................................................... 99
Figure 6.5: Rank-1 recognition results versus number of training epochs considering all proposed
SA image selection methods for: (a) Protocol 1 and (b) Protocol 3. ........................................ 100
Figure 6.6: Ear recognition cumulative recognition rank curves (up to CRR50) for the proposed
recognition and best performing benchmarking solutions........................................................ 105
Figure 7.1: Proposed taxonomy for face PAD solutions. ......................................................... 110
Figure 7.2: Illustration of different types of PAIs. ................................................................... 112
Figure 7.3: Illustration of GUC-LiFFAD face artefact acquisition [33]. .................................. 114
Figure 8.1: IST LLFFSB face artefact acquisition pipeline. .................................................... 123
Figure 8.2: IST LLFFSD example: Illustration of 2D central view rendered images for: (a) bona
fide face; (b) print paper attack; (c) wrapped print paper attack; (d) laptop attack; (e) tablet
attack; (f) mobile attack 1; and (g) mobile attack 2. ................................................................ 124
Figure 8.3: Illustration of IST LLFEADB images for a bona fide sample and corresponding
artefact samples for four different PAIs. ................................................................................. 126
xxv
Figure 8.4: IST LLFEADB ear artefact acquisition pipeline. ................................................... 126
Figure 8.5: Multi-view sub-aperture image array for an artefact ear image. ............................. 127
Figure 9.1: Architecture of the proposed face and ear PAD solution based on LFALBP hand-
crafted descriptor. ................................................................................................................... 130
Figure 9.2: Architecture of the proposed face and ear PAD solution based on LFHDG descriptor.
............................................................................................................................................... 132
Figure 10.1: DET face PAD performance for the proposed and benchmarking solutions using
IST LLFFSD for: (a) monitor; (b) tablet; (c) mobile 1; (d) mobile 2; (e) paper; (f) wrapped paper
PAIs. ...................................................................................................................................... 138
Figure 10.2: ACER face PAD performance for the proposed and benchmarking solutions with n-
fold cross-validation. .............................................................................................................. 139
Figure 10.3: DET face PAD generalization performance for the proposed and benchmarking
solutions using IST LLFFSD for: (a) monitor; (b) tablet; (c) mobile 1; (d) mobile 2; (e) paper; (f)
wrapped paper PAIs. .............................................................................................................. 141
Figure 12.1: Mobile phones equipped by multiple cameras. .................................................... 154
xxvi
xxvii
List of Tables
Table 3.1: Overview of selected, prominent face databases with different (Diff.) characteristics.
................................................................................................................................................. 31
Table 3.2: Classification of a selection of non-light field based face recognition solutions based
on the proposed taxonomy. ....................................................................................................... 32
Table 3.3: Classification of the prior and proposed (Prop.) light field based face recognition
solutions, based on the proposed taxonomy. ............................................................................. 41
Table 3.4: Overview of prior and the proposed light field based face recognition solutions. ...... 42
Table 3.5: Overview of ear databases with different (Diff.) characteristics. ............................... 44
Table 3.6: Classification of a selection of ear recognition solutions based on the developed
taxonomy. ................................................................................................................................ 45
Table 4.1: List of Acronyms used in IST-EURECOM LLFFD along with the their definition. .. 57
Table 4.2: Metadata associated to each subject in each acquisition session. .............................. 60
Table 5.1: Face rank-1 recognition performance for the 2D baseline descriptor and three
alternative fusion strategies (best results in bold). ..................................................................... 74
Table 5.2: Overview of the recognition solutions proposed in this Thesis. ................................ 89
Table 6.1: Overview of the face recognition benchmarking solutions. ....................................... 95
Table 6.2: Overview of the ear recognition benchmarking solutions. ........................................ 96
Table 6.3: RR1 and CRR5 for LFLBP for different values of A and N (best results in bold). ....... 97
Table 6.4: Selected configuration for the the Conv-LSTM and the proposed GLF-LSTM, SLF-
LSTM and SeqL-LSTM architectures for face recognition...................................................... 101
xxviii
Table 6.5: Protocol 1 assessment: Face recognition rank-1 for the proposed and benchmarking
recognition solutions (best results in bold). ............................................................................. 101
Table 6.6: Protocol 2 assessment: Face recognition rank-1 for the proposed and benchmarking
recognition solutions (best results in bold). ............................................................................. 102
Table 6.7: Protocol 3 assessment: Face recognition rank-1 for the proposed and benchmarking
recognition solutions (best results in bold). ............................................................................. 102
Table 6.8: Protocol 1 average rank-1 recognition results for some 2D baseline solutions against
their light field based variants. ................................................................................................ 103
Table 6.9: Protocol 2 average rank-1 recognition results for some 2D baseline solutions against
their light field based variants. ................................................................................................ 103
Table 6.10: Protocol 3 average rank-1 recognition results for some 2D baseline solutions against
their light field based variants. ................................................................................................ 103
Table 6.11: Ear recognition CRR1 up to CRR3 for the proposed recognition and benchmarking
solutions (best results in bold). ............................................................................................... 105
Table 7.1: Overview of publicly available non-light field face artefact databases. ................... 112
Table 7.2: Overview of publicly available light field artefact databases. ................................. 113
Table 7.3: Overview of non-light field face PAD solutions. .................................................... 116
Table 7.4: Overview of light field face PAD solutions. ........................................................... 119
Table 10.1: Overview of PAD benchmarking solutions........................................................... 137
Table 10.2: ACER face PAD performance for the proposed and benchmarking solutions using
IST LLFFSD (minimum errors in bold). ................................................................................. 138
Table 10.3: ACER face PAD performance for the proposed and benchmarking solutions using
color or gray information (minimum errors in bold). ............................................................... 140
xxix
Table 10.4: ACER face PAD generalization performance for the proposed and benchmarking
solutions using IST LLFFSD (minimum errors in bold). ......................................................... 141
Table 10.5: Average extraction and classification times, and feature vector size for the proposed
and benchmarking face PAD solutions (minimum values in bold). ......................................... 142
Table 10.6: ACER ear PAD performance for the proposed and benchmarking solutions using IST
LLFEADB baseline set (minimum errors in bold)................................................................... 143
Table 10.7: ACER ear PAD performance for the proposed and benchmarking solutions using IST
LLFEADB extended set (minimum errors in bold). ................................................................ 144
Table 10.8: ACER ear PAD generalization performance for the proposed and benchmarking
solutions using IST LLFEADB baseline set (minimum errors in bold). ................................... 144
Table 10.9: ACER ear PAD generalization performance for the proposed and benchmarking
solutions using IST LLFEADB extended set (minimum errors in bold). ................................. 144
Table 10.10: Average extraction and classification times, and feature vector size for the proposed
and benchmarking ear PAD solutions (minimum values in bold). ........................................... 145
xxx
xxxi
List of Acronyms
2D 2 Dimensional
3D 3 Dimensional
3DMAD 3D Mask Attack Database
3DMM 3D Morphable Model
4D 4 Dimensional
ACER Average Classification Error Rate
AHFSVD Adaptive High-Frequency Singular Value Decomposition
ALTP Adaptive Local Ternary Pattern
APCER Attack Presentation Classification Error Rate
ASVDF Adaptive Singular Value Decomposition Face
BPCER Bona Fide Presentation Classification Error Rate
BPR Bayesian Patch Representation
BSIF Binarized Statistical Image Features
BSIF Binarized Statistical Image Features
BU-3DFE Binghamton University 3D Facial Expression
BU-3DFE Binghamton University 3D Facial Expression
CASIA Chinese Academy of Sciences, Institute of Automation
CLF-LSTM Gate-Level Fusion Long Short-Term Memory
CNN Convolutional Neural Network
Conv-LSTM VGG + Conventional Long Short-Term Memory
CRR Cumulative Recognition Rate
xxxii
CS Centre-symmetric
CSLBP Center-Symmetric Local Binary Patterns
DB Database
DBN Deep Belief Network
DCP Dual-Cross Pattern
DET Detection Error Tradeoff
DFW Disguised Faces in the Wild
DLBP Depth Local Binary Pattern
DLM Dynamic Link Matching
DM Depth Map
DOG Difference of Gaussians
DP Decision Pyramid
DPC Decision Pyramid Classifier
EBGM Elastic Bunch Graph Matching
ELBP Extended Local Binary Patterns
FAR False Acceptance Rate
FPGA Field Programmable Gate Array
FERET FacE REcognition Technology
FRR False Rejection Rate
GGZ Global-Gabor-Zernike
GLCM Grey-Level Co-occurrence Matrices
GLCM Grey-Level Co-occurrence Matrices
G-NSRC Gabor scale feature based non-Negative Sparse Representation Classification
xxxiii
GPCA Generalized PCA
GPU Graphics Processing Unit
GRBP Global RBP
GW Gabor Wavelet
HDCA High Density Camera Array
HOG Histogram of Oriented Gradients
HSV Hue-Saturation-Value
ICA Independent Component Analysis
IJB-A IARPA Janus Benchmark
KED Kernel Extended Dictionary
KLT Kanade-Lucas-Tomasi
KNN k-Nearest Neighbour
KPCA MM Kernel PCA Mixture Model
LBP Local Binary Pattern
LBPNET Local Binary Pattern Network
LCCP Local Contourlet Combined Patterns
LDA Linear Discriminant Analysis
LDF Local Difference Feature
LF Light Field
LFALBP Light Field Angular Local Binary Pattern
LFHDG Light Field Histogram of Disparity Gradients
LFHG Light Field Histogram of Gradients
LFHOG Light Field Histogram of Oriented Gradients
xxxiv
LFLBP Light Field Local Binary Patterns
LFR Light Field Raw
LFW Labeled Faces in the Wild
LGPDP Local Gabor Phase Difference Pattern
LiFFAD Light Field Face Artefact Database
LiFFID Light Field Face and Iris Database
LLFEADB Lenslet Light Field Ear Artefact Database
LLFEDB Lenslet Light Field Ear DataBase
LLFFD Lenslet Light Field Face Database
LLFFSD Lenslet Light Field Face Spoofing Database
LPQ Local Phase Quantization
LPS Local Pattern Selection
LR Logistic Regression
LRBP Local RBP
LRT Local Radon Transform
LSM Local Shape Map
LSTM Long Short-Term Memory
MB-LBP Multi-scale Block Light Field Local Binary Patterns
MCP Mean based Contrast Patterns
MDML-DCP Multi-Directional Multi-Level Dual-Cross Pattern
ME-CS-LDP Multi-resolution Elongated Centre-Symmetric Local Derivative Pattern
MFSD Mobile Face Spoofing Database
MLBP Multi-scale Local Binary Pattern
xxxv
MLFP Mask based Video Face Presentation Attack
MLP Multi-Layer Perceptron
MOBIO MObile BIOmetry
MPCA Multilinear Principal Component Analysis
MS-PCANET Multi-Scaled PCA Network
NIR Near Infra-Red
NNC Nearest Neighbor Classifier
NUAA Nanjing University of Aeronautics and Astronautics
PAD Presentation Attack Detection
PAI Presentation Attack Instrument
PCA Principle Component Analysis
PHOW Pyramidal Histogram Of visual Words
PIN Personal Identification Number
PIPA People In Photo Albums
PLS Partial Least Square
POEM Patterns of Oriented Edge Magnitudes
PPR Probabilistic Patch Representation
RBF Radial Basis Function
RBP Riesz Binary Pattern
RNN Recurrent Neural Networks
RR Recognition Rate
SA Sub-Aperture
Scface Surveillance Cameras face
xxxvi
SeqL-LSTM Sequential Learning Long Short-Term Memory
SIFT Scale Invariant Feature Transform
S-KDA Specific Kernel Discriminant Analysis
SLBP Spatial Local Binary Pattern
SLF-LSTM State-Level Fusion Long Short-Term Memory
SMAD Silicone Mask Attack Database
SML Sum-Modified-Laplacian
SoF Specs on Faces
SRC Sparse Reconstruction Classifier
SVM Support Vector Machine
TDSIFT Texture and Depth Scale Invariant Feature Transform
TRIVET TransfeR Nir-Vis heterogeneous facE recognition neTwork
U-3DMM Unified 3D Morphable Model
VGG Visual Geometry Group
VGG-D3 2D+Disparity+Depth VGG
VIS VISible
1
Part I. Objectives and Basics
2
3
Chapter 1 _
Introduction
1.1 Context and Motivation
Nowadays, automatically recognizing the identity of a person is of paramount importance in
various application domains, from forensics and surveillance to commerce and entertainment.
Biometric recognition, referring to the automated recognition of individuals based on their
biological and behavioral traits, appears as a viable alternative to more traditional approaches such
as Personal Identification Numbers (PINs) or passwords [1] [2]. There are multiple types of
biometric modalities available, such as fingerprint, iris, face, ear, and gait, and they are in use in
multiple types of applications. Each biometric modality has its strengths and weaknesses, and the
choice mainly depends on the application scenario [1]. Depending on the application context, the
generic term recognition may become either a verification or an identification. In a verification
system, a person claims a specific identity and the system either confirms or denies that claim. In
an identification system, the one considered in this Thesis, recognition of an individual happens
by searching the templates of all the users in the database for a match, without that individual first
claiming an identity. In the following, identification is simply called recognition.
This Thesis focuses on the face and ear biometric modalities. Face recognition is a nonintrusive
method, and facial images are probably the most common biometric characteristic used by humans
to perform personal recognition. Face recognition systems have been successfully used in various
application areas with high acceptability, collectability and universality [1] [3]. After the first
automatic face recognition algorithms emerged more than four decades ago [4], this area has
attracted much research and there has been incredible progress in this field. Ear recognition has
also evolved as a reliable biometric modality for human identification over recent years, with its
potential stemming from the specific ear structure, which significantly varies across different
people, while remaining stable over time for the same person and without significant changing for
different facial emotions and actions [5] [6]. It has also proved useful for facial profile based
recognition [7] [8] and in combination with other modalities in multimodal biometric systems [9]
[10].
4
Despite the significant progress in biometric recognition performance, the widespread use of
biometric recognition applications raises new security concerns, making the robustness against
presentation attacks a very active field of research [11] [12] [13]. The security of a biometric
recognition system can be compromised at different architectural points, as illustrated in Figure
1.1, all the way from the biometric trait presentation to the final recognition decision. The attacks
to a biometric recognition system can be broadly divided into indirect and direct attacks [14].
Indirect attacks are performed inside the recognition system to bypass the feature extractor,
matcher, or tamper the template database. Direct attacks, also referred as spoofing or presentation
attacks, are performed outside the biometric system by presenting falsified data, or artefacts, in
front of the acquisition sensors, for instance using printed photos or electronic devices displaying
a face or an ear. While the recognition system robustness against indirect attacks can be increased
using conventional protection mechanisms, such as data encryption and intrusion prevention and
detection [15], it is also critical to incorporate in the recognition systems efficient Presentation
Attack Detection (PAD) solutions. According to ISO/IEC JTC1 SC37, a Presentation Attack
Instrument (PAI) is "an artificial object or representation presenting a copy of biometric
characteristics or synthetic biometric patterns" [11]. The most common types of attacks produced
by different PAIs involve: i) a printed face on a paper or a wrapped paper, simulating the human
face curvature; ii) a displayed face image or video on the screen of a portable device such as a
laptop, tablet, or mobile phone; and iii) 3D masks of various types [13].
Figure 1.1: Possible attack points in a generic biometric system [14].
The availability of richer imaging sensors is opening a new range of possibilities, not only for
biometric recognition but also for PAD solutions [2] [16]. Beside conventional 2D sensors, depth
sensors, as used by Microsoft Kinect, and Near Infra-Red (NIR) cameras, have been used for face
and ear biometric recognition and PAD [17] [18] [19] [20] [21] [22] [23]. Additionally, light field
imaging technologies [24] [25] have recently come into prominence with commercial lenslet light
field cameras, such as Lytro [26], available in the market. These cameras capture not only the
intensity of light on a specific 2D plane position but also the intensity of the light rays arriving
from multiple directions in space. Light field cameras are receiving increasing interest from the
5
biometrics and forensics communities, for both biometric recognition [27] [28] [29] [30] [31] and
PAD [32] [33] [34].
The key advantage of the light field imaging sensors for biometric recognition and PAD comes
from the richer scene representation, allowing a posteriori refocusing, disparity exploitation and
depth map exploitation. Preliminary works [27], [28], [29], [30], [31], [32], [33], [34] have shown
the effectiveness of the supplementary information captured by light field cameras for biometric
recognition and PAD applications, even when considering one single shot. Most of these works
have explored the possibility of creating multiple focus images, rendered from the same light field
image acquisition, for instance using super-resolution and fusion schemes for the biometric
recognition and PAD tasks. The results demonstrate the added value of light field imaging in terms
of post-capture refocusing capability, and improved biometric recognition and PAD accuracy,
when compared with conventional 2D images.
While the preliminary works processed multiple 2D images at different focus or depth, biometric
recognition and PAD systems based on light field imaging can be further extended in other
directions. More precisely, by processing the rich information associated to a light field in its native
form there is potential to improve the performance of current biometric recognition and PAD
systems, which is the direction considered in this Thesis.
1.2 Objectives
This Thesis focuses on exploring the advances in light field imaging technology and applying them
to develop advanced face and ear recognition and PAD systems with improved performance. The
main research question being addressed in this work is: how to exploit the additional information
available in a light field image to improve the performance of face and ear recognition and PAD
systems?
In the context described above, the Thesis targets four main objectives:
1. Review recent advances in light field based face and ear recognition and PAD databases and
solutions.
2. Create and publicly provide to the research community new light field databases for designing,
testing and validating light field based face and ear recognition and PAD solutions.
3. Design new light field based face and ear recognition and PAD solutions to exploit the richer
information available in light field images.
4. Perform appropriate performance assessment, including benchmarking with the state-of-the-
art in face and ear recognition and PAD solutions, to assess the performance of the proposed
solutions, in terms of accuracy, generalization and complexity, while ensuring the
reproducibility of results.
1.3 Contributions
Following the main objectives defined above, the main contributions of this Thesis are illustrated
in Figure 1.2. The contributions are organized based on two main tasks, i.e., biometric recognition
6
and PAD, for the face and ear biometric modalities. These contributions will be presented in Part
II (Chapters 3, 4, 5, and 6) and Part III (Sections 7, 8, 9 and 10), respectively.
Figure 1.2: Structured representation of the main contributions of this Thesis.
Part II – Light Field Based Biometric Recognition (Chapters 3, 4, 5, and 6)
In the context of light field based biometric recognition, this Thesis proposes the following main
contributions: i) a new taxonomy for face and ear recognition systems; ii) light field face and ear
databases; iii) two hand-crafted light field based descriptors for light field face and ear recognition;
and iv) five deep learning light field based solutions, evolving through progressive levels of
functionality and performance, for light field face recognition.
A summary of the contributions are described in the following.
1. Multi-Level Face and Ear Recognition Taxonomies
To better understand the technological landscape in the area of face and ear recognition systems,
this work proposes a new, more encompassing multi-level taxonomy for face/ear recognition
solutions, thus facilitating the organization and categorization of face recognition solutions. The
proposed multi-level taxonomy considers four levels including: i) face/ear structure; ii) feature
support; iii) feature extraction approach; and iv) feature extraction sub-approach. Following the
proposed taxonomy, a comprehensive review on recent, representative and relevant face and ear
recognition solutions has been done. As a result of this work, a research paper is under preparation
to be submitted to an international journal.
2. The IST-EURECOM Lenslet Light Field Face Database
A new lenslet light field face database has been proposed, the so-called IST-EURECOM Lenslet
Light Field Face Database (IST-EURECOM LLFFD), including data from 100 subjects, with 20
7
samples per each person, captured by a Lytro ILLUM lenslet camera. This database was captured
in cooperation with EURECOM, with the images of 50 persons being captured in each of the
collaborating institutions, with the IST acquisition setup being replicated in the EURECOM lab.
The images are captured in a controlled acquisition setup with different facial variations, including
emotions, actions, poses, illuminations, and occlusions thus exposing the non-intrusive nature of
face recognition. The database includes the raw light field images, sample 2D rendered images
and the associated depth maps, along with a rich set of metadata. This research work led to the
following publication [35]:
A. Sepas-Moghaddam, V. Chiesa, P. Correia, F. Pereira, and J. Dugelay " The IST-EURECOM
light field face database," International Workshop on Biometrics and Forensics, Coventry,
UK, Apr. 2017.
3. IST-EURECOM Lenslet Light Field Ear Database
To establish the connection between light field cameras and ear recognition research, the IST-
EURECOM Lenslet Light Field Ear DataBase (LLFEDB) has been created with a Lytro ILLUM
lenslet camera, and publicly made available to be used as a basis for testing and validating light
field based ear recognition systems. The proposed ear database consists of 536 light field ear
images from 67 subjects, with 8 image shots per person, captured with a Lytro ILLUM lenslet
camera, over two separate sessions, with four different poses per session. This research work led
to the following publication [36]:
A. Sepas-Moghaddam, F. Pereira, and P. Correia, " Ear recognition in a light field imaging
framework: a new perspective", IET Biometrics, Vol. 7, No. 3, pp. 224-231, May 2018.
4. Face and Ear Recognition Based on Light Field Local Binary Patterns Descriptor
Face and ear recognition solutions have been proposed based on a novel simple, yet effective hand-
crafted descriptor, named Light Field Local Binary Patterns (LFLBP), able to exploit the richer
information available in light field images. LFLBP is a combined descriptor with two main
components, the spatial Local Binary Pattern (LBP) and the angular LBP, able to capture not only
the usual spatial information but also the light field angular information associated to the set of
sub-aperture images, corresponding to different viewpoints. When compared with alternative
methods, the proposed descriptor has shown good face and ear recognition performance under
varied and challenging acquisition conditions. This research work led to the following publication
[37]:
A. Sepas-Moghaddam, P. Correia, and F. Pereira, "Light field local binary patterns description
for face recognition", IEEE International Conference on Image Processing, Beijing, China,
Sep. 2017.
5. Face and Ear Recognition Based on Light Field Histogram of Gradients Descriptor
Another light field based recognition solution has been proposed, able to exploit the spatio-angular
information available in light field images. This novel recognition solution is based on a new light
field hand-crafted descriptor, named Light Field Histogram of Gradients (LFHG), fusing a non-
light field based descriptor, the so-called Histogram of Oriented Gradients (HOG), with a light
8
field based descriptor, called Light Field Histogram of Disparity Gradients (LFHDG). The LFHG
descriptor considers both the orientation and magnitude for the spatial and angular information,
while the solution described in 4. only captures the magnitude for the spatial and angular
information. Thus, it is expected that this fused descriptor improves face and ear recognition
performance. This research work led to the following publication [36]:
A. Sepas-Moghaddam, F. Pereira, and P. Correia, "Ear recognition in a light field imaging
framework: a new perspective", IET Biometrics, Vol. 7, No. 3, pp. 224-231, May 2018.
6. Face Recognition Based on a VGG 2D+Disparity+Depth (VGG-D3) Fused Deep Descriptor
Recognizing the importance of deep learning in biometric recognition, a light field face recognition
solution has been proposed, based on a VGG 2D+Disparity+Depth (VGG-D3) fused deep
descriptor. The VGG-D3 description is formed by concatenating descriptions extracted from 2D
images as well as disparity and depth maps using VGG-Face descriptor [38]. This solution is the
first adopting a fused deep CNN representation to exploit the complementary information available
in the light field for face description and then recognition. The VGG-Face descriptor, trained over
2.6 million face images, is computed based on a VGG- 16 network, ignoring the last fully connected
layer in the architecture, to extract a feature vector with 4096 elements. The exploitation of disparity
maps together with 2D images and depth maps, in the context of a fusion scheme, is a novel
approach never tried in the literature, acknowledging that disparity and depth maps may bring some
complementary information to the recognition task. This research work led to the following
publication [39]:
A. Sepas-Moghaddam, P. Correia, K. Nasrollahi, T. Moeslund, and F. Pereira, “Light field
based face recognition via a fused deep representation”, IEEE International Workshop on
Machine Learning for Signal Processing, Aalborg, Denmark, Sep. 2018.
7. Face Recognition Based on a VGG + Conventional LSTM Double-Deep Descriptor
A face recognition solution based on a double-deep descriptor, so-called VGG + Conventional
Long Short Term Memory (Conv-LSTM), has been proposed, exploiting the multi-perspective
information available in a light field image. The fused deep representation solution described in 6.
processes only light field central view data, notably using its rendered texture and corresponding
disparity and depth maps. On the contrary, the proposed double-deep descriptor adopts a Conv-
LSTM recurrent network to extract higher dimensional angular dependencies from VGG deep
descriptions associated to different face viewpoints rendered from a full light field image, thus
offering a more powerful description for light field face recognition; a softmax layer is used for
classification. This research work led to the following submission [40]:
A. Sepas-Moghaddam, P. Correia, K. Nasrollahi, T. Moeslund, and F. Pereira "A double-deep
spatio-angular learning framework for light field based face recognition", Submitted to IEEE
Transactions on Circuits and Systems for Video Technology, Oct. 2018.
8. Face Recognition Based on VGG + Light Field LSTM Double-Deep Descriptors
The solution described in 7. proposes to organize the light field views’ spatial features as a
sequence to be input to a conventional LSTM network, thus ignoring the additional angular
9
information/dependencies, notably in terms of parallax, that could be further exploited during the
learning process to increase recognition accuracy. This work proposes three novel light field
LSTM cell architectures able to jointly learn the light field horizontal and vertical parallaxes. The
three LSTM cell architectures proposed perform: i) Gate-Level Fusion LSTM (GLF-LSTM), ii)
State-Level Fusion LSTM (SLF-LSTM) and iii) Sequential Learning LSTM (SeqL-LSTM); these
architectures create richer spatio-angular light field descriptions for visual recognition tasks. The
proposed cell architectures have been integrated into a spatio-angular deep learning framework for
double-deep description, where a LSTM network adopting the proposed light field LSTM cell
architectures receives its inputs from a VGG-Face deep descriptor applied to a set of horizontal
and vertical 2D face viewpoint images, derived from a light field image. This research work led to
the following submission [41]:
A. Sepas-Moghaddam, F. Pereira, and P. Correia "Light field long short-term memory: novel
cell architectures with application to face recognition", Submitted to Pattern Recognition
Letters, Oct. 2018.
Part III – Light Field Based Presentation Attack Detection (Chapters 7, 8, 9 and 10)
In the context of light field based biometric PAD, this Thesis proposes the following main
contributions: i) a new taxonomy for biometric PAD systems; ii) two light field artefact face and
ear databases; and iii) two solutions for light field based face and ear PAD. A summary of these
contributions is presented in the following.
1. Encompassing Taxonomy for Face PAD Solutions
To better understand the technological landscape in the area of PAD, this work proposes a
taxonomy to organize the face PAD solutions according to four main dimensions, notably user
interaction support, imaging sensor, contextual information and feature extraction. Following the
proposed taxonomy, a comprehensive review of recent, representative and relevant non-light field
based and light field based face PAD solutions has been developed. This research work led to the
following publication [42]:
A. Sepas-Moghaddam, F. Pereira, and P. Correia, "Light field based face presentation attack
detection: reviewing, benchmarking and one step further", IEEE Transactions on Information
Forensics and Security, Vol. 13, No. 7, pp. 1696-1709, Jul. 2018.
2. IST Lenslet Light Field Face Artefact Database
In the context of light field based face PAD, the IST Lenslet Light Field Face Spoofing Database
(IST LLFFSD) has been proposed, consisting of 100 bona fide images, from 50 subjects, captured
with a Lytro ILLUM lenslet camera, and a set of 600 face artefact images, captured using the same
camera. The IST LLFFSD simulates six different types of presentation attacks, including printed
paper, wrapped printed paper, laptop, tablet and two different mobile phones. This research work
led to the following publication [43]:
A. Sepas-Moghaddam, L. Malhadas, P. Correia, and F. Pereira, "Face spoofing detection using
a light field imaging framework", IET Biometrics, Vol. 7, No. 1, pp. 39-48, Jan. 2018.
3. IST Lenslet Light Field Ear Artefact Database
10
The IST Lenslet Light Field Ear Artefact Database (LLFEADB) has been proposed, including both
2D and light field ear artefact images. The database contains two sets: The first set, named baseline
LLFEADB, includes 268 bona fide ear samples derived from the publicly available IST-
EURECOM LLFEDB ear database, which includes ears from 67 subjects, with 4 shots per person,
captured with a Lytro ILLUM lenslet camera. The extended LLFEADB includes an additional set
of high resolution bona fide samples, captured with the same camera from 15 subjects, with 4
image shots per person. For both sets, four types of PAI were used to create the artefact samples:
a laptop, a tablet and two different mobile phones. This research work led to the following
publication [44]:
A. Sepas-Moghaddam, F. Pereira, and P. Correia, "Ear presentation attack detection:
benchmarking study with first lenslet light field database", European Signal Processing
Conference, Rome, Italy, Sep. 2018.
4. Face and Ear PAD Based on Light Field Angular Local Binary Patterns Descriptor
A novel PAD solutions has been proposed based on a hand-crafted descriptor exploiting the color
and texture variations associated to the different directions of the light captured in light field
images. The proposed PAD solution is based on the Light Field Angular Local Binary Patterns
(LFALBP) descriptor, which captures the disparity information present in light field images. The
proposed PAD solution exploits the LFALBP in two different color spaces, and when applied to
face and ear PAD the resulting performance compares favorably with the alternative solutions in
the literature. This research work led to the following publications [43] [44]:
A. Sepas-Moghaddam, L. Malhadas, P. Correia, and F. Pereira, "Face spoofing detection using
a light field imaging framework", IET Biometrics, Vol. 7, No. 1, pp. 39-48, Jan. 2018.
A. Sepas-Moghaddam, F. Pereira, and P. Correia, "Ear presentation attack detection:
benchmarking study with first lenslet light field database", European Signal Processing
Conference, Rome, Italy, Sep. 2018.
5. Face and Ear PAD Based on Light Field Histogram of Disparity Gradients Descriptor
This work proposes a new light field based PAD solution based on a hand-crafted LFHDG
descriptor, computed in the Hue-Saturation-Value (HSV) color space, able to express the light
variations associated to the multiple light capturing directions in light field images. As the LFHDG
descriptor considers both the orientation and magnitude variations for the angular information, it
offers a more comprehensive angular description compared to the LFALBP solution described in
4. The performance of the proposed PAD solution compares favorably with state-of-the-art
solutions. This research work led to the following publications [42] [44]:
A. Sepas-Moghaddam, F. Pereira, and P. Correia, "Light field based face presentation attack
detection: reviewing, benchmarking and one step further", IEEE Transactions on Information
Forensics and Security, Vol. 13, No. 7, pp. 1696-1709, Jul. 2018.
A. Sepas-Moghaddam, F. Pereira, and P. Correia, "Ear presentation attack detection:
benchmarking study with first lenslet light field database", European Signal Processing
Conference, Rome, Italy, Sep. 2018.
11
1.4 Thesis Structure
This Thesis proposes novel taxonomies and light field based databases and solutions for face and
ear biometric recognition and PAD. The organization of the Thesis is illustrated in Figure 1.3.
Figure 1.3: Thesis organization (highlighting in gray the Thesis contributions).
The Thesis starts with Part I which includes two chapters. First, this Chapter featuring a brief
introduction of the context, motivation, objectives, and contributions and after Chapter 2 which
briefly reviews the main light field imaging basic concepts and added value of light field cameras
for biometric recognition and PAD.
Next, Part II, including Chapters 3, 4, 5, and 6, presents the light field based biometric
recognition related contributions. Chapter 3 reviews the state-of-the-art on ear and face
recognition databases and solutions, guided by new, more encompassing taxonomies. Chapter 4
proposes two lenslet light field based face and ear databases to allow more powerful benchmarking
for testing and validating face and ear recognition solutions, exploiting the full light field data.
Chapter 5 proposes seven light field based face and ear recognition solutions, exploiting the
additional information available in a light field image. An extensive performance evaluation has
been conducted in Chapter 6 with the proposed light field databases, using a common,
representative evaluation framework for varied and challenging recognition tasks.
Part III, including Chapters 7, 8, 9, and 10, presents the light field based biometric PAD related
contributions. First, Chapter 7 proposes a taxonomy to organize the face PAD solutions according
to four main dimensions. Then, available face PAD solutions are reviewed according to the
proposed taxonomy. After, Chapter 8 proposes two light field face and ear artefact databases for
testing and validating face and ear PAD solutions. Chapter 9 proposes two light field based face
12
and ear PAD solutions, exploiting the disparity information available in a light field image. Finally,
Chapter 10 assesses the proposed solutions along with state-of-the-art solutions in terms of
accuracy, generalization and complexity, using a common, representative evaluation framework.
Finally, Part IV, featuring Chapters 11 and 12, closes this Thesis with a summary of the
achievements and some relevant future work directions.
13
Chapter 2 _
Light Field Imaging: Basic Concepts and
Tools
2.1 Introduction
This Thesis is devoted to light field biometrics. The main purpose of this chapter is to review the
light field basic concepts, light field acquisition approaches, and highlight the added value of light
field imaging for visual recognition tasks, notably biometric recognition and PAD.
Light field imaging technology has emerged as one of the most promising visual representation
formats, enabling a richer and more faithful representation of a visual scene. Light field cameras
acquire more information about the light, namely information about its direction, providing richer
content to immersively experience visual scenes and to accurately perform visual recognition
tasks.
Light field acquisition shall consider light variations, both in terms of position and direction, as
expressed by the plenoptic function introduced in Section 2.2. There are currently two main
practical ways of capturing light fields, introduced in Section 2.3: using an array of cameras or
using a lenslet light field camera. Since this Thesis adopts the second approach, the main focus of
this chapter is on lenslet light field imaging.
2.2 Plenoptic Function
Light field imaging has been considered by researchers for more than one century [45]. Already
in 1908, Lippmann discussed how to use small and closely spaced circular-shaped lenses to record
many slightly different perspectives of a scene. An observer could then view an image from a
selected perspective through an array of lenses, thus selecting small portions of each captured
image to create the so-called integral image [45]. The term light field was suggested by Gershun
in 1936, referring to the amount of light traveling in all directions through every point in space
[46]. In 1991, Adelson [47] proposed using the so-called plenoptic function P(x,y,z,t,λ,θ,φ) to
describe what was called the luminous environment. As illustrated in Figure 2.1, the 7D plenoptic
function describes the information carried by the light rays at every point in the 3D space (x,y,z),
14
towards every possible direction (θ, φ), over any wavelength (λ), and at any time (t).
Figure 2.1: Visualization of the plenoptic function.
The plenoptic function provides a complete modeling of the light involved in a visual scene;
however, using it in practice, is certainly a challenge due to the large amount of data involved, and
the associated complexity.
In 1996, two simplifications of the plenoptic function have been proposed: the so-called static 4D
light field L(x,y,u,v), by Levoy et al. [24], and the so-called static 4D Lumigraph by Gortler et al.
[48]. These 4D representations are more compact and easier to process as they adopt a set of
simplifications resulting from the following observations [24] [25] [49]:
1. The wavelength dimension in the plenoptic function can be simplified by considering only
three components, the red, green and blue color channels, typically used by existing capture
and display devices; in this case, each channel should integrate the plenoptic function over a
certain wavelength range. This would imply using three independent light field functions, one
for each color channel, without the wavelength dimension.
2. The radiance along a light ray crossing the empty space remains constant, implying that it is
not required considering different positions along its path, thus reducing one spatial dimension.
3. For a static scene, the temporal dimension can be skipped.
4. The angular coordinates (θ,φ) can be replaced with Cartesian coordinates (u,v).
For a static 4D light field representation, L(u, v, x, y), the light rays can be described by their points
of intersection with two parallel planes, the X-Y plane, by convention closer to the camera, and the
U-V plane, closer to the captured scene, as illustrated in Figure 2.2. In this two-plane
parameterization, the static 4D light field, L(u, v, x, y), describes all light rays passing through the
U-V and X-Y planes. A light ray emanates from a specific point (u, v) on the U-V plane to a specific
point (x, y) in the X-Y plane, with each ray keeping its RGB radiance [49]. A multi-camera array
15
could then be used as a simple way to acquire the light field, including a set of cameras, with
appropriately set apertures, placed in the X-Y plane, depicted in Figure 2.2 as grey cubes.
Given the two-plane parameterization, the light information can be described in terms of position
and direction, and so the terms spatial and angular can be employed to describe these dimensions.
One interpretation is that x and y fix the position of a ray, while u and v fix its direction. In this
interpretation, X and Y are spatial, and U and V are angular dimensions – this convention is
followed throughout the Thesis.
Figure 2.2: Parameterization of light rays in lenslet light field cameras using two planes.
In summary, the plenoptic function is proposed to provide a comprehensive description of the light
in any scene. However, sampling the full plenoptic function to obtain a scene representation
requires considerable data and computational complexity. The light field imaging representation
has been proposed as a way to sample the plenoptic function assuming some simplifications.
2.3 Light Field Acquisition
Light field acquisition shall consider light variations, both in terms of position and direction, as
proposed by the plenoptic function. There are currently two main practical ways of understanding
and capturing light fields: the first considers an (ideally high density) array of regular (or even
irregular) cameras, which acquires different perspectives of the same scene; the second considers
a so-called lenslet light field camera, which includes an array of microlens, each one playing the
role of a small camera and acquiring a so-called micro-image. In practice, the two acquisition
approaches are rather equivalent with the main difference being the ‘camera’ baseline and all the
implications in size and cost that derive from that.
2.3.1 Multi-Camera Arrays
The most important parameters for a 2D array of cameras are the number of cameras, the resolution
of each camera and their arrangement. A 2D array of cameras can be arranged either in a regular
way, e.g. rectangular (Figure 2.3.a) or circular (Figure 2.3.b), or in an irregular way, such as the
irregular array cameras on the recent handheld Light L16 camera [50] (Figure 2.3.c). In a 2D array
16
of cameras, each camera acquires a 2D slice of the 4D light field from a different perspective; by
arranging these slices, a multi-view array of 2D images can be obtained.
Figure 2.3: Multi-camera arrays arrangements: (a) Regular, rectangular arrangement of cameras:
Stanford multi-camera array [51]; (b) Regular, circular arrangement [51]; (c) Irregular
arrangement of cameras on Light L16 [50].
One of the first multi-camera arrays was designed at Stanford University in 2004, and consisted
on 128 (8×16) cameras, each with a spatial resolution of 640×480 pixels [51], as illustrated in
Figure 2.3.a. Recently, more dense and higher resolution camera acquisition systems have
emerged, such as the one used to acquire the JPEG Pleno High Density Camera Array (HDCA)
content [52], which used a single camera in a rig with a 7956×5304 spatial resolution. This spatial
multiposition acquisition system considers 101 horizontal steps with a gap of 4 mm (distance
between adjacent cameras in the horizontal direction) and 21 vertical steps with a gap of 6 mm
(distance between adjacent cameras in the vertical direction).
2.3.2 Lenslet Light Field Cameras
A lenslet light field camera includes a digital sensor, main optics and an aperture control similar
to normal cameras. The main difference regarding regular cameras comes from placing a micro-
lens array on the focal plane of the main lens at a given distance from the sensor, as shown in
Figure 2.4. The main lens aims to focus the light rays from the object into the microlens array.
Then, the micro-lenses split the incoming light cone based on the direction of the incoming rays
onto the sensor area of the corresponding micro-lens. A microlens array is usually composed by
thousands of tiny lenses that are arranged in a rectangular, hexagonal or custom grid. The photo
sensors have their sensing positions masked by a color filter array, with the most popular being
the Bayer pattern filter [53].
17
Figure 2.4: Lenslet light field imaging based on micro-lens array.
Concerning the placement of the microlens array, there are two main types of lenslet light field
cameras, so-called focused and unfocused [54]. In the so-called plenoptic 1.0 camera, also known
as unfocused light field cameras, the main lens is focused on the microlens plane while the micro-
lenses are focused at infinity, as shown in Figure 2.5.a. Differently from a conventional 2D camera
capturing an image by integrating the intensities of all rays (from all directions) impinging each
sensor element, each pixel in this light field camera collects the light of a single ray (or a thin
bundle of rays) from a given angular direction (θ,φ) that converges on a specific microlens at
position (x,y) in the array. The research advances made by Ng [25] have led to the development of
the commercially available plenoptic 1.0 set of Lytro lenslet light field cameras [26], see Figure
2.6.a and Figure 2.6.b.
Figure 2.5: Lenslet light field cameras: (a) unfocused and (b) focused architectures [54].
On the contrary, in the plenoptic 2.0 cameras, also known as focused light field cameras [55], [56],
the micro-lenses are no longer focused at infinity, but they are rather focused on the main lens
image plane, as shown in Figure 2.5.b; this justifies the name focused as each microlens is now
focused on the same subject as the main lens. Raytrix cameras [57], which target industrial
18
applications, follow the plenoptic 2.0 paradigm, see Figure 2.6.c. Adjusting the focal distance by
moving the main lens, allows changing the depth of field, thus capturing multiple pixels per
microlens in different focus plans, raising the possibility to render 2D images with increased spatial
resolution.
In practice, the two lenslet light field camera setups, this means plenoptic 1.0 and plenoptic 2.0
cameras, allow different trade-offs between the spatial and angular resolutions in the captured light
field image.
Figure 2.6: Lenslet light field cameras: (a) Lytro first generation camera [26]; (b) Lytro ILLUM
lenslet camera [26]; and (c) Raytrix R11 camera [57].
In this Thesis, light field images acquired with the plenoptic 1.0 Lytro ILLUM lenslet camera,
using a 40 Megaray sensor and a 30-250 mm lens with 8.3× optical zoom and f/2.0 aperture, are
processed. In the following, the plenoptic 1.0 lenslet light field camera will be simply called light
field camera.
2.4 Lenslet Light Field Imaging: From Micro-images to Sub-Aperture
Images
This section reviews the lenslet light field pre-processing operations, relevant for different visual
analysis tasks, including biometric recognition and PAD and eventually coding, which targets to
transform the acquired lenslet light field represented as a set of micro-images into a lenslet light
field represented as a set of views/perspective usually called Sub-Aperture (SA) images. The
architecture of the light field pre-processing adopted in the Light Field Toolbox software [58] is
represented in Figure 2.7.
Figure 2.7: The lenslet light field pre-processing architecture.
The main modules in the architecture have the following tasks:
Acquisition – This module has the task to acquire the light field data from the scene, as
described in Section 2.3.2. After acquisition, a light field image is stored in a raw lenslet
19
format, corresponding to a set of micro-images, using a single precision floating point format,
and a resolution of 7728×5368 pixels in GRGB Bayer format, as illustrated in Figure 2.8.
RGB color demosaicing – This module has the task to recover a full RGB lenslet light field
image from the raw lenslet image, which has been obtained with a Bayer-pattern filter. To
achieve this goal, a conventional linear demosaicing technique is applied over the raw image
[59]. A sample demosaiced lenslet light field image is shown in Figure 2.9.
Multi-view array creation - This module re-arranges the demosaiced raw light field image
into a multi-view array of SA images. A SA image results from putting together the pixels in
the same position within each micro-image to create a rendered image for a specific
viewpoint/perspective; the full set of these SA images corresponds to the light field multi-view
array of SA images. A Lytro ILLUM multi-view array corresponds to 15×15 SA images
(Figure 2.10-left), each with a resolution of 434×625 pixels (Figure 2.10-right). The SA images
in black in Figure 2.10 do not contain usable image information due to the vignetting effect,
resulting from the circular microlens shape, implying that some sensor positions almost do not
get any incident light.
Figure 2.8: Raw lenslet light field representation, before color demosaicing (each position
corresponds to a R, G or B intensity).
Figure 2.9: Raw lenslet light field representation, after color demosaicing.
20
Figure 2.10: Light field multi-view array of SA images and central rendered 2D image.
Multi-view array color and gamma corrections – By exploiting the available light field
metadata associated with each light field image, including color balance matrix and white
balance level, this module applies: i) histogram equalization to adjust the contrast using the
image's histogram; ii) color saturation adjustment to control the intensity of RGB color
channels; iii) white balance adjustment, to adjust the so-called color temperature
corresponding to the relative warmth or coolness of light; and iv) gamma correction, a
nonlinear operation to adjust the overall brightness of the image. It should be noted that the
color and gamma corrections are applied to each SA image, to enhance the quality of the multi-
view array of SA images [59]. Figure 2.11 shows a light field SA image before (left side) and
after (right side) color and gamma corrections. The output of this module is the color-gamma
corrected multi-view array of SA images which can then be used as input to the feature
extraction stage of a biometric recognition or PAD solution.
Figure 2.11: A light field SA image (a) before and (b) after color and gamma corrections.
It should be noted that other types of rendering solutions may be used to extract/render 2D images
from a raw light field image, depending on the specific needs, e.g. focus view rendering.
21
Nevertheless, the simple rendering solution described above is more convenient for the biometric
recognition and PAD solutions proposed in this Thesis whose aims are exploiting the spatio-
angular information available in the multi-view array of SA images.
In this Thesis, the pre-processing architecture illustrated in Figure 2.7 corresponds to the initial
part of the full architecture designed for the proposed light field biometric recognition and PAD
solutions.
2.5 Added Value for Biometric Recognition and PAD
Images acquired with a light field camera include rich spatio-angular information about different
viewpoints, thus supporting characteristics/functionalities such a posteriori refocusing, disparity
exploitation, and depth exploitation. These light field distinctive characteristics can be useful for
addressing many imaging analysis tasks, notably face and ear biometric recognition and PAD:
1. A posteriori refocusing: A posteriori refocusing on a given selected plane can be performed
with a rendering solution controlled by a single focal shift parameter. This capability can be
very useful to improve the quality of a previously out-of-focus region of interest for the
subsequent recognition of either a single face/ear or multiple faces/ears, positioned at different
distances or focus planes [27] [28] [29]. In addition, a presentation attack image may have
different surface geometry than bona fide samples, thus exhibiting limited differences for the
attack images rendered at different depth planes, what facilitates detecting presentation attacks
[33].
2. Disparity exploitation: Disparity refers to the distance between the corresponding points in
different viewpoints. Given a captured light field image, it is possible to render a set of SA
images, each one corresponding to a specific viewpoint, which show some disparity between
the objects, which also depends on the distance to the camera. Disparity information is
instrumental for different analysis tasks including image registration, as it represents relevant
information for biometric recognition and PAD, such as the position and shape of shadows,
changes in contrast and contrast gradient among observation viewpoints, and defocus blur.
Disparity information can be exploited to improve the performance of biometric recognition
solutions [36] [37] [39] [40] [41] and PAD solutions [42] [43] [44].
3. Depth exploitation: Depth information expresses the distance from the scene objects to the
camera, thus providing geometric information about the position and shape of the various
objects, e.g., face components, which may not be equally expressed by disparity information.
The depth map of a light field image can provide key information for biometric recognition
[31] [39] and PAD [34]. In addition, depth information can be exploited to decide whether an
image is being captured from a flat surface or not. For example, face presentation attacks using
2D supports exhibit limited depth differences for facial landmarks, which can be exploited to
detect face presentation attacks [34].While disparity and depth are, in theory, the same
information and may be mutually converted, the independent extraction of these two types of
information may bring additional information, notably compensating the weaknesses of each
individual extraction process.
The face and ear recognition and PAD solutions proposed in this Thesis are mostly focused on
exploiting disparity information [36] [37] [39] [40] [41] [42] [43] [44], thus capturing the richer
22
spatio-angular information available in a light field image. There is only one proposed face
recognition solution [39] that exploits the disparity together with depth maps, acknowledging that
disparity and depth information may bring some complementary information to the recognition
task.
23
Part II. Light Field Based
Face and Ear Recognition
24
25
Chapter 3 _
State-of-the-Art on Face and Ear Recognition
3.1 Introduction
Biometric recognition, referring to the automated recognition of individuals based on their
biological and behavioral traits [1], has been successfully used in multiple application domains,
ranging from forensics and surveillance to commerce and entertainment [2]. There are different
types of biometric modalities available, such as fingerprint, iris, face, and ear, with each modality
having its strengths and weaknesses, and the choice clearly depending on the target application
[1].
This Thesis has been focused on face and ear biometric modalities. The main objective of this
chapter is to review the state-of-the-art on face and ear recognition databases and solutions. To
better understand the technological landscape in terms of recognition systems, this Thesis proposes
a new, more encompassing multi-level taxonomy for face and ear recognition solutions, to
facilitate the organization and categorization of face and ear recognition solutions. Following the
proposed taxonomy, a comprehensive review on recent, representative and relevant face and ear
recognition solutions is presented. Additionally, this chapter reviews face and ear databases that
are instrumental for designing, testing and validating face and ear recognition solutions
3.2 Face/Ear Recognition Taxonomy
In order to help understanding the structure and abstraction levels that can be considered in the
face and ear recognition problems, a number of face recognition [60] [61] [62] [63] [64] and ear
recognition [5] taxonomies have been developed so far, allowing the dissection and comparison of
face and ear recognition solutions. These taxonomies may also guide researchers in the
development of more efficient face and ear recognition solutions. This Thesis also attempts to
support an informed comparison of face recognition solutions, as well as ear recognition solutions,
by proposing a comprehensive multi-level face/ear recognition taxonomy for categorization of
such solutions.
26
3.2.1 Reviewing Existing Face Recognition Taxonomies
The existing face recognition taxonomies have been reviewed to understand their benefits and
drawbacks. In [60], a taxonomy is proposed dividing the face recognition solutions into appearance
based (holistic), feature based, and hybrid approaches. This taxonomy is widely used in the
literature and has been used for categorization of face detection and recognition solutions [61].
The face recognition taxonomy proposed in [62] gives an overview of various face recognition
solutions by classifying them into geometric vs. template based, piecemeal vs. holistic, appearance
based vs. model based, and statistical vs. neural network approaches. In [63], the face recognition
solutions are classified based on the sensing modalities, i.e., 2D conventional image, 3D and infra-
red data. Depending on the main purpose, the reviewed taxonomies classify the face recognition
solutions based on a specific abstraction level, e.g., feature extraction or sensing modalities, while
ignoring other dimensions.
To structure the face recognition solutions based on different levels of abstraction, a multi-level
taxonomy may be adopted. In [64], a multi-level face recognition taxonomy is proposed, providing
an overview of face recognition solutions based on three different abstraction levels, notably pose-
dependency, face representation, and features used for matching, as illustrated in Figure 3.1.
However, this taxonomy ignores some relevant abstraction levels, such as face structure and
feature support, which may be considered for a more complete characterization of face recognition
solutions.
Figure 3.1: Multi-level taxonomy for face recognition solutions [64].
3.2.2 Reviewing Existing Ear Recognition Taxonomy
To help understanding the relations between the various ear recognition solutions, an ear
recognition taxonomy has been proposed in [5] to divide the ear recognition solutions into holistic,
geometric, local, and hybrid approaches as illustrated in Figure 3.2. However, this taxonomy also
ignores some relevant abstraction levels, such as the ear structure and feature extraction support,
which may be considered for a more complete characterization of ear recognition solutions.
27
Figure 3.2: Taxonomy of ear recognition solutions [5].
3.2.3 Proposing a Novel Multi-Level Face/Ear Recognition Taxonomy
This Thesis proposes a new, more encompassing multi-level taxonomy, which can be applied to
the face recognition and the ear recognition problems, helping to better understand the
technological landscape in the area of face and ear recognition, thus facilitating the
characterization and organization of face and ear recognition methods. The proposed multi-level
face/ear recognition taxonomy, illustrated in Figure 3.3, has four levels including:
Figure 3.3: Proposed multi-level face/ear recognition taxonomy.
28
1. Face/ear structure –This level describes the way a recognition solution deals with the
structure of a face or ear image. It includes three classes: i) global representation, dealing with
face/ear as a whole (see Figure 3.4.a); ii) component + structure representation, relying on the
characteristics of some face components, such as eyes, nose, mouth, etc. or some ear
components, such as tragus, helix, lope, etc. (as illustrated in Figure 3.5), along with their
relations (Figure 3.4.b); and iii) component representation, dealing independently with a
selection of face/ear components, without any consideration about the relations between them
(Figure 3.4.c).
Figure 3.4: Face/ear structure level: (a) global; (b) component + structure; and (c) component
representation face structures.
Figure 3.5: Ear structure [5].
2. Feature support – This level is related to the region of support considered for feature
extraction, which can be either global or local. Global feature support implies that the region
of support is the whole image, either a face/ear (Figure 3.6.a) or a face/ear component (Figure
3.6.b), depending on the face/ear structure class considered. Local feature support implies that
the region of support is a local region of either a face/ear (Figure 3.6.c) or a face/ear component
(Figure 3.6.d). A local region of support can have different characteristics, for instance in what
concerns topology, size, overlapping, among others.
29
Figure 3.6: Feature support level: Global feature support with (a) global and (b) component face
structures; Local feature support with (c) global and (d) component face structures.
3. Feature extraction approach – This level is related to the approaches used for feature
extraction, which may be classified as: i) appearance based, deriving features by using
statistical transformations from the intensity data; ii) model based, deriving features based on
geometrical characteristics of the face/ear; iii) learning based, deriving features by modelling
and learning relationships from the input data; and iv) hand-crafted based, deriving features by
describing pre-selected elementary characteristics computed over a local image area.
4. Feature extraction sub-approach– The last level considered in the proposed taxonomy is a
sub-category of the previous one, identifying the family of techniques used by the selected
feature extraction approach.
Appearance based feature extraction can be divided into: i) linear solutions, such as Principle
Component Analysis (PCA) [65] and Independent Component Analysis (ICA) [66], performing
an optimal linear mapping to a lower dimensional space to extract the representative features; ii)
non-linear solutions, such as kernel PCA [67], exploiting the non-linear structure of face/ear
patterns to compute a non-linear mapping; and iii) multi-linear, such as generalized PCA [68],
extracting information from high dimensional data while retaining its natural structure, providing
more compact representations than linear methods.
Model based feature extraction can be divided into: i) graph based solutions, such as Elastic Bunch
Graph Matching (EBGM) [69], representing face/ear features as a graph, where nodes store local
information about face/ear landmarks and edges represent relations, such as distances between
nodes, and a graph similarity function is used for matching; and ii) shape based solutions, such as
the 3D Morphable Model (3DMM) [70], using landmarks to represent key face/ear components,
controlled by the model, and using shape similarity functions for matching.
The third feature extraction approach, learning based solutions, can be categorized into five
families of techniques , including: i) deep neural networks, such as using the VGG-Face descriptor
[38], modelling the input data with high abstraction levels by using a deep graph with multiple
processing layers to automatically learn features from the input data; ii) dictionary learning
solutions, such as Kernel Extended Dictionary (KED) [71], finding a sparse feature representation
of the input data in the form of a linear combination of basic elements; iii) decision tree solutions,
such as Decision Pyramid (DP) [72], representing features as the result of a series of decisions; iv)
regression solutions, such as Logistic Regression (LR) [73], with the relationship between
variables being iteratively refined using a measure of error for the predictions made by the
considered model; and v) Bayesian solutions, such as Bayesian Patch Representation (BPR) [74],
applying Bayes’ theorem to extract features and using a probabilistic measure of similarity.
30
Finally, the hand-crafted based feature extraction approach includes: i) local shape based solutions,
such as Local Shape Map (LSM) [75], defining feature vectors using local shape descriptors; ii)
texture based solutions, such as Local Binary Patterns (LBP) [76], exploring the structure of local
spatial neighborhoods; and ii) frequency based solutions, such as Local Phase Quantization (LPQ)
[77], exploring the local structure in the frequency domain.
It is not uncommon to find hybrid face/ear recognition solutions, such as LBP Net [78] and Mesh-
LBP [79], combining elements from two or more feature extractors to improve the recognition
performance. Additionally, for face/ear recognition solutions combining multiple features,
extracted using different feature extraction methods, fusion can be done at several levels among
them the feature level and score level fusion strategies are the most often employed ones [80].
3.3 Face Recognition
Face recognition systems have been successfully used in multiple application areas with high
acceptability, collectability and universality [1] [3]. After the first automatic face recognition
algorithms emerged more than four decades ago [4], this area has attracted much research and
there has been incredible progress in this field. Following the multi-level face/ear recognition
taxonomy proposed in Section 3.2.3, this section reviews the state-of-the-art on face recognition
solutions. Since this Thesis focuses on the added value of light field images for biometric
recognition, the reviewed face recognition solutions are organized around the exploitation or not
of light field data. This section also provides an overview of the main characteristics of a set of
selected prominent existing face databases and the face variations addressed in these databases.
3.3.1 Face Databases: Status Quo
Face databases play a very important role for designing, testing and validating face recognition
solutions, while ensuring the reproducibility of performance results and their fair comparison. A
set of selected face biometric databases are briefly reviewed in the following.
Currently, there are over 100 publicly available face databases. Table 3.1 overviews the main
characteristics of a set of selected prominent existing face databases and the face variations
addressed in these databases, notably in terms of acquisition date, lighting, poses, expression, and
occlusions, sorted according to their release date (a more complete list can be found in [81]). For
comparison, also the characteristics of the IST-EURECOM Lenslet Light Field Face Database
(IST-EURECOM LLFFD) [35] proposed in this Thesis are included in Table 3.1.
Among the selected databases, several consider the usage of sensors that had not been considered
previously, thus motivating their creation. For instance, the MObile BIOmetry (MOBIO) database
[91] was recorded using two mobile devices, a mobile phone and a laptop computer, to boost the
research on face recognition techniques for mobile devices. The Surveillance Cameras face
(SCface) database [83] was collected to provide VISible (VIS) and Near Infra-Red (NIR) spectrum
images in an uncontrolled indoor environment. The Binghamton University 3D Facial Expression
(BU-3DFE) [19] database was developed for analyzing facial expressions in dynamic 3D spaces.
The Kinect Face database [18] provides RGB-D face images, captured by a Kinect sensor, to
evaluate how face recognition technology can benefit from this specific imaging sensor.
31
Table 3.1: Overview of selected, prominent face databases with different (Diff.) characteristics.
Database
Name Year
No. of
subjects
Image
type
Image
modality
Spatial
Resolution
Face Variation
Diff.
Dates
Diff.
Lighting
Diff.
Poses
Diff.
Expres.
Diff.
occlusion
ORL [84] 1994 40 Gray 2D 92×112
AR [85] 1998 126 Color 2D 380×285
Yale B [86] 2001 28 Gray 2D 640×480
FERET [87] 2003 1199 Gray /
color 2D 256×384
FEI [88] 2006 200 Color 3D 640×480
FRAV3D [89] 2007 106 Color 2D;
3D N/A
LFW [90] 2007 5749 Color 2D 250×250
Bosphorus [91] 2008 105 Color 2D;
3D; 1600×1200
Multi-PIE [92] 2009 337 Color 2D 3072×2048
MOBIO [82] 2010 150 Color 2D up to 2048×1536
Texas 3D [93] 2010 118 Color 2D;
3D 751×501
YouTube Faces
[93] 2011 1595 Color 2D video Different sizes
SCface [83] 2011 130 Color/
infrared 2D
100×75;
144×108; 224×168;
426×320
BU-3DFE [19] 2013 100 Color 3D;
3D video 1040×1329
Kinect Face DB
[18] 2014 52 Color
2D;
depth map; 640×480
Face
Warehouse [95] 2014 150 Color 2D; 3D; 640×480
IJB-A [96] 2015 500 Color 2D; 2D video Different sizes
PIPA [97] 2015 2000 Color 2D Different sizes
LiFFID [98] 2016 112 Gray 2D;
2D rendered
1054×1054;
120×120
Prop. IST-
EURECOM
LLFFD [35]
2016 100 Color
4D light field;
2D rendered;
2D depth map;
15×15×434×625
2022×1404
2022×1404
SoF [99] 2017 112 Color 2D N/A
DFW [100] 2018 1000 Color 2D Different sizes
As the emergence of novel imaging sensors motivates the research community to work with
associated new and richer imaging formats, gathering a powerful light field face database was
becoming a pressing need. Light field imaging is a relatively new topic and, thus, only a few
databases have been made available. The Light Field Face and Iris Database (LiFFID) [98] is the
first face database where the importance of light field imaging sensors for facial recognition tasks
has been acknowledged. It includes a set of 2D greyscale images, focused at different depths,
rendered from the light field content acquired using a first generation Lytro lenslet camera, but
does not include the raw light field images.
32
3.3.2 Non-Light Field Based Face Recognition Solutions
Existing non-light field based face recognition solutions are briefly reviewed according to the
proposed multi-level face recognition taxonomy. Table 3.2 summarizes the main characteristics of
a selection of recent, representative and relevant solutions, sorted based on the feature extraction
approach, feature extraction sub-approach and, finally, publication date. Apart from the
information about taxonomy levels considered in the reviewed solutions, this table includes
information about the face databases considered in the publications reporting these solutions. The
solutions summarized in Table 3.2 are briefly reviewed in the following, grouped based on the
feature extraction approaches considered in the taxonomy.
Table 3.2: Classification of a selection of non-light field based face recognition solutions based
on the proposed taxonomy.
Solution Name Year Face Structure Feature
Support
Feature
Extraction
Approach
Feature Extraction
Sub-Approach Database
PCA [65] 1991 Global Global Appearance Linear Private
ICA [66] 2002 Global Global Appearance Linear FERET
ASVDF [101] 2016 Global Global Appearance Linear PIE; FEI;FERET
KPCA MM [67] 2016 Global Global Appearance Non-Linear Yale; ORL
AHFSVD-Face [102]
2017 Global Global Appearance Non-Linear CMU PIE;LFW
GPCA [68] 2004 Global Global Appearance Multi-Linear AR; ORL
EBGM [69] 1997 Comp.+ Struct. Global Model Graph FERET
Homography Based
[103] 2017 Comp.+ Struct.
Global
Local Model Graph
FERET, CMU-
PIE, Multi-PIE
3DMM [70] 2003 Comp.+ Struct. Global Model Shape CMU-PIE; FERET
U-3DMM [104] 2016 Comp.+ Struct. Global Model Shape Multi-PIE;AR
Face Hallucination
[105] 2016 Comp.+ Struct. Global Learning Dictionary Learning Yale B
Orthonormal
Dic. Lear. [106] 2016 Global Global Learning Dictionary Learning AR
LKED [71] 2017 Global Global Learning Dictionary Learning AR; FERET;
CAS-PEAL
Decision Pyramid
[72] 2017 Global Local Learning Decision Tree AR; Yale B
Logistic Regression
[73] 2014 Global Global Learning Regression ORL; Yale B
BPR [74] 2016 Global Local Learning Bayesian AR
AlexNet [107] 2014 Global Global Learning Deep Neural Network LFW; YTF
VGG Face [38] 2015 Global Global Learning Deep Neural Network LFW; Youtube
GoogLeNet [108] 2015 Global Global Learning Deep Neural Network LFW
Deep RGB-D [109] 2016 Global Global Learning Deep Neural Network Kinect Face DB
TRIVET [110] 2016 Global Global Learning Deep Neural Network CASIA
Deep HFR [111] 2016 Global Global Learning Deep Neural Network CASIA
Deep RGB-D [112] 2016 Global Global Learning Deep Neural Network Kinect Face DB
CDL [113] 2017 Global Global Learning Deep Neural Network CASIA
Deep NIR-VIS
[114] 2017 Global Global Learning Deep Neural Network CASIA
Deep CSH [115] 2017 Global Global Learning Deep Neural Network CASIA
Alexnet [116] 2018 Global Global Learning Deep Neural Network IJB-A; PIPA
Lightened CNN
[117] 2018 Global Global Learning Deep Neural Network LFW; YTF
33
Solution Name Year Face Structure Feature
Support
Feature
Extraction
Approach
Feature Extraction
Sub-Approach Database
SqueezeNet [118] 2018 Global Global Learning Deep Neural Network LFW
LSM [75] 2004 Global Local Hand-Crafted Shape Private
LBP [76] 2006 Global Local Hand-Crafted Texture FERET
HOG [119] 2011 Global Local Hand-Crafted Texture FERET
DLBP [120] 2014 Global Local Hand-Crafted
Texture TEXAS; FRGC;
BOSPHORUS
E-LBP [121] 2016 Global Local Hand-Crafted
Texture Yale B; FERET;
CAS-PEAL
MB-LBP [122] 2016 Global Local Hand-Crafted Texture Yale B; FERET
MR CS-LDP [123] 2016 Component Local Hand-Crafted
Texture PIE; Yale B;
VALID
ALTP [124] 2016 Global Local Hand-Crafted Texture FERET;ORL
LPQ [77] 2008 Global Local Hand-Crafted Frequency CMU PIE
Hybrid Solution: Mesh-LBP [79]
2015 Comp.+ struct; Local Global
Hand-Crafted; Model based
Texture; Graph
MIT CSAIL; BU-3DFE
Hybrid Solution:
LBP Net [78] 2016 Global Local
Hand-Crafted;
Learning
Texture;
Deep Neural Network LFW; FERET
Hybrid Solution:
PCA Net [125] 2016 Global Global
Appearance;
Learning
Linear;
Deep Neural Network LFW
Hybrid Solution:
Aging FR [126] 2016 Component Local
Hand-Crafted;
Learning
Texture;
Decision tree MORPH
Hybrid Solution:
MSB LBP + WPCA
[127]
2016 Global Local
Global
Hand-Crafted;
Appearance Texture; Linear
ORL
Hybrid Solution:
LFD+PCA [128] 2016 Comp.+ struct.
Local
Global
Hand-Crafted;
Appearance
Texture;
Linear
SGIDCDL;
FERET
Hybrid Solution:
Deep BeliefNet
+CSLBP [129]
2016 Global Local
Global
Hand-Crafted;
Learning
Texture;
Deep Neural Network ORL
Hybrid Solution:
Discriminative Dic.
Lear. [130]
2016 Global Local
Global
Local
Global
Texture;
Dictionary Learning AR; Yale B
Hybrid Solution: Nonlinear 3DMM
[131]
2018 Comp.+ struct. Global Appearance;
Model;
Learning
Non-Linear; Shape;
Deep Neural Network
FaceWarehouse
Fusion Scheme:
RGB-D-T [132] 2014 Global Local Hand-Crafted Texture Private
Fusion Scheme:
Binocular Stereo
[133]
2015 Global Local Hand-Crafted Texture Private
Fusion Scheme:
RBP [134] 2016 Global Local Hand-Crafted Texture
AR, Yale B,
UMIST
Fusion Scheme:
LCCP [135] 2016 Global Local Hand-Crafted
Frequency;
Texture; FERET
Fusion Scheme:
Gabor-Zernike
Descriptor [136]
2016 Global Local Hand-Crafted Texture ORL; Yale; AR
Fusion Scheme:
MDML-DCP [137] 2016 Comp.+ struct.
Local
Global
Hand-Crafted;
Appearance
Texture;
Linear
FRGC; CAS;
FERET
Fusion Scheme:
RGB-D-IR [138] 2016 Global
Local
Global
Hand-Crafted;
Learning
Texture;
Deep Neural Network Private
Fusion Scheme: Thermal Fus. [139]
2016 Global Local Hand-Crafted Texture Thermal/Visible
Face
34
3.3.2.1 Appearance Based Solutions
Appearance based face recognition solutions map the input data into a lower dimensional space,
while retaining the most relevant information. Appearance based solutions are generally sensitive
to face variations, such as occlusion, scale, pose, expression, as they do not consider any
knowledge about the face structure.
The most popular appearance based solutions for face recognition are PCA, also known as
eigenfaces [65], and ICA [66]. PCA is an appearance based face recognition solution that finds
useful representations by projecting face images onto an orthogonal representation space where
each basis image captures the highest variance possible, thus decomposing an image into an
uncorrelated linear combination of the basis images. ICA is proposed to find the independent
components that are linearly mixed by maximizing the statistical independence of the estimated
components [66]. Linear methods work based on the vectorization of intensity data; in order to
work directly with 2D images in their native state, a multi-linear appearance based solution, so-
called Generalized PCA (GPCA) [68], has been proposed. By projecting the images to a vector
space that is the tensor product of two lower-dimensional vector spaces, GPCA aims to preserve
spatial locality to improve the effectiveness of the feature extraction method. More recent
appearance based solutions include, for instance, Kernel PCA Mixture Model (KPCA MM) [67],
a supervised version for probabilistic kernel principal component analysis mixture model,
obtaining local non-linear structure of facial patterns which can be used for dimensionality
reduction in recognition task. Adaptive Singular Value Decomposition Face (ASVDF) [101] is an
illumination compensation method in the two-dimensional discrete Fourier domain, for reducing
the influence of side light on a color face image when there is insufficient light, improving the
performance of recognition systems. As a final example, Adaptive High-Frequency Singular Value
Decomposition face (AHFSVD-face) [102] adaptively selects a nonlinear parameter to generate
features according to the face image illumination level.
3.3.2.2 Model Based Solutions
Model based face recognition solutions derive features based on geometrical characteristics of the
face. These solutions are generally less sensitive to face variations as they consider structural
information of the face, for which they require accurate landmark localization prior to feature
extraction.
Graph based solutions represent face features as a graph, with the nodes collecting local
information about each facial landmark and edges representing their relations, e.g. geometric
distances between the nodes. Model matching can be performed using a graph similarity function.
EBGM [69] is a graph based solution where the local texture of fiducial points on the face (eyes,
mouth, etc.) is described by a set of wavelet components, so-called jets, and the edges represent
distances between the node locations on an image; recognition is done based on a Dynamic Link
Matching (DLM) method. In the homography based normalization solution [103], an efficient
pose-invariant face recognition solution is proposed, projecting a dense grid of 3D facial
landmarks to each 2D face image, to enable pose-invariant feature extraction. Then, an optimal
35
warp is estimated for each landmark in order to correct the texture deformation caused by pose
variations. The reconstructed frontal-view features are then utilized for recognition.
Shape model face recognition solutions represent features for a set of points controlled by the
model to find the best matching position between the prior models and the input face image.
Landmark points represent the positions of key facial features used for facial alignment. As an
example, 3DMM [70] captures the class-specific properties of faces by learning from a data set of
3D scans. The morphable model represents face shape and texture as vectors in a high-dimensional
face space, and involves a probability density function of faces within the face space. Matching is
achieved by fitting a statistical, morphable model of 3D faces to images. The Unified 3D
Morphable Model (U-3DMM) solution [104] proposes an improved approach to learn the face
subspace, by modelling the difference in the texture map of the 3D aligned input and reference
images, resulting in an improved fit of the 3D face model.
3.3.2.3 Learning Based Solutions (excluding Deep Learning)
Learning based face recognition solutions derive features by modelling and learning relationships
from the input data. These solutions can present some robustness against facial variations,
depending on the considered training data; however, they can be computationally more complex
than solutions based on other feature extraction approaches, as they require initialization, training,
and tuning of (hyper) parameters. As the majority of recent learning face recognition solutions are
based on deep learning, those solutions are reviewed in the next sub-section.
Dictionary learning based solutions aim to find a sparse feature representation of the input data in
the form of a linear combination of basic elements. A two-step supervised face hallucination
framework based on class-specific dictionary learning is proposed in [105] to learn a set of class-
specific dictionaries. The learned dictionaries can fit the global and local characteristics of an input
face image. Then, a maximum a posteriori estimator is used to infer the global characteristics. An
orthonormal dictionary learning solution is presented in [106], obtaining low-rank face
representations with fast computation. The solution enhances the ability of the class-specific
dictionary to represent samples from the associated class and suppress the ability of representing
samples from other classes. In [71], several kernel principal components of occlusion variations
are learned, representing the possible occlusion variations efficiently. Then, the occlusion model
is projected by kernel discriminant analysis to get the kernel extended dictionary; finally, a
structured sparse representation classifier is used for classification.
Decision tree based solutions represent features as a model of decisions that is constructed based
on values of attributes in the input image. The Decision Pyramid Classifier (DPC) face recognition
solution [72] solves the single sample per person problem by considering large appearance
variations. DPC divides each training image into multiple non-overlapping local blocks and
extracts features from each block to generate the training feature set. By following the constructed
decision pyramid, the person identity is predicted.
Bayesian learning solutions apply Bayes’ theorem to extract features using a probabilistic measure
of similarity. In [74], a simple yet effective framework is proposed, generating, interpreting, and
aggregating the partial representations in a Bayesian manner. First, linear representations are
36
computed on randomly generated face patches. Second, each patch representation is considered as
a probability vector, with each element corresponding to a certain individual. The interpretation is
obtained by applying the Bayes theorem on a basic distribution assumption and, thus, is referred
to as Probabilistic Patch Representation (PPR). Finally, a linear combination of the obtained PPRs
is learned to achieve higher recognition performance.
3.3.2.4 Deep Learning Based Solutions
With the development of deep learning architectures and the increase in computational power,
rapid advances in a variety of visual recognition tasks, including face recognition, have been
observed [140]. In recent years, deep learning architectures have been increasingly adopted for face
recognition tasks and, not surprisingly, the current state-of-the-art on face recognition is dominated
by deep neural networks, notably Convolutional Neural Networks (CNNs). Deep CNN
architectures take raw data as their input and extract features using convolutional filters in multiple
levels of abstraction, followed by a few fully connected layers. However, optimizing tens of
millions parameters to learn deep learning weights needs a huge amount of labeled/learning
samples along with powerful computational resources. Hence, deep learning architectures have
been trained over millions of face images with different variations, obtaining the so-called pre-
trained face models for face recognition that can then be after used for deep feature
extraction/description; at this stage, a conventional classifier such as Support Vector Machine
(SVM) can be employed to classify the extracted features. The adaptation of the pre-trained face
model to a specific face recognition problem can also be done using a so-called transfer learning
process, meaning that the pre-trained model is fine-tuned using a part of the newly available
datasets, notably when the type of face data is different from the images used for training the
model, by changing the last (classification) layer(s) of the architecture to learn the new classes
[141]. Nowadays, the most efficient and commonly used CNN architectures for face recognition
are AlexNet [142] [116] [107], Lightened CNN [143] [117], SqueezeNet [144] [118], GoogLeNet
[145] [108], and VGG-16 network [146] [38] and ResNet [147] [148].
Several deep learning based face recognition solutions exploiting richer imaging representation
formats have recently been proposed and are discussed in the following, excluding light field
solutions, which are addressed in a later section. Coupled Deep Learning (CDL) [113] is proposed
to address the VIS and NIR heterogeneous matching problem. It transfers the deep representation
learned on a large-scale VIS dataset and adapts it to the NIR domain by introducing a VIS-NIR
objective function for convolution neural networks. It seeks a deep feature space in which the
heterogeneous face matching problem can be approximately treated as a homogeneous face
matching problem. A deep TransfeR nIr-Vis heterogeneous facE recognition neTwork (TRIVET)
for NIR-VIS face recognition is proposed in [110], employing a CNN with ordinal measures to
learn discriminative models. The ordinal activation function, so-called max-feature-map, is used
to select discriminative features and make the models robust and light. Then, the models are
transformed to the NIR-VIS domain by fine-tuning with a NIR-VIS triplet loss function. A method
using the GoogLeNet to learn global features for heterogeneous face recognition is presented in
[111], which is able to learn coupled deep convolutional neural networks to map visible and NIR
faces into a domain independent latent feature space where they can be compared directly. Another
deep convolutional network is proposed [114], using only one network to map both NIR and VIS
37
images to a compact Euclidean space. Each convolutional layer implements the maxout operator
and the layers are divided into two orthogonal subspaces that contain modality-invariant identity
information. The solution proposed in [115] extends the deep learning breakthrough for VIS face
recognition to the NIR spectrum, without retraining the underlying deep models trained for VIS
faces. The solution consists of two core components, cross-spectral hallucination and low-rank
embedding, optimizing the input and output of a VIS deep model for cross-spectral face
recognition, respectively. A face recognition system is proposed in [109] to recognize faces with
color and depth information including three parts: i) depth image recovery; ii) deep learning for
feature extraction with a 12-layer deep architecture; and iii) joint classification. To alleviate the
problem of the limited size of available RGB-D data for deep learning, the network is firstly trained
with a color face dataset, and later fine-tuned on depth face images exploiting transfer learning.
Finally, a deep face recognition solution is proposed in [112] to learn effective color and depth
feature transformation, containing 3 parts: i) depth data enhancement, recovery, and augmentation;
ii) deep CNN transfer learning, efficiently transferring the knowledge of color images to depth
images for feature extraction; and iii) joint classification of color and depth features.
3.3.2.5 Hand-Crafted Based Solutions
Hand-crafted based face recognition solutions derive features by describing elementary
characteristics of the visual information selected a priori. These solutions are not usually very
sensitive to face variations, e.g. pose, expression, occlusion, aging, illumination changes, as they
can consider different/multiple scales, orientations, and frequency bands. These solutions require
tuning one/several parameters such as region size, scale, or topology. However, they are not
computationally expensive as there is no need for training at feature extraction level.
Texture based solutions can encode the local structures in spatial neighbourhoods within an image.
For instance, LBP [76] is among the most widely used local texture descriptors. This solution
divides the face image into several regions from which the LBP feature distributions are computed,
encoded and concatenated into a feature vector to be used as a face description. Sample encoding
is performed based on a comparison of the center pixel’s gray value and the corresponding values
for neighbour pixels, while using zero as the threshold value. Another widely used local texture
descriptor is the Histogram of Oriented Gradients (HOG) [149], which is able to represent spatial
gradient variations for face recognition [119]. HOG divides a face image into small connected
regions, called cells, and for each cell a histogram of edge orientations is computed. The histogram
channels are evenly spread over the ranges 0–180° and 0–360°, depending on whether the gradient
is ‘unsigned’ or ‘signed’. The histogram counts are normalized to compensate for illumination.
The combination of these histograms corresponds to the final HOG description. In [120], a depth
image descriptor called DLBP (Depth Local Binary Pattern) is proposed, capturing features from
texture and depth values of neighbourhood patterns. As it takes a similar form to conventional
LBP, patterns can be readily combined to form joint histograms to represent depth faces. The
Extended Local Binary Patterns (ELBP) solution [121] decomposes angular and radial differences
into complementary components of sign and magnitude, learning the most frequently occurring
patterns and their labels to capture discriminative textural information. Histogram features are
obtained from each given face image by concatenating spatial histograms extracted from non-
overlapping sub-regions which are then used for face classification. Multi-scale Block LBP (MB-
38
LBP) [122] processes average pixel values of block sub-regions instead of single pixels. Then,
binarized histograms obtained from MB-LBP are used for a rapid comparison of face images.
Multi-resolution Elongated Centre-Symmetric Local Derivative Pattern (ME-CS-LDP) is
proposed in [123], allowing to capture more important information from some important elliptical
parts of faces, like eyes and mouth. An adaptive local feature descriptor, Adaptive Local Ternary
Pattern (ALTP) [124], is proposed based on an adaptive sampling threshold, exploiting positive
and negative channel patterns for extracting more discriminative information.
Frequency descriptors encode the local structures in frequency neighbourhoods within images. A
well-known example is the LPQ [77] hand-crafted feature descriptor, based on quantizing the
Fourier transform phase in local neighbourhoods. Histograms of LPQ labels computed within local
regions are used for face image description, similarly to LBP.
3.3.2.6 Hybrid Solutions
Hybrid face recognition solutions combine elements from two or more feature extraction solutions,
taking advantage of their strengths to offer a more discriminative representation. It is worth to note
that hybrid solutions often do not refer to simply combining multiple features/classification scores
extracted by different feature extraction approaches. Hybrid face recognition solutions, especially
those using deep learning, are often quite competitive, but depending on their building blocks may
also be computationally more complex than other approaches.
One example is the Local Binary Pattern Network (LBPNet) [78], an unsupervised deep learning
based solution that efficiently extracts and compares high-level features in a multilayer hierarchy.
LBPNet retains the same CNN topology, whereas the trainable kernels are replaced by the off-the-
shelf LBP descriptor. In Multi-Scaled PCA Network (MS-PCANet) [125], a multiple scale
combined deep learning network is proposed to learn a set of high-level feature representations
through each stage of the convolutional neural network for face recognition. The network obtains
the filter kernels by learning the principal components of images using PCA, then nonlinearly
processes the convolutional results by using simple binary hashing, and pool them using a spatial
pyramid pooling method. Finally, the output features of several stages are fed to the classifier, thus
providing classifying multi-scaled features. In [126], a hierarchical model based on two-level
learning is proposed. At the first level, effective features are learned from low-level
microstructures, based on a Local Pattern Selection (LPS) descriptor, selecting low-level
discriminant patterns to minimize intra-user dissimilarity. At the second level, higher level visual
information is further refined based on the output from the first level. The nonlinear 3DMM [131]
solution contains a deep network, encoding the projection, shape and texture parameters. Two
decoders nonlinearly map from the shape and texture parameters to the 3D shape and texture,
respectively. With the projection parameter, 3D shape and texture, an analytically-differentiable
rendering layer is designed to recognize the input face. Another hybrid solution is proposed in
[127], combining Centre-Symmetric (CS)-LBP based on Gaussian pyramids and weighted PCA
for face recognition; different classifiers are used to select the optimal classification approach. In
[128], a hybrid solution is proposed, applying dense sampling around each detected feature point,
extracting Local Difference Feature (LDF) for face representation, and then utilizing PCA and
linear discriminant analysis to reduce feature dimension; finally, cosine similarity evaluation is
39
used for classification. In [129], another hybrid solution is proposed, combining (Center-
Symmetric Local Binary Patterns) CSLBP and Deep Belief Network (DBN). CSLBP is applied to
extract local texture features of face images and the extracted features are used as the input to a
DBN. A face recognition solution based on discriminative dictionary learning, LBP, and
regularized robust coding, is proposed in [130] to obtain the Gabor amplitude images of a face
image using a Gabor filter, extract the uniform LBP histograms to form a new dictionary, and,
finally, classify the test image via sparse representation coding. The challenge of LBP computation
on a mesh manifold is addressed in [79] by proposing a computational framework, called mesh-
LBP, allowing the extraction of LBP-like patterns directly from a triangular mesh manifold,
without the need for any intermediate representation in the form of depth images.
3.3.2.7 Fusion of Solutions
Face recognition fusion can be performed at four levels to combine the relevant information
[150]: i) feature level, usually concatenating features obtained by different feature extractors into
a single vector for classification; ii) score level, combining the different classifier output scores,
usually using the ‘sum rule’; iii) rank level, combining the ranking of the enrolled identities to
consolidate the ranks output from multiple biometric systems; and iv) decision level, combining
different decisions by those biometric matchers that provide access only to the final recognition
decision, usually adopting a ‘majority vote’ scheme . Fusion at feature and score levels are the
most commonly used approaches in the biometric literature. Generally, feature level fusion
contains richer information than score level; however, it is not always possible to apply it due to:
i) incompatibility of the features extracted in different feature spaces, in terms of data precision,
scale, structure, and size; and ii) large dimensionality of the concatenated features thus leading to
a higher complexity in the matching stage [80]. If one of these difficulties exists, fusion can be
performed at the rank, score or decision levels.
A fused presentation named Riesz Binary Pattern (RBP) is proposed in [134], consisting of two
complementary components: a Local RBP (LRBP) and a Global RBP (GRBP). LRBP is obtained
by applying a local binary coding operator on each Riesz transform response to extract image
intrinsic two-dimensional structure features [151], while GRBP is the global binary coding of joint
information of multi-scale image analysis and multi-order Riesz transform. Histograms of LRBP
and GRBP are concatenated at feature level to form the face RBP hand-crafted description. In
[135], a fused solution called Local Contourlet Combined Patterns (LCCP) is proposed, combining
local descriptions at multiple scales, orientations, and frequency bands at feature level. LBP and
Mean based Contrast Patterns (MCP) have been applied to different levels' of Contourlet transform
coefficients and then concatenated at feature level. Then, a block based kernel Fisher linear
discriminant is used to select the most discriminative feature sets. In [136], a feature level fusion
scheme is proposed, combining a multi-scale and rotation invariant global feature descriptor called
Global-Gabor-Zernike (GGZ) with HOG for face recognition. Another feature fusion level
solution extracts Multi-Directional Multi-Level Dual-Cross Patterns (MDML-DCPs) [137],
encoding the invariant characteristics of a face image at multiple levels into patterns based on
Dual-Cross Pattern (DCP) descriptions at both component and global face representation levels.
Several fused face recognition solutions based on emerging, non-light field sensors have recently
40
been proposed. In [132], a feature level fusion solution is proposed, concatenating extracted
features from RGB, depth, and thermal data using LBP, HOG, and HAAR methods. A score level
fusion solution is proposed in [133], adding classification scores obtained by horizontal gradient
ordinal relationship patterns and steerable filters to perform face recognition on rectified stereo
images. The fused representation solution proposed in [138] combines shallow Pyramidal
Histogram Of visual Words (PHOW) and VGG-Face deep descriptions at feature and score levels
to perform face recognition on a new RGB-depth-infrared database. A feature level fusion scheme
proposed in [139] combines LBP, Gabor jet, Weber local and down-sampling local descriptors for
thermal face recognition.
3.3.2.8 Non-Light Field Based Face Recognition: the Status Quo
Figure 3.7 illustrates the evolution of face recognition solutions over time, grouped based on their
feature extraction approaches. Figure 3.7 also includes the typical performance in terms of
Recognition Rate at rank 1 (RR1) obtained for each group of techniques on the LFW [88] database.
The appearance based solutions dominated the face recognition landscape from the early 1990
until around 1997. Then, model based solutions appeared and remained the state-of-the-art until
approximately 2006. Hand-crafted based solutions were introduced after, providing a moderate
improvement in the accuracy of face recognition solutions. In 2014, DeepFace [107] dramatically
improved the state-of-the-art accuracy, from around 80% to above 97.5%. From that date, the face
recognition research focus has shifted to deep learning based solutions and the current state-of-
the-art on face recognition is dominated by deep neural networks, offering more than 99%
accuracy for the LFW database [152].
Figure 3.7: Overview of the evolution of face recognition solutions over time, grouped based on
feature extraction approaches; performance values for the LFW database.
41
In [141], a comprehensive evaluation of deep learning models for face recognition using different
CNN architectures and different databases, under various facial variations, is presented.
Additionally, the impact of different covariates, such as compression artefacts, occlusions, noise,
and color information, on the face recognition performance for different architectures has been
studied [118]. The results have shown that the VGG-Face descriptor [38], computed using a VGG-
16 network, achieves superior recognition performance under various facial variations, and is more
robust to different covariates, when compared to relevant alternatives. The VGG-Face descriptor
has been trained over 2.6 million face images, covering rich variations in expression, pose,
occlusion, and illumination, obtaining a so-called pre-trained VGG-Face model for face
recognition, containing 144 million parameters. The VGG-Face description is obtained by running
the VGG-16 network [146] without the last two fully connected layers [38] using the pre-trained
VGG-Face model, thus including 13 convolutional layers, followed by one fully connected layer,
resulting in a feature vector of size 4096.
3.3.3 Light Field Based Face Recognition Solutions
This section reviews existing face recognition solutions exploiting light field sensors. Several face
recognition solutions exploiting the richer light field imaging information have recently been
proposed. Following the multi-level face recognition taxonomy developed, Table 3.3 classifies the
available light field based face recognition solutions, along with the databases used for reporting
results. The characteristics of the light field based face recognition solutions being proposed in this
Thesis are also listed in Table 3.3 for comparison purposes.
Table 3.3: Classification of the prior and proposed (Prop.) light field based face recognition
solutions, based on the proposed taxonomy.
Solution Name Year Face
Structure
Feature
Support
Feature
Extraction
Approach
Feature
Extraction
Sub-Approach
Database
MPCA Tensor [153] 2008 Global Global Appearance Multi-Linear N/A
LF Face [28] 2013 Global Local Hand-Crafted Texture Private
Multi-Face LF [29] 2013 Global Local Hand-Crafted Texture LiFFID
Super Res. LF [30] 2013 Global Local Hand-Crafted Texture LiFFID
Face-Iris MF LF [27] 2016 Global Local Hand-Crafted Texture LiFFID
DM LF [31] 2016 Global Local Hand-Crafted Texture Private
Prop. LFLBP [37] 2017 Global Local Hand-Crafted Texture LLFFD
Prop. LFHG [36] 2018 Global Local Hand-Crafted Texture LLFFD
Prop. VGG-D3 [39] 2018 Global Local Learning Deep Neural Nets LLFFD
Prop. VGG+ Conv-LSTM [40] 2018 Global Global Learning Deep Neural Nets LLFFD
Prop. VGG+ GLF-LSTM [41] 2018 Global Global Learning Deep Neural Nets LLFFD
Prop. VGG+ SLF-LSTM [41] 2018 Global Global Learning Deep Neural Nets LLFFD
Prop. VGG+ SeqL-LSTM [41] 2018 Global Global Learning Deep Neural Nets LLFFD
Additionally, Table 3.4 summarizes the main characteristics, including feature extraction method,
classifier, light field capability exploited, and light field format, of prior and proposed light field
based face recognition solutions. The genesis of these solutions is associated to three distinctive
light field capabilities, i.e., a posteriori refocusing, disparity exploitation and depth map
exploitation (see Section 2.5). The solutions summarized in Table 3.3, excluding the proposed
solutions, are briefly reviewed in the following, grouped based on the feature extraction
42
approaches considered in the taxonomy.
Table 3.4: Overview of prior and the proposed light field based face recognition solutions.
Solution Name Year Feature Extraction
Method Classifier Light Field Capability Format
MPCA Tensor [153] 2008 MPCA NNC Disparity exploitation M-V SA array
LF Face [28] 2013 LBP NNC Depth computation 2D rendered
from LF
Multi-Face LF [29] 2013 LBP; LG filter SRC A posteriori refocusing 2D rendered
from LF
Super Res. LF [30] 2013 LBP SCR A posteriori refocusing 2D rendered
from LF
Face-Iris MF [27] 2016 HOG; LBP; CSLBP;
BSIF SRC A posteriori refocusing
2D rendered
from LF
DM LF [31] 2016 LFHOG SVM Depth computation M-V SA array
Prop. LFLBP [37] 2017 LFLBP SVM Disparity exploitation M-V SA array
Prop. LFHG [36] 2018 HOG; LFHDG SVM Disparity exploitation M-V SA array
Prop. VGG-D3 [39] 2018 VGG SVM Disparity exploitation;
Depth computation M-V SA array
Prop. VGG+ Conv-LSTM [40] 2018 VGG; Conv-LSTM Softmax Disparity exploitation M-V SA array
Prop. VGG+GLF-LSTM [41] 2018 VGG; GLF-LSTM Softmax Disparity exploitation M-V SA array
Prop. VGG+SLF-LSTM [41] 2018 VGG; SLF-LSTM Softmax Disparity exploitation M-V SA array
Prop. VGG+SeqL-LSTM [41] 2018 VGG; SeqL-LSTM Softmax Disparity exploitation M-V SA array
3.3.3.1 Appearance Based Solution
There are a few multilinear appearance based solutions able to analyse the high dimensional light
field image information in its native form, thus exploiting the disparity information available in a
light field image; it is worth to mention that none of the multilinear appearance based solutions
were originally designed for face recognition. Multilinear Principal Component Analysis (MPCA)
[153] is one such solution using tensors for feature extraction, and is able to decompose the original
problem into a series of multiple projection sub-problems to capture most of the tensorial input
variations. As a light field image represented in the form of a multi-view array of SA images can
be interpreted as a 4D tensor, MPCA has been considered for light field based face recognition in
this Thesis for the first time.
3.3.3.2 Hand-Crafted Based Solutions
The first group of hand-crafted based solutions relies on the a posteriori refocusing capability
when using light field imaging. This can improve the image quality of a previously out-of-focus
region of interest for the subsequent recognition of either a single face or multiple faces, positioned
at different distances. The solution presented in [28] proposes a wavelet energy based approach to
select the best focused face image from a set of refocused images, rendered from a light field
image. Then, the LBP descriptor is applied to extract features and a Nearest Neighbor Classifier
(NNC) is used for classification. Another solution is based on a resolution enhancement scheme
[29], using the discrete wavelet transform, to capture high frequency components from different
focused 2D images to create an all-in-focus face image to be input to a LBP descriptor. The
identification of multiple faces at different distances is investigated in [30] by exploring an all-in-
43
focus image created from a light field image. A LBP descriptor is applied to extract features from
the all-in-focus image and a Sparse Reconstruction Classifier (SRC) is used to perform the
recognition task. In [27], a face recognition solution is presented, relying on rendering a light field
image in different focus planes in two different ways: i) selecting the best focus image; and ii)
combining focus images to create a super-resolved image; both approaches have been considered
in this research. Different local descriptors including HOG, LBP, CSLBP, and Binarized Statistical
Image Features (BSIF) are used for feature extraction.
The second group of hand-crafted based solutions relies on exploiting the depth information that
can be estimated from a light field image, thus providing geometric information about the position
and shape of facial components. In [31], a depth map computed from a light field image is analyzed
using a HOG descriptor for extracting discriminative features, which are then fed into a linear
SVM classifier to perform the recognition task.
3.3.3.3 Light Field Based Face Recognition: the Status Quo
Table 3.3 summarizes the six available light field based face recognition solutions, along with the
databases used for reporting results. As it can be observed, the experiments performed in [28] and
[31] were conducted on private databases, so there is no way to compare their performance with
the other light field based face recognition solutions. MPCA has been considered for light field
based face recognition in this Thesis for the first time, meaning that no previous recognition
performance results had been reported in the literature. Concerning [29] [30] [27], although they
are tested on the same database, LiFFID, there is no comparative study available to analyze the
level of performance achieved by these solutions. In conclusion, given the nature of the existing
solutions and the way they were tested, it is difficult to provide precise information the
performance to expect from light field based face recognition. The benchmarking study performed
in this Thesis will address this shortcoming.
3.4 Ear Recognition
Since face is not the single relevant biometric, increasing research work has been recently
developed in the area of ear recognition. As the human ear structure remains stable over time for
the same person, and it does not present significant changes for different facial emotions and
actions, ear recognition has evolved as a reliable biometric modality for human identification in
recent years [6]. Since there is no research activity or publicly available databases addressing ear
recognition using light field sensors, excluding the ear recognition solutions proposed in this
Thesis, this section is dedicated to the review of non-light field based ear recognition databases
and solutions, following the multi-level face/ear recognition taxonomy proposed in Section 3.2.3.
Considering this taxonomy, see Figure 3.3, it should be noted that, although the relations between
the ear components contain critical information for ear recognition, no ear recognition solution
adopting the component representation paradigm, thus dealing independently with a selection of
ear components, has been yet proposed in the literature for ear recognition. Additionally, while all
the feature extraction sub-approaches in the taxonomy can potentially be used also for ear
recognition, some of them, notably multi-linear, dictionary learning, decision tree, regression,
44
Bayesian and shape descriptor feature extraction sub-approaches, have not yet been considered in
the literature for ear recognition.
3.4.1 Ear Databases: Status Quo
Ear databases play a very important role for designing, testing and validating ear recognition
solutions, while ensuring the reproducibility of performance results and their fair comparison. A
set of selected ear biometric databases are briefly reviewed in the following.
Currently, there are several publicly available ear image databases, none of them light field based.
An overview of the publicly available ear databases, including their main characteristics and
variations, is provided in Table 3.5. For comparison, also the characteristics of the IST-
EURECOM Lenslet Light Field Ear DataBase (LLFEDB) [36] proposed in this Thesis are included
in Table 3.5. The databases consider different characteristics and the corresponding ear images
exhibit different levels of variability. The variation ‘sides’ indicates whether images were captured
from one or both ears, ‘poses’ refers to yaw and pitch ear rotations, and ‘occlusions’ and
‘accessories’ hint whether ears are (partly) occluded and whether accessories, such as earrings, are
visible in the ear images.
Table 3.5: Overview of ear databases with different (Diff.) characteristics.
Database Name Year Number of
Subjects
Number of
Images
Ear Variation
Diff.
Sides
Diff.
Poses
Diff.
Occlusions
Diff.
Accessories
USTB I [154] 2002 60 185 Right
USTB II [154] 2004 77 308 Right
IITD I [155] 2007 125 493 Right
AMI [156] 2009 100 700 Both
WPUT [157] 2010 501 2071 Both
IITD II [155] 2014 221 793 Right
AWE [5] 2016 100 1000 Both
Prop. LLFEDB [36] 2018 67 536 Both
3.4.2 Ear Recognition Solutions
Ear recognition has recently attracted a considerable amount of research interest, thus resulting in
a large number of papers, although none of them exploring light field images. Following the
proposed taxonomy in Section 3.2.3, the main ear recognition solutions available are briefly
described in this section. Table 3.6 summarizes the main characteristics of a selection of
representative and relevant ear recognition non-light field solutions. The solutions listed in Table
3.6 are sorted based on feature extraction approach, feature extraction sub-approach and,
eventually, publication date. The characteristics of the light field based face recognition solutions
being proposed in this Thesis are also listed in Table 3.6 for comparison purposes. The solutions
summarized in Table 3.6 are briefly reviewed in the following, grouped based on the feature
extraction approach considered in the taxonomy.
45
Table 3.6: Classification of a selection of ear recognition solutions based on the developed
taxonomy.
Ref. year Ear
Structure
Feature
Support
Feature
Extraction
Approach
Feature Extraction
sub-approach Feature Extractor Classifier
[7] 2003 Global Global Appearance Linear PCA NNC
[158] 2007 Global Global Appearance Linear LDA NNC
[8] 2013 Global Global Appearance Linear Sparse Coding Error
Ratio NNC
[159] 2014 Global Global Appearance Linear G-NSRC NNC
[160] 2015 Global Local Appearance Linear PCA NNC
[161] 2016 Global Global Appearance Linear Dictionary Based
Sparse Representation NNC
[162] 2013 Global Global Appearance Linear;
Non-Linear ICA; LDA NNC
[163] 1999 Comp.+ struct.
Global Geometric Graph Distance Model NNC
[164] 2004 Comp.+
struct. Global Geometric Graph Distance Model NNC
[165] 1997 Comp.+
struct. Global Geometric Shape Contour Model NNC
[166] 2008 Comp.+
struct. Global Geometric Shape Deformable Model NNC
[167] 2008 Comp.+
struct. Global Geometric Shape Contour Model NNC
[168] 2016 Comp.+
struct. Global Geometric Shape Contour Model NNC
[169] 2017 Comp.+
struct. Global Geometric Shape Deformable models NNC
[170] 2017 Global Global Learning Deep Neural Network SqueezeNet Softmax
[171] 2017 Global Global Learning Deep Neural Network VGG-16 Netwrok Softmax
[172] 2018 Global Global Learning Deep Neural Network AlexNet; VGG-16
Network; GoogLeNet Softmax
[173] 2009 Global Local Hand-crafted Texture LGPDP NNC
[174] 2012 Global Local Hand-crafted
Texture Multi-Scale Dense
HOG NNC
[175] 2013 Global Local Hand-crafted Texture SIFT NNC
[176] 2014 Global Local Hand-crafted
Texture LBP; LPQ; HOG;
BSIF NNC
[177] 2016 Global Local Hand-crafted Texture MLBP NNC
[178] 2016 Global Local Hand-crafted Texture TDSIFT NNC
[5] 2017 Global Local Hand-crafted Texture;
Frequency
LPQ; BSIF; SIFT; POEM; Gabor; HOG
NNC
Prop.
LFLBP
[43]
2018 Global Local
Hand-crafted
Texture LFLBP SVM
Prop.
LFHG
[42]
2018 Global Local
Hand-crafted
Texture LFHG SVM
[179] 2005 Global Global Hybrid Linear;
Deep Neural Network ICA, RBF Softmax
[180] 2007 Global Global Hybrid Linear
Wavelet Transform,
PCA NNC
46
Ref. year Ear
Structure
Feature
Support
Feature
Extraction
Approach
Feature Extraction
sub-approach Feature Extractor Classifier
[181] 2008 Global Local Hybrid Linear;
Texture
Haar Wavelet
Transform, LBP NNC
[182] 2013 Global Local Hybrid Linear;
Texture
Sparse
Representation; LRT NNC
[183] 2014 Global Global Hybrid
Non-linear;
Deep Neural Network; Texture
LDA; Neural
Network; SURF Softmax
[184] 2014 Global Local Hybrid Texture GLCM; LBP; Gabor
filters NNC
3.4.2.1 Appearance Based Solutions
Appearance based ear recognition solutions exploit the global appearance of the input image, either
the whole ear or its components, to compute representations encoding the ear structure. A PCA
based solution, called eigen-ears, is proposed, combining uncorrelated linear combinations of the
basis images for ear recognition [7]. An ear recognition solution is proposed in [158], applying
Linear Discriminant Analysis (LDA) to determine a set of projection vectors maximizing inter-
class and minimizing intra-class variabilities. An extensive experimental comparison is conducted
for ear recognition using the ICA and LDA feature extraction approaches [162]. In [8], an adaptive
feature weighting scheme based on a sparse representation method, called sparse coding error
ratio, is proposed for ear recognition. A new feature extraction approach is investigated for ear
recognition, using the scale information of Gabor wavelets. Then, Gabor scale feature based non-
Negative Sparse Representation Classification (G-NSRC) is proposed for ear recognition under
occlusion [159]. The ear recognition solution proposed in [160] divides an ear image into smaller
blocks, to apply after PCA to each block separately, and eventually add the outputs of the
classifiers applied to each block to perform ear recognition. Finally, the ear recognition solution
proposed in [161] uses a sparse representation framework without requiring any preprocessing or
normalization of the ear region.
3.4.2.2 Geometric Based Solutions
Geometric solutions use features representing the geometrical characteristics of the ear. Ear
recognition solutions from this category are, in general, computationally simple and often rely on
edge detection as a pre-processing step. The geometric based solutions to recognize user identity
from ear images focus on analyzing either simple ear geometrical features, such as height, width,
size, and distances between ear components, or more advanced shape models, e.g., graph,
deformable, and contour models, to provide more comprehensive geometrical descriptions. The
ear recognition solution in [165] localizes the ear using deformable contours on a Gaussian
pyramid representation of the image gradient. Then, the ear features are computed as a number of
scale, rotation and translation invariant geometrical factors. An ear identification solution is
proposed based on the extraction of an ear feature combining outer ear points, ear shape and
wrinkles information [163]. The ear recognition solution proposed in [164] consists of: i) ear edge
detection using a Sobel operator; and ii) ear feature extraction by forming a shape feature vector
of the outer ear and the structural feature vector of the inner ear. In [166], an ear deformable model
47
is constructed and then converted to a geometry image. After, a wavelet transform is applied to the
geometry image and the wavelet coefficients form the ear feature vector. Another feature
extraction approach computes geometrical parameters of ear contours extracted from ear images
[167]. The feature extraction is based on a concentric circles centered method to obtain an ear
centroid point and contour features, including contour starting points, ending points, bifurcations,
and intersections, computed with respect to the centroid point. A geometric feature extraction
method is proposed in [168], which finds the contours of the ear based on a Canny edge detector,
and then extracts shape features from the outer ear images with respect to the ear height line.
Finally, an extensive experimental comparison of ear recognition using state-of-the-art
methodologies for training and fitting statistical deformable models is presented in [169].
3.4.2.3 Learning Based Solutions
Thanks to the popularity of deep learning based solutions and their significant impact on different
computer vision tasks, three deep CNN solutions have recently been adopted also for ear
recognition [170] [171] [172]. In [170], the problem of training CNNs with limited training data
for the ear recognition task is addressed by considering data augmentation techniques including:
i) geometric and color perturbations to the available training data; and ii) synthetic data samples
generation. Then, the SqueezeNet [144], AlexNet [142], and VGG-16 network [146] generic
models are fine-tuned for deep ear description and a softmax classifier is used to perform ear
recognition. In [171], two fully-connected layers are added on top of the VGG-16 network seventh
layer and the pre-trained VGG-16 model optimized in [170] is used for model initiation. Then, the
pre-trained weights of the early layers are frozen and kept unchanged, while the newly added fully-
connected layers are trained from scratch. The output of the second fully connected layer of the
modified architecture is used as input to a softmax classifier to perform ear recognition. In [172],
deep CNN networks, notably AlexNet [142], VGG-16 network [146], and GoogLeNet [145] are
used considering two different training approaches, notably full model learning and selective
model learning. Data augmentation has also been applied to increase the amount of data for deep
CNN model training.
3.4.2.4 Hand-crafted Based Solutions
Several hand-crafted based solutions use local texture descriptors for ear recognition [5], [176],
[185]. A local hand-crafted based ear recognition solution is proposed in [173], deriving features
using Local Gabor Phase Difference Pattern (LGPDP) to represent images by exploiting
relationships of Gabor phase between pixel and its neighbours. A robust ear recognition solution
using multi-scale dense HOG features as a descriptor of 2D ear images, capturing different and
complicated structures of ear images, is proposed in [174]. The Scale Invariant Feature Transform
(SIFT) is applied for ear feature description in [175]. An ear recognition solution is proposed in
[177], extracting features based on Multi-scale Local Binary Pattern (MLBP) descriptor to be used
as input to a classifier. A Texture and Depth Scale Invariant Feature Transform (TDSIFT)
descriptor, encoding 2D and 3D local features for ear recognition is proposed in [178]. A
comparative study of ear recognition performance is presented in [176], using LBP, LPQ, HOG,
and BSIF local descriptors.
48
3.4.2.5 Hybrid Solutions
Hybrid solutions combine elements from several categories to improve ear recognition
performance. A hybrid system for classifying ear images is proposed in [179], combining ICA and
a Radial Basis Function (RBF) network. The ear image is decomposed into linear combinations of
several basic images and the corresponding coefficients of these combinations are fed up into a
RBF network to perform recognition. The ear recognition solution proposed in [180] decomposes
the ear image into three horizontal, vertical and diagonal images using wavelet transform and then
PCA is applied for feature dimension reduction. The proposed ear recognition solution in [181]
decomposes ear images using the Haar wavelet transform to provide input to an uniform LBP
descriptor, thus describing the ear sub-images texture features in the Haar wavelet domain. A
solution based on sparse representation of local gray-level orientation described by a Local Radon
Transform (LRT) is proposed for ear recognition in [182]. In [183], SURF features are computed
from the ear images and then the dimensionality is reduced using LDA as an input of two neural
networks. Finally, the extracted features from Grey-Level Co-occurrence Matrices (GLCM), LBP
and Gabor filters are combined for ear recognition in [184].
3.4.2.6 Ear Recognition: the Status Quo
Figure 3.8 illustrates an overview of the evolution of ear recognition solutions over time, grouped
based on feature extraction approaches. Figure 3.8 also includes the range of RR1 results obtained
for each group of techniques on the AWE dataset [5]. The model based solutions were the first
appearing solutions for ear recognition. Then, appearance based and hand-crafted based solutions
appeared successively, providing a moderate improvement in the accuracy of ear recognition
solutions. The state-of-the-art model based, appearance based, and local hand-crafted based
solutions provide, respectively, a RR1 recognition performance of 63.80%, 61.10%, and 65.20%
for the AWE dataset, thus showing the slight superiority of local hand-crafted based solutions at
the cost of a lower computational complexity. In 2017, deep learning based solutions started to be
appeared although they have not yet led to a considerably superior performance over local hand-
crafted based solutions; this is most probably due to the lack of enough available training samples,
which has a larger impact on the recognition performance on the deep solution architectures. In
fact, the amount of samples in the available datasets for ear recognition is rather limited (see Table
3.5), thus deep learning based ear recognition solutions mainly utilize an already trained
classification models, e.g., for generic object classification or for face recognition, for model
initiation. The adaptation to the specific ear recognition problem is done using transfer learning,
fine-tuning the models, using a part of the available ear dataset, which typically is not large enough
to result in an appropriate training. This reveals a pressing need to gather large-scale ear databases
in order to obtain better deep classification models for ear recognition.
49
Figure 3.8: Overview of the evolution of ear recognition solutions over time, grouped based on
feature extraction approaches; performance values for the AWE database.
50
51
Chapter 4 _
Proposing Novel Light Field Face and Ear
Recognition Databases
4.1 Introduction
As stated in Section 3.3.1, it was difficult to fully assess how face recognition technology can
benefit from light filed data, as the only available light field face database, LiFFID [98], does not
include the raw light field images. In fact, LiFFID only includes a number of 2D images focused
at different depths for each person, rendered from light field images acquired by an old generation
of lenslet light field cameras; thus, it can be only useful for testing and validating those face
recognition solutions that exploit the a posteriori refocusing capability supported by light field
imaging. Concerning ear recognition, no ear database captured by lenslet light field cameras was
available at the time of writing this Thesis.
To be able to test any light field face recognition solution, including those proposed in this Thesis,
it is necessary to have access to databases including light field face images in the light field raw
format, thus providing the flexibility to exploit different type of light field data for biometric
recognition. It should again be noted that the available databases did not include the light field
images, but rather only specific sets of 2D images rendered from the light fields.
To overcome these limitations, two light field based face and ear databases have been developed
in the context of this Thesis, allowing more powerful benchmarking for testing and validating face
and ear recognition solutions exploiting the full light field data; these databases have been made
publicly available to the research community. This section reviews the proposed light field face
and ear databases.
52
4.2 Lenslet Light Field Face Recognition Database
A new database, the so-called IST-EURECOM Lenslet Light Field Face Database (IST-
EURECOM LLFFD), is introduced in this section. The proposed database includes data from 100
subjects, with 20 samples per each person, captured by a Lytro ILLUM lenslet camera. The images
are captured in a controlled acquisition setup with different facial variations, including emotions,
actions, poses, illuminations, and occlusions in order to benefit from the non-intrusive nature of
face recognition. This database refers to application scenarios where the subjects present
themselves to a fixed camera with a controlled background, but significant flexibility is allowed
in terms of pose, expression and occlusions. This is a rather common and realistic scenario in
business and industrial environments where the facial images to be recognized are captured in, at
least partly, constrained conditions. The database includes the raw light field images, sample 2D
rendered images and the associated depth maps, along with a rich set of metadata.
4.2.1 Acquisition Setup and Statistics
The proposed IST-EURECOM LLFFD was acquired in the context of a cooperation between the
Multimedia Signal Processing laboratory at Instituto de Telecomunicações, Instituto Superior
Técnico, Lisbon, Portugal and the Imaging Security Lab at EURECOM, SophiaTech, Nice,
France. Image acquisition was performed in an indoor environment, using the Lytro ILLUM
lenslet camera [26]. The acquisition setup, illustrated in Figure 4.1, included a white backdrop
background behind a chair at a fixed distance of 1.25 m to the camera. The scene was illuminated
with a three-point lighting kit, including a key light, a fill light and a back light, placed to limit
shadows and allow ease segmentation of the subject from the background; a sketch of the database
acquisition setup is included in Figure 4.2. The image acquisition process has been repeated in the
two labs with the same predefined setup. Each volunteer participated in two separate acquisition
sessions, with a time interval between 1 and 6 months. The database includes 20 shots per person
in each session, with different facial variations including facial emotions, actions, poses,
illuminations and occlusions. Before the acquisition process, volunteers were asked to fill and sign
consent and metadata forms.
Figure 4.1: Acquisition setup at (a) IST; and (b) EURECOM.
The IST-EURECOM LLFFD includes data from 100 volunteers, 66 males and 34 females, with a
total number of 4000 light field face images in the database, corresponding to a total disk space of
about 270 GB. The participants were born between 1957 and 1998, and are from 19 different
countries. Figure 4.3 illustrates the distribution of subjects by age.
53
Figure 4.2: A sketch of the LLFFD acquisition setup.
Figure 4.3: Age distribution for the subjects in IST-EURECOM LLFFD.
4.2.2 Database Variations
To fully benefit from the non-intrusive nature of face recognition, a face recognition system may
be required to recognize a face in an arbitrary situations, without the explicit cooperation of the
subject. This flexibility is of great interest in many face recognition applications, notably many
video surveillance environments.
To consider less controlled acquisition conditions, the IST-EURECOM LLFFD includes a total of
20 face variations per person, categorized into 6 dimensions:
1. Neutral image (1 image): image captured with standard illumination, frontal pose, neutral
emotion, no action, and no occlusion;
2. Emotions (3 images): images with three different emotions, notably happy, angry and surprise;
54
3. Actions (2 images): images with two different actions, notably closed eyes and open mouth;
4. Poses (6 images): images with different poses, notably looking up, looking down, right half-
profile, right profile, left half-profile, left profile;
5. Illumination (2 images): images with different illumination intensities, notably low and high
illumination levels;
6. Occlusions (6 images): images with occlusions, notably eye occluded by hand, mouth occluded
by hand, with glasses, with sunglasses, with surgical mask and with hat.
Examples of the various face variations considered in the IST-EURECOM LLFFD are illustrated
in Figure 4.4. All images were taken under controlled conditions, but there were no restrictions
imposed on clothing, make-up and hair style.
Figure 4.4: Illustration of 2D rendered images for the facial variations in the IST-EURECOM
LLFFD.
4.2.3 Database Elements
The IST-EURECOM LLFFD is the first biometric database to include raw light field imaging files.
It also includes additional information that can be useful for developing and testing face
recognition systems. The database is composed by the following elements:
1. Raw Light Field Images: Raw light field images are stored in the Lytro ILLUM native file
format, so-called Light Field Raw (LFR) files, with a size of about 50 MB/image. LFR files
can be used as initial input for both the Lytro camera software i.e., Lytro Desktop Software
55
[186], or to any other processing library/toolbox, such as the Matlab Light Field Toolbox V0.4
[58].
2. 2D Images: Since light field images are not directly viewable in conventional 2D displays, the
proposed database also includes 2D rendered images for the central view of each light field
image variation, generated using the Lytro Desktop Software [186]. It is worth noting that this
software automatically performs a number of processing steps, including up-sampling and
color correction, to enhance the quality of the output images. As the raw light field images are
made available, any other rendering solution may also be used. The 2D rendered face images
can be viewed using conventional 2D displays or be further processed.
3. Depth Map: A depth map for each central view 2D rendered image is available in the database.
The depth map can be used to bridge the gap between 2D and 3D face recognition. Depth maps
(see example in Figure 4.5) can provide geometric information about the position and shape of
objects, to be explored by recognition systems. The supplied depth maps were generated using
the Lytro Desktop Software [186].
Figure 4.5: Sample depth map.
4. Landmark Information: Facial landmarks are relevant for facial region extraction and
normalization in face recognition systems. In the IST-EURECOM LLFFD, the facial
landmarks information includes the location of the face, left eye, right eye, nose and mouth
bounding boxes, as illustrated in Figure 4.6. The landmark information is extracted for the
central view 2D rendered images.
Figure 4.6: Illustration of facial landmarks.
56
5. Subjects Metadata Information: Metadata information can be used for the evaluation of face
recognition, facial expression recognition, gender classification, and age estimation automated
results. The IST-EURECOM LLFFD rich metadata includes the image acquisition date, as
well as information on the subject gender, age, facial hair, makeup, haircut and usage of
accessories; the range of values/labels for each of these metadata fields is listed in Figure 4.7.
Figure 4.7: Metadata associated to each subject.
6. Calibration Information: Calibration data is essential to compensate for the specific
properties of each camera’s sensor. For example, it is a required input for some light field
image processing software products, such as the Lytro Desktop Software [186] and the Matlab
Light Field Toolbox [58].
4.2.4 Database Structure and Naming Convention
The files composing the database are organized according to a hierarchical structure, as illustrated
in Figure 4.8. The root level of the hierarchy includes the metadata information and facial
landmarks for all the subjects and the camera calibration files. The root level also includes a folder
for each of the N subjects in the database, labelled using a 3 digit identifier, xxx. Each of these
folders contains 3 sub-folders: “LFR files”, “2D images” and “Depth map images”.
Figure 4.8: IST-EURECOM LLFFD file structure.
57
The naming convention for the database light field images is type_xxx_s_vv_variation where:
“type” refers to the type of image, notably “LF” (Light Field), “2D” (2 Dimensional) or “DM”
(Depth Map);
“xxx” is a three digit integer uniquely identifying the subject, starting from 001; the first 50
subjects have been recorded at IST and the second 50 subjects at EURECOM;
“s” is a digit indicating the acquisition session number, notably “1” or “2”;
“vv” is a two digit integer indicating the variation number, ranging from 01 to 20,
corresponding to the variations illustrated in Figure 6.
“variation” is a three letter acronym identifying the variation in a format more suitable for
human reading, as defined in Table 4.1.
Table 4.1: List of Acronyms used in IST-EURECOM LLFFD along with the their definition.
Acronym Definition
NFF Neutral Frontal Face
EHF Emotion Happy Face
EAF Emotion Angry Face
ESF Emotion Surprised Face
AEC Action Eyes Closed
AMO Action Mouth Open
PUL Pose Up Looking
PDL Pose Down Looking
PHL Pose Half-profile Left
PHR Pose Half-profile Right
PPL Pose Profile Left
PPR Pose Profile Right
ILI Illumination High Intensity
IHI Illumination Low Intensity
OMH Occlusion Mouth by Hand
OEH Occlusion Eye by Hand
OFG Occlusion Face by Glasses
OFS Occlusion Face by Sunglasses
OFM Occlusion Face by Mask
OFH Occlusion Face by Hat
4.2.5 Database Access and Usage Conditions
IST-EURECOM-LFFD is freely distributed for standardization and academic research purposes.
The first part of the database, captured at Instituto de Telecomunicações – Instituto Superior
Técnico, Lisbon, Portugal can be accessed at http://www.img.lx.it.pt/LFFD/. The second part,
captured at EURECOM, SophiaTech Campus, Nice, France can be accessed at
http://lffd.eurecom.fr/.
4.3 Lenslet Light Field Ear Recognition Database
Since no light field ear database was available, the IST-EURECOM Lenslet Light Field Ear
DataBase (LLFEDB) has been created, to make publicly available content allowing testing and
58
validating light field imaging based ear recognition systems. The proposed ear database consists
of 536 light field ear images from 67 subjects, with 8 image shots per person, captured with a Lytro
ILLUM lenslet camera, over two separate sessions, with four different poses per session.
4.3.1 Acquisition Setup and Statistics
This Thesis proposes the IST-EURECOM Lenslet Light Field Ear Database (LLFEDB),
containing only the ear region from a relevant subset of IST-EURECOM LLFFD images. Out of
the 100 LLFFD subjects, only 67 were selected, as for the remaining subjects the ears were
completely occluded with hair. The interval between acquisition sessions is in the range of 1-6
months. The IST-EURECOM LLFEDB includes data from volunteers from both genders, with a
total number of 536 light field ear images in the database, corresponding to a total disk space of
about 30 GB.
The participants were born between 1957 and 1998, originating from 15 different countries. The
ear portion of the facial images has been manually cropped. Since the facial images were acquired
at slightly different distances/camera’s zoom levels, the ear size in the database for each view
varies from 75×35 up to 107×86 pixels, with an average aspect ratio of 1.49. If necessary, some
normalization may have to be applied when using this content.
4.3.2 Database Variations
Among the available IST-EURECOM LLFFD facial images corresponding to multiple poses,
there are four of interest for ear recognition, notably the right and left half and full profile images.
LLFEDB consists of 536 light fields image from 67 subjects, considering the four poses
mentioned, taken in two sessions, in a total of 8 images per subject – see Figure 4.9.
Figure 4.9: Illustration of IST-EURECOM LLFEDB 2D rendered ear images for the four
profiles of a specific subject in two separate acquisition sessions.
59
As an ear recognition system is often required to operate in unconstrained situations, i.e. without
the explicit cooperation of the subject, it is important to include less ‘recognition friendly’ content.
To investigate the effects of occlusions on the ear recognition performance, the IST-EURECOM
LLFEDB includes ear images partly occluded by ear piercing, earing, hair and combinations of
these occlusions – see examples in Figure 4.10.
Figure 4.10: Examples of partially occluded ear images: (a) ear piercing; (b) earing; (c) hair; and
(d) combination of occlusions.
4.3.3 Database Elements
The IST-EURECOM LLFEDB is composed by: i) the raw light field ear images; ii) their
corresponding representation as a multi-view SA images array; iii) central view 2D images (for
easy access); iv) metadata; v) ear landmark information; and vi) camera calibration file, as
described in the following:
1. Raw Light Field Images: The raw light field ear images are the most important component of
the database; they are stored in the Lytro ILLUM native format, the so-called Light Field Raw
(LFR) files. As landmark information for the central view rendered image is made available in
IST-EURECOM LLFEDB, the ear region can be easily cropped from the original LLFFD
facial images.
2. Multi-View Array: Ear recognition systems working on light field images may not process
directly the raw light field images and may instead process some conversion of the LFR files,
e.g. 2D rendered images such as SA images. The multi-view SA arrays available in the IST-
EURECOM LLFEDB database, extracted using the Matlab Light Field Toolbox [58], contain
only the ear region, cropped using the landmark information provided in the LLFFD database.
3. 2D Images: Since light field images are not directly viewable in conventional 2D displays, the
proposed database also includes 2D rendered ear images for the central view of each light field
image, extracted using the Matlab Light Field Toolbox [58] – see Figure 4.9. The available 2D
rendered images contain only the ear region.
60
4. Subjects Metadata Information: The IST-EURECOM LLFEDB metadata includes the
image acquisition date, subject gender, subject age, and information about occlusions by hair,
earings or piercings. The set of labels for each metadata field are listed in Table 4.2.
5. Ear Landmark Information: In the IST-EURECOM LLFEDB, the landmark information is
defined by the corner coordinates of the ear bounding boxes in the facial image. The landmark
coordinates refer to the central view 2D rendered images.
6. Calibration Information: Calibration data is essential to compensate for the specific
properties of each camera’s sensor. For example, it is a required input for some light field
image processing software products, such as the Lytro Desktop Software [186] and the Matlab
Light Field Toolbox [58].
Table 4.2: Metadata associated to each subject in each acquisition session.
Field Range
Date taken Date defined as YYYY/MM/DD
Gender Male, Female
Age Integer number
Hair Occlusion Yes, No
Earing Occlusion Yes, No
Ear Piercing Occlusion Yes, No
4.3.4 Database Structure and Naming Convention
The files composing the database are organized according to the hierarchical structure illustrated
in Figure 4.11. The root level includes files containing the metadata information for all the subjects
and the ear landmark information for all the images. The root level also includes a folder for each
of the 67 subjects in the database, labelled using a 3 digit identifier, xxx, corresponding to their
identifications in IST-EURECOM LLFFD. Each of these folders contains 2 sub-folders: “2D
images” and “Multi-view arrays”.
Figure 4.11: IST-EURECOM LLFEDB file structure.
61
The naming convention for the database files follows the same protocol as defined for IST-
EURECOM LLFFD (Section 4.2.4).
4.3.5 Database Access and Usage Conditions
IST-EURECOM-LFFD is freely distributed for standardization and academic research purposes.
The database can be downloaded from: http://www.img.lx.it.pt/LLFEDB/.
62
63
Chapter 5 _
Proposing Novel Light Field Face and Ear
Recognition Solutions
5.1 Introduction
This Thesis proposes seven light field based face and ear recognition solutions, evolving through
progressive levels of functionality and performance, exploiting the additional information
available in a light field image. The first two solutions are proposed based on light field hand-
crafted descriptors, describing the disparity information available in light field images for both
face and ear recognition. The other five recognition solutions are based on fused deep/double-deep
descriptors, learning convolutional representations and angular dynamics from a light field image
for face recognition. The proposed recognition solutions are summarized in Figure 1.3.
Figure 5.1: Summary of the proposed recognition solutions.
5.2 Face and Ear Recognition Based on Light Field Local Binary Pattern
Descriptor
This section proposes a face/ear recognition solution based on a new hand-crafted light field
descriptor, so-called Light Field Local Binary Patterns (LFLBP) descriptor, exploiting the spatial
and disparity information available in light field images for face and ear recognition tasks.
64
5.2.1 Architecture and Walkthrough
The generic architecture of the proposed recognition solution based on LFLBP hand-crafted
descriptor for both face and ear recognition tasks is illustrated in Figure 5.2. By exploiting the
multiple SA images, available from the light field multi-view representation, the proposed LFLBP
descriptor is expected to improve the recognition system performance.
The proposed recognition architecture includes the following main steps:
1. Pre-processing: The Matlab Light Field Toolbox v0.4 [58] has been used to create the multi-
view array of SA images (Section 2.4) from the input, raw light field. Then, the face and ear
in all SA images are cropped and resized to 128×128 and 192×128 pixels, respectively.
Additionally, for ear images, the images of left side ears are flipped horizontally, making all
ear images to be further processed to have the same orientation. There are three reasons for
these image size selections: i) A study with the IST-EURECOM LLFFD and IST-EURECOM
LLFEDB databases has shown that the average aspect ratios of the cropped faces and ears are
1.08 and 1.51, respectively, thus justifying the aspect ratio of the resized faces and ears; ii) A
preliminary study conducted during the Thesis work has shown that increasing the image
resolution does not significantly impact the recognition performance, while increasing the
computational complexity; and iii) Although the IST-EURECOM LLFFD database considers
larger image sizes, the face area is only a portion of that size, with128×128 being a size
appropriate for the cropped face image; for ear images, 192×128 is a size using the some
horizontal resolution and a vertical resolution growing to adjust to the ears aspect ratio
measured from IST-EURECOM LLFEDB.
2. LFLBP feature description: The proposed LFLBP descriptions are extracted from the
normalized multi-view array, as detailed in Section 5.2.2.
3. Offline training: The LFLBP descriptions extracted from the training samples, highlighted by
red in Figure 5.2, are fed to a linear SVM classifier (implemented using the LIBSVM library
[187]) to define the classification model. The training data should be selected based on the test
protocol considered.
4. Classification: LFLBP descriptions extracted from testing samples are fed to the previously
trained SVM classifier, thus determining the subject identity.
Figure 5.2: Architecture of the proposed face and ear recognition solution based on LFLBP
hand-crafted descriptor.
65
5.2.2 Light Field Local Binary Patterns Feature Description
LBP [76] and its variant are among the best performing hand-crafted feature descriptors for face
recognition. The LFLBP combined descriptor is an extension of LBP, which is able to exploit the
richer information available in light field images. Thus, the recognition solution can exploit both
spatial and angular information that may boost the final recognition performance. Similarly to the
original LBP [76], the novel LFLBP processes the gray level intensities of the captured light fields.
The input to a LFLBP descriptor is a multi-view array, i.e. L(u,v,x,y), where u and v identify the
viewpoint, and x and y the pixel position within a SA image. In a Cartesian representation, for the
used Lytro Illum lenslet camera [26], u and v take integer values in the range {-7, …, 7}, and the
size of each SA is 625×434 pixels. The central SA image is the reference view position, denoted
as L(0,0,x,y), as highlighted in yellow in Figure 5.3. Each SA image can also be identified using
polar coordinates using two parameters: i) the radius, R, expressing the Euclidean distance to the
reference view, with a direct relation with the observed disparity; and ii) the angle, A, measured
counter-clockwise from the positive part of the real axis. A third parameter, N, defines the number
of SA images, or views, to consider in the descriptor. Figure 5.3 shows three examples of SA
images, highlighted in red, with different parameter values.
Figure 5.3: Examples of selected SA images (red). The SA images highlighted in dark grey do
not contain usable image information due to the micro-lens shape.
The proposed LFLBP descriptor combines two components: i) the Spatial Local Binary Patterns
(SLBP) descriptor, which corresponds to LBP applied to the central, reference view; and ii) the
novel Light Field Angular Local Binary Patterns (LFALBP) descriptor, which captures the multi-
view information available in the light field image.
1) Spatial Local Binary Pattern (SLBP): For a selected set of p samples in the spatial
neighborhood, at distance r from the reference sample x,y, with starting angle a, the sample level
SLBP pattern value (SLBPSL),for position x,y is defined by Equation 5.1:
SLBP𝑆𝐿(𝑟, 𝑎, 𝑝, 𝑥, 𝑦) = ∑ s(𝐿0,0,(𝑥+𝑘),(𝑦+𝑙) − 𝐿0,0,𝑥,𝑦) × 2𝑖−1 𝑝𝑖=1 (5.1)
Where
{𝑘 = ⌈𝑟 sin (𝑎 +
360°
𝑝× (𝑖 − 1))⌉
𝑙 = ⌈𝑟 cos(𝑎 +360°
𝑝× (𝑖 − 1))⌉
(5.2)
s(x) is the sign function defined as:
66
𝑠𝑖𝑔𝑛(𝑥) = {1, 𝑖𝑓 𝑥 ≥ 00, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒.
(5.3)
Equation 5.2 transforms the central-view ith sample coordinates from polar to Cartesian, as
Equation 5.1 works with Cartesian coordinates. The central-view sample intensity at position x,y
is used as threshold value; whenever the selected sample in the spatial neighbourhood is below the
threshold, then the sign(x) function takes value 0; otherwise, it takes value 1, according to Equation
5.3. The binary thresholding result is multiplied by the binomial factor, 2𝑖−1, and the resulting
values are summed, to obtain a value in the range [0.. 2S-1], which is the SLBPSL pattern value for
each sample position x,y.
Finally, the SLBP descriptor corresponds to the histogram of SLBPSL pattern values computed for
all 𝑥, 𝑦 samples, according to Equation 5.4:
SLBP(𝑟, 𝑎, 𝑝)=Histogram (SLBP𝑆𝐿(𝑟, 𝑎, 𝑝, 𝑥, 𝑦);
∀𝑥 ∈ (1, … , 𝑋); ∀𝑦 ∈ (1, … , 𝑌)) (5.4)
where X and Y indicate the number of samples in the central view. The SLBP descriptor expresses
elementary characteristics of spatial information in the form of a magnitude sign histogram.
2) Light Field Angular Local Binary Pattern (LFALBP): The LFALBP hand-crafted descriptor
is here proposed to exploit the information corresponding to the variations observed for light rays
travelling in different directions, as captured by light field images. For a selected set of S SA
images, at distance R from the reference view, with starting angle A, the sample level LFALBP
pattern value (LFALBPSL), for position x,y, is defined by Equation 5.5:
LFALBP𝑆𝐿(𝑅, 𝐴, 𝑆, 𝑥, 𝑦) = ∑ 𝑠𝑖𝑔𝑛(𝐿𝑢,𝑣,𝑥,𝑦 − 𝐿0,0,𝑥,𝑦) × 2𝑖−1𝑆𝑖=1 (5.5)
where
{𝑢 = ⌈𝑅 sin (𝐴 +
360°
𝑁× (𝑗 − 1))⌉
𝑣 = ⌈𝑅 cos(𝐴 +360°
𝑁× (𝑗 − 1))⌉
(5.6)
Equation 5.6 transforms the ith selected SA image coordinates from polar to Cartesian, as Equation
5.5 works with Cartesian coordinates. As defined in Section 5.2.2, and illustrated in Figure 5.3, A
indicates the starting angle for the first SA image to consider, at radius R, and S indicates the
number of views to consider for the descriptor. The reference view sample intensity at position x,y
is used as threshold value; whenever another view’s component intensity value at position x,y is
below the threshold, then the sign(x) function takes value 0; otherwise, it takes value 1, according
to Equation 5.3. As for the conventional LBP descriptor, the binary thresholding result is
multiplied by the binomial factor, 2𝑖−1, and the resulting values are summed, to obtain a value in
the range [0.. 2S-1], which is the LFALBPSL pattern value for each sample position x,y,.
Finally, the LFALBP description is the histogram of LFALBPSL pattern values computed for all
𝑥, 𝑦 samples, according to Equation 5.7:
LFALBP(𝑅, 𝐴, 𝑆) = Histogram (LFALBP𝑆𝐿(𝑅, 𝐴, 𝑆, 𝑥, 𝑦);
∀𝑥 ∈ (1, … , 𝑋); ∀𝑦 ∈ (1, … , 𝑌)) (5.7)
67
where X×Y indicate the number of luminance samples (or pixels) in each view. The LFALBP
descriptor expresses the magnitude sign histogram of the disparity information present in a light
field image.
The computation of LFALBPSL for a face sample x1,y1, is illustrated in Figure 5.4, considering the
parameter values R=3, A=0º and S=4. The LFALBP pattern value is obtained by taking the gray
values of pixel x1,y1 from the reference view, 225 in this example, and from the other four views.
The reference view value is used for thresholding; whenever the other view’s value in position
x1,y1 is below the threshold it takes value 0, otherwise it takes value 1. Finally, the thresholded
values are multiplied by the corresponding binomial factors, as shown in Figure 5.4, and summed
to obtain the LFALBP pattern value for pixel x1,y1 – value 12 in the example.
Figure 5.4: LFALBP descriptor extraction example.
3) LFLBP as a LFALBP and SLBP Combination: LFALBP may be combined with not only
SLBP to build LFLBP, but also with any existing local hand-crafted descriptor, to derive enhanced
descriptions for light field based recognition. In fact, the combination of angular and spatial
descriptions ‘fuses’ complementary information, thus improving the final recognition
performance. This combination flexibility is expressed by the generic combination framework
presented in Figure 5.5 where any spatial descriptor can be used to replace SLBP while still
benefiting from the complementary angular information captured by the novel LFALBP
descriptor.
This Thesis proposes a specific combination, the LFLBP descriptor, computed according to
Equation 5.8.
LFLBP𝑟,𝑎,𝑝,𝑅,𝐴,𝑉(𝑥, 𝑦) = (LFALBP𝑅,𝐴,𝑁(𝑥, 𝑦) × 2𝑝) + SLBP𝑟,𝑎,𝑝(𝑥, 𝑦) (5.8)
68
where (LFALBP𝑅,𝐴,𝑉(𝐿0,0,𝑥,𝑦) × 2𝑝) means binary left shift by 𝑝 bits. Equation 5.8 concatenates
the spatial SLBP with the angular light field descriptions to form the combined LFLBP description.
The resulting description has p + N bits for each sample. The final description is the histogram of
LFLBP pattern values computed for all 𝑥, 𝑦 samples, expressing the variations for the spatial and
angular information available in a light field image.
Figure 5.5: Proposed spatial and angular descriptors combination framework.
5.3 Face and Ear Recognition Based on Light Field Histogram of Gradients
Descriptor
This novel recognition solution is based on a new hand-crafted light field descriptor, named Light
Field Histogram of Gradients (LFHG), fusing a non-light field based descriptor, the so-called
Histogram of Oriented Gradients (HOG), with a light field based descriptor, called Light Field
Histogram of Disparity Gradients (LFHDG), each having a specific and complementary function
in the resulting fused descriptor. The fused hand-crafted descriptor considers both the orientation
and magnitude variations for the spatial and angular information. Compared to the LFLBP [37]
descriptor proposed in Section 5.2, that only captures the magnitude sign, the descriptor proposed
in this section offers a more comprehensive spatio-angular description. As expected, it boosts the
final recognition performance, as described in detail in this section.
5.3.1 Architecture and Walkthrough
The architecture of the proposed solution, based on the fusion of HOG and LFHDG hand-crafted
descriptors, for face and ear recognition is shown in Figure 5.6. By exploiting both the orientation
and magnitude variations for the spatial and angular information available in a light field image,
the proposed solution is expected to improve the recognition performance over conventional
spatial-only face and ear recognition solutions.
The proposed face and ear recognition solution includes the following main steps:
1. Pre-processing: Matlab Light Field Toolbox v0.4 [58] has been used to create the multi-view
array of SA images (Section 2.4). Then, each face and ear in all SA images are cropped and
69
resized to 128×128 and 192×128 pixels, respectively. Additionally for ear images, images of
left side ears are flipped horizontally, making all ear images to be further processed have the
same orientation.
2. HOG and LFHDG feature descriptions: The HOG and LFHDG (Section 5.3.2) descriptions
are respectively extracted from the central SA image and the normalized multi-view SA array,
to capture the multi-view information available in the light field image.
3. Fusion of descriptions: The extracted HOG and LFHDG descriptors are concatenated,
resulting in the fused LFHG descriptor.
4. Offline training: The fused LFHG descriptions obtained in the previous step, extracted from
the training samples, highlighted by red in Figure 5.6, are fed to a linear SVM classifier
(implemented using the LIBSVM library [187]) to define the classification model. The training
data should be selected based on the test protocol considered.
5. Classification: The fused LFHG descriptions extracted from testing samples are fed to the
previously trained SVM classifier to be compared to the classification model, thus determining
the subject identity.
Figure 5.6: Architecture of the proposed face and ear recognition solution based on the
fused LFHG hand-crafted descriptor.
5.3.2 Light Field Histogram of Disparity Gradients Feature Description
The Histogram of Oriented Gradients (HOG) hand-crafted descriptor [149] is a widely used, non-
light field based local texture descriptor, able to represent spatial orientation and gradient
variations. It has been successfully applied in several computer vision problems, such as pedestrian
detection [149], face recognition [119] and ear recognition [5], [176], [174]. The Light Field
Histogram of Disparity Gradients (LFHDG) descriptor, an extension of HOG, targets the
description of the light field disparity variations. This Thesis proposes to fuse the HOG and
LFHDG descriptors, forming Light Field Histogram of Gradients (LFHG), for exploiting the light
field variations, both in terms of position and direction, thus obtaining an improved face and ear
recognition descriptor.
The HOG descriptor computation follows the implementation in [149] and the tunned parameter
settings for face and ear recognition proposed in [119] and [5]. It is applied to the central SA image
to capture the texture information available in the central SA image.
70
The proposed LFHDG descriptor processing steps are:
1. Gradient computation: Horizontal and vertical disparity gradients, Gx (x,y) and Gy (x,y), for
a given (x,y) sample are computed as:
{𝐺𝑥 (𝑥, 𝑦) = L(𝑢1, 𝑣1, 𝑥, 𝑦) − 𝐿(𝑢2, 𝑣2, 𝑥, 𝑦)
𝐺𝑦 (𝑥, 𝑦) = 𝐿(𝑢3, 𝑣3, 𝑥, 𝑦) − 𝐿(𝑢4, 𝑣4, 𝑥, 𝑦) (5.9)
where (u1, v1), (u2, v2), (u3, v3) and (u4, v4) correspond to the specific selected SA images.
2. Disparity gradient magnitude and orientation computation: The disparity gradient
magnitude, |∇I (x, y)| and orientation, θ(x,y), for each (x,y) sample, are computed according to
Equations 5.10 and 5.11:
|∇I (𝑥, 𝑦)| = √𝐺𝑥(𝑥, 𝑦)2 + 𝐺𝑦(𝑥, 𝑦)2 (5.10)
θ(𝑥, 𝑦) = arctan (𝐺𝑥(𝑥,𝑦)
𝐺𝑦(𝑥,𝑦)) (5.11)
3. Cell histogram computation: The computed disparity gradient magnitude and orientation
maps are divided into non-overlapping 8×8 cells. Gradient orientation values for all (x,y)
samples (in the range 0°-180°), in each cell, are quantized into 9 bins; instead of storing how
many times a quantized orientation occurs in the cell, the magnitudes for identical orientations
are added into the closest bin to its orientation, forming a local histogram for the cell.
4. Block normalization: To make the descriptor image contrast independent [149], cells are
grouped into blocks of 2×2 cells and the histograms of the 4 cells concatenated. Adjacent
blocks are made to overlap, with each cell being shared by four blocks (see Figure 5.7 for an
ear sample), meaning that each local cell histogram contributes more than once to the final
LFHDG description. Finally, each block histogram is normalized with respect to its Euclidean
norm [149].
5. Block histogram concatenation: All normalized block histograms are concatenated to create
the LFHDG description.
Figure 5.7: Division of an ear sample disparity magnitude map into 8×8 sample cells and
overlapping 2×2 cell blocks.
71
In summary, the fused LFHG descriptor expresses both the orientation and magnitude variations
for the spatial and angular information.
5.4 Face Recognition Based on a VGG 2D+Disparity+Depth (VGG-D3) Fused
Deep Descriptor
As discussed in Section 3.4.2.6, contrary to face recognition, the current CNN networks such as
SqueezeNet [144], AlexNet [142], and VGG-16 network [146] may not achieve superior
performance over conventional solutions for the ear recognition task. This is probably due to the
lack of a sufficient number of available training samples to let the deep networks learn good ear
representations, having a large impact on the recognition performance. Hence, the deep learning
based solutions proposed in the Thesis, presented in this and the next two sections, are optimized
for the face recognition task, although they might also be applied to the ear recognition task once
large-scale ear databases become available, allowing to obtain better deep ear classification
models.
The previous two sections proposed hand-crafted light field description based recognition solutions;
this section proposes for the first time a light field face recognition solution based on a deep
learning, named VGG 2D+Disparity+Depth (VGG-D3) fused deep descriptor. The VGG-D3
description is obtained by the feature level fusion of deep descriptions extracted from 2D images
as well as the corresponding disparity and depth maps, using a VGG-Face descriptor [38]. The
VGG-Face descriptor, pre-trained over 2.6 million face images, is computed based on a VGG-16
network, ignoring the last fully connected layer in the architecture to extract a description with 4096
elements.
The exploitation of disparity maps together with 2D images and depth maps, in the context of a
fusion scheme, is a novel approach never tried in the literature, acknowledging that disparity and
depth maps may bring some complementary information to the recognition task. It is well-known
that a depth map may be computed from disparity information and the camera intrinsic parameters,
thus being rather equivalent information, even if they visually express different features.
Moreover, if disparity and depth are extracted with independent algorithms and not directly
computed from each other, it is very likely that they partly compensate for each other algorithmic
weaknesses. The implication is that disparity and depth maps may not necessarily provide exactly
the same visual information for face recognition. A disparity map can represent relevant facial
information such as the position and shape of shadows, changes in contrast and contrast gradient
among observation viewpoints, and defocus blur, which may not be equally expressed by a depth
map. On the other hand, geometric information about the position and shape of face components
may be better represented by a depth map. Hence, disparity and depth maps may express visually
complementary information, and jointly exploiting them may contribute to improve face
recognition performance.
5.4.1 Architecture and Walkthrough
The architecture of the proposed face recognition solution based on the VGG-D3 fused deep
descriptor, is presented in Figure 5.8. It takes as input a raw light field face image to create a multi-
view SA array. The face region is cropped and then resized to 224×224 pixels in all SA images,
72
as this is the input size expected by the VGG-Face descriptor [38]. This work uses the VGG-Face
descriptor that can be directly used to extract descriptions from the 2D central view. However, for
the disparity and depth maps extracted from the light field multi-view SA array, the VGG-16
network [146] needs to be retrained, to fine-tune the pre-trained model to perform well when
disparity or depth maps are taken as input, instead of regular 2D images. Once all models are
available, VGG-Face descriptions are extracted from the three types of data inputs, then these
descriptions are concatenated to form the VGG-D3 description which is then passed to a SVM
classifier. By fusing the descriptions extracted from the 2D central SA view as well as the
corresponding disparity and depth maps, the proposed solution exploits the complementary
information available in the light field image. The VGG-D3 fused deep description is expected to
improve the recognition performance over 2D and 2D +depth face recognition solutions.
The proposed face recognition solution includes the following steps:
1. Pre-processing: Matlab Light Field Toolbox v0.4 [58] has been used to create the multi-view
array of SA images (Section 2.4). Then, the face region is cropped in all SA images, based on
the landmarks provided in the database, and the cropped SA images are resized to 224×224
pixels as this is the input size expected by the VGG-Face descriptor [38].
2. Disparity map extraction: A disparity map is extracted from the cropped multi-view SA
array, capturing the angular information available in the light field image. The light field
disparity map is extracted using the method proposed in [188] and [189], which computes the
disparity map as gradients of epipolar plane images.
3. Depth map extraction: A depth map is extracted from the cropped multi-view SA array,
providing geometric information about the position and shape of the facial components. The
depth map has been extracted using the method proposed in [190], which estimates multi-view
stereo correspondences and then optimizes them using graph cuts.
4. VGG-Face feature description: The pre-trained VGG-Face model, which is originally
trained for 2D face recognition, is independently fine-tuned using the training disparity and
depth map samples. Then, the 2D central view as well as the disparity and depth maps are fed
into three VGG-Face descriptor based deep learning networks to extract texture, disparity and
depth descriptions [38] – see details in Section 5.4.2.
5. Description level fusion: Description level fusion is adopted, concatenating the descriptions
extracted for each input into a single VGG-D3 fused deep description.
6. Offline training: The VGG-D3 fused deep descriptions extracted from the training samples,
highlighted by red in Figure 5.8, are fed to a linear SVM classifier (implemented using the
LIBSVM library [187]) to define the classification model. The training data should be selected
based on the test protocol considered.
7. Classification: The VGG-D3 fused deep description extracted from testing samples is fed to
the previously trained SVM classifier (implemented using the LIBSVM library [187]), to be
compared to the previously trained classification model, thus returning a subject identification.
This work also tested the performance of a softmax classifier, with SVM performing slightly
better than softmax (less than 1% improvement) that justifying the choice of SVM as the final
classifier.
73
Figure 5.8: Architecture of the proposed face recognition solution based on a
2D+disparity+depth fused deep descriptor.
5.4.2 VGG-Face Feature Description
The VGG-16 network is one of the top performing convolutional network architectures for several
visual recognition tasks [146]. The VGG-Face descriptor [38], running the VGG-16 network
without the last two fully connected layer, has been trained over 2.6 million face images, covering
rich variations in expression, pose, occlusion, and illumination to obtain a so-called pre-trained
VGG-Face model [38]. The pre-trained model can then be used to extract descriptions from 2D
face images for face recognition. As the VGG-Face descriptor is originally trained with 2D images,
it may not be suitable for describing disparity and depth information for face recognition. For the
purposes of this Thesis, the pre-trained VGG-Face model is fine-tuned considering disparity and
depth maps at the input and back-propagating the loss function results through the VGG-16
network layers. During the fine-tuning, the pre-trained weights of the convolutional layers are
frozen and kept unchanged, while the fully-connected layers are re-trained. Considering some
empirical studies and memory constraints, the fine-tuning for both disparity and depth maps has
been done using a learning rate of 0.005, a batch size of 32, and a total of 30 epochs. The pre-
trained VGG-Face model has been used for the 2D images, and the fine-tuned models have been
used for the disparity and depth maps, resulting in a so called fully connected layer 6 (FC6)
description, with a total of 4096 elements for each input.
5.4.3 Fusion Strategies Comparison
The proposed face recognition solution based on the VGG-D3 fused deep descriptor processes the
2D central view as well as the corresponding disparity and depth maps. In order to study the other
possible descriptor combinations and the effectiveness of the proposed fusion strategy, Table 5.1
reports the rank 1 recognition rate performance (RR1) when considering only the 2D VGG
descriptor and several alternative fusion strategies: i) 2D + disparity; ii) 2D + depth; and iii) 2D +
disparity + depth. The recognition results are presented for a cross-session face recognition
scenario, this means training and testing the recognition system using the samples captured in
different sessions.
74
Table 5.1: Face rank-1 recognition performance for the 2D baseline descriptor and three
alternative fusion strategies (best results in bold).
Solution
Recognition Tasks
Neutral&
Emotion Action Pose Illumination Occlusion Average
2D 99.5% 100% 94.6% 98.5% 94.6% 96.8%
2D +disparity 99.5% 99.0% 94.8% 99.0% 94.8% 97.0%
2D +depth 99.5% 100% 95.6% 99.5% 95.6% 97.7%
2D +disparity+depth 99.5% 100% 95.8% 100% 95.8% 98.1%
The obtained recognition results show that the proposed 2D+disparity+depth fusion strategy
always achieves the best performance among all the considered recognition alternative cases. Since
the solution corresponding to the fusion of the VGG descriptors for the 2D image with the disparity
and depth maps allows best exploring the complementary information available in the light field,
thus increasing the discriminative power of the fused descriptor, this is the VGG-16 based
recognition solution proposed out of this section.
5.5 Face Recognition Based on a VGG + Conventional LSTM Double-Deep
Descriptor
The proposed solution described in the previous section processes only light field central view
data, notably using its rendered texture, and corresponding disparity and depth maps, using a CNN
network. This Thesis also proposes a double-deep spatio-angular learning framework/descriptor
adopting a conventional long short-term memory (LSTM) recurrent network to extract higher
dimensional angular dependencies from different viewpoints rendered from a full light field image,
thus offering a more powerful double-deep spatio-angular description for light field face
recognition. The double-deep descriptor for light field based face recognition proposed in this
Thesis is based on the combination of a VGG-Face descriptor with a Conventional LSTM (Conv-
LSTM) recurrent deep network [191] [192]. This novel descriptor combines the spatial information
learned using a VGG-Face descriptor with the angular dynamics available in a light field image
that are learned using a Conv-LSTM deep recurrent neural network.
While the combination of VGG and Conv-LSTM has recently been used to learn spatio-temporal
information for visual classification and description tasks, including action recognition [193],
facial expression classification [194], or image captioning and video description [195], this
combination has never been proposed to exploit the multi-view information from a single temporal
instant, as performed by the double-deep spatio-angular learning description proposed in this
Thesis; this novel approach of successively processing views within a light field image instead of
a sequence of frames along time has been never tried before for face recognition or any other visual
recognition task.
In the proposed framework, a VGG-Face descriptor, is employed to capture 2D information from
multiple SA images, thus extracting high-level spatial/textural descriptions. Next, a Conv-LSTM
network exploits the angular dynamics by learning from the spatial descriptions previously
extracted for slightly different viewpoints. Hence, the proposed double-deep VGG + Conv-LSTM
combination can be very powerful to jointly exploit the spatio-angular information available in
light field images to boost face recognition performance.
75
5.5.1 Architecture and Walkthrough
The proposed double-deep VGG + Conv-LSTM framework architecture is presented in Figure 5.9.
Figure 5.9: Architecture of the proposed face recognition solution based on VGG + Conv-LSTM
double-deep descriptor.
The proposed face recognition solution/framework based on double-deep VGG + Conv-LSTM
descriptor includes the following main modules:
1. Pre-processing: Matlab Light Field Toolbox v0.4 [58] has been used to create the multi-view
array of SA images (Section 2.4). Then, the face region is cropped in all SA images, based on
the landmarks provided in the database, and the cropped SA images are resized to 224×224
pixels as this is the input size expected by the VGG-Face descriptor [38].
2. SA images selection and scanning: This module successively scans a selected sub-set of the
SA images into a SA image pseudo-video sequence, as described in Section 5.5.2.
3. VGG-Face spatial description: Each selected SA image is fed into a pre-trained VGG-Face
descriptor, trained with totally different content from the test material used in this Thesis, to
extract a spatial description containing 4096 elements, as described in Section 5.4.2. Since a
pre-trained model has been used, no additional learning/fine-tuning has been performed for
this specific purposes.
4. LSTM angular description: The extracted spatial deep descriptions are passed to a LSTM
network composed by conventional LSTM (Conv-LSTM) cells with peephole connections, to
learn angular dependencies across the selected SA viewpoints and then extracting double-deep
descriptions for classification, as described in Section 5.5.3.
5. Offline training: The set of double-deep description outputs from the LSTM gates, extracted
from the training samples, highlighted by red in Figure 5.9Figure 5.6, is used as input to a
76
softmax classifier to create a classification model. The set of training description outputs are
denoted by a red in Figure 5.9 to be distinguished from the testing description outputs. The
training data should be selected based on the test protocol considered.
6. Classification: The set of double-deep descriptor outputs from the LSTM gates, extracted from
testing samples is used as input to the previously trained softmax classifier to be compared to
the classification model. Then, the average of the classification probabilities across the
rendered SA images, selected from the light field image, is used to predict the most probable
label and, thus, the final output, as described in Section 5.5.4.
5.5.2 SA Images Selection and Scanning
The pre-processed multi-view SA array contains 15×15 rendered 2D SA images. A representative
subset of SA images is selected for processing by the VGG-Face descriptor, and then scanned as
a pseudo-video sequence, so that their angular dynamics can be learned by the conventional LSTM
network. Different methods can be considered to select and scan the sequence of representative
SA images, notably varying in their number, position and scanning order. It is again worth
mentioning that since the Lytro ILLUM lenslet camera microlens shape is hexagonal, the SA
image positions highlighted in dark grey in Figure 5.10 do not contain usable information, thus
being ignored in the selection process. To consider different solutions in terms of number of views,
thus impacting complexity, and positioning, thus impacting the amount of disparity, the following
SA images selection topologies have been defined:
1. High-density SA images selection: This SA topology considers a rather large number of SA
images from the multi-view array, as illustrated in Figure 5.10.a, where the selected SA images
are highlighted in red. To arrange the selected SA images into a sequence, two different
scanning orders are proposed: i) row-major scanning, which concatenates SA images one row
after another, from left to right, as illustrated in Figure 5.10.b; and ii) snake-like scanning,
which also progresses row-wise, but the rows are alternatively scanned from left to right and
right to left, as illustrated in Figure 5.10.c.
2. Max-disparity SA images selection: This selection topology considers those SA images
corresponding to the multi-view array's borders, thus considering the SA images for which the
viewpoint changes the most and, thus, have the maximum disparity, as illustrated in Figure
5.10.d. Some of the selected SA images may not be of the highest quality, due to the vignetting
effect.
3. Mid-density SA images selection: In this case, the selected SA images capture horizontal,
vertical, and both horizontal and vertical parallaxes. The SA images considered are: i) middle
row – see Figure 5.10.e; ii) middle column – see Figure 5.10.f; and iii) combination of middle
row and middle column – see Figure 5.10.g. There are two possible ways to combine the
horizontal and vertical angular information for the topology in Figure 3-g: i) scanning the
horizontal SA images followed by the vertical ones; or ii) processing each direction separately
and then applying score-level fusion, by adding the LSTM softmax classifier outputs obtained
for the horizontal and vertical SA images, as illustrated in Figure 5.11. As it will be seen later,
the performances for these two approaches may be rather different.
77
Figure 5.10: (a) High-density SA images selection topology; (b) row-major scanning order; (c)
snake-like scanning order; (d) max-disparity SA images selection topology; (e) mid-density
horizontal SA images selection topology; (f) mid-density vertical SA images selection.
Figure 5.11: Score-level fusion for combining the horizontal and vertical angular information.
78
4. Low-density SA images selection: Exploiting spatio-angular dynamics for a considerable
number of SA images may not always be the best option, as this requires considerable
computational power and memory resources. Thus, a low-density sampling of the SA images
is also considered. Since results in [36] and [37] show a clear performance improvement for
light field based face recognition and presentation attack detection as the SA images’ disparity
increases, the central view SA image along with two SA images at maximum horizontal and
vertical disparities from the central view are selected, as illustrated in Figure 5.10.h and Figure
5.10.i. Figure 5.10.j shows the selection of both these horizontal and vertical SA images, for
which the two combination approaches described above may be applied.
5.5.3 LSTM Angular Description
The VGG-Face descriptor only deals with spatial information within a 2D image. However, for a
multi-view array of rendered 2D SA images, it is possible to additionally exploit the angular
information available in the light field image to improve the face recognition performance.
Recurrent neural networks (RNN) can be used to extract higher dimensional dependencies from
sequential data. The RNN units, called cells, have connections not only between the subsequent
layers, but also into themselves, to keep information from previous inputs. To train a RNN, the so-
called back-propagation through time algorithm can be used [196]. Traditional RNNs can easily
learn short-term dependencies; however, they have difficulties to learn long-term dynamics due to
the vanishing and exploding gradient problems [197].
The Long Short-Term Memory (LSTM) is a type of RNN addressing the vanishing and exploding
gradient problems by learning both long- and short-term dependencies [191] [192]. LSTM has
recently achieved impressive results on many large-scale learning tasks, such as speech recognition
[198], language translation [199], activity recognition [193], facial expression classification [194],
and image captioning and video description [195]. Therefore, LSTM based networks are now
widely used in many cutting-edge applications, notably Google Translate, Facebook, Siri or
Amazon's Alexa.
A LSTM network is composed of cells whose outputs evolve through the network based on past
memory content. Since the introduction of LSTM in 1997 [191] , the conventional LSTM (Conv-
LSTM) with peephole connections has been the most commonly used cell architecture for visual
analysis tasks [195]. Figure 5.12 illustrates the architecture of a Conv-LSTM cell with peephole
connections, which are connections from the previous cell state to the gates, denoted by a dash-
line in Figure 5.12. The cells have a common cell state, which keeps long-term dependencies along
the entire LSTM cells chain, controlled by two gates, the so-called input and forget gates, thus
allowing the network to decide when to forget the previous state or update the current state given
new information. The output of each cell, hidden state, is controlled by an output gate, allowing
the cell to compute its output given the updated cell state.
79
Figure 5.12: Architecture of a Conv-LSTM cell with peephole connections (indicated by a
dashed line).
A Conv-LSTM cell can be mathematically formulated as follows:
For a descriptor sample Si, belonging to a descriptor sequence, derived from an image of the multi-
view sequence, the output of the input gate, Ii, is computed as in Equation 5.12, based on the
sample value, the previous hidden state hi-1, and the previous cell state Ci-1 (for the peephole LSTM
cell architecture):
𝐼𝑖 = 𝜎(𝑊𝐼[𝑆𝑖 + ℎ𝑖−1 + 𝐶𝑖−1] + 𝑏𝐼) (5.12)
where 𝑊𝐼 is the input gate weight and bI is the input gate bias. Each gate is controlled by a sigmoid
activation function, defining the output of the gate, as formulated by Equation 5.13, bounding its
output to a [0,1] range:
𝜎(𝑥) = (1 + 𝑒−𝑥)−1 (5.13)
Equation (5.14) creates a vector of new cell state candidate values, �̃�𝑖, that may be added to the
cell state later:
�̃�𝑖 = 𝑡𝑎𝑛ℎ(𝑊𝐶[𝑆𝑖 + ℎ𝑖−1 + 𝐶𝑖−1] + 𝑏𝐶) (5.14)
where 𝑊𝐶 is the vector of new candidate values weights and 𝑏𝐶 is the vector of new candidate
values biases. The hyperbolic tangent activation function, 𝑡𝑎𝑛ℎ, is used to create the vector of
candidate values, �̃�𝑖, and is defined as:
𝑡𝑎𝑛ℎ(𝑥) = 2𝜎(2𝑥) − 1 (5.15)
The output of the forget gate, Fi, is defined as in Equation 5.16, and defines what information
should be removed from the cell state:
80
𝐹𝑖 = 𝜎(𝑊𝐹[𝑆𝑖 + ℎ𝑖−1 + 𝐶𝑖−1] + 𝑏𝐹) (5.16)
where 𝑊𝐹 is the forget gate weight and bF is the forget gate bias.
Then, based on Ii, Fi, and �̃�𝑖, the previous cell state, Ci-1, is updated to obtain Ci as follows:
𝐶𝑖 = 𝐹𝑖ʘ𝐶𝑖−1 + 𝐼𝑖ʘ �̃�𝑖 (5.17)
where ʘ denotes the vector element-wise product. As the output values for Ii and Fi lie in the range
[0,1], the LSTM selectively learns to consider or forget the current input and the previous state.
The current cell state, Ci, can then be used for predicting the current cell’s hidden state, hi,
according to Equations 5.18 and 5.19, thus allowing the LSTM to learn how much from the cell
memory should be included into the hidden state.
𝑂𝑖 = 𝜎(𝑊𝑂[𝑆𝑖 + ℎ𝑖−1 + 𝐶𝑖−1] + 𝑏𝑂) (5.18)
ℎ𝑖 = 𝑂𝑖ʘ tanh (𝐶𝑖) (5.19)
where 𝑊𝑂 is the output gate weight and bO is the output gate bias. The hidden state, hi, is the cell
output for the descriptor sample Si which is passed as input to the next LSTM cell in the LSTM
network architecture, which is composed of LSTM cells sequentially connected together. As
shown in Figure 5.9, the adopted LSTM network includes one Conv-LSTM cell per each selected
SA image in the pseudo-video sequence. Based on the selected scanning order, the deep spatial
descriptions extracted from the SA images are passed to the corresponding LSTM cell. The output
of each LSTM cell, corresponding to its hidden state, describes the short-term and long-term
angular dependencies captured so far.
The LSTM network has been trained with the MSE loss function and batch normalization [200] is
used to control the distributions of feedforward network activations. The model obtained from the
LSTM learning process can then be used for descriptor extraction during testing. Applying the
above structure to a SA image pseudo-video sequence (and not a sequence of images along time)
enables the LSTM to learn long short-term angular dynamics when using light field images for
face recognition; it offers a novel approach, never tried before for any visual recognition task.
LSTM has a number of hyper-parameters for network training whose optimization is of major
importance for the final recognition performance, notably
LSTM hidden layer size: This hyper-parameter controls the size of the hidden layer in the
LSTM units, which is also the size of each LSTM cell’ output. A small hidden layer size
requires setting fewer parameters, but it may lead to underfitting. A larger hidden layer size
gives the network more capacity for convergence, while increasing the required training time.
However, a too large hidden layer size may result in overfitting, thus highlighting the
importance of appropriately adjusting the hidden layer size.
Batch size: The input data can be divided into a number of batches, each used for one round
of network weights update. There are two main advantages of training a deep learning network
using batches instead of the whole input data at once: i) decreasing the computational
complexity, increasing the parallelization ability and needing less memory; and ii) performing
a better training with stronger generalization ability as the network can escape from local
81
minima [201] [202]. Nevertheless, it should be noted that a high number of batches, i.e., small
batches, may lead to less accurate gradient estimation during the training/learning process.
Number of training epochs: One epoch is a full forward-backward pass of all training
samples through the network. Each epoch may consider a number of iterations, in case the
whole data is divided into batches. The number of epochs should be selected in such a way that
it guarantees network convergence within a reasonable training time.
The impact of the hyper-parameter settings on face recognition performance will be evaluated in
the experiments reported in Section 6.5.
5.5.4 Softmax Classification
The output (hidden state) of each Conv-LSTM cell is used as input to a softmax classifier and
includes: i) the short-term dependencies, corresponding to the recently observed viewpoint
changes; and ii) long-term dependencies corresponding to the all viewpoint changes observed so
far, is used as input to a softmax classifier. Then, the average of the classification probabilities
across the rendered SA images, selected from the light field image under consideration, predicts
the most probable label and, thus, the final output. The averaging mechanism, which has been
widely used in the literature in the context of spatio-temporal frameworks for visual recognition
tasks [195], considers all LSTM hidden states, thus exploiting both the full short-term and long-
term angular dependencies; this approach offers a comprehensive angular description for visual
recognition. The alternative mechanism of only using the output of the last LSTM cell [194], thus
considering long-term dependencies and short-term dependency corresponding to only the last
LSTM cell, ignoring the other hidden states, may not exploit the full angular dependencies, thus
offering a slightly lower performance than the former mechanism.
5.6 Face Recognition Based on VGG + Light Field LSTM Double-Deep
Descriptors
As discussed above, a conventional LSTM network can be used to learn the available angular
information from the multiple viewpoints included in a light field image to provide richer
descriptions for visual analysis tasks. In order to capture both the horizontal and vertical angular
information, one possibility is to scan the horizontal SA images followed by the vertical ones, thus
creating a single descriptors sequence to be used as LSTM input, as this can represent the
viewpoint changes along different directions. However, this sequential descriptor concatenation
implies a viewpoint descriptor discontinuity when moving from the last horizontal SA image
position to the first vertical one, which may lead to a degraded learning performance. Additionally,
dealing with angular information as a single pseudo-video sequence ignores the additional angular
information/dependencies, such as parallax, that could be further exploited during the
training/learning process to increase recognition accuracy.
This Thesis also proposes three light field LSTM cell architectures which have been integrated
(naturally, one at a time) in a double-deep learning framework for face recognition, whose inputs
come from a VGG-Face descriptor applied to the set of horizontal and vertical face 2D SA images
sequences derived from a light field image.
82
5.6.1 Architecture and Walkthrough
This Thesis proposes three novel light field LSTM cell architectures able to jointly learn light field
horizontal and vertical dynamics and, thus, providing highly discriminative double-deep
descriptions for spatio-angular based face recognition tasks. The differences between this VGG +
light field LSTM framework represented in Figure 5.13 and the VGG + conventional LSTM
framework, presented in Section 5.5.1 (see Figure 5.9), are twofold: i) the Conv-LSTM cell
architecture in the basic framework is replaced by the new light field LSTM cell architectures
proposed here; and ii) in the VGG + conventional LSTM learning framework, different methods
to select the sequence of representative SA images were considered, notably varying in their
number, position and scanning order, whereas in the present solution only the middle row and the
middle column SA images are considered, as they can represent the essential light field information
coming from multiple directions. The proposed VGG + light field LSTM learning framework
architecture, adopting the proposed light field LSTM cell architectures is represented in Figure
5.13.
Figure 5.13: Architecture of the proposed face recognition solution based on VGG + Light Field
LSTM double-deep descriptors.
83
The proposed face recognition solution/framework based on double-deep VGG + light field LSTM
descriptors includes the following main modules:
1. Pre-processing: Matlab Light Field Toolbox v0.4 [58] has been used to create the multi-view
array of SA images (Section 2.4). Then, the face region is cropped in all SA images, based on
the landmarks provided in the database, and the cropped SA images are resized to 224×224
pixels as this is the input size expected by the VGG-Face descriptor [38].
2. Horizontal and vertical SA image selection: This module independently scans the middle
row and the middle column SA images into two SA images pseudo-video sequences, each
including fifteen SA images (for the used Lytro ILLUM lenslet camera). These images
represent viewpoint changes along the horizontal and vertical directions, thus expecting to
capture light field information coming from multiple directions.
3. VGG-Face spatial description: Each selected SA image is fed to a VGG-Face descriptor, to
extract descriptions with a fixed length of 4096 elements (see Section 5.4.2). This work uses
the available VGG-Face model, with no additional training performed at this stage. It should
be noted that the training of the VGG-Face model has been done with totally different content
from the test material used in this Thesis.
4. Light Field LSTM angular description: The extracted spatial descriptions are provided to a
LSTM network including one of the newly proposed LSTM cell architectures (see Section
5.6.2), which jointly learn horizontal and vertical angular dependencies across the selected SA
viewpoints, extracting double-deep spatio-angular descriptions for classification. Naturally,
the number of LSTM cells in a LSTM network equals the number of samples in the input
sequence. It should be noted that the SeqL-LSTM cell architecture has been used in Figure
5.13 for illustration purposes.
5. Offline training: The set of double-deep description outputs from the light filed LSTM gates,
extracted from training samples, highlighted by red in Figure 5.13Figure 5.6, are used as input
to a softmax classifier to create a classification model. The training data should be selected
based on the test protocol considered.
6. Softmax classification: The set of double-deep description outputs from the light field LSTM
gates, extracted from testing samples, is used as input to the previously trained softmax
classifier to be compared to the classification model. Then, the average of the classification
probabilities across the rendered SA images, selected from the light field image, is used to
predict the most probable label and, thus, the final output – more details about softmax
classification stage were provided in Section 5.5.4.
In summary, the adoption of the proposed light field LSTM cell architectures in the context of a
double-deep spatio-angular based recognition framework can offer very powerful recognition
solutions, by exploiting both the spatial and combined horizontal and vertical angular information
available in light field images, leading to a boost in face recognition performance.
84
5.6.2 Light Field LSTM Angular Descriptors
This Thesis proposes three novel light field LSTM cell architectures able to jointly learn light field
horizontal and vertical dynamics and, thus, providing highly discriminative double-deep
descriptions for spatio-angular based visual recognition tasks. The proposed architectures express
gate-level fusion, state-level fusion, and sequential learning schemes, as described in the
following.
5.6.2.1 Gate-Level Fusion LSTM Cell Architecture
The first proposed light field LSTM cell architecture, called Gate-Level Fusion LSTM (GLF-
LSTM), adopts a gate-level fusion scheme, separately learning horizontal and vertical forget, input
and output gates and then merging the horizontal and vertical gates’ outputs to compute the fused
gates’ output. As illustrated in Figure 5.14, the horizontal and vertical gates are, respectively,
computed based on the description samples Hi and Vi, respectively extracted from the horizontal
and vertical multi-view description sequences, the previous hidden state, hi-1, and the previous cell
state, Ci-1. Then, the fused gates are computed by adding the horizontal and vertical gates together.
The cell and hidden state outputs are controlled by the fused gates, thus providing a richer joint
information to learn light field angular dynamics.
Figure 5.14: Architecture of a GLF-LSTM cell.
Given inputs Hi, Vi, hi-1, and Ci-1, the GLF-LSTM cell architecture for view number i can be
formulated as:
𝐻𝐼𝑖 = 𝜎(𝑊𝐻𝐼[𝐻𝑖 + ℎ𝑖−1 + 𝐶𝑖−1] + 𝑏𝐻𝐼) (5.20)
𝑉𝐼𝑖 = 𝜎(𝑊𝑉𝐼[𝑉𝑖 + ℎ𝑖−1 + 𝐶𝑖−1] + 𝑏𝑉𝐼) (5.21)
85
𝐼𝑖 = [𝐻𝐼𝑖 + 𝑉𝐼𝑖] (5.22)
𝐻�̃�𝑖 = 𝑡𝑎𝑛ℎ(𝑊𝐻�̃� [𝐻𝑖 + ℎ𝑖−1 + 𝐶𝑖−1] + 𝑏𝐻�̃� ) (5.23)
𝑉�̃�𝑖 = 𝑡𝑎𝑛ℎ(𝑊𝑉�̃�[𝑉𝑖 + ℎ𝑖−1 + 𝐶𝑖−1] + 𝑏𝑉�̃�) (5.24)
�̃�𝑖 = [𝐻�̃�𝑖 + 𝑉�̃�𝑖] (5.25)
𝐻𝐹𝑖 = 𝜎(𝑊𝐻𝐹[𝐻𝑖 + ℎ𝑖−1 + 𝐶𝑖−1] + 𝑏𝐻𝐹) (5.26)
𝑉𝐹𝑖 = 𝜎(𝑊𝐻𝐹[𝑉𝑖 + ℎ𝑖−1 + 𝐶𝑖−1] + 𝑏𝑉𝐹) (5.27)
𝐹𝑖 = [𝐻𝐹𝑖 + 𝑉𝐹𝑖] (5.28)
𝐶𝑖 = 𝐹𝑖ʘ𝐶𝑖−1 + 𝐼𝑖ʘ �̃�𝑖 (5.29)
𝐻𝑂𝑖 = 𝜎(𝑊𝐻𝑂[𝐻𝑖 + ℎ𝑖−1 + 𝐶𝑖−1] + 𝑏𝐻𝑂) (5.30)
𝑉𝑂𝑖 = 𝜎(𝑊𝑉𝑂[𝑉𝑖 + ℎ𝑖−1 + 𝐶𝑖−1] + 𝑏𝑉𝑂) (5.31)
𝑂𝑖 = [𝐻𝑂𝑖 + 𝑉𝑂𝑖] (5.32)
ℎ𝑖 = 𝑂𝑖ʘ tanh (𝐶𝑖) (5.33)
Where: i) 𝐻𝐼𝑖, 𝐻𝐹𝑖, 𝐻�̃�𝑖, and 𝐻𝑂𝑖 are, respectively, the horizontal input gate, forget gate, candidate
values and output gate; ii) 𝑉𝐼𝑖, 𝑉𝐹𝑖, 𝑉�̃�𝑖, and 𝑉𝑂𝑖 are, respectively, the vertical input gate, forget
gate, candidate values and output gate; iii) 𝑊𝐻𝑖 , 𝑊𝐻𝑖, 𝑊𝐻�̃� , and 𝑊𝐻𝑂 are, respectively, the
horizontal input, forget, candidate values and output weights; iv) 𝑊𝑉𝑖 , 𝑊𝑉𝑖 , 𝑊𝑉�̃� , and 𝑊𝑉𝑂 are,
respectively, the vertical input, forget, candidate values and output weights; v) 𝑏𝐻𝑖 , 𝑏𝐻𝑖 , 𝑏𝐻�̃� , and
𝑏𝐻𝑂 are, respectively, the horizontal input, forget, candidate values and output bias; and vi) 𝑏𝑉𝑖,
𝑏𝑉𝑖, 𝑏𝑉�̃� , and 𝑏𝑉𝑂 are, respectively, the vertical input, forget, candidate values and output bias.
The GLF-LSTM cell architecture jointly learns light field horizontal and vertical dynamics, in the
form of fused gates composed by independent horizontal and vertical gates. The computation of
the horizontal and vertical input, forget, and output gates can be done in parallel, as the learning
of 𝑊𝐻𝑖, 𝑊𝐻𝑖, 𝑊𝐻�̃� , and 𝑊𝐻𝑂 horizontal weights is independent from that of 𝑊𝑉𝑖 , 𝑊𝑉𝑖 , 𝑊𝑉�̃� , and
𝑊𝑉𝑂 vertical weights. Although this independency increases parallelism and, thus, may reduce the
computational time, it implies that the vertical and horizontal gates cannot establish a learning
interaction between themselves when optimizing learning weights for updating the cell sate.
5.6.2.2 State-Level Fusion LSTM Cell Architecture
The second proposed light field LSTM cell architecture, called State-Level Fusion LSTM (SLF-
LSTM), provides a state-level fusion scheme, separately learning the horizontal and vertical cell
and hidden states and then merging the horizontal and vertical states outputs to compute the fused
cell and hidden state outputs. As illustrated in Figure 5.15, the horizontal and vertical gates are,
respectively, computed based on the descriptor samples Hi and Vi, the previous hidden state hi-1,
and the previous cell state Ci-1. Then, the horizontal and vertical cell and hidden state outputs are
independently computed. The final cell and hidden state outputs are computed by adding the
horizontal and vertical state outputs together.
86
Figure 5.15: Architecture of a SLF-LSTM cell.
Given inputs Hi, Vi, hi-1, and Ci-1, the SLF-LSTM cell architecture for view number i can be
formulated as:
𝐻𝐼𝑖 = 𝜎(𝑊𝐻𝐼[𝐻𝑖 + ℎ𝑖−1 + 𝐶𝑖−1] + 𝑏𝐻𝐼) (5.34)
𝐻�̃�𝑖 = 𝑡𝑎𝑛ℎ(𝑊𝐻�̃� [𝐻𝑖 + ℎ𝑖−1 + 𝐶𝑖−1] + 𝑏𝐻�̃� ) (5.35)
𝐻𝐹𝑖 = 𝜎(𝑊𝐻𝐹[𝐻𝑖 + ℎ𝑖−1 + 𝐶𝑖−1] + 𝑏𝐻𝐹) (5.36)
𝐻𝐶𝑖 = 𝐻𝐹𝑖ʘ𝐻𝐶𝑖−1 + 𝐻𝐼𝑖ʘ𝐻 �̃�𝑖 (5.37)
𝐻𝑂𝑖 = 𝜎(𝑊𝐻𝑂[𝐻𝑖 + ℎ𝑖−1 + 𝐶𝑖−1] + 𝑏𝐻𝑂) (5.38)
𝐻ℎ𝑖 = 𝐻𝑂𝑖ʘ tanh (𝐻𝐶𝑖) (5.39)
𝑉𝐼𝑖 = 𝜎(𝑊𝑉𝐼[𝑉𝑖 + ℎ𝑖−1 + 𝐶𝑖−1] + 𝑏𝑉𝐼) (5.40)
𝑉�̃�𝑖 = 𝑡𝑎𝑛ℎ(𝑊𝑉�̃�[𝑉𝑖 + ℎ𝑖−1 + 𝐶𝑖−1] + 𝑏𝑉�̃�) (5.41)
𝑉𝐹𝑖 = 𝜎(𝑊𝑉𝐹[𝑉𝑖 + ℎ𝑖−1 + 𝐶𝑖−1] + 𝑏𝑉𝐹) (5.42)
𝑉𝐶𝑖 = 𝑉𝐹𝑖ʘ𝑉𝐶𝑖−1 + 𝑉𝐼𝑖ʘ𝑉 �̃�𝑖 (5.43)
𝑉𝑂𝑖 = 𝜎(𝑊𝑉𝑂[𝑉𝑖 + ℎ𝑖−1 + 𝐶𝑖−1] + 𝑏𝑉𝑂) (5.44)
𝑉ℎ𝑖 = 𝑉𝑂𝑖ʘ tanh (𝑉𝐶𝑖) (5.45)
𝐶𝑖 = [𝐻𝐶𝑖 + 𝑉𝐶𝑖] (5.46)
87
ℎ𝑖 = [𝐻ℎ𝑖 + 𝑉ℎ𝑖] (5.47)
where 𝐻𝐶𝑖 and 𝑉𝐶𝑖 are, respectively, the horizontal and vertical cell states and 𝐻ℎ𝑖 and 𝑉ℎ𝑖 are,
respectively, the horizontal and vertical hidden states (other variables were defined in Section
5.6.2.1).
The SLF-LSTM cell architecture jointly learns light field horizontal and vertical dynamics, in the
form of fused states composed by independent horizontal and vertical states. The parallelism
capability of SLF-LSTM is the same as GLF-LSTM, as all the horizontal and vertical learning
weights are independently computed. The SLF-LSTM architecture implies not only that vertical
and horizontal gates cannot establish a learning interaction between themselves, but the fused
horizontal and vertical gates cannot do so when optimizing learning weights for updating the cell
and hidden states either, which may decrease the learning efficiency.
5.6.2.3 Sequential Learning LSTM Cell Architecture
The last proposed light field cell architecture performs sequential learning (SeqL) by modeling in
sequence the angular dynamics available in the horizontal and vertical parallaxes. As illustrated in
Figure 5.16, the proposed SeqL-LSTM cell architecture updates first the cell state using horizontal
information, thus creating an updated cell state expressing all previous horizontal and vertical
viewpoint changes observed so far, as well as the horizontal changes observed in the current
viewpoint. Then the cell state is again updated using vertical information. Unlike the previous
proposals, in this approach, the cell state is not updated based on a fusion scheme. In this cell
architecture, the horizontal and vertical hidden states are independently computed based on the
sequentially learned cell states and only then combined to compute the final cell output, i.e. the
hidden state.
Figure 5.16: Architecture of a SeqL-LSTM cell.
88
Given inputs Hi, Vi, hi-1, and Ci-1, the SeqL-LSTM cell architecture for view number i can be
formulated as:
𝐻𝐼𝑖 = 𝜎(𝑊𝐻𝐼[𝐻𝑖 + ℎ𝑖−1 + 𝐶𝑖−1] + 𝑏𝐻𝐼) (5.48)
𝐻𝐹𝑖 = 𝜎(𝑊𝐻𝐹[𝐻𝑖 + ℎ𝑖−1 + 𝐶𝑖−1] + 𝑏𝐻𝐹) (5.49)
𝐻�̃�𝑖 = 𝑡𝑎𝑛ℎ(𝑊𝐻�̃� [𝐻𝑖 + ℎ𝑖−1 + 𝐶𝑖−1] + 𝑏𝐻�̃� ) (5.50)
𝐻𝐶𝑖 = 𝐻𝐹𝑖ʘ𝐶𝑖−1 + 𝐻𝐼𝑖ʘ 𝐻�̃�𝑖 (5.51)
𝑉𝐼𝑖 = 𝜎(𝑊𝑉𝐼[𝑉𝑖 + ℎ𝑖−1 + 𝐶𝑖−1] + 𝑏𝑉𝐼) (5.52)
𝑉𝐹𝑖 = 𝜎(𝑊𝐻𝐹[𝑉𝑖 + ℎ𝑖−1 + 𝐶𝑖−1] + 𝑏𝑉𝐹) (5.53)
𝑉�̃�𝑖 = 𝑡𝑎𝑛ℎ(𝑊𝑉�̃�[𝑉𝑖 + ℎ𝑖−1 + 𝐶𝑖−1] + 𝑏𝑉�̃�) (5.54)
𝐶𝑖 = 𝑉𝐹𝑖ʘ𝐻𝐶𝑖 + 𝑉𝐼𝑖ʘ 𝑉�̃�𝑖 (5.55)
𝐻𝑂𝑖 = 𝜎(𝑊𝐻𝑂[𝐻𝑖 + ℎ𝑖−1 + 𝐶𝑖−1] + 𝑏𝐻𝑂) (5.56)
𝑉𝑂𝑖 = 𝜎(𝑊𝐻𝑂[𝑉𝑖 + ℎ𝑖−1 + 𝐶𝑖−1] + 𝑏𝑉𝑂) (5.57)
𝑉ℎ𝑖 = 𝐻𝑂𝑖ʘ tanh (𝐶𝑖) (5.58)
𝐻ℎ𝑖 = 𝑉𝑂𝑖ʘ tanh (𝐶𝑖) (5.59)
ℎ𝑖 = [𝐻ℎ𝑖 + 𝑉ℎ𝑖] (5.60)
The variables above have been defined in Sections 5.6.2.1 and 5.6.2.2.
The SeqL-LSTM cell architecture establishes a learning interaction between horizontal and
vertical input, forget and candidate value weights when updating the cell state which is expected
to provide a better learning and, thus, a better angular description than SLF-LSTM and GLF-
LSTM cell architectures. Indeed, the vertical weights are optimized considering the fact that the
horizontal information for the current input has already been observed, and thus the horizontal
weights for updating the cell state are already optimized, which is not the case for the other two
proposed light field LSTM cell architectures. However, the expected superior performance of
SeqL-LSTM cell architecture comes with the cost of reducing parallelism ability, as updating the
SeqL-LSTM cell state must be done in sequence.
5.7 Summary of the Proposed Face/Ear Recognition Solutions
Table 5.2 summarizes the main characteristics of the face/ear recognition solutions proposed in
this chapter, sorted (from low to high) based on the level of performance provided, as it will be
shown in Chapter 6. This table includes information about the biometric modalities considered,
the levels of the taxonomy proposed in Section 3.2.3, the classifiers used and the light field
capabilities exploited.
89
Table 5.2: Overview of the recognition solutions proposed in this Thesis.
Proposed
Solution
Biometric
Modality
Face
Structure
Feature
Support
Feature
Extraction
Approach
Feature
Extraction
Sub-Approach
Classifier Light Field
Capability
LFLBP Face/Ear Global Local Hand-Crafted Texture SVM Disparity exploitation
LFHG Face/Ear Global Local Hand-Crafted Texture SVM Disparity exploitation
VGG-D3 Face Global Global Learning Deep Neural Nets SVM Disparity exploitation;
Depth exploitation
VGG +
Conv-LSTM Face Global Global Learning Deep Neural Nets Softmax Disparity exploitation
VGG +
GLF-LSTM Face Global Global Learning Deep Neural Nets Softmax Disparity exploitation
VGG +
SLF-LSTM Face Global Global Learning Deep Neural Nets Softmax Disparity exploitation
VGG +
SeqL-LSTM Face Global Global Learning Deep Neural Nets Softmax Disparity exploitation
90
91
Chapter 6 _
Light Field Face and Ear Recognition
Performance
6.1 Introduction
In this chapter, an extensive performance evaluation is reported for the proposed light field face
and ear recognition solutions, as well as for several benchmarking solutions, using common,
representative performance evaluation frameworks for varied and challenging face and ear
recognition tasks.
This Thesis proposed two recognition solutions based on hand-crafted light field descriptors,
LFLBP (Section 5.2) and LFHG (Section 5.3), which can be applied to the face and ear recognition
problems, describing the disparity information available in light field images and thus exploiting
the angular variations together with the available spatial information. Additionally, five deep
learning based fused deep/double-deep descriptors have been proposed for face recognition,
learning convolutional representations and angular dynamics from a light field image. To
successfully apply the proposed deep learning solutions for ear recognition a larger light field
dataset would be necessary for fine-tuning the considered neural networks.
In order to analyze the sensitivity of the proposed face recognition solutions to the available
training data, different evaluation protocols with practical meaningfulness have been proposed,
offering different trade-offs in terms of initial setting complexity and later recognition
performance. Concerning ear recognition, the performance of the proposed solutions is evaluated
based on a cross-session scenario (IST-EURECOM LLFEDB has been captured in two separate
acquisition sessions), training the classifier using the ear images captured in the first session while
testing with the second acquisition session’s image. It should be noted that the proposed ear
recognition database does not include many variations (it only includes four images cropped from
right and left half and full profile face images), thus more complex evaluation protocols are not
considered for ear recognition.
92
6.2 Performance Assessment Frameworks
This section presents the frameworks considered for evaluating the performance of the proposed
face (Section 6.2.1) and ear (Section 6.2.2) recognition solutions.
6.2.1 Face Recognition Performance Assessment Framework
The test material, evaluation protocols and metrics to assess the performance of the proposed face
recognition solutions and other relevant recognition solutions used for benchmarking are described
in the following.
6.2.1.1 Face Recognition Test Material
A comprehensive set of experiments using a common, representative evaluation framework has
been conducted with the IST-EURECOM LLFFD face database (presented in Section 4.2), for
varied and challenging recognition tasks. In the experiments, all the images from the IST-
EURECOM LLFFD are used to assess the performance of the proposed face recognition solutions
and other relevant recognition methods used for benchmarking.
6.2.1.2 Face Recognition Evaluation Protocols
To analyse the sensitivity of the proposed face recognition solutions to the available training data,
both in terms of number of training samples and facial variations, three evaluation protocols with
practical meaningfulness are proposed. The protocols are defined as follows:
Protocol 1: The training set contains only the neutral light field images from the first
acquisition session (1 image per subject), while the validation set includes left and right half-
profile images from the same acquisition session (2 images per subject), thus corresponding to
a low-complexity enrolment and training scenario; the testing set includes all the light field
images from the second acquisition session, as illustrated in Figure 6.1-a. This 'single training
image per person' protocol is the simplest protocol considered, but it is the most challenging
in terms of recognition performance.
Protocol 2: The training set contains the neutral plus the left and right full-profile light field
images from the first acquisition session (3 images per subject), while the testing set includes
all the light field images from the second acquisition session, as illustrated in Figure 6.1-b. The
validation study is is omitted for this protocol, as the training set is not very different from
protocol 1. This protocol assumes a rather simple and quick enrolment phase thus
corresponding to a low-complexity enrolment and training scenario and is slightly less
challenging than the first protocol in terms of recognition performance.
Protocol 3: The training set contains all twenty database face variations captured during the
first acquisition session, while the validation and testing sets each consider half of the second
session images, as illustrated in Figure 6.1-c, thus corresponding to a higher-complexity
enrolment and training scenario. This scenario is less challenging in terms of recognition
performance as the system learns more in the training phase.
The first and second protocols (Protocol 1 and Protocol 2) correspond to application scenario
where each person registers/enrolls into the system by quickly taking just one or three photos in a
93
controlled setup, similar to the famous police station paradigm. Testing is done by considering all
facial variations captured in the second acquisition session, assuming that the recognition should
be robust to real-life conditions where the face images to be used for recognition may have
captured in less constrained conditions, notably including facial expressions or be partly occluded,
for instance. With these protocols, the recognition system has not been exposed/trained with many
of the facial variations with which it will be tested.
The third protocol (Protocol 3) assumes a more complex acquisition phase, considering more
training images, under the assumption that the increased complexity will result in a better trained
and thus more knowledgeable model, which should offer a better recognition performance. This
protocol divides the available database material into disjoint training (50%), validation (25%), and
testing (25%) sets where the first session images are all used for training. In this case, the
recognition system has been initially exposed/trained to more facial variations, increasing the
initial complexity to get a better deep model, and thus achieve a better recognition performance.
Figure 6.1: IST-EURECOM LFFD (non-cropped) database division into training, validation and
testing sets for (a) Protocol 1; (b) Protocol 2; and (c) Protocol 3.
94
The three protocols correspond to cooperative user scenarios, offering different trade-offs in terms
of the initial enrolment and training complexity versus the expected recognition performance. The
first and second protocols have multiple practical applications, such as in access control systems,
where the users can be registered/enrolled into the system by quickly taking a mugshot, including
a frontal-view and side-view photos in a controlled setup. Then, the goal is to recognize a person
from image captured at a different time, and possibly in non-ideal conditions, e.g. exhibiting
unpredictable facial variations. The third protocol corresponds to a very cooperative user scenario,
for usage in applications with increased security requirements, where the users are willing to
cooperate during the registration phase, simulating different facial variations, over a range of
expressions, actions, poses, illuminations, and occlusions, to capture as much variations as possible
during the enrollment phase so that the proposed system can more effectively recognize users
during the daily operation of the system.
For all protocols, the training set is used to obtain the classification model weights, the validation
set is used to tune the training model hyper-parameters in case of deep learning based solutions,
and the testing set is used for the system performance assessment. By considering a multi-label
classification task (face recognition), at least one image from each subject (classes) with whom the
system will be validated/tested must be available during the training stage. If a new subject is to
be recognized, the database has to be extended with corresponding images and the classification
model has to be re-trained (fine-tuned), as the new subject is an unseen label in the previous model.
As the performance of the classification model being trained depends on a set of hyper-parameters,
a disjoint set of validation samples are used to select the hyper-parameter values for deep learning
based solutions leading to the best performance.
6.2.1.3 Face Recognition Performance Assessment Metrics
After performing classification, similarity scores between the test and all the enrolment samples
are sorted, thus every test sample has a best match with one of the enrolment samples. A test
sample has rank k if the correct match has the kth largest similarity score, where k can vary between
1 and the number of samples enrolled in the database. To evaluate the identification performance
of the face recognition solutions, the Recognition Rate at rank n (RRn) and Cumulative Recognition
Rate at rank n (CRRn) metrics are used. RRn can be calculated according to Equation 6.1:
𝑅𝑅n =𝑁𝑛
|𝑇| (6.1)
where Nn is the number of samples that have rank n, and |T| is the total number of test images
considered. CRRn can then be computed using Equation 6.2
𝐶𝑅𝑅𝑛 = ∑ 𝑅𝑅𝑖𝑛𝑖=1 (6.2)
6.2.1.4 Face recognition Benchmarking Solutions
The competing recognition solutions considered for benchmarking purposes are grouped into two
categories:
1. Conventional 2D solutions, notably PCA [65], LBP [76], HOG [119] and VGG-Face [38],
which are applied to the central view 2D rendered SA images;
95
2. Light field solutions, which fully use the light field data, notably DLBP [120], fusing features
extracted from the central view 2D rendered SA image with a disparity map computed from
the light field, and MPCA [153], adopted for the first time for light field based face recognition
in this Thesis.
A summary of the characteristics of each considered benchmarking solution, following the
taxonomy illustrated in Figure 3.3, is available in Table 6.1 for ease of reference. The central view
2D rendered SA images and the full light field images are used to test the non-light field based 2D
and the light field based solutions, respectively. All tested face recognition solutions are re-
implemented by the author of this Thesis and performance results were obtained considering the
best parameter settings reported in the relevant original papers.
Table 6.1: Overview of the face recognition benchmarking solutions.
Solution Name Type Face Structure Feature
Support
Feature
Extraction
Approach
Feature Extraction
Sub-Approach
PCA [65] 2D Global Global Appearance Linear
VGG Face [38] 2D Global Global Learning Deep Neural Network
LBP [76] 2D Global Local Hand-Crafted Texture
HOG [119] 2D Global Local Hand-Crafted Texture
MPCA [153] LF Global Global Appearance Multi-Linear
DLBP [120] LF Global Local Hand-Crafted Texture
6.2.2 Ear Recognition Performance Assessment Framework
To be able to test the proposed light field ear recognition solution only the proposed IST-
EURECOM LLFEDB database (see Section 4.3) is available. The database has been proposed in
this Thesis, and made publicly available, to facilitate testing, validating and comparison of light
field ear recognition solutions. This section presents the experimental assessment setup, the
benchmarking ear recognition solutions, and the obtained ear recognition performance results and
analysis of the two proposed ear recognition and the benchmarking solutions.
6.2.2.1 Ear Recognition Test Material
A comprehensive set of experiments using a common, representative evaluation framework has
been conducted with the novel IST-EURECOM LLFEDB face database. In the experiments, all
the images from the IST-EURECOM LLFEDB are used to assess the performance of the proposed
ear recognition and other relevant recognition solutions used for benchmarking.
6.2.2.2 Ear Recognition Evaluation Protocol and Metrics
This Thesis proposes an ear recognition evaluation protocol based on a cross-session scenario. The
training phase uses the four ear images per user of the IST-EURECOM LLFEDB first acquisition
session, applying the proposed light field descriptors, whose outputs are then used to train a
classifier and create a classification model. The testing phase uses the four IST-EURECOM
LLFEDB images from the second acquisition session. The training and testing steps are repeated
using the second session images as enrolment data and the first session images as test data; the
average results of these two runs are reported.
96
Similarly to face recognition, to evaluate the identification performance of the tested ear
recognition solutions the RRn (Equation 6.1) and CRRn (Equation 6.2) metrics are used.
6.2.2.3 Ear Recognition Benchmarking Solutions
A set of representative and promising ear recognition solutions available in the literature were
selected for benchmarking purposes. The selection includes hand-crafted based solutions,
including Local Gabor (LG) descriptor [5], [203], [173], LBP [176], [177], [204], [205], [181],
LPQ and rotation invariant LPQ [5], [176], [206], HOG [5], [176], [174], POEM (Patterns of
Oriented Edge Magnitudes) [5], and BSIF [5], [176]. The performance of the tested 2D recognition
solutions was evaluated considering the best parameter settings reported in [5]. It should be noted
that apart from the solutions proposed in this Thesis, there is no published research activity
addressing ear recognition using light field sensors at the time of the writing of the Thesis, thus
the benchmarking solutions do not contain any light field based ear recognition solution. A
summary of the characteristics of each considered ear benchmarking solution, following the
taxonomy illustrated in Figure 3.3, is available in Table 6.2 for ease of reference.
Table 6.2: Overview of the ear recognition benchmarking solutions.
Ref. Ear
Structure
Feature
Support
Feature
Extraction
Approach
Feature Extraction
Sub-Approach Feature Extractor
[173] Global Local Hand-Crafted Texture LG
[176] Global Local Hand-Crafted Texture LBP; LPQ; HOG;
BSIF
[5] Global Local Hand-Crafted Texture;
Frequency
LPQ; BSIF; SIFT;
POEM; Gabor; HOG
[184] Global Local Hand-Crafted Texture GLCM; LBP; Gabor
filters
The central view 2D rendered images and the multi-view SA images arrays of the IST-EURECOM
LLFEDB are the input to the 2D benchmarking solutions and the proposed light field based ear
recognition solutions, respectively.
6.3 LFLBP Descriptor Parameter Setting
The proposed Light Field Local Binary Pattern (LFLBP) descriptor hand-crafted descriptor has a
number of parameters whose optimization is of major importance for the final recognition
performance. In this context, parameter setting experiments are performed to study the influence
of the key parameters on the light field based face recognition performance.
As discussed in Section 5.2.2, LFLBP has three parameters, including: i) the radius, R, expressing
the Euclidean distance to the reference view, with a direct relation with the observed disparity; ii)
the angle, A, identifying the starting angle; and iii) N, defining the number of SA images to consider
in the descriptor. The experimental work starts by analyzing the influence of the radius parameter,
R. Once the optimal value of R is fixed, the impact of considering a different numbers of angular
views (N) and of the starting angle (A) parameters, is investigated.
97
6.3.1 LFLBP Descriptor Parameter Setting: View radius
As mentioned before, light field images allow the recognition technique to benefit from the
captured disparity, therefore the amount of disparity to consider is the first aspect to be analysed.
For this purpose, the value of R is increased from 3 up to 7, with A=45º and N=4. The CRR5 values
for the emotion, action and occlusion recognition tasks corresponding to the IST-EURECOM
LFFD database dimensions, are illustrated in Figure 6.2. The results show a clear increase on the
recognition rate as the disparity increases. By considering a larger radius, more distinctive angular
information is captured by the proposed LFALBP descriptor, and therefore the matching accuracy
between the test and enrolment samples is considerably improved.
Figure 6.2: CRR5 versus R for LFALBPR,45º,4.
6.3.2 LFLBP Descriptor Parameter Setting: Number of Views and Starting Angle
After finding the optimal value of R, the second set of experiments aims to select the ideal starting
angle (A) and number of angular views (N) to use. Table 6.3 shows RR1 and CRR5 (in percentage)
for the proposed recognition system based on using the LFLBP descriptor with three different
parameter settings for A and N. Table 6.3 shows results for N values of 4 and 8. Results show that
considering 4 views allows to capture the essential angular variations, leading to the best
recognition performance. Concerning parameter A, there is no significant difference between the
results obtained when using 0° or 45°. Thus, 0° and 4 are selected as final values for the A and N
parameters.
Table 6.3: RR1 and CRR5 for LFLBP for different values of A and N (best results in bold).
Method
Recognition Tasks
Emotion Action Occlusion Average
RR1 CRR5 RR1 CRR5 RR1 CRR5 RR1 CRR5
LFLBP4,0°,16,7,0°,4 97 98 93.5 97 86 96.5 92.1 97.1
LFLBP4,0°,16,7,45°,4 96.6 97.3 93 97 86.5 96.5 92 96.9
LFLBP4,0°,16,7,0°,8 86.3 94.3 84.5 95 80.5 91 83.7 93.4
98
6.4 Light Field Histogram of Disparity Gradients Descriptor Parameters
The proposed LFHG descriptor, presented in Section 5.3.2, targets to exploit the light field
variations, both in terms of position and direction. It uses the central SA image to compute the
HOG descriptor, and four SA images, referred to by their position in the SA multi-view array by
(u1, v1), (u2, v2), (u3, v3) and (u4, v4), to compute the LFHDG descriptor. Results in Section 6.3.1
show a clear performance improvement for light field based face recognition as the SA images’
disparity increases. Thus, it is proposed here that the SA images selected for computing the
disparity gradients are at maximum distance from the central view, i.e., u1=15, v1=8, u2=1, v2=8,
u3=8, v3=15, u4=8, v4=1, be selected for computing the LFHDG descriptor.
6.5 LSTM Hyper-Parameter Setting
The proposed recognition solutions using deep learning models combine the usage of VGG-Face
descriptor and LSTM. The LSTM network has a number of parameters whose optimization is of
major importance for the final recognition performance; this is not the case for VGG-Face
descriptor as the pre-trained VGG-Face model can be directly use to extract descriptions from 2D
face images for face recognition. This section evaluates the impact of the LSTM hyper-parameters
setting, notably analyzing the influence of the LSTM hidden layer size, the batch size and the
number of epochs to consider for network convergence. Then the impact of the various proposed
SA image selection topologies and scanning methods is evaluated in terms of recognition accuracy.
The optimal recognition framework configuration will be used to test the proposed solutions based
on combination of VGG-Face descriptor with the conventional and the proposed light field LSTM
architectures. In the following subsections LSTM hyper-parameters are evaluated considering
protocols 1 and 3, given the similarities between protocols 1 and 2.
6.5.1 Hyper-Parameter Evaluation: Hidden Layer Size
The study of recognition performance sensitivity to the size of the LSTM hidden layers is reported
first. Figure 6.3 illustrates the recognition performance at rank-1, RR1, for hidden layer sizes of 32,
64, 128, 256, and 512 for Protocol 1 (Figure 6.3-a) and Protocol 3 (Figure 6.3-b) validation sets,
after training with all the considered SA image selection methods. These results are reported
considering a batch size of 34 and 667 (1/3 of the input data), respectively for protocols 1 and 3,
and 50 epochs. These values were selected after some initial experimentation, which showed the
suitability of these values for network initialization.
The results show a clear improvement on the recognition performance as the hidden layer size is
increased up to 256. The recognition accuracy is not further increased by considering larger LSTM
hidden layer sizes, even gradually decreasing for a size of 512. This may be due to overfitting and
shows that LSTM tends to converge to a complex model that is not well captured using a too small
hidden layer size.
99
Figure 6.3: Rank-1 recognition results versus hidden layer size considering all proposed SA
image selection methods for: (a) Protocol 1 and (b) Protocol 3.
6.5.2 Hyper-Parameter Evaluation: Batch Size
In theory, the batch size should be adjusted to have an accurate gradient estimation while avoiding
overfitting. Figure 6.4 illustrates the recognition performance for protocols 1 and 3 validation sets,
when considering between 2 and 6 batches, resulting in batch sizes of 50, 34, 25, 20 and 17 for
Protocol 1, and 1000, 667, 500, 400 and 333 for Protocol 3. Results are reported for 50 epochs,
after setting the hidden layer size to 256, the best size obtained in Section 6.5.1.
Figure 6.4: Rank-1 recognition results versus the batch size considering all proposed SA image
selection methods for: (a) Protocol 1, and (b) Protocol 3.
The results presented in Figure 6.4 show that using three batches, i.e., batch sizes of 34 and 667,
respectively for protocols 1 and 3, allows a good gradient estimation, leading to the best
recognition performance for almost all cases. It should be noted that since the LSTM inputs are
VGG face descriptions, the input dimension is very small, i.e., 4096, thus justifying the better
performance obtained by the large batch size selected for Protocol 3. It is also possible to observe
that mid-density SA image selection methods are more robust to changes in the number of batches,
when compared to the other SA image selection methods.
100
6.5.3 Hyper-Parameter Evaluation: Number of Training Epochs
The number of training epochs, which directly impacts the required training time, should be
minimized while guaranteeing network convergence. Figure 6.5 shows the recognition
performance for Protocol 1 (Figure 6.5-a) and Protocol 3 (Figure 6.5-b) validation sets when
varying the number of training epochs, after training with all the considered SA image selection
methods. Results are reported by setting the hidden layer size to 256 and the number of batches to
3, based on the conclusions from the previous sections.
The experimental results show that considering 40 and 130 training epochs, respectively for
protocols 1 and 3, leads to a stable performance for almost all the cases. The recognition
performance remains almost constant for higher number of epochs. The network converges much
faster in Protocol 1 as the validation data is smaller. Hence, to keep a good trade-off between
accuracy and training time and also to keep the same framework configuration for both evaluation
protocols, the number of training epochs selected is 130.
Figure 6.5: Rank-1 recognition results versus number of training epochs considering all
proposed SA image selection methods for: (a) Protocol 1 and (b) Protocol 3.
6.5.4 SA Images Selection Evaluation
As discussed in Section 5.5.2, there are different options for selecting a SA image subset to be
processed by the VGG-Face descriptor and then scanned as a pseudo-video sequence so that their
angular dynamics can be learned by the LSTM network.
The results for the different SA image selection methods presented in Figure 6.5 show that, for the
high density SA images selection strategy, the snake-like scanning offers superior performance
over the row-major scanning, as it avoids the significant viewpoint feature discontinuities resulting
from moving from the right-most SA image in a row to the left-most SA image in the next row.
It is also clear from Figure 6.5 that the mid-density SA image selection solutions, capturing full
angular information along the horizontal and vertical directions, achieve better average
performance when compared to the high- and low-density selection methods. Among the proposed
mid-density selection alternatives, the score-level fusion of horizontal and vertical angular
101
information leads to the best performance. The alternative of performing a single combined scan
implies a viewpoint feature discontinuity when moving from the last horizontal SA image (middle
row) to the first vertical SA image (top row) which leads to a worse performance.
Based on the validation experiments described so far, the best configuration, to be used from this
point on for the system performance assessment using the Conv-LSTM and the proposed GLF-
LSTM, SLF-LSTM and SeqL-LSTM architectures, is summarized in Table 6.4.
Table 6.4: Selected configuration for the the Conv-LSTM and the proposed GLF-LSTM, SLF-
LSTM and SeqL-LSTM architectures for face recognition.
Hyper-parameter /
Image selection method Setting
Hidden layer size 256
Batch size 1/3 of the input data
Number of training epochs 130
SA image selection method Mid-density horizontal and vertical, with score-level fusion
6.6 Face Recognition Accuracy
Once the values of the parameters and hyper parameters of the proposed solutions have been
selected, their face recognition performance can be evaluated and compared to that of the selected
benchmarking solutions. The rank-1 recognition results obtained are reported in Table 6.5, Table
6.6 and Table 6.7, respectively for test protocols 1, 2 and 3 (see Section 6.2.1.2), using the best
configurations obtained in Sections 6.3, 6.4, and 6.5, for the seven proposed and the six
benchmarking recognition solutions introduced in Section 6.2.1.4. Keep in mind that each protocol
has a different classification model as each protocol uses a different training set. The results in
these tables are presented for five recognition tasks corresponding to the IST-EURECOM LFFD
database dimensions, and the best results are highlighted in bold.
Table 6.5: Protocol 1 assessment: Face recognition rank-1 for the proposed and benchmarking
recognition solutions (best results in bold).
Solution Recognition Tasks
Neutral
&Emotion Action Pose Illumination Occlusion Average
Name Type
PCA [65] 2D 28.50% 28.00% 06.67% 12.50% 16.33% 17.40%
LBP [76] 2D 16.75% 18.50% 06.67% 12.00% 09.33% 11.20%
HOG [119] 2D 57.50% 58.50% 09.83% 48.00% 38.33% 36.60%
VGG-Face [38] 2D 99.50% 99.00% 56.33% 99.00% 74.67% 79.00%
DLBP [120] LF 59.25% 64.50% 30.33% 24.50% 22.33% 36.55%
MPCA [153] LF 36.75% 33.50% 07.50% 14.50% 19.67% 20.30%
Prop. LFLBP LF 34.25% 31.00% 10.17% 17.00% 13.17% 18.65%
Prop. LFHG LF 62.25% 62.50% 12.00% 62.00% 41.33% 40.90%
Prop. VGG-D3 LF 99.50% 99.00% 56.50% 99.00% 75.50% 79.50%
Prop. VGG + Conv-LSTM LF 99.25% 99.50% 71.67% 99.00% 91.17% 88.55%
Prop. VGG + GLF-LSTM LF 99.50% 100% 71.33% 99.00% 92.00% 88.75%
Prop. VGG + SLF-LSTM LF 99.75% 100% 72.33% 99.00% 92.00% 89.10%
Prop. VGG + SeqL-LSTM LF 99.75% 100% 73.17% 99.50% 92.33% 89.55%
102
Table 6.6: Protocol 2 assessment: Face recognition rank-1 for the proposed and benchmarking
recognition solutions (best results in bold).
Solution Recognition Tasks
Neutral
&Emotion Action Pose Illumination Occlusion Average
Name Type
PCA [65] 2D 66.00% 63.50% 16.50% 46.50% 36.00% 40.70%
LBP [76] 2D 71.00% 73.00% 21.50% 43.00% 36.50% 43.85%
HOG [119] 2D 81.50% 77.50% 21.00% 70.50% 64.00% 58.25%
VGG-Face [38] 2D 91.75% 86.00% 82.00% 90.00% 63.67% 79.65%
DLBP [120] LF 89.50% 89.00% 72.50% 65.00% 63.33% 73.60%
MPCA [153] LF 68.50% 68.50% 20.50% 32.00% 41.00% 42.85%
Prop. LFLBP LF 67.00% 70.50% 38.50% 46.00% 55.67% 53.75%
Prop. LFHG LF 80.00% 79.00% 21.34% 67.50% 65.00% 59.20%
Prop. VGG-D3 LF 97.25% 93.00% 86.34% 96.00% 72.33% 85.95%
Prop. VGG + Conv-LSTM LF 98.50% 99.00% 92.00% 98.00% 83.00% 91.95%
Prop. VGG + GLF-LSTM LF 98.75% 99.50% 93.17% 98.50% 84.00% 92.80%
Prop. VGG + SLF-LSTM LF 98.75% 99.50% 93.50% 98.50% 85.17% 93.15%
Prop. VGG + SeqL-LSTM LF 99.25% 99.00% 94.00% 98.50% 85.17% 93.35%
Table 6.7: Protocol 3 assessment: Face recognition rank-1 for the proposed and benchmarking
recognition solutions (best results in bold).
Solution Recognition Tasks
Neutral
&Emotion Action Pose Illumination Occlusion Average
Name Type
PCA [65] 2D 53.00% 65.00% 56.66% 65.00% 56.66% 49.80%
LBP [76] 2D 48.50% 83.00% 67.66% 64.00% 67.66% 52.20%
HOG [119] 2D 51.50% 96.00% 84.66% 75.00% 84.66% 64.60%
VGG-Face [38] 2D 93.50% 97.00% 97.00% 95.00% 97.00% 92.90%
DLBP [120] LF 56.50% 64.00% 69.66% 75.00% 69.66% 63.70%
MPCA [153] LF 48.00% 89.00% 65.00% 63.00% 64.66% 50.30%
Prop. LFLBP LF 52.50% 96.00% 87.66% 76.00% 87.66% 65.80%
Prop. LFHG LF 61.00% 93.00% 83.33% 80.00% 83.33% 67.10%
Prop. VGG-D3 LF 94.00% 98.00% 98.00% 97.00% 98.33% 97.40%
Prop. VGG + Conv-LSTM LF 100% 100% 96.33% 100% 98.66% 98.60%
Prop. VGG + GLF-LSTM LF 100% 100% 97.33% 100% 98.66% 98.80%
Prop. VGG + SLF-LSTM LF 100% 100% 98.00% 100% 98.33% 98.90%
Prop. VGG + SeqL-LSTM LF 100% 100% 98.33% 100% 98.33% 99.00%
Comparison of benchmarking face recognition solutions:
The results for the benchmarking face recognition solutions clearly show that the VGG-Face
descriptor [38] performs considerably better than all the other tested solutions, including PCA [65],
LBP [76], HOG [119], DLBP [120], and MPCA [153]. These results were expected as the current
state-of-the-art on face recognition is dominated by deep neural networks, and the VGG-16
descriptor has proved to be as one of the most efficient CNN architectures for face recognition.
The next best performing conventional recognition solution is DBLP [120], due to the exploitation
of 2D texture as well as depth information for face recognition.
Comparison with benchmarking face recognition solutions:
103
Comparing the proposed light field face recognition solutions against the 2D benchmarking, the
obtained rank-1 recognition results are presented in Table 6.8, Table 6.9 and Table 6.10,
respectively for test protocols 1, 2 and 3. These results demonstrate the superiority of the proposed
light field based solutions when compared to the corresponding 2D baseline solutions, including:
i) PCA [65] against MPCA [153], ii) LBP [76] against LFLBP, iii) HOG [119] against
HOG+HDG, and iv) VGG [38] against VGG+ SeqL-LSTM.
Table 6.8: Protocol 1 average rank-1 recognition results for some 2D baseline solutions against
their light field based variants.
Solution Performance
2D LF Based 2D Average LF Based Average Gain
PCA [65] MPCA [153] 17.40% 20.30% 2.90%
LBP [76] Proposed LFLBP 11.20% 18.65% 7.45%
HOG [119] Proposed LFHG 36.60% 40.90% 4.30%
VGG [38] Proposed VGG+ SeqL-LSTM 79.00% 89.55% 10.55%
Table 6.9: Protocol 2 average rank-1 recognition results for some 2D baseline solutions against
their light field based variants.
Solution Performance
2D LF Based 2D Average LF Based Average Gain
PCA [65] MPCA [153] 40.70% 42.85% 2.15%
LBP [76] Proposed LFLBP 43.85% 53.75% 9.90%
HOG [119] Proposed LFHG 58.52% 59.20% 0.68%
VGG [38] Proposed VGG+ SeqL-LSTM 79.65% 93.35% 13.70%
Table 6.10: Protocol 3 average rank-1 recognition results for some 2D baseline solutions against
their light field based variants.
Solution Performance
2D LF Based 2D Average LF Based Average Gain
PCA [65] MPCA [153] 49.80% 50.30% 0.50%
LBP [76] Proposed LFLBP 52.20% 65.80% 13.60%
HOG [119] Proposed LFHG 64.00% 67.10% 3.10%
VGG [38] Proposed VGG+ SeqL-LSTM 92.90% 99.00% 6.10%
The average recognition gain clearly shows the added value of light field information for face
recognition purposes. The considerable gains obtained are, to a large extend, due to the
consideration of the angular information as expressed by the proposed light field based solutions,
which provides complementary information and discriminative power to the baseline solutions, as
shown by the improved performances.
Comparison of the proposed solutions:
The obtained rank-1 recognition results demonstrate the superiority of the proposed LFLBP and
LFHG solutions when compared against their corresponding 2D variants, i.e., LBP and HOG. The
results show that as the LFLBP hand-crafted descriptor captures only the magnitude sign for the
spatial and angular information its performance is lower (9.67% in average) than the proposed
104
LFHG descriptor that considers both the orientation and magnitude variations for the spatial and
angular information. This superiority is more evident for the most challenging protocol 1.
The proposed VGG-D3 fused deep descriptor achieves superior performance over the proposed
hand-crafted descriptors, i.e., LFLBP and LFHG, due to: i) adoption of a CNN for light field based
face description; and ii) fusion of the CNN descriptions for the 2D texture with disparity and depth
maps, which allows exploring the complementary information available in the light field. The
average performance gain regarding the baseline 2D VGG-Face descriptor is more than 3.76%,
showing the additional discriminative power of the proposed VGG-D3 fused deep descriptor.
However, the proposed VGG-D3 fused deep descriptor processes only light field central view data,
notably using its rendered texture, and corresponding disparity and depth maps, using a CNN
network. The results clearly show that the proposed VGG + Conv-LSTM double-deep descriptor
performs considerably better than the proposed VGG-D3 fused deep descriptor for all face
recognition tasks/protocols considered. This is due to: i) adoption of a double-deep learning
descriptor for light field face recognition; and ii) exploitation of the full spatio-angular information
available in a light field image. The proposed VGG + Conv-LSTM double-deep descriptor
achieved average performance gains of 9.18% and 5.42%, when compared to the baseline 2D
VGG-face descriptor, and the proposed VGG-D3 fused deep descriptor, respectively.
The obtained results also show the superiority of the face recognition solutions based on three
VGG+ light field LSTM double-deep descriptors, notably GLF-LSTM, SLF-LSTM and SeqL-
LSTM, over the VGG+Conv-LSTM double-deep descriptor. The added value is more evident for
the more challenging protocols/tasks, including protocols 1 and 2 and for the pose variation and
occlusions tasks, where the new joint learning of the light field horizontal and vertical parallaxes,
leading to richer descriptions, contributes to improve the final recognition performance.
Additionally, the average rank-1 recognition performance obtained for the three evaluation
protocols shows that the proposed solutions based on VGG + light field LSTM double-deep
descriptors are less sensitive to the number of training samples and the presence of facial
variations, when compared to the other solutions. The much improved face recognition results
under illumination variations illustrate the robustness of the proposed solutions based on light field
LSTM double-deep descriptors to illumination changes, highlighting the importance of exploiting
the angular information, which is invariant to the intensity changes resulting from different
illumination levels during the data acquisition process.
Finally, the performance results show that the proposed recognition solution based on SeqL-LSTM
works slightly better than the other solutions based on the other proposed LSTM cell architectures,
i.e. GLF-LSTM and SLF-LSTM, due to establishing a learning interaction between vertical and
horizontal weights when updating the cell sate, thus proving a better angular description.
6.7 Ear recognition Accuracy
Ear recognition performance assessment is performed using the IST-EURECOM LLFEDB
database and considering the cross-session scenario discussed in Section 6.2.2.2, to compare the
proposed hand-crafted based ear recognition solutions with the state-of-the-art benchmarking
solutions. Table 6.11 reports the obtained recognition results in terms of CRR1 up to CRR3 (in
105
percentage). Additionally, to have a more precise performance analysis, Figure 6.6 includes the
cumulative recognition rank curves (up to CRR50) for the proposed and the four best performing
benchmarking solutions reported in Table 6.11.
Table 6.11: Ear recognition CRR1 up to CRR3 for the proposed recognition and benchmarking
solutions (best results in bold).
Benchmarking
Solutions
Performance metric
CRR1 CRR2 CRR3
LPQ 80.4% 86.0% 87.9%
BSIF 78.0% 85.6% 87.7%
LBP 76.7% 83.4% 85.6%
ULBP 75.6% 82.8% 86.2%
POEM 75.6% 79.9% 83.0%
RILPQ 71.6% 80.2% 83.8%
Gabor 66.6% 72.2% 76.3%
HOG 82.3% 88.4% 90.7%
Proposed LFLBP 81.9% 87.1% 90.1%
Proposed LFHG 88.2% 90.9% 92.9%
Figure 6.6: Ear recognition cumulative recognition rank curves (up to CRR50) for the proposed
recognition and best performing benchmarking solutions.
Comparison of benchmarking ear recognition solutions:
Among the 2D benchmarking ear recognition solutions, the HOG descriptor shows the best
performance due to: i) consideration of both orientation and magnitude variations, providing a
more comprehensive description of the ear; and ii) use of overlapping blocks. Exploiting
overlapping blocks is beneficial as the ears may be cropped from slightly different positions,
leading to slight misalignments between the ear images registered in the database and the newly
acquired ear images. Consideration of overlapping blocks helps to reduce the misalignment
impact.
106
Comparison with benchmarking ear recognition solutions:
The proposed ear recognition solution based on LFHG descriptor achieves the best recognition
performance, thanks to: i) the exploitation of the spatial and angular information available in light
field images, providing complementary information for ear recognition; ii) consideration of both
orientation and magnitude variations for the spatial and angular information, and iii) exploitation
of overlapping blocks to compensate misalignment impacts. The angular/disparity information
represents the changes in light distribution bringing additional information for ear recognition; by
expressing changes in the ear surface, it captures more information about the ear structure and
geometrical information about the position, depth and shape of the ear components. In summary,
the good results obtained are due to the joint exploitation of the spatial and angular information,
as expressed by the proposed fused descriptor.
Comparison of the proposed solutions:
Concerning the proposed LFLBP descriptor, as it captures only the magnitude sign for the spatial
and angular information, its performance is lower than the proposed LFHG descriptor.
Nevertheless, the results show that LFLBP performance is superior to its 2D variants, LBP and
ULBP, due to the exploitation of light field angular information on the top of spatial information.
It should be noted again that, at this point when the amount of training data available is insufficient,
deep learning based ear recognition solutions may not offer superior performance over local
description based solutions, thus justifying the absence of deep learning based solution in the
benchmarking study.
107
Part III. Light Field Based
Face and Ear Presentation
Attack Detection
108
109
Chapter 7 _
State-of-the-Art on Face Presentation
Attack Detection
7.1 Introduction
The widespread use of biometric recognition applications raises new security concerns [15],
making the robustness against presentation attacks a very active field of research [11]. Presentation
Attack Detection (PAD) solutions aim to automatically detect the presentation of artefacts to
acquisition sensors. The presentation of attack samples can be done using a Presentation Attack
Instrument (PAI) which is defined as an artificial object or representation presenting a copy of
biometric characteristics or synthetic biometric patterns, for instance printed photos, electronic
devices displaying a face or ear, or silicon masks [11]. PAIs can be classified in terms of their
attack potential, an attribute expressing the effort expended in the preparation and execution of the
attack in terms of elapsed time, expertise, knowledge about the capture device being attacked, and
equipment. The attack potential can be graded as “minimal”, “basic”, "enhanced-basic,”
“moderate” or “high” [11]. Among the different PAIs, mask attacks, and especially attacks using
thin silicon masks, have higher attack potential than other types, but these masks are not easy to
get.
Several PAD solutions have been presented for face [12] [13] [14], fingerprint [207], and iris
recognition [208]. To better understand the technological landscape in this area, this chapter
proposes a new, more encompassing, taxonomy of face PAD solutions, which is after used to guide
a review of existing PAD solutions for face biometrics. Additionally, this chapter reviews existing
face artefact databases, which are instrumental for designing, testing and validating face PAD
solutions.
As for other biometric modalities, there are challenges for ear PAD which had not yet been
addressed at the time of the writing of the Thesis. Hence, there are no ear artefact databases or ear
PAD solutions to be reviewed in this chapter and, thus, the reason of the naming of this chapter,
110
mentioning only ‘face’ and not also ‘ear’. Nevertheless, some face PAD solutions, notably those
that do not rely on specific facial characteristics, can potentially also be applied to detect ear
presentation attacks.
7.2 Proposed Face Presentation Attack Detection Taxonomy
This Thesis proposes a taxonomy to organize the face PAD solutions according to four main
dimensions, notably user interaction support, imaging sensor, contextual information and feature
extraction - see Figure 7.1. The different types of presentation attacks are not considered as a
dimension in the taxonomy, as the available PAD solutions are typically not developed to address
a specific type of attack, but rather try to efficiently detect all of them, since it would be unwise to
assume a single specific type of attack.
Figure 7.1: Proposed taxonomy for face PAD solutions.
The four selected taxonomy dimensions are:
User interaction support – This dimension is related to the level of interaction supported with
the user in the relevant application scenario. When face recognition is used to grant access to
sensitive information or facilities, the user may be willing to undergo a more thorough identity
check. In such scenarios, face PAD solutions often employ the so-called challenge response
strategy for liveness detection, e.g. by analyzing the user’s response to external commands or
stimuli. Such responses can be voluntary, e.g., when the user is asked to look left or close the
eyes, or involuntary, e.g., as a reaction to unexpected luminous or acoustic stimuli.
Imaging sensor – This dimension is related to the selected sensor and, thus, the type of
information that can be exploited for PAD detection. Typically, 2D RGB cameras are used,
but the recent availability of richer imaging sensors is opening new possibilities for designing
improved face PAD solutions. Light field cameras [32] [33] [34], near infra-red (NIR) cameras
111
[22] [23] thermal cameras [209] [210] [211], stereo cameras [212] [213] and depth sensors
[20] [21] have recently been used for detecting presentation attacks.
Contextual information – This dimension is related to the possibility to use contextual
information, for instance including background and scenic cues, to detect the presentation
attacks [214] [215] [216]. This is possible in application scenarios where the image acquisition
is not performed with a very limited field of view and the PAD solutions do not have to
concentrate only on the (cropped) face region.
Feature extraction – This dimension is related to the feature extraction methods adopted for
PAD. A first key distinction is between dynamic methods, which use video, and static methods,
based on image analysis. Feature extraction can then use cues derived from texture, quality,
depth/focus or learning methods. Dynamic methods can additionally explore motion. Texture
based methods can exploit the textural patterns to detect presentation attacks. Quality based
methods typically explore changes in the attack images’ quality characteristics to distinguish
bona fide faces from those captured from PAIs. Learning based methods derive features by
modelling and learning relationships from images in view of distinguishing bona fide from
artefact samples. Depth/focus based methods explore changes either in depth or focus
information between the images captured/rendered at different focus planes. Finally, motion
based methods can explore the voluntary or involuntary movements of the head, mouth, or
eyes to detect presentation attacks. It is also not uncommon to find face PAD methods
combining two or more feature extraction methods [12] [13] [14].
In addition to the dimensions considered in the proposed PAD taxonomy, PAD solutions are
expected to work in combination, and eventually in synergy, with some existing face recognition
system [11] [12]. If the face PAD and face recognition systems are running in parallel,
independently of each other, then any samples flagged as suspicious by the face PAD system
should be further investigated; this can be even done manually by a human operator. The automatic
integration of both systems will typically follow a sequential approach, with face PAD only
passing bona fide images as input to the recognition system. Alternatively, when the two systems
run in parallel, some fusion strategy may be used to integrate the obtained results. If an integrated
system for face PAD and face recognition is being developed, then it is expected that some
modules will be shared by both sub-systems. For instance, it may be possible to use contextual
cues or share feature extraction methods, notably if the implementation targets a platform with
limited computational resources.
7.3 Face Artefact Databases
The artefact databases play a very important role for designing, testing and validating face PAD
solutions, while ensuring the reproducibility of performance results and their fair comparison.
Since this Thesis is focusing on the added value of light field images for PAD, the reminder of this
section will review face artefact databases organized around the exploitation or not of light field
cameras and data.
112
7.3.1 Non-Light Field Face Artefact Databases
The main characteristics of the publicly available non-light field face artefact databases described
in the literature are summarized in Table 7.1, including the types of PAIs addressed, as shown in
Figure 7.2, with the databases sorted according to their release date.
Table 7.1: Overview of publicly available non-light field face artefact databases.
Database Name
Relea
se
Year
No. of
Subjects
No. of
Images/
Videos
Type of
Content
Type of attack
Paper Wrapped
Paper Mobile Tablet Laptop
3D
Mask
NUAA [217] 2010 15 58 2D
Print-Attack
[218] 2011 50 1200 2D
REPLAY-
ATTACK [219] 2012 50 1200 2D
CASIA [220] 2012 50 650 2D
3DMAD [20] 2013 17 255 2D+depth
Multi-Spectral
DB [22] 2014 100 200 2D (NIR)
MSU MFSD
[221] 2015 55 440 2D
MS-Face [23] 2016 21 450 NIR
Oulu-NPU [222] 2017 55 4940 2D
SMAD [223] 2017 N/A 130 2D
MLFP [224] 2017 10 1350 VIS, NIR,
Thermal
Figure 7.2: Illustration of different types of PAIs.
113
The genesis of some of the artefact databases was partly motivated by the availability of new
imaging sensors. For instance, the 3D Mask Attack Database (3DMAD) [20] was recorded using
the Microsoft Kinect sensor, and the Multi-Spectral [22] and MS Face [23] databases were
recorded with Near Infra-Red (NIR) cameras, to support the study of presentation attacks on face
recognition systems using those sensors and associated content.
Among the face available artefact databases there are a few that consider the importance of 3D
masks (hard/latex/silicone) for face PAD. 3DMAD [20] contains hard mask samples from 17
subjects, whose masks are provided by thatsmyface.com [225] and the used masks do not have a
very high quality. Two other face artefact databases have recently been proposed considering the
usage of silicon and latex mask attack samples for face PAD. The Silicone Mask Attack Database
(SMAD) [223] consist of a person wearing a silicone mask that can be used to perform attacks
highly sensitive security scenarios, such as semi-supervised border control scenarios. The
Multispectral Latex Mask based Video Face Presentation Attack (MLFP) database [224] contains
latex mask attack samples that are captured in different scenarios; the acquisition has been done in
three different spectrum bands: visible, NIR and thermal.
7.3.2 Light Field Face Artefact Databases
Table 7.2 provides details of the only publicly available light field face artefact database,
describing its main characteristics, including the attack types addressed. For comparison, also the
characteristics of the IST Lenslet Light Field Face Spoofing Database (IST LLFFSD) [43]
proposed in this Thesis, see Section 8.2, are included in Table 7.2.
Table 7.2: Overview of publicly available light field artefact databases.
Database
Name Year
No. of
Subjects
No. of
Images Type of Content
Type of attack
Paper Wrapped
Paper Mobile Tablet Laptop
3D
Mask
GUC-
LiFFAD [33] 2015 80 4826
2D rendered from
LF
Prop. IST
LLFFSD 2018 50 700
Raw LF image; 4D MV SA array;
2D rendered;
The GUC Light Field Face Artefact Database (GUC-LiFFAD) [33] was the first available face
artefact database acknowledging the importance of light field imaging sensors for face PAD. It
includes a set of 2D greyscale images, using printed paper and tablet PAIs, as illustrated in Figure
7.3, focused at different depths, rendered from the light field data acquired using a first generation
Lytro lenslet camera; however, the database does not include the raw light field images, which is
a limitation. This database can be useful for testing and validating face PAD solutions, notably
exploiting the a posteriori refocusing supported by light field imaging.
114
Figure 7.3: Illustration of GUC-LiFFAD face artefact acquisition [33].
7.4 Non-Light Field Based Face PAD Solutions
Existing non-light field based face PAD solutions are here briefly reviewed according to the
proposed taxonomy. Table 7.3 summarizes the main characteristics of a selection of recent,
representative and relevant PAD solutions, sorted according to their publication date. Additionally,
this table includes information about the used color space, classifier, fusion level, test databases,
and types of attack considered in these solutions. For face PAD solutions combining multiple
features or multiple feature extraction methods, fusion can be done at: i) feature level, usually
concatenating features into a single vector for classification; and ii) score level, combining the
classifier outputs [80]. The solutions summarized in Table 7.3 are briefly reviewed in the
following, grouped based on the feature extraction types considered in the taxonomy.
7.4.1 Texture Based Methods
Static texture based methods exploit the textural patterns in images, usually using hand-crafted
descriptors, to detect presentation attacks [226]. Määttä et al. used multi-scale Local Binary
Patterns (LBP) to form a feature vector by concatenating local histograms extracted from
overlapping micro-textures, classified using a Support Vector Machine (SVM) classifier [227].
The same authors add Gabor wavelet (GW) features and HOG hand-crafted descriptions to the
multi-scale LBP descriptor, using score level fusion to combine individual SVM outputs [228].
Kose and Dugelay used rotation invariant LBP descriptor together with a pre-processing step of
Difference of Gaussians (DoG) filtering, followed by a classifier using a chi-square dissimilarity
metric [229]. Waris et al. fused the features extracted by Rotation invariant uniform LBP
descriptor, GW, and Grey-Level Co-occurrence Matrices (GLCM) and used SVM and Partial
Least Square (PLS) regression for classification [230]. Raghavendra and Busch proposed a PAD
scheme exploring both global face structure and face component regions using LBP and Binarized
Statistical Image Features (BSIF) descriptors; score level fusion of two SVM classifiers computed
over two feature vectors is used [231]. Erdogmus and Marcel evaluated the performance of
different LBP based descriptions including conventional LBP, modified LBP, transitional LBP
and direction-coded LBP, using Linear Discriminant Analysis (LDA), Chi-square, and SVM
classifiers, showing the superiority of LBP with LDA classifier to detect 3D mask attacks [21]. Yi
et al. proposed a multi-spectral face PAD system utilizing GW descriptions extracted on 76 facial
landmarks together with a linear SVM classifier, operating in the visible and NIR spectra [22].
Hadid et al. analysed facial image textures using LBP and GLCM descriptors, using logistic
regression classifiers and score-level fusion [232]. Arashloo et al. proposed a solution based on
115
the fusion of multiscale BSIF and multiscale Local Phase Quantization (LPQ) descriptors, using a
Specific Kernel Discriminant Analysis (S-KDA) [233]. In one of the more recent and promising
works, Boulkenafet et al. exploited the joint color texture information extracted by LPQ and the
co-occurrence of adjacent local binary patterns descriptors, concluding that using color rather than
greyscale is advantageous for PAD [234]. The same authors proposed a solution to describe the
facial appearance by speeded-up robust descriptions extracted over the HSV and YCbCr color
spaces using a softmax classifier [235]. Peng et al. proposed a solution based on guided scale LBP
and Local Guided Binary Pattern descriptors, concatenating features and using a linear SVM
classifier [236].
Dynamic texture based methods exploit the textural patterns in videos to detect presentation attacks
[237] [238]. Pereira et al. proposed a spatio-temporal texture based solution using a LBP from
three orthogonal planes descriptor to consider both spatial and temporal information, followed by
a SVM classifier [239]. Bharadwaj et al. proposed two dynamic texture based solutions using
Dynamic Multi-scale LBP descriptor together with SVM classifiers, and histogram of oriented
optical flows with a LDA classifier, respectively [240]. Pinto et al. proposed a solution based on
low level time-spectral descriptors to exploit spectral and temporal information, testing several
classifiers, with SVM to achieve the best performance [241]. Phan et al. considered a local
derivative pattern from three orthogonal planes descriptor to exploit temporal and spatial
information in different directions of face movements with a SVM classifier [242].
7.4.2 Quality Based Methods
Quality based methods typically explore changes in the attack images’ quality characteristics to
distinguish bona fide faces from those captured from 2D reproductions; examples include the loss
of sharpness and detail, blurriness, and differences in light distribution. This means the quality of
the samples captured by the PAD system can be exploited to detect attacks. Zhang et al. proposed
a solution able to learn multispectral reflectance distributions and analyse the faces based on a
Lambertian model to select the two more discriminative wavelengths for attack detection; finally,
a SVM classifier is trained to learn the multispectral distribution [243]. Kose and Dugelay analysed
the reflectance characteristics of masks and real faces [244]. The input image is decomposed into
illumination and reflectance components and the gradient of reflectance components is considered
to define a feature vector. Finally, a linear SVM classifier is applied to detect mask attacks.
Galbally et al. proposed to use 14 full-reference quality assessment metrics classified into three
different classes, notably: i) pixel difference; ii) correlation based; and iii) edge based measures.
The metrics are then combined to form a feature vector to be fed into SVM classifiers [245]. In
[246], the same authors add some full-reference and no-reference image quality measures, such as
spectral distance measures, gradient based measures, structural similarity measures and
information theoretic measures, and the detection method was extended to presentation attack
detection in iris, fingerprint and face recognition. Wen et al. proposed an image distortion analysis
based solution, exploiting specular reflection, blurriness, chromatic moments, and color diversity,
together with multiple SVM classifiers, and trained for different face presentation attacks [221].
Agarwal et al. proposed to use 13 Haralick features [247], computed over non-overlapping patches
for each color channel, to be fed into a SVM classifier for detecting face presentation attacks [248].
Finally, Bhogal et al. used non-reference image quality assessment measures for face PAD [249].
116
The feature vector includes a natural image quality metric, blind image integrity notator using
DCT statistics, blind image quality assessment through anisotropy, blind/reference-less image
spatial quality metric, distortion identification based image verity and integrity metric, and blind
image quality index metric, using SVM classifiers.
Table 7.3: Overview of non-light field face PAD solutions.
Ref.
Rele
ase
Year
User
Inter.
Support
Imagi
ng
Sensor
Context
ual Inf.
Feature
Extraction
Type
Feature
Extraction
Sub-Type
Color
Space Class.
Fusion
level DB Type of attack
[227] 2011 No inter. RGB Cropped Static Texture
based Gray SVM Feature NUAA Paper
[243] 2011 No inter. RGB Full Static Quality
based Gray SVM --- Private Paper, laptop
[250] 2012 No inter. RGB Cropped Dynamic Motion
based Gray Logistic Score Print-Att. Paper
[228] 2012 No inter. RGB Cropped Static Texture
based Gray SVM Score
NUAA,
Print-Att. Paper, monitor
[229] 2012 No inter. RGB Full Static Texture
based Gray
Chi-
square --- NUAA Paper
[239] 2012 No inter. RGB Cropped Dynamic Texture
based Gray SVM Feature REPLAY Paper, tablet
[230] 2013 No inter. RGB Full Static Texture
based Gray
SVM,
PLS Feature REPLAY Paper, tablet
[251] 2013 No inter. RGB Cropped Dynamic Motion
based Gray MLP Score REPLAY Paper, tablet
[252] 2013 No inter. RGB Full Static Depth/focu
s based RGB
Plane
equation --- Private Paper
[240] 2013 No inter. RGB Cropped Dynamic Texture
based RGB
SVM,
LDA Feature
Print-Att.
REPLAY Paper, tablet
[244] 2013 No inter. RGB+
Dep Cropped Static
Quality
based Gray SVM --- Private Mask
[253] 2013 Voluntar
y RGB Cropped Dynamic
Motion
based Gray kNN Feature Private Paper, tablet
[245] 2014 No inter. RGB Full Static Quality
based Gray
LDA,
QDA Feature REPLAY
Paper, tablet,
mobile
[246] 2014 No inter. RGB Full Static Quality
based Gray LDA Feature
CASIA,
REPLAY
Paper, tablet,
mobile
[254] 2014 No inter. RGB Full Static Depth/focu
s based Gray
Naïve
Bayes --- Private N/A
[255] 2014 Voluntar
y RGB Cropped Dynamic
Motion
based Gray SVM Feature Print-Att. Paper
[231] 2014 No inter.
RGB,
RGB+
Dep.
Cropped Static Texture
based Gray SVM Score
CASIA,
3DMAD Paper, mask
[21] 2014 No inter. RGB+
Dep. Cropped Static
Texture
based RGB LDA Score 3DMAD Mask
[22] 2014 No inter. RGB,
NIR Full Static
Texture
based
RGB,
Gray SVM Feature
Multi-
Spectral
DB
Paper
[256] 2014 Voluntar
y RGB Cropped Dynamic
Motion
based Gray kNN --- Private N/A
117
Ref.
Rele
ase
Year
User
Inter.
Support
Imagi
ng
Sensor
Context
ual Inf.
Feature
Extraction
Type
Feature
Extraction
Sub-Type
Color
Space Class.
Fusion
level DB Type of attack
[257] 2014 No inter. RGB Full Static Learning
based RGB SVM ---
CASIA,
REPLAY
Paper, tablet,
mobile
[232] 2015 No inter. RGB Cropped Static Texture
based RGB
Regressi
on Score REPLAY Paper, tablet
[233] 2015 No inter. RGB Full Staric Texture
based Gray S-KDA Score
REPLAY,
CASIA,
NUAA
Paper, wrapped
paper, tablet,
[258] 2015 No inter. RGB Cropped Static Depth/focu
s based Gray SVM Feature Private Paper
[221] 2015 No inter. RGB Cropped Static Quality
based RGB SVM Feature
NUAA,
REPLAY,
CASIA
Paper, tablet,
mobile
[259] 2015 No inter. RGB Cropped Dynamic Learning
based Gray
Softmax
layer --- CASIA
Paper, wrapped
paper, tablet
[260] 2015 No inter. RGB Full Static Learning
based Gray SVM ---
REPLAY,
3DMAD
Paper, tablet,
mobile, mask
[241] 2015 No inter.
RGB,
RGB+
Dep.
Full Dynamic Texture
based RGB SVM Feature
CASIA,R
EPLAY,
3DMAD
Paper, tablet,
mobile, mask
[248] 2016 No inter.
RGB,
RGB+
Dep.
Cropped Static Quality
based RGB SVM Feature
3DMAD,
CASIA,
MSU
Paper, wrapped
paper, tablet,
mobile, mask
[234] 2016 No inter. RGB Cropped Static Texture
based
HSV,
YCbCr SVM
Feature,
score
REPLAY,
CASIA,
MSU
Paper, wrapped
paper, tablet,
mobile
[235] 2016 No inter. RGB Cropped Static Texture
based
HSV,
YCbCr
Softmax
Regressi
on
Feature
REPLAY,
CASIA,
MSU
Paper, wrapped
paper, tablet,
mobile
[23] 2016 No inter. RGB,
NIR Full Static
Texture
based
RGB,
Gray SVM Feature MS-Face Paper
[242] 2016 No inter. RGB Cropped Dynamic Texture
based RGB SVM Feature
MSU,
REPLAY,
CASIA
Paper, tablet,
mobile
[261] 2016 No inter. RGB Full Static Learning
based RGB SVM ---
CASIA,
REPLAY
Paper, tablet,
mobile
[262] 2016 No inter. RGB Full Dynamic Learning
based RGB Softmax ---
CASIA,
3DMAD
Paper, tablet,
mask
[236] 2017 No inter. RGB Cropped Static Texture
based
HSV,
YCbCr SVM Feature
REPLAY,
CASIA,
MSU
Paper, wrapped
paper, tablet,
mobile
[263] 2017 No inter. RGB Cropped Static Learning
based Gray Softmax ---
NUAA,
REPLAY
Paper, tablet,
mobile
[264] 2017 Involunta
ry RGB Full Dynamic
Motion
based YCbCr Voting --- Private N/A
[249] 2017 No inter. RGB Full Static Quality
based RGB SVM Feature REPLAY
Paper, tablet,
mobile
118
7.4.3 Learning Based Methods
Learning based methods derive features by modelling and learning relationships from images in
view of distinguishing bona fide samples from attack attempts. In particular, the usage of CNNs
have been growing very fast for face PAD since 2014 [257]. CNNs can support both feature
extraction and classification, or they can be used only for feature extraction and combined with
different classifiers, such as SVM. Examples of CNN supporting only feature extraction include:
i) the canonical CNN structure proposed by Yang et al., which includes five convolutional and
three fully connected layers, followed by a SVM classifier [257]; ii) a deep learning solution
including a conventional CNN with back-propagation approach, proposed by Menotti et al. to
learn the weights of the network and using SVM classifiers [260]; and iii) a CNN proposed by Li
et al. to learn features based on the pre-trained VGG-Face CNN [38], followed by principle
component analysis for dimensionality reduction and a linear SVM for classification [261].
CNN examples supporting both feature extraction and classification include: i) the CNN by Xu et
al. to extend a CNN with two convolutional, one fully connected and a softmax layers, with a new
layer, called long short term memory, after the fully connected layer of the CNN architecture, for
learning and extracting the temporal structure across the video [259]; ii) the CNN by Feng et al.
to combine a shearlet based image quality feature and two types of motion features using a
hierarchical neural network containing an input layer, two hidden layers and a softmax layer for
classification [262]; iii) the solution by Alotaibi et al. to use nonlinear diffusion to preserve depth
and edge information, using a deep CNN with five convolutional and subsampling layers, trained
using stochastic gradient descent to extract the discriminative high-level features for face PAD,
and using softmax layer for classification [263].
7.4.4 Focus/Depth Based Methods
Focus/depth based methods explore changes either in depth or focus information between the
images captured/rendered at different focus planes. Kim et al. used the power histogram and
gradient location features to compare images focused at two different planes, i.e., nose and ears,
classifying the resulting degree of blurriness with a Sum-Modified-Laplacian (SML) method
[252]. Yang et al. proposed a face PAD method by investigating the focus distance between the
face and the background [254]. In this context, the degree of blurriness of the face and the
background are considered to detect attacks. Kim et al. concatenated three different features
extracted from two images sequentially captured at different focuses to be fed into a SVM classifier
[258].
7.4.5 Motion Based Methods
Motion patterns of the face or facial landmarks can reveal valuable information for face PAD,
allowing for instance to analyze responses when the user interaction dimension is being explored,
or to better analyze contextual information when exploring that taxonomy dimension. Dynamic
motion based methods can explore the voluntary or involuntary movements of the head [265]
[266], [267], mouth [268], or eyes [251] [255] [269] [270] to detect presentation attacks. Examples
exploring contextual information include: i) Yan et al. performed a foreground–background
consistency analysis in both spatial and temporal domains using a logistic classifier to detect
119
presentation attacks [250]; ii) Komulainen et al. measured the temporal correlations between the
background and user’s head, using a multilayer perceptron (MLP) classifier [251]; and iii) Anjos
et al. performed foreground/background motion correlation using optical flow, with a binary SVM
classifier [251].
Some examples where motion is explored to evaluate the challenge responses include: i) Ali et al.
proposed a gaze tracking solution in response to an external challenge, testing various fusion
schemes and using a k-nearest neighbour (kNN) classifier [253]; ii) Cai et al. exploited gaze
estimation in a challenge-response framework, using a kNN classifier [256]; and iii) Killioğlu et
al. tracked the eyes with a Kanade-Lucas-Tomasi (KLT) tracker, in response to an external
challenge, and used a voting scheme to distinguish bona fide users from attack attempts [264].
7.5 Light Field Based Face PAD Solutions
Excluding the solutions proposed in this Thesis, three face PAD solutions exploiting the richer
light field information have been proposed in the literature. Following the taxonomy illustrated in
Figure 7.1, the main characteristics of those three light field based PAD solutions [32] [33] [34]
are summarized in Table 7.4. Additionally, this table includes some information about the
distinctive characteristics of the light field sensors (see Section 2.5) and imaging representations
considered, such as format, color space, classifier, fusion level, test databases, and type of attacks.
For comparison, also the characteristics of the two light field based face recognition solutions [42]
[43] being proposed in this Thesis are included in Table 3.3. The solutions summarized in Table
3.3 are briefly reviewed in the following, grouped based on the feature extraction types considered
in the taxonomy.
Table 7.4: Overview of light field face PAD solutions.
Ref. Rele
ase
Year
User
Inter.
Support
LF
Imaging
Sensor
Conte.
Inf.
Feat.
Ext.
Type
Feature
Extrac.
Sub-Type
LF
Analysis Format
Col.
Space Class.
Fusion
level DB
Type of
attack
[32] 2014 No inter. Lytro
1st Gen. Crop. Static
Depth/
focus
based
Disparity
exploit.
LF
microlens
image
Gray SVM --- Privat
e Paper
[33] 2015 No inter. Lytro
1st Gen. Crop. Static
Depth/
focus based
A
posteriori refocusing
2D
rendered from LF
Gray SVM Feature
GUC-
LiFFAD
Paper; tablet
[34] 2016 No inter. Lytro
1st Gen. Crop. Static
Texture
based
Depth
comp.
2D
rendered
from LF
Gray SVM Feature Privat
e Paper
Prop.
[43] 2018 No inter.
Lytro
ILLUM Crop. Static
Texture
based
Disparity
exploit.
LF MV
SA array
YCbCr
HSV SVM
Feature;
score
LLFF
SD
Wrap.
paper;
paper;
tablet;
laptop;
mobile
Prop.
[42] 2018 No inter.
Lytro
ILLUM Crop. Static
Texture
based
Disparity
exploit.
LF MV
SA array HSV SVM Feature
LLFF
SD
Wrap.
paper;
paper;
tablet; laptop;
mobile
120
7.5.1 Texture Based Methods
Kim et al. exploited disparity information for face PAD by considering edge and ray difference
information, which cannot be obtained from the images captured by a conventional camera [32].
The edge information expresses the microlens image properties, which have different light
distributions for different focal planes. The ray difference information expresses the different
incident rays for the multiple SA images. A LBP descriptor is employed to extract features from
edge and ray difference information and a decision rule is used to distinguish bona fide users from
attacks. Performance evaluation considered a private light field database, including printed paper,
wrapped printed paper and tablet attacks. Li et al. [34] proposed a solution relying on a light field
histogram of gradients (LFHoG) descriptor, considering both spatial and depth information. The
rendered image texture and the distribution of the scene depth are combined, providing a more
comprehensive description of the face, which is exploited by a linear SVM classifier. A private
light field database was used to demonstrate the LFHoG descriptor effectiveness.
7.5.2 Focus/Depth Based Methods
Raghavendra et al. proposed a solution relying on a posteriori refocusing, exploiting the variation
of focus between images rendered at different depths to detect presentation attacks [33].
Experiments using the GUC-LiFFAD database included paper and tablet attacks. The best results,
after comparing 26 different focus measures, are reported when using gradient based focus
measurement operators.
7.6 Adaptation of Face Presentation Attack Detection Solutions for Ear
Biometrics
All the reviewed artefact databases and solutions in this chapter have been proposed for face
biometrics. In spite of the ear PAD challenges, currently there are no available artefact databases
or PAD solutions for ear biometrics. Nevertheless, most of the face PAD solutions can potentially
be applied to detect ear presentation attacks. The face PAD solutions that may not be used in the
context of ear PAD are those relying on specific characteristics of the face, for example those
solutions with user interaction support for analyzing the user’s face reaction to external commands
or stimuli.
This Thesis proposes one ear artefact database along with two PAD solutions that can be applied
to both face and ear biometrics.
121
Chapter 8 _
Proposing Novel Light Field Face and Ear
Artefact Databases
8.1 Introduction
As the emergence of novel imaging sensors motivates the research community to work with new
and richer imaging formats to detect presentation attacks, gathering extensive lenslet light field
artefact databases was pressing need during this Theis. As stated in Section 3.3.10, it was difficult
to fully assess how face PAD systems could benefit from light filed data, as the only available light
field face artefact database, GUC-LiFFAD [33], does not include the raw light field images. In
fact, GUC-LiFFAD only includes a number of 2D images focused at different depths for each
person, rendered from light field images acquired by an old generation of lenslet light field
cameras; thus, it can be only useful for testing and validating those face recognition solutions that
exploit the a posteriori refocusing capability supported by light field imaging. Concerning ear
PAD, no ear artefact database, captured by lenslet light field cameras, was available when this
Thesis started.
To be able to test light field face and ear PAD solutions, including those proposed in this Thesis,
it is necessary to have access to artefact databases including light field images in the LFR format,
thus providing the flexibility to exploit different types of light field data for biometric PAD. To
overcome these limitations, light field based face and ear artefact databases have been created in
the context of this Thesis, allowing more powerful benchmarking for testing and validating face
and ear PAD solutions, exploiting the full light field data. The proposed IST Lenslet Light Field
Face Spoofing Database (IST LLFFSD) consists of 100 bona fide images, from 50 subjects,
captured with a Lytro ILLUM lenslet camera, and a set of 600 face presentation attack images,
using several types of presentation attack types, including printed paper, wrapped printed paper,
laptop, tablet and two different mobile phones, captured with the same camera. This Thesis also
proposes the first ear PAD database, the IST Lenslet Light Field Ear Artefact Database (IST
122
LLFEADB), captured with a Lytro ILLUM lenslet camera, including both 2D and light field
contents, using several types of presentation attack instruments, including laptop, tablet and two
different mobile phones. By including the raw light field images in the proposed databases, the
potential of these databases is significantly boosted as it allows more powerful benchmarking for
testing and validating face and ear PAD solutions exploiting the full light field data. These
databases have been made publicly available to the research community. This chapter reviews the
proposed light field face and ear artefact databases.
8.2 Light Field Based Face Artefact Database
The IST LLFFSD being proposed in this Thesis is the first face artefact database to include the
raw lenslet light field images, along with 2D rendered images and the corresponding depth maps.
It contains 100 bona fide samples and six types of presentation attacks: printed paper, wrapped
printed paper, laptop, tablet and two different mobile phones. This database has been made
publicly available to the research community.
In comparison with GUC-LiFFAD, the only other available light field face artefact database, the
proposed IST LLFFSD has the following main advantages: i) uses the higher resolution Lytro
ILLUM lenslet camera; ii) includes 2D RGB rendered face images, instead of greyscale; iii)
includes a depth map for the rendered images; and more importantly, iv) includes the raw light
field imaging information itself, in LFR format, boosting the potential of this database to allow
more powerful benchmarking for testing and validating face PAD solutions, exploiting the full
light field data.
8.2.1 Acquisition Setup and Statistics
Since face presentation attacks mostly happen in face verification systems [11] [13], where image
acquisition is done in controlled conditions, this is the scenario considered for the propose IST
LLFFSD database. The IST LLFFSD artefact acquisition was performed indoors, using a lenslet
light field camera, the Lytro ILLUM [26], for capturing images from the attack attempts. The bona
fide IST LLFFSD samples are derived from the publicly available IST-EURECOM LLFFD [35],
captured with the same camera. It includes data from 50 volunteers, 33 males and 17 females, who
were born between 1957 and 1998, originating from 10 different countries. Each volunteer
participated in two separate acquisition sessions, with a time interval between 1 and 6 months,
resulting in a total of 100 bona fide face images. During acquisition, a uniform background was
used, and volunteers were asked to look frontally to the camera, with a neutral expression – see
example in Figure 8.2.a.
8.2.2 Presentation Attack Instruments
The artefact acquisition pipeline is illustrated in Figure 8.1. A 2D central view rendered image,
with a resolution of 2022×1404 pixels, corresponding to each bona fide light field, is used to
generate the six types of PAIs considered in the proposed database:
1. Printed paper attack: 2D images were printed on A4 paper using a Canon i-SENSYS
MF8300 color laser printer. The printed paper is placed on a flat surface and the attack image
is captured with the Lytro ILLUM lenslet camera – see Figure 8.2.b.
123
2. Wrapped printed paper attack: The printed paper is wrapped over an object simulating the
human face curvature, resulting in different depths for different face areas. A more challenging
attack is expected to result, especially for the methods exploiting depth to detect presentation
attacks – see Figure 8.2.c.
3. Laptop attack: A 2D bona fide rendered image is displayed using a MacBook Pro 13’’ – see
Figure 8.2.d.
4. Tablet attack: A 2D bona fide rendered image is displayed using an iPad Air2, 9,7’’ – see
Figure 8.2.e.
5. Mobile attack 1: A 2D bona fide rendered image is displayed using an iPhone 6S – see Figure
8.2.f.
6. Mobile attack 2: A 2D bona fide rendered image is displayed using a Sony Xperia z2 – see
Figure 8.2.g.
Figure 8.1: IST LLFFSB face artefact acquisition pipeline.
124
Figure 8.2: IST LLFFSD example: Illustration of 2D central view rendered images for: (a) bona
fide face; (b) print paper attack; (c) wrapped print paper attack; (d) laptop attack; (e) tablet
attack; (f) mobile attack 1; and (g) mobile attack 2.
8.2.3 Database Elements
The IST LLFFSD database is the first face artefact database to include the raw lenslet light field
imaging files. It is composed by the following elements:
1. Raw light lield images: Light field images in the Lytro ILLUM native file format, LFR, with
about 50 MB/image. LFR files can be used as initial input for both the Lytro camera software
i.e., Lytro Desktop Software [186], or to any other processing library/toolbox, such as the
Matlab Light Field Toolbox V0.4 [58].
2. 2D rendered images: 2D rendered images for the light field central view, created using the
Lytro Desktop Software [186]. This software performs up-sampling and color correction, to
enhance the rendered image quality, as described in Section 2.4.
3. Depth maps: Depth map for each 2D rendered image, generated with the Lytro Desktop
Software [186].
4. Camera calibration file: Calibration data is provided, as this information is essential to
compensate for the specific properties of each camera sensor.
8.2.4 Database Access and Usage Conditions
The IST LLFFSD database is publicly available for standardization and academic research
purposes and can be downloaded from http://www.img.lx.it.pt/ LLFFSD /.
125
8.3 Light Field Based Ear Artefact Database
This Thesis proposes the first ear PAD database, the IST Lenslet Light Field Ear Artefact Database
(IST LLFEADB), including both 2D and light field ear artefact images. The database contains two
sets which are reviewed in the following.
8.3.1 Acquisition Setup and Statistics
IST LLFEADB consists of a baseline and an extended sets, with the difference between the two
sets being related to the settings used for bona fide image acquisition.
Baseline set: The bona fide samples in the baseline IST LLFEADB have been derived from
the LLFEDB [36], consisting of 268 ear samples from 67 subjects, with 4 image shots per
person, notably the right and left half and full profile images, captured with a Lytro ILLUM
lenslet camera. They include ear images partly occluded by ear piercing, earing, hair and
combinations of multiple occlusions. A 2D central view image (see Figure 8.3.a), rendered by
the Lytro Desktop Software [186], corresponding to each bona fide light field, was used to
generate the artefacts. The size of the rendered ear images varies, with an average size of
213×143 pixels and aspect ratio of 1.49. All bona fide images in the baseline set were first
rescaled to 192×128 pixels, with an aspect ratio of 1.5, to have the same size while displaying
using PAIs.
Extended high resolution set: The bona fide baseline samples do not have a very high
resolution, which can affect the quality of the samples displayed on PAIs, thus facilitating
distinguishing the attack from bona fide samples. To consider a more challenging condition,
additional high resolution bona fide images have been captured with the same camera, from
15 subjects, with 4 image shots per person. The size of the rendered ear images varies, with an
average size of 1162×760 pixels and aspect ratio of 1.53. All the bona fide images were
rescaled to 1152×768 pixels, with an aspect ratio of 1.5, to have the same size while displaying
using PAIs.
8.3.2 Presentation Attack Instruments
The artefact acquisition pipeline is illustrated in Figure 8.4; this acquisition of images from PAIs
was performed using the Lytro ILLUM lenslet camera, thus creating one LFR file per sample. It
should be noted that an ear does not have a curved shape, therefore wrapped paper attacks are not
relevant for ear recognition systems. Additionally, due to the low quality of bona fide ear samples
available from the baseline IST LLFEADB set, printing those low resolution and low quality
artefacts would not result in challenging attacks, so paper attacks were not considered.
Four types of PAIs were considered for the LLFEADB:
1. Laptop attack: A 2D bona fide rendered image is displayed using a MacBook Pro 13’’ – see
Figure 8.3.b.
2. Tablet attack: A 2D bona fide rendered image is displayed using an iPad Air2, 9,7’’ – see
Figure 8.3.c.
126
3. Mobile attack 1: A 2D bona fide rendered image is displayed using an iPhone 6S – see Figure
8.3.d.
4. Mobile attack 2: A 2D bona fide rendered image is displayed using a Sony Xperia z2 – see
Figure 8.3.e.
Figure 8.3: Illustration of IST LLFEADB images for a bona fide sample and corresponding
artefact samples for four different PAIs.
Figure 8.4: IST LLFEADB ear artefact acquisition pipeline.
127
8.3.3 Database Elements
The IST LLFEADB is the first ear artefact database for PAD, including both 2D and raw light
field images. It is composed by the following elements:
1. Raw light field images: Light field images in the Lytro ILLUM native file format, LFR, that
can be used as initial input for both the Lytro camera software i.e., Lytro Desktop Software
[186], or to any other processing library/toolbox, such as the Matlab Light Field Toolbox V0.4
[58].
2. 2D rendered images: 2D rendered images for the light field central view, containing only the
ear region, and created using the Matlab Light Field Toolbox V0.4 [58] – see Figure 8.3.a.
3. Multi-view SA image array: Sub-aperture images corresponding to different viewpoints,
forming a multi-view array, and created using the Matlab Light Field Toolbox V0.4 [58] – see
Figure 8.5; these multi-view arrays contain only the ear region.
4. Camera calibration file: Calibration data is provided, as this information is essential to
compensate for the specific properties of each camera sensor.
Figure 8.5: Multi-view sub-aperture image array for an artefact ear image.
8.3.4 Database Access and Usage Conditions
The IST LLFEADB database is publicly available for standardization and academic research
purposes and can be downloaded from http://www.img.lx.it.pt/ LLFEADB /.
128
129
Chapter 9 _
Proposing Novel Light Field Face and Ear
Presentation Attack Detection Solutions
9.1 Introduction
This Thesis proposes two PAD solutions based on two light field angular hand-crafted descriptors
for the disparity information available in light field images, for both face and ear. Exploiting the
disparity information acquired in a light field image in the form of an array of SA images can be
very useful to improve PAD performance. The motivation behind exploiting disparity information
for detecting presentation attacks comes from disparity differences between bona fide and attack
images due to:
1. Differences in surface geometry, leading to considerable differences in the disparity
information. Differences in face and ear components’ depth influences the image texture,
leading to shifts in shadows’ location and shape, and changes in contrast gradients. All flat
attack types (laptop, tablet, mobile and printed paper) exhibit limited differences in the
disparity/depth at the positons of the various face and ear components. Wrapped printed papers
are more challenging attacks for face PAD as they simulate the face’s approximately
cylindrical shape. Nevertheless, the resulting disparity changes are smoother and different from
those observed in bona fide faces. It should be noted that it is not logical to use wrapped printed
paper for ear PAD, as this PAI cannot simulate the ear shape. Concerning 3D PAIs, including
face masks and silicon ears, as the surface geometry of these PAIs is very similar to a bona
fide face or ear, it is expected to see less changes regarding the bona fide disparities, thus
leading to the most challenging attacks; however, the light reflection pattern is also different
what is an advantage for light field approaches. Differences in face and ear components’ depth
can also cause changes in defocus blur between the different views obtained from a light field
image, with attack images exhibiting an almost constant amount of defocus blur across views,
while different views of a bona fide image are expected to exhibit an unequal amount of
defocus blur.
130
2. Reproduction materials used for the attacks, such as electronic displays and paper, lead to
changes in transmission, scattering, reflection and absorption of light, thus introducing changes
in the light distribution, as well as some additional acquisition noise types such as reflection
and sharpness loss, which can be expressed by disparity information.
The above mentioned effects are effectively explored by the solutions proposed in this Thesis to
detect presentation attacks.
9.2 PAD Based on Light Field Angular Local Binary Pattern Descriptor
This section proposes a novel PAD solution based on a light field angular hand-crafted descriptor
exploiting the color and texture variations associated to the different directions of light captured
in light field images. The proposed PAD solution is based on the Light Field Angular Local Binary
Patterns (LFALBP) descriptor presented in Section 5.2.2, here used to capture the disparity
information present in light field images in two different color spaces. The proposed LFALBP
based PAD architecture is represented in Figure 9.1.
Figure 9.1: Architecture of the proposed face and ear PAD solution based on LFALBP hand-
crafted descriptor.
This PAD solution includes the following main modules:
1. Pre-processing: Matlab Light Field Toolbox v0.4 [58] has been used to create the multi-view
array of SA images (see Section 2.5). Then, each face/ear in all SA images is cropped and faces
are resized to 128×128 and ears to 192×128 pixels. There are three reasons for these cropping
sizes: i) A study on IST LLFFSD and IST LLFEADB databases showed that the average aspect
ratios of the cropped faces and ears are 1.08 and 1.51, thus justifying the aspect ratio of the
resized faces and ears; ii) A preliminary study conducted during the Thesis work as shown that
increasing the image resolution does not significantly affect the PAD performance, while
increasing computational complexity. It is due to the fact that although the IST LLFFSD
database considers larger image sizes, the face area is only a portion of that size, with 128×128
being a size close to the cropped face image. This is also the case for ear images from the
131
extended high resolution set of IST LLFEADB, and 192×128 is a size that is adjusted to the
aspect ratio of ears present in the database.
2. HSV/YCbCr color conversion: RGB may not always be the best color space to work since:
i) there is a strong correlation between the RGB components; and ii) the luminance and
chrominance information are not separately represented in RGB. This module converts the
RGB cropped SA face/ear images to the HSV and YCbCr color spaces, as they proved to be
efficient in detecting presentation attacks [234]. The combination of the descriptors computed
over HSV and YCbCr is able to express the color information in different, complementary
ways.
3. LFALBP description: The Light Field Angular Local Binary Patterns (LFALBP) descriptor
(Equation 5.7) is the angular part of the LFLBP combined descriptor proposed in Section 5.2.2,
which is a compact and efficient LBP extension, able to exploit the light field disparity
information. The LFALBP descriptions are computed over each color component from the
cropped SA images. Results in Section 6.3.1 show a clear performance improvement for light
field biometric recognition tasks as the SA images’ disparity increases. Thus, it is proposed
here that the SA images selected for computing the disparity description are at maximum
distance from the central view. More details about the LFALBP descriptor along with its
parameters, including radius, starting angle, and the number of SA images, have already been
presented in Section 5.2.2.
4. Component description concatenation: The extracted descriptions for each color space
component are concatenated, resulting in a 3-component description for each considered color
space.
5. Offline training: The LFALBP concatenated descriptions extracted from the training samples
for each color space are fed to a SVM classifier (implemented using the LIBSVM library
[190]), to define the classification model. The training data should be selected based on the test
protocol considered.
6. SVM classification: The LFALBP descriptions extracted from the test samples for each color
space are fed to a SVM classifier (implemented using the LIBSVM library [271]), returning a
bona fide versus attack classification score for each color space. The decision to adopt the
SVM classifier was made after an extensive performance comparison of several classifiers,
including k-nearest neighbour (kNN) with L1 and L2 norms and logistic regression, as well as
different SVM kernels, including polynomial, radial basis, and sigmoid tanh functions. Linear
SVM performed slightly better than logistic regression and considerably better than kNN.
Additionally, a linear kernel led to the best results, compared to the other tested kernels, thus
justifying the choice of linear SVM as the final classifier.
7. Score level fusion: The integration of the individual SVM classifier outputs for the two color
spaces is done using score level fusion, applying the sum rule to compensate the small errors
obtained by each individual color space. The fused score finally determines whether the input
image should be considered to contain a bona fide face/ear or to be an attack attempt.
In summary, the novel PAD solution is able to take benefit of the variations associated to the
different directions in the captured light field, using the angular texture information represented in
two different color spaces.
132
9.3 PAD Based on Light Field Histogram of Disparity Gradient Descriptor
Another novel light field based PAD solution has been proposed based on the Light Field
Histogram of Disparity Gradients (LFHDG) hand-crafted descriptor presented in Section 5.3.2,
which is able to express the light variations associated to the multiple light capturing directions in
light field images. The LFHDG descriptor considers both the orientation and magnitude variations
for the angular information. Compared to the proposed solution above that only captures the
magnitude sign, the PAD solution proposed in this section offers a more comprehensive angular
description. As expected, it boosts the final recognition performance, as described in detail in this
section. The proposed LFHDG based PAD architecture is represented in Figure 9.2.
Figure 9.2: Architecture of the proposed face and ear PAD solution based on LFHDG
descriptor.
This PAD solution includes the following main modules:
1. Pre-processing: Matlab Light Field Toolbox v0.4 [58] has been used to create the multi-view
array of SA images (Section 2.5). Then, as for the previous PAD solution, each face/ear in all
SA images is cropped and faces are resized to 128×128 and ears to 192×128 pixels.
2. HSV color conversion: Since in RGB there is a strong correlation between the color
components and the luminance is not separately represented from chrominance, RGB is not
necessarily the best color space to work with. This module converts the RGB cropped sub-
aperture face/ear images to the HSV color space, which can be beneficial to distinguish
between bona fide and attack samples as shown for instance in [234].
3. LFHDG extraction: The LFHDG descriptions are computed for each color component from
the cropped SA images, thus expressing orientation and magnitude for angular information.
Similar to the proposed PAD solution based on the LFALBP descriptor, it is proposed here
that the SA images selected for computing the disparity gradients are at maximum distance
from the central view. More details about the LFHDG descriptor have already been provided
in Section 5.3.2.
4. Components descriptor concatenation: The extracted descriptions for each color space
component are concatenated, resulting in the final 3-component description.
5. Offline training – The LFHDG concatenated descriptions extracted from the training samples
for each color space are fed to a SVM classifier (implemented using the LIBSVM library
[190]), to define the classification model. The training data should be selected based on the test
protocol considered.
6. SVM classification: The concatenated LFHDG description extracted from the test samples is
fed to the previously trained SVM classifier to be compared to the classification model, thus
133
determining whether the input image should be considered to contain a bona fide face/ear or
an attack attempt. The experiments made with other classifiers and SVM kernels led to the
same conclusion of using a linear SCM classifier as for the previous PAD solution based on
the LFALBP descriptor.
In summary, the LFHDG based light field PAD solution exploits the orientation and magnitude
for the light variations associated to the multiple directions in the captured light field in the HSV
color space.
134
135
Chapter 10 _
Light Field Face and Ear Presentation Attack
Detection Performance
Light Fiel d Face a nd Ear Presenta tion A ttack De tection Performa nce
10.1 Introduction
In this chapter, an extensive performance evaluation is reported for the proposed PAD solutions
and several benchmarking methods, using a common, representative performance evaluation
framework for varied and challenging presentation attacks, notably the proposed light field artefact
databases. As the proposed IST LLFEADB is the first ear PAD light field database and no previous
light field ear PAD solutions were available, this Thesis considers a set of representative and
promising face PAD solutions applied to the ear PAD problem for benchmarking purposes.
10.2 Performance Assessment Framework
This section presents the test material and metrics used for the assessment of the proposed face
and ear PAD solutions and also several solutions from the literature. Also the non-light field and
light field based face and ear PAD solutions considered for benchmarking are introduced.
10.2.1 Test Material
The IST LLFFSD database (Section 8.2) is here used for the assessment of the proposed face PAD
solutions and also several benchmarking solutions from the literature. Similarly, IST LLFEADB
(Section 8.3), the first database to consider light field imaging of ear presentation attacks, has been
used for the assessment of the proposed and benchmarking ear PAD solutions.
10.2.2 Evaluation Metrics
The metrics used for evaluating the performance of PAD solutions are described in the following:
136
Bona Fide Presentation Classification Error Rate (BPCER), also known as False Rejection
Rate (FRR), showing the proportion of bona fide presentations incorrectly classified as attack
presentations.
Attack Presentation Classification Error Rate (APCER), also known as False Acceptance Rate
(FAR), showing the proportion of attack presentations incorrectly classified as bona fide
presentations.
ACER (Average Classification Error Rate), defined as half of the sum of the BPCER and
APCER, summarizing the overall system performance.
It is also recommended that the operational systems be configured at a defined security level, see
for instance the FRONTEX guidelines for automated border control systems in Europe [273]. In
this context, the classification performance of a PAD solution can be shown as a Detection Error
Tradeoff (DET) curve, plotting the BPCER versus the APCER.
10.2.3 Benchmarking Methods
Apart from the proposed PAD solutions, this Thesis ha selected a set of representative and
promising non-light field and light field based benchmarking solutions available in the literature.
For face PAD, the benchmarking solutions considered are:
1. Non-Light field based 2D solutions, including two baseline description based solutions [227]
[228] two state-of-the-art description based solutions [232] [234] and one state-of-the-art
quality based solution [221].
2. Light field based solutions, including the solutions presented in [32] [33] [34] (see Section
7.5).
None of the above face PAD solutions relies on specific facial characteristics, e.g., analyzing the
blinking of the eyes or the user’s face reaction to external commands or stimuli, to detect PAD,
thus they can potentially be used also for ear recognition. For ear PAD, the three best performing
benchmarking solutions for face PAD [232] [234] and [34] are considered to detect ear
presentation attacks.
A summary of the characteristics of each considered benchmarking solution, following the
taxonomy illustrated in Figure 7.1, is available in Table 10.1.
The central view 2D rendered SA images and the full light field images are used to test the non-
light field based 2D and the light field based solutions, respectively. All tested PAD solutions were
re-implemented by the author of this Thesis and performance results were obtained considering
the best parameter settings reported in the relevant original papers.
10.3 Face PAD Performance
The experimental work started by analyzing the performance of the proposed and benchmarking
face PAD solutions. The usage of different color spaces is analyzed, the generalization of face
PAD solutions to different operation contexts and different attacks is studied and finally the
computationally efficiency of the face PAD solutions is analyzed.
137
Table 10.1: Overview of PAD benchmarking solutions.
Ref. Release
Year
User
Inter.
Support
Imaging
Sensor
Contextual
Info.
Feature
Extraction
Type
Feature
Extraction
Sub-Type
Color
Space Classifier
Fusion
level
[227] 2011 No inter. RGB Cropped Static Texture based Gray SVM Feature
[228] 2012 No inter. RGB Cropped Static Texture based Gray SVM Score
[221] 2015 No inter. RGB Cropped Static Quality based RGB SVM Feature
[232] 2015 No inter. RGB Cropped Static Texture based RGB Regression Score
[234] 2016 No inter. RGB Cropped Static Texture based HSV,
YCbCr SVM
Feature,
score
[32] 2014 No inter. Lytro 1st Gen. Cropped Static Focus based Gray SVM ---
[33] 2015 No inter. Lytro 1st Gen. Cropped Static Focus based Gray SVM Feature
[34] 2016 No inter. Lytro 1st Gen. Cropped Static Texture based Gray SVM Feature
Prop. LFALBP No inter. Lytro ILLUM Cropped Static Texture based YCbCr;
HSV SVM
Feature;
score
Prop. LFHDG No inter. Lytro ILLUM Cropped Static Texture based HSV SVM Feature
10.3.1 Face PAD Accuracy Evaluation
Since in real-world situations there is no way to know what type of attack will be performed, the
face PAD solutions were trained with a mix of all artefact types available in the IST LLFFSD
database. 4-fold cross-validation experiments were performed, meaning that, for each experiment,
the face PAD systems are trained with ¾ of the database and tested with the remaining ¼. The
cross-validation strategy leads to a more accurate assessment on the model detection power for
unseen data, compared to the simpler strategy of dividing the data into rigid training and testing
sets. Regarding to the number of available IST LLFFSD samples, the experiments used the 100
bona fide and 100 attack images for each attack type, in total 600 attack images. To have a balanced
training, 75 bona fide and 75 attack images (randomly selecting 12 images from each attack type)
were considered to train the classifiers. Testing was performed with a non-overlapping set of 25
bona fide and 25 attack images. As the attack images are randomly selected for each attack type,
the experiments have been repeated 50 times and the average results of these 50 runs are reported
to provide a more stable performance estimation. Table 10.2 reports the average ACER results and
Figure 10.1 the DET curves. The red vertical dash-lines show BPCERs at a fixed 1% APCER, as
the operational systems are usually configured at a predefined security level. Naturally, BPCERs
at any APCER level can be observed from Figure 10.1.
The obtained results show that the proposed LFALBP based and LFHDG based face PAD
solutions always achieve the best performance when compared to the benchmarking non-light field
and light field face PAD solutions. It is worth noticing that the very good accuracy achieved by
the proposed PAD solutions is associated to the considered scenario, with images acquired under
controlled conditions, which corresponds to the most relevant presentation attack scenario when
considering a fixed camera setup. The experiments performed in [33] and [34] were conducted on
images captured in less controlled environments, at various distances. The exploitation of the
focus/depth variation for those images brings more information when compared to images
acquired under controlled conditions, which are almost all-in-focus. This may justify the reduced
performance of those solutions when testing with the IST LLFFSD database.
138
Due to considering both the orientation and magnitude for the angular information, the proposed
LFHDG based solution performs slightly better than the proposed LFALBP based solution, which
only captures the magnitude sign for the angular information.
Table 10.2: ACER face PAD performance for the proposed and benchmarking solutions using
IST LLFFSD (minimum errors in bold).
Detection solution Presentation Attack Instrument
Laptop Tablet Mobile 1 Mobile 2 Paper Wrapped
paper Ref. Year Type
[227] 2011 2D 42.62% 23.03% 39.33% 46.31% 33.87% 20.32%
[228] 2012 2D 37.67 14.00% 41.25% 45.95% 35.5% 15.95%
[221] 2015 2D 12.06% 13.01% 9.23% 15.83% 14.82% 15.27%
[232] 2015 2D 2.76% 4.39% 2.10% 2.80% 3.03% 4.17%
[234] 2016 2D 4.32% 2.65% 2.52% 5.81% 2.75% 4.95%
[32] 2014 LF 10.12% 12.39% 12.79% 13.86% 12.91% 16.14%
[33] 2015 LF 19.78% 26.36% 29.98% 22.46% 32.43% 38.03%
[34] 2016 LF 11.00% 10.77% 8.12% 18.50% 7.27% 22.05%
Proposed LFALBP LF 0.88% 2.14% 0.73% 0.79% 0.75% 2.85%
Proposed LFHDG LF 0.01% 0.29% 0% 0% 0.28% 0.45%
Figure 10.1: DET face PAD performance for the proposed and benchmarking solutions using
IST LLFFSD for: (a) monitor; (b) tablet; (c) mobile 1; (d) mobile 2; (e) paper; (f) wrapped paper
PAIs.
139
It is well-known that the PAD performance is sensitive to the amount of training data. To
investigate the robustness of the proposed and benchmarking solutions to the number of training
samples, the value of n considered for the adopted n-fold cross-validation strategy was tested with
values n=2,…,7, meaning that the number of training samples increases with n. Figure 10.2
illustrates the average ACER results for all artefact types obtained with 50 runs. The results clearly
show that the proposed LFALBP and LFHDG based face PAD solutions are less sensitive to the
number of training samples than the benchmarking solutions. As expected, the detection
performance increases when training uses more samples. The value of n is fixed to 4 (4-fold cross-
validation) in the next reported experiments as it shows stable performances for the proposed and
benchmarking solutions.
Figure 10.2: ACER face PAD performance for the proposed and benchmarking solutions with n-
fold cross-validation.
10.3.2 Face PAD Color Features Accuracy Evaluation
The importance of considering color information and not only luminance information for face
PAD has been studied. These tests were conducted for the proposed and the two best performing
benchmarking solutions [232] [234]. The ACER results are reported in Table 10.3, notably when
using color information from the color spaces depicted in Table 10.1 and when considering only
the luminance/gray channel. The results highlight PAD performance gains when using color in
comparison to only using gray level information. The advantage of using color information stems
from: i) as printed paper and display PAIs used for face presentation attacks have a limited color
gamut, not reproducing color perfectly, PAD solutions can, therefore, benefit from extracting
discriminative color features; and ii) computing features over several color components can
provide complementary information.
140
Table 10.3: ACER face PAD performance for the proposed and benchmarking solutions using
color or gray information (minimum errors in bold).
Ref. Type Color Presentation Attack instrument
Laptop Tablet Mobile 1 Mobile 2 Paper Wrapped paper
[232] 2D No 20.85% 26.67% 18.34% 39.89% 19.23% 19.94%
Yes 2.76% 4.39% 2.10% 2.80% 3.03% 4.17%
[234] 2D No 5.41% 9.72% 8.91% 18.18% 17.40% 7.93%
Yes 4.32% 2.65% 2.52% 5.81% 2.75% 4.95%
Prop. LFALBP LF No 15.53% 9.75% 6.92% 9.60% 15.70% 13.15%
Yes 0.88% 2.14% 0.73% 0.79% 0.75% 2.85%
Prop. LFHDG LF No 10.65% 11.62% 4.80% 2.52% 20.57% 16.92%
Yes 0.01% 0.29% 0% 0% 0.28% 0.45%
10.3.3 Face PAD Generalization Accuracy Evaluation
When deploying a face PAD solution, there is no way to know if the system will be attacked and
what type of attacks will be performed. A good way to test the ability to sustain unknown attacks
is to test the system in conditions not considered in the training stage. For this purpose, cross-
database evaluation has been considered, as performed in [234], training with one database and
testing with another. However, as there are only two publicly available light field face artefact
databases, GUC-LiFFAD [33] and the proposed IST LLFFSD, with the first including only 2D
rendered images, it is impossible at this stage to conduct a cross-database evaluation for light field
face PAD solutions.
As an alternative study, this Thesis investigates face PAD generalization from a different
perspective, notably considering an unforeseen artefact type, by training the PAD solutions with
all attack types available in IST LLFFSD excluding one, which is then used for testing. Table 10.4
reports the ACER results for the generalization tests obtained with 50 runs. Additionally, Figure
10.3 shows the DET generalization curves for the proposed methods and the benchmarking PAD
solutions, where the red vertical dash-lines show BPCERs at a fixed 1% APCER.
The results show that the light field based PAD solutions generalize very well to unforeseen flat
presentation attack instruments (either laptop, tablet, mobile or printed paper), as the training also
includes other types of flat attacks. However, the performance drops significantly when testing for
the wrapped paper attack type when it was not used for training. This is not surprising as wrapped
paper is the only 3D attack type considered in IST LLFSD, thus exhibiting more differences in the
various facial landmarks positions than the flat attack types. Therefore, the light field based
solutions using a classification model trained only with flat attacks experience some ACER
performance degradation in the presence of wrapped paper attacks. This observation highlights the
need for training face PAD systems using attack samples with different surface geometries.
Concerning the proposed LFALBP and LFHDG based solutions, even though their performance
decreases compared to the performance reported in Section 10.3.1, their generalization ability is
superior to that of the state-of-the-art solutions for most of the considered attack types.
141
Table 10.4: ACER face PAD generalization performance for the proposed and benchmarking
solutions using IST LLFFSD (minimum errors in bold).
Detection solution Presentation Attack instrument
Laptop Tablet Mobile 1 Mobile 2 Paper Wrapped paper Ref. Year Type
[227] 2011 2D 44.93% 34.27% 41.60% 47.05% 42.02% 32.87%
[228] 2012 2D 27.90% 24.13% 19.30% 25.60% 17.70% 28.30%
[221] 2015 2D 17.61% 16.93% 12.02% 22.88% 17.12% 24.27%
[232] 2015 2D 3.99% 12.45% 4.14% 7.13% 4.97% 45.95%
[234] 2016 2D 36.10% 7.28% 4.64% 15.15% 7.47% 33.57%
[32] 2014 LF 30.60% 33.60% 17.40% 37.40% 35.40% 42.40%
[33] 2015 LF 24.33% 29.85% 32.17% 23.26% 19.93% 47.16%
[34] 2016 LF 12.23% 27.95% 9.80% 33.48% 8.78% 37.75%
Prop. LFALBP LF 0.95% 6.04% 0.83% 0.98% 0.95% 38.75%
Prop. LFHDG LF 0.05% 9.20% 0% 0.02% 0.50% 45.10%
Figure 10.3: DET face PAD generalization performance for the proposed and benchmarking
solutions using IST LLFFSD for: (a) monitor; (b) tablet; (c) mobile 1; (d) mobile 2; (e) paper; (f)
wrapped paper PAIs.
142
10.3.4 Face PAD Computational Complexity
Quantifying the amount of resources needed to perform PAD, such as time and storage, is of
prominent importance for operational biometric systems. Table 10.5 shows the average extraction
and classification times per light field image (in seconds) for the proposed and benchmarking
solutions. Table 10.5 also summarizes the feature vector sizes extracted by the various solutions.
Time measurements were performed on a standard 64-bit Intel PC with a 3.40 GHz processor and
16 GB RAM, running MATLAB R2015b.
Table 10.5: Average extraction and classification times, and feature vector size for the proposed
and benchmarking face PAD solutions (minimum values in bold).
Solution Feature
extraction time (s)
Classification
time
Total
time (s)
No. of vector
elements/bins
Feature size
(bytes) Ref. Year Type
[227] 2011 2D 0.9375 0.0020 0.9395 833 687
[228] 2012 2D 4.2508 0.0570 4.3078 45,669 339,071
[221] 2015 2D 0.1978 0.0013 0.1991 121 394
[232] 2015 2D 0.3148 0.0016 0.3164 369 474
[234] 2016 2D 0.6658 0.0068 0.6726 7,680 21,599
[32] 2014 LF 2.326 0.0009 2.3269 64 473
[33] 2015 LF 391.021 0.0004 391.02 2 16
[34] 2016 LF 19.0422 0.0105 19.052 8,100 60,860
Prop. LFALBP LF 0.2314 0.0015 0.2329 96 168
Prop. LFHDG LF 0.2286 0.0215 0.2501 24,300 18,122
Feature extraction typically has the largest impact on the overall presentation detection algorithm
complexity. The total processing duration for the proposed LFALBP and LFHDG solutions is,
respectively, around 0.23 and 0.25 second per image, thus exhibiting the second and third lowest
computational complexity over all tested solutions. This represents a step forward in making light
field based solutions viable to detect face presentation attacks, notably considering its detection
performance gains.
The feature size results highlight that the proposed LFALBP based solution offers a really compact
representation, simplifying its storage, retrieval, and transmission. Concerning the LFHDG based
solution, capturing both orientation and magnitude variations for the angular information comes
at the cost of increasing the feature size, although it is not as large as the feature sizes obtained by
two of the benchmarking solutions.
10.4 Ear PAD Performance
This section reports the experimental work conducted to assess the proposed ear PAD solutions as
well as the selected benchmarking solutions, which were originally proposed as face PAD
solutions and are here used as ear PAD solutions, notably [232] [234] and [34]. The assessment
includes also the generalization power of the ear PAD solutions to unknown attacks as well as their
computational complexity.
143
10.4.1 Ear PAD Accuracy Evaluation
The ear PAD performance evaluation considers the same 4-fold cross-validation strategy and the
same metrics as for face PAD, as discussed in Section 10.3.1. Therefore, for each experiment, the
ear PAD system is trained with ¾ of the IST LLFEADB database and tested with the remaining
¼. Regarding the number of available IST LLFEADB baseline set samples, the experiments used
the 266 bona fide and 266 attack images for each attack type, in a total of 1064 attack images. To
have a balanced training, 200 bona fide and 200 attack images (randomly selecting 50 images from
each attack type) were considered to train the classifiers. Testing was performed with a non-
overlapping set of 66 bona fide and 66 attack images. As the attack images are randomly selected
from each attack type, the experiments have been repeated for 50 times and the average results for
50 runs are reported to provide a more stable performance estimation. As the number of bona fide
and attack samples in the IST LLFEADB extended dataset are, respectively, 60 and 240, 45 bona
fide and 45 attack images (randomly selecting 12 images from each attack type) were considered
to train the classifiers and the average ACER results, obtained after 50 runs, are reported.
The results obtained for the proposed IST LLFEADB low resolution baseline set are presented in
Table 10.6. These results show that two proposed LFALBP based and LFHDG based light field
PAD solutions, as well as one of the conventional 2D solutions [234], exploiting the joint color
texture information extracted by LPQ and the co-occurrence of adjacent local binary patterns,
achieve perfect or near perfect classification accuracy for all PAIs considered.
IST LLFEADB consists of a baseline and an extended set, with the difference between the two
sets being related to the settings used for bona fide image acquisition and thus the quality of the
samples displayed on PAIs. Results for the IST LLFEADB high resolution extended set are
presented in Table 10.7. In this case, the proposed LFHDG based PAD solution still achieves
perfect classification accuracy, while the proposed LFALBP based PAD solution and the
benchmarking solution [234] show a slight reduction in PAD performance, when compared to the
baseline set. In practice, the IST LLFEADB extended set provides a more challenging task than
the baseline for ear PAD due to two main reasons: i) the higher resolution of the extended set
images captured from the PAI improves the quality of the samples displayed on PAIs, thus making
distinguishing artefacts from bona fide samples more difficult for the extended set, as expected;
ii) as the detection performance decreases when training uses less samples, a slight performance
degradation may be also justified by the smaller size of the extended set.
Table 10.6: ACER ear PAD performance for the proposed and benchmarking solutions using
IST LLFEADB baseline set (minimum errors in bold).
Detection solution Presentation Attack Instrument
Laptop Tablet Mobile 1 Mobile 2 Average Ref. Year Type
[232] 2015 2D 5.12 % 9.29 % 5.34 % 4.80 % 6.13 %
[234] 2016 2D 0 % 0.15 % 0 % 0 % 0.04 %
[34] 2016 LF 3.11 % 2.09 % 1.66 % 0.57 % 1.85 %
Prop. LFALBP LF 0.01 % 0.02 % 0 % 0 % 0.01 %
Prop. LFHDG LF 0 % 0 % 0 % 0 % 0 %
144
Table 10.7: ACER ear PAD performance for the proposed and benchmarking solutions using
IST LLFEADB extended set (minimum errors in bold).
Detection solution Presentation Attack Instrument
Laptop Tablet Mobile 1 Mobile 2 Average Ref. Year Type
[232] 2015 2D 5.84 % 6.77 % 4.62 % 5.62 % 5.71 %
[234] 2016 2D 1.05 % 2.42 % 0.65 % 0.29 % 1.10 %
[34] 2016 LF 0.45 % 5.75 % 4.55 % 5.92 % 2.74 %
Prop. LFALBP LF 0.20 % 0.39 % 0.18 % 1.22 % 0.49 %
Prop. LFHDG LF 0 % 0 % 0 % 0 % 0 %
10.4.2 Ear PAD Generalization Accuracy Evaluation
When deploying an ear PAD solution, there is naturally no way to know with absolute certainty
what type of attacks will be performed. This Thesis investigates ear PAD generalization to consider
an unforeseen artefact type, by training the solutions with all attack types available in the IST
LLFEADB excluding one, which is then used for testing, thus performing the role on an unknown
attack. Table 10.8 and Table 10.9 report the ACER generalization performance results obtained
with 50 runs for the IST LLFEADB baseline and extended sets, respectively.
Table 10.8: ACER ear PAD generalization performance for the proposed and benchmarking
solutions using IST LLFEADB baseline set (minimum errors in bold).
Detection solution Unknown Presentation Attack Instrument
Laptop Tablet Mobile 1 Mobile 2 Average Ref. Year Type
[232] 2015 2D 7.98 % 10.67 % 10.64 % 7.93 % 9.30 %
[234] 2016 2D 0 % 1.23 % 0 % 0 % 0.31 %
[34] 2016 LF 16.85 % 4.26 % 8.03 % 2.09 % 7.80 %
Prop. LFALBP LF 0.01 % 0.16 % 0 % 0 % 0.04 %
Prop. LFHDG LF 0 % 0 % 0 % 0 % 0 %
Table 10.9: ACER ear PAD generalization performance for the proposed and benchmarking
solutions using IST LLFEADB extended set (minimum errors in bold).
Detection solution Unknown Presentation Attack Instrument
Laptop Tablet Mobile 1 Mobile 2 Average Ref. Year Type
[232] 2015 2D 7.88 % 13.49 % 4.15 % 10.11 % 10.03 %
[234] 2016 2D 1.84 % 6.25 % 0.67 % 0.53 % 2.32 %
[34] 2016 LF 0.28 % 6.72 % 5.84 % 6.15 % 4.74 %
Prop. LFALBP LF 0.27 % 0.43 % 0.21 % 2.22 % 0.78 %
Prop. LFHDG LF 0 % 0 % 0 % 0 % 0 %
The results show that the proposed LFHDG based ear PAD solution generalizes perfectly to the
unforeseen PAIs considered, for both the IST LLFEADB baseline and extended sets. Concerning
the proposed LFALBP based ear PAD solution, even though its performance slightly decreases
compared to the performance reported in Section 10.4.1, its generalization abilities are superior to
the benchmarking solutions for most the considered unknown PAIs. It should be noted that, in the
145
absence of ear artefact samples captured from 3D PAIs, for instance wrapped paper or silicon ears,
it is not expected to experience a significant performance degradation when considering this
generalization scenario as it happened for face PAD due to the inclusion of wrapped paper attacks.
This highlights the need for a more complete ear artefact database, notably including artefacts
samples captured from 3D PAIs, to more exhaustively study ear PAD technology.
10.4.3 Ear PAD Computational Complexity
As PAD performance may increase at the cost of additional computational complexity, it is
important to assess this trade-off. Table 10.10 shows the average extraction and classification times
per image (in seconds) for the two proposed and the three benchmarking ear PAD solutions
considered. This table also summarizes the descriptor sizes for the various solutions. Time
measurements were performed on a standard 64-bit Intel PC with a 3.40 GHz processor and 16
GB RAM, running MATLAB R2015b. The proposed LFHDG and LFALBP based ear PAD
solutions exhibit the lowest computational complexity over all tested solutions, with a total
processing time around 49 and 217 milliseconds per image, respectively. Concerning the feature
vector size, the proposed LFALBP based solution offers a very compact representation, this is not
the case for the LFHDG based solution as it needs a larger feature vector to capture both the
orientation and magnitude variations of the angular information.
Table 10.10: Average extraction and classification times, and feature vector size for the
proposed and benchmarking ear PAD solutions (minimum values in bold).
Solution Feature extraction
time (s)
Classification
time
Total processing
time (s)
No. of vector
elements/bins
Descriptor
size (bytes) Ref. Year Type
[232] 2015 2D 0.4188 0.1188 0.5376 369 339
[234] 2016 2D 0.3598 0.0095 0.3693 7,680 12,173
[34] 2016 LF 19.0351 0.0640 19.0991 12,420 93,494
Prop. LFALBP LF 0.0489 0.0002 0.0492 96 158
Prop. LFHDG LF 0.1016 0.1162 0.2178 37,260 274,475
In summary, the performance results and the computational complexity show that light field based
solutions not only achieve very effective and stable PAD performance, but can also offer lower
complexity, thus representing one step forward in biometric and forensic applications when
compared to regular 2D imaging PAD solutions.
146
147
Part IV. Conclusion
148
149
Chapter 11 _
Summary of Contributions
Sum mary of Co ntrib utio ns
11.1 Introduction
Exploiting light field imaging sensors and the associated data for face and ear biometric
recognition and Presentation Attack Detection (PAD) tasks has been the main focus of this Thesis. Following the research work developed, this Thesis has extensively reviewed the state-of-the-art
on face and ear recognition and PAD and has proposed several novel databases and solutions to
exploit the additional visual information available in a light field image. This section presents a
summary of the contributions, separately for recognition and PAD.
11.2 Light Field Face and Ear Recognition
To better understand the technological landscape in terms of recognition systems, this Thesis has
proposed a new multi-level taxonomy for face and ear recognition solutions, targeting to facilitate
the organization and categorization of face and ear recognition solutions. The proposed multi-level
taxonomy considers four levels, notably face/ear structure, feature support, feature extraction
approach, and feature extraction sub-approach. Following the proposed taxonomy, a
comprehensive review of recent, representative and relevant face and ear recognition solutions has
been made. This Thesis has also reviewed the available face and ear databases, which are
instrumental for designing, testing and validating face and ear recognition solutions.
Next, two light field face and ear databases were developed in the context of this Thesis, thus
allowing more extensive benchmarking for face and ear recognition solutions exploiting light field
data:
1. IST-EURECOM LLFFD - The IST-EURECOM Lenslet Light Field Face Database (IST-
EURECOM LLFFD) has been created and made publicly available, including data from 100
subjects, with 20 samples per each person, captured by a Lytro ILLUM lenslet camera. The
150
images were captured in a controlled acquisition setup with different facial variations,
including emotions, actions, poses, illuminations, and occlusions in order to benefit from the
non-intrusive nature of face recognition.
2. IST LLFEDB - Additionally, the IST-EURECOM Lenslet Light Field Ear DataBase (IST
LLFEDB) has been created, to make publicly available the first content allowing testing and
validating light field based ear recognition systems. The proposed ear database consists of 536
light field ear images from 67 subjects, with 8 image shots per person, captured with a Lytro
ILLUM lenslet camera, over two separate sessions, with four different poses per session. The
proposed database includes ear images partly occluded by ear piercing, earing, hair and
combinations of these occlusions
In the sequence, two light field face and ear recognition solutions based on hand-crafted descriptors
and five face recognition solutions based on fused deep/double-deep descriptors were developed,
evolving through progressive levels of functionality and recognition performance:
1. Face and Ear Recognition Based on Light Field Local Binary Patterns (LFLBP)
Descriptor - The first light field solution was proposed based on a novel, simple, yet effective,
hand-crafted descriptor, named Light Field Local Binary Patterns (LFLBP), able to exploit the
richer information available in light field images for face and ear recognition tasks.
2. Face and Ear Recognition Based on Light Field Histogram of Gradients (LFHG)
Descriptor - Another light field solution was proposed based on the fusion of a non-light field
based hand-crafted descriptor, the so-called Histogram of Oriented Gradients (HOG), with a
new light field based descriptor, called Light Field Histogram of Disparity Gradients
(LFHDG). The fused descriptor, named Light Field Histogram of Gradients (LFHG),
considered both the orientation and magnitude for the spatial and angular information, while
the LFLBP solution only captured the magnitude for the spatial and angular information.
3. Face Recognition Based on VGG 2D+Disparity+Depth (VGG-D3) Fused Deep Descriptor-
Recognizing the importance of deep learning in biometric recognition, the first deep learning
based solution was proposed for light field face recognition, based on a VGG
2D+Disparity+Depth (VGG-D3) fused deep descriptor. The fused deep VGG-D3 description is
obtained by the feature level fusion of deep descriptions extracted from 2D images as well as
the corresponding disparity and depth maps, using a VGG-Face descriptor, acknowledging that
disparity and depth maps may bring some complementary information to the recognition task.
4. Face Recognition Based on VGG + Conv-LSTM Double-Deep Descriptor - As the VGG-
D3 fused deep descriptor only processes light field central view data, notably using its rendered
texture and corresponding disparity and depth maps, a double-deep descriptor, based on VGG-
Face and conventional LSTM (Conv-LSTM) descriptors, was proposed, exploiting the multi-
perspective information available in a light field image. The proposed VGG + Conv-LSTM
double-deep descriptor extracts higher dimensional angular dependencies from different face
viewpoints rendered from a light field image.
5. Face Recognition Based on VGG + Light Field LSTM Double-Deep Descriptors - Finally,
three face recognition solutions based on three novel light field LSTM cell architectures, so
151
called Gate-Level Fusion (GLF-LSTM), State-Level Fusion (SLF-LSTM) and Sequential
Learning (SeqL-LSTM), were proposed, adopting joint learning of the light field horizontal
and vertical parallaxes. The proposed cell architectures have been integrated into a spatio-
angular learning framework for double-deep description, where a LSTM network adopting the
proposed light field LSTM cell architectures receives its inputs from a VGG-Face deep
descriptor applied to a set of horizontal and vertical 2D face viewpoint images, derived from a
light field image. These recognition solutions, named VGG + GLF-LSTM, VGG + SLF-
LSTM, and VGG + SeqL-LSTM double-deep descriptors, lead to richer spatio-angular
descriptions, compared to the proposed VGG + Conv-LSTM double-deep descriptor, for face
recognition.
A comprehensive evaluation of state-of-the-art face and ear recognition solutions was been
conducted, including analysing the sensitivity of the proposed recognition solutions to the
available training data, both in terms of number of training samples and variations. The extensive
performance assessment showed the superiority of the proposed light field based solutions for both
face and ear recognition tasks, compared to appropriate state-of-the-art benchmarking recognition
solutions. In particular, average recognition gains considering 2D baseline solutions against their
corresponding light field based variants, showed the added value of light field information for face
and ear recognition purposes. Among the proposed solutions, the recognition solutions based on
VGG + SeqL-LSTM double-deep descriptor and LFHG hand-crafted descriptor, achieved the best
recognition performances, respectively for face and ear recognition tasks. The average SeqL-
LSTM double-deep descriptor performance obtained for the three challenging face recognition
evaluation protocols on IST-EURECOM LLFFD was 93.96%, showing 10.11% gain regarding the
best performing benchmarking solution, 2D VGG-Face descriptor. The proposed LFHG descriptor
achieved average ear recognition performance of 88.2%, showing a gain of 5.9% on IST LLFEDB
when compared against its 2D baseline, HOG, which was the best performing benchmarking
solution.
11.3 Light Field Biometric Presentation Attack Detection
This Thesis has also provided a comprehensive review of the recent advances in light field based
face PAD solutions and artefact databases, following a new, encompassing taxonomy for PAD
solutions. Reviewing ear PAD was not considered, as there were no ear artefact databases or ear
PAD solutions to be considered at the time of the writing of the Thesis. The proposed multi-level
taxonomy organized the face PAD solutions according to four main dimensions, notably user
interaction support, imaging sensor, contextual information and feature extraction.
Next, two light field face and ear artefact databases were developed in the context of this Thesis,
thus allowing more extensive benchmarking for face and ear PAD solutions exploiting light field
data:
1. IST LLFFSD - The IST Lenslet Light Field Face Spoofing Database (IST LLFFSD), captured
with a Lytro ILLUM lenslet camera, has been created and made publicly available. It consists
of 100 bona fide light field images and a set of artefact images, simulating six different types
of presentation attacks, including printed paper, wrapped printed paper, laptop, tablet and two
different mobile phones.
152
2. IST LLFEADB - Additionally, the first ear PAD database, the IST Lenslet Light Field Ear
Artefact Database (IST LLFEADB) has been created, and made publicly available. This is the
first database allowing testing and validating ear PAD systems, including both 2D and light
field ear artefact images, captured with a Lytro ILLUM lenslet camera. IST LLFEADB
contains two sets of light field images, differing on the settings used for bona fide image
acquisition, notably in the imaging resolution and number of samples. For both sets, four types
of PAIs, including a laptop, a tablet and two different mobile phones were used to create the
artefact samples.
In the sequence, two PAD solutions based on two novel light field hand-crafted descriptors were
proposed for both face and ear PAD, exploiting the variations associated to different directions of
light captured in the light field images:
1. PAD Based on Light Field Angular Local Binary Patterns (LFALBP) Descriptor - The
first proposed PAD solution is based on a Light Field Angular Local Binary Patterns
(LFALBP) hand-crafted descriptor, capturing magnitude sign for the angular information.
2. PAD Based on Light Field Histogram of Disparity Gradients (LFHDG) Descriptor - The
second PAD solution is based on Light Field Histogram of Disparity Gradients (LFHDG)
hand-crafted descriptor, capturing both the orientation and magnitude variations for the angular
information available in a light field image.
A comprehensive evaluation of the proposed and benchmarking light field face and ear PAD
solutions has been performed, in terms of accuracy, generalization and complexity. The extensive
assessment showed that not only the proposed light field based PAD solutions achieve superior
PAD performance with higher robustness and generalization ability, but they can also exhibit low
computational complexity, thus representing one step forward towards allowing using light field
imaging solutions to effectively detect face and ear presentation attacks. The evaluation results
showed that the proposed light field based solutions achieved perfect or near perfect classification
accuracy for both face and ear PAD and all PAIs considered, where the best performing
benchmarking solution led to more than 2.5% average classification error rate.
153
Chapter 12 _
Future Research Directions
Fut ure Research Direction
12.1 Introduction
The experimental work conducted in this Thesis has confirmed the added value of the richer
information available in light field images for face and ear recognition and PAD purposes.
However, there is still room for further developments and improvements, a selection of which is
discussed in this chapter
12.2 Future Research Directions for Light Field Face and Ear Recognition
Some future research directions in terms of face and ear recognition include:
Unconstrained light field face and ear databases – The face and ear light field databases
proposed in the context of this Thesis were captured with a controlled acquisition setup, which
is a rather common and realistic scenario in business and industrial environments where the
images to be recognized are captured in, at least partly, constrained conditions. An important
future research direction may be to extend the light field face and ear database acquisitions,
considering not only constrained environments, but also the acquisition of images in
unconstrained conditions, thus introducing high degree of variability and presenting more
challenging recognition conditions to the existing technologies. Examples include face and ear
images captured at different distances to the camera, with uncontrolled backgrounds, and
uncontrolled poses. The more difficult recognition scenarios represented by the images
contained in such unconstrained databases will provide challenges that can lead to significant
improvements in light field based face and ear recognition technologies.
Large-scale ear database – Deep learning based ear recognition solutions have not yet led to
a considerably superior performance over traditional solutions due to the lack of enough
available training samples in the available datasets. This reveals a pressing need to gather large-
154
scale ear databases in order to obtain better deep classification models in order to yield a more
accurate ear recognition.
Fusion of multiple imaging sensors – Although light field based recognition solutions
achieve the best performance for the tested databases, different imaging sensors, such as depth
and NIR cameras may provide a valuable contribution if operating in unconstraint situations,
e.g., with varying illumination conditions. Therefore, recognition solutions considering
multiple sensors may contribute to enhance performance.
Face and ear recognition on mobile phones equipped with light field/multiple cameras –
Following the wide deployment of mobile authentication scenarios using facial images and the
availability of mobile devices equipped with light field/multiple cameras, as illustrated in
Figure 12.1, there is a strong need to design efficient recognition solutions, based on the
characteristics of the emerging light field/multiple cameras available in mobile phones.
Figure 12.1: Mobile phones equipped by multiple cameras.
12.3 Future Research Directions for Light Field Based Face and Ear
Presentation Attack Detection
While this Thesis demonstrates the efficacy of light field based face and ear PAD, some future
research directions include:
Light field artefact face mask and silicon ear databases – The current light field based face
and ear artefact databases cover all artefact types considered in the literature, except those
involving wearing 3D face masks and silicon ears, due to the high cost associated with
preparing good quality face masks and silicon ears PAIs. To fully explore the potential of light
field solutions to detect 3D presentation attacks, especially flexible thin-layered silicon face
masks that can be used in highly sensitive security scenarios such as a semi-supervised border
control scenario, more complete light field artefact databases should be created.
Deep learning based light field PAD – Deep learning PAD solutions based on conventional
cameras are among the most recent and promising approaches to detect presentation attacks
[261], [262], and [263]. However, the current publicly available light field artefact databases
do not provide enough information to train a deep network. The application of neural network
based solutions for light field images needs to be explored after the availability of more
comprehensive light field artefact databases.
Hardware implementation of PAD solutions – Recent works introduce hardware platforms
to implement hand-crafted descriptors, such as HOG [274] and LBP [275] and also classifiers
155
such as SVM [276], on FPGA and GPU. The hardware implementation of PAD systems, either
based on light field cameras or others, is a step forward towards designing fast solutions
operating in real-time.
PAD on mobile phones equipped with light field/multiple cameras – Similar to recognition
systems, there is a strong need to design efficient PAD solutions running on mobile platforms
due to the wide deployment of mobile authentication scenarios. As commercial light field
cameras, notably camera arrays, are getting cheaper and there are some efforts to equip mobile
devices with light field/multiple cameras (see Figure 12.1), designing efficient PAD solutions,
based on the characteristics of light field/multiple cameras for mobile phones, is becoming a
hot topic.
In summary, light field imagining technology may represent one step forward in biometric and
forensic applications when compared to the conventional imaging sensors.
156
157
References
[1] A. Jain and A. Ross, "An introduction to biometric recognition," IEEE Transactions on Circuits and Systems
for Video Technology, vol. 14, no. 1, pp. 4-20, Jan. 2004.
[2] A. Jain, K. Nandakumarb and A. Ross, "50 years of biometric research: Accomplishments, challenges, and
opportunities," Pattern Recognition Letters, vol. 79, no. 1, pp. 80-105, Aug. 2016.
[3] M. Günther, L. El Shafey and S. Marcel, "2D face recognition: An experimental and reproducible research
survey," Technical Report Idiap-RR-13, Martigny, Switzerland, Apr. 2017.
[4] A. Goldstein, L. Harmon and A. Lesk, "Identification of human faces," Proceedings of the IEEE, vol. 59, no.
5, pp. 748-760, May 1971.
[5] Z. Emeršic, V. Štruc and P. Peer, "Ear recognition: More than a survey," Neurocomputing, vol. 255, no. 1, pp.
26-39, Sep. 2017.
[6] A. Pflug, "Ear recognition: Biometric identification using 2- and 3-dimensional images of human ears," PhD
thesis in Information Security, Faculty of Computer Science and Media Technology Gjøvik University
College, Gjøvik, Norway, Jun. 2016.
[7] K. Chang, K. Bowyer, S. Sarkar and B. Victor, "Comparison and combination of ear and face images in
appearance-based biometrics," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no.
9, pp. 1160-1165, Sep. 2003.
[8] Z. Huang, Y. Liu, C. Li, M. Yang and L. Chen, "A robust face and ear based multimodal biometric system
using sparse representation," Pattern Recognition, vol. 46, no. 8, pp. 2156-2168, Aug. 2013.
[9] N. Hezil and A. Boukrouche, "Multimodal biometric recognition using human ear and palmprint," IET
Biometrics, vol. 6, no. 5, pp. 351-359, Aug. 2017.
[10] M. Monwar and M. Gavrilova, "FES: A system for combining face, ear and signature biometrics using rank
level fusion," in International Conference on Information Technology: New Generations, Las Vegas, NV,
USA, Apr. 2008.
[11] ISO/IEC 30107, "Information technology—Presentation attack detection—Part 1: Framework," International
Organization for Standardization, Jan. 2016.
[12] L. Li, P. Correia and A. Hadid, "Face recognition under spoofing attacks: Countermeasures and research
directions," IET Biometrics, vol. 7, no. 1, pp. 3-14, Jan. 2018.
[13] R. Ramachandra and C. Busch, "Presentation attack detection methods for face recognition systems: A
comprehensive survey," ACM Computing Surveys, vol. 50, no. 1, pp. 801-837, Apr. 2017.
[14] J. Komulainen, "Software-based countermeasures to 2D facial spoofing attacks," PhD thesis in Department of
Computer Science and Engineering, Infotech Oulu,University of Oulu, Oulu, Finland, Aug. 2015.
[15] D. Ngo, A. Teoh and J. Hu, Biometric security, Newcastle, UK: Cambridge Scholars Publishing, Feb. 2015.
158
[16] G. Goudelis, A. Tefas and I. Pitas, "Emerging biometric modalities: A survey," Journal on Multimodal User
Interfaces, vol. 2, no. 4, p. 217–235, Dec. 2008.
[17] G. Goswami, M. Vatsa and R. Singh, "RGB-D face recognition with texture and attribute features," IEEE
Transactions on Information Forensics and Security, vol. 9, no. 10, pp. 1629 - 1640, Jul. 2014.
[18] R. Min, N. Kose and J. Gugelay, "KinectFaceDB: A Kinect database," IEEE Transactions on Systems, Man,
and Cybernetics: Systems, vol. 44, no. 11, pp. 1534-1548, Jul 2014.
[19] X. Zhang, L. Yin and F. Cohn, "BP4D-Spontaneous: a high-resolution spontaneous 3D dynamic facial," Image
and Vision Computing, vol. 32, no. 1, p. 692–706, Oct 2014.
[20] N. Erdogmus and S. Marcel, "Spoofing in 2D face recognition with 3D masks and anti-spoofing with Kinect,"
in IEEE Sixth International Conference on Biometrics: Theory, Applications and Systems (BTAS), Arlington,
VA, USA, Sep. 2013.
[21] N. Erdogmus and S. Marcel, "Spoofing face recognition with 3D masks," IEEE Transactions on Information
Forensics and Security, vol. 9, no. 7, pp. 1084-1097, Jul. 2014.
[22] D. Yi, Z. Lei, Z. Zhang and S. Li, "Face anti-spoofing: Multi-spectral approach," in Handbook of Biometric
Anti-Spoofing, London, Springer-Verlag, Jul. 2014, pp. 83-102.
[23] I. Chingovska, N. Erdogmus, A. Anjos and S. Marcel, "Face recognition systems under spoofing attacks," in
Face Recognition Across the Imaging Spectrum, NY, Springer International Publishing, Feb. 2016, pp. 165-
194.
[24] M. Levoy and P. Hanrahan, "Light field rendering," in 23rd annual conference on Computer graphics and
interactive techniques, New York, NY, USA, Aug. 1996.
[25] R. Ng, M. Levoy, M. Bradif, G. Duval, M. Horowitz and P. Hanrahan, "Light field photography with a hand-
held plenoptic camera," Tech Report CSTR 2005-02, Stanford, CA, USA, Feb. 2005.
[26] "Lytro website," Lytro Inc, [Online]. Available: https://www.lytro.com. [Accessed Nov. 2018].
[27] R. Raghavendra,, K. Raja and C. Busch, "Exploring the usefulness of light field cameras for biometrics: An
empirical study on face and iris recognition," IEEE Transaction on Infromation Forensics And Security, vol.
11, no. 5, pp. 922-936, May 2016.
[28] R. Raghavendra, B. Yang, K. Raja and C. Busch, "A new perspective - Face recognition with light-field
camera," in International Conference on Biometrics, Madrid, Spain, Jun. 2013.
[29] R. Raghavendra, K. Raja, B. Yang and C. Busch, "Multi-face recognition at a distance using light-field
camera," in International Conference on Intelligent Information Hiding and Multimedia Signal Processing,
Beijing, China, Jul. 2013.
[30] R. Raghavendra, K. Raja, B. Yang and C. Busch, "Comparative evaluation of super-resolution techniques for
multi-face recognition using light-field camera," in IEEE International Conference on Digital Signal
Processing, Santorini, Greece, Jul. 2013.
159
[31] T. Shen, H. Fu and J. Chen, "Facial expression recognition using depth map estimation of light field camera,"
in IEEE International Conference on Signal Processing, Communications and Computing, Hong Kong, China,
Aug. 2016.
[32] S. Kim, Y. Ben and S. Lee, "Face liveness detection using a light field camera," Sensors, vol. 14, no. 12, pp.
71-99, Nov. 2014.
[33] R. Raghavendra, K. Raja and C. Busch, "Presentation attack detection for face recognition using light field
camera," IEEE Transactions on Image Processing, vol. 24, no. 3, pp. 1060-1075, Mar. 2015.
[34] Z. Ji, H. Zhu and Q. Wang, "LFHOG: A discriminative descriptor for live face detection from light field
image," in IEEE International Conference on Image Processing, Phoenix, AZ, USA, Sep. 2016.
[35] A. Sepas-Moghaddam, V. Chiesa, P. Correia, F. Pereira and J. Dugelay, "The IST-EURECOM light field face
database," in International Workshop on Biometrics and Forensics, Coventry, UK, Apr. 2017.
[36] A. Sepas-Moghaddam, F. Pereira and P. Correia, "Ear recognition in a light field imaging framework: A new
perspective," IET Biometrics, vol. 7, no. 3, p. 224–231, May. 2018.
[37] A. Sepas-Moghaddam, P. Correia and F. Pereira, "Light field local binary patterns description for face
recognition," in IEEE International Conference on Image Processing, Beijing, China, Sep. 2017.
[38] O. Parkhi, A. Vedaldi and A. Zisserman, "Deep face recognition," in British Machine Vision Conference,
Swansea, UK, Sep. 2015.
[39] A. Sepas-Moghaddam, P. Correia, K. Nasrollahi, T. Moeslund and F. Pereira, "Light field based face
recognition via a fused deep representation," in IEEE International Workshop on Machine Learning for Signal
Processing, Aalborg, Denmark, Sep. 2018.
[40] A. Sepas-Moghaddam, P. Correia, K. Nasrollahi, T. Moeslund and F. Pereira, "A double-deep spatio-angular
learning framework for light field based face recognition," Submitted to IEEE Transactions on Circuits and
Systems for Video Technology, Oct. 2018.
[41] A. Sepas-Moghaddam, F. Pereira and P. Correia, "Light field long short-term memory: Novel cell architectures
with application to face recognition," Submitted to Pattern Recognition Letters, Oct. 2018.
[42] A. Sepas-Moghaddam, F. Pereira and P. Correia, "Light field based face presentation attack detection:
Reviewing, benchmarking and one step further," IEEE Transactions on Information Forensics and Security,
vol. 13, no. 7, pp. 1696-1709, Jul. 2018.
[43] A. Sepas-Moghaddam, L. Malhadas, P. Correia and F. Pereira, "Face spoofing detection using a light field
imaging framework," IET Biometrics, vol. 7, no. 1, pp. 39-48, Jan. 2018.
[44] A. Sepas-Moghaddam, F. Pereira and P. Correia, "Ear presentation attack detection: Benchmarking study with
first lenslet light field database," in European Signal Processing Conference, Rome, Italy, Sep. 2018.
[45] G. Lippmann, "Épreuves réversibles. Photographies intégrales," Comptes Rendus de l'Académie des Sciences,
vol. 13, no. 9, pp. 245-254, Jan. 1908.
[46] A. Gershun, "The light field," Journal of Mathematics and Physics, vol. 18, no. 1, pp. 51-151, April 1939.
160
[47] E. Adelson and J. Bergen, "The plenoptic function and the elements of early vision," in Computation Models
of Visual Processing, MA, USA, MIT Press, Oct. 1991, pp. 3-20.
[48] S. Gortler, R. Grzeszczuk, R. Szeliski and M. Cohen, "The lumigraph," in Annual Conference on Computer
Graphics and Interactive Techniques, New Orleans, LA, USA, Aug. 1996.
[49] D. Dansereau, "Plenoptic signal processing for robust vision in field robotics," PhD Thesis in Mechatronic
Engineering, Queensland University of Technology, Queensland, Australia, 2014.
[50] Light, "The Light L16 Camera," [Online]. Available: https://light.co/camera. [Accessed Nov. 2018].
[51] B. Wilburn, "High performance imaging using arrys of inexpensive cameras," PhD Thesis in department of
Electrical Engineering, Stanford University, Stanford, CA, USA, Dec. 2004.
[52] ISO/IEC JTC 1/SC 29/WG 1 , "JPEG pleno call for proposals on light field coding," ISO/IEC, Geneva,
Switzerland, Jan. 2017.
[53] Z. Yu, J. Yu, A. Lumsdaine and T. Georgiev, "An analysis of color demosaicing in plenoptic cameras," in
IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, Jun. 2012.
[54] T. Georgiev and A. Lumsdaine, "The multi-focus plenoptic camera," in SPIE Electronic Imaging, Burlingame,
CA, USA, Jan. 2012.
[55] A. Lumsdaine and T. Georgiev, "The focused plenoptic camera," in IEEE International Conference on
Computational Photography, San Francisco, CA, USA, Aug. 2010.
[56] C. Perwass and L. Wietzke, "Single lens 3D-camera with extended depth-of-field," in Human Vision and
Electronic Imaging, Burlingame, CA, USA, Jan. 2012.
[57] "Raytrix," Raytrix GmbH, [Online]. Available: https://www.raytrix.de/. [Accessed Nov. 2018].
[58] D. Dansereau, "Light Field Toolbox V. 0.4," [Online]. Available:
http://www.mathworks.com/matlabcentral/fileexchange/49683-light-field-toolbox-v0-4. [Accessed Nov.
2018].
[59] D. Dansereau, "Plenoptic Signal Processing for Robust Vision in Field Robotics," PhD Thesis in Mechatronic
Engineering, Queensland University of Technology, Queensland, Australia, Jan. 2014.
[60] W. Zhao, R. Chellappa, J. Phillips and A. Rosenfeld, "Face recognition: A literature survey," ACM Computing
Surveys, vol. 35, no. 4, pp. 399-458, Dec 2003.
[61] E. Hjelmas and B. Low, "Face detection: A survey," Computer Vision and Image Understanding, vol. 83, no.
3, p. 236–274, Sep. 2001.
[62] I. Marqu´es, "Face recognition algorithms," Thesis in Computer Engineering, University of the Basque
Country, Vizcaya, Spain, Hun. 2010.
[63] R. Jafri and R. Arabnia, "A survey of face recognition techniques," Journal of Information Processing Systems,
vol. 5, no. 2, pp. 41-68, Jun. 2009.
[64] S. Li and A. Jain, Handbook of face recognition, London, UK: Springer-Verlag London, 2011.
161
[65] M. Turk and A. Pentland, "Eigenfaces for recognition," Journal of Cognitive Neuroscience, vol. 3, no. 1, pp.
71-86, jan. 1991.
[66] M. Bartlett, J. Movellan and T. Sejnowski, "Face recognition by independent component analysis," IEEE Trans
Neural Networks, vol. 13, no. 6, p. 1450–1464. , Nov. 2002.
[67] S. Ahmadkhani and P. Adibi, "Face recognition using supervised probabilistic principal component analysis
mixture model in dimensionality reduction without loss framework," IET Computer Vision, vol. 10, no. 3, pp.
193-201, Mar. 2016.
[68] J. Ye, R. Janardan and Q. Li, "GPCA: an efficient dimension reduction scheme for image compression and
retrieval," in ACM SIGKDD international conference on Knowledge discovery and data mining, Seattle, WA,
USA, Aug. 2004.
[69] L. Wiskott, J. Fellous, N. Kruger and C. Von der Malsburg , "Face recognition by elastic bunch graph
matching," in International Conference on Image Processing, Santa Barbara, CA, USA, Oct. 1997.
[70] V. Blanz and T. Vetter, "Face recognition based on fitting a 3D morphable model," IEEE Transactions on
Pattern Analysis and Machine Intelligence , vol. 25, no. 9, pp. 1063-1074 , Sep. 2003.
[71] K. Huang, D. Dai, C. Ren and Z. Lai, "Learning kernel extended dictionary for face recognition," IEEE
Transactions on Neural Networks and Learning Systems , vol. 28, no. 5, pp. 1082-1094, May 2017.
[72] T. Zhang, B. Wang, F. Li and Z. Zhang, "Decision pyramid classifier for face recognition under complex
variations using single sample per person," Pattern Recognition, vol. 64, no. 1, pp. 305-313, Apr. 2017.
[73] C. Zhou, L. Wang, Q. Zhang and X. Wei, "Face recognition based on PCA and logistic regression analysis,"
Optik, vol. 125, no. 20, pp. 5916-5919, Oct. 2014.
[74] H. Li, F. Shen, C. Shen, Y. Yang and Y. Gao, "Face recognition using linear representation ensembles," Pattern
Recognition, vol. 59, no. 1, pp. 72-87, Nov. 2016.
[75] Z. Wu, Y. Wang and G. Pan, "3D face recognition using local shape map," International Conference on Image
Processing, Singapore, Singapore, Oct. 2004.
[76] T. Ahonen, A. Hadid and M. Pietikainen, "Face description with local binary patterns: Application to face
recognition," IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 28, no. 12, pp. 2037-2041,
Dec. 2006.
[77] T. Ahonen, E. Rahtu, V. Ojansivu and J. Heikkila , "Recognition of blurred faces using local phase
wuantization," in International Conference on Pattern Recognition , Tampa, FL, USA , Dec. 2008.
[78] M. Xi, L. Chen, D. Polajnar and W. Tong, "Local binary pattern network: A deep learning approach for face
recognition," in International Conference on Image Processing, Phoenix, AZ, USA, Sep. 2016.
[79] N. Werghi, S. Berretti and A. Bimbo, "The Mesh-LBP: A framework for extracting local binary patterns from
discrete manifolds," IEEE Transactions on Image Processing, vol. 24, no. 1, pp. 220-235, Jan. 2015.
[80] A. Ross and A. Jain, "Information fusion in biometrics," Pattern Recognition Letters, vol. 24, no. 13, p. 2115–
2125, Sep. 2003.
162
[81] G. M and C. Delac, "Face Recognition Homepage," [Online]. Available: http://www.face-rec.org/databases/.
[Accessed Nov. 2018].
[82] C. McCool, "Bi-Modal Person Recognition on a Mobile Phone: using mobile phone data," in IEEE
International Conference on Multimedia and Expo Workshops, Melbourne, Australia, 2012.
[83] M. Grgic, K. Delac and S. Grgic, "SCface – surveillance cameras face database," Multimed Tools and
Application, vol. 51, no. 1, p. 863–879, Feb 2011.
[84] F. Samaria and A. Harter, "Parameterisation of a stochastic model for human face identification," in IEEE
Workshop on Applications of Computer Vision, Sarasota, FL, USA, Dec. 1994.
[85] M. A and B. R, "The AR face database," CVC Technical Report, Columbus, OH, USA, Jun. 1998.
[86] A. Georghiades,, P. Belhumeur and D. Kriegman, "From few to many: Illumination cone models for face
recognition under variable lighting and pose," IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 23, no. 06, pp. 643-660, Aug 2001.
[87] J. Phillips, H. Moon, S. Rizvi and P. Rauss, "The FERET evaluation methodology for face-recognition
algorithms," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 10, pp. 1090-1104,
Aug 2000.
[88] C. Thomaz and G. Giraldi, "A new ranking method for principal components analysis and its application to
face image analysis," Image and Vision Computing, vol. 28, no. 06, pp. 902-913, Jun 2010.
[89] C. Conde, "Multimodal 2D, 2.5D & 3D face verification," in International Conference on Image Processing,
AT, USA, Oct. 2006.
[90] G. Huang, M. Ramesh, T. Berg and E. Learned-Miller, "Labeled faces in the wild: A database for studying
face recognition in unconstrained environments," University of Massachusetts, Amherst, Technical Report 07-
49, Amherst, MA, USA, Oct. 2007.
[91] A. Savran, N. Alyüz and H. Dibeklioğlu, "Bosphorus database for 3D face analysis," in BIOID 2008, LNCS
5372, Berlin, Germany, Springer-Verlag, 2008, p. 47–56.
[92] R. Gross, I. Matthews, J. Cohn, T. Kanade and S. Baker, "Multi-PIE," Image and Vision Computing, vol. 28,
no. 1, p. 807–813, May 2010.
[93] S. Gupta and a. et., "Texas 3D Face Recognition Database," in IEEE Southwest Symposium on Image Analysis
& Interpretation, Austin, TX, USA, 2010.
[94] L. Wolf, T. Hassner and I. Maoz, "Face recognition in unconstrained videos with matched background
similarity," in IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, TX,,
Aug. 2011.
[95] C. Cao, Y. Weng, S. Zhou and Y. Tong, "FaceWarehouse: A 3D facial expression database for visual
computing," IEEE Transactions on Visualization and Computer Graphics, vol. 20, no. 3, pp. 413-425 , Mar.
2014.
163
[96] B. Klare, B. Klein, E. Taborsky, A. Blanton and A. Jain, "Pushing the frontiers of unconstrained face detection
and recognition: IARPA Janus Benchmark A," in IEEE Conference on Computer Vision and Pattern
Recognition, Boston, MA, USA, Jun. 2015.
[97] N. Zhang, M. Paluri, Y. Taigman, R. Fergus and L. Bourdev, "Beyond frontal faces: Improving person
recognition using multiple cues," arXiv:1501.05703, Jan. 2015.
[98] R. Raghavendra,, K. Raja and C. Busch, "Exploring the usefulness of light field cameras," IEEE Transactions
on Infromation Forensics and Security, vol. 11, no. 5, pp. 922-936, May 2016.
[99] A. M, "The specs on face dataset," York University, Toronto, Ontario, Canada, Jan. 2017.
[100] V. Kushwaha, M. Singh, R. Singh, M. Vatsa, N. Ratha and R. Chellappa, "Disguised faces in the Wild,"
International Conference on Computer Vision and Pattern Recognition Workshop, Salt Lake City, UT, USA,
Jun. 2018.
[101] J. Wang, N. Le, J. Lee and C. Wang, "Color face image enhancement using adaptive singular value
decomposition in fourier domain for face recognition," Pattern Recognition, vol. 57, no. 1, pp. 31-49, Sep.
2016.
[102] S. Hu, X. Lu, M. Ye and W. Zeng, "Singular value decomposition and local near neighbors for face
recognition," Pattern Recognition, vol. 64, no. 1, pp. 60-83, Apr. 2017.
[103] C. Ding and D. Tao, "Pose-invariant face recognition with homography-based normalization," Pattern
Recognition, vol. 66, no. 1, pp. 144-152, Jun. 2017.
[104] G. Hu, F. Yan, C. Chan, W. Deng, W. Christmas, J. Kittler and N. Robertson, "Face recognition using a unified
3D morphable model," in European Conference on Computer Vision, Amsterdam, Netherlands, Oct. 2016.
[105] W. Su, C. Hsu, C. Lin and W. Lin, "Supervised-learning based face hallucination for enhancing face
recognition," in International Conference on Acoustics, Speech and Signal Processing, Shanghai, China, Mar.
2016.
[106] Z. Dong, M. Pei and Y. Jia, "Orthonormal dictionary learning and its application to face recognition," Image
and Vision Computing, vol. 51, no. 1, pp. 13-21, Jul. 2016.
[107] Y. Taigman, M. Yang, M. Ranzato and L. Wolf, "DeepFace: closing the gap to human-Level performance in
face verification," in Computer Vision and Pattern Recognition, Columbus, OH, USA, Jun. 2014.
[108] Y. Sun, D. Liang, X. Wang and X. Tang, "DeepID3: Face recognition with very deep neural networks,"
arXiv:1502.00873, Feb. 2015.
[109] Y. Lee, G. Chen, C. Tseng and S. Lai, "Accurate and robust face recognition from RGB-D images with a deep
learning," in British Machine Vision Conference, York, UK, Sep. 2016..
[110] X. Liu, L. Song, X. Wu and T. Tan, "Transferring deep representation for NIR-VIS heterogeneous face
recognition," in International Conference on Biometrics, Halmstad, Sweden, Aug. 2016.
164
[111] C. Reale, N. Nasrabadi, H. Kwon and R. Chellappa, "Seeing the forest from the trees: A holistic approach to
near-infrared heterogeneous face recognition," in in IEEE Conference on Computer Vision and Pattern
Recognition Workshops, Las Vegas, NV, USA, Jul. 2016.
[112] Y. Lee, J. Chen, C. Tseng and S. Lai, "Accurate and robust face recognition from RGB-D images with a deep
learning approach," in British Machine Vision Conference, York, UK, Sep. 2016.
[113] X. Wu, L. Song, R. He and T. Tan, "Coupled deep learning for heterogeneous face recognition,"
arXiv:1704.02450, Apr. 2017.
[114] R. He, X. Wu, Z. Sun and T. Tan, "Learning invariant deep representation for NIR-VIS face recognition," in
AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, Feb. 2017.
[115] J. Lezama, Q. Qiu and G. Sapiro, "Not afraid of the dark: NIR-VIS face recognition via cross-spectral
hallucination and low-rank embedding," in IEEE Conference on Computer Vision and Pattern Recognition,
Honolulu, HW, USA, Jul. 2017.
[116] I. Masi, F. Chang, P. Natarajan and R. Nevatia, "Learning Pose-Aware Models for Pose-Invariant Face
Recognition in the Wild," IEEE Transactions on Pattern Analysis and Machine Intelligence, in press, Jan.
2018.
[117] X. Wu, R. He, Z. Sun and T. Tan, "A lightened CNN for deep face representation," IEEE Transactions on
Information Forensics and Security, in press, Jan. 2018.
[118] K. Grm, V. Struc, A. Artiges, M. Caron and H. Ekenel, "Strengths and weaknesses of deep learning models
for face recognition against image degradations," IET Biometrics, vol. 7, no. 1, pp. 81-89 , Jan. 2018.
[119] O. Deniz, G. Bueno, J. Salido and F. De la Torre, "Face recognition using histograms of oriented gradients,"
Pattern Recognition Letters, vol. 32, no. 12, pp. 1598-1603, Sep. 2011.
[120] A. Aissaoui, J. Martinet and C. Ajeraba, "DLBP: A novel descriptor for depth image based face recognition,"
in IEEE International Conference on Image Processing, Paris, France, Oct. 2014.
[121] L. Liu, P. Fieguth, G. Zhao, M. Pietikäinen and D. Hu, "Extended local binary patterns for face recognition,"
Information Sciences, vol. 358, no. 1, pp. 56-72, Sep. 2016.
[122] T. Schlett, C. Rathgeb and C. Busch , "A binarization scheme for face recognition based on multi-scale block
local binary patterns," in International Conference of the Biometrics Special Interest Group, Darmstadt,
Germany, Nov. 2016.
[123] X. Chen, F. Hu, Z. Liu, Q. Huang and J. Zhang, "Multi-resolution elongated CS-LDP with Gabor feature for
face recognition," International Journal of Biometrics, vol. 8, no. 1, pp. 19-32, Jan. 2016.
[124] W. Yang, Z. Wang and B. Zhang, "Face recognition using adaptive local ternary patterns method,"
Neurocomputing, vol. 213, no. 1, pp. 183-190, Nov. 2016.
[125] L. Tian, C. Fan and Y. Ming, "Multiple scales combined principle component analysis," Journal of Electronic
Imaging, vol. 25, no. 2, pp. 3025-3041, Apr. 2016.
165
[126] Z. Li, D. Gong, X. Li and D. Tao, "Aging face recognition: A hierarchical learning model based on Local
patterns selection," IEEE Transactions on Image Processing, vol. 25, no. 5, pp. 2146-2154, May 2016.
[127] j. Li, Z. Chen and C. Liu, "Low-resolution face recognition of multi-scale blocking CS-LBP and weighted
PCA," International Journal of Pattern Recognition and Artificial Intelligence, vol. 30, no. 9, pp. 6005-6018,
Sep. 2016.
[128] J. Zhang, Y. Deng, Z. Guo and Y. Chen, "Face recognition using part-based dense sampling local features,"
Neurocomputing, vol. 184, no. 1, pp. 176-187, Apr. 2016.
[129] C. Li, W. Wei, J. Wang, W. Tang and S. Zhao, "Face recognition based on deep belief network combined with
center-symmetric local binary pattern," in Advanced Multimedia and Ubiquitous Engineering, Singapore,
Singapore, Springer, Aug. 2016, pp. 277-283.
[130] Z. Lu and L. Zhang, "Face recognition algorithm based on discriminative dictionary learning and sparse
representation," Neurocomputing, vol. 174, no. 2, pp. 749-755, Jan. 2016.
[131] L. Tran and X. Liu, "Nonlinear 3D face morphable model," arXiv:1804.03786, Apr. 2018.
[132] O. Nikisins, K. Nasrollahi, M. Greitans and T. Moeslund, "RGB-D-T based face recognition," in International
Conference on Pattern Recognition, Stockholm, Sweden, Dec. 2014.
[133] A. Nigam, G. Chhalotre and P. Gupta, "Pose and illumination invariant face recognition using binocular stereo
3D reconstruction," in Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics,
Patna, India, Dec. 2015.
[134] J. Li, N. Sang and C. Gao, "Face recognition with Riesz binary pattern," Digital Signal Processing, vol. 51,
no. 1, pp. 196-201, Apr. 2016.
[135] Y. Wang, S. Yu, W. Li, L. Wang and Q. Liao, "Face recognition with local contourlet combined patterns," in
International Conference on Acoustics, Speech and Signal Processing, Shanghai, China, May. 2016.
[136] A. Fathi, P. Alirezazadeh and F. Abdali-Mohammadi, "A new Global-Gabor-Zernike feature descriptor and its
application to face recognition," Journal of Visual Communication and Image Representation, vol. 38, no. 1,
pp. 65-72, Jul. 2016.
[137] C. Ding, J. Choi, D. Tao and L. Davis, "Multi-directional multi-level dual-cross patterns for robust face
recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 3, pp. 518-531,
Mar. 2016.
[138] T. Freitas, P. Alves, J. Monteiro and J. Cardoso, "A comparative analysis of deep and shallow features for
multimodal face recognition in a novel RGB-D-IR dataset," in International Symposium on Visual Computing,
Las Vegas, NV, USA, Dec. 2016.
[139] Y. Bi, M. Lv, Y. Wei, N. Guan and W. Yi, "Multi-feature fusion for thermal face recognition," Infrared Physics
& Technology, vol. 77, no. 1, pp. 366-374, Jul. 2016.
166
[140] G. Hu, Y. Yang, D. Yi, J. Kittler, W. Christmas, S. Li and T. Hospedales, "When face recognition meets with
deep learning: An Evaluation of convolutional neural networks for face recognition," in IEEE International
Conference on Computer Vision Workshop, Santiago, Chile, Dec. 2015.
[141] M. Mehdipour Ghazi and H. Ekenel, "A comprehensive analysis of deep learning based representation for face
recognition," in IEEE Conference on Computer Vision and Pattern Recognition Workshops, Las Vegas, NV,
United States, Jul. 2016.
[142] A. Krizhevsky, I. Sutskever and G. Hinton, "Imagenet classification with deep convolutional neural networks,"
in International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, Dec. 2012.
[143] X. Wu, R. He, Z. Sun and T. Tan, "A light CNN for deep face representation with noisy labels,"
arXiv:1511.02683, Ithaca, NY, USA, Apr. 2017.
[144] F. Iandola, S. Han, M. Moskewicz, K. Ashraf, W. Dally and K. Keutzer, "SqueezeNet: AlexNet-level accuracy
with 50x fewer parameters and <0.5MB model size," arXiv:1602.07360, Ithaca, NY, USA, Nov. 2016.
[145] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens and Z. Wojna, "Rethinking the inception architecture for
computer vision," in IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA,
Jun. 2016.
[146] K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv
preprint arXiv:1409.1556, Ithaca, NY, USA, Apr. 2015.
[147] K. He, X. Zhang, S. Ren and J. Sun, "Deep residual learning for image recognition," arXiv:1512.03385, Dec.
2015.
[148] F. Wang, X. Xiang, J. Cheng and A. Yuille, "NormFace: L2 hypersphere embedding for face verification,"
arXiv:1704.06369, Apr. 2017.
[149] N. Dalal and B. Triggs, "Histograms of oriented gradients for human detection," in IEEE Conference on
Computer Vision and Pattern Recognition, San Diego, CA, USA, Jul. 2005.
[150] A. Ross, "An introduction to multibiometrics," in European Signal Processing Conference , Poznan, Poland,
Sep. 2007.
[151] P. Auscher, T. Coulhon, X. Duong and S. Hofmann, "Riesz transform on manifolds and heat kernel regularity,"
Annales Scientifiques de l’École Normale Supérieure, vol. 37, no. 6, pp. 911-957, Dec. 2004.
[152] M. Wang and W. Deng, "Deep Face Recognition: A Survey," arXiv:1804.06655, Apr. 2018.
[153] H. Lu, K. Plataniotis and A. Venetsanopoulos, "MPCA: Multilinear principal component analysis of tensor
objects," IEEE Transactions on Neural Networks, vol. 19, no. 1, pp. 18-39, Jan.2008.
[154] Z. Mu and L. Yuan, "Introduction to USTB ear image databases," Ear recognition Lab at University of Science
& Technology Beijing, 2004. [Online]. Available: http://www1.ustb.edu.cn/resb/en/visit/visit.htm.
[155] A. Kumar and C. Wu, "Automated human identification using ear imaging," Pattern Recognition, vol. 45, no.
3, pp. 956-968, Mar. 2012.
167
[156] E. González Sánchez, "Biometric analysis of the ears," Ph.D. thesis, Universidad de Las Palmas, Gran Canaria,
Spain, Sep. 2008.
[157] D. Frejlichowski and N. Tyszkiewicz, "The west pomeranian university of technology ear database – A tool
for testing biometric algorithms," in International Conference Image Analysis and Recognition, Póvoa de
Varzim, Portugal, Jun. 2010.
[158] Z. Xiaoxun and J. Yunde, "Symmetrical null space LDA for face and ear recognition," Neurocomputing, vol.
70, no. 4, pp. 842-848, Jan. 2007.
[159] B. Zhang, Z. Mu, C. Li and H. Xeng, "Robust classification for occluded ear via Gabor scale feature-based
non-negative sparse representation," Optical Engineering, vol. 53, no. 6, pp. 1-11, Jun. 2014.
[160] A. Tharwat, A. Ibrahim, A. Hassanien and G. Schaefer, "Ear recognition using block-based principal
component analysis and decision fusion," in International Conference on Pattern Recognition and Machine
Intelligence, Warsaw, Poland, Jun. 2015.
[161] I. Naseem, R. Togneri and M. Bennamoun, "Sparse representation for ear biometrics," in International
Symposium on Visual Computing , Las Vegas, NV, USA , Dec. 2016.
[162] D. Watabe, H. Sai, T. Ueda and K. Sakai, "ICA, LDA, and Gabor jets for robust ear recognition, and jet space
similarity for ear detection," International Journal of Intelligent Computing in Medical Sciences & Image
Processing , vol. 3, no. 1, pp. 9-29, Feb. 2013.
[163] B. Moreno, A. Sanchez and J. Velez, "On the use of outer ear images for personal identification in security
applications," in International Carnahan Conference on Security Technology, Madrid, Spain, Oct. 1999 .
[164] Z. Mu, L. Yuan, Z. Xu, D. Xi and S. Qi, "Shape and structural feature based ear recognition," in 5th Chinese
conference on Advances in Biometric Person Authentication, Guangzhou, China, Dec. 2004.
[165] M. Burge and W. Burger, "Ear Biometrics for Machine Vision," in Workshop of the Austrian Association for
Pattern Recognition, New York, Springer, Sep. 1997, pp. 273-285.
[166] T. Theoharis, G. Passalis and G. Toderici, "Unified 3D face and ear recognition using wavelets on geometry
images," Pattern Recognition, vol. 41, no. 3, pp. 796-804, Mar. 2008.
[167] M. Choraś, "Perspective methods of human identification: Ear biometrics," Opto-Electronics Review, vol. 16,
no. 1, p. 85–96, Mar. 2008.
[168] I. Omara, F. Li, H. Zhang and W. Zuo, "A novel geometric feature extraction method for ear recognition,"
Expert Systems With Applications, vol. 65, no. 1, pp. 127-135, Dec. 2016.
[169] Y. Zhou and S. Zaferiou, "Deformable models of ears in-the-wild for alignment and recognition," in IEEE
International Conference on Automatic Face & Gesture Recognition, Washington, DC, USA, Jun. 2017.
[170] Z. Emeršič, D. Štepec, V. Štruc and P. Peer, "Training convolutional neural networks with limited training data
for ear recognition in the wild," in IEEE International Conference on Automatic Face & Gesture Recognition,
Washington, DC, USA, Jun. 2017.
168
[171] Z. Emeršič, D. Štepec, V. Štruc and H. Ekenel, "The unconstrained ear recognition challenge," in IEEE
International Joint Conference on Biometrics, Denver, CO, USA, Oct. 2017.
[172] F. Eyiokur, D. Yaman and H. Ekenel, "Domain adaptation for ear recognition using deep convolutional neural
networks," IET Biometrics, vol. 7, no. 3, pp. 199-206, May. 2018.
[173] Y. Guo, G. Zhao, M. Pietikäinen and Z. Xu, "A new Gabor phase difference pattern for face and ear
recognition," in International Conference on Computer Analysis of Images and Patterns, Münster, Germany,
Sep. 2009.
[174] N. Damer and B. Führer, "Ear recognition using multi-Scale histogram of oriented gradients," in International
Conference on Intelligent Information Hiding and Multimedia Signal Processing, Piraeus, Greece, Jul. 2012.
[175] A. Morales, M. Ferrer, M. Diaz-Cabrera and E. González, "Analysis of local descriptors features and its
robustness applied to ear recognition," in International Carnahan Conference on Security Technology,
Medellin, Colombia, Oct. 2013 .
[176] A. Pflug, P. Paul and C. Busch, "A comparative study on texture and surface descriptors for ear biometrics,"
in International Carnahan Conference on Security Technology, Rome, Italy, Dec. 2014.
[177] Z. Youbi, L. Boubchir, M. Bounneche, A. Ali-Chérif and A. Boukrouche, "Human ear recognition based on
multi-scale local binary pattern descriptor and KL divergence," in International Conference on
Telecommunications and Signal Processing, Vienna, Austria, Jun. 2016.
[178] C. Long, |. Zhichun, N. Bingfei, Z. Yi and Y. Ruyin, "TDSIFT: A new descriptor for 2D and 3D ear
recognition," in International Conference on Graphic and Image Processing, Tokyo, Japan, Oct. 2016.
[179] H. Zhang, Z. Mu, W. Qu, L. Liu and C. Zhang, "A novel approach for ear recognition based on ICA and RBF
network," in International Conference on Machine Learning and Cybernetics, Guangzhou, China, Aug. 2005.
[180] M. Nosrati, K. Faez and F. Faradji, "Using 2D wavelet and principal component analysis for personal
identification based On 2D ear structure," in International Conference on Intelligent and Advanced Systems,
Kuala Lumpur, Malaysia, Nov. 2007.
[181] Y. Wang, Z. Mu and H. Zeng, "Block-based and multi-resolution methods for ear recognition using wavelet
transform and uniform local binary patterns," in International Conference on Pattern Recognition, Tampa, FL,
USA, Dec. 2008.
[182] A. Kumar and T. Chan, "Robust ear identification using sparse representation of local texture descriptors,"
Pattern Recognition, vol. 85, no. 1, pp. 73-85, Jan. 2013.
[183] P. Galdámez, A. Arrieta and M. Ramon, "Ear recognition using a hybrid approach based on neural networks,"
in International Conference on Information Fusion, Salamanca, Spain, Jul. 2014.
[184] L. Jacob and G. Raju, "Ear recognition using texture features - A novel approach," in Advances in Intelligent
Systems and Computing, Singapor, Singapor, Springer, Jul. 2014, pp. 1-12.
[185] A. Benzaoui, A. Hadid and A. Boukrouche, "Ear biometric recognition using local texture descriptors," Journal
of Electronic Imaging, vol. 23, no. 5, pp. 1-12, Oct. 2014.
169
[186] "Lytro Desktop 4," Lytro, Inc., [Online]. Available: https://support.lytro.com/hc/en-us/articles/202590364-
Lytro-Desktop-4-Main-Overview. [Accessed Nov. 2018].
[187] C. Chang and C. Lin, "LIBSVM -- A library for support vector machines," National Taiwan University,
[Online]. Available: https://www.csie.ntu.edu.tw/~cjlin/libsvm/. [Accessed Nov. 2018].
[188] S. Marto, N. Monteiro, J. Barreto and J. Gaspar, "Structure from plenoptic imaging," in IEEE International
Conference on Development and Learning and on Epigenetic Robotics, Lisbon, Portugal, Sep. 2017.
[189] N. Monteiro, S. Marto, J. Barreto and J. Gaspar, "Depth range accuracy for plenoptic cameras," Accepted in
Computer Vision and Image Understanding, vol. 168, no. 1, pp. 104-117, Mar. 2018.
[190] H. Jeon, J. Park, G. Choe and G. Park, "Accurate depth map estimation from a lenslet light field camera," in
IEEE International Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, Jun. 2015.
[191] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural Computation, vol. 9, no. 8, pp. 1735-
1780, Nov. 1997.
[192] K. Greff, R. Srivastava, J. Koutník, B. Steunebrink and J. Schmidhuber, "LSTM: A search space odyssey,"
IEEE Transactions on Neural Networks and Learning Systems, vol. 28, no. 10, pp. 2222-2232, Oct. 2017.
[193] J. Liu, A. Shahroudy, D. Xu, A. Chichung and G. Wang, "Skeleton-based action recognition using spatio-
temporal LSTM network with trust gates," IEEE Transactions on Pattern Analysis and Machine Intelligence,
in press, Nov. 2017.
[194] P. Rodriguez, G. Cucurull, J. Gonzalez, J. Gonfaus, K. Nasrollahi, T. Moeslund and J. Xavier Roca, "Deep
pain: Exploiting long short-term memory networks for facial expression classification," IEEE Transactions on
Cybernetics, vol. 99, no. 1, pp. 1-11, Feb. 2017.
[195] J. Donahue, L. Hendricks, M. Rohrbach, S. Venugopalan, S. Guadarrama, K. Saenko and T. Darrell, "Long-
term recurrent convolutional networks for visual recognition and description," IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 39, no. 4, pp. 677-691, Apr. 2017.
[196] P. Werbos, "Backpropagation through time: What it does and how to do it," Proceedings of the IEEE, vol. 78,
no. 10, pp. 1550-1560, Oct. 1990.
[197] S. Hochreiter, "The vanishing gradient problem during learning recurrent neural nets and problem solutions,"
International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 6, no. 2, pp. 107-116 ,
Apr. 1998.
[198] A. Zeyer, P. Doetsch, P. Voigtlaender, R. Schlüter and H. Ney, "A comprehensive study of deep bidirectional
LSTM RNNS for acoustic modeling in speech recognition," in IEEE International Conference on Acoustics,
Speech and Signal Processing, New Orleans, LA, USA, Jun. 2017.
[199] S. Merity, N. Keskar and R. Socher, "Regularizing and optimizing LSTM language models,"
arXiv:1708.02182, Ithaca, NY, USA, Aug. 2017.
[200] Y. Gal and Z. Ghahramani, "A theoretically grounded application of dropout in recurrent neural networks," in
International Conference on Neural Information Processing Systems, Barcelona, Spain, Dec. 2016.
170
[201] N. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy and P. Tang, "On large-batch training for deep learning:
Generalization gap and sharp minima," in International Conference on Learning Representations, Toulon,
France, Apr. 2017.
[202] V. Patel, "The impact of local geometry and batch size on convergence and divergence of stochastic gradient
descent," arXiv:1709.04718, Ithaca, NY, USA, Sep. 2017.
[203] A. Meraoumia, S. Chitroub and A. Bouridane, "An automated ear identification system using Gabor filter
responses," in International New Circuits and Systems Conference, Grenoble, France , Jun. 2015.
[204] Z. Hui, Z. Rui, M. Zhichun and W. Xiuqing, "Local feature descriptor based rapid 3D ear recognition," in
Chinese Control Conference, Nanjing, China, Jul. 2014.
[205] y. Guo and Z. Xu, "Ear recognition using a new local matching approach," in International Conference on
Image Processing, San Diego, CA, USA, Oct. 2008.
[206] V. Ojansivu, E. Rahtu and J. Heikkila, "Rotation invariant local phase quantization for blur insensitive texture
analysis," in International Conference on Pattern Recognition, Tampa, FL, USA, Dec. 2008.
[207] C. Sousedik and C. Busch, "Presentation attack detection methods for fingerprint recognition systems: a
survey," IET Biometrics, vol. 3, no. 4, pp. 219-233, Dec. 2014.
[208] A. Czajka and K. Bowyer, "Presentation attack detection for iris recognition: an assessment of the state-of-the-
art," ACM Computing Surveys, vol. 51, no. 4, pp. 1-35, Sep. 2018.
[209] D. Yi, Z. Lei, Z. Zhang and S. Li, "Face anti-spoofing: Multi-spectral approach," in Handbook of Biometric
Anti-Spoofing, London, Springer-Verlag, Jul. 2014, pp. 83-102.
[210] C. Kant and N. Sharma, "Fake face recognition using fusion of thermal imaging and skin elasticity,"
International Journal of Computer Science and Communication, vol. 4, no. 1, pp. 65-72, Mar. 2013.
[211] L. Sun, W. Huang and M. Wu, "TIR/VIS correlation for liveness detection in face recognition," in International
Conference on Computer Analysis of Images and Patterns, Seville, Spain, Aug. 2011.
[212] G. Tian, T. Mori and Y. Okuda, "Spoofing detection for embedded face recognition system using a low cost
stereo camera," in International Conference on Pattern Recognition, Cancun, Mexico, Dec. 2016.
[213] X. Sun, L. Huang and C. Liu, "Dual camera based feature for face spoofing detection," in Chinese Conference
on Pattern Recognition, Chengdu, China, Nov. 2016.
[214] J. Komulainen, A. Hadid and M. Pietikainen, "Context based face anti-spoofing," in International Conference
on Biometrics: Theory, Applications and Systems, Arlington, VA, USA, Jan. 2014.
[215] K. Patel, H. han and A. Jain, "Secure face unlock: spoof detection on smartphones," IEEE Transactions on
Information Forensics and Security, vol. 11, no. 10, pp. 2268-2283, Jun. 2016.
[216] G. Pan, L. Sun, Z. Wu and Y. Wang, "Monocular camera-based face liveness detection by combining eyeblink
and scene context," Telecommunication Systems, vol. 3, no. 1, p. 215–225, Aug. 2011.
[217] X. Tan, Y. Li, J. Liu and L. Jiang, "Face liveness detection from a single image with sparse low rank bilinear
discriminative model," in European Conference on Computer Vision, Crere, Greece, Sep. 2010.
171
[218] A. Anjos and S. Marcel, "Counter-measures to photo attacks in face recognition: A public database and a
baseline," in International Joint Conference on Biometrics, Washington, DC, USA, Oct. 2011.
[219] I. Chingovska, A. Anjos and S. Marcel, "On the effectiveness of local binary patterns in face anti-spoofing,"
in International Conference of Biometrics Special Interest Group, Darmstadt, Germany, Sep. 2012.
[220] Z. Zhang, J. Yan and S. Liu, "A face antispoofing database with diverse attacks," in IAPR International
Conference on Biometrics, Dehli, India, Apr. 2012.
[221] D. Wen, H. Han and A. Jain, "Face spoof detection with image distortion analysis," IEEE Transactions on
Information Forensics and Security, vol. 10, no. 4, pp. 746-761, Apr. 2015.
[222] A. Hadid, "Physics-based face database," University of Oulu, [Online]. Available:
http://www.cse.oulu.fi/CMV/Downloads/Pbfd. [Accessed Nov. 2018].
[223] I. Manjani, S. Tariyal, M. Vesta, R. Singh and A. Majumdar, "Detecting silicone mask-based presentation
attack via deep dictionary learning," IEEE Transactions on Information Forensics and Security, vol. 12, no. 7,
pp. 1713-1723, Mar. 2017.
[224] A. Agarwal, D. Yadav, N. Kohli, R. Singh, M. Vatsa and A. Noore, "Face presentation attack with latex masks
in multispectral videos," in Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, Jul.
2017.
[225] "Thats my face," [Online]. Available: http://faces.thatsmyface.com/. [Accessed Nov. 2018].
[226] D. Gragnaniello, G. Poggi, C. Sansone and L. Verdoliva, "An investigation of local descriptors for biometric
spoofing detection," IEEE Transactions on Information Forensics and Security, vol. 10, no. 4, pp. 849-863,
Feb. 2015.
[227] J. Määttä, A. Hadid and M. Pietikäinen, "Face spoofing detection from single images using micro-texture
analysis," in International Joint Conference on Biometrics , Washington, DC, USA, Oct. 2011.
[228] J. Maatta, A. Hadid and M. Pietikainen, "Face spoofing detection from single images using texture and local
shape analysis," IET Biometrics, vol. 1, no. 1, pp. 3-10, Mar. 2012.
[229] N. Kose and J. Dugelay, "Classification of captured and recaptured images to detect photograph spoofing," in
International Conference on Informatics, Electronics & Vision, Dhaka, Bangladesh , May 2012 .
[230] M. Waris, H. Zhang, I. Ahmad, S. Kiranyaz and M. Gabbouj, "Analysis of textural features for face biometric
anti-spoofing," in European Signal Processing Conference, Marrakech, Morocco , Sep. 2013.
[231] R. Raghavendra and C. Busch, "Robust 2D/3D face mask presentation attack detection scheme by exploring
multiple features and comparison score level fusion," in International Conference on Information Fusion,
Salamanca, Spain, Oct. 2014.
[232] A. Hadid, N. Evans, S. Marcel and J. Fierrez, "Biometrics systems under spoofing attack: An evaluation
methodology and lessons learned," IEEE Signal Processing Magazine, vol. 32, no. 5, pp. 20-30, Sep. 2015.
172
[233] S. Arashloo, J. Kittler and W. Christmas, "Face spoofing detection based on multiple descriptor fusion using
multiscale dynamic binarized statistical image features," IEEE Transactions on Information Forensics and
Security, vol. 10, no. 11, pp. 2396-2407, Jul. 2015.
[234] Z. Boulkenafet, J. Boulkenafet and A. Hadid, "Face spoofing detection using colour texture analysis," IEEE
Transactions on Information Forensics and Security, vol. 11, no. 8, pp. 1818-1830, Aug. 2016.
[235] Z. Boulkenafet, J. Komulainen and A. Hadid, "Face antispoofing using speeded-up robust features and fisher
vector encoding," IEEE Signal Processing Letters, vol. 24, no. 2, pp. 141-145, Feb. 2017.
[236] F. Peng, L. Qin and M. Long, "Face presentation attack detection using guided scale texture," Multimedia
Tools and Applications, vol. 77, no. 7, pp. 8883-8909, May 2017.
[237] S. Tirunagari, N. Poh, D. Windridge, A. Iorliam, N. Suki and A. Ho, "Detection of face spoofing using visual
dynamics," IEEE Transactions on Information Forensics and Security, vol. 10, no. 4, pp. 762-777, Feb. 2015.
[238] J. Komulainen, A. Hadid and M. Pietikäinen, "Face spoofing detection using dynamic texture," in Asian
Conference on Computer Vision, Daejeon, Korea, Nov. 2012.
[239] T. Pereira, A. Anjos, J. Martino and S. Marcel, "LBP−TOP based countermeasure against face spoofing
attacks," in Asian Conference on Computer Vision, Daejeon, Korea, Nov. 2012.
[240] S. Bharadwaj, T. Dhamecha, M. Vatsa and R. Singh, "Computationally efficient face spoofing detection with
motion magnification," in The IEEE Conference on Computer Vision and Pattern Recognition Workshops,
Portland, Oregon, Jun. 2013.
[241] A. Pinto, H. Pedrini, W. Schwartz and A. Rocha, "Face spoofing detection through visual codebooks of spectral
temporal cubes," IEEE Transactions on Image Processing, vol. 24, no. 12, pp. 4726-4740, Dec. 2015.
[242] Q. Phan, D. Dang-Nguyen, G. Boato and F. De Natale, "Face spoofing detection using LDP-TOP," in IEEE
International Conference on Image Processing, Phoenix, AZ, USA, Aug. 2016.
[243] Z. Zhang, D. Yi, Z. Lei and S. Li, "Face liveness detection by learning multispectral reflectance distributions,"
in International Conference on Automatic Face & Gesture Recognition and Workshops, Santa Barbara, CA,
USA, Mar. 2011.
[244] N. Kose and J. Dugelay, "Reflectance analysis based countermeasure technique to detect face mask attacks,"
in International Conference on Digital Signal Processing, Fira, Greece, Oct. 2013.
[245] J. Galbally and S. Marcel, "Face anti-spoofing based on general image quality assessment," in International
Conference on Pattern Recognition, Stockholm, Sweden, Aug. 2014.
[246] J. Galbally, S. Marcel and J. Fierrez, "Image quality assessment for fake biometric detection: Application to
iris, fingerprint,and face recognition," IEEE Transactions on Image Processing, vol. 23, no. 2, pp. 710-724,
Nov. 2013.
[247] R. Haralick, K. Shanmugam and I. Dinstein, "Textural features for image classification," IEEE Transactions
on Systems, Man, and Cybernetics, vol. 3, no. 6, pp. 610-621, Nov. 1973.
173
[248] A. Agarwal, R. Singh and M. Vatsa, "Face anti-spoofing using Haralick features," in International Conference
on Biometrics Theory, Applications and Systems, Niagara Falls, NY, USA , Sep. 2016.
[249] A. Bhogal, D. Söllinger, P. Trung and A. Uhl, "Non-reference image quality assessment for biometric
presentation attack detection," in International Workshop on Biometrics and Forensics, Coventry, UK, Apr.
2017.
[250] j. Yan, Z. Zhang, Z. Lei, D. Yi and S. Li, "Face liveness detection by exploring multiple scenic clues," in
International Conference on Control Automation Robotics & Vision, Guangzhou, China, Dec. 2012.
[251] J. Komulainen, A. Hadid, M. Pietikäinen, A. Anjos and S. Marcel, "Complementary countermeasures for
detecting scenic face spoofing attacks," in International Conference on Biometrics, Madrid, Spain, Sep. 2013.
[252] S. Kim, S. Yu, K. Kim, Y. Ban and S. Lee, "Face liveness detection using variable focusing," in International
Conference on Biometrics, Madrid, Spain , Jun. 2013.
[253] A. Ali, F. Deravi and S. Hoque, "Directional sensitivity of gaze-collinearity features in liveness detection," in
International Conference on Emerging Security Technologies, Cambridge, UK, Sep. 2013.
[254] L. Yang, "Face liveness detection by focusing on frontal faces and image backgrounds," in International
Conference on Wavelet Analysis and Pattern Recognition, Lanzhou, China, Jul. 2014.
[255] A. Anjos, M. Chakka and S. Marcel, "Motion-based counter-measures to photo attacks in face recognition,"
IET Biometrics, vol. 3, no. 3, pp. 147-158, Sep. 2014.
[256] L. Cai, C. Xiong, L. Huang and C. Liu, "A novel face spoofing detection method based on gaze estimation,"
in Asian Conference on Computer Vision, Singapore, Singapore, Nov. 2014.
[257] J. Yang, Z. Lei and S. Li, "Learn convolutional neural network for face anti-spoofing," arXiv preprint
arXiv:1408.5601, Ithaca, NY, USA, Aug. 2014.
[258] S. Kim, Y. Ban and S. Lee, "Face liveness detection using defocus," Sensors, vol. 15, no. 1, pp. 1537-1563,
jan. 2015.
[259] Z. Xu, S. Li and W. Deng, "Learning temporal features using LSTM-CNN architecture for face anti-spoofing,"
in IAPR Asian Conference on Pattern Recognition, Kuala Lumpur, Malaysia, Nov. 2015.
[260] D. Menotti, G. Chiachia and A. Pinto, "Deep representations for iris, face, and fingerprint spoofing detection,"
IEEE Transactions on Information Forensics and Security, vol. 10, no. 4, pp. 864-879, Apr. 2015.
[261] L. Li, X. Feng, Z. Boulkenafet, Z. Xia, M. Li and A. Hadid, "An original face anti-spoofing approach using
partial convolutional neural network," in International Conference on Image Processing Theory Tools and
Applications, Oulu, Finland, Dec. 2016.
[262] L. Feng, L. Po and Y. Li, "Integration of image quality and motion cues for face anti-spoofing: A neural
network approach," Journal of Visual Communication and Image Representation, vol. 38, no. 1, pp. 451-460,
Jul. 2016.
[263] A. Alotaibi and A. Mahmood, "Deep face liveness detection based on nonlinear diffusion using convolution
neural network," Signal, Image and Video Processing, vol. 11, no. 4, pp. 713-720, May 2017.
174
[264] K. Killioğlu, M. Taşkiran and N. Kahraman, "Anti-spoofing in face recognition with liveness detection using
pupil tracking," in International Symposium on Applied Machine Intelligence and Informatics, Herl'any,
Slovakia, Jan. 2017.
[265] M. De Marsico, M. Nappi, D. Riccio and J. Dugelay, "Moving face spoofing detection via 3D projective
invariants," in IAPR International Conference on Biometrics, New Delhi, India, Apr. 2012.
[266] A. Saad, "Anti-spoofing using challenge-response user interaction," Thesis in Dept. of Computer Science and
Engineering, American University in Cairo, Cairo, Egypt, Jan. 2015.
[267] A. Munalih, "Challenge response interaction for biometric liveness establishment and template protection," in
Annual Conference on Privacy, Security and Trust, Auckland, New Zealand, Dec. 2016.
[268] A. Singh, P. Joshi and G. Nandi, "Face recognition with liveness detection using eye and mouth movement,"
in International Conference on Signal Propagation and Computer Technology, Ajmer, India, Jul. 2014.
[269] O. Komogortsev, A. Karpov and C. Holland, "Attack of mechanical replicas: Liveness detection with eye
movement," IEEE Transactions on Information Forensics and Security, vol. 10, no. 4, pp. 716-725, Apr. 2015.
[270] K. Kollreider, H. Fronthaler and J. Bigun, "Verifying liveness by multiple experts in face biometrics," in IEEE
Conference on Computer Vision and Pattern Recognition Workshops, Anchorage, AK, USA, Jul. 2008.
[271] C. Chang and C. Lin, "LIBSVM -- A library for support vector machines," National Taiwan University,
[Online]. Available: https://www.csie.ntu.edu.tw/~cjlin/libsvm/. [Accessed Nov. 2018].
[272] I. I. 30107-3, "Information Technology—Presentation Attack Detection—Part 3: Testing, Reporting and
Classification of Attacks," International Organization for Standardization, Sep. 2017.
[273] FRONTEX, "Best practice operational guidelines for automated border control (ABC) systems," European
Agency for the Management, Research and Development Unit, Warsaw, Poland, Sep. 2015.
[274] P. Chen, C. Huang, C. Lien and Y. Tsai, "An efficient hardware implementation of HOG feature extraction for
human detection," IEEE Transactions on Intelligent Transportation Systems, vol. 15, no. 2, pp. 656-662, Oct.
2013.
[275] Y. Zhang, W. Cao and L. Wang, "Implementation of high performance hardware architecture of face
recognition algorithm based on local binary pattern on FPGA," in International Conference on ASIC, Chengdu,
China, Jul. 2016.
[276] K. Irick, M. DeBole, V. Narayanan and A. Gayasen, "A hardware efficient support vector machine architecture
for FPGA," in International Symposium on Field-Programmable Custom Computing Machines, Stanford, CA,
USA, Apr. 2008.