Deep Face Recognition

1
Deep Face Recogni-on Omkar M. Parkhi Andrea Vedaldi Andrew Zisserman Visual Geometry Group, Department of Engineering Science, University of Oxford 0. Objec)ve Results (ii) Correspondences Build a large scale face dataset with minimal human interven-on Train a convolu-onal neural network for face recogni-on which can compete with that of the Internet giants like Google and Facebook Results Incorpora)ng A=ributes Contribu)ons Representa)on Learning Comparison with the state of the art New face dataset of 2.6 Million Faces with minimal manual labeling Stateoftheart results on YouTube faces in the wild dataset (EER 3%) [BeVer than Google’s FaceNet and Facebook’s DeepFace] Comparable to the stateoftheart results on the Labeled faces in the wild dataset (EER 0.87%) [Comparable to Google’s FaceNet and beVer than Facebook’s DeepFace] Dataset construc)on 1. Candidate list genera)on: 5000 iden--es Tap the knowledge on the web Collect representa-ve images for each celebrity 200 images per iden-ty 2. Candidate list filtering Remove people with low representa-on on the Google image search. Remove overlap with public benchmarks 2622 celebri-es in the final dataset 3. Rank image sets Download 2000 images per iden-ty Learn classifier using data obtained in step 1 Rank 2000 images and select the top 1000 4. Near duplicate removal VLAD descriptor based near duplicate removal 5. Manual filtering Cura-ng the dataset further using manual checks No. Aim Mode # Persons # images / person Total # images Anno. effort 1 Candidate list genera-on Auto 5000 200 1,000,000 2 Candidate list filtering Manual 2622 4 days 3 Rank image sets Auto 2622 1000 2,622,000 4 Near duplicate removal Auto 2622 623 1,635,159 5 Manual filtering Manual 2622 375 982,803 10 days No. Aim # Persons Total # images 1 Labeled Faces In the Wild 5,749 13,233 2 WDRef 2995 99,773 3 Celeb Faces 10177 202,599 4 Ours 2622 2.6M 5 Facebook 4030 4.4M 6 Google 8M 200M Dataset Comparison Example images from our dataset Dataset Sta)s)cs image Conv64 maxpool fc4096 fc4096 SoWmax Conv64 Conv128 maxpool Conv128 Conv256 maxpool Conv256 Conv512 maxpool Conv512 Conv512 Conv512 maxpool Conv512 Conv512 fc2622 Objec)ve 100%EER Soemax classifica-on 97.30 Embedding Learning 99.10 No. Method # Training Images # Networks Accuracy 1 Fisher Vector Faces 93.10 2 DeepFace (Facebook) 4M 3 97.35 3 DeepFace Fusion (Facebook) 500 M 5 98.37 4 DeepID2,3 Full 200 99.47 5 FaceNet (Google) 200 M 1 98.87 6 FaceNet+ Alignment (Google) 200 M 1 99.63 7 Ours (VGG Face) 2.6 M 1 98.78 No. Method # Training Images # Networks 100% EER Accuracy 1 Video Fisher Vector Faces 87.7 83.8 2 DeepFace (Facebook) 4M 1 91.4 91.4 4 DeepID2,2+,3 200 93.2 5 FaceNet (Google) 200 M 1 95.1 7 Ours (VGG Face) 2.6 M 1 97.60 97.4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False Positive Rate True Positive Rate ROC Curve LFW Dataset Our Method DeepID3 DeepFace Fisher Vector Faces Dataset YouTube Faces In the Wild: Unrestricted Protocol Labeled Faces In the Wild: Unrestricted Protocol “Very Deep” Architecture Different from previous architectures proposed for face recogni-on: locally connected layers (Facebook) incep-on (Google) Learning a mul)way classifier Soe Max Objec-ve 2622 way classifica-on 4096d descriptor Learning Task Specific Embedding The embedding is learnt by minimizing the triplet loss Learning a projec-on layer from 4096 to 1024 dimensions Online triplet forma-on at the beginning of each itera-on Fine tuned on target datasets Experiments Labeled Faces in the Wild – Unrestricted Protocol  (a, p,n)2T max{0, a -kx a - x n k 2 2 + kx a - x p k 2 2 }, Download Effect of dataset cura)on Dataset #images 100%EER Curated 982,803 92.83 Full 2,622,000 95.80 Effect of embedding learning More experiments in the paper Example correct classifica)on: posi)ve pairs Example correct classifica)on: nega)ve pairs Example incorrect classifica)on Acknowledgements False posi)ve CNN model and face detector available online!! Visit: h=p://www.robots.ox.ac.uk/~vgg/soWware/vgg_face/ This research is based upon work supported by the Office of the Director of Na-onal Intelligence (ODNI), Intelligence Advanced Research Projects Ac-vity (IARPA) False nega)ve

description

Deep Face Recognition- Omkar M. Parkhi- Andrea Vedaldi- Andrew Zisserman- Visual Geometry Group, Department of Engineering Science, University of Oxfordhttps://www.robots.ox.ac.uk/~vgg/publications/2015/Parkhi15/parkhi15.pdf

Transcript of Deep Face Recognition

Page 1: Deep Face Recognition

Deep  Face  Recogni-on  Omkar  M.  Parkhi          Andrea  Vedaldi      Andrew  Zisserman  

Visual  Geometry  Group,  Department  of  Engineering  Science,  University  of  Oxford  

0.  

Objec)ve  

Results  

(ii)  Correspondences  

•  Build  a  large  scale  face  dataset  with  minimal  human  interven-on  

 •  Train  a  convolu-onal  neural  network  for  face  recogni-on  which  can  compete  with  that  of  the  Internet  giants  like  Google  and  Facebook  

 Results  

Incorpora)ng  A=ributes  

Contribu)ons  

Representa)on  Learning   Comparison  with  the  state  of  the  art  

•  New  face  dataset  of  2.6  Million  Faces  with  minimal        manual  labeling    •  State-­‐of-­‐the-­‐art  results  on  YouTube  faces  in  the  wild  dataset  (EER  3%)    [BeVer  than  Google’s  FaceNet  and  Facebook’s  DeepFace]  

•  Comparable  to  the  state-­‐of-­‐the-­‐art  results  on  the  Labeled  faces  in  the  wild  dataset  (EER  0.87%)  

   [Comparable  to  Google’s  FaceNet  and  beVer  than        Facebook’s  DeepFace]  

Dataset  construc)on    1.   Candidate  list  genera)on:    5000  iden--es    •  Tap  the  knowledge  on  the  web  

•  Collect  representa-ve  images  for  each  celebrity                -­‐  200  images  per  iden-ty    

2.   Candidate  list  filtering  •  Remove  people  with  low  representa-on  on  the  Google  image  search.  

•  Remove  overlap  with  public  benchmarks  

•  2622  celebri-es  in  the  final  dataset      

3.  Rank  image  sets  •  Download  2000  images  per  iden-ty  

•  Learn  classifier  using  data  obtained  in  step  1  

•  Rank  2000  images  and  select  the  top  1000  

4.   Near  duplicate  removal  •  VLAD  descriptor  based  near  duplicate  removal  

5.  Manual  filtering  •  Cura-ng  the  dataset  further  using  manual  checks  

       

   

       

No.   Aim   Mode   #  Persons   #  images  /person  

Total  #  images  

Anno.  effort  

1   Candidate  list  genera-on   Auto   5000   200   1,000,000   -­‐  

2   Candidate  list  filtering   Manual   2622   -­‐   -­‐   4  days  

3   Rank  image  sets   Auto   2622   1000   2,622,000   -­‐  

4   Near  duplicate  removal   Auto   2622   623   1,635,159   -­‐  

5   Manual  filtering   Manual   2622   375   982,803   10  days  

No.   Aim   #  Persons   Total  #  images  

1   Labeled  Faces  In  the  Wild   5,749   13,233  

2   WDRef   2995   99,773  

3   Celeb  Faces   10177   202,599  

4   Ours   2622   2.6M  

5   Facebook   4030   4.4M  

6   Google   8M   200M  

Dataset  Comparison  

Example  images  from  our  dataset  

Dataset  Sta)s)cs  

image  

Conv-­‐64  

maxpool  

fc-­‐4096  

fc-­‐4096  

SoWmax  

Conv-­‐64  

Conv-­‐128  

maxpool  

Conv-­‐128  

Conv-­‐256  

maxpool  

Conv-­‐256  

Conv-­‐512  

maxpool  

Conv-­‐512  

Conv-­‐512  

Conv-­‐512  

maxpool  

Conv-­‐512  

Conv-­‐512  

fc-­‐2622  

Objec)ve   100%-­‐EER  

Soemax  classifica-on   97.30  

Embedding  Learning   99.10  

No.   Method   #  Training  Images   #  Networks   Accuracy  

1   Fisher  Vector  Faces   -­‐   -­‐   93.10  

2   DeepFace  (Facebook)   4  M   3   97.35  

3   DeepFace  Fusion  (Facebook)   500  M   5   98.37  

4   DeepID-­‐2,3   Full   200   99.47  

5   FaceNet  (Google)   200  M   1   98.87  

6   FaceNet+  Alignment  (Google)   200  M   1   99.63  

7   Ours  (VGG  Face)   2.6  M   1   98.78  

No.   Method   #  Training  Images   #  Networks   100%-­‐

EER   Accuracy  

1   Video  Fisher  Vector  Faces   -­‐   -­‐   87.7   83.8  

2   DeepFace  (Facebook)   4  M   1   91.4   91.4  

4   DeepID-­‐2,2+,3   200   -­‐   93.2  

5   FaceNet    (Google)   200  M   1   -­‐   95.1  

7   Ours  (VGG  Face)   2.6  M   1   97.60   97.4  

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positive Rate

True

Pos

itive

Rat

e

ROC Curve LFW Dataset

Our MethodDeepID3DeepFaceFisher Vector Faces

Dataset  

YouTube  Faces  In  the  Wild:    Unrestricted  Protocol  

Labeled  Faces  In  the  Wild:  Unrestricted  Protocol  

•  “Very  Deep”  Architecture  •  Different  from  previous  architectures              proposed  for  face  recogni-on:  •  locally  connected  layers  (Facebook)  •  incep-on  (Google)  

•  Learning  a  mul)-­‐way  classifier  •  Soe  Max  Objec-ve  •  2622  way  classifica-on  •  4096d  descriptor    

 •  Learning  Task  Specific  Embedding  •  The  embedding  is  learnt  by  minimizing            the  triplet  loss  

•  Learning  a  projec-on  layer  from            4096  to  1024  dimensions  •  Online  triplet  forma-on  at  the  beginning            of  each  itera-on  •  Fine  tuned  on  target  datasets  

   •  Experiments                        Labeled  Faces  in  the  Wild  –  Unrestricted  Protocol  

 

6 PARKHI et al.: DEEP FACE RECOGNITION

layer 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18type input conv relu conv relu mpool conv relu conv relu mpool conv relu conv relu conv relu mpool convname – conv1_1 relu1_1 conv1_2 relu1_2 pool1 conv2_1 relu2_1 conv2_2 relu2_2 pool2 conv3_1 relu3_1 conv3_2 relu3_2 conv3_3 relu3_3 pool3 conv4_1

support – 3 1 3 1 2 3 1 3 1 2 3 1 3 1 3 1 2 3filt dim – 3 – 64 – – 64 – 128 – – 128 – 256 – 256 – – 256

num filts – 64 – 64 – – 128 – 128 – – 256 – 256 – 256 – – 512stride – 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 2 1pad – 1 0 1 0 0 1 0 1 0 0 1 0 1 0 1 0 0 1

layer 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37type relu conv relu conv relu mpool conv relu conv relu conv relu mpool conv relu conv relu conv softmxname relu4_1 conv4_2 relu4_2 conv4_3 relu4_3 pool4 conv5_1 relu5_1 conv5_2 relu5_2 conv5_3 relu5_3 pool5 fc6 relu6 fc7 relu7 fc8 prob

support 1 3 1 3 1 2 3 1 3 1 3 1 2 7 1 1 1 1 1filt dim – 512 – 512 – – 512 – 512 – 512 – – 512 – 4096 – 4096 –

num filts – 512 – 512 – – 512 – 512 – 512 – – 4096 – 4096 – 2622 –stride 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1pad 0 1 0 1 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0

Table 3: Network configuration. Details of the face CNN configuration A. The FC layersare listed as “convolution” as they are a special case of convolution (see Section 4.3). Foreach convolution layer, the filter size, number of filters, stride and padding are indicated.

using a “triplet loss” training scheme, illustrated in the next section. While the latter isessential to obtain a good overall performance, bootstrapping the network as a classifier, asexplained in this section, was found to make training significantly easier and faster.

4.2 Learning a face embedding using a triplet loss

Triplet-loss training aims at learning score vectors that perform well in the final application,i.e. identity verification by comparing face descriptors in Euclidean space. This is similar inspirit to “metric learning”, and, like many metric learning approaches, is used to learn a pro-jection that is at the same time distinctive and compact, achieving dimensionality reductionat the same time.

Our triplet-loss training scheme is similar in spirit to that of [17]. The output f(`t

) 2RD

of the CNN, pre-trained as explained in Section 4.1, is l

2-normalised and projected to a L ⌧D dimensional space using an affine projection x

t

=W

0f(`t

)/kf(`t

)k2, W

0 2 RL⇥D. Whilethis formula is similar to the linear predictor learned above, there are two key differences.The first one is that L 6=D is not equal to the number of class identities, but it is the (arbitrary)size of the descriptor embedding (we set L = 1,024). The second one is that the projectionW

0 is trained to minimise the empirical triplet loss

E(W 0) = Â(a,p,n)2T

max{0,a �kx

a

�x

n

k22 +kx

a

�x

p

k22}, x

i

=W

0 f(`i

)

kf(`i

)k2. (1)

Note that, differently from the previous section, there is no bias being learned here as thedifferences in (1) would cancel it. Here a � 0 is a fixed scalar representing a learning

margin and T is a collection of training triplets. A triplet (a, p,n) contains an anchor faceimage a as well as a positive p 6= a and negative n examples of the anchor’s identity. Theprojection W

0 is learned on target datasets such as LFW and YTF honouring their guidelines.The construction of the triplet training T set is discussed in Section 4.4.

4.3 Architecture

We consider three architectures based on the A, B, and D architectures of [19]. The CNNarchitecture A is given in full detail in Table 3. It comprises 11 blocks, each containing alinear operator followed by one or more non-linearities such as ReLU and max pooling. Thefirst eight such blocks are said to be convolutional as the linear operator is a bank of linearfilters (linear convolution). The last three blocks are instead called Fully Connected (FC);

Download  

Effect  of  dataset  cura)on  

Dataset   #images   100%-­‐EER  

Curated   982,803   92.83  

Full   2,622,000   95.80  

Effect  of  embedding  learning  

More  experiments  in  the  paper  Example  correct  classifica)on:  posi)ve  pairs  

Example  correct  classifica)on:  nega)ve  pairs  

Example  incorrect  classifica)on  

Acknowledgements  

False  posi)ve  

CNN  model  and  face  detector  available  online!!      Visit:  h=p://www.robots.ox.ac.uk/~vgg/soWware/vgg_face/  

This  research  is  based  upon  work  supported  by  the  Office  of  the  Director  of  Na-onal  Intelligence  (ODNI),  Intelligence  Advanced  Research  Projects  Ac-vity  (IARPA)  

False  nega)ve