A Compact and Discriminative Face Track Descriptorvgg/publications/2014/... · Very simple yet...

Omkar M Parkhi, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman

A Compact and Discriminative!Face Track Descriptor

Recognising and verifying faces in videos 2

Recognition Verification

same

different

VF2: a new compact face track descriptor

▶Discriminative!▶Useful for different tasks (Recognition, Verification)!▶Extremely compact

3

Face track: sequence of face detections in consecutive frames.

face track descriptor

Large scale face retrieval

▶ Example of a typical target dataset!▶ 5 years of evening news programs!▶ 10,000 hrs of broadcast !▶ 20 Million frames, !▶ 2.1 Million face tracks!▶ Real time performance

4

http://www.robots.ox.ac.uk/~vgg/research/on-the-fly/

▶ 30 frames per track on average!▶ Typical 4000D descriptor → 1 TB!▶ Our descriptor → 270 MB!

http://www.robots.ox.ac.uk/~vgg/rese

d2W (x, y)

[011001010]

Outline

1. Dense feature computation!

2. Fisher Vector encoding!

3. Video and jittered pooling!

4. Compression by metric learning!

5. Binarisation!

6. Results

5

1. Dense feature computation

▶ Input: a face track!▶ Aligned or unaligned!▶ No facial landmarks required (eyes, nose, etc.)!

▶ Output: a set of local features !▶ Extracted from all frames!▶ Dense RootSIFT at multiple scales!▶ 64-D PCA

6

d2W (x, y)

[011001010]

Outline





5. Binarisation!

6. Results

7

GMM

[Perronnin et al. ECCV 2012]

2. Fisher Vector encoding 8

first and second order statistics

FV encoding+ sqrt-L2

normalisation

� =

2

6666666664

v1u1v2u2...vKuK

3

7777777775 uk =1

Mp2⇡k

MX

i=1

�k(xi )

✓xi � µk

�i� 1

◆2

vk =1

Mp⇡k

MX

i=1

�k(xi )xi � µk

�i

xi

Gaussians!(μk,Σk)

µk

xi�k(xi )

AssignmentHard

Dense SIFT

Gaussian components as part detectors

2. Fisher Vector Encoding 9

2

66664 xW � 1

2yH � 1

2

3

77775

Spatial (x,y) Augmentation

d2W (x, y)

[011001010]

Outline





5. Binarisation!

6. Results

10

3. Video and jittered pooling

▶ Typically each frame is pooled independently!▶ Complex inference procedures combining multiple descriptors!▶ Large memory footprint

11

[Sivic et al. CVPR 09, Everingham et. al IVC 09,, Wolf et al. CVPR 2011]


▶ Single descriptor per track!▶ Smaller memory footprint!▶ Easy to use!▶ Improved performance

12

[Application to Action Recognition: Oneata, Verbeek, Schmid ICCV 2013]


▶ Data augmentation!▶ Data augmentation without training set increase!▶ Improvement in the performance

13

[Paulin et al. CVPR 2014]

d2W (x, y)

[011001010]

Outline





5. Binarisation!

6. Results

14

Learn to discriminate faces

4. Metric Learning 15

d2W (x, y) = kW x�W yk2 < b

same person

d2W (u, v) = kWu�W vk2 > b

x y uv

different people

d2W (x, y) = kW x�W yk2

[Simonyan, Parkhi, Vedaldi, Zisserman BMVC 2013]

W

x

=

learnt projectionFisherVector

z

d2W (x, y)

[011001010]

Outline





5. Binarisation!

6. Results

16

Parseval Tight Frame

5. Binarisation

▶ Low-dimensional real-valued descriptor → high dimensional binary !▶ 4x decrease in memory footprint (128D real → 1024D binary)!▶ Fast distance computation!▶ Alternative binarisation methods could be used

17

q ⨉ m!Columns from a

random rotation matrix

sign= q

q bits only

0101010

[Jégou et al. ICASSP 2012, Simonyan et al. PAMI 2014]

real-valued descriptor

m⨉

U z sign(Uz)zU

q

d2W (x, y)

[011001010]

Outline





5. Binarisation!

6. Results

18

Face Verification

YouTube Faces Dataset 19

▶ Face verification in videos!▶ 3,425 videos of 1,595 celebrities!▶ Videos collected from internet!▶ Wide pose, expression and illumination variation!▶ 10 splits of 600 pairs of videos!

▶ Restricted setting: Use provided pairs!▶ Unrestricted setting: Free to form own pairs.

differentsame

[Wolf, Hassner, Moaz CVPR 2011]

Image Pool (Soft assignment FV)

Video Pool (Soft assignment FV)

Video Pool hard asignment fv

Video Pool + Jittered Pool

Video Pool. + Binar. 1024 bit + jitt.

Video Pool. + Joint sim. + jitt.

Error0 4.5 9 13.5 18

12.3

13.4

14.2

16.2

15

17.3

Face Verification


Face Verification


MGBS & SVM-APEM FUSION

STFRD & PMMLVSOF & OSS (Adaboost)

DDML (Combined)VF 1024D (binary)

VF 256DDeep Face (facebook.com)

Error0 5.5 11 16.5 22

8.612.3

13.418.5

2019.9

21.421.2

Requires additional training data.

2

2

Weakly supervised face classification

Oxford Buffy Dataset 22

▶ “Buffy The Vampire Slayer”!▶ Face tracks from 7 episodes of season 5.!▶ Both frontal and profile detections!▶ Weak supervision from transcript and subtitles!▶ Multi Class classification for every episode

[Everingham et al. IVC 2009, Sivic et al. CVPR 2009]

Weakly supervised classification

Oxford Buffy Dataset 23

Sivic et al. (HOG RBF MKL)

VF ( GMMs trained on Buffy )

VF ( GMMs trained on YTF )

VF ( GMMs trained on YTF ) + Jitt. Pool 1024D

VF ( GMMs trained on YTF 2048b)

Avg. AP0.79 0.808 0.825 0.843 0.86

0.82

0.86

0.8

0.81

0.812

2

2

2

Very simple yet powerful face track descriptor!

Recap 24

▶Track descriptor in 128 bytes!▶Face landmarks and alignment not required!▶One descriptor per track!!

▶State of the art/comparable results on multiple tasks!▶YouTube Faces Dataset!▶Oxford Buffy Dataset!

!▶ Can be trained with very small amount of data!▶ Extremely easy to compute!

!▶ Code online soon.

Questions?

A Compact and Discriminative Face Track Descriptorvgg/publications/2014/... · Very simple yet...

Documents

Transcript of A Compact and Discriminative Face Track Descriptorvgg/publications/2014/... · Very simple yet...