Video search by deep-learning

35
Video Search by Deep Learning Cees Snoek

Transcript of Video search by deep-learning

Page 1: Video search by deep-learning

VideoSearchbyDeepLearning

CeesSnoek

Page 2: Video search by deep-learning

2

Which one is the plane?

Page 3: Video search by deep-learning

3

Which one is the plane?

Page 4: Video search by deep-learning

4

Which one is the bird?

Page 5: Video search by deep-learning

5

Which one is the bird?

Page 6: Video search by deep-learning

6

Which one is the Kentucky Warbler?

Page 7: Video search by deep-learning

7

Which one is the Kentucky Warbler?

Page 8: Video search by deep-learning

8

How difficult is the problem?

Humanvisionconsumes50%brainpower…

Van Essen, Science 1992

Page 9: Video search by deep-learning

9

Video recognition in a nutshell

Visualization by Jasper Schulte

Page 10: Video search by deep-learning

10

NIST TRECVID Benchmark

Promote progress in video retrieval research

Big data, standardized tasks, independent evaluation and open innovation

Internationalvideosearchcompetition

http://trecvid.nist.gov/

Page 11: Video search by deep-learning

11

Conceptdetectiontask

http://trecvid.nist.gov/

Aircraft

Beach

Mountain

People marching

Police/Security

Flower

Page 12: Video search by deep-learning

12

From University-lab to spin-off and your mobile phone

• = 1000+ others* = UvA / Euvision / Qualcomm

Universities win Start-ups win

Snoek et al., TRECVID 2004-2015

Page 13: Video search by deep-learning

13

Latest jump due to deep learning2006 2009 2015

Mea

n av

erag

e pr

ecis

ion

Progress in video recognition

Page 14: Video search by deep-learning

14

The more features the better

Typical shallow learning architecture

e.g. SIFT

dense sampling

Local Feature Extraction

Feature Pooling

Feature Encoding Classification

avg/sum poolingmax pooling

BoWSparse coding FisherVLAD

Linear / Non-linear SVM

Page 15: Video search by deep-learning

15

The deeper the better

Typical deep learning architecture

Layer 6

Loss

Layer 7

Max pool. 2

224

224

3×3

4,096 4,096

Dropout

Dropout

3×33×35×511×11

Convolution Non-linearity Pooling

Krizhevsky et al., NIPS 2012

Page 16: Video search by deep-learning

16

Video search demo’s

Social media Forensics Cultural heritage

Page 17: Video search by deep-learning

17

Tomorrow: The Internet of things that video

Page 18: Video search by deep-learning

18

Need to understand what is happening where and when?

Page 19: Video search by deep-learning

19

Examples

ShakinghandsKissing

Page 20: Video search by deep-learning

20

Goal: obtain the red tube around the actionJain et al., IJCV 2017

Page 21: Video search by deep-learning

21

Method: Super-voxel segmentation of the videoJain et al., IJCV 2017

Page 22: Video search by deep-learning

22

Group voxels to generate action proposalsJain et al., IJCV 2017

Unsupervised and class-agnostic

Page 23: Video search by deep-learning

23

Example proposals

Page 24: Video search by deep-learning

24

Encode video proposals as 15,000 object scoresJain et al., CVPR 2015

Layer 6

Loss

Layer 7

Max pool. 2

3×34,096 4,096

Dropout

Dropout

3×33×35×511×11

Page 25: Video search by deep-learning

25

Actions have object preference, relation is generic

TypingPlaying Cello Bodyweight squats

Jain et al., CVPR 2015

Page 26: Video search by deep-learning

26

We consider three object encodings− Whole video− Outside of tube only− Inside of tube only

Where do objects aid actions the most?

Page 27: Video search by deep-learning

27

Objects aid most close to the action

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

100.00

Wholevideo Outsidetube Insidetube

Jain et al., CVPR 2015

Page 28: Video search by deep-learning

28

Simple convex combination of known classifiers

Objects2action: Translate objects to an action

Object representationTest video Object/action affinities

where s() = word2vec

Mikolov et al., NIPS 2013

Jain et al., ICCV 2015

Page 29: Video search by deep-learning

29

Objects2action localizes actions without examples

Retrieval results from action query only

Jain et al., ICCV15

Prediction Ground truth

Page 30: Video search by deep-learning

30

So far we have considered video search from text only, what about text search from video?

That is: given a video, can we find the best matching sentence?

Matching sentences to videos

Page 31: Video search by deep-learning

31

Word2VisualVec: Predicting the visual representation of textTraining time

Dong et al., ArXive17

Page 32: Video search by deep-learning

32

Word2VisualVec: Predicting the visual representation of textTesting time

Dong et al., ArXive17

Page 33: Video search by deep-learning

33

ResultsDong et al., ArXive17

Page 34: Video search by deep-learning

34

‘Arithmetic’ with visual and textual query

Page 35: Video search by deep-learning

35

Video search by deep learning is powerful, even without examples

Field is progressing rapidly

Precise spatiotemporal video understanding is next

Conclusion

www.ceessnoek.info