Multilayer and Multimodal Fusion of Deep Neural...

39
Xiaodong Yang, Pavlo Molchanov, Jan Kautz Xiaodong Yang, Pavlo Molchanov, Jan Kautz Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

Transcript of Multilayer and Multimodal Fusion of Deep Neural...

Page 1: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

Xiaodong Yang, Pavlo Molchanov, Jan KautzXiaodong Yang, Pavlo Molchanov, Jan Kautz

Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

Page 2: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

22

INTELLIGENT VIDEO ANALYTICS

Surveillance event detection

Human-computer interaction

Multimedia search and indexing

@bmw.com

Video Classification

Page 3: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

33

Local feature extraction

Global feature representation

Temporal modeling

INTELLIGENT VIDEO ANALYTICS Related Work

Page 4: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

44

Local feature extraction

Global feature representation

Temporal modeling

INTELLIGENT VIDEO ANALYTICS Related Work

Dense trajectories,H. Wang et al. ICCV 2013

Page 5: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

55

Local feature extraction

Global feature representation

Temporal modeling

INTELLIGENT VIDEO ANALYTICS Related Work

Bag-of-visual-words,J. Gemert et al. TPAMI 2009

Fisher vector,F. Perronnin et al. ECCV 2010

Dense trajectories,H. Wang et al. ICCV 2013

Page 6: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

66

Local feature extraction

Global feature representation

Temporal modeling

INTELLIGENT VIDEO ANALYTICS Related Work

Bag-of-visual-words,J. Gemert et al. TPAMI 2009

Fisher vector,F. Perronnin et al. ECCV 2010

Dense trajectories,H. Wang et al. ICCV 2013

Spatio-temporal pyramid,X. Yang et al. ECCV 2014

Page 7: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

77

INTELLIGENT VIDEO ANALYTICS Related Work

2D-CNN, A. Karpathy et al, CVPR 2014 C3D, D. Tran et al, ICCV 2015

Two-stream networks, K. Simonyan et al, NIPS 2014 LSTM, J. Ng, CVPR 2015

Page 8: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

88

OUR CONTRIBUTIONS

Overview of multilayer and multimodal fusion for video classification

Local feature extraction:

• Multilayer representations from CNN

Global feature representation:

• Multimodal representations

• Fusion by boosting

Temporal modeling:

• Structure of FC-RNN

Page 9: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

99

MULTILAYER REPRESENTATIONS

Dense image prediction

FCN by Long et al. FlowNet by Fischer et al.

Page 10: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

1010

MULTILAYER REPRESENTATIONS

Features of conv layers

Poses, parts, articulations, objects, etc.

Visualization by Zeiler et al.

Page 11: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

1111

MULTILAYER REPRESENTATIONS

Convert feature maps to feature descriptors

Feature maps of dimension 28×28×5

28×28 feature descriptors of dimension 5

Page 12: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

1212

MULTILAYER REPRESENTATIONS

Learn spatial discriminative weights of conv layers

Spatial information of conv layers to enhance representations

Video frames Feature maps of a conv layer over time

Spatial weights of a conv layer

import

ance

Page 13: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

1313

MULTILAYER REPRESENTATIONS

Aggregate feature descriptors by Fisher vector (FV)

Gaussian mixture modelFeature maps of a conv layer over time

Page 14: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

1414

MULTILAYER REPRESENTATIONS

Represent conv layers by improved Fisher vector (iFV)

Gaussian mixture modelFeature maps of a conv layer over time

Spatial weights of a conv layerim

port

ance

Page 15: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

1515

MULTILAYER REPRESENTATIONS

Represent conv layers by improved Fisher vector (iFV)

Represent fc layers by temporal max pooling

Overview of multilayer representation

Page 16: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

1616

FC-RNN STRUCTUREModeling Temporal Dynamics

Don’t be a hero—use pre-trained models

Page 17: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

1717

FC-RNN STRUCTUREModeling Temporal Dynamics

Images/Snippets Videos

Don’t be a hero—use pre-trained models

Many pre-trained models from ImageNet and Sports1M

VGG/C3D

Page 18: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

1818

FC-RNN STRUCTUREModeling Temporal Dynamics

Images/Snippets Videos

Don’t be a hero—use pre-trained models

Many pre-trained models from ImageNet and Sports1M

VGG/C3D VGG/C3D

fc layer

RNN

Standard RNN

Page 19: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

1919

FC-RNN STRUCTUREModeling Temporal Dynamics

Images/Snippets Videos

Don’t be a hero—use pre-trained models

Many pre-trained models from ImageNet and Sports1M

VGG/C3D VGG/C3D

fc layer

RNN

Standard RNN

VGG/C3D

fc layer

RNN

FC-RNN

Page 20: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

2020

FC-RNN STRUCTUREModeling Temporal Dynamics

Images/Snippets Videos

Don’t be a hero—use pre-trained models

Many pre-trained models from ImageNet and Sports1M

VGG/C3D VGG/C3D

fc layer

RNN

Standard RNN

VGG/C3D

FC-RNN

FC-RNN

Page 21: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

2121

FC-RNN STRUCTUREModeling Temporal Dynamics

RNN

FC-RNN

Pre-trained CNN, fc layer:

Transfer to recurrent layers

Comparison of standard RNN and FC-RNN

Page 22: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

2222

MULTIMODAL REPRESENTATIONS

Static and dynamic information

2D-CNN/3D-CNN with video frames/optical flow maps

A single frame

A single flow map

A buffer of frames

A buffer of flow maps

Page 23: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

2323

FUSION BY BOOSTING

Optimize a linear combination of predictions of multiple layers from multiple modalities

LPBoost:

boost-u: learn uniform weights for all classes

boost-c: learn class specific weights

Page 24: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

2424

FUSION BY BOOSTING

Optimize a linear combination of predictions of multiple layers from multiple modalities

LPBoost:

boost-u: learn uniform weights for all classes

boost-c: learn class specific weights

4 layers and 4 modalities M = 16

Page 25: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

2525

EXPERIMENTS

Benchmark datasets

UCF101: 13,320 videos in 101 classes

HMDB51: 6,766 videos in 51 classes

Skiing

Kissing

Page 26: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

2626

EXPERIMENTSFC-RNN

Outperforms RNN and LSTM by 3.0% and 2.9%

Comparison of standard RNN and FC-RNN in training and testing of 3D-CNN-SF on UCF101

error rate

epochs

Page 27: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

2727

EXPERIMENTSFC-RNN

Outperforms RNN and LSTM by 3.0% and 2.9%

Comparison of standard RNN and FC-RNN in training and testing of 3D-CNN-SF on UCF101

error rate

epochs

3 %

Up to

improvement

Page 28: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

2828

EXPERIMENTSFeature Aggregation

Comparison of FV and iFV to represent conv layers of different modalities

Spatial weights of a conv layer

import

ance

A single frame

A single flow map

A buffer of frames

A buffer of flow maps

Page 29: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

2929

EXPERIMENTSFeature Aggregation

Comparison of FV and iFV to represent conv layers of different modalities

Spatial weights of a conv layer

import

ance

A single frame

A single flow map

A buffer of frames

A buffer of flow maps

2.5 %

Up to

improvement

Page 30: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

3030

EXPERIMENTSMultilayer Fusion

Classification accuracy of single layers over different modalities and multilayer fusion results

Page 31: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

3131

EXPERIMENTSMultilayer Fusion

Classification accuracy of single layers over different modalities and multilayer fusion results

Page 32: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

3232

EXPERIMENTSMultilayer Fusion

Classification accuracy of single layers over different modalities and multilayer fusion results

Page 33: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

3333

EXPERIMENTSMultilayer Fusion

Classification accuracy of single layers over different modalities and multilayer fusion results

8 %

Up to

improvement

Page 34: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

3434

EXPERIMENTSMultimodal Fusion

Classification accuracy of different modalities and various combinations

Comparison to the state-of-the-art results

6 %

Up to

improvement

Page 35: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

3535

EXPERIMENTSLPBoost

17%

31%

23%

29%

0%

38%

12%

50%fc7

conv5

fc6

conv4

Modalities Layers

Page 36: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

3636

EXPERIMENTSEffect of Multimodal Fusion

SKIING SKIJET

skiing : )Multimodal Fusion

2D-CNN-SFskijet : (

Page 37: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

3737

EXPERIMENTSEffect of Multimodal Fusion

2D-CNN-OF boxing speeding bag : (

boxing punching bag : )

Multimodal Fusion

BOXING PUNCHING BAG BOXING SPEEDING BAG

Page 38: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

3838

OUR CONTRIBUTIONS

Local feature extraction:

• Multilayer representations from CNN

Global feature representation:

• Multimodal representations

• Fusion by boosting

Temporal modeling:

• Structure of FC-RNNOverview of multilayer and multimodal fusion for video classification

Page 39: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification