Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE...

55
| | Limin Wang Computer Vision Laboratory, ETH Zurich Workshop on frontiers of video technology -- 2017 17/7/27 Limin Wang (CVL ETHZ) 1 Towards efficient end-to-end architectures for action recognition and detection in videos

Transcript of Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE...

Page 1: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

||

Limin WangComputer Vision Laboratory, ETH Zurich

Workshop on frontiers of video technology -- 2017

17/7/27Limin Wang (CVL ETHZ) 1

Towards efficient end-to-end architectures foraction recognition and detection in videos

Page 2: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

|| 17/7/27Limin Wang (CVL ETHZ) 2

Action recognition in videos

§ 1. Action recognition “in the lab”: KTH, Weizmann etc.§ 2. Action recognition “in TV, Movies”: UCF Sports, Holloywood etc.§ 3. Action recognition “in Web Videos”: HMDB, UCF101, THUMOS,

ActivityNet etc.Haroon Idrees et al. The THUMOS Challenge on Action Recognition for Videos "in the Wild”, in Computer Vision and Image Understanding (CVIU), 2017.

Page 3: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

||

§ Action Recognition: classify the short clip or untrimmedvideo into pre-defined class.

§ Action Temporal Localization: detect starting and ending times of action instances in untrimmed video.

§ Action Spatial Detection: detect the bounding boxes ofactors in trimmed videos.

§ Action Spatial-Temporal Detection: combine temporaland spatial localization in untrimmed videos.

17/7/27Limin Wang (CVL ETHZ) 3

Action Understanding Tasks

Page 4: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

|| 17/7/27Limin Wang (CVL ETHZ) 4

Action recognition -- deep networks (2014)

Andrej Karpathy et al., Large-scale Video Classification with Convolutional Neural Networks, in CVPR, 2014.

Page 5: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

|| 17/7/27Limin Wang (CVL ETHZ) 5

Action recognition – two stream CNN (2014)

Karen Simonyan and Andrew Zisserman, Two-Stream Convolutional Networks for Action Recognition in Videos,in NIPS, 2014.

Page 6: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

|| 17/7/27Limin Wang (CVL ETHZ) 6

Action recognition – 3D CNN (2015)

Du Tran et al. Learning Spatiotemporal Features with 3D Convolutional Networks, in ICCV, 2015.

Page 7: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

||

§ Opportunities§ Videos provide huge and rich data for visual learning§ Action is important in motion perception and has many applications

§ Challenges§ Temporal models and representations§ High computational and memory cost§ Noisy and weakly labels

17/7/27Limin Wang (CVL ETHZ) 7

Opportunities and Challenges

Page 8: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

||

Action Recognition(AR)

• TemporalSegment Net(TSN)

Action Detection(AD)

• StructuredSegment Net(SSN)

Weakly SupervisedAR & AD

• UntrimmedNet

17/7/27Limin Wang (CVL ETHZ) 8

Overview

§ [1] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, Temporal Segment Networks: Towards Good Practices for Deep Action Recognition, in ECCV, 2016.

§ [2] L. Wang, Y. Xiong, D. Lin, and L. Van Gool, UntrimmedNets for Weakly Supervised Action Recognition and Detection, in CVPR 2017.

§ [3] Y. Zhao, Y. Xiong, L. Wang, Z. Wu, D. Lin, and X. Tang, Temporal Action Detection with Structured Segment Networks, in ICCV 2017.

Page 9: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

||

§ Towards end-to-end and video-level architecture.§ Modeling issue: mainstream CNN frameworks focus on

appearance and short-term motion.

17/7/27Limin Wang (CVL ETHZ) 9

Motivation of TSN

Page 10: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

|| 17/7/27Limin Wang (CVL ETHZ) 10

Modeling Long-Range Structure

Stacking multiple frames: dense and local

[1] Joe Yue-Hei Ng et al. Beyond Short Snippets: Deep Networks for video classification, in CVPR 2015.[2] C. Feichtenhofer et al. Convolutional Two-Stream Network Fusion for Video Action Recognition, in CVPR 2016.[3] Gul Varol et al., Long-term Temporal Convolutions for Action Recognition in PAMI 2017.

Page 11: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

||

§ There are high data redundancy in video.§ High-level semantics vary slowly (slowness).§ Our segment sampling share two properties:

§ Sparse: processing efficiency§ Global: duration invariant and modeling the entire video content.

17/7/27Limin Wang (CVL ETHZ) 11

Segment Based Sampling

Page 12: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

|| 17/7/27Limin Wang (CVL ETHZ) 12

Modeling Long-Range Structure

Segment based sampling: sparse and global

Page 13: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

|| 17/7/27Limin Wang (CVL ETHZ) 13

Overview of TSN

TSN is a video-level framework based on simple strategies of segment samplingand consensus aggregation.

Page 14: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

||

§ Aggregation function aims to summarize the predictions ofdifferent snippet to yield the video-level prediction.

§ Simple aggregation functions:§ Mean pooling, max pooling, weighted average

§ Advanced aggregation functions:§ Top-k pooling, attention weighting

17/7/27Limin Wang (CVL ETHZ) 14

Aggregation Function

Page 15: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

||

Stacking RGB differenceStacking warped optical field

17/7/27Limin Wang (CVL ETHZ) 15

Input modalities

Page 16: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

|| 17/7/27Limin Wang (CVL ETHZ) 16

Experiment result -- input modality

S. Ioffe et al., Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, inICML 2015.

Page 17: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

|| 17/7/27Limin Wang (CVL ETHZ) 17

Exploration on TSN

92.4

94.2 94.7 94.9 94.9

8586.5 86.7 86.4 86.2

88.389.8 90.1 90.1 89.7

K=1 K=3 K=5 K=7 K=9

Accuracy on UCF101 (%)

Two stream Spatial stream Temporal stream

Page 18: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

|| 17/7/27Limin Wang (CVL ETHZ) 18

Evaluation on TSN

81.8 84 83.1 84.7 84.1

62

72.8 70.573.6 71.8

85.4 87.8 86.4 88.1 88.2

50556065707580859095

MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING

mAP on ActivityNet (%)

Spatial stream Temporal Stream Two stream

84.986.4 86.4 85.5 86.1

83.5

90.1 89.7 88.8 89.1

92.494.9 94.8 94.2 94.6

7678808284868890929496

MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING

Accuracy on UCF101 (%)

Spatial stream Temporal Stream Two stream

Page 19: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

|| 17/7/27Limin Wang (CVL ETHZ) 19

Comparison on Trimmed Datasets

57.2

61.759.4

56.8

63.264.8

71

50

55

60

65

70

75Accuracy on HMDB51 (%)

85.988.3 88

85.2

90.391.7

94.9

75

80

85

90

95

100Accuracy on UCF101 (%)

[1] L. Soomro et al., UCF101: A datasetof 101 human action classes from videos in the wild, in arXiv 1212.0402,2012.[2] H. Kuehne et al., HMDB: A large video database for human motion recognition, in ICCV, 2011.

Page 20: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

|| 17/7/27Limin Wang (CVL ETHZ) 20

Comparison on Untrimmed Datasets

63.166.1

71.6

61.5

80.1

5055606570758085

mAPonTHUMOS(%)

66.5

71.974.1

78.1

89.6

50556065707580859095

mAP onActivityNet (%)

[1] H. Idrees et al., The THUMOS Challenge on Action Recognition for Videos “in the Wild”, in CVIU, 2017.[2] F. C. Heilbron et al., ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding, in CVPR, 2015.

Page 21: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

|| 17/7/27Limin Wang (CVL ETHZ) 21

CVPR ActivityNet Challenge -- 2016

Page 22: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

|| 17/7/27Limin Wang (CVL ETHZ) 22

CVPR ActivityNet Challenge -- 2016

Page 23: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

|| 17/7/27Limin Wang (CVL ETHZ) 23

CVPR ActivityNet Challenge -- 2016

Page 24: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

|| 17/7/27Limin Wang (CVL ETHZ) 24

Kinetics dataset

Page 25: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

|| 17/7/27Limin Wang (CVL ETHZ) 25

Results on Kinetics Dataset

RGB, Pretrain on ImageNet, TSN: top-1 70.28%, 89.13%RGB, Train from scratch, TSN: top-1 69.55%, 88.68%

Page 26: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

||

Action Recognition(AR)

• TemporalSegment Net(TSN)

Action Detection(AD)

• StructuredSegment Net(SSN)

Weakly SupervisedAR & AD

• UntrimmedNet

17/7/27Limin Wang (CVL ETHZ) 26

Overview

§ [1] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, Temporal Segment Networks: Towards Good Practices for Deep Action Recognition, in ECCV, 2016.

§ [2] L. Wang, Y. Xiong, D. Lin, and L. Van Gool, UntrimmedNets for Weakly Supervised Action Recognition and Detection, in CVPR 2017.

§ [3] Y. Zhao, Y. Xiong, L. Wang, Z. Wu, D. Lin, and X. Tang, Temporal Action Detection with Structured Segment Networks, in ICCV 2017.

Page 27: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

|| 17/7/27Limin Wang (CVL ETHZ) 27

Motivation of Structured Segment Network

1. Action detection in untrimmedvideo is an important problem.

2. Snippet-level classifier is difficult toaccurately localize the temporalextent of action instance.

Context and Structure Modeling!

Page 28: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

|| 17/7/27Limin Wang (CVL ETHZ) 28

Temporal Region Proposal

Bottom up proposal generation based on actionness map

Page 29: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

|| 17/7/27Limin Wang (CVL ETHZ) 29

Structured Segment Network (SSN)

Page 30: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

||

§ To model the class classes and completeness ofinstances, we design a two classifier loss

P(c,b|p) = P(c|p)P(b|c,p)§ Action class classifier measure the likelihood of action

class distribution: P(c|p)§ Completeness classifier measure the likelihood of

instance completeness: P(b|c,p)§ A joint loss to optimize these two classifiers:

17/7/27Limin Wang (CVL ETHZ) 30

Two Classifier Design

Page 31: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

|| 17/7/27Limin Wang (CVL ETHZ) 31

Experiment result -- Action Proposal

[1] V. Escorcia, F. Caba Heilbron, J. C. Niebles, and B. Ghanem. Daps: Deep action proposals for action understanding. In, ECCV, pages 768–784, 2016. [2] Fabian Caba Heilbron et al.. Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In CVPR, 2016.

Page 32: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

|| 17/7/27Limin Wang (CVL ETHZ) 32

Experiment result -- Component Analysis

Page 33: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

|| 17/7/27Limin Wang (CVL ETHZ) 33

Experiment result -- Comparison

[1] H. Idrees et al., The THUMOS Challenge on Action Recognition for Videos “in the Wild”, in CVIU, 2017.[2] F. C. Heilbron et al., ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding, in CVPR, 2015.

Page 34: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

|| 17/7/27Limin Wang (CVL ETHZ) 34

Detection example (1)

Green: correct detectionRed: bad localizationYellow: multiple detections

Page 35: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

|| 17/7/27Limin Wang (CVL ETHZ) 35

Detection example (2)

Green: correct detectionRed: bad localizationYellow: multiple detections

Page 36: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

||

Action Recognition(AR)

• TemporalSegment Net(TSN)

Action Detection(AD)

• StructuredSegment Net(SSN)

Weakly SupervisedAR & AD

• UntrimmedNet

17/7/27Limin Wang (CVL ETHZ) 36

Overview of temporal modeling

§ [1] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, Temporal Segment Networks: Towards Good Practices for Deep Action Recognition, in ECCV, 2016.

§ [2] L. Wang, Y. Xiong, D. Lin, and L. Van Gool, UntrimmedNets for Weakly Supervised Action Recognition and Detection, in CVPR 2017.

§ [3] Y. Zhao, Y. Xiong, L. Wang, Z. Wu, D. Lin, and X. Tang, Temporal Action Detection with Structured Segment Networks, in ICCV 2017.

Page 37: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

|| 17/7/27Limin Wang (CVL ETHZ) 37

Motivation of UntrimmedNet

1. Labeling untrimmed videois expensive and timeconsuming

2. Temporal annotation issubjective and notconsistent across personsand datasets

Page 38: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

|| 17/7/27Limin Wang (CVL ETHZ) 38

Overview of UntrimmedNet

Page 39: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

||

§ Uniform Sampling§ Uniform sampling of fixed duration

§ Shot based Sampling§ First shot detection based HOG difference§ For each shot, perform uniform sampling.

17/7/27Limin Wang (CVL ETHZ) 39

Clip Proposal

Page 40: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

||

§ Following TSN framework:§ Sampling a few snippets from each clip.§ Aggregating snippet-level predictions with average pooling

§ In practice, we use two stream input: RGB and OpticalFlow

17/7/27Limin Wang (CVL ETHZ) 40

Clip Classification

Page 41: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

||

§ Selection aims to select discriminative clips or rank themwith attention weights.

§ Two selection methods:§ Hard selection: top-k pooling over clip-level prediction§ Soft selection: learning attention weights for different clips

17/7/27Limin Wang (CVL ETHZ) 41

Clip Selection

Page 42: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

||

Top-k Pooling

Washing Dishes

Page 43: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

||

-1.3 0.8 -0.9 1.5 0.4

0.03 0.25 0.05 0.50 0.17

X X X X X

Attention weighting

Page 44: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

||

§ UntrimmedNet is an end-to-end learning architecture,combing three modules: feature extraction, classificationmodule, selection module.

§ Video-level prediction: a bilinear model over classificationscore and selection weights.

§ The whole pipeline could be optimized with standard backpropagation algorithm.

17/7/27Limin Wang (CVL ETHZ) 44

UntrimmedNet

Page 45: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

||

§ Action Recognition:§ In practice, we sample a single frame (or 5 frame stacking of

optical flow) every 30 frames.§ The recognition from sampled frames are aggregated with top-k

pooling (k set to 20) to yield the final video-level prediction. § Action Detection:

§ we sample frames every 15 frame and for each frame, we get both prediction scores and attention weights.

§ we remove background by thresholding (set to 0.0001) on the attention weights .

§ we produce the final detection results by thresholding (set to 0.5) on the classification scores.

17/7/27Limin Wang (CVL ETHZ) 45

Weakly supervised AR and AD

Page 46: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

|| 17/7/27Limin Wang (CVL ETHZ) 46

Exploration Study

Page 47: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

|| 17/7/27Limin Wang (CVL ETHZ) 47

Exploration Study

Page 48: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

|| 17/7/27Limin Wang (CVL ETHZ) 48

Exploration Study

Page 49: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

|| 17/7/27Limin Wang (CVL ETHZ) 49

Experiment Results -- Action recognition

[1] H. Idrees et al., The THUMOS Challenge on Action Recognition for Videos “in the Wild”, in CVIU, 2017.[2] F. C. Heilbron et al., ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding, in CVPR, 2015.

Page 50: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

|| 17/7/27Limin Wang (CVL ETHZ) 50

Experiment Results -- Action recognition

[1] H. Idrees et al., The THUMOS Challenge on Action Recognition for Videos “in the Wild”, in CVIU, 2017.[2] F. C. Heilbron et al., ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding, in CVPR, 2015.

Page 51: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

|| 17/7/27Limin Wang (CVL ETHZ) 51

Experiment Results -- Action detection

[1] H. Idrees et al., The THUMOS Challenge on Action Recognition for Videos “in the Wild”, in CVIU, 2017.

Page 52: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

|| 17/7/27Limin Wang (CVL ETHZ) 52

Examples of Attention

Page 53: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

||

§ Temporal modeling is important for action understanding.§ Segment based sampling shares two properties: global

and sparse.§ TSN is a general and flexible framework for action

modeling.§ SSN extends TSN for action detection with context and

structure modeling.§ UntrimmedNet extends TSN for weakly supervised

setting with attention modeling.

17/7/27Limin Wang (CVL ETHZ) 53

Summary

Page 54: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

||

§ [1] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, Temporal Segment Networks: Towards Good Practices for Deep Action Recognition, in ECCV, 2016.

§ [2] L. Wang, Y. Xiong, D. Lin, and L. Van Gool, UntrimmedNets for Weakly Supervised Action Recognition and Detection, in CVPR 2017.

§ [3] Y. Zhao, Y. Xiong, L. Wang, Z. Wu, D. Lin, and X. Tang, Temporal Action Detection with Structured Segment Networks, in ICCV 2017.

17/7/27Limin Wang (CVL ETHZ) 54

Code and References

§ Temporal segment network:https://github.com/yjxiong/temporal-segment-network

§ Structured segment network:https://github.com/yjxiong/action-detection§ UntrimmedNet:https://github.com/wanglimin/UntrimmedNet

§ Video Caffe:https://github.com/yjxiong/caffe

Page 55: Towards efficientend-to-end architecturesfor ... · MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING mAP onActivityNet(%) Spatial stream Temporal Stream

|| 17/7/27Limin Wang (CVL ETHZ) 55

Collaborators