Motionlet: a middle level part for action...

27
Yu QIAO 2018.4 行为识别与检测2018年度进展 乔宇 中国科学院深圳先进技术研究院 2018年4月22日 VALSE 2018 - 大连

Transcript of Motionlet: a middle level part for action...

Page 1: Motionlet: a middle level part for action classificationice.dlut.edu.cn/valse2018/ppt/09-2018.pdfVideo Benchmarks UCF101 (13,320 videos,101 actions ) HMDB51 (6,849 videos, 51 actions

Yu QIAO 2018.4

行为识别与检测2018年度进展

乔宇中国科学院深圳先进技术研究院

2018年4月22日

VALSE 2018 - 大连

Page 2: Motionlet: a middle level part for action classificationice.dlut.edu.cn/valse2018/ppt/09-2018.pdfVideo Benchmarks UCF101 (13,320 videos,101 actions ) HMDB51 (6,849 videos, 51 actions

SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS

Outline

视频行为数据库

行为识别方法

行为检测方法

未来研究方向

Page 3: Motionlet: a middle level part for action classificationice.dlut.edu.cn/valse2018/ppt/09-2018.pdfVideo Benchmarks UCF101 (13,320 videos,101 actions ) HMDB51 (6,849 videos, 51 actions

SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS

Video Benchmarks

UCF101 (13,320 videos,101 actions ) HMDB51 (6,849 videos, 51 actions )

1. The widely-used data sets are small-scale

2. It is hard to investigate spatial-temporal representations

of deep neural networks

Page 4: Motionlet: a middle level part for action classificationice.dlut.edu.cn/valse2018/ppt/09-2018.pdfVideo Benchmarks UCF101 (13,320 videos,101 actions ) HMDB51 (6,849 videos, 51 actions

SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS

Large-Scale Video Sets

Youtube8M

• 200 classes

• 100 untrimmed videos per class

• 1.54 activity instances per video

• 648 video hours

Page 5: Motionlet: a middle level part for action classificationice.dlut.edu.cn/valse2018/ppt/09-2018.pdfVideo Benchmarks UCF101 (13,320 videos,101 actions ) HMDB51 (6,849 videos, 51 actions

SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS

Large-Scale Video Sets

• 80 atomic actions

• 192 clips (15 mins per clip)

• 740k annotations

• 306,245 videos in total

• 400 action classes

• Each clip lasts around 10s

• over 1,000,000 videos

• 339 Moment classes

• 3-second video

Page 6: Motionlet: a middle level part for action classificationice.dlut.edu.cn/valse2018/ppt/09-2018.pdfVideo Benchmarks UCF101 (13,320 videos,101 actions ) HMDB51 (6,849 videos, 51 actions

SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS

Large-Scale Video Sets

Benchmarks Year Team Task

ActivityNet

http://activity-net.org/index.html

2015

Universidad del Norte

&

KAUST

• Untrimmed Action Recognition

• Temporal Action Proposals

• Temporal Action Localization

• Dense-Captioning Events in Videos

Youtube8M

https://research.google.com/you

tube8m/index.html

2016 Google • Video Classification

Kinetics

https://deepmind.com/research/

open-source/open-source-

datasets/kinetics/

2017Google

(DeepMind)• Trimmed Activity Recognition

AVA

https://research.google.com/ava

/index.html

2017 Google • Spatio-temporal Action Localization

Moments in Time

http://moments.csail.mit.edu/

2018 MIT • Trimmed Event Recognition

Page 7: Motionlet: a middle level part for action classificationice.dlut.edu.cn/valse2018/ppt/09-2018.pdfVideo Benchmarks UCF101 (13,320 videos,101 actions ) HMDB51 (6,849 videos, 51 actions

SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS

Action Recognition

2D CNNs

Recurrent Modeling

3D CNNs

Page 8: Motionlet: a middle level part for action classificationice.dlut.edu.cn/valse2018/ppt/09-2018.pdfVideo Benchmarks UCF101 (13,320 videos,101 actions ) HMDB51 (6,849 videos, 51 actions

SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS

Temporal Linear Encoding (CVPR 17)

Ali Diba et al., Deep Temporal Linear Encoding Networks, CVPR 2017

Deep Temporal Linear Encoding (TLE) Networks:

1. Aggregating K segments into a video representation

2. Bilinear encoding for feature interactions

Page 9: Motionlet: a middle level part for action classificationice.dlut.edu.cn/valse2018/ppt/09-2018.pdfVideo Benchmarks UCF101 (13,320 videos,101 actions ) HMDB51 (6,849 videos, 51 actions

SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS

UntrimmedNets (CVPR 17)

Limin Wang et al., UntrimmedNets for Weakly Supervised Action Recognition and Detection, CVPR 2017

UntrimmedNet:

1. Attention for proposal selection

2. Weakly-supervised detection

Page 10: Motionlet: a middle level part for action classificationice.dlut.edu.cn/valse2018/ppt/09-2018.pdfVideo Benchmarks UCF101 (13,320 videos,101 actions ) HMDB51 (6,849 videos, 51 actions

SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS

Recurrent Modeling

Wenbin Du et al., Recurrent Spatial-Temporal Attention Network for Action Recognition in Videos, IEEE TIP 2018

Recurrent Spatial-Temporal Attention Network (ours):

1. Spatial-temporal attention from global video context

2. Attention-driven two-steam fusion

3. Actor-attention regularization to highlight action regions

Page 11: Motionlet: a middle level part for action classificationice.dlut.edu.cn/valse2018/ppt/09-2018.pdfVideo Benchmarks UCF101 (13,320 videos,101 actions ) HMDB51 (6,849 videos, 51 actions

SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS

Recurrent Pose Attention Network (ICCV 17 Oral)

Wenbin Du et al., RPAN: An End-to-End Recurrent Pose-Attention Network for Action Recognition in Videos, ICCV2017

Recurrent Pose Attention Network (ours, ICCV oral):

1. Pose attention as dynamical guidance for LSTM

2. Byproduct: pose estimation in videos

Page 12: Motionlet: a middle level part for action classificationice.dlut.edu.cn/valse2018/ppt/09-2018.pdfVideo Benchmarks UCF101 (13,320 videos,101 actions ) HMDB51 (6,849 videos, 51 actions

SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS

Demo of RPAN

Wenbin Du et al., RPAN: An End-to-End Recurrent Pose-Attention Network for Action Recognition in Videos, ICCV2017

Page 13: Motionlet: a middle level part for action classificationice.dlut.edu.cn/valse2018/ppt/09-2018.pdfVideo Benchmarks UCF101 (13,320 videos,101 actions ) HMDB51 (6,849 videos, 51 actions

SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS

Applications of our RSTAN & RPAN

Video Surveillance

Home Service Robot

Human-Computer Interactions

Page 14: Motionlet: a middle level part for action classificationice.dlut.edu.cn/valse2018/ppt/09-2018.pdfVideo Benchmarks UCF101 (13,320 videos,101 actions ) HMDB51 (6,849 videos, 51 actions

SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS

Inflated 3D (I3D) ConvNets (CVPR 17)

Inflated 3D (I3D) ConvNets:

1. Inflating 2D ConvNets into 3D

2. Bootstrapping 3D filters from 2D Filters

3. Propose Kinetics dataset

Joao Carreira et al., Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset, CVPR2017

Page 15: Motionlet: a middle level part for action classificationice.dlut.edu.cn/valse2018/ppt/09-2018.pdfVideo Benchmarks UCF101 (13,320 videos,101 actions ) HMDB51 (6,849 videos, 51 actions

SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS

Pseudo-3D Residual Networks (ICCV2017)

Pseudo-3D (P3D) Residual Net:

• 3 types of P3D blocks

• Interleaving Design for ResNet

Zhaofan Qiu et al., Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks, ICCV2017

ActivityNet

Page 16: Motionlet: a middle level part for action classificationice.dlut.edu.cn/valse2018/ppt/09-2018.pdfVideo Benchmarks UCF101 (13,320 videos,101 actions ) HMDB51 (6,849 videos, 51 actions

SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS

Spatiotemporal Separable 3D (CVPR 2018)

Spatiotemporal Separable 3D (S3D) Convolutions:

• 3D Conv (Kt x K x K) Spatial Conv (1 x K x K) + Temporal Conv (Kt x 1 x 1)

Saining Xie et al., Rethinking Spatiotemporal Feature Learning For Video Understanding, arxiv, 2017

Kinetics

Page 17: Motionlet: a middle level part for action classificationice.dlut.edu.cn/valse2018/ppt/09-2018.pdfVideo Benchmarks UCF101 (13,320 videos,101 actions ) HMDB51 (6,849 videos, 51 actions

SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS

ResNet (2+1)D (CVPR 18)

ResNet (2+1)D:

• Spatial-Temporal Decomposition but with similar number of parameters

Du Tran et al., A Closer Look at Spatiotemporal Convolutions for Action Recognition, CVPR2018

Kinetics

Page 18: Motionlet: a middle level part for action classificationice.dlut.edu.cn/valse2018/ppt/09-2018.pdfVideo Benchmarks UCF101 (13,320 videos,101 actions ) HMDB51 (6,849 videos, 51 actions

SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS

ARTNet (CVPR 2018)

Limin Wang et al., Appearance-and-Relation Networks for Video Classification, CVPR2018

ARTNet: Modeling multiplicative interactions between two patches of

consecutive frames

appearance

relationKinetics

0.5(top1+ top5 error)

Page 19: Motionlet: a middle level part for action classificationice.dlut.edu.cn/valse2018/ppt/09-2018.pdfVideo Benchmarks UCF101 (13,320 videos,101 actions ) HMDB51 (6,849 videos, 51 actions

SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS

Non-local Neural Networks (CVPR 2018)

Non-local Neural Networks: spatial-temporal attention among frames

Xiaolong Wang et al., Non-local Neural Networks, CVPR2018

Kinetics

Page 20: Motionlet: a middle level part for action classificationice.dlut.edu.cn/valse2018/ppt/09-2018.pdfVideo Benchmarks UCF101 (13,320 videos,101 actions ) HMDB51 (6,849 videos, 51 actions

SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS

Action Detection

Tianwei Lin et al., ActivityNet2017 (Temporal Action Localization Winter) & ACMM2017

Single Shot Action Detector (SSAD):

• Snippet-level feature extraction from two-stream CNN

• Temporal convolution with anchor mechanism (Inspired by SSD & YOLO)

Page 21: Motionlet: a middle level part for action classificationice.dlut.edu.cn/valse2018/ppt/09-2018.pdfVideo Benchmarks UCF101 (13,320 videos,101 actions ) HMDB51 (6,849 videos, 51 actions

SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS

Action Detection

Yue Zhao et al., Temporal Action Detection with Structured Segment Networks,ICCV2017

Structured Segment Networks (SSN):

• Temporal actionness grouping for proposal generation

• Structured temporal pyramid pooling with the contextual proposal

• Activity + completeness classifier to produce the final probability of proposal

Page 22: Motionlet: a middle level part for action classificationice.dlut.edu.cn/valse2018/ppt/09-2018.pdfVideo Benchmarks UCF101 (13,320 videos,101 actions ) HMDB51 (6,849 videos, 51 actions

SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS

Action Detection

Convolutional-De-Convolutional (CDC) Networks

• CDC: jointly performing spatial downsampling and temporal upsampling

• Dense score prediction at the frame-level for proposal segments

Zheng Shou et al., CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed

Videos, CVPR2017, oral

Page 23: Motionlet: a middle level part for action classificationice.dlut.edu.cn/valse2018/ppt/09-2018.pdfVideo Benchmarks UCF101 (13,320 videos,101 actions ) HMDB51 (6,849 videos, 51 actions

SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS

Future: 3D CNNs

Designing effective modules in 3D CNNs can be crucial for large-scale

video classification

To name a few:

•Joao Carreira et al., Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset, CVPR2017

•Zhaofan Qiu et al., Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks, ICCV2017

•Du Tran et al., A Closer Look at Spatiotemporal Convolutions for Action Recognition, CVPR2018

•Limin Wang et al., Appearance-and-Relation Networks for Video Classification, CVPR2018

•Xiaolong Wang et al., Non-local Neural Networks, CVPR2018

I3D

Page 24: Motionlet: a middle level part for action classificationice.dlut.edu.cn/valse2018/ppt/09-2018.pdfVideo Benchmarks UCF101 (13,320 videos,101 actions ) HMDB51 (6,849 videos, 51 actions

SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS

Future: Pose & Action

Pose is a discriminative guidance for human actions in videos

To name a few:

• Wenbin Du et al., RPAN: An End-to-End Recurrent Pose-Attention Network for Action Recognition in Videos, ICCV2017, oral (ours)

•Mohammadreza Zolfaghari et al., Chained Multi-stream Networks Exploiting Pose, Motion, and Appearance for Action Classification and Detection, ICCV2017

•Sijie Yan et al., Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition, AAAI2018

•Mengyuan Liu et al., Recognizing Human Actions as Evolution of Pose Estimation Maps, CVPR2018

•Diogo Luvizon et al., 2D/3D Pose Estimation and Action Recognition using Multitask Deep Learning, CVPR2018

•Vasileios Choutas et al., PoTion: Pose MoTion Representation for Action Recognition, CVPR2018

RPAN (ours)

Page 25: Motionlet: a middle level part for action classificationice.dlut.edu.cn/valse2018/ppt/09-2018.pdfVideo Benchmarks UCF101 (13,320 videos,101 actions ) HMDB51 (6,849 videos, 51 actions

SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS

Future: Motion Prediction

To name a few:

•Eddy Ilg et al., FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks, CVPR2017

•Zelun Luo,et al., Unsupervised Learning of Long-Term Motion Dynamics for Videos, CVPR2017

•Xiaodan Liang et al., Dual Motion GAN for Future-Flow Embedded Video Prediction , ICCV2017

•Shuyang Sun et al., Optical Flow Guided Feature: A Motion Representation for Video Action Recognition, CVPR2018

•Lijie Fan et al., End-to-End Learning of Motion Representation for Video Understanding, CVPR2018

•Ruohan Gao et al., Im2Flow: Motion Hallucination from Static Images for Action Recognition, CVPR2018

•Lei Zhou et al., Temporal Hallucinating for Action Recognition with Few Still Images, CVPR2018 (ours)

Learning flow in the videos

FlowNet2.0

Learning flow in the images?!

HVM (ours)

Page 26: Motionlet: a middle level part for action classificationice.dlut.edu.cn/valse2018/ppt/09-2018.pdfVideo Benchmarks UCF101 (13,320 videos,101 actions ) HMDB51 (6,849 videos, 51 actions

SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS

Future: Video Understanding

Video Caption

Video Summarization

Ranjay Krishna et al., Dense-Captioning Events in Videos , ICCV2017

Kaiyang Zhou et al., Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity-Representativeness Reward, AAAI2018, oral (ours)

DSN (ours)

Spatial-Temporal Localization

Chunhui Gu et al., AVA: A video dataset of spatio-temporally localized atomic

visual actions, arxiv,2017

Page 27: Motionlet: a middle level part for action classificationice.dlut.edu.cn/valse2018/ppt/09-2018.pdfVideo Benchmarks UCF101 (13,320 videos,101 actions ) HMDB51 (6,849 videos, 51 actions

Yu QIAO 2018.4

Thanks!

Q&A