Download - Motionlet: a middle level part for action classificationice.dlut.edu.cn/valse2018/ppt/09-2018.pdfVideo Benchmarks UCF101 (13,320 videos,101 actions ) HMDB51 (6,849 videos, 51 actions

Yu QIAO 2018.4

行为识别与检测2018年度进展

乔宇中国科学院深圳先进技术研究院

2018年4月22日

VALSE 2018 - 大连

SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS

Outline

视频行为数据库

行为识别方法

行为检测方法

未来研究方向


Video Benchmarks

UCF101 (13,320 videos,101 actions ) HMDB51 (6,849 videos, 51 actions )

1. The widely-used data sets are small-scale

2. It is hard to investigate spatial-temporal representations

of deep neural networks


Large-Scale Video Sets

Youtube8M

• 200 classes

• 100 untrimmed videos per class

• 1.54 activity instances per video

• 648 video hours



• 80 atomic actions

• 192 clips (15 mins per clip)

• 740k annotations

• 306,245 videos in total

• 400 action classes

• Each clip lasts around 10s

• over 1,000,000 videos

• 339 Moment classes

• 3-second video



Benchmarks Year Team Task

ActivityNet

http://activity-net.org/index.html

2015

Universidad del Norte

&

KAUST

• Untrimmed Action Recognition

• Temporal Action Proposals

• Temporal Action Localization

• Dense-Captioning Events in Videos

Youtube8M

https://research.google.com/you

tube8m/index.html

2016 Google • Video Classification

Kinetics

https://deepmind.com/research/

open-source/open-source-

datasets/kinetics/

2017Google

(DeepMind)• Trimmed Activity Recognition

AVA

https://research.google.com/ava

/index.html

2017 Google • Spatio-temporal Action Localization

Moments in Time

http://moments.csail.mit.edu/

2018 MIT • Trimmed Event Recognition


Action Recognition

2D CNNs

Recurrent Modeling

3D CNNs


Temporal Linear Encoding (CVPR 17)

Ali Diba et al., Deep Temporal Linear Encoding Networks, CVPR 2017

Deep Temporal Linear Encoding (TLE) Networks:

1. Aggregating K segments into a video representation

2. Bilinear encoding for feature interactions


UntrimmedNets (CVPR 17)

Limin Wang et al., UntrimmedNets for Weakly Supervised Action Recognition and Detection, CVPR 2017

UntrimmedNet:

1. Attention for proposal selection

2. Weakly-supervised detection


Recurrent Modeling

Wenbin Du et al., Recurrent Spatial-Temporal Attention Network for Action Recognition in Videos, IEEE TIP 2018

Recurrent Spatial-Temporal Attention Network (ours):

1. Spatial-temporal attention from global video context

2. Attention-driven two-steam fusion

3. Actor-attention regularization to highlight action regions


Recurrent Pose Attention Network (ICCV 17 Oral)

Wenbin Du et al., RPAN: An End-to-End Recurrent Pose-Attention Network for Action Recognition in Videos, ICCV2017

Recurrent Pose Attention Network (ours, ICCV oral):

1. Pose attention as dynamical guidance for LSTM

2. Byproduct: pose estimation in videos


Demo of RPAN

Wenbin Du et al., RPAN: An End-to-End Recurrent Pose-Attention Network for Action Recognition in Videos, ICCV2017


Applications of our RSTAN & RPAN

Video Surveillance

Home Service Robot

Human-Computer Interactions


Inflated 3D (I3D) ConvNets (CVPR 17)

Inflated 3D (I3D) ConvNets:

1. Inflating 2D ConvNets into 3D

2. Bootstrapping 3D filters from 2D Filters

3. Propose Kinetics dataset

Joao Carreira et al., Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset, CVPR2017


Pseudo-3D Residual Networks (ICCV2017)

Pseudo-3D (P3D) Residual Net:

• 3 types of P3D blocks

• Interleaving Design for ResNet

Zhaofan Qiu et al., Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks, ICCV2017

ActivityNet


Spatiotemporal Separable 3D (CVPR 2018)

Spatiotemporal Separable 3D (S3D) Convolutions:

• 3D Conv (Kt x K x K) Spatial Conv (1 x K x K) + Temporal Conv (Kt x 1 x 1)

Saining Xie et al., Rethinking Spatiotemporal Feature Learning For Video Understanding, arxiv, 2017

Kinetics


ResNet (2+1)D (CVPR 18)

ResNet (2+1)D:

• Spatial-Temporal Decomposition but with similar number of parameters

Du Tran et al., A Closer Look at Spatiotemporal Convolutions for Action Recognition, CVPR2018

Kinetics


ARTNet (CVPR 2018)

Limin Wang et al., Appearance-and-Relation Networks for Video Classification, CVPR2018

ARTNet: Modeling multiplicative interactions between two patches of

consecutive frames

appearance

relationKinetics

0.5(top1+ top5 error)


Non-local Neural Networks (CVPR 2018)

Non-local Neural Networks: spatial-temporal attention among frames

Xiaolong Wang et al., Non-local Neural Networks, CVPR2018

Kinetics


Action Detection

Tianwei Lin et al., ActivityNet2017 (Temporal Action Localization Winter) & ACMM2017

Single Shot Action Detector (SSAD):

• Snippet-level feature extraction from two-stream CNN

• Temporal convolution with anchor mechanism (Inspired by SSD & YOLO)


Action Detection

Yue Zhao et al., Temporal Action Detection with Structured Segment Networks,ICCV2017

Structured Segment Networks (SSN):

• Temporal actionness grouping for proposal generation

• Structured temporal pyramid pooling with the contextual proposal

• Activity + completeness classifier to produce the final probability of proposal


Action Detection

Convolutional-De-Convolutional (CDC) Networks

• CDC: jointly performing spatial downsampling and temporal upsampling

• Dense score prediction at the frame-level for proposal segments

Zheng Shou et al., CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed

Videos, CVPR2017, oral


Future: 3D CNNs

Designing effective modules in 3D CNNs can be crucial for large-scale

video classification

To name a few:

•Joao Carreira et al., Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset, CVPR2017

•Zhaofan Qiu et al., Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks, ICCV2017

•Du Tran et al., A Closer Look at Spatiotemporal Convolutions for Action Recognition, CVPR2018

•Limin Wang et al., Appearance-and-Relation Networks for Video Classification, CVPR2018

•Xiaolong Wang et al., Non-local Neural Networks, CVPR2018

I3D


Future: Pose & Action

Pose is a discriminative guidance for human actions in videos

To name a few:

• Wenbin Du et al., RPAN: An End-to-End Recurrent Pose-Attention Network for Action Recognition in Videos, ICCV2017, oral (ours)

•Mohammadreza Zolfaghari et al., Chained Multi-stream Networks Exploiting Pose, Motion, and Appearance for Action Classification and Detection, ICCV2017

•Sijie Yan et al., Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition, AAAI2018

•Mengyuan Liu et al., Recognizing Human Actions as Evolution of Pose Estimation Maps, CVPR2018

•Diogo Luvizon et al., 2D/3D Pose Estimation and Action Recognition using Multitask Deep Learning, CVPR2018

•Vasileios Choutas et al., PoTion: Pose MoTion Representation for Action Recognition, CVPR2018

RPAN (ours)


Future: Motion Prediction

To name a few:

•Eddy Ilg et al., FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks, CVPR2017

•Zelun Luo,et al., Unsupervised Learning of Long-Term Motion Dynamics for Videos, CVPR2017

•Xiaodan Liang et al., Dual Motion GAN for Future-Flow Embedded Video Prediction , ICCV2017

•Shuyang Sun et al., Optical Flow Guided Feature: A Motion Representation for Video Action Recognition, CVPR2018

•Lijie Fan et al., End-to-End Learning of Motion Representation for Video Understanding, CVPR2018

•Ruohan Gao et al., Im2Flow: Motion Hallucination from Static Images for Action Recognition, CVPR2018

•Lei Zhou et al., Temporal Hallucinating for Action Recognition with Few Still Images, CVPR2018 (ours)

Learning flow in the videos

FlowNet2.0

Learning flow in the images?!

HVM (ours)


Future: Video Understanding

Video Caption

Video Summarization

Ranjay Krishna et al., Dense-Captioning Events in Videos , ICCV2017

Kaiyang Zhou et al., Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity-Representativeness Reward, AAAI2018, oral (ours)

DSN (ours)

Spatial-Temporal Localization

Chunhui Gu et al., AVA: A video dataset of spatio-temporally localized atomic

visual actions, arxiv,2017

Yu QIAO 2018.4

Thanks!

Q&A