Yu QIAO 2018.4
行为识别与检测2018年度进展
乔宇中国科学院深圳先进技术研究院
2018年4月22日
VALSE 2018 - 大连
SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS
Outline
视频行为数据库
行为识别方法
行为检测方法
未来研究方向
SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS
Video Benchmarks
UCF101 (13,320 videos,101 actions ) HMDB51 (6,849 videos, 51 actions )
1. The widely-used data sets are small-scale
2. It is hard to investigate spatial-temporal representations
of deep neural networks
SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS
Large-Scale Video Sets
Youtube8M
• 200 classes
• 100 untrimmed videos per class
• 1.54 activity instances per video
• 648 video hours
SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS
Large-Scale Video Sets
• 80 atomic actions
• 192 clips (15 mins per clip)
• 740k annotations
• 306,245 videos in total
• 400 action classes
• Each clip lasts around 10s
• over 1,000,000 videos
• 339 Moment classes
• 3-second video
SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS
Large-Scale Video Sets
Benchmarks Year Team Task
ActivityNet
http://activity-net.org/index.html
2015
Universidad del Norte
&
KAUST
• Untrimmed Action Recognition
• Temporal Action Proposals
• Temporal Action Localization
• Dense-Captioning Events in Videos
Youtube8M
https://research.google.com/you
tube8m/index.html
2016 Google • Video Classification
Kinetics
https://deepmind.com/research/
open-source/open-source-
datasets/kinetics/
2017Google
(DeepMind)• Trimmed Activity Recognition
AVA
https://research.google.com/ava
/index.html
2017 Google • Spatio-temporal Action Localization
Moments in Time
http://moments.csail.mit.edu/
2018 MIT • Trimmed Event Recognition
SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS
Action Recognition
2D CNNs
Recurrent Modeling
3D CNNs
SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS
Temporal Linear Encoding (CVPR 17)
Ali Diba et al., Deep Temporal Linear Encoding Networks, CVPR 2017
Deep Temporal Linear Encoding (TLE) Networks:
1. Aggregating K segments into a video representation
2. Bilinear encoding for feature interactions
SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS
UntrimmedNets (CVPR 17)
Limin Wang et al., UntrimmedNets for Weakly Supervised Action Recognition and Detection, CVPR 2017
UntrimmedNet:
1. Attention for proposal selection
2. Weakly-supervised detection
SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS
Recurrent Modeling
Wenbin Du et al., Recurrent Spatial-Temporal Attention Network for Action Recognition in Videos, IEEE TIP 2018
Recurrent Spatial-Temporal Attention Network (ours):
1. Spatial-temporal attention from global video context
2. Attention-driven two-steam fusion
3. Actor-attention regularization to highlight action regions
SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS
Recurrent Pose Attention Network (ICCV 17 Oral)
Wenbin Du et al., RPAN: An End-to-End Recurrent Pose-Attention Network for Action Recognition in Videos, ICCV2017
Recurrent Pose Attention Network (ours, ICCV oral):
1. Pose attention as dynamical guidance for LSTM
2. Byproduct: pose estimation in videos
SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS
Demo of RPAN
Wenbin Du et al., RPAN: An End-to-End Recurrent Pose-Attention Network for Action Recognition in Videos, ICCV2017
SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS
Applications of our RSTAN & RPAN
Video Surveillance
Home Service Robot
Human-Computer Interactions
SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS
Inflated 3D (I3D) ConvNets (CVPR 17)
Inflated 3D (I3D) ConvNets:
1. Inflating 2D ConvNets into 3D
2. Bootstrapping 3D filters from 2D Filters
3. Propose Kinetics dataset
Joao Carreira et al., Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset, CVPR2017
SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS
Pseudo-3D Residual Networks (ICCV2017)
Pseudo-3D (P3D) Residual Net:
• 3 types of P3D blocks
• Interleaving Design for ResNet
Zhaofan Qiu et al., Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks, ICCV2017
ActivityNet
SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS
Spatiotemporal Separable 3D (CVPR 2018)
Spatiotemporal Separable 3D (S3D) Convolutions:
• 3D Conv (Kt x K x K) Spatial Conv (1 x K x K) + Temporal Conv (Kt x 1 x 1)
Saining Xie et al., Rethinking Spatiotemporal Feature Learning For Video Understanding, arxiv, 2017
Kinetics
SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS
ResNet (2+1)D (CVPR 18)
ResNet (2+1)D:
• Spatial-Temporal Decomposition but with similar number of parameters
Du Tran et al., A Closer Look at Spatiotemporal Convolutions for Action Recognition, CVPR2018
Kinetics
SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS
ARTNet (CVPR 2018)
Limin Wang et al., Appearance-and-Relation Networks for Video Classification, CVPR2018
ARTNet: Modeling multiplicative interactions between two patches of
consecutive frames
appearance
relationKinetics
0.5(top1+ top5 error)
SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS
Non-local Neural Networks (CVPR 2018)
Non-local Neural Networks: spatial-temporal attention among frames
Xiaolong Wang et al., Non-local Neural Networks, CVPR2018
Kinetics
SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS
Action Detection
Tianwei Lin et al., ActivityNet2017 (Temporal Action Localization Winter) & ACMM2017
Single Shot Action Detector (SSAD):
• Snippet-level feature extraction from two-stream CNN
• Temporal convolution with anchor mechanism (Inspired by SSD & YOLO)
SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS
Action Detection
Yue Zhao et al., Temporal Action Detection with Structured Segment Networks,ICCV2017
Structured Segment Networks (SSN):
• Temporal actionness grouping for proposal generation
• Structured temporal pyramid pooling with the contextual proposal
• Activity + completeness classifier to produce the final probability of proposal
SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS
Action Detection
Convolutional-De-Convolutional (CDC) Networks
• CDC: jointly performing spatial downsampling and temporal upsampling
• Dense score prediction at the frame-level for proposal segments
Zheng Shou et al., CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed
Videos, CVPR2017, oral
SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS
Future: 3D CNNs
Designing effective modules in 3D CNNs can be crucial for large-scale
video classification
To name a few:
•Joao Carreira et al., Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset, CVPR2017
•Zhaofan Qiu et al., Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks, ICCV2017
•Du Tran et al., A Closer Look at Spatiotemporal Convolutions for Action Recognition, CVPR2018
•Limin Wang et al., Appearance-and-Relation Networks for Video Classification, CVPR2018
•Xiaolong Wang et al., Non-local Neural Networks, CVPR2018
I3D
SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS
Future: Pose & Action
Pose is a discriminative guidance for human actions in videos
To name a few:
• Wenbin Du et al., RPAN: An End-to-End Recurrent Pose-Attention Network for Action Recognition in Videos, ICCV2017, oral (ours)
•Mohammadreza Zolfaghari et al., Chained Multi-stream Networks Exploiting Pose, Motion, and Appearance for Action Classification and Detection, ICCV2017
•Sijie Yan et al., Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition, AAAI2018
•Mengyuan Liu et al., Recognizing Human Actions as Evolution of Pose Estimation Maps, CVPR2018
•Diogo Luvizon et al., 2D/3D Pose Estimation and Action Recognition using Multitask Deep Learning, CVPR2018
•Vasileios Choutas et al., PoTion: Pose MoTion Representation for Action Recognition, CVPR2018
RPAN (ours)
SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS
Future: Motion Prediction
To name a few:
•Eddy Ilg et al., FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks, CVPR2017
•Zelun Luo,et al., Unsupervised Learning of Long-Term Motion Dynamics for Videos, CVPR2017
•Xiaodan Liang et al., Dual Motion GAN for Future-Flow Embedded Video Prediction , ICCV2017
•Shuyang Sun et al., Optical Flow Guided Feature: A Motion Representation for Video Action Recognition, CVPR2018
•Lijie Fan et al., End-to-End Learning of Motion Representation for Video Understanding, CVPR2018
•Ruohan Gao et al., Im2Flow: Motion Hallucination from Static Images for Action Recognition, CVPR2018
•Lei Zhou et al., Temporal Hallucinating for Action Recognition with Few Still Images, CVPR2018 (ours)
Learning flow in the videos
FlowNet2.0
Learning flow in the images?!
HVM (ours)
SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY, CAS
Future: Video Understanding
Video Caption
Video Summarization
Ranjay Krishna et al., Dense-Captioning Events in Videos , ICCV2017
Kaiyang Zhou et al., Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity-Representativeness Reward, AAAI2018, oral (ours)
DSN (ours)
Spatial-Temporal Localization
Chunhui Gu et al., AVA: A video dataset of spatio-temporally localized atomic
visual actions, arxiv,2017
Yu QIAO 2018.4
Thanks!
Q&A
Top Related