【ISVC2015】Evaluation of Vision-based Human Activity Recognition in Dense Trajectory Framework

26
Evaluation of Vision-based Human Activity Recognition in Dense Trajectory Framework Hirokatsu Kataoka, Yoshimitsu Aoki , Kenji Iwata, Yutaka Satoh National Institute of Advanced Industrial Science and Technology (AIST) † Keio University http://www.hirokatsukataoka.net/

Transcript of 【ISVC2015】Evaluation of Vision-based Human Activity Recognition in Dense Trajectory Framework

Evaluation of Vision-based Human Activity Recognition in Dense Trajectory Framework

Hirokatsu Kataoka, Yoshimitsu Aoki†, Kenji Iwata, Yutaka Satoh

National Institute of Advanced Industrial Science and Technology (AIST) † Keio University

http://www.hirokatsukataoka.net/

Background Computer vision for human sensing

-  Detection, Tracking, Trajectory Analysis -  Posture Estimation, Activity Recognition -  Action recognition is able to extend human sensing applications

Mental state

Body Situation

Attention

Activity Analysis

shakinghands

Look at people

Detection Gaze Estimation

Action Recognition

Posture Estimation

Face Recognition

Trajectory extraction

Tracking

Activity Recognition

“Activity” is a low-level primitive with semantic meaning e.g. walking, running, sitting

This image contains a man walking - The classification (location is given)

Activity recognition - The classification and localization

Activity detection

Walking

Dense Trajectories (DT) [Wang+, IJCV2013] • State-of-the-art space-time recognition approach –  State-of-the-art: DT + Deep Learning [THUMOS2015]

– Usable motion analyzer –  Simply, (i) flow tracker (ii) feature vectorization

Large amount of opt. flows

[THUMOS2015] http://www.thumos.info/results.html

History of keypoint/traj.-based approach • Space-time interest points (STIP) – DT

STIP: Space-time interest points

[Laptev et al., IJCV2005]

Dense Trajectories[Wang et al., CVPR2011]

[Laptev et al., CVPR2008]

HOG + HOF on STIP

Feature Mining for Activity Recognition

[Gilbert et al., PAMI2011]

Cuboid Features

[Dollar et al., PETS2005]

STR: Spatio-Temporal Relationship Match

[Ryoo et al., ICCV2009]

[Raptis et al., ECCV2010]

Tracklet Descriptors

STIP & DT: Sampling • Space-time interest points (STIP) – DT

STIP: Space-time interest points

[Laptev et al., IJCV2005]

Dense Trajectories[Wang et al., CVPR2011]

Action Bank[Sadanand et al., CVPR2012]

[Laptev et al., CVPR2008]

HOG + HOF on STIP

Feature Mining for Activity Recognition

[Gilbert et al., PAMI2011]

Cuboid Features

[Dollar et al., PETS2005]

STR: Spatio-Temporal Relationship Match

[Ryoo et al., ICCV2009]

[Raptis et al., ECCV2010]

Tracklet Descriptors

Co-occurrence features in DT • Extended co-occurrence feature (ECoHOG) –  Feature •  CoHOG[Watanabe, PSIVT2009] (pair-count), ECoHOG (edge-magnitude accum.) •  PCA for codeword •  DT+Co-occurrence features (62.4%) > DT (59.2%) on MPII cooking

CoHOG

ECoHOG

H. Kataoka+, “Extended Co-occurrence HOG with Dense Trajectories for Fine-grained Activity Recognition”, in ACCV2014.

Need for more features!

Pose-based approach

Holistic appraoch

Proposal • Feature evaluation for more better performance –  Evaluation of 13 features at fair settings –  5 Category •  Trajectory: traj. feature (originally in DT) •  Shape: HOG, SIFT •  Motion: HOF, MBHx, MBHy, MIP •  Texture: HLAC, LBP, iLBP, LTP •  Co-occurrence: CoHOG, ECoHOG

–  4 different datasets •  NTSEL (traffic) •  INRIA surgery (surgery) •  MSR daily activity 3d (daily living) •  UCF50 (sports)

Simple algorithm • (i) Flow tracking –  Pyramidal images & sampling –  Farneback optical flow & flow tracking

• (ii) Feature vectorization – HOG, HOF, MBH, Trajectory, SIFT, LBP….. – Bag-of-words (BoW) representation

Pyramidal images & sampling • Scaling and dense sampling

–  Pyramidal images •  Scales *= 1/√2

–  Sampling at each scale •  Grid: 5x5 [pxls] (experimentally decided) •  Corner detection T: threshold, λ: eigen value

Scale invariant Detailed description

Farneback Optical Flow • Dense Optical Flow + ST-patch –  Farneback Optical Flow is included OpenCV – Comparison of KLT tracker and SIFT –  Local space-time patch around tracked sampling points

Noises

Tracking-error

Trajectory-based feature • Trajectory shape – Calculating flow between frames –  Scale normalization

Pt = (Pt+1 − Pt) = (xt+1 − xt, yt+1 − yt)

[Wang+, IJCV2013]

Shape-based feature • HOG, SIFT

Edge-orient., mag. from block representation with overlapping and normalization

Edge-shape from background

Simply divided 4x4 blocks

[Lowe, IJCV2004]

[Dalal+, CVPR2005]

Motion • HOF, MBHx, MBHy, MIP

Block optical flow extraction

Quantization

Motion boundary with dense optical flow [Dalal+, ECCV2006]

Trinary (-1, 0, +1) from block flow direction, [Kliper-Gross+, ECCV2012]

[Laptev+, CVPR2008]

Texture • HLAC, LBP, iLBP, LTP

Higher-order local auto-correlation 0-, 1st-, 2nd- order pattern

Texture binarization in a 3x3 patch, [Ojala+, TPAMI2002]

[Otsu+, IAIP1988] [Kobayashi+, ICPR2004]

Co-occurrence • Extended co-occurrence feature (ECoHOG) –  Feature •  CoHOG[Watanabe, PSIVT2009] (pair-count), ECoHOG (edge-magnitude accum.) •  PCA for codeword •  DT+Co-occurrence features (62.4%) > DT (59.2%) on MPII cooking

CoHOG ECoHOG

H. Kataoka+, “Extended Co-occurrence HOG with Dense Trajectories for Fine-grained Activity Recognition”, in ACCV2014.

Experiments • Evaluation of 13 features in dense trajectory framework –  4 different datasets •  Traffic scene (NTSEL dataset): 4 classes •  Surgery (INRIA surgery): 4 classes •  Daily living (MSR daily action 3D): 12 classes •  Sports (UCF50): 50 classes

Results on the 4 datasets • High-performance features –  Top three features at each dataset –  4 different scenes

Results on the 4 datasets • High-performance features – CoHOG, SIFT, MBH – CoHOG is the stable accuracy at all datasets

Detailed performance rate • Depending on recognition task! – We need to experimentally concatenate several features –  Feature concatenation on the NTSEL and INRIA surgery

Rate of feature concatenation • Baseline, 5 categories and concatenated vector – Baseline: DT + BoW model – Motion and co-occurrence feature – No need to apply all features

Conclusion • We evaluated 13 features in the framework of DT –  For more effective activity recognition –  4 different scenes at each dataset – Detailed evaluation and concatenated vectors –  Top-N ranked concatenation is needed for activity recognition

Feature extraction Around trajectories

–  Extraction of 13 features in ST-patch –  2 (x dir.) x 2 (y dir.) x 3 (t dir.) region – Calculating features with bag-of-words(BoW)

ST-patch and xyt block extraction

13 features extractioin

Trajectory feature • Trajectory shape – フレーム間のフローを算出 – 全体のフローの大きさで正規化

Pt = (Pt+1 − Pt) = (xt+1 − xt, yt+1 − yt)

HOG特徴量 • Histograms of Oriented Gradients (HOG) – 物体のおおまかな形状を表現可能 – 局所領域をブロック分割して特徴取得 – エッジ勾配(下式g(x,y))により量子化ヒストグラム作成 – 勾配毎のエッジ強度(下式m(x,y))を累積

歩行者画像から取得した形状

背景から取得した形状

HOF特徴量 • Histograms of Optical Flow (HOF) – 局所領域をブロック毎に分割 – 前後フレーム(tとt+1)のフローをブロックごとに記述 – フロー方向と強度(長さ)

前後2フレームからフローを算出

動作ベースの特徴ベクトルを取得