ECCV2010 tutorial: statisitcal and structural recognition of human actions part II

Structural Methods

•  Pose seems sufficient for certain ac5on categories •  Remove effects of clothing, ligh5ng varia5on from representa5on

G. Johansson, Moving Light Displays, 1973

•  Two broad classes of approaches – Match templates (exemplar-‐based)

– Fit a human body model

Database of Exemplars

Mori & Malik PAMI 2005 Shakhnarovich, Viola & Darrell ICCV 2003 Bourdev & Malik ICCV 2009

Felzenszwalb & Huttenlocher CVPR 2000 Ramanan NIPS 2006 Ferrari, Marin & Zisserman CVPR 2008

Pictorial Structures model

•  Detect corners in images/video •  Assess likelihood under ac5on-‐specific pose model •  Discriminate between walking direc5ons, bicycle riding

Song, Goncalves & Perona NIPS 2001, PAMI 2003

•  Key frame matching to test sequence to find similar poses – Shape matching on edge maps using order structure

Sullivan & Carlsson ECCV 2002

test sequence key frame

•  Automa5c person detec5on-‐tracking •  Compare quan5zed pose to labeled training poses

–  Smooth over 5me

Ramanan & Forsyth NIPS 2003

•  M is quan5zed 3d pose

•  T is root orienta5on

Ferrari, Marin & Zisserman CVPR 2009

•  Video shot retrieval from pose – Either query-‐by-‐example or classifica5on

– Focus on upper body pose •  Pictorial structures model

Ac5on Label

•  SVM on descriptors of absolute & rela5ve part loca5ons, segmenta5ons –  Include short tracks for robustness

query CODE AVAILABLE ONLINE

•  Focus on discrimina5ve elements of pose for ac5on classifica5on •  Use exemplar-‐based “poselet” representa5on

Yang, Wang & Mori CVPR 2010

Golfing?

Walking?

17

Successful classification

examples

Unsuccessful classification

examples

Yao & Fei-Fei CVPR 2010

H

A

•  More than one H for each A; •  Unobserved during training.

A:

Croquet shot

Volleyball smash

Tennis forehand

Intra-class variations

Activity

Object

Human pose

Body parts

lP: location; θP: orientation; sP: scale.

Croquet mallet

Volleyball Tennis racket

O:

H:

P:

f: Shape context. [Belongie et al, 2002]

P1

Image evidence

fO

f1 f2 fN

O

P2 PN

19

Cricket defensive

shot

Cricket bowling

Croquet shot

•  Build ac5on models from web search results

Ikizler-Cinbis, Cinbis, Sclaroff ICCV 2009

•  Find repeated poses in a dataset

SLAG

Wang, Jiang, Drew, Li, Mori CVPR 2006

•  Person loca5on given

•  Classify into one of 9 categories

Riding horse Reading book Taking photo

Riding bike Play instrument Running

Phoning Use computer Walking

•  Pose as representa5on for ac5on recogni5on –  Captures much informa5on about ac5on –  Invariance to clothing / ligh5ng effects – Model and exemplar based representa5ons

•  New direc5on: Ac5on recogni5on from s5ll images –  Image retrieval and analysis – An important cue for video-‐based ac5on recogni5on –  Pose seems essen5al

•  Describe low-‐level components – Ac5ons of individual people – Movement of pixels

•  Iden5fy key objects or loca5ons in scene – Buildings, roads, etc.

•  Model interac5ons between people, objects, and loca5ons

•  Detect and track moving objects

•  Manually iden5fy key regions in scene –  E.g. road, checkpoint

•  Scenarios describe rela5ve arrangements of objects in scene –  E.g. proximity of car to checkpoint

– No5ons of scene context

Medioni, Cohen, Bremond, Hongeng, Nevatia PAMI 2001

•  Detect and track players, ball

•  Low-‐level ac5on detectors for individual players

•  Hand-‐constructed Bayes net for each ac5vity –  Spa5al and temporal rela5ons between low-‐level ac5ons

Intille & Bobick CVPR 1999

Zhong, Shi & Visontai CVPR 2004

•  Global, frame-‐level feature –  Bag-‐of-‐words representa5on

•  Detect unusual events by clustering –  Isolated, varied clusters are unusual

•  Real-‐world highway dataset –  Cars pulling off road, backing

up, U-‐turns

•  Chea5ng detec5on in simulated card game

•  Describe moving pixels by loca5on and mo5on direc5on – No object detec5on

•  Use as visual words in Latent Dirichlet Alloca5on (LDA) type model –  Infer low-‐level ac5ons from words

Wang, Ma, Grimson PAMI 2009 Blei, Ng, Jordan JMLR 2003

•  Higher-‐level ac5vity analysis –  Distribu5on of low-‐level ac5ons over en5re scene

•  Applica5ons –  Temporal segmenta5on by ac5vity –  Abnormality detec5on

•  Hierarchical Dirichlet Process model –  Learn number of ac5vi5es automa5cally

Kuettel, Breitenstein, van Gool & Ferrari CVPR 2010

A B

C

0.8

0.5 0.3

traffic light controlled scene

•  con5nuous video •  annotated with states and history •  3x speed

Loy, Xiang & Gong CVPR, ICCV 2009

•  Consider 5me-‐delayed correla5ons between regions –  Applica5ons to irregularity detec5on

Choi, Shahid, & Savarese VS 2009 Lan, Wang, Yang, & Mori SGA 2010, NIPS 2010

y

h1 h2 y h

x1 x2 xn image

action class

activity class

x0

…

•  Cap5oned baseball videos in training

•  Build AND-‐OR graph representa5on of ac5vi5es –  AND specifies elements of an ac5vity that must occur

–  OR allows varia5on in how an element appears

•  Describe low-‐level tracks using STIPs

•  Match tracks to ac5ons in AND-‐OR graph

Gupta, Srinivasan, Shi, Davis CVPR 2009

•  Scene modeling to look at the big picture •  Feature representa5ons

– Holis5c: describe en5re scene, irrespec5ve of individuals

– Local: describe ac5ons of individuals •  Structure of ac5vi5es

– Model free: clustering-‐type approaches – Strong models: grammars, probabilis5c models

Objects: cars, glasses, people, etc…

Scene categories: indoors, outdoors, street scene, etc…

Actions: drinking, running, door exit, car enter

Geometry: Street, wall, field, stair, etc…

constraints

A B

C

0.8

0.5 0.3

Reading book

Play instrument

Riding horse

Riding bike

•  Standardiza5on of datasets for field – Allow comparison of algorithms

•  E.g. KTH for low-‐level features, atomic ac5ons

– Fair tuning of model parameters

•  New algorithms compare to baselines – Bag-‐of-‐words on densely sampled STIPs – Pose es5ma5on (Ferrari et al. code) – HOG SVM (Dalal & Triggs code, Ramanan code)

•  Standardiza5on of datasets for field – Don’t feel constrained by the exis5ng problem defini5ons

– Do make your new dataset available •  Should clearly specify separate training and test sets

•  New algorithms compare to baselines – Do use reasonable variant of standard baselines for your new problem

•  Even atomic low-‐level ac5ons are very difficult to detect reliably – Far more work needed on representa5ons for the ac5on of a single person

– Features – Temporal representa5on, smoothing – Tracking – …

1.  Cameras and bandwidth are cheap

2.  Lots of training data is potentially available

+ = Training

data

Potential for the huge progress … if we can get the data

Aligned with video Describes visual content

Source

Subtitles Yes No DVD, Internet

Scripts for TV series, movies and sport games

No Yes Internet, e.g. www.dailyscript.com

Plot summaries and synopses

No Yes, sparsely Internet (e.g. IMDB)

Instruction videos No Yes Internet, e.g. www.videojug.com

Descriptive Video Service Yes Yes DVD, rare

Word tags No Yes, sparsely Internet (e.g. YouTube)

Manual labelling, Human Computation

?? ?? Mechanical Turk, ESP Game,

Grad undergrad students

  How to benefit from the structure of the human body in complex situations, e.g. heavy occlusions, uniformly colored clothing?

  Will action classification generalize over different video domains: Movies, TV, YouTube, Surveillance video?

  What is the useful action vocabulary? Are we trying to solve the right problem? How can we visualize/display the results?

Open questions:

  Use actions for recognizing functional and physical object properties, e.g. “sitable”, “eatable”, “heavy”, “solid” objects…

  Action prediction, i.e. what can happen in the given situation: e.g. is it dangerous to cross this road?

  Explore more sources of strong and weak supervision: Manual surveillance, Descriptive Video Service (DVS), YouTube tags; Transcripts of sports games; Instruction videos.

Interesting novel directions:

•  P. Viola, M. Jones, and D. Snow. Detec5ng pedestrians using paeerns of mo5on and appearance. In Proc. 9th Int. Conf. Computer Vision, pages 734–741, 2003.

•  N. Dalal and B. Triggs. Histograms of oriented gradients for human detec5on. In Proc. IEEE Comput. Soc. Conf. Comput. Vision and Paeern Recogn., 2005.

•  Bo Wu and Ram Neva5a. Detec5on of mul5ple, par5ally occluded humans in a single image by bayesian combina5on of edgelet part detectors. In Proc. 10th Int. Conf. Computer Vision, 2005.

•  Pedro Felzenszwalb, David McAllester, and Deva Ramanan. A discrimina5vely trained, mul5scale, deformable part model. In IEEE Computer Society Conference on Computer Vision and Paeern Recogni5on, 2008.

•  Chris Stauffer and W.E.L. Grimson. Adap5ve background mixture models for real-‐ 5me tracking. In Proc. IEEE Comput. Soc. Conf. Comput. Vision and Paeern Recogn., 1999.

•  Kentaro Toyama, John Krumm, Barry Brumie, and Brian Meyers. Wallflower: Principles and prac5ce of background maintenance. In Proc. 7th Int. Conf. Computer Vision, 1999.

•  J.L. Barron, D.J. Fleet, and S.S. Beauchemin. Performance of op5cal flow techniques. Int. Journal of Computer Vision, 12(1):43–77, 1994.

•  T. Brox, C. Bregler, and J. Malik. Large displacement op5cal flow. In Proc. IEEE Comput. Soc. Conf. Comput. Vision and Paeern Recogn., 2009.

•  M. Isard and A. Blake. Condensa5on – condi5onal density propaga5on for visual tracking. Int. Journal of Computer Vision, 29(1):5–28, 1998.

•  Yuan Li, Chang Huang, and Ram Neva5a. Learning to associate: Hybridboosted mul5-‐target tracker for crowded scene. In Proc. IEEE Comput. Soc. Conf. Comput. Vision and Paeern Recogn., 2009.

•  W. T. Freeman, K. Tanaka, J.Ohta, and K. Kyuma. Computer vision for computer games. In IEEE 2nd Intl. Conf. on Automa5c Face and Gesture Recogni5on, 1996.

•  J. Sullivan and S. Carlsson. Recognizing and tracking human ac5on. In ECCV 2002

•  A.A. Efros, A.C. Berg, G. Mori, and J. Malik. Recognizing ac5on at a distance. In ICCV 2003 •  A. Bobick and J. Davis. The recogni5on of human movement using temporal templates. IEEE Trans.

PAMI, 23(3):257–267, 2001.

•  L. Zelnik-‐Manor and M. Irani. Event-‐based video analysis. In CVPR 2001 •  E. Shechtman and M. Irani. Space-‐5me behavior based correla5on. In CVPR 2005

•  O. Boiman and M. Irani. Detec5ng irregulari5es in images and in video. In Proc. ICCV, 2005.

•  M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri. Ac5ons as space-‐5me shapes. In Proc. ICCV, 2005.

•  Y. Ke, R. Sukthankar, and M. Hebert. Efficient Visual Event Detec5on using Volumetric Features . In Proc. ICCV 2005.

•  Y. Ke, R. Sukthankar, and M. Hebert. Event detec5on in crowded videos. In Proc. ICCV, 2007. •  I. Laptev and P. Pérez. Retrieving ac5ons in movies. In Proc. ICCV 2007

•  D. Weinland and E. Boyer. Ac5on recogni5on using exemplar-‐based embedding. In Proc. CVPR, 2008.

•  Z. Lin, Z. Jiang, and L. S. Davis. Recognizing ac5ons by shape-‐mo5on prototype trees. In Proc. ICCV, 2009.

•  I. Laptev and T. Lindeberg. Space-‐5me interest points. In Proc. ICCV 2003.

•  C. Schuldt, I. Laptev, and B. Caputo. Recognizing human ac5ons: A local svm approach. In Proc. ICPR, 2004. •  P. Dollar, V. Rabaud, G. Coerell, and Serge Belongie. Behavior recogni5on via sparse spa5o-‐temporal features.

In VS-‐PETS, 2005.

•  H. Jhuang, T. Serre, L. Wolf and T. Poggio. A Biologically Inspired System for Ac5on Recogni5on. In Proc. ICCV 2007

•  P. Scovanner, S. Ali, and M. Shah, A 3-‐Dimensional SIFT descriptor and its applica5on to ac5on recogni5on, ACM MM 2007.

•  J. C. Niebles, H. Wang, and L. Fei-‐Fei. Unsupervised learning of human ac5on categories using spa5al-‐temporal words. In IJCV 2008.

•  I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realis5c human ac5ons from movies. In Proc. CVPR 2008.

•  A. Klaeser, M. Marszałek and C. Schmid. A spa5o-‐temporal descriptor based on 3D-‐gradients. In Proc. BMVC 2008

•  G. Willems, T. Tuytelaars and L. Van Gool. An Efficient Dense and Scale-‐Invariant Spa5o-‐Temporal Interest Point Detector. In Proc. ECCV 2008

•  H. Wang, M. M. Ullah, A. Kläser, I. Laptev and C. Schmid. Evalua5on of local spa5o-‐temporal features for ac5on recogni5on. In Proc. BMVC 2009.

•  L. Yeffet and L. Wolf. Local Trinary Paeerns for Human Ac5on Recogni5on. In Proc. ICCV 2009. •  A. Gilbert, J. Illingworth, R. Bowden. Fast realis5c mul5-‐ac5on recogni5on using mined dense spa5o-‐

temporal features, In Proc. ICCV 2009.

•  P. Ma5kainen, M. Hebert, R. Sukthankar. Trajectons: Ac5on recogni5on through the mo5on analysis of tracked features. ICCV workshop on Video-‐oriented Object and Event Classifica5on, 2009

•  M. M. Ullah, S. N. Parizi, I. Laptev. Improving bag-‐of-‐features ac5on recogni5on with non-‐local cues. In Proc. BMVC 2010

•  Y. Song, L. Goncalves, and P. Perona. Unsupervised learning of human mo5on. IEEE Trans. PAMI, 25(7):814–827, 2003.

•  D. Ramanan and D. A. Forsyth. Automa5c annota5on of everyday movements. In Advances in Neural Informa5on Processing Systems 16, 2003.

•  V. Ferrari, M. Marin, and A. Zisserman. Pose search: retrieving people using their pose. In Proc. IEEE Comput. Soc. Conf. Comput. Vision and Paeern Recogn., 2009.

•  Yang Wang, Hao Jiang, Mark S. Drew, Ze-‐Nian Li, and Greg Mori. Unsupervised discovery of ac5on classes. In CVPR, 2006.

•  Nazli Ikizler-‐Cinbis, R. Gokberk Cinbis, and Stan Sclaroff. Learning ac5ons from the web. In IEEE Interna5onal Conference on Computer Vision, 2009.

•  Weilong Yang, Yang Wang, and Greg Mori. Recognizing human ac5ons from s5ll images with latent poses. In Proc. IEEE Comput. Soc. Conf. Comput. Vision and Paeern Recogn., 2010.

•  Bangpeng Yao and Li Fei-‐Fei. Modeling mutual context of object and human pose in human-‐object interac5on ac5vi5es. In Proc. IEEE Comput. Soc. Conf. Comput. Vision and Paeern Recogn., 2010.

•  R. Polana and R.C. Nelson. Detec5on and recogni5on of periodic, nonrigid mo5on. In IJCV 1997.

•  S.M. Seitz and C.R. Dyer. View invariant analysis of cyclic mo5on. In IJCV 1997

•  A. Thangali and S. Sclaroff. Periodic mo5on detec5on and es5ma5on via space-‐5me sampling. In IEEE Workshop on Mo5on and Video Compu5ng, 2005.

•  I. Laptev, S.J. Belongie, P. Pérez and J. Wills. Periodic mo5on detec5on and segmenta5on via approximate sequence alignment, In Proc. ICCV 2005

•  P. Wang, G.D. Abowd and J.M. Rehg. Quasi-‐periodic event analysis for social game retrieval. In Proc ICCV 2009

•  D. Weinland, E. Boyer, and R. Ronfard. Ac5on recogni5on from arbitrary views using 3D exemplars. in Proc. ICCV 2007.

•  A. Farhadi and M. Tabrizi. Learning to recognize ac5vi5es from the wrong view point. In Proc. ECCV 2008.

•  I. Junejo, E. Dexter, I. Laptev and Patrick Pérez. Cross-‐view ac5on recogni5on from temporal self-‐similari5es. In Proc. ECCV 2008

•  A. Farhadi, M. Kamali, I. Endres, D. Forsyth. A latent model of discrimina5ve aspect. In Proc. ICCV 2009.

•  X. Wang, X. Ma, and E. Grimson. Unsupervised ac5vity percep5on in crowded and complicated scenes using hierarchical bayesian models. IEEE Trans. PAMI, 31(3):539– 555, 2009.

•  Abhinav Gupta, Praveen Srinivasan, Jianbo Shi, and Larry S. Davis. Understanding videos, construc5ng plots -‐ learning a visually grounded storyline model from annotated videos. In CVPR, 2009.

•  T. Xiang and S. Gong. Beyond tracking: Modelling ac5vity and understanding behaviour. Int. Journal of Computer Vision, 67(1):21–51, 2006.

•  G. Medioni, I. Cohen, F. Bre ́mond, S. Hongeng, and R. Neva5a. Event detec5on and analysis from video streams. IEEE Trans. PAMI, 23(8):873–889, 2001.

•  Y. A. Ivanov and A. F. Bobick. Recogni5on of visual ac5vi5es and interac5ons by stochas5c parsing. IEEE Trans. PAMI, 22(8):852–872, 2000.

•  D. Moore and I. Essa. Recognizing mul5tasked ac5vi5es using stochas5c context-‐free grammar using video. In AAAI, 2002.

•  Chen Change Loy, Tao Xiang, and Shaogang Gong. Modelling ac5vity global temporal dependencies using 5me delayed probabilis5c graphical model. In ICCV, 2009.

•  Xiaogang Wang, Keng Teck Ma, Gee Wah Ng, and W. Eric L. Grimson. Trajectory analysis and seman5c region modeling using a nonparametric bayesian model. In Proc. IEEE Comput. Soc. Conf. Comput. Vision and Paeern Recogn., 2008.

•  W. Choi, K. Shahid, and S. Savarese. ”what are they doing? : Collec5ve ac5vity classifica5on using spa5o-‐temporal rela5onship among people”. In 9th Interna5onal Workshop on Visual Surveillance, 2009.

•  Ramin Mehran, Alexis Oyama, and Mubarak Shah. Abnormal crowd behavior detec5on using social force model. In CVPR, 2009.

Workshop materials available: heps://sites.google.com/site/humanac5onstutorialeccv10/

ECCV2010 tutorial: statisitcal and structural recognition of human actions part II

Education

Transcript of ECCV2010 tutorial: statisitcal and structural recognition of human actions part II