ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification

1. Modeling Temporal Structure of Decomposable Motion Segments for Activity ClassificationJuan CarlosChih-WeiLiNieblesChenFei-FeiComputer Science Dept.Stanford University 1

2. Recognizing Human ActivitiesMotion AnalysisInteractions with ObjectsDetect unusual behavior Temporal structure & causalityJudge Sports Automatically Provide cooking assistance Smart surveillanceBiomechanics Psychology studiesVideo game interfaces 2 3. Activity landscapeLong termSnapshot Atomic actionActivities Eventsevent Construction Catch RunHigh JumpFootballof a building10-1100101 103 107-8Temporal Scale (seconds) 3 4. Activity landscapeLong term SnapshotAtomic actionActivities EventseventConstructionCatchRunHigh JumpFootball of a building 10-1 100 101 103 107-8 Thurau & Hlavac, 2008 Bobick & Davis, 2001 Ramanan & Forsyth, Sridhar et al, 2010 Gupta et al, 2009 Efros et al, 2003 2003 Kuettel, 2010 Ikizler & Duygulu, 2009 Schuldt et al, 2004 Laxton et al, 2007 Ikizler-Cinbis et al, 2009 Alper & Shah, 2005 Ikizler & Forsyth, 2008 Yao & Fei-Fei 2010a,b Dollar et al, 2005 Gupta et al, 2009 Yang, Wang and Mori, Blank et al, 2005 Choi & Savarese, 20092010 Niebles et al, 2006 Laptev et al, 2008 Wang & Mori, 2008 Rodriguez et al, 2008 Wang & Mori, 2009 Gupta et al, 2009 Liu et al, 20094 Marszalek et al, 2009 5. Activity landscape Long termSnapshot Atomic action Activities Events event10-1100 101 103 107-8 Temporal Scale (seconds) Composition of simple motions Non-periodic Longer duration than atomic actions 6. Activity landscape related datasets Long term Snapshot Atomic actionActivities Events event 10-1 100 101103107-8 Temporal Scale (seconds)Actions in still images KTHNew[Ikizler 2009][Schuldt et al 2004] Olympic SportsPPMIHollywoodDataset[Yao & Fei-Fei 2010][Laptev et al 2008]UIUC Sports UCF Sports[Li & Fei-Fei 2007] [Rodriguez et al 2008]Ballet[Yang et al 2009] 7. Activity landscapeLong term Snapshot Atomic action Activities Eventsevent10-1 100101103107-8Temporal Scale (seconds)Possible approaches:Pose-based recognition HMM, CRFBag of features Computationally intensive Simple action recognition: Fails when actionsFerrari et al 2008are complexRamanan & Forsyth 2003 Laptev et al 2008 Sminchisescu 2006Nazli & Forsyth 2008 Niebles et al 2006Blank et al 20057[]Liu et al 2009Efros et al 2003 [] 8. Our proposal decompose activities into simpler motion segments1. Simple motions are easier to describe computationally2. Can leverage temporal context3. Human visual system seems to rely on decomposition for understanding [Zacks et al, Nature Neuro 2001, Tversky et al, JEP, 2006]8 9. Outline Discriminative model for activities Representation Recognition Learning Experiments Conclusions9 10. Outline Discriminative model for activities Representation Recognition Learning Experiments Conclusions10 11. A model for activities Activity Model 11 12. A model for complex activities Activity ModelModel Properties 0 1 Use a standard [ ]time range: [0,1]time12 13. A model for complex activities Activity ModelModel Properties 0 1 Use a standard [ ]time range: [0,1]time Model is formedby a few simplemotions13 14. A model for complex activities Activity ModelModel Properties 0 1 Use a standard [ ]time range: [0,1]time Model is formedby a few simplemotions14 15. A model for complex activitiesActivity ModelModel Properties 01 Use a standard []time range: [0,1] time Model is formedby a few simplemotions Local motionappearance : Motion Segment 1 15 16. A model for complex activitiesActivity ModelModel Properties 01 Use a standard []time range: [0,1] : anchor location time Model is formedby a few simplemotions Local motionappearance Encode temporalorder: Motion Segment 1 16 17. A model for complex activities temporal location uncertainty Activity ModelModel Properties 0 1 Use a standard [ ]time range: [0,1] : anchor locationtime Model is formedby a few simplemotions Local motionappearance Encode temporalorder: Motion Segment 1 Temporalflexibility17 18. A model for complex activitiestemporal location uncertainty Activity ModelModel Properties0 1 Use a standard[ ]time range: [0,1]: anchor locationtimeshorter Model is formedby a few simplemotions Local motionappearance Encode temporalorder : Motion Segment 1 Temporalflexibility Multiple temporalscaleslonger 18 19. Outline Discriminative model for activities Representation Recognition Learning Experiments Conclusions19 20. Query VideoRecognition20 21. Query VideoRecognition[01]Activity Model[01]21 22. Query VideoRecognition[01]Match Motion Segment 1:Activity Model[01]22 23. Query VideoRecognition[01]Match Motion Segment 1: Consider a candidate locationActivity Model[01]23 24. Query VideoRecognition[01]Match Motion Segment 1: Consider a candidate location Matching score for this segment:Activity Model[01]24 25. Query VideoRecognition[01]Match Motion Segment 1: Consider a candidate location Matching score for this segment:Activity Model[01]25 26. Query VideoRecognition[0 1]Match Motion Segment 1:Spatio-temporal Interest points Consider a candidate locationHOG/HOF Descriptors Matching score for this segment: [Laptev et al, 2005] Activity Model[0 1] 26 27. Query VideoRecognition[01]Match Motion Segment 1:Vector-quantized into a codebook Consider a candidate locationof 1000 spatio-temporal words. Matching score for this segment:Activity Model[01]27 28. Query VideoRecognitionVideo words[01]Match Motion Segment 1:Appearance feature: Consider a candidate locationhistogram of video words Matching score for this segment:Activity Model[01]28 29. Query VideoRecognitionVideo words[01]Match Motion Segment 1:Appearance similarity score: Consider a candidate locationChi-square kernel SVM Matching score for this segment: Activity Model[01] 29 30. Query VideoRecognition[01]Match Motion Segment 1: Consider a candidate location Matching score for this segment:Activity Model[01]30 31. Query VideoRecognition[01]Match Motion Segment 1: Consider a candidate location Matching score for this segment:Activity Model[01]31 32. Query VideoRecognition[01]Match Motion Segment 1:Temporal location feature: Consider a candidate locationthe distance btw h_1 and the Matching score for this segment: anchor location:Activity Model[01] 32 33. Query VideoRecognition[0 1]Match Motion Segment 1:Temporal location disagreement Consider a candidate locationscore: 2nd order polynomial Matching score for this segment: Activity Model[0 1] 33 34. Query VideoRecognition[01]Match Motion Segment 1: Consider a candidate location Matching score for this segment:Activity Model[01]34 35. Query VideoRecognition[01]Match Motion Segment 1: Consider a candidate location Matching score for this segment:Activity Model[01]35 36. Query VideoRecognition[01]Match Motion Segment 1: Consider a candidate location Matching score for this segment:Activity Model[01]36 37. Query VideoRecognition[01] Matching score for all segments:Activity Model[01]37 38. Outline Discriminative model for activities Representation Recognition Learning Experiments Conclusions38 39. Learning from weakly labeled datapositive examples negative examples39 YouTube videos Class label per video collected on Amazon Mechanical Turk No annotation of temporal segments 39 40. Learning from weakly labeled data positive examples negative examples40 Activity Model[0 1] 40 41. LearningGoalLearn: Motion segment appearance Temporal arrangementA max-margin framwork by optimizing a discriminative loss: Coordinate descend [Felzenszwalb et al 2008] Activity Model[0 1] 41 42. LearningCoordinate descend Initialize model parameterspositive examplesnegative examples Activity Model []0 142 43. LearningCoordinate descend Initialize model parameters1. Find best matching locationspositive examplesnegative examples Activity Model []0 143 44. LearningCoordinate descend Initialize model parameters1. Find best matching locations2. Updatepositive examplesnegative examples Activity Model []0 144 45. LearningCoordinate descend Initialize model parameters1. Find best matching locations2. Updatepositive examplesnegative examples Activity Model []0 145 46. LearningCoordinate descend Initialize model parameters1. Find best matching locationsRepeat till convergence (or max iter.)2. Updatepositive examples negative examples Activity Model []0 1 46 47. Outline Discriminative model for activities Representation Recognition Learning Experiments Conclusions47 48. Experiment I: Simple Actions KTH dataset [Schuldt et al 2004] Action Class Our Modelwalkingjogging running walking94.4% running79.5% jogging78.2% hand-waving99.9% hand-clapping96.5% boxing 99.2%boxing hand-waving hand-clapping100.0%Ours 90.0%Wang et al 2009 80.0% 70.0%Laptev et al 2008 60.0%Wong et al 2007 50.0% Accuracy Schuldt et al 482004 49. Experiment II: Proof of concept Activities synthesized from 6 classes Weizmann [Blank 2005]Ours 100%Ours 100%Bag-of-features 17%Bag-of-Features 17%wave jumpjumping - jacksActivity Model [0 1] shorter jumping jacks wavingwavingTransition from jump to jumping jacks longer jumping jacks 49 50. Experiment III: Olympic Sports Dataset YouTube videos with class labels per video from AMT 16 classes, ~100 videos each http://vision.stanford.edu/Datasets/OlympicSports high-jump long-jump triple-jump pole-vault discushammerjavelin shot put 50 basketball bowling tennis-serve platform springboard snatch clean-jerkvault lay-up 51. Learned model: High Jump Activity Model [01]shorterLanding & Start running Run Take off stand uplonger Run51 52. Learned model: High Jump Activity Model [01]shorterLanding & Start running Run Take off stand uplonger RunShortersegment, larger location 52 53. Learned model: High Jump Activity Model[0 1]Landing & Start running Run Take off stand up RunLong segment,small location uncertainty53 54. Learned Model: Clean and JerkActivity Model[01]Hold weight whileLift Weight to Hold weight oncrouchingshouldersshoulders Hold weight while crouching Transition to upright position 54 55. Learned Model: Clean and JerkActivity Model[01]Hold weight whileLift Weight to Hold weight oncrouchingshouldersshoulders Hold weight while crouching Transition to upright position Short segment with lowlocation uncertainty, it had high location consistency in training55 56. Learned Model: Clean and JerkActivity Model[01]Hold weight whileLift Weight to Hold weight oncrouchingshouldersshoulders Hold weight while crouching Transition to upright position Segments encodesimilar appearance, possible locations overlap56 57. Matched SequencesLong JumpSequence 1 RunTake off Stand upLong JumpSequence 2 Remarks: Matching is tolerant to variations in exact motion segment temporal location. Query videos can have different time length. Long Jump Model[01] 57 58. Matched Sequences VaultSequence 1 RunUp in the airLanding VaultSequence 2Low matching score, good temporal alignment, badappearance. Vault Model[0 1] 58 59. Classifying Olympic Sports100.0% 90.0% 80.0% 70.0% 60.0% 50.0% 40.0% 30.0%Ours Laptev et al CVPR 08Our Method72.1%Laptev et al 2008 62.0% 59 60. Outline Discriminative model for activities Representation Recognition Learning Experiments Conclusions60 61. ConclusionsTemporal context and structures are usefulOlympic Sports Dataset for activity recognition(16 classes, ~100 video/class) Future directions Explore richer temporal structures; Introduce semantics for more meaningful decomposition61 62. Thank you! Juan Carlos Niebles Graduate studentPrinceton/StanfordBangpeng Yao, Barry Chai, Jia Deng, Hao Su, OlgaRussakovsky, and all Stanford Vision Lab members.

ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification

Education

Transcript of ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification