Detecting Engagement in Egocentric Videosammy-su.github.io/projects/ego-engagement/ego... ·...

1
Detecting Engagement in Egocentric Video 0 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 Precision UT EE 0 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 UT Ego Recall GBVS (Harel ‘06) Self-resemblance (Seo ‘09) Bayesian Surprise (Itti ‘09) Video Attention (Ejaz ‘13) Video Saliency (Rudoy ‘13) Salient Object (Rathu ‘10) Important Region (Lee ‘12) CNN Appearance Motion Mag. (Rallapalli ‘14) Ours – frame Ours – interval Yu-Chuan Su and Kristen Grauman The University of Texas at Austin 1. Engagement in Egocentric Video Motivation: people do not always engage with what they see and pay different levels of attention to the environment Goal: given an egocentric video, we want to predict when the camera wearer is engaged with what he sees. Engagement is different from saliency: Previous work [Harel ‘06, Itti ‘09, Rudoy ’13, …] on visual attention focuses on where the people look but ignores when people are engaged. Time Engagement Level Detection Detection The recorder is attracted by some object(s), and he interrupts his ongoing flow of activity to purposefully gather more information about the object(s). 2. UT Egocentric Engagement (UT EE) Dataset We collect videos in three browsing scenarios: Frame-level annotation with MTurk. Each video is labeled by 10 Turkers. Ground truth is determined by majority vote. Video Statistics Engagement Annotation Definition of Engagement 3. Data Analysis We collect 3 hours of recorder self-annotation to verify the third person annotation. Frame F 1 Interval F 1 Boundary Presence Turker vs. Consensus 0.818 0.837 0.914 vs. Recorder 0.589 0.626 0.813 Random vs. Consensus 0.426 0.339 0.481 vs. Recorder 0.399 0.344 0.478 Engagement is predictable from egocentric video! 4. Predict Engagement from Motion Frame-level motion descriptor 1. Estimate frame-wise engagement Frame-level engagement estimator t Engagement Level 2. Generate interval hypotheses t Engagement Level 3. Estimate engagement per interval Motion Temporal Pyramid Interval-level engagement estimator Final Prediction 5. Experiments http://vision.cs.utexas.edu/projects/ego-engagement Shopping in a Market Window Shopping in Mall Touring in a Museum 27 videos 9 recorders 14 hours total length Challenge of engagement detection Diverse visual content Being engaged being static Duration of engagement varies significantly Our method performs the best in all settings Interval hypothesis has clear positive impact Appearance feature does not generalize well (UT Ego) Saliency/Motion Mag. performs poorly We outperform Important Region without train on UT Ego Applications Behavior analysis VR display Camera control

Transcript of Detecting Engagement in Egocentric Videosammy-su.github.io/projects/ego-engagement/ego... ·...

Page 1: Detecting Engagement in Egocentric Videosammy-su.github.io/projects/ego-engagement/ego... · Boundary Presence Turker vs. Consensus 0.818 0.837 0.914 vs. Recorder 0.589 0.626 0.813

Detecting Engagement in Egocentric Video

0 0.2 0.4 0.6 0.8 10.2

0.4

0.6

0.8

Precision

UT EE

0 0.2 0.4 0.6 0.8 10.2

0.4

0.6

0.8

UT Ego

Recall

GBVS (Harel ‘06)

Self-resemblance (Seo ‘09)

Bayesian Surprise (Itti ‘09)

Video Attention (Ejaz ‘13)

Video Saliency (Rudoy ‘13)

Salient Object (Rathu ‘10)

Important Region (Lee ‘12)CNN AppearanceMotion Mag. (Rallapalli ‘14)Ours – frameOurs – interval

Yu-Chuan Su and Kristen GraumanThe University of Texas at Austin

1. Engagement in Egocentric VideoMotivation: people do not always engage with what they see and pay different levels of attention to the environmentGoal: given an egocentric video, we want to predict when the camera wearer is engaged with what he sees.

Engagement is different from saliency: Previous work [Harel‘06, Itti ‘09, Rudoy ’13, …] on visual attention focuses on where the people look but ignores when people are engaged.

Time

Engagement Level

Detection Detection

The recorder is attracted by some object(s), and he interrupts his ongoing flow of activity to purposefully gather more information about the object(s).

2. UT Egocentric Engagement (UT EE) DatasetWe collect videos in three browsing scenarios:

Frame-level annotation with MTurk. Each video is labeled by 10 Turkers. Ground truth is determined by majority vote.

Video Statistics

Engagement Annotation

Definition of Engagement

3. Data AnalysisWe collect 3 hours of recorder self-annotation to verify the third person annotation.

Frame F1Interval F1

Boundary Presence

Turker

vs. Consensus 0.818 0.837 0.914

vs. Recorder 0.589 0.626 0.813

Random

vs. Consensus 0.426 0.339 0.481

vs. Recorder 0.399 0.344 0.478

Engagement is predictable from egocentric video!

4. Predict Engagement from Motion

Frame-level motion descriptor

1. Estimate frame-wise engagementFrame-level

engagement estimator

t

Engagement Level

2. Generate interval hypotheses

t

Engagement Level

3. Estimate engagement per intervalMotion Temporal Pyramid

Interval-level engagement estimator

FinalPrediction

5. Experiments

http://vision.cs.utexas.edu/projects/ego-engagement

Shopping in a Market

Window Shopping in Mall

Touring in a Museum

§ 27 videos§ 9 recorders§ 14 hours total length

Challenge of engagement detection§ Diverse visual content§ Being engaged ≠ being static§ Duration of engagement varies significantly

§ Our method performs the best in all settings§ Interval hypothesis has clear positive impact§ Appearance feature does not generalize well (UT Ego)§ Saliency/Motion Mag. performs poorly § We outperform Important Region without train on UT Ego

Applications Behavior analysis VR displayCamera control