Download - Serena Yeung, PHD, Stanford, at MLconf Seattle 2017

Transcript

Towards Scaling Video Understanding

Serena Yeung

YouTube TV

GoPro Smart spaces

State-of-the-art in video understanding

State-of-the-art in video understandingClassification

Abu-El-Haija et al. 2016

4,800 categories15.2 Top5 error

State-of-the-art in video understandingClassification Detection

Idrees et al. 2017, Sigurdsson et al. 2016Abu-El-Haija et al. 2016

4,800 categories15.2 Top5 error

Tens of categories~10-20 mAP at 0.5 overlap

State-of-the-art in video understandingClassification Detection

Abu-El-Haija et al. 2016

Captioning

4,800 categories15.2 Top5 error

Yu et al. 2016

Just getting started:Short clips, niche domains

Idrees et al. 2017, Sigurdsson et al. 2016

Tens of categories~10-20 mAP at 0.5 overlap

Comparing video with image understanding

Comparing video with image understandingClassification4,800 categories15.2% Top5 errorVideos

Images 1,000 categories*3.1% Top5 error

*Transfer learning widespread

Krizhevsky 2012, Xie 2016

Comparing video with image understanding

He 2017

Classification Detection4,800 categories15.2% Top5 error

Tens of categories~10-20 mAP at 0.5 overlapVideos

Images 1,000 categories*3.1% Top5 error

*Transfer learning widespread

Hundreds of categories*~60 mAP at 0.5 overlapPixel-level segmentation

*Transfer learning widespread

Krizhevsky 2012, Xie 2016

Comparing video with image understanding

He 2017 Johnson 2016, Krause 2017

Classification Detection Captioning4,800 categories15.2% Top5 error

Just getting started:Short clips, niche

domainsVideos

Images 1,000 categories*3.1% Top5 error

*Transfer learning widespread

Hundreds of categories*~60 mAP at 0.5 overlapPixel-level segmentation

*Transfer learning widespread

Dense captioningCoherent paragraphs

Krizhevsky 2012, Xie 2016

Tens of categories~10-20 mAP at 0.5 overlap

Comparing video with image understanding

He 2017 Johnson 2016, Krause 2017

Classification Detection Captioning4,800 categories15.2 Top5 error

Just getting started:Short clips, niche

domainsVideos

Beyond

Images 1,000 categories*3.1% Top5 error

*Transfer learning widespread

Hundreds of categories*~60 mAP at 0.5 overlapPixel-level segmentation

*Transfer learning widespread

Dense captioningCoherent paragraphs

Significant work on question-answering

Krizhevsky 2012, Xie 2016

Yang 2016

Tens of categories~10-20 mAP at 0.5 overlap

The challenge of scale

Training labels Inference

Models

Video processing is computationally expensive

Video annotation is labor-intensive

Temporal dimension adds complexity

The challenge of scale

Training labels InferenceVideo processing is

computationally expensiveVideo annotation is

labor-intensive

ModelsTemporal dimension adds

complexity

Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame

Glimpses in Videos. CVPR 2016.

Input Output

t = 0 t = TRunning

Task: Temporal action detection

Talking

Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.

Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.

Efficient video processing

t = 0 t = T

Output

Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.

Frame modelInput: a frame

Our model for efficient action detection

t = 0 t = T

[ ]Output

Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.

Output:Detection instance [start, end]Next frame to glimpse

Frame modelInput: a frame

Our model for efficient action detection

t = 0 t = T

Recurrent neural network(time information)

[ ] Output:Detection instance [start, end]Next frame to glimpse

Output

Convolutional neural network (frame information)

Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.

Our model for efficient action detection

t = 0 t = T

Recurrent neural network(time information)

[ ] Output:Detection instance [start, end]Next frame to glimpse

Output

Convolutional neural network (frame information)

Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.

Our model for efficient action detection

t = 0 t = T

Recurrent neural network(time information)

[ ] Output:Detection instance [start, end]Next frame to glimpse

Output

Convolutional neural network (frame information)

Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.

Our model for efficient action detection

t = 0 t = T

Output

Our model for efficient action detection

Recurrent neural network(time information)

[ ] [ ] Output:Detection instance [start, end]Next frame to glimpse

Output

Convolutional neural network (frame information)

Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.

t = 0 t = T

Output

Our model for efficient action detection

Recurrent neural network(time information)

[ ] [ ] Output:Detection instance [start, end]Next frame to glimpse

Output

Convolutional neural network (frame information)

Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.

t = 0 t = T

Output

Our model for efficient action detection

Recurrent neural network(time information)

[ ] [ ] Output:Detection instance [start, end]Next frame to glimpse

Output

Convolutional neural network (frame information)

Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.

t = 0 t = T

Output

Our model for efficient action detection

Recurrent neural network(time information)

[ ] [ ] Output:Detection instance [start, end]Next frame to glimpse

Output

Convolutional neural network (frame information)

Output

[ ]

Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.

t = 0 t = T

Output

Our model for efficient action detection

Recurrent neural network(time information)

[ ] [ ] Output:Detection instance [start, end]Next frame to glimpse

Output

Convolutional neural network (frame information)

Output

[ ]

Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.

t = 0 t = T

Output

Our model for efficient action detection

Recurrent neural network(time information)

Output

Convolutional neural network (frame information)

Output

[ ]

Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.

�� Optional output:Detection instance [start, end]

Output:Next frame to glimpse

Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.

Our model for efficient action detection• Train differentiable outputs (detection output class and bounds) using

standard backpropagation

• Train non-differentiable outputs (where to look next, when to emit a prediction) using reinforcement learning (REINFORCE algorithm)

• Achieves detection performance on par with dense sliding window-based approaches, while observing only 2% of frames

Learned policy in action

Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.

Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.

Learned policy in action

The challenge of scale

Training labels InferenceVideo processing is

computationally expensiveVideo annotation is

labor-intensive

ModelsTemporal dimension adds

complexity

Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame

Glimpses in Videos. CVPR 2016.

Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling

of Actions in Complex Videos. IJCV 2017.

Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling

of Actions in Complex Videos. IJCV 2017.

Dense action labeling

Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos. IJCV 2017.

MultiTHUMOS• Extends the THUMOS’14 action detection dataset with dense, multilevel,

frame-level action annotations for 30 hours across 400 videos

THUMOS MultiTHUMOSAnnotations 6,365 38,690

Classes 20 65Density (labels / frame) 0.3 1.5

Classes per video 1.1 10.5Max actions per frame 2 9Max actions per video 3 25

Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos. IJCV 2017.

Modeling dense, multilabel actions• Need to reason about multiple potential actions simultaneously

• High degree of temporal dependency

• In standard recurrent models for action recognition, all state is in hidden layer representation

• At each time step, makes prediction of current frame based on the current frame and previous hidden representation

Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos. IJCV 2017.

MultiLSTM• Extension of LSTM that expands the temporal receptive field of input

and output connections

• Key idea: providing the model with more freedom in both reading input and writing output reduces the burden placed on the hidden layer representation

Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos. IJCV 2017.

MultiLSTM

Input video frames

Frame class predictions

t

Standard LSTM

……

Donahue 2014

Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos. IJCV 2017.

MultiLSTMFrame class predictions

t

……

Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos. IJCV 2017.

Standard LSTM: Single input, single output

Input video frames

MultiLSTMFrame class predictions

t

……

Frame class predictions

t

MultiLSTM: Multiple inputs, multiple outputs

……

Standard LSTM: Single input, single output

Input video frames Input video frames

Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos. IJCV 2017.

MultiLSTMMultiple Inputs (soft attention)

Multiple Outputs (weighted average)

Multilabel Loss

Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos. IJCV 2017.

MultiLSTM

Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos. IJCV 2017.

Retrieving sequential actions

Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos. IJCV 2017.

Retrieving co-occurring actions

Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos. IJCV 2017.

The challenge of scale

Training labels InferenceVideo processing is

computationally expensiveVideo annotation is

labor-intensive

ModelsTemporal dimension adds

complexity

Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame

Glimpses in Videos. CVPR 2016.

Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling

of Actions in Complex Videos. IJCV 2017.

Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling

of Actions in Complex Videos. IJCV 2017.

Yeung, Ramanathan, Russakovsky, Shen, Mori, Fei-Fei. Learning to learn from noisy web videos.

CVPR 2017.

Labeling videos is expensive• Takes significantly longer to

label a video than an image

• If spatial or temporal bounds desired, even worse

• How can we practically learn about new concepts in video?

Web queries are a source of noisy video labels

Image search is much cleaner!

Can we effectively learn from the noisy web queries?

• Our approach: learn how to selective positive training examples from noisy queries in order to train classifiers for new classes

• Use a reinforcement learning-based formulation to learn a data labeling policy that achieves strong performance on a small, manually-labeled dataset of classes

• Then use this policy to automatically label noisy web data for new classes

Yeung, Ramanathan, Russakovsky, Shen, Mori, Fei-Fei. Learning to learn from noisy web videos. CVPR 2017.

Balancing diversity vs. semantic drift• Want diverse training examples to improve classifier

• But too much diversity can also lead to semantic drift

• Our approach: balance diversity and drift by training labeling policies using an annotated reward set which the policy must successfully classify

Yeung, Ramanathan, Russakovsky, Shen, Mori, Fei-Fei. Learning to learn from noisy web videos. CVPR 2017.

Overview of approach

Boomerang …

Boomerang on a beach

Boomerang music video

Classifier

Candidate web queries (YouTube autocomplete)

Agent

Label new positives

+Boomerang on a beach

Current positive set

Update classifier

Yeung, Ramanathan, Russakovsky, Shen, Mori, Fei-Fei. Learning to learn from noisy web videos. CVPR 2017.

Update state

Overview of approach

Boomerang …

Boomerang on a beach

Boomerang music video

Classifier

Candidate web queries (YouTube autocomplete)

Agent

Label new positives

+Boomerang on a beach

Current positive set

Update classifier

Fixed negative set

Yeung, Ramanathan, Russakovsky, Shen, Mori, Fei-Fei. Learning to learn from noisy web videos. CVPR 2017.

Update state

Overview of approach

Boomerang …

Boomerang on a beach

Boomerang music video

Classifier

Candidate web queries (YouTube autocomplete)

Agent

Label new positives

+Boomerang on a beach

Current positive set

Update classifier

Update state

Fixed negative set

Yeung, Ramanathan, Russakovsky, Shen, Mori, Fei-Fei. Learning to learn from noisy web videos. CVPR 2017.

Training reward

Eval on reward set

Sports1M

Greedy classifier Ours

Yeung, Ramanathan, Russakovsky, Shen, Mori, Fei-Fei. Learning to learn from noisy web videos. CVPR 2017.

Sports1M

Greedy classifier Ours

Yeung, Ramanathan, Russakovsky, Shen, Mori, Fei-Fei. Learning to learn from noisy web videos. CVPR 2017.

Novel classes

Greedy classifier Ours

Yeung, Ramanathan, Russakovsky, Shen, Mori, Fei-Fei. Learning to learn from noisy web videos. CVPR 2017.

The challenge of scale

Training labels InferenceVideo processing is

computationally expensiveVideo annotation is

labor-intensive

ModelsTemporal dimension adds

complexity

Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame

Glimpses in Videos. CVPR 2016.

Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling

of Actions in Complex Videos. IJCV 2017.

Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling

of Actions in Complex Videos. IJCV 2017.

Yeung, Ramanathan, Russakovsky, Shen, Mori, Fei-Fei. Learning to learn from noisy web videos.

CVPR 2017.

The challenge of scale

Training labels InferenceVideo processing is

computationally expensiveVideo annotation is

labor-intensive

ModelsTemporal dimension adds

complexity

Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame

Glimpses in Videos. CVPR 2016.

Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling

of Actions in Complex Videos. IJCV 2017.

Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling

of Actions in Complex Videos. IJCV 2017.

Yeung, Ramanathan, Russakovsky, Shen, Mori, Fei-Fei. Learning to learn from noisy web videos.

CVPR 2017.

Learning to learn

The challenge of scale

Training labels InferenceVideo processing is

computationally expensiveVideo annotation is

labor-intensive

ModelsTemporal dimension adds

complexity

Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame

Glimpses in Videos. CVPR 2016.

Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling

of Actions in Complex Videos. IJCV 2017.

Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling

of Actions in Complex Videos. IJCV 2017.

Yeung, Ramanathan, Russakovsky, Shen, Mori, Fei-Fei. Learning to learn from noisy web videos.

CVPR 2017.

Learning to learnUnsupervised learning

Towards Knowledge

Training labels InferenceVideo processing is

computationally expensiveVideo annotation is

labor-intensive

ModelsTemporal dimension adds

complexity

Videos Knowledge of the dynamic visual world

Collaborators

Olga Russakovsky Mykhaylo Andriluka Ning Jin Vignesh Ramanathan Liyue Shen

Greg Mori Fei-Fei Li

Thank You