Serena Yeung, PHD, Stanford, at MLconf Seattle 2017

Click here to load reader

download Serena Yeung, PHD, Stanford, at MLconf Seattle 2017

of 60

  • date post

    17-Mar-2018
  • Category

    Technology

  • view

    460
  • download

    7

Embed Size (px)

Transcript of Serena Yeung, PHD, Stanford, at MLconf Seattle 2017

  • Towards Scaling Video Understanding

    Serena Yeung

  • YouTube TV

    GoPro Smart spaces

  • State-of-the-art in video understanding

  • State-of-the-art in video understandingClassification

    Abu-El-Haija et al. 2016

    4,800 categories15.2 Top5 error

  • State-of-the-art in video understandingClassification Detection

    Idrees et al. 2017, Sigurdsson et al. 2016Abu-El-Haija et al. 2016

    4,800 categories15.2 Top5 error

    Tens of categories~10-20 mAP at 0.5 overlap

  • State-of-the-art in video understandingClassification Detection

    Abu-El-Haija et al. 2016

    Captioning

    4,800 categories15.2 Top5 error

    Yu et al. 2016

    Just getting started:Short clips, niche domains

    Idrees et al. 2017, Sigurdsson et al. 2016

    Tens of categories~10-20 mAP at 0.5 overlap

  • Comparing video with image understanding

  • Comparing video with image understandingClassification4,800 categories15.2% Top5 errorVideos

    Images 1,000 categories*3.1% Top5 error*Transfer learning

    widespread

    Krizhevsky 2012, Xie 2016

  • Comparing video with image understanding

    He 2017

    Classification Detection4,800 categories15.2% Top5 error

    Tens of categories~10-20 mAP at 0.5 overlapVideos

    Images 1,000 categories*3.1% Top5 error*Transfer learning

    widespread

    Hundreds of categories*~60 mAP at 0.5 overlapPixel-level segmentation

    *Transfer learning widespread

    Krizhevsky 2012, Xie 2016

  • Comparing video with image understanding

    He 2017 Johnson 2016, Krause 2017

    Classification Detection Captioning4,800 categories15.2% Top5 error

    Just getting started:Short clips, niche

    domainsVideos

    Images 1,000 categories*3.1% Top5 error*Transfer learning

    widespread

    Hundreds of categories*~60 mAP at 0.5 overlapPixel-level segmentation

    *Transfer learning widespread

    Dense captioningCoherent paragraphs

    Krizhevsky 2012, Xie 2016

    Tens of categories~10-20 mAP at 0.5 overlap

  • Comparing video with image understanding

    He 2017 Johnson 2016, Krause 2017

    Classification Detection Captioning4,800 categories15.2 Top5 error

    Just getting started:Short clips, niche

    domainsVideos

    Beyond

    Images 1,000 categories*3.1% Top5 error*Transfer learning

    widespread

    Hundreds of categories*~60 mAP at 0.5 overlapPixel-level segmentation

    *Transfer learning widespread

    Dense captioningCoherent paragraphs

    Significant work on question-answering

    Krizhevsky 2012, Xie 2016

    Yang 2016

    Tens of categories~10-20 mAP at 0.5 overlap

  • The challenge of scale

    Training labels Inference

    Models

    Video processing is computationally expensive

    Video annotation is labor-intensive

    Temporal dimension adds complexity

  • The challenge of scale

    Training labels InferenceVideo processing is

    computationally expensiveVideo annotation is

    labor-intensive

    ModelsTemporal dimension adds

    complexity

    Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame

    Glimpses in Videos. CVPR 2016.

  • Input Output

    t = 0 t = TRunning

    Task: Temporal action detection

    Talking

    Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.

  • Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.

    Efficient video processing

  • t = 0 t = T

    Output

    Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.

    Frame modelInput: a frame

    Our model for efficient action detection

  • t = 0 t = T

    [ ]Output

    Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.

    Output:Detection instance [start, end]Next frame to glimpse

    Frame modelInput: a frame

    Our model for efficient action detection

  • t = 0 t = T

    Recurrent neural network(time information)

    [ ] Output:Detection instance [start, end]Next frame to glimpse

    Output

    Convolutional neural network (frame information)

    Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.

    Our model for efficient action detection

  • t = 0 t = T

    Recurrent neural network(time information)

    [ ] Output:Detection instance [start, end]Next frame to glimpse

    Output

    Convolutional neural network (frame information)

    Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.

    Our model for efficient action detection

  • t = 0 t = T

    Recurrent neural network(time information)

    [ ] Output:Detection instance [start, end]Next frame to glimpse

    Output

    Convolutional neural network (frame information)

    Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.

    Our model for efficient action detection

  • t = 0 t = T

    Output

    Our model for efficient action detection

    Recurrent neural network(time information)

    [ ] [ ] Output:Detection instance [start, end]Next frame to glimpse

    Output

    Convolutional neural network (frame information)

    Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.

  • t = 0 t = T

    Output

    Our model for efficient action detection

    Recurrent neural network(time information)

    [ ] [ ] Output:Detection instance [start, end]Next frame to glimpse

    Output

    Convolutional neural network (frame information)

    Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.

  • t = 0 t = T

    Output

    Our model for efficient action detection

    Recurrent neural network(time information)

    [ ] [ ] Output:Detection instance [start, end]Next frame to glimpse

    Output

    Convolutional neural network (frame information)

    Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.

  • t = 0 t = T

    Output

    Our model for efficient action detection

    Recurrent neural network(time information)

    [ ] [ ] Output:Detection instance [start, end]Next frame to glimpse

    Output

    Convolutional neural network (frame information)

    Output

    [ ]

    Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.

  • t = 0 t = T

    Output

    Our model for efficient action detection

    Recurrent neural network(time information)

    [ ] [ ] Output:Detection instance [start, end]Next frame to glimpse

    Output

    Convolutional neural network (frame information)

    Output

    [ ]

    Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.

  • t = 0 t = T

    Output

    Our model for efficient action detection

    Recurrent neural network(time information)

    Output

    Convolutional neural network (frame information)

    Output

    [ ]

    Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.

    Optional output:Detection instance [start, end]

    Output:Next frame to glimpse

  • Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.

    Our model for efficient action detection Train differentiable outputs (detection output class and bounds) using

    standard backpropagation

    Train non-differentiable outputs (where to look next, when to emit a prediction) using reinforcement learning (REINFORCE algorithm)

    Achieves detection performance on par with dense sliding window-based approaches, while observing only 2% of frames

  • Learned policy in action

    Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.

  • Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.

    Learned policy in action

  • The challenge of scale

    Training labels InferenceVideo processing is

    computationally expensiveVideo annotation is

    labor-intensive

    ModelsTemporal dimension adds

    complexity

    Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame

    Glimpses in Videos. CVPR 2016.

    Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling

    of Actions in Complex Videos. IJCV 2017.

    Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling

    of Actions in Complex Videos. IJCV 2017.

  • Dense action labeling

    Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos. IJCV 2017.

  • MultiTHUMOS Extends the THUMOS14 action detection dataset with dense, multilevel,

    frame-level action annotations for 30 hours across 400 videos

    THUMOS MultiTHUMOSAnnotations 6,365 38,690

    Classes 20 65Density (labels / frame) 0.3 1.5

    Classes per video 1.1 10.5Max actions per frame 2 9Max actions per video 3 25

    Yeung, Russakovsky, Jin,