Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin...

download Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture

of 130

  • date post

    09-Apr-2020
  • Category

    Documents

  • view

    5
  • download

    0

Embed Size (px)

Transcript of Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin...

  • Lecture 14 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 29 Feb 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 14 - 29 Feb 20161

  • Lecture 14 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 29 Feb 2016

    Administrative ● Everyone should be done with Assignment 3 now ● Milestone grades will go out soon

    2

  • Lecture 14 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 29 Feb 2016

    Last class

    3

    Segmentation Spatial Transformer

    Soft Attention

  • Lecture 14 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 29 Feb 2016

    Videos

    4

  • Lecture 14 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 29 Feb 2016

    ConvNets for images

    5

  • Lecture 14 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 29 Feb 2016

    Feature-based approaches to Activity Recognition

    6

    Dense trajectories and motion boundary descriptors for action recognition Wang et al., 2013

    Action Recognition with Improved Trajectories Wang and Schmid, 2013

    (code available!)

  • Lecture 14 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 29 Feb 20167

    Dense trajectories and motion boundary descriptors for action recognition Wang et al., 2013

    detect feature points track features with optical flow

    extract HOG/HOF/MBH features in the (stabilized) coordinate system of each tracklet

  • Lecture 14 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 29 Feb 20168

    [J. Shi and C. Tomasi, “Good features to track,” CVPR 1994] [Ivan Laptev 2005]

    detected feature points

    Dense trajectories and motion boundary descriptors for action recognition Wang et al., 2013

  • Lecture 14 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 29 Feb 20169

    Dense trajectories and motion boundary descriptors for action recognition Wang et al., 2013

    track each keypoint using optical flow.

    [G. Farnebäck, “Two-frame motion estimation based on polynomial expansion,” 2003] [T. Brox and J. Malik, “Large displacement optical flow: Descriptor matching in variational motion estimation,” 2011]

  • Lecture 14 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 29 Feb 201610

    Dense trajectories and motion boundary descriptors for action recognition Wang et al., 2013

    Extract features in the local coordinate system of each tracklet.

    Accumulate into histograms, separately according to multiple spatio-temporal layouts.

  • Lecture 14 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 29 Feb 201611

    Case Study: AlexNet [Krizhevsky et al. 2012]

    Input: 227x227x3 images

    First layer (CONV1): 96 11x11 filters applied at stride 4 => Output volume [55x55x96] Q: What if the input is now a small chunk of video? E.g. [227x227x3x15] ?

  • Lecture 14 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 29 Feb 201612

    Case Study: AlexNet [Krizhevsky et al. 2012]

    Input: 227x227x3 images

    First layer (CONV1): 96 11x11 filters applied at stride 4 => Output volume [55x55x96] Q: What if the input is now a small chunk of video? E.g. [227x227x3x15] ? A: Extend the convolutional filters in time, perform spatio-temporal convolutions! E.g. can have 11x11xT filters, where T = 2..15.

  • Lecture 14 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 29 Feb 201613

    Spatio-Temporal ConvNets

    [3D Convolutional Neural Networks for Human Action Recognition, Ji et al., 2010]

  • Lecture 14 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 29 Feb 2016

    Spatio-Temporal ConvNets

    14

    Sequential Deep Learning for Human Action Recognition, Baccouche et al., 2011

  • Lecture 14 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 29 Feb 201615

    Spatio-Temporal ConvNets

    [Large-scale Video Classification with Convolutional Neural Networks, Karpathy et al., 2014]

    spatio-temporal convolutions; worked best.

  • Lecture 14 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 29 Feb 201616

    Spatio-Temporal ConvNets

    [Large-scale Video Classification with Convolutional Neural Networks, Karpathy et al., 2014]

    Learned filters on the first layer

  • Lecture 14 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 29 Feb 201617

    Spatio-Temporal ConvNets

    [Large-scale Video Classification with Convolutional Neural Networks, Karpathy et al., 2014]

    1 million videos 487 sports classes

  • Lecture 14 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 29 Feb 201618

    Spatio-Temporal ConvNets

    [Large-scale Video Classification with Convolutional Neural Networks, Karpathy et al., 2014]

    The motion information didn’t add all that much...

  • Lecture 14 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 29 Feb 201619

    Spatio-Temporal ConvNets

  • Lecture 14 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 29 Feb 201620

    Spatio-Temporal ConvNets

    [Learning Spatiotemporal Features with 3D Convolutional Networks, Tran et al. 2015]

    3D VGGNet, basically.

  • Lecture 14 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 29 Feb 201621

    Spatio-Temporal ConvNets

    [Two-Stream Convolutional Networks for Action Recognition in Videos, Simonyan and Zisserman 2014]

    [T. Brox and J. Malik, “Large displacement optical flow: Descriptor matching in variational motion estimation,” 2011]

    (of VGGNet fame)

  • Lecture 14 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 29 Feb 201622

    Spatio-Temporal ConvNets

    [Two-Stream Convolutional Networks for Action Recognition in Videos, Simonyan and Zisserman 2014]

    [T. Brox and J. Malik, “Large displacement optical flow: Descriptor matching in variational motion estimation,” 2011]

    Two-stream version works much better than either alone.

  • Lecture 14 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 29 Feb 201623

    Long-time Spatio-Temporal ConvNets All 3D ConvNets so far used local motion cues to get extra accuracy (e.g. half a second or so) Q: what if the temporal dependencies of interest are much much longer? E.g. several seconds?

    event 1 event 2

  • Lecture 14 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 29 Feb 201624

    Sequential Deep Learning for Human Action Recognition, Baccouche et al., 2011

    Long-time Spatio-Temporal ConvNets

    (This paper was way ahead of its time. Cited 65 times.)

  • Lecture 14 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 29 Feb 201625

    Sequential Deep Learning for Human Action Recognition, Baccouche et al., 2011

    Long-time Spatio-Temporal ConvNets LSTM way before it was cool

    (This paper was way ahead of its time. Cited 65 times.)

  • Lecture 14 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 29 Feb 201626

    Long-time Spatio-Temporal ConvNets

    [Long-term Recurrent Convolutional Networks for Visual Recognition and Description, Donahue et al., 2015]

  • Lecture 14 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 29 Feb 201627

    Long-time Spatio-Temporal ConvNets

    [Beyond Short Snippets: Deep Networks for Video Classification, Ng et al., 2015]

  • Lecture 14 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 29 Feb 2016

    Summary so far We looked at two types of architectural patterns:

    1. Model temporal motion locally (3D CONV)

    2. Model temporal motion globally (LSTM / RNN)

    + Fusions of both approaches at the same time.

    28

  • Lecture 14 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 29 Feb 2016

    Summary so far We looked at two types of architectural patterns:

    1. Model temporal motion locally (3D CONV)

    2. Model temporal motion globally (LSTM / RNN)

    + Fusions of both approaches at the same time.

    29

    There is another (cleaner) way!

  • Lecture 14 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 29 Feb 201630

    3D CONVNET

    video

    RNN

    Finite temporal extent (neurons that are only a function of finitely many video frames in the past)

    Infinite (in theory) temporal extent (neurons that are function of all video frames in the past)

  • Lecture 14 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 29 Feb 201631

    Long-time Spatio-Temporal ConvNets

    [Delving Deeper into Convolutional Networks for Learning Video Representations, Ballas et al., 2016]

    Beautiful: All neurons in the ConvNet are recurrent.

    Only requires (existing) 2D CONV routines. No need for 3D spatio-temporal CONV.

  • Lecture 14 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 29 Feb 201632

    Long-time Spatio-Temporal ConvNets

    [Delving Deeper into Convolutional Networks for Learning Video Representations, Ballas et al., 2016]

    Convolution Layer

    Normal ConvNet:

  • Lecture 14 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 29 Feb 201633

    Long-time Spatio-Temporal ConvNets

    [Delving