Fei-Fei Li & Justin Johnson & Serena 2019-04-23¢  Fei-Fei Li & Justin Johnson...

download Fei-Fei Li & Justin Johnson & Serena 2019-04-23¢  Fei-Fei Li & Justin Johnson & Serena Yeung Lecture

of 80

  • date post

    12-Mar-2020
  • Category

    Documents

  • view

    5
  • download

    0

Embed Size (px)

Transcript of Fei-Fei Li & Justin Johnson & Serena 2019-04-23¢  Fei-Fei Li & Justin Johnson...

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 2019Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 20191

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 2019Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 20192

    Administrative: Project Proposal

    Due tomorrow, 4/24 on GradeScope

    1 person per group needs to submit, but tag all group members

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 2019Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 20193

    Administrative: Alternate Midterm

    See Piazza for form to request alternate midterm time or other midterm accommodations

    Alternate midterm requests due Thursday!

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 2019Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 20194

    Administrative: A2

    A2 is out, due Wednesday 5/1

    We recommend using Google Cloud for the assignment, especially if your local machine uses Windows

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 2019

    Where we are now...

    5

    x

    W

    hinge loss

    R

    + L s (scores)

    Computational graphs

    *

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 2019Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 20196

    Where we are now...

    Linear score function:

    2-layer Neural Network

    x hW1 sW2

    3072 100 10

    Neural Networks

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 2019

    Illustration of LeCun et al. 1998 from CS231n 2017 Lecture 1

    7

    Where we are now...

    Convolutional Neural Networks

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 20198

    Where we are now... Convolutional Layer

    32

    32

    3

    32x32x3 image 5x5x3 filter

    convolve (slide) over all spatial locations

    activation map

    1

    28

    28

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 20199

    Where we are now... Convolutional Layer

    32

    32

    3

    Convolution Layer

    activation maps

    6

    28

    28

    For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps:

    We stack these up to get a “new image” of size 28x28x6!

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 201910

    Where we are now...

    Landscape image is CC0 1.0 public domain Walking man image is CC0 1.0 public domain

    Learning network parameters through optimization

    http://maxpixel.freegreatpicture.com/Mountains-Valleys-Landscape-Hills-Grass-Green-699369 https://creativecommons.org/publicdomain/zero/1.0/ http://www.publicdomainpictures.net/view-image.php?image=139314&picture=walking-man https://creativecommons.org/publicdomain/zero/1.0/

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 201911

    Where we are now...

    Mini-batch SGD Loop: 1. Sample a batch of data 2. Forward prop it through the graph

    (network), get loss 3. Backprop to calculate the gradients 4. Update the parameters using the gradient

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 201912

    Where we are now...

    Hardware + Software PyTorch

    TensorFlow

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 2019Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 201913

    Next: Training Neural Networks

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 2019Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 201914

    Overview 1. One time setup

    activation functions, preprocessing, weight initialization, regularization, gradient checking

    2. Training dynamics babysitting the learning process, parameter updates, hyperparameter optimization

    3. Evaluation model ensembles, test-time augmentation

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 2019Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 201915

    Part 1 - Activation Functions - Data Preprocessing - Weight Initialization - Batch Normalization - Babysitting the Learning Process - Hyperparameter Optimization

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 2019Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 201916

    Activation Functions

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 2019Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 201917

    Activation Functions

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 2019Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 201918

    Activation Functions Sigmoid

    tanh

    ReLU

    Leaky ReLU

    Maxout

    ELU

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 2019Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 201919

    Activation Functions

    Sigmoid

    - Squashes numbers to range [0,1] - Historically popular since they

    have nice interpretation as a saturating “firing rate” of a neuron

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 2019Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 201920

    Activation Functions

    Sigmoid

    - Squashes numbers to range [0,1] - Historically popular since they

    have nice interpretation as a saturating “firing rate” of a neuron

    3 problems:

    1. Saturated neurons “kill” the gradients

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 2019Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 201921

    sigmoid gate

    x

    What happens when x = -10? What happens when x = 0? What happens when x = 10?

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 2019Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 201922

    Activation Functions

    Sigmoid

    - Squashes numbers to range [0,1] - Historically popular since they

    have nice interpretation as a saturating “firing rate” of a neuron

    3 problems:

    1. Saturated neurons “kill” the gradients

    2. Sigmoid outputs are not zero-centered

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 2019Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 201923

    Consider what happens when the input to a neuron is always positive...

    What can we say about the gradients on w?

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 2019Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 201924

    Consider what happens when the input to a neuron is always positive...

    What can we say about the gradients on w? Always all positive or all negative :(

    hypothetical optimal w vector

    zig zag path

    allowed gradient update directions

    allowed gradient update directions

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 2019Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 201925

    Consider what happens when the input to a neuron is always positive...

    What can we say about the gradients on w? Always all positive or all negative :( (For a single element! Minibatches help)

    hypothetical optimal w vector

    zig zag path

    allowed gradient update directions

    allowed gradient update directions

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 2019Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 201926

    Activation Functions

    Sigmoid

    - Squashes numbers to range [0,1] - Historically popular since they

    have nice interpretation as a saturating “firing rate” of a neuron

    3 problems:

    1. Saturated neurons “kill” the gradients

    2. Sigmoid outputs are not zero-centered

    3. exp() is a bit compute expensive

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 2019Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 201927

    Activation Functions

    tanh(x)

    - Squashes numbers to range [-1,1] - zero centered (nice) - still kills gradients when saturated :(

    [LeCun et al., 1991]

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 2019Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 201928

    Activation Functions - Computes f(x) = max(0,x)

    - Does not saturate (in +region) - Very computationally efficient - Converges much faster than

    sigmoid/ta