CS 4803 / 7643: Deep Learning

Post on 24-Dec-2021

9 views 0 download

Transcript of CS 4803 / 7643: Deep Learning

CS 4803 / 7643: Deep Learning

Dhruv Batra

Georgia Tech

Topics: – Automatic Differentiation

– (Finish) Forward mode vs Reverse mode AD– Patterns in backprop

Administrativia• HW1 Reminder

– Due: 09/09, 11:59pm

• HW2 out on 9/10• Schedule: https://www.cc.gatech.edu/classes/AY2021/cs7643_fall/

• Project discussion next class

(C) Dhruv Batra 2

Recap from last time

(C) Dhruv Batra 3

How do we compute gradients?• Analytic or “Manual” Differentiation

• Symbolic Differentiation

• Numerical Differentiation

• Automatic Differentiation– Forward mode AD– Reverse mode AD

• aka “backprop”

(C) Dhruv Batra 4

Vector/Matrix Derivatives Notation

(C) Dhruv Batra 5

Vector/Matrix Derivatives Notation

(C) Dhruv Batra 6

Vector Derivative Example

(C) Dhruv Batra 7

Extension to Tensors

(C) Dhruv Batra 8

Chain Rule: Composite Functions

(C) Dhruv Batra 9

Chain Rule: Scalar Case

(C) Dhruv Batra 10

Chain Rule: Vector Case

(C) Dhruv Batra 11

Chain Rule: Jacobian view

(C) Dhruv Batra 12

Chain Rule: Graphical view

(C) Dhruv Batra 13

Chain Rule: Cascaded

(C) Dhruv Batra 14

Deep Learning = Differentiable Programming

• Computation = Graph– Input = Data + Parameters– Output = Loss– Scheduling = Topological ordering

• Auto-Diff– A family of algorithms for

implementing chain-rule on computation graphs

(C) Dhruv Batra 15

Directed Acyclic Graphs (DAGs)• Exactly what the name suggests

– Directed edges– No (directed) cycles– Underlying undirected cycles okay

(C) Dhruv Batra 16

Directed Acyclic Graphs (DAGs)• Concept

– Topological Ordering

(C) Dhruv Batra 17

Computational Graphs• Notation

(C) Dhruv Batra 18

Deep Learning = Differentiable Programming

• Computation = Graph– Input = Data + Parameters– Output = Loss– Scheduling = Topological ordering

• Auto-Diff– A family of algorithms for

implementing chain-rule on computation graphs

(C) Dhruv Batra 19

20

g

Forward mode AD

21

g

Reverse mode AD

Plan for Today• Automatic Differentiation

– (Finish) Forward mode vs Reverse mode AD– Backprop– Patterns in backprop

(C) Dhruv Batra 22

Example: Forward mode AD

(C) Dhruv Batra 23

+

sin( )

x1 x2

*

Example: Forward mode AD

(C) Dhruv Batra 24

+

sin( )

x1 x2

*

(C) Dhruv Batra 25

+

sin( )

x1 x2

*

Example: Forward mode AD

(C) Dhruv Batra 26

+

sin( )

x1 x2

*

Example: Forward mode AD

(C) Dhruv Batra 27

+

sin( )

x1 x2

*

Example: Forward mode AD

Q: What happens if there’s another input variable x3?

(C) Dhruv Batra 28

+

sin( )

x1 x2

*

Example: Forward mode AD

Q: What happens if there’s another input variable x3?A: more sophisticated graph; d+1 “passes” for d+1 variables

(C) Dhruv Batra 29

+

sin( )

x1 x2

*

Example: Forward mode AD

Q: What happens if there’s another output variable f2?

(C) Dhruv Batra 30

+

sin( )

x1 x2

*

Example: Forward mode AD

Q: What happens if there’s another output variable f2?A: more sophisticated graph;

d “passes” for variables

Example: Reverse mode AD

(C) Dhruv Batra 31

+

sin( )

x1 x2

*

Example: Reverse mode AD

(C) Dhruv Batra 32

+

sin( )

x1 x2

*

+

Gradients add at branches

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Duality in Fprop and Bprop

(C) Dhruv Batra 34

+

+

FPROP BPROPSUM

COPY

(C) Dhruv Batra 35

Example: Reverse mode AD

+

sin( )

x1 x2

*

(C) Dhruv Batra 36

Example: Reverse mode AD

+

sin( )

x1 x2

*

Q: What happens if there’s another input variable x3?

(C) Dhruv Batra 37

Example: Reverse mode AD

+

sin( )

x1 x2

*

Q: What happens if there’s another input variable x3?A: more sophisticated graph;

single “backward prop”

(C) Dhruv Batra 38

Example: Reverse mode AD

+

sin( )

x1 x2

*

Q: What happens if there’s another output variable f2?

(C) Dhruv Batra 39

Example: Reverse mode AD

+

sin( )

x1 x2

*

Q: What happens if there’s another output variable f2?A: more sophisticated graph; c “backward props” for c vars

Forward Pass vs Forward mode AD vs Reverse Mode AD

(C) Dhruv Batra 40

+

sin( )

x1 x2

*

+

sin( )

x2

*

x1

+

sin( )

x1 x2

*

Forward mode vs Reverse Mode• What are the differences?

(C) Dhruv Batra 41

+

sin( )

x2

*

+

sin( )

x1 x2

*

x1

Forward mode vs Reverse Mode• What are the differences?

• Which one is faster to compute? – Forward or backward?

(C) Dhruv Batra 42

Forward mode vs Reverse Mode• x Graph L• Intuition of Jacobian

(C) Dhruv Batra 43

Forward mode vs Reverse Mode• What are the differences?

• Which one is faster to compute? – Forward or backward?

• Which one is more memory efficient (less storage)? – Forward or backward?

(C) Dhruv Batra 44

Plan for Today• Automatic Differentiation

– (Finish) Forward mode vs Reverse mode AD– Backprop– Patterns in backprop

(C) Dhruv Batra 45

(C) Dhruv Batra 46Figure Credit: Andrea Vedaldi

Neural Network Computation Graph

(C) Dhruv Batra 47Figure Credit: Andrea Vedaldi

Backprop

Any DAG of differentiable modules is allowed!

Slide Credit: Marc'Aurelio Ranzato(C) Dhruv Batra 48

Computational Graph

Key Computation: Forward-Prop

(C) Dhruv Batra 49Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Key Computation: Back-Prop

(C) Dhruv Batra 50Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Neural Network Training• Step 1: Compute Loss on mini-batch [F-Pass]

(C) Dhruv Batra 51Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Neural Network Training• Step 1: Compute Loss on mini-batch [F-Pass]

(C) Dhruv Batra 52Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Neural Network Training• Step 1: Compute Loss on mini-batch [F-Pass]

(C) Dhruv Batra 53Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Neural Network Training• Step 1: Compute Loss on mini-batch [F-Pass]• Step 2: Compute gradients wrt parameters [B-Pass]

(C) Dhruv Batra 54Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Neural Network Training• Step 1: Compute Loss on mini-batch [F-Pass]• Step 2: Compute gradients wrt parameters [B-Pass]

(C) Dhruv Batra 55Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Neural Network Training• Step 1: Compute Loss on mini-batch [F-Pass]• Step 2: Compute gradients wrt parameters [B-Pass]

(C) Dhruv Batra 56Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Neural Network Training• Step 1: Compute Loss on mini-batch [F-Pass]• Step 2: Compute gradients wrt parameters [B-Pass]• Step 3: Use gradient to update parameters

(C) Dhruv Batra 57Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Backpropagation: a simple example

Backpropagation: a simple example

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Patterns in backward flow

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Patterns in backward flow

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Q: What is an add gate?

Patterns in backward flow

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

add gate: gradient distributor

add gate: gradient distributor

Q: What is a max gate?

Patterns in backward flow

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

add gate: gradient distributor

max gate: gradient router

Patterns in backward flow

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

add gate: gradient distributor

max gate: gradient router

Q: What is a mul gate?

Patterns in backward flow

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

add gate: gradient distributor

max gate: gradient router

mul gate: gradient switcher

Patterns in backward flow

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

+

Gradients add at branches

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Duality in Fprop and Bprop

(C) Dhruv Batra 68

+

+

FPROP BPROPSUM

COPY