Lecture 4: Backpropagation and AutomaticDifferentiationdlsys.cs.washington.edu/pdf/lecture4.pdf ·...

43
Lecture 4: Backpropagation and Automatic Differentiation CSE599W: Spring 2018

Transcript of Lecture 4: Backpropagation and AutomaticDifferentiationdlsys.cs.washington.edu/pdf/lecture4.pdf ·...

Lecture 4: Backpropagation and Automatic Differentiation

CSE599W: Spring 2018

Announcement

• Assignment 1 is out today, due in 2 weeks (Apr 19th, 5pm)

Model Training Overviewlayer1extractor

layer2extractor predictor

Objective

Training

Symbolic Differentiation

• Input formulae is a symbolic expression tree (computation graph).• Implement differentiation rules, e.g., sum rule, product rule, chain rule

✘For complicated functions, the resultant expression can be exponentially large.✘Wasteful to keep around intermediate symbolic expressions if we only

need a numeric value of the gradient in the end✘Prone to error

!(# + %)!' = !#

!' +!%!'

!(#%)!' = !#

!' % + #!%!'

! ℎ '!' = !# % '

!' * !%(')'

Numerical Differentiation

• We can approximate the gradient using!" #!$%

≈ lim*→," # + ℎ/0 − "(#)

" 4, $ = 4 ⋅ $−0.8 0.3 < 0.5

−0.2

Numerical Differentiation

• We can approximate the gradient using!" #!$%

≈ lim*→," # + ℎ/0 − "(#)

" 4, $ = 4 ⋅ $−0.8 + ; 0.3 = 0.5

−0.2

" 4, $ = 4 ⋅ $−0.8 0.3 = 0.5

−0.2

Numerical Differentiation

• We can approximate the gradient using!" #!$%

≈ lim*→," # + ℎ/0 − "(#)

ℎ• Reduce the truncation error by using center difference

!" #!$%

≈ lim*→," # + ℎ/0 − "(# − ℎ/0)

2ℎ✘Bad: rounding error, and slow to computeüA powerful tool to check the correctness of implementation, usually

use h = 1e-6.

Backpropagation

!

"

# = %(!, ")Operator %

Backpropagation

Operator !

"

#

$ = !(", #))*)$

)*)" =

)*)$)$)"

)*)# =

)*)$)$)#

Compute gradient becomes local computation

Backpropagation simple example

! = 11 + %&(()*(+,+*(-,-)

Backpropagation simple example

∗"#

+%#

∗"&%&"'

+ ∗ −1 *%+ +1

, = 11 + *.(012034320545)

1/%

Backpropagation simple example

∗"#

+%#

∗"&%&"'

+ ∗ −1 *%+ +1

, = 11 + *.(012034320545)

1/%

1.0

3.0

-2.0

2.02.0

3.0

-4.0 -1.01.0 -1.0 0.37 1.37 0.73

Backpropagation simple example

∗"#

+%#

∗"&%&"'

+ ∗ −1 *%+ +1

, = 11 + *.(012034320545)

1/%

1.0

3.0

-2.0

2.02.0

3.0

-4.0 -1.01.0 -1.0 0.37 1.37 0.73

1.0

Backpropagation simple example

∗"#

+%#

∗"&%&"'

+ ∗ −1 *%+ +1

, = 11 + *.(012034320545)

1/%

1.0

3.0

-2.0

2.02.0

3.0

-4.0 -1.01.0 -1.0 0.37 1.37 0.73

1.0-0.53

, % = 1/% à8984 = −1/%&

:;:% =

:;:,:,:% = −1/%&

Backpropagation simple example

∗"#

+%#

∗"&%&"'

+ ∗ −1 *%+ +1

, = 11 + *.(012034320545)

1/%

1.0

3.0

-2.0

2.02.0

3.0

-4.0 -1.01.0 -1.0 0.37 1.37 0.73

1.0-0.53-0.53

, % = % + 1 à8984 = 1

Backpropagation simple example

∗"#

+%#

∗"&%&"'

+ ∗ −1 *%+ +1

, = 11 + *.(012034320545)

1/%

1.0

3.0

-2.0

2.02.0

3.0

-4.0 -1.01.0 -1.0 0.37 1.37 0.73

1.0-0.53-0.53-0.20

, % = *4 à8984 = *4

:;:% =

:;:,:,:% =

:;:, ⋅ *

4

Backpropagation simple example

∗"#

+%#

∗"&%&"'

+ ∗ −1 *%+ +1

, = 11 + *.(012034320545)

1/%

1.0

3.0

-2.0

2.02.0

3.0

-4.0 -1.01.0 -1.0 0.37 1.37 0.73

1.0-0.53-0.53-0.20

, %," = %" à9:94 = ", 9:

90 = %

0.200.20

0.20

0.200.20

0.600.20

Backpropagation simple example

∗"#

+%#

∗"&%&"'

+ ∗ −1 *%+ +1

, = 11 + *.(012034320545)

1/%

1.0

3.0

-2.0

2.02.0

3.0

-4.0 -1.01.0 -1.0 0.37 1.37 0.73

1.0-0.53-0.53-0.200.200.20

0.20

0.200.20

-0.400.40

0.600.20

Any problem?Can we do better?

Problems of backpropagation

• You always need to keep intermediate data in the memory during the forward pass in case it will be used in the backpropagation.

• Lack of flexibility, e.g., compute the gradient of gradient.

Automatic Differentiation (autodiff)

• Create computation graph for gradient computation

Automatic Differentiation (autodiff)

• Create computation graph for gradient computation

∗"#

+%#

∗"&%&"'

+ ∗ −1 *%+ +1

, = 11 + *.(012034320545)

1/%

Automatic Differentiation (autodiff)

• Create computation graph for gradient computation

∗"#

+%#

∗"&%&"'

+ ∗ −1 *%+ +1 1/%

− 1%&

- = 11 + */(123145431656)

- % = 1/% à8985 = −1/%&

Automatic Differentiation (autodiff)

• Create computation graph for gradient computation

∗"#

+%#

∗"&%&"'

+ ∗ −1 *%+ +1 1/%

− 1%&

- = 11 + */(123145431656)

∗ 1

- % = % + 1 à8985 = 1

Automatic Differentiation (autodiff)

• Create computation graph for gradient computation

∗"#

+%#

∗"&%&"'

+ ∗ −1 *%+ +1 1/%

− 1%&

- = 11 + */(123145431656)

∗ 1∗

- % = *5 à8985 = *5

Automatic Differentiation (autodiff)

• Create computation graph for gradient computation

∗"#

+%#

∗"&%&"'

+ ∗ −1 *%+ +1 1/%

− 1%&

- = 11 + */(123145431656)

∗ 1∗∗ −1∗89814

- %, " = %" à8;81 = %

Automatic Differentiation (autodiff)

• Create computation graph for gradient computation

∗"#

+%#

∗"&%&"'

+ ∗ −1 *%+ +1 1/%

− 1%&

- = 11 + */(123145431656)

∗ 1∗∗ −1∗89814

∗89816

AutoDiff Algorithm

W

xmatmult softmax log

y_

mul meany cross_en

tropy

log-gradsoftmax-grad mul 1 / batch_sizematmult-transpose

W_grad

AutoDiff Algorithmdef gradient(out):

node_to_grad[out] = 1

nodes = get_node_list(out)

for node in reverse_topo_order(nodes):

grad ß sum partial adjoints from output edges

input_grads ß node.op.gradient(input, grad) for input in node.inputs

add input_grads to node_to_grad

return node_to_grad

!"

1!$

!%

!&

exp

+

×

AutoDiff Algorithmdef gradient(out):

node_to_grad[out] = 1

nodes = get_node_list(out)

for node in reverse_topo_order(nodes):

grad ß sum partial adjoints from output edges

input_grads ß node.op.gradient(input, grad) for input in node.inputs

add input_grads to node_to_grad

return node_to_grad

!"

1!$

!%

!&

exp

+

×

!&node_to_grad:!&: !&

AutoDiff Algorithmdef gradient(out):

node_to_grad[out] = 1

nodes = get_node_list(out)

for node in reverse_topo_order(nodes):

grad ß sum partial adjoints from output edges

input_grads ß node.op.gradient(input, grad) for input in node.inputs

add input_grads to node_to_grad

return node_to_grad

!"

1!$

!%

!&

exp

+

×

!&node_to_grad:!&: !&

!& !% !$ !"

AutoDiff Algorithmdef gradient(out):

node_to_grad[out] = 1

nodes = get_node_list(out)

for node in reverse_topo_order(nodes):

grad ß sum partial adjoints from output edges

input_grads ß node.op.gradient(input, grad) for input in node.inputs

add input_grads to node_to_grad

return node_to_grad

!"

1!$

!%

!&

exp

+

×

!&node_to_grad:!&: !&

!& !% !$ !"

!$" !%××

AutoDiff Algorithmdef gradient(out):

node_to_grad[out] = 1

nodes = get_node_list(out)

for node in reverse_topo_order(nodes):

grad ß sum partial adjoints from output edges

input_grads ß node.op.gradient(input, grad) for input in node.inputs

add input_grads to node_to_grad

return node_to_grad

!"

1!$

!%

!&

exp

+

×

!&node_to_grad:!&: !&!%: !%!$: !$"

!& !% !$ !"

!$" !%××

AutoDiff Algorithmdef gradient(out):

node_to_grad[out] = 1

nodes = get_node_list(out)

for node in reverse_topo_order(nodes):

grad ß sum partial adjoints from output edges

input_grads ß node.op.gradient(input, grad) for input in node.inputs

add input_grads to node_to_grad

return node_to_grad

!"

1!$

!%

!&

exp

+

×

!&node_to_grad:!&: !&!%: !%!$: !$"

!& !% !$ !"

!$" !%××

AutoDiff Algorithmdef gradient(out):

node_to_grad[out] = 1

nodes = get_node_list(out)

for node in reverse_topo_order(nodes):

grad ß sum partial adjoints from output edges

input_grads ß node.op.gradient(input, grad) for input in node.inputs

add input_grads to node_to_grad

return node_to_grad

!"

1!$

!%

!&

exp

+

×

!&node_to_grad:!&: !&!%: !%!$: !$"

!& !% !$ !"

!$" !%××

!$$id

AutoDiff Algorithmdef gradient(out):

node_to_grad[out] = 1

nodes = get_node_list(out)

for node in reverse_topo_order(nodes):

grad ß sum partial adjoints from output edges

input_grads ß node.op.gradient(input, grad) for input in node.inputs

add input_grads to node_to_grad

return node_to_grad

!"

1!$

!%

!&

exp

+

×

!&node_to_grad:!&: !&!%: !%!$: !$", !$$

!& !% !$ !"

!$" !%××

!$$id

AutoDiff Algorithmdef gradient(out):

node_to_grad[out] = 1

nodes = get_node_list(out)

for node in reverse_topo_order(nodes):

grad ß sum partial adjoints from output edges

input_grads ß node.op.gradient(input, grad) for input in node.inputs

add input_grads to node_to_grad

return node_to_grad

!"

1!$

!%

!&

exp

+

×

!&node_to_grad:!&: !&!%: !%!$: !$", !$$

!& !% !$ !"

!$" !%××

!$$id

!$+

AutoDiff Algorithmdef gradient(out):

node_to_grad[out] = 1

nodes = get_node_list(out)

for node in reverse_topo_order(nodes):

grad ß sum partial adjoints from output edges

input_grads ß node.op.gradient(input, grad) for input in node.inputs

add input_grads to node_to_grad

return node_to_grad

!"

1!$

!%

!&

exp

+

×

!&node_to_grad:!&: !&!%: !%!$: !$", !$$

!& !% !$ !"

!$" !%××

!$$id

!$+

!"×

AutoDiff Algorithmdef gradient(out):

node_to_grad[out] = 1

nodes = get_node_list(out)

for node in reverse_topo_order(nodes):

grad ß sum partial adjoints from output edges

input_grads ß node.op.gradient(input, grad) for input in node.inputs

add input_grads to node_to_grad

return node_to_grad

!"

1!$

!%

!&

exp

+

×

!&

!$" !%

!$

!$$

!"

××

+

×

id

node_to_grad:!&: !&!%: !%!$: !$", !$$!": !"

!& !% !$ !"

AutoDiff Algorithmdef gradient(out):

node_to_grad[out] = 1

nodes = get_node_list(out)

for node in reverse_topo_order(nodes):

grad ß sum partial adjoints from output edges

input_grads ß node.op.gradient(input, grad) for input in node.inputs

add input_grads to node_to_grad

return node_to_grad

!"

1!$

!%

!&

exp

+

×

!&

!$" !%

!$

!$$

!"

××

+

×

id

node_to_grad:!&: !&!%: !%!$: !$", !$$!": !"

!& !% !$ !"

Backpropagation vs AutoDiff

!"

!#exp 1

!(

!)

+

×

!"

!#exp 1

!(

!)

+

×!)

!(×!#′×

!#′′id

!#+

!"×

Backpropagation AutoDiff

Recap

• Numerical differentiation• Tool to check the correctness of implementation

• Backpropagation• Easy to understand and implement• Bad for memory use and schedule optimization

• Automatic differentiation• Generate gradient computation to entire computation graph• Better for system optimization

References

• Automatic differentiation in machine learning: a surveyhttps://arxiv.org/abs/1502.05767• CS231n backpropagation: http://cs231n.github.io/optimization-2/