Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These...

77
Vector Calculus Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 [email protected] https://deisenroth.cc AIMS Rwanda and AIMS Ghana March/April, 2020

Transcript of Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These...

Page 1: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Vector Calculus

Marc DeisenrothCentre for Artificial IntelligenceDepartment of Computer ScienceUniversity College London

@[email protected]

https://deisenroth.cc

AIMS Rwanda and AIMS Ghana

March/April, 2020

Page 2: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Reading Material

MATHEMATICS FOR MACHINE LEARNING

Marc Peter DeisenrothA. Aldo Faisal

Cheng Soon Ong

MA

THEM

ATICS FO

R MA

CHIN

E LEARN

ING

DEIS

ENR

OTH

ET AL.

The fundamental mathematical tools needed to understand machine learning include linear algebra, analytic geometry, matrix decompositions, vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science or computer science students, or professionals, to effi ciently learn the mathematics. This self-contained textbook bridges the gap between mathematical and machine learning texts, introducing the mathematical concepts with a minimum of prerequisites. It uses these concepts to derive four central machine learning methods: linear regression, principal component analysis, Gaussian mixture models and support vector machines. For students and others with a mathematical background, these derivations provide a starting point to machine learning texts. For those learning the mathematics for the fi rst time, the methods help build intuition and practical experience with applying mathematical concepts. Every chapter includes worked examples and exercises to test understanding. Programming tutorials are offered on the book’s web site.

MARC PETER DEISENROTH is Senior Lecturer in Statistical Machine Learning at the Department of Computing, Împerial College London.

A. ALDO FAISAL leads the Brain & Behaviour Lab at Imperial College London, where he is also Reader in Neurotechnology at the Department of Bioengineering and the Department of Computing.

CHENG SOON ONG is Principal Research Scientist at the Machine Learning Research Group, Data61, CSIRO. He is also Adjunct Associate Professor at Australian National University.

Cover image courtesy of Daniel Bosma / Moment / Getty Images

Cover design by Holly Johnson

Deisenrith et al. 9781108455145 Cover. C M Y K

Chapter 5https://mml-book.com

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 2

Page 3: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Regression in Machine Learning (1)

x-5 0 5

f(x)

-3

-2

-1

0

1

2

3Polynomial of degree 5

DataMaximum likelihood estimate

Setting: Inputs x P RD, outputs/targets y P RGoal: Find a function that models the relationship between xand y (regression/curve fitting)Model f that depends on parameters θ. Examples:

Linear model: fpx,θq “ θJx, x,θ P RD

Neural network: fpx,θq “ NNpx,θq

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 3

Page 4: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Regression in Machine Learning (2)

Training data, e.g., N pairs pxi, yiq ofinputs xi and observations yi

Training the model means findingparameters θ˚, such that fpxi,θ˚q « yi

x-5 0 5

f(x)

-3

-2

-1

0

1

2

3Polynomial of degree 5

DataMaximum likelihood estimate

Define a loss function, e.g.,řNi“1pyi ´ fpxi,θqq

2, which we wantto optimize

Typically: Optimization based on some form of gradient descentDifferentiation required

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 4

Page 5: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Regression in Machine Learning (2)

Training data, e.g., N pairs pxi, yiq ofinputs xi and observations yi

Training the model means findingparameters θ˚, such that fpxi,θ˚q « yi

x-5 0 5

f(x)

-3

-2

-1

0

1

2

3Polynomial of degree 5

DataMaximum likelihood estimate

Define a loss function, e.g.,řNi“1pyi ´ fpxi,θqq

2, which we wantto optimize

Typically: Optimization based on some form of gradient descentDifferentiation required

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 4

Page 6: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Types of Differentiation

1 Scalar differentiation: f : RÑ R

y P R w.r.t. x P R

2 Scalar differentiation of a vector: f : RÑ RN

y P RN w.r.t. x P R

3 Multivariate case: f : RN Ñ R

y P R w.r.t. vector x P RN

4 Vector fields: f : RN Ñ RM

vector y P RM w.r.t. vector x P RN

5 General derivatives: f : RMˆN Ñ RPˆQ

matrix y P RPˆQ w.r.t. matrix X P RMˆN

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 5

Page 7: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Scalar Differentiation f : R Ñ R

Derivative defined as the limit of the difference quotient

f 1pxq “df

dx“ lim

hÑ0

fp x` h q ´ fpxq

h

Slope of the secant line through fpxq and fpx` hq

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 6

Page 8: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Some Examples

fpxq “ xn f 1pxq “ nxn´1

fpxq “ sinpxq f 1pxq “ cospxqfpxq “ tanhpxq f 1pxq “ 1´ tanh2pxqfpxq “ exppxq f 1pxq “ exppxqfpxq “ logpxq f 1pxq “ 1

x

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 7

Page 9: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Differentiation Rules

Sum Rule`

fpxq ` gpxq˘1“ f 1pxq ` g1pxq “

dfpxq

dx`dgpxq

dx

Product Rule`

fpxqgpxq˘1“ f 1pxqgpxq ` fpxqg1pxq “

dfpxq

dxgpxq ` fpxq

dgpxq

dx

Chain Rule

pg ˝ fq1pxq “`

gpfpxqq˘1“ g1pfpxqqf 1pxq “

dgpfpxqq

df

dfpxq

dx

Quotient Rule´fpxq

gpxq

¯1

“fpxq1gpxq ´ fpxqgpxq1

pgpxqq2“

dfdxgpxq ´ fpxq

dgdx

pgpxqq2

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 8

Page 10: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Differentiation Rules

Sum Rule`

fpxq ` gpxq˘1“ f 1pxq ` g1pxq “

dfpxq

dx`dgpxq

dx

Product Rule`

fpxqgpxq˘1“ f 1pxqgpxq ` fpxqg1pxq “

dfpxq

dxgpxq ` fpxq

dgpxq

dx

Chain Rule

pg ˝ fq1pxq “`

gpfpxqq˘1“ g1pfpxqqf 1pxq “

dgpfpxqq

df

dfpxq

dx

Quotient Rule´fpxq

gpxq

¯1

“fpxq1gpxq ´ fpxqgpxq1

pgpxqq2“

dfdxgpxq ´ fpxq

dgdx

pgpxqq2

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 8

Page 11: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Differentiation Rules

Sum Rule`

fpxq ` gpxq˘1“ f 1pxq ` g1pxq “

dfpxq

dx`dgpxq

dx

Product Rule`

fpxqgpxq˘1“ f 1pxqgpxq ` fpxqg1pxq “

dfpxq

dxgpxq ` fpxq

dgpxq

dx

Chain Rule

pg ˝ fq1pxq “`

gpfpxqq˘1“ g1pfpxqqf 1pxq “

dgpfpxqq

df

dfpxq

dx

Quotient Rule´fpxq

gpxq

¯1

“fpxq1gpxq ´ fpxqgpxq1

pgpxqq2“

dfdxgpxq ´ fpxq

dgdx

pgpxqq2

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 8

Page 12: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Differentiation Rules

Sum Rule`

fpxq ` gpxq˘1“ f 1pxq ` g1pxq “

dfpxq

dx`dgpxq

dx

Product Rule`

fpxqgpxq˘1“ f 1pxqgpxq ` fpxqg1pxq “

dfpxq

dxgpxq ` fpxq

dgpxq

dx

Chain Rule

pg ˝ fq1pxq “`

gpfpxqq˘1“ g1pfpxqqf 1pxq “

dgpfpxqq

df

dfpxq

dx

Quotient Rule´fpxq

gpxq

¯1

“fpxq1gpxq ´ fpxqgpxq1

pgpxqq2“

dfdxgpxq ´ fpxq

dgdx

pgpxqq2

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 8

Page 13: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Example: Scalar Chain Rule

pg ˝ fq1pxq “`

gpfpxqq˘1“ g1pfpxqqf 1pxq “

dg

df

df

dx

Beginner

gpzq “ 6z ` 3

z “ fpxq “ ´2x` 5

pg ˝ fq1pxq “

p6qloomoon

dg{df

p´2qloomoon

df{dx

“ ´12

Advanced

gpzq “ tanhpzq

z “ fpxq “ xn

pg ˝ fq1pxq “

`

1´ tanh2pxnqqlooooooooomooooooooon

dg{df

nxn´1loomoon

df{dx

Work it out with your neighbors

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 9

Page 14: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Example: Scalar Chain Rule

pg ˝ fq1pxq “`

gpfpxqq˘1“ g1pfpxqqf 1pxq “

dg

df

df

dx

Beginner

gpzq “ 6z ` 3

z “ fpxq “ ´2x` 5

pg ˝ fq1pxq “ p6qloomoon

dg{df

p´2qloomoon

df{dx

“ ´12

Advanced

gpzq “ tanhpzq

z “ fpxq “ xn

pg ˝ fq1pxq “`

1´ tanh2pxnqqlooooooooomooooooooon

dg{df

nxn´1loomoon

df{dx

Work it out with your neighbors

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 9

Page 15: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Scalar Differentiation f : R Ñ RN

fpxq “

»

f1pxq...

fN pxq

fi

ffi

fl

P RN , x P R

Here, fn are different functions, e.g.,

fpxq “

«

f1pxq

f2pxq

ff

«

x2

sinpxq

ff

Differentiation: Compute derivatives of each fn:

dfdx

»

df1dx...

dfNdx

fi

ffi

ffi

fl

P RN

Derivative of a (column) vector w.r.t. a scalar input is a (column)vector

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 10

Page 16: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Multivariate Differentiation f : RN Ñ R

y “ fpxq , x “

»

x1...xN

fi

ffi

fl

P RN

Partial derivative (change one coordinate at a time):

Bf

Bxi“ lim

hÑ0

fpx1, . . . , xi´1, xi ` h , xi`1, . . . , xN q ´ fpxq

h

Jacobian vector (gradient) collects all partial derivatives:

dfdx

BfBx1

¨ ¨ ¨BfBxN

ı

P R1ˆN

Note: This is a row vector.

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 11

Page 17: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Multivariate Differentiation f : RN Ñ R

y “ fpxq , x “

»

x1...xN

fi

ffi

fl

P RN

Partial derivative (change one coordinate at a time):

Bf

Bxi“ lim

hÑ0

fpx1, . . . , xi´1, xi ` h , xi`1, . . . , xN q ´ fpxq

h

Jacobian vector (gradient) collects all partial derivatives:

dfdx

BfBx1

¨ ¨ ¨BfBxN

ı

P R1ˆN

Note: This is a row vector.

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 11

Page 18: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Example: Multivariate Differentiation

Beginner

f : R2 Ñ R

fpx1, x2q “ x21x2 ` x1x32 P R

Advanced

f : R2 Ñ R

fpx1, x2q “ px1 ` 2x32q2 P R

Partial derivatives? Gradient?Work it out with your neighbors

Bfpx1, x2q

Bx1“ 2x1x2 ` x

32

Bfpx1, x2q

Bx2“ x21 ` 3x1x

22

Bfpx1, x2q

Bx1“ 2px1 ` 2x32q

BBx1

px1`2x32qhkkikkj

p1q

Bfpx1, x2q

Bx2“ 2px1 ` 2x32q p6x22q

loomoon

BBx2

px1`2x32q

Gradientdfdx

Bfpx1,x2qBx1

Bfpx1,x2qBx2

ı

P R1ˆ2

dfdx

““

2x1x2 ` x32 x21 ` 3x1x

22

‰ dfdx

““

2px1 ` 2x32q 12px1 ` 2x32qx22

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 12

Page 19: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Example: Multivariate Differentiation

Beginner

f : R2 Ñ R

fpx1, x2q “ x21x2 ` x1x32 P R

Advanced

f : R2 Ñ R

fpx1, x2q “ px1 ` 2x32q2 P R

Partial derivatives

? Gradient?Work it out with your neighbors

Bfpx1, x2q

Bx1“ 2x1x2 ` x

32

Bfpx1, x2q

Bx2“ x21 ` 3x1x

22

Bfpx1, x2q

Bx1“ 2px1 ` 2x32q

BBx1

px1`2x32qhkkikkj

p1q

Bfpx1, x2q

Bx2“ 2px1 ` 2x32q p6x22q

loomoon

BBx2

px1`2x32q

Gradientdfdx

Bfpx1,x2qBx1

Bfpx1,x2qBx2

ı

P R1ˆ2

dfdx

““

2x1x2 ` x32 x21 ` 3x1x

22

‰ dfdx

““

2px1 ` 2x32q 12px1 ` 2x32qx22

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 12

Page 20: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Example: Multivariate Differentiation

Beginner

f : R2 Ñ R

fpx1, x2q “ x21x2 ` x1x32 P R

Advanced

f : R2 Ñ R

fpx1, x2q “ px1 ` 2x32q2 P R

Partial derivatives

? Gradient?Work it out with your neighbors

Bfpx1, x2q

Bx1“ 2x1x2 ` x

32

Bfpx1, x2q

Bx2“ x21 ` 3x1x

22

Bfpx1, x2q

Bx1“ 2px1 ` 2x32q

BBx1

px1`2x32qhkkikkj

p1q

Bfpx1, x2q

Bx2“ 2px1 ` 2x32q p6x22q

loomoon

BBx2

px1`2x32q

Gradientdfdx

Bfpx1,x2qBx1

Bfpx1,x2qBx2

ı

P R1ˆ2

dfdx

““

2x1x2 ` x32 x21 ` 3x1x

22

‰ dfdx

““

2px1 ` 2x32q 12px1 ` 2x32qx22

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 12

Page 21: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Example: Multivariate Chain Rule

Consider the function

Lpeq “ 12}e}

2 “ 12eJe

e “ y ´Ax , x P RN ,A P RMˆN , e,y P RM

Compute the gradient dLdx . What is the dimension/size of dL

dx?

Work it out with your neighbors

dLdx

“BL

Be

Be

BxBL

Be“ eJ P R1ˆM (1)

Be

Bx“ ´A P RMˆN (2)

ùñdLdx

“ eJp´Aq “ ´py ´AxqJA P R1ˆN

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 13

Page 22: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Example: Multivariate Chain Rule

Consider the function

Lpeq “ 12}e}

2 “ 12eJe

e “ y ´Ax , x P RN ,A P RMˆN , e,y P RM

Compute the gradient dLdx . What is the dimension/size of dL

dx?

dLdx

“BL

Be

Be

BxBL

Be“ eJ P R1ˆM (1)

Be

Bx“ ´A P RMˆN (2)

ùñdLdx

“ eJp´Aq “ ´py ´AxqJA P R1ˆN

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 13

Page 23: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Vector Field Differentiation f : RN Ñ RM

y “ fpxq P RM , x P RN

»

y1...yM

fi

ffi

fl

»

f1pxq...

fM pxq

fi

ffi

fl

»

f1px1, . . . , xN q...

fM px1, . . . , xN q

fi

ffi

fl

Jacobian matrix (collection of all partial derivatives)

»

dy1dx...

dyMdx

fi

ffi

ffi

fl

»

Bf1Bx1

¨ ¨ ¨Bf1BxN

......

BfMBx1

¨ ¨ ¨BfMBxN

fi

ffi

ffi

ffi

fl

P RMˆN

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 14

Page 24: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Vector Field Differentiation f : RN Ñ RM

y “ fpxq P RM , x P RN

»

y1...yM

fi

ffi

fl

»

f1pxq...

fM pxq

fi

ffi

fl

»

f1px1, . . . , xN q...

fM px1, . . . , xN q

fi

ffi

fl

Jacobian matrix (collection of all partial derivatives)

»

dy1dx...

dyMdx

fi

ffi

ffi

fl

»

Bf1Bx1

¨ ¨ ¨Bf1BxN

......

BfMBx1

¨ ¨ ¨BfMBxN

fi

ffi

ffi

ffi

fl

P RMˆN

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 14

Page 25: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Example: Vector Field Differentiation

fpxq “ Ax , fpxq P RM , A P RMˆN , x P RN

»

y1...yM

fi

ffi

ffi

fl

»

f1pxq

...fM pxq

fi

ffi

ffi

fl

»

A11x1 ` A12x2 ` ¨ ¨ ¨ ` A1NxN...

......

...AM1x1 ` AM2x2 ` ¨ ¨ ¨ ` AMNxN

fi

ffi

ffi

fl

Compute the gradient dfdx

Dimension of dfdx :

Since f : RN Ñ RM it follows that dfdx P R

MˆN

Gradient:

fipxq “Nÿ

j“1

Aijxj ùñBfiBxj

“Aij

ùñdfdx

»

Bf1Bx1

¨ ¨ ¨Bf1BxN

......

BfMBx1

¨ ¨ ¨BfMBxN

fi

ffi

ffi

fl

»

A11 ¨ ¨ ¨ A1N

......

AM1 ¨ ¨ ¨ AMN

fi

ffi

ffi

fl

“ A P RMˆN

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 15

Page 26: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Example: Vector Field Differentiation

fpxq “ Ax , fpxq P RM , A P RMˆN , x P RN

»

y1...yM

fi

ffi

ffi

fl

»

f1pxq

...fM pxq

fi

ffi

ffi

fl

»

A11x1 ` A12x2 ` ¨ ¨ ¨ ` A1NxN...

......

...AM1x1 ` AM2x2 ` ¨ ¨ ¨ ` AMNxN

fi

ffi

ffi

fl

Compute the gradient dfdx

Dimension of dfdx : Since f : RN Ñ RM it follows that df

dx P RMˆN

Gradient:

fipxq “Nÿ

j“1

Aijxj ùñBfiBxj

“Aij

ùñdfdx

»

Bf1Bx1

¨ ¨ ¨Bf1BxN

......

BfMBx1

¨ ¨ ¨BfMBxN

fi

ffi

ffi

fl

»

A11 ¨ ¨ ¨ A1N

......

AM1 ¨ ¨ ¨ AMN

fi

ffi

ffi

fl

“ A P RMˆN

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 15

Page 27: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Example: Vector Field Differentiation

fpxq “ Ax , fpxq P RM , A P RMˆN , x P RN

»

y1...yM

fi

ffi

ffi

fl

»

f1pxq

...fM pxq

fi

ffi

ffi

fl

»

A11x1 ` A12x2 ` ¨ ¨ ¨ ` A1NxN...

......

...AM1x1 ` AM2x2 ` ¨ ¨ ¨ ` AMNxN

fi

ffi

ffi

fl

Compute the gradient dfdx

Dimension of dfdx : Since f : RN Ñ RM it follows that df

dx P RMˆN

Gradient:

fipxq “Nÿ

j“1

Aijxj ùñBfiBxj

“Aij

ùñdfdx

»

Bf1Bx1

¨ ¨ ¨Bf1BxN

......

BfMBx1

¨ ¨ ¨BfMBxN

fi

ffi

ffi

fl

»

A11 ¨ ¨ ¨ A1N

......

AM1 ¨ ¨ ¨ AMN

fi

ffi

ffi

fl

“ A P RMˆN

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 15

Page 28: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Example: Vector Field Differentiation

fpxq “ Ax , fpxq P RM , A P RMˆN , x P RN

»

y1...yM

fi

ffi

ffi

fl

»

f1pxq

...fM pxq

fi

ffi

ffi

fl

»

A11x1 ` A12x2 ` ¨ ¨ ¨ ` A1NxN...

......

...AM1x1 ` AM2x2 ` ¨ ¨ ¨ ` AMNxN

fi

ffi

ffi

fl

Compute the gradient dfdx

Dimension of dfdx : Since f : RN Ñ RM it follows that df

dx P RMˆN

Gradient:

fipxq “Nÿ

j“1

Aijxj ùñBfiBxj

“Aij

ùñdfdx

»

Bf1Bx1

¨ ¨ ¨Bf1BxN

......

BfMBx1

¨ ¨ ¨BfMBxN

fi

ffi

ffi

fl

»

A11 ¨ ¨ ¨ A1N

......

AM1 ¨ ¨ ¨ AMN

fi

ffi

ffi

fl

“ A P RMˆN

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 15

Page 29: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Dimensionality of the Gradient

In general: A function f : RN Ñ RM has a gradient that is anM ˆN -matrix with

dfdx

P RMˆN , df rm,ns “BfmBxn

Gradient dimension: # target dimensions ˆ # input dimensions

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 16

Page 30: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Chain Rule

B

Bxpg ˝ fqpxq “

B

Bx

`

gpfpxqq˘

“Bgpfq

Bf

Bfpxq

Bx

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 17

Page 31: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Example: Chain Rule

Consider f : R2 Ñ R, x : RÑ R2

fpxq “ fpx1, x2q “ x21 ` 2x2 ,

xptq “

«

x1ptq

x2ptq

ff

«

sinptq

cosptq

ff

What are the dimensions of dfdx and dx

dt ?1ˆ 2 and 2ˆ 1

Compute the gradient dfdt using the chain rule:

dfdt“

dfdx

dxdt“

BfBx1

BfBx2

ı

»

Bx1Bt

Bx2Bt

fi

fl “

2 sin t 2ı

«

cos t

´ sin t

ff

“ 2 sin t cos t´ 2 sin t “ 2 sin tpcos t´ 1q

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 18

Page 32: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Example: Chain Rule

Consider f : R2 Ñ R, x : RÑ R2

fpxq “ fpx1, x2q “ x21 ` 2x2 ,

xptq “

«

x1ptq

x2ptq

ff

«

sinptq

cosptq

ff

What are the dimensions of dfdx and dx

dt ?Work it out with your neighbors

1ˆ 2 and 2ˆ 1

Compute the gradient dfdt using the chain rule:

dfdt“

dfdx

dxdt“

BfBx1

BfBx2

ı

»

Bx1Bt

Bx2Bt

fi

fl “

2 sin t 2ı

«

cos t

´ sin t

ff

“ 2 sin t cos t´ 2 sin t “ 2 sin tpcos t´ 1q

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 18

Page 33: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Example: Chain Rule

Consider f : R2 Ñ R, x : RÑ R2

fpxq “ fpx1, x2q “ x21 ` 2x2 ,

xptq “

«

x1ptq

x2ptq

ff

«

sinptq

cosptq

ff

What are the dimensions of dfdx and dx

dt ?1ˆ 2 and 2ˆ 1

Compute the gradient dfdt using the chain rule:

dfdt“

dfdx

dxdt“

BfBx1

BfBx2

ı

»

Bx1Bt

Bx2Bt

fi

fl “

2 sin t 2ı

«

cos t

´ sin t

ff

“ 2 sin t cos t´ 2 sin t “ 2 sin tpcos t´ 1q

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 18

Page 34: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Example: Chain Rule

Consider f : R2 Ñ R, x : RÑ R2

fpxq “ fpx1, x2q “ x21 ` 2x2 ,

xptq “

«

x1ptq

x2ptq

ff

«

sinptq

cosptq

ff

What are the dimensions of dfdx and dx

dt ?1ˆ 2 and 2ˆ 1

Compute the gradient dfdt using the chain rule:

dfdt“

dfdx

dxdt“

BfBx1

BfBx2

ı

»

Bx1Bt

Bx2Bt

fi

fl “

2 sin t 2ı

«

cos t

´ sin t

ff

“ 2 sin t cos t´ 2 sin t “ 2 sin tpcos t´ 1q

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 18

Page 35: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Derivatives with Respect to Matrices

Recall: A function f : RN Ñ RM has a gradient that is anM ˆN -matrix with

dfdx

P RMˆN , df rm,ns “BfmBxn

Gradient dimension: # target dimensions ˆ # input dimensions

This generalizes to when the inputs (N ) or targets (M ) arematrices

Function f : RM ˆN Ñ R P ˆQ , has a gradient that is apP ˆQq ˆ pM ˆNq object (tensor)

dfdX

P RpPˆQqˆpMˆNq , df rp, q,m, ns “BfpqBXmn

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 19

Page 36: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Derivatives with Respect to Matrices

Recall: A function f : RN Ñ RM has a gradient that is anM ˆN -matrix with

dfdx

P RMˆN , df rm,ns “BfmBxn

Gradient dimension: # target dimensions ˆ # input dimensions

This generalizes to when the inputs (N ) or targets (M ) arematrices

Function f : RM ˆN Ñ R P ˆQ , has a gradient that is apP ˆQq ˆ pM ˆNq object (tensor)

dfdX

P RpPˆQqˆpMˆNq , df rp, q,m, ns “BfpqBXmn

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 19

Page 37: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Derivatives with Respect to Matrices

Recall: A function f : RN Ñ RM has a gradient that is anM ˆN -matrix with

dfdx

P RMˆN , df rm,ns “BfmBxn

Gradient dimension: # target dimensions ˆ # input dimensions

This generalizes to when the inputs (N ) or targets (M ) arematrices

Function f : RM ˆN Ñ R P ˆQ , has a gradient that is apP ˆQq ˆ pM ˆNq object (tensor)

dfdX

P RpPˆQqˆpMˆNq , df rp, q,m, ns “BfpqBXmn

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 19

Page 38: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Example 1: Derivatives with Respect to Matrices

f “ Ax , f P RM ,A P RMˆN ,x P RN

»

y1...yM

fi

ffi

ffi

fl

»

f1pxq

...fM pxq

fi

ffi

ffi

fl

»

A11x1 ` A12x2 ` ¨ ¨ ¨ ` A1NxN...

......

...AM1x1 ` AM2x2 ` ¨ ¨ ¨ ` AMNxN

fi

ffi

ffi

fl

dfdA

P R?

# target dimˆ # input dim “M ˆ pM ˆNq

dfdA

»

Bf1BA...

BfMBA

fi

ffi

fl

,BfmBA

P R1ˆpMˆNq

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 20

Page 39: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Example 1: Derivatives with Respect to Matrices

f “ Ax , f P RM ,A P RMˆN ,x P RN

»

y1...yM

fi

ffi

ffi

fl

»

f1pxq

...fM pxq

fi

ffi

ffi

fl

»

A11x1 ` A12x2 ` ¨ ¨ ¨ ` A1NxN...

......

...AM1x1 ` AM2x2 ` ¨ ¨ ¨ ` AMNxN

fi

ffi

ffi

fl

dfdA

P R# target dimˆ # input dim “M ˆ pM ˆNq

dfdA

»

Bf1BA...

BfMBA

fi

ffi

fl

,BfmBA

P R1ˆpMˆNq

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 20

Page 40: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Example 2: Derivatives w.r.t. Matrices

fm “Nÿ

n“1

Amnxn, m “ 1, . . . ,M

»

y1...yi...yM

fi

ffi

ffi

ffi

ffi

ffi

ffi

fl

»

f1pxq...

fipxq...

fM pxq

fi

ffi

ffi

ffi

ffi

ffi

ffi

fl

»

A11x1 `A12x2` ¨ ¨ ¨ `A1NxN...

......

...Ai1x1 ` Ai2x2` ¨ ¨ ¨ `AiNxN

......

......

AM1x1 `AM2x2` ¨ ¨ ¨ `AMNxN

fi

ffi

ffi

ffi

ffi

ffi

ffi

fl

BfmBAmq

“ ?

xq

PR

BfmBAm,:

“ ?

xJ

PR1ˆ1ˆN

BfmBAk‰m,:

“ ?

0J

PR1ˆ1ˆN

BfmBA

“ ?

»

0J

...xJ

...0J

fi

ffi

ffi

ffi

ffi

ffi

ffi

ffi

ffi

fl

PR1ˆpMˆNq

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 21

Page 41: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Example 2: Derivatives w.r.t. Matrices

fm “Nÿ

n“1

Amnxn, m “ 1, . . . ,M

»

y1...yi...yM

fi

ffi

ffi

ffi

ffi

ffi

ffi

fl

»

f1pxq...

fipxq...

fM pxq

fi

ffi

ffi

ffi

ffi

ffi

ffi

fl

»

A11x1 `A12x2` ¨ ¨ ¨ `A1NxN...

......

...Ai1x1 ` Ai2x2` ¨ ¨ ¨ `AiNxN

......

......

AM1x1 `AM2x2` ¨ ¨ ¨ `AMNxN

fi

ffi

ffi

ffi

ffi

ffi

ffi

fl

BfmBAmq

“ xq

PR

BfmBAm,:

“ ?

xJ

PR1ˆ1ˆN

BfmBAk‰m,:

“ ?

0J

PR1ˆ1ˆN

BfmBA

“ ?

»

0J

...xJ

...0J

fi

ffi

ffi

ffi

ffi

ffi

ffi

ffi

ffi

fl

PR1ˆpMˆNq

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 21

Page 42: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Example 2: Derivatives w.r.t. Matrices

fm “Nÿ

n“1

Amnxn, m “ 1, . . . ,M

»

y1...yi...yM

fi

ffi

ffi

ffi

ffi

ffi

ffi

fl

»

f1pxq...

fipxq...

fM pxq

fi

ffi

ffi

ffi

ffi

ffi

ffi

fl

»

A11x1 `A12x2` ¨ ¨ ¨ `A1NxN...

......

...Ai1x1 ` Ai2x2` ¨ ¨ ¨ `AiNxN

......

......

AM1x1 `AM2x2` ¨ ¨ ¨ `AMNxN

fi

ffi

ffi

ffi

ffi

ffi

ffi

fl

BfmBAmq

“ xq

PR

BfmBAm,:

“ xJ

PR1ˆ1ˆN

BfmBAk‰m,:

“ ?

0J

PR1ˆ1ˆN

BfmBA

“ ?

»

0J

...xJ

...0J

fi

ffi

ffi

ffi

ffi

ffi

ffi

ffi

ffi

fl

PR1ˆpMˆNq

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 21

Page 43: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Example 2: Derivatives w.r.t. Matrices

fm “Nÿ

n“1

Amnxn, m “ 1, . . . ,M

»

y1...yi...yM

fi

ffi

ffi

ffi

ffi

ffi

ffi

fl

»

f1pxq...

fipxq...

fM pxq

fi

ffi

ffi

ffi

ffi

ffi

ffi

fl

»

A11x1 `A12x2` ¨ ¨ ¨ `A1NxN...

......

...Ai1x1 ` Ai2x2` ¨ ¨ ¨ `AiNxN

......

......

AM1x1 `AM2x2` ¨ ¨ ¨ `AMNxN

fi

ffi

ffi

ffi

ffi

ffi

ffi

fl

BfmBAmq

“ xq

PR

BfmBAm,:

“ xJ

PR1ˆ1ˆN

BfmBAk‰m,:

“ 0J

PR1ˆ1ˆN

BfmBA

“ ?

»

0J

...xJ

...0J

fi

ffi

ffi

ffi

ffi

ffi

ffi

ffi

ffi

fl

PR1ˆpMˆNq

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 21

Page 44: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Example 2: Derivatives w.r.t. Matrices

fm “Nÿ

n“1

Amnxn, m “ 1, . . . ,M

»

y1...yi...yM

fi

ffi

ffi

ffi

ffi

ffi

ffi

fl

»

f1pxq...

fipxq...

fM pxq

fi

ffi

ffi

ffi

ffi

ffi

ffi

fl

»

A11x1 `A12x2` ¨ ¨ ¨ `A1NxN...

......

...Ai1x1 ` Ai2x2` ¨ ¨ ¨ `AiNxN

......

......

AM1x1 `AM2x2` ¨ ¨ ¨ `AMNxN

fi

ffi

ffi

ffi

ffi

ffi

ffi

fl

BfmBAmq

“ xq

PR

BfmBAm,:

“ xJ

PR1ˆ1ˆN

BfmBAk‰m,:

“ 0J

PR1ˆ1ˆN

BfmBA

»

0J

...xJ

...0J

fi

ffi

ffi

ffi

ffi

ffi

ffi

ffi

ffi

fl

PR1ˆpMˆNq

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 21

Page 45: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Gradient Computation: Two Alternatives

Consider f : R3 Ñ R4ˆ2, fpxq “ A P R4ˆ2 where the entriesAij depend on a vector x P R3

We can compute dApxqdx P R4ˆ2ˆ3 in two equivalent ways:

A P R4ˆ2 x P R3

BA

Bx1P R4ˆ2

BA

Bx2P R4ˆ2

BA

Bx3P R4ˆ2

x1

x2

x3

dAdx

P R4ˆ2ˆ3

4

2

3

Partial derivatives:

collate

A P R4ˆ2 x P R3

x1

x2

x3

dAdx

P R4ˆ2ˆ3

re-shape re-shapegradient

A P R4ˆ2 A P R8dAdx

P R8ˆ3

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 22

Page 46: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Gradient Computation: Two Alternatives

Consider f : R3 Ñ R4ˆ2, fpxq “ A P R4ˆ2 where the entriesAij depend on a vector x P R3

We can compute dApxqdx P R4ˆ2ˆ3 in two equivalent ways:

A P R4ˆ2 x P R3

BA

Bx1P R4ˆ2

BA

Bx2P R4ˆ2

BA

Bx3P R4ˆ2

x1

x2

x3

dAdx

P R4ˆ2ˆ3

4

2

3

Partial derivatives:

collate

A P R4ˆ2 x P R3

x1

x2

x3

dAdx

P R4ˆ2ˆ3

re-shape re-shapegradient

A P R4ˆ2 A P R8dAdx

P R8ˆ3

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 22

Page 47: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Gradients of a Single-Layer Neural Network

x z f

A, b

σ

f “ tanhpAx` blooomooon

“:zPRM

q P RM , x P RN ,A P RMˆN , b P RM

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 23

Page 48: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Gradients of a Single-Layer Neural Network

f “ tanhpAx` blooomooon

“:zPRM

q P RM , x P RN ,A P RMˆN , b P RM

Bf

Bb“

Bf

Bzloomoon

MˆM

Bz

Bbloomoon

MˆM

P RMˆMBf

Bbri, js “

Mÿ

l“1

Bf

Bzri, ls

Bz

Bbrl, js

Bf

BA“

Bf

Bzloomoon

MˆM

Bz

BAloomoon

MˆpMˆNq

P RMˆpMˆNqBf

BAri, j, ks “

Mÿ

l“1

Bf

Bzri, ls

Bz

BArl, j, ks

Bf

Bz“ diagp1´ tanh2pzqq

looooooooooomooooooooooon

PRMˆM

Bz

Bb“ I

loomoon

PRMˆM

Bz

BA“

»

xJ ¨ 0J ¨ 0J

¨ ¨ ¨

0J ¨ xJ ¨ 0J

¨ ¨ ¨

0J ¨ 0J ¨ xJ

fi

ffi

ffi

fl

loooooooooooomoooooooooooon

PRMˆpMˆNq

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 24

Page 49: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Gradients of a Single-Layer Neural Network

f “ tanhpAx` blooomooon

“:zPRM

q P RM , x P RN ,A P RMˆN , b P RM

Bf

Bb“

Bf

Bzloomoon

MˆM

Bz

Bbloomoon

MˆM

P RMˆM

Bf

Bbri, js “

Mÿ

l“1

Bf

Bzri, ls

Bz

Bbrl, js

Bf

BA“

Bf

Bzloomoon

MˆM

Bz

BAloomoon

MˆpMˆNq

P RMˆpMˆNqBf

BAri, j, ks “

Mÿ

l“1

Bf

Bzri, ls

Bz

BArl, j, ks

Bf

Bz“ diagp1´ tanh2pzqq

looooooooooomooooooooooon

PRMˆM

Bz

Bb“ I

loomoon

PRMˆM

Bz

BA“

»

xJ ¨ 0J ¨ 0J

¨ ¨ ¨

0J ¨ xJ ¨ 0J

¨ ¨ ¨

0J ¨ 0J ¨ xJ

fi

ffi

ffi

fl

loooooooooooomoooooooooooon

PRMˆpMˆNq

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 24

Page 50: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Gradients of a Single-Layer Neural Network

f “ tanhpAx` blooomooon

“:zPRM

q P RM , x P RN ,A P RMˆN , b P RM

Bf

Bb“

Bf

Bzloomoon

MˆM

Bz

Bbloomoon

MˆM

P RMˆM

Bf

Bbri, js “

Mÿ

l“1

Bf

Bzri, ls

Bz

Bbrl, js

Bf

BA“

Bf

Bzloomoon

MˆM

Bz

BAloomoon

MˆpMˆNq

P RMˆpMˆNq

Bf

BAri, j, ks “

Mÿ

l“1

Bf

Bzri, ls

Bz

BArl, j, ks

Bf

Bz“ diagp1´ tanh2pzqq

looooooooooomooooooooooon

PRMˆM

Bz

Bb“ I

loomoon

PRMˆM

Bz

BA“

»

xJ ¨ 0J ¨ 0J

¨ ¨ ¨

0J ¨ xJ ¨ 0J

¨ ¨ ¨

0J ¨ 0J ¨ xJ

fi

ffi

ffi

fl

loooooooooooomoooooooooooon

PRMˆpMˆNq

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 24

Page 51: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Gradients of a Single-Layer Neural Network

f “ tanhpAx` blooomooon

“:zPRM

q P RM , x P RN ,A P RMˆN , b P RM

Bf

Bb“

Bf

Bzloomoon

MˆM

Bz

Bbloomoon

MˆM

P RMˆM

Bf

Bbri, js “

Mÿ

l“1

Bf

Bzri, ls

Bz

Bbrl, js

Bf

BA“

Bf

Bzloomoon

MˆM

Bz

BAloomoon

MˆpMˆNq

P RMˆpMˆNq

Bf

BAri, j, ks “

Mÿ

l“1

Bf

Bzri, ls

Bz

BArl, j, ks

Bf

Bz“ diagp1´ tanh2pzqq

looooooooooomooooooooooon

PRMˆM

Bz

Bb“ I

loomoon

PRMˆM

Bz

BA“

»

xJ ¨ 0J ¨ 0J

¨ ¨ ¨

0J ¨ xJ ¨ 0J

¨ ¨ ¨

0J ¨ 0J ¨ xJ

fi

ffi

ffi

fl

loooooooooooomoooooooooooon

PRMˆpMˆNq

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 24

Page 52: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Gradients of a Single-Layer Neural Network

f “ tanhpAx` blooomooon

“:zPRM

q P RM , x P RN ,A P RMˆN , b P RM

Bf

Bb“

Bf

Bzloomoon

MˆM

Bz

Bbloomoon

MˆM

P RMˆM

Bf

Bbri, js “

Mÿ

l“1

Bf

Bzri, ls

Bz

Bbrl, js

Bf

BA“

Bf

Bzloomoon

MˆM

Bz

BAloomoon

MˆpMˆNq

P RMˆpMˆNq

Bf

BAri, j, ks “

Mÿ

l“1

Bf

Bzri, ls

Bz

BArl, j, ks

Bf

Bz“ diagp1´ tanh2pzqq

looooooooooomooooooooooon

PRMˆM

Bz

Bb“ I

loomoon

PRMˆM

Bz

BA“

»

xJ ¨ 0J ¨ 0J

¨ ¨ ¨

0J ¨ xJ ¨ 0J

¨ ¨ ¨

0J ¨ 0J ¨ xJ

fi

ffi

ffi

fl

loooooooooooomoooooooooooon

PRMˆpMˆNq

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 24

Page 53: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Gradients of a Single-Layer Neural Network

f “ tanhpAx` blooomooon

“:zPRM

q P RM , x P RN ,A P RMˆN , b P RM

Bf

Bb“

Bf

Bzloomoon

MˆM

Bz

Bbloomoon

MˆM

P RMˆM

Bf

Bbri, js “

Mÿ

l“1

Bf

Bzri, ls

Bz

Bbrl, js

Bf

BA“

Bf

Bzloomoon

MˆM

Bz

BAloomoon

MˆpMˆNq

P RMˆpMˆNq

Bf

BAri, j, ks “

Mÿ

l“1

Bf

Bzri, ls

Bz

BArl, j, ks

Bf

Bz“ diagp1´ tanh2pzqq

looooooooooomooooooooooon

PRMˆM

Bz

Bb“ I

loomoon

PRMˆM

Bz

BA“

»

xJ ¨ 0J ¨ 0J

¨ ¨ ¨

0J ¨ xJ ¨ 0J

¨ ¨ ¨

0J ¨ 0J ¨ xJ

fi

ffi

ffi

fl

loooooooooooomoooooooooooon

PRMˆpMˆNq

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 24

Page 54: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Gradients of a Single-Layer Neural Network

f “ tanhpAx` blooomooon

“:zPRM

q P RM , x P RN ,A P RMˆN , b P RM

Bf

Bb“

Bf

Bzloomoon

MˆM

Bz

Bbloomoon

MˆM

P RMˆMBf

Bbri, js “

Mÿ

l“1

Bf

Bzri, ls

Bz

Bbrl, js

Bf

BA“

Bf

Bzloomoon

MˆM

Bz

BAloomoon

MˆpMˆNq

P RMˆpMˆNq

Bf

BAri, j, ks “

Mÿ

l“1

Bf

Bzri, ls

Bz

BArl, j, ks

Bf

Bz“ diagp1´ tanh2pzqq

looooooooooomooooooooooon

PRMˆM

Bz

Bb“ I

loomoon

PRMˆM

Bz

BA“

»

xJ ¨ 0J ¨ 0J

¨ ¨ ¨

0J ¨ xJ ¨ 0J

¨ ¨ ¨

0J ¨ 0J ¨ xJ

fi

ffi

ffi

fl

loooooooooooomoooooooooooon

PRMˆpMˆNq

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 24

Page 55: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Gradients of a Single-Layer Neural Network

f “ tanhpAx` blooomooon

“:zPRM

q P RM , x P RN ,A P RMˆN , b P RM

Bf

Bb“

Bf

Bzloomoon

MˆM

Bz

Bbloomoon

MˆM

P RMˆMBf

Bbri, js “

Mÿ

l“1

Bf

Bzri, ls

Bz

Bbrl, js

Bf

BA“

Bf

Bzloomoon

MˆM

Bz

BAloomoon

MˆpMˆNq

P RMˆpMˆNqBf

BAri, j, ks “

Mÿ

l“1

Bf

Bzri, ls

Bz

BArl, j, ks

Bf

Bz“ diagp1´ tanh2pzqq

looooooooooomooooooooooon

PRMˆM

Bz

Bb“ I

loomoon

PRMˆM

Bz

BA“

»

xJ ¨ 0J ¨ 0J

¨ ¨ ¨

0J ¨ xJ ¨ 0J

¨ ¨ ¨

0J ¨ 0J ¨ xJ

fi

ffi

ffi

fl

loooooooooooomoooooooooooon

PRMˆpMˆNq

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 24

Page 56: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Putting Things Together

Inputs x P RN

Observed outputs y “ fθpzq ` ε P RM , ε „ N

`

0, Σ˘

Train single-layer neural network with

fθpzq “ tanhpzq P RM , z “ Ax` b P RM , θ “ tA, bu

Find A, b, such that the squared loss

Lpθq “ 12}e}

2 P R , e “ y ´ fθpzq P RM

is minimized

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 25

Page 57: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Putting Things Together

Inputs x P RN

Observed outputs y “ fθpzq ` ε P RM , ε „ N

`

0, Σ˘

Train single-layer neural network with

fθpzq “ tanhpzq P RM , z “ Ax` b P RM , θ “ tA, bu

Find A, b, such that the squared loss

Lpθq “ 12}e}

2 P R , e “ y ´ fθpzq P RM

is minimized

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 25

Page 58: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Putting Things Together

Inputs x P RN

Observed outputs y “ fθpzq ` ε P RM , ε „ N

`

0, Σ˘

Train single-layer neural network with

fθpzq “ tanhpzq P RM , z “ Ax` b P RM , θ “ tA, bu

Find A, b, such that the squared loss

Lpθq “ 12}e}

2 P R , e “ y ´ fθpzq P RM

is minimized

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 25

Page 59: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Putting Things Together

Inputs x P RN

Observed outputs y “ fθpzq ` ε P RM , ε „ N

`

0, Σ˘

Train single-layer neural network with

fθpzq “ tanhpzq P RM , z “ Ax` b P RM , θ “ tA, bu

Find A, b, such that the squared loss

Lpθq “ 12}e}

2 P R , e “ y ´ fθpzq P RM

is minimized

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 25

Page 60: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Putting Things Together

Partial derivatives:BL

BA“BL

Be

Be

Bf

Bf

Bz

Bz

BABL

Bb“BL

Be

Be

Bf

Bf

Bz

Bz

Bb

BL

Be“ eJ

loomoon

PR1ˆM

Be

Bf“ ´I

loomoon

PRMˆM

Bf

Bz“ diagp1´ tanh2pzqq

looooooooooomooooooooooon

PRMˆM

Bz

BA“

»

xJ ¨ 0J ¨ 0J

¨ ¨ ¨

0J ¨ xJ ¨ 0J

¨ ¨ ¨

0J ¨ 0J ¨ xJ

fi

ffi

ffi

ffi

ffi

fl

looooooooooooomooooooooooooon

PRMˆpMˆNq

Bz

Bb“ I

loomoon

PRMˆM

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 26

Page 61: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Gradients of a Multi-Layer Neural Network

x fK

A0, b0 AK´1, bK´1

LfK´1

AK´2, bK´2

f 1

A1, b1

Inputs x, observed outputs y

Train multi-layer neural network with

f0 “ x

f i “ σipAi´1f i´1 ` bi´1q , i “ 1, . . . ,K

Find Aj , bj for j “ 0, . . . ,K ´ 1, such that the squared loss

Lpθq “ }y ´ fK,θpxq}2

is minimized, where θ “ tAj , bju , j “ 0, . . . ,K ´ 1

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 27

Page 62: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Gradients of a Multi-Layer Neural Network

x fK

A0, b0 AK´1, bK´1

LfK´1

AK´2, bK´2

f 1

A1, b1

Inputs x, observed outputs y

Train multi-layer neural network with

f0 “ x

f i “ σipAi´1f i´1 ` bi´1q , i “ 1, . . . ,K

Find Aj , bj for j “ 0, . . . ,K ´ 1, such that the squared loss

Lpθq “ }y ´ fK,θpxq}2

is minimized, where θ “ tAj , bju , j “ 0, . . . ,K ´ 1

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 27

Page 63: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Gradients of a Multi-Layer Neural Network

x fK

A1, b1 AK´1, bK´1

LfK´1

AK´2, bK´2

f 1

A2, b2

BL

BθK´1“BL

BfK

BfKBθK´1

BL

BθK´2“BL

BfK

BfKBfK´1

BfK´1BθK´2

BL

BθK´3“BL

BfK

BfKBfK´1

BfK´1BfK´2

BfK´2BθK´3

BL

Bθi“BL

BfK

BfKBfK´1

¨ ¨ ¨Bf i`2Bf i`1

Bf i`1Bθi

Intermediate derivatives are stored during the forward pass

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 28

Page 64: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Gradients of a Multi-Layer Neural Network

x fK

A1, b1 AK´1, bK´1

LfK´1

AK´2, bK´2

f 1

A2, b2

BL

BθK´1“BL

BfK

BfKBθK´1

BL

BθK´2“BL

BfK

BfKBfK´1

BfK´1BθK´2

BL

BθK´3“BL

BfK

BfKBfK´1

BfK´1BfK´2

BfK´2BθK´3

BL

Bθi“BL

BfK

BfKBfK´1

¨ ¨ ¨Bf i`2Bf i`1

Bf i`1Bθi

Intermediate derivatives are stored during the forward pass

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 28

Page 65: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Gradients of a Multi-Layer Neural Network

x fK

A0, b0 AK´1, bK´1

LfK´1

AK´2, bK´2

f 1

A1, b1

BL

BθK´1“BL

BfK

BfKBθK´1

BL

BθK´2“BL

BfK

BfKBfK´1

BfK´1BθK´2

BL

BθK´3“BL

BfK

BfKBfK´1

BfK´1BfK´2

BfK´2BθK´3

BL

Bθi“BL

BfK

BfKBfK´1

¨ ¨ ¨Bf i`2Bf i`1

Bf i`1Bθi

Intermediate derivatives are stored during the forward pass

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 28

Page 66: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Gradients of a Multi-Layer Neural Network

x fK

A0, b0 AK´1, bK´1

LfK´1

AK´2, bK´2

f 1

A1, b1

BL

BθK´1“BL

BfK

BfKBθK´1

BL

BθK´2“BL

BfK

BfKBfK´1

BfK´1BθK´2

BL

BθK´3“BL

BfK

BfKBfK´1

BfK´1BfK´2

BfK´2BθK´3

BL

Bθi“BL

BfK

BfKBfK´1

¨ ¨ ¨Bf i`2Bf i`1

Bf i`1Bθi

Intermediate derivatives are stored during the forward pass

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 28

Page 67: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Gradients of a Multi-Layer Neural Network

x fK

A0, b0 AK´1, bK´1

LfK´1

AK´2, bK´2

f 1

A1, b1

BL

BθK´1“BL

BfK

BfKBθK´1

BL

BθK´2“BL

BfK

BfKBfK´1

BfK´1BθK´2

BL

BθK´3“BL

BfK

BfKBfK´1

BfK´1BfK´2

BfK´2BθK´3

BL

Bθi“BL

BfK

BfKBfK´1

¨ ¨ ¨Bf i`2Bf i`1

Bf i`1Bθi

Intermediate derivatives are stored during the forward pass

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 28

Page 68: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Example: Regression with Neural Networks

Linear regression with a neural network parametrized by θ:

y “ fθpxq ` ε , ε „ N`

0, σ2ε˘

Given inputs xn and corresponding (noisy) observations yn,n “ 1, . . . , N , find parameters θ˚ that minimize the squared loss

Lpθq “Nÿ

n“1

pyn ´ fθpxnqq2 “ }y ´ fpXq}2

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 29

Page 69: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Example: Regression with Neural Networks

Linear regression with a neural network parametrized by θ:

y “ fθpxq ` ε , ε „ N`

0, σ2ε˘

Given inputs xn and corresponding (noisy) observations yn,n “ 1, . . . , N , find parameters θ˚ that minimize the squared loss

Lpθq “Nÿ

n“1

pyn ´ fθpxnqq2 “ }y ´ fpXq}2

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 29

Page 70: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Training NNs as Maximum Likelihood Estimation

Training a neural network in the above way corresponds tomaximum likelihood estimation:

If y “ NNpx,θq ` ε, ε „ N`

0, I˘

then the log-likelihood is

log ppy|X,θq “ ´12}y ´NNpx,θq}

2

Find θ˚ by minimizing the negative log-likelihood:

θ˚ “ argminθ´ log ppy|x,θq

“ argminθ

12}y ´NNpx,θq}

2

“ argminθLpθq

Maximum likelihood estimation can lead to overfitting (interpretnoise as signal)

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 30

Page 71: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Training NNs as Maximum Likelihood Estimation

Training a neural network in the above way corresponds tomaximum likelihood estimation:

If y “ NNpx,θq ` ε, ε „ N`

0, I˘

then the log-likelihood is

log ppy|X,θq “ ´12}y ´NNpx,θq}

2

Find θ˚ by minimizing the negative log-likelihood:

θ˚ “ argminθ´ log ppy|x,θq

“ argminθ

12}y ´NNpx,θq}

2

“ argminθLpθq

Maximum likelihood estimation can lead to overfitting (interpretnoise as signal)

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 30

Page 72: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Training NNs as Maximum Likelihood Estimation

Training a neural network in the above way corresponds tomaximum likelihood estimation:

If y “ NNpx,θq ` ε, ε „ N`

0, I˘

then the log-likelihood is

log ppy|X,θq “ ´12}y ´NNpx,θq}

2

Find θ˚ by minimizing the negative log-likelihood:

θ˚ “ argminθ´ log ppy|x,θq

“ argminθ

12}y ´NNpx,θq}

2

“ argminθLpθq

Maximum likelihood estimation can lead to overfitting (interpretnoise as signal)

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 30

Page 73: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Example: Linear Regression (1)

Linear regression with a polynomial of order M :

y “ fpx,θq ` ε , ε „ N`

0, σ2ε˘

fpx,θq “ θ0 ` θ1x` θ2x2 ` ¨ ¨ ¨ ` θMx

M “

Mÿ

i“0

θixi

Given inputs xi and corresponding (noisy) observations yi,i “ 1, . . . , N , find parameters θ “ rθ0, . . . , θM sJ, that minimizethe squared loss (equivalently: maximize the likelihood)

Lpθq “Nÿ

i“1

pyi ´ fpxi,θqq2

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 31

Page 74: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Example: Linear Regression (1)

Linear regression with a polynomial of order M :

y “ fpx,θq ` ε , ε „ N`

0, σ2ε˘

fpx,θq “ θ0 ` θ1x` θ2x2 ` ¨ ¨ ¨ ` θMx

M “

Mÿ

i“0

θixi

Given inputs xi and corresponding (noisy) observations yi,i “ 1, . . . , N , find parameters θ “ rθ0, . . . , θM sJ, that minimizethe squared loss (equivalently: maximize the likelihood)

Lpθq “Nÿ

i“1

pyi ´ fpxi,θqq2

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 31

Page 75: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Example: Linear Regression (2)

x-5 0 5

f(x)

-3

-2

-1

0

1

2

3Polynomial of degree 16

DataMaximum likelihood estimate

Regularization, model selection etc. can address overfitting

Alternative approach based on integration

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 32

Page 76: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Example: Linear Regression (2)

x-5 0 5

f(x)

-3

-2

-1

0

1

2

3Polynomial of degree 16

DataMaximum likelihood estimate

Regularization, model selection etc. can address overfitting

Alternative approach based on integration

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 32

Page 77: Vector Calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science

Summary

A P R4ˆ2 x P R3

BA

Bx1P R4ˆ2

BA

Bx2P R4ˆ2

BA

Bx3P R4ˆ2

x1

x2

x3

dAdx

P R4ˆ2ˆ3

4

2

3

Partial derivatives:

collate

x-5 0 5

f(x)

-3

-2

-1

0

1

2

3Polynomial of degree 16

DataMaximum likelihood estimate

x fK

A0, b0 AK´1, bK´1

LfK´1

AK´2, bK´2

f 1

A1, b1

Vector-valued differentiation

Chain rule

Check the dimension of the gradients

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 33