Dl meetup 07-04-16

100
primer Overivew and Tutorial Jim O’ Donoghue Deep Learning Meetup @Intercom, Stephens Green 7th April 2016

Transcript of Dl meetup 07-04-16

primer

Overivew and Tutorial

Jim O’ Donoghue

Deep Learning Meetup @Intercom, Stephens Green

7th April 2016

my background machine learning function elements

my background machine learning function elements

hypothesis functions (NN architectures)objective functions

optimisation

my background machine learning function elements

hypothesis functions (NN architectures)objective functions

optimisation

linear regressionmulti-layer perceptron

my background machine learning function elements

input types hypothesis functions (NN architectures)

objective functions + optimisationoutput types

regressionmulti-layer Perceptron

Continuous

DiscreteCategorical

Nominal

Ordinal

Supervised

Unsupervised

Semi-Supervised

Feature

Supervised

Unsupervised

Semi-Supervised

Feature

y𝑥

y𝑥

y𝑥

y𝑥 𝑓 𝑥

y 𝑦+𝜖𝑥 𝑓 𝑥

y 𝑦+𝜖𝑥 𝑓 𝑥

y 𝑦+𝜖𝑥

Optimisation

𝑓 𝑥

y 𝑦+𝜖𝑥

Optimisation + Hyper-Parameters

𝑓 𝑥

y 𝑦+𝜖𝑥

hypothesis

ℎ 𝑥

y 𝑦+𝜖𝑥 ℎ 𝑥

output

y 𝑦+𝜖𝑥 ℎ 𝑥

calculated by ...

y 𝑦+𝜖𝑥 ℎ𝜽 𝑥

calculated by ...

y 𝑦+𝜖𝑥 ℎ𝜃 𝑥objective

y 𝑦+𝜖𝑥 ℎ𝜽 𝑥

optimise

27

Hypothesis Functionsℎ 𝑥

28

Hypothesis Functionsℎ 𝑥

Hypothesis Functionsℎ 𝑥

calculate outputs via 𝑦

𝜃 = {Weights, bias}

calculate outputs via

𝜃 = {Weights, bias}

calculate outputs via

n activation functions

𝜃 = {Weights, bias}

calculate outputs via

n activation functionsinterim

𝜃 = {Weights, bias}

calculate outputs via

interim functionsLinear

𝜃 = {Weights, bias}

calculate outputs via

interim functionsLinearTanhCoshLogistic SigmoidRecitified Linear

{non linear

𝜃 = {Weights, bias}

calculate outputs via

ℎ 𝑥 = 𝑔(𝑓 𝑥 )

𝜃 = {Weights, bias}

calculate outputs via

ℎ 𝑥 = 𝑔(𝑓(𝑔(𝑓 𝑥 ))

loss/cost/error

y 𝑦+𝜖

y − 𝑦𝜖

loss/cost/error

y − 𝑦𝜖J(𝜃)

loss/cost/error

gradient descent

𝜃

𝐽(𝜃)

𝜃

𝐽(𝜃)

gradient descent

𝜃

𝐽(𝜃)

gradient descent

𝜃

𝐽(𝜃)

gradient descent

𝜕𝐽(𝜃)

𝜕𝜃A. get partial derivative

gradient descent

𝜃

𝐽(𝜃)

gradient descent

𝜃

𝐽(𝜃)

gradient descent

𝜃 ≔ 𝜃 − 𝛼𝜕𝐽(𝜃)

𝜕𝜃

𝜕𝐽(𝜃)

𝜕𝜃A. get partial derivative

B. update the parameters

gradient descent

𝜃

𝐽(𝜃)

gradient descent

𝜃

𝐽(𝜃)

gradient descent

𝜃

𝐽(𝜃)

gradient descent

𝜃

𝐽(𝜃)

gradient descent

𝜃

𝐽(𝜃)

𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔 𝑟𝑎𝑡𝑒

𝛼

gradient descent

54𝜃

𝐽(𝜃)

gradient descent

global optimim

55𝜃

𝐽(𝜃)

gradient descent

local optimum

first...

the activation

ConnectionWeights

Class

Input Features

the activation

ConnectionWeights

Class

Input Features

the activation

𝑥

𝑦𝜃 = {Weights, bias}

the activation

𝑥

𝑦𝜃 = {Weights, bias}

𝑦 = 𝑓 𝑥

the activation

𝑥

𝑓(𝑥)𝜃 = {Weights, bias}

𝑦 = 𝑓 𝑥 = 𝑎𝑥 + 𝑏

the activation

𝑥

𝑓𝜃(𝑥)𝜃 = {Weights, bias}

𝑦 = 𝑓𝜃 𝑥 = 𝑤𝑥 + 𝑏

the activation

𝑥

𝑓𝜃(𝑥)𝜃 = {Weights, bias}

𝑦 = 𝑓𝜃 𝑥 = 𝑤𝑥 + 𝑏

= 𝜃𝑇𝑥

the activation

𝑥

𝑓𝜃(𝑥)𝜃 = {Weights, bias}

𝑦 = 𝑓𝜃 𝑥 = 𝑤𝑥 + 𝑏= 𝜃𝑇𝑥

=

𝑖=1

𝑛

𝑤𝑖𝑥𝑖 + 𝑏

the activation for

𝑥

𝑧𝜃 = {Weights, bias}

𝑧 = 𝑓𝜃 𝑥

= 𝜃𝑇𝑥

= 𝑤𝑇𝑥 + 𝑏

y − 𝑦𝜖J(𝜃)

one sample

1

𝑚

𝑠=1

𝑚

(𝑦𝑠 − 𝑦𝑠)2

all samples

J(𝜃)

1

2𝑚

𝑠=1

𝑚

(𝑦𝑠 − 𝑦𝑠)2

all samples

J(𝜃)

1

𝑚

𝑠=1

𝑚

(𝑦𝑠 − 𝑦𝑠)𝑥𝑠

partial derivative

𝜕𝐽(𝜃)

𝜕𝜃

δ𝑥𝑠

partial derivative

𝜕𝐽(𝜃)

𝜕𝜃

𝜃 ≔ 𝜃 − 𝛼1

𝑚

𝑠=1

𝑚

(𝑦𝑠 − 𝑦𝑠)𝑥𝑠

update

the activation for

ConnectionWeights

Class

Input Features

ConnectionWeights

Class

Input Features

the activation

𝑥

𝑔(𝑧)𝑧 = 𝑓𝜃 𝑥

𝑎 = 𝑔 𝑧

=1

1+𝑒−𝑧

the activation

the activation

𝑥

𝑎𝑧 = 𝑓𝜃 𝑥

𝑎 = 𝑔 𝑧

p(a = 1|𝑥, 𝜃) =1

1+𝑒−𝑧

hypothesis

𝑦 = ℎ 𝑥 = 𝑓𝜃2 𝑎1

𝑥

𝑎1

𝑦

hypothesis

𝑦 = ℎ 𝑥 = 𝑓𝜃2 𝑎1

𝑥

𝑎1

𝑦

𝑎1 = 𝑔 𝑧 =1

1 + 𝑒−𝑧

𝑧1 = 𝑓𝜃1 𝑥 = 𝑤1𝑇𝑥 + 𝑏

hypothesis

𝑦 = ℎ 𝑥 = 𝑓𝜃2 𝑎1

𝑥

𝑎1

𝑦

𝑎1 = 𝑔 𝑧 =1

1 + 𝑒−𝑧

𝑧1 = 𝑓𝜃1 𝑥 = 𝑤1𝑇𝑥 + 𝑏

hypothesis

𝑥

𝑎1

𝑦

ℎ 𝑥 = 𝑓(𝑔(𝑓 𝑥 )

the error function

𝑥

𝑎1

𝑦

1

2𝑚

𝑠=1

𝑚

(𝑦𝑠 − 𝑦𝑠)2J(𝜃)

the partial derivative

𝑥

𝑎1

𝑦

𝜕𝐽(𝜃2)

𝜕𝜃2

1

𝑚

𝑠=1

𝑚

(𝑦𝑠 − 𝑦𝑠)𝑎1𝑠

the partial derivative

𝑥

𝑎1

𝑦

𝜕𝐽(𝜃2)

𝜕𝜃2

𝜕𝐽(𝜃2)

𝜕 𝑦

𝜕 𝑦

𝜕𝜃2

the partial derivative

𝑥

𝑎1

𝑦

𝜕𝐽(𝜃2)

𝜕𝜃2

𝜕𝐽(𝜃2)

𝜕 𝑦𝑎1

𝑠

the partial derivative

𝑥

𝑎1

𝑦

1

𝑚

𝑠=1

𝑚

(𝑦𝑠 − 𝑦𝑠)𝑎1𝑠

𝜕𝐽(𝜃2)

𝜕𝜃2

the partial derivative

𝑥

𝑎1𝛿2 𝑎1

𝛿2

𝜕𝐽(𝜃2)

𝜕𝜃2

the partial derivative

𝑥

𝛿1

𝜕𝐽(𝜃)

𝜕𝜃1𝛿1 𝑥

𝛿2

the partial derivative

𝑥

𝛿1

𝛿2

𝜕𝐽(𝜃1)

𝜕𝑎1

𝜕𝑎1

𝜕𝑧1

𝜕z1

𝜕𝜃1

𝜕𝐽(𝜃1)

𝜕𝜃1

the partial derivative

𝑥

𝛿1

𝛿2

𝜃2𝛿2𝜕𝑎1

𝜕𝑧1

𝜕z1

𝜕𝜃1

𝜕𝐽(𝜃1)

𝜕𝜃1

the partial derivative

𝑥

𝛿1

𝜕𝐽(𝜃1)

𝜕𝜃1

𝛿2

𝜃2𝛿2 𝑎1(1 − 𝑎1)𝜕z1

𝜕𝜃1

the partial derivative

𝑥

𝛿1

𝜕𝐽(𝜃1)

𝜕𝜃1

𝛿2

𝜃2𝛿2 𝑎1(1 − 𝑎1) 𝑥

the partial derivative

𝑥

𝛿1

𝜕𝐽(𝜃1)

𝜕𝜃1

𝛿2

𝜃2𝛿2 𝑎1(1 − 𝑎1) 𝑥

𝛿1

the partial derivative

𝑥

𝛿1

𝜕𝐽(𝜃1)

𝜕𝜃1𝛿1 𝑥

𝛿2

ConnectionWeights

Class

Input Features

Class

ConnectionWeights

Class

Input Features

Learned Features

Learning deep architectures for AI

https://deeplearning.net

https://github.com/jimod/deeplearning-meetup-dublin

http://colah.github.io/

https://www.coursera.org/learn/machine-learning

https://www.coursera.org/course/neuralnets