Download - Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

Transcript
Page 1: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

Neural Network Part 1: Multiple Layer Neural Networks

CS 760@UW-Madison

Page 2: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

Goals for the lecture

you should understand the following concepts

• perceptrons

• the perceptron training rule

• linear separability

• multilayer neural networks

• stochastic gradient descent

• backpropagation

2

Page 3: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

Goals for the lecture

you should understand the following concepts

• regularization

• different views of regularization

• norm constraint

• dropout

• data augumentation

• early stopping

3

Page 4: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

Neural networks

• a.k.a. artificial neural networks, connectionist models

• inspired by interconnected neurons in biological systems

• simple processing units

• each unit receives a number of real-valued inputs

• each unit produces a single real-valued output

4

Page 5: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

Perceptron

Page 6: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

Perceptrons

[McCulloch & Pitts, 1943; Rosenblatt, 1959; Widrow & Hoff, 1960]

input units:

represent given x

output unit:

represents binary classification

x1

x2

xn

w1

w2

wn

w01

6

𝑜 =1 if 𝑤0 +

𝑖=1

𝑛

𝑤𝑖𝑥𝑖 > 0

0 otherwise

Page 7: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

The perceptron training rule

Dwi = h y- o( )xi

1. randomly initialize weights

2. iterate through training instances until convergence

wi ¬wi + Dwi

2a. calculate the output for

the given instance

2b. update each weight

η is learning rate;

set to value << 17

𝑜 =1 if 𝑤0 +

𝑖=1

𝑛

𝑤𝑖𝑥𝑖 > 0

0 otherwise

Page 8: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

Representational power of perceptrons

perceptrons can represent only linearly separable concepts

x1

+ +

+ + +

+ -

+ - -

+ +

+ + -

+ + - -

-

+ + -

-

+ - - -

- -

x2

w1x1 +w2x2 = -w0

x2 = -w1

w2

x1 -w0

w2

1 if w0 +w1x1 +w2x2 > 0

decision boundary given by:

8

also write as: wx > 0

𝑜 =1 if 𝑤0 +

𝑖=1

𝑛

𝑤𝑖𝑥𝑖 > 0

0 otherwise

Page 9: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

Representational power of perceptrons

• in previous example, feature space was 2D so decision boundary was a line

• in higher dimensions, decision boundary is a hyperplane

9

Page 10: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

Some linearly separable functions

x1 x2

0 0

0 1

1 0

1 1

y

0

0

0

1

a

b

c

d

AND

a c

0 1

1

x1

x2

db

x1 x2

0 0

0 1

1 0

1 1

y

0

1

1

1

a

b

c

d

OR

b

a c

0 1

1

x1

x2

d

10

Page 11: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

XOR is not linearly separable

x1 x2

0 0

0 1

1 0

1 1

y

0

1

1

0

a

b

c

d 0 1

1

x1

x2

b

a c

d

a multilayer perceptron

can represent XOR

x1

x21

-1

1

1

1

-1

assume w0 = 0 for all nodes11

Page 12: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

Multiple Layer Neural Networks

Page 13: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

Example multilayer neural network

input: two features from spectral analysis of a spoken sound

output: vowel sound occurring in the context “h__d”

figure from Huang & Lippmann, NIPS 1988

input units

hidden units

output units

13

Page 14: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

Decision regions of a multilayer neural network

input: two features from spectral analysis of a spoken sound

output: vowel sound occurring in the context “h__d”

figure from Huang & Lippmann, NIPS 1988

14

Page 15: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

Components

• Representations:• Input

• Hidden variables

• Layers/weights:• Hidden layers

• Output layer

Page 16: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

Components

… …

…… …

…Hidden variables ℎ1 ℎ2 ℎ𝐿

𝑦

Input 𝑥

First layer Output layer

Page 17: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

Input

• Represented as a vector

• Sometimes require some

preprocessing, e.g.,• Subtract mean

• Normalize to [-1,1]

Expand

Page 18: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

Input (feature) encoding for neural networks

nominal features are usually represented using a 1-of-k encoding

ordinal features can be represented using a thermometer encoding

precipitation = 0.68[ ]

real-valued features can be represented using individual input units (we may want

to scale/normalize them first though)

18

𝐴 =

1000

𝐶 =

0100

𝐺 =

0010

𝑇 =

0001

small=100

medium=110

large=111

Page 19: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

Output layers

• Regression: 𝑦 = 𝑤𝑇ℎ + 𝑏

• Linear units: no nonlinearity

𝑦

Output layer

Page 20: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

Output layers

• Multi-dimensional regression: 𝑦 = 𝑊𝑇ℎ + 𝑏

• Linear units: no nonlinearity

𝑦

Output layer

Page 21: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

Output layers

• Binary classification: 𝑦 = 𝜎(𝑤𝑇ℎ + 𝑏)

• Corresponds to using logistic regression on ℎ

𝑦

Output layer

Page 22: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

Output layers

• Multi-class classification:

• 𝑦 = softmax 𝑧 where 𝑧 = 𝑊𝑇ℎ + 𝑏

• Corresponds to using multi-class

logistic regression on ℎ

𝑦

Output layer

𝑧

Page 23: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

Hidden layers

• Neuron takes weighted linear combination of the previous layer

• outputs one value for the next layer

……

ℎ𝑖 ℎ𝑖+1

Page 24: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

Hidden layers

• 𝑦 = 𝑟(𝑤𝑇𝑥 + 𝑏)

• Typical activation function 𝑟• Threshold t 𝑧 = 𝕀[𝑧 ≥ 0]

• Sigmoid 𝜎 𝑧 = 1/()

1 +exp(−𝑧)

• Tanh tanh 𝑧 = 2𝜎 2𝑧 − 1

𝑦𝑥𝑟(⋅)

Page 25: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

Hidden layers

• Problem: saturation

𝑦𝑥𝑟(⋅)

Figure borrowed from Pattern Recognition and Machine Learning, Bishop

Too small gradient

Page 26: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

Hidden layers

• Activation function ReLU (rectified linear unit)• ReLU 𝑧 = max{𝑧, 0}

Figure from Deep learning, by Goodfellow, Bengio, Courville.

Page 27: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

Hidden layers

• Activation function ReLU (rectified linear unit)• ReLU 𝑧 = max{𝑧, 0}

Gradient 0

Gradient 1

Page 28: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

Hidden layers

• Generalizations of ReLU gReLU 𝑧 = max 𝑧, 0 + 𝛼min{𝑧, 0}

• Leaky-ReLU 𝑧 = max{𝑧, 0} + 0.01min{𝑧, 0}

• Parametric-ReLU 𝑧 : 𝛼 learnable

𝑧

gReLU 𝑧

Page 29: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

Backpropagation

Page 30: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

Learning in multilayer networks

• work on neural nets fizzled in the 1960’s

• single layer networks had representational limitations (linear separability)

• no effective methods for training multilayer networks

• revived again with the invention of backpropagation method

[Rumelhart & McClelland, 1986; also Werbos, 1975]

• key insight: require neural network to be differentiable;

use gradient descent

x1

x2

how to determine

error signal for

hidden units?

30

Page 31: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

Learning in multilayer networks

• learning techniques nowadays

• random initialization of the weights

• stochastic gradient descent (can add momentum)

• regularization techniques

• norm constraint

• dropout

• batch normalization

• data augmentation

• early stopping

• pretraining

• …

Page 32: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

Gradient descent in weight space

figure from Cho & Chow, Neurocomputing 1999

Given a training set we can specify an error

measure that is a function of our weight vector w

This error measure defines a surface over the hypothesis (i.e. weight) space

w1w2

32

)},(),...,,{( )()()1()1( mm yxyxD =

𝐸 𝒘 =1

2

𝑑∈D

𝑦(𝑑) − 𝑜(𝑑)2

Page 33: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

Gradient descent in weight space

w1

w2

Error

on each iteration

• current weights define a point in this space

• find direction in which error surface descends most steeply

• take a step (i.e. update weights) in that direction

gradient descent is an iterative process aimed at finding a minimum in the

error surface

33

Page 34: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

Gradient descent in weight space

w1

w2

Error

-¶E

¶w1

-¶E

¶w2

ÑE(w) =

¶E

¶w0

, ¶E

¶w1

, , ¶E

¶wn

é

ëê

ù

ûú

Dw = -h ÑE w( )

Dwi = -h ¶E

¶wi

calculate the gradient of E:

take a step in the opposite direction

34

Page 35: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

Batch neural network training

given: network structure and a training set

initialize all weights in w to small random numbers

until stopping criteria met do

initialize the error

for each (x(d), y(d)) in the training set

input x(d) to the network and compute output o(d)

increment the error

calculate the gradient

update the weights

E(w) = E(w)+1

2y(d ) - o(d )( )

2

ÑE(w) =

¶E

¶w0

, ¶E

¶w1

, , ¶E

¶wn

é

ëê

ù

ûú

Dw = -h ÑE w( )

E(w) = 0

35

)},(),...,,{( )()()1()1( mm yxyxD =

Page 36: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

Online vs. batch training

• Standard gradient descent (batch training): calculates error gradient for the entire training set, before taking a step in weight space

• Stochastic gradient descent (online training): calculates error gradient for a single instance, then takes a step in weight space

• much faster convergence

• less susceptible to local minima

36

Page 37: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

Online neural network training (stochastic gradient descent)

given: network structure and a training set

initialize all weights in w to small random numbers

until stopping criteria met do

for each (x(d), y(d)) in the training set

input x(d) to the network and compute output o(d)

calculate the error

calculate the gradient

update the weights

E(w) =1

2y(d ) - o(d )( )

2

ÑE(w) =

¶E

¶w0

, ¶E

¶w1

, , ¶E

¶wn

é

ëê

ù

ûú

Dw = -h ÑE w( )37

)},(),...,,{( )()()1()1( mm yxyxD =

Page 38: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

Taking derivatives in neural nets

recall the chain rule from calculus

we’ll make use of this as follows

38

𝑦 = 𝑓 𝑢

𝑢 = 𝑔 𝑥

𝜕𝑦

𝜕𝑥=𝜕𝑦

𝜕𝑢

𝜕𝑢

𝜕𝑥

𝜕𝐸

𝜕𝑤𝑖=𝜕𝐸

𝜕𝑜

𝜕𝑜

𝜕𝑛𝑒𝑡

𝜕𝑛𝑒𝑡

𝜕𝑤𝑖𝑛𝑒𝑡 = 𝑤0 +

𝑖=1

𝑛

𝑤𝑖𝑥𝑖

Page 39: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

Gradient descent: simple case

Consider a simple case of a network with one linear output unit and no

hidden units:

let’s learn wi’s that minimize squared error

batch case online case

x1

x2

xn

w1

w2

wn

w01

39

𝑜(𝑑) = 𝑤0 +

𝑖=1

𝑛

𝑤𝑖𝑥𝑖(𝑑)

𝐸 𝒘 =1

2

𝑑∈D

𝑦(𝑑) − 𝑜(𝑑)2

𝜕𝐸

𝜕𝑤𝑖=

𝜕

𝜕𝑤𝑖

1

2

𝑑∈D

𝑦(𝑑) − 𝑜(𝑑)2 𝜕𝐸(𝑑)

𝜕𝑤𝑖=

𝜕

𝜕𝑤𝑖

1

2𝑦(𝑑) − 𝑜(𝑑)

2

Page 40: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

Stochastic gradient descent: simple case

let’s focus on the online case (stochastic gradient descent):

𝜕𝐸(𝑑)

𝜕𝑤𝑖=

𝜕

𝜕𝑤𝑖

1

2𝑦(𝑑) − 𝑜(𝑑)

2

= 𝑦(𝑑) − 𝑜(𝑑)𝜕

𝜕𝑤𝑖𝑦(𝑑) − 𝑜(𝑑)

= 𝑦(𝑑) − 𝑜(𝑑) −𝜕𝑜(𝑑)

𝜕𝑤𝑖

= − 𝑦(𝑑) − 𝑜(𝑑)𝜕𝑜(𝑑)

𝜕𝑛𝑒𝑡(𝑑)𝜕𝑛𝑒𝑡(𝑑)

𝜕𝑤𝑖= − 𝑦(𝑑) − 𝑜(𝑑)

𝜕𝑛𝑒𝑡(𝑑)

𝜕𝑤𝑖

= − 𝑦 𝑑 − 𝑜 𝑑 𝑥𝑖(𝑑)

40

Page 41: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

Gradient descent with a sigmoid

Now let’s consider the case in which we have a sigmoid output unit

and no hidden units:

x1

x2

xn

w1

w2

wn

w01

useful property:

41

𝑛𝑒𝑡(𝑑) = 𝑤0 +

𝑖=1

𝑛

𝑤𝑖𝑥𝑖(𝑑)

𝑜(𝑑) =1

1 + 𝑒−𝑛𝑒𝑡(𝑑)

𝜕𝑜 𝑑

𝜕𝑛𝑒𝑡(𝑑)= 𝑜 𝑑 (1 − 𝑜 𝑑 )

Page 42: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

Stochastic GD with sigmoid output unit

42

42

𝜕𝐸(𝑑)

𝜕𝑤𝑖=

𝜕

𝜕𝑤𝑖

1

2𝑦(𝑑) − 𝑜(𝑑)

2

= 𝑦(𝑑) − 𝑜(𝑑)𝜕

𝜕𝑤𝑖𝑦(𝑑) − 𝑜(𝑑)

= 𝑦(𝑑) − 𝑜(𝑑) −𝜕𝑜(𝑑)

𝜕𝑤𝑖

= − 𝑦(𝑑) − 𝑜(𝑑)𝜕𝑜(𝑑)

𝜕𝑛𝑒𝑡(𝑑)𝜕𝑛𝑒𝑡(𝑑)

𝜕𝑤𝑖

= − 𝑦 𝑑 − 𝑜 𝑑 𝑜 𝑑 (1 − 𝑜(𝑑))𝜕𝑛𝑒𝑡(𝑑)

𝜕𝑤𝑖

= − 𝑦 𝑑 − 𝑜 𝑑 𝑜 𝑑 (1 − 𝑜(𝑑))𝑥𝑖(𝑑)

Page 43: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

Backpropagation

¶E

¶wi

• now we’ve covered how to do gradient descent for single-layer networks with

• linear output units

• sigmoid output units

• how can we calculate for every weight in a multilayer network?

➔ backpropagate errors from the output units to the hidden units

43

Page 44: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

Backpropagation notation

let’s consider the online case, but drop the (d) superscripts for simplicity

we’ll use

• subscripts on y, o, net to indicate which unit they refer to

• subscripts to indicate the unit a weight emanates from and goes to

w jii

j o j

44

Page 45: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

Backpropagation

each weight is changed by

xi if i is an input unit

45

∆𝑤𝑗𝑖 = −𝜂𝜕𝐸

𝜕𝑤𝑗𝑖

= −𝜂𝜕𝐸

𝜕𝑛𝑒𝑡𝑗

𝜕𝑛𝑒𝑡𝑗𝜕𝑤𝑗𝑖

= 𝜂𝛿𝑗𝑜𝑖

where 𝛿𝑗 = −𝜕𝐸

𝜕𝑛𝑒𝑡𝑗

Page 46: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

Backpropagation

each weight is changed by

if j is a hidden unit

46

sum of backpropagated

contributions to error

∆𝑤𝑗𝑖 = 𝜂𝛿𝑗𝑜𝑖

where 𝛿𝑗 = −𝜕𝐸

𝜕𝑛𝑒𝑡𝑗

if j is an output unit

same as

single-layer net

with sigmoid

output

𝛿𝑗 = 𝑜𝑗(1 − 𝑜𝑗)(𝑦𝑗 − 𝑜𝑗)

𝛿𝑗 = 𝑜𝑗(1 − 𝑜𝑗)

𝑘

𝛿𝑘𝑤𝑘𝑗

Page 47: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

Backpropagation illustrated

j

1. calculate error of output units

47

𝛿𝑗 = 𝑜𝑗(1 − 𝑜𝑗)(𝑦𝑗 − 𝑜𝑗)

i

j

2. determine updates for weights

going to output units

∆𝑤𝑗𝑖 = 𝜂𝛿𝑗𝑜𝑖

Page 48: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

Backpropagation illustrated

j

k

3. calculate error for hidden units

48

𝛿𝑗 = 𝑜𝑗(1 − 𝑜𝑗)

𝑘

𝛿𝑘𝑤𝑘𝑗

i

j

4. determine updates for weights

to hidden units using hidden-unit

errors

∆𝑤𝑗𝑖 = 𝜂𝛿𝑗𝑜𝑖

Page 49: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

More Illustration of Backpropagation

Page 50: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

slide 50

Learning in neural network

• Again we will minimize the error (𝐾 outputs):

• 𝑥: one training point in the training set 𝐷

• 𝑎𝑐: the 𝑐-th output for the training point 𝑥

• 𝑦𝑐: the 𝑐-th element of the label indicator vector for 𝑥

𝐸 =1

2

𝑥∈𝐷

𝐸𝑥 , 𝐸𝑥 = 𝑦 − 𝑎 2 =

𝑐=1

𝐾

𝑎𝑐 − 𝑦𝑐2

𝑥2

𝑥1

𝑎1

𝑎𝐾

10…00

= 𝑦

Page 51: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

slide 51

Learning in neural network

• Again we will minimize the error (𝐾 outputs):

• 𝑥: one training point in the training set 𝐷

• 𝑎𝑐: the 𝑐-th output for the training point 𝑥

• 𝑦𝑐: the 𝑐-th element of the label indicator vector for 𝑥

• Our variables are all the weights 𝑤 on all the edges

▪ Apparent difficulty: we don’t know the ‘correct’

output of hidden units

𝐸 =1

2

𝑥∈𝐷

𝐸𝑥 , 𝐸𝑥 = 𝑦 − 𝑎 2 =

𝑐=1

𝐾

𝑎𝑐 − 𝑦𝑐2

Page 52: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

slide 52

Learning in neural network

• Again we will minimize the error (𝐾 outputs):

• 𝑥: one training point in the training set 𝐷

• 𝑎𝑐: the 𝑐-th output for the training point 𝑥

• 𝑦𝑐: the 𝑐-th element of the label indicator vector for 𝑥

• Our variables are all the weights 𝑤 on all the edges

▪ Apparent difficulty: we don’t know the ‘correct’

output of hidden units

▪ It turns out to be OK: we can still do gradient

descent. The trick you need is the chain rule

▪ The algorithm is known as back-propagation

𝐸 =1

2

𝑥∈𝐷

𝐸𝑥 , 𝐸𝑥 = 𝑦 − 𝑎 2 =

𝑐=1

𝐾

𝑎𝑐 − 𝑦𝑐2

Page 53: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

slide 53

Gradient (on one data point)

𝑥2

𝑥1𝑤11(4)

Layer (4)Layer (3)Layer (2)Layer (1)

𝐸𝑥

want to compute 𝜕𝐸𝑥

𝜕𝑤114

Page 54: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

slide 54

Gradient (on one data point)

𝑥2

𝑥1

= 𝑦 − 𝑎 2

𝑎1

𝑎2

𝑤11(4)

Layer (4)Layer (3)Layer (2)Layer (1)

𝐸𝑥

𝐸𝑥𝑎1𝑦 − 𝑎 2

Page 55: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

slide 55

Gradient (on one data point)

𝑥2

𝑥1

= 𝑦 − 𝑎 2

𝑎1

𝑎2

𝑤11(4)

Layer (4)Layer (3)Layer (2)Layer (1)

𝐸𝑥

𝐸𝑥𝑎1𝑦 − 𝑎 2

𝑧1(4)

= 𝑤11(4)𝑎1(3)

+𝑤12(4)𝑎2(3)

𝑔 𝑧1(4)

𝑧1(4)

Page 56: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

slide 56

Gradient (on one data point)

𝑥2

𝑥1

= 𝑦 − 𝑎 2

𝑎1

𝑎2

𝑤11(4)

Layer (4)Layer (3)Layer (2)Layer (1)

𝐸𝑥

𝐸𝑥𝑎1𝑦 − 𝑎 2

𝑧1(4)

= 𝑤11(4)𝑎1(3)

+𝑤12(4)𝑎2(3)

𝑔 𝑧1(4)

𝑧1(4)

𝑤124𝑎2(3)

𝑤114𝑎1(3)

𝑤12(4)

Page 57: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

slide 57

Gradient (on one data point)

𝑥2

𝑥1

= 𝑦 − 𝑎 2

𝑎1

𝑎2

𝑤11(4)

Layer (4)Layer (3)Layer (2)Layer (1)

𝐸𝑥

𝐸𝑥𝑎1𝑦 − 𝑎 2

𝑧1(4)

= 𝑤11(4)𝑎1(3)

+𝑤12(4)𝑎2(3)

𝑔 𝑧1(4)

𝑧1(4)

𝑤124𝑎2(3)

𝑤114𝑎1(3)

𝜕𝐸𝒙

𝜕𝑤11(4)

=𝜕𝐸𝒙𝜕𝑎1

𝜕𝑎1

𝜕𝑧1(4)

𝜕𝑧1(4)

𝜕𝑤11(4)

By Chain Rule:

𝜕𝐸𝒙𝜕𝑎1

= 2(𝑎1 − 𝑦1)𝜕𝑎1

𝜕𝑧1(4)

= 𝑔′ 𝑧1(4)

Page 58: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

slide 58

Gradient (on one data point)

𝑥2

𝑥1

= 𝑦 − 𝑎 2

𝑎1

𝑎2

𝑤11(4)

Layer (4)Layer (3)Layer (2)Layer (1)

𝐸𝑥

𝐸𝑥𝑎1𝑦 − 𝑎 2

𝑧1(4)

= 𝑤11(4)𝑎1(3)

+𝑤12(4)𝑎2(3)

𝑔 𝑧1(4)

𝑧1(4)

𝑤124𝑎2(3)

𝑤114𝑎1(3)

𝜕𝐸𝒙

𝜕𝑤11(4)

= 2(𝑎1 − 𝑦1)𝑔′ 𝑧1(4) 𝜕𝑧1

(4)

𝜕𝑤11(4)

By Chain Rule:

𝜕𝐸𝒙𝜕𝑎1

= 2(𝑎1 − 𝑦1)𝜕𝑎1

𝜕𝑧1(4)

= 𝑔′ 𝑧1(4)

Page 59: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

slide 59

Gradient (on one data point)

𝑥2

𝑥1

= 𝑦 − 𝑎 2

𝑎1

𝑎2

𝑤11(4)

Layer (4)Layer (3)Layer (2)Layer (1)

𝐸𝑥

𝐸𝑥𝑎1𝑦 − 𝑎 2

𝑧1(4)

= 𝑤11(4)𝑎1(3)

+𝑤12(4)𝑎2(3)

𝑔 𝑧1(4)

𝑧1(4)

𝑤124𝑎2(3)

𝑤114𝑎1(3)

𝜕𝐸𝒙

𝜕𝑤11(4)

= 2(𝑎1 − 𝑦1)𝑔′ 𝑧1(4)

𝑎1(3)

By Chain Rule:

𝜕𝐸𝒙𝜕𝑎1

= 2(𝑎1 − 𝑦1)𝜕𝑎1

𝜕𝑧1(4)

= 𝑔′ 𝑧1(4)

Page 60: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

slide 60

Gradient (on one data point)

𝑥2

𝑥1

= 𝑦 − 𝑎 2

𝑎1

𝑎2

𝑤11(4)

Layer (4)Layer (3)Layer (2)Layer (1)

𝐸𝑥

𝐸𝑥𝑎1𝑦 − 𝑎 2

𝑧1(4)

= 𝑤11(4)𝑎1(3)

+𝑤12(4)𝑎2(3)

𝑔 𝑧1(4)

𝑧1(4)

𝑤124𝑎2(3)

𝑤114𝑎1(3)

𝜕𝐸𝒙

𝜕𝑤11(4)

= 2(𝑎1 − 𝑦1)𝑔 𝑧1(4)

1 − 𝑔 𝑧1(4)

𝑎1(3)

By Chain Rule:

𝜕𝐸𝒙𝜕𝑎1

= 2(𝑎1 − 𝑦1)𝜕𝑎1

𝜕𝑧1(4)

= 𝑔′ 𝑧1(4)

Page 61: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

slide 61

Gradient (on one data point)

𝑥2

𝑥1

= 𝑦 − 𝑎 2

𝑎1

𝑎2

𝑤11(4)

Layer (4)Layer (3)Layer (2)Layer (1)

𝐸𝑥

𝐸𝑥𝑎1𝑦 − 𝑎 2

𝑧1(4)

= 𝑤11(4)𝑎1(3)

+𝑤12(4)𝑎2(3)

𝑔 𝑧1(4)

𝑧1(4)

𝑤124𝑎2(3)

𝑤114𝑎1(3)

𝜕𝐸𝒙

𝜕𝑤11(4)

= 2 𝑎1 − 𝑦1 𝑎1 1 − 𝑎1 𝑎1(3)

By Chain Rule:

𝜕𝐸𝒙𝜕𝑎1

= 2(𝑎1 − 𝑦1)𝜕𝑎1

𝜕𝑧1(4)

= 𝑔′ 𝑧1(4)

Page 62: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

slide 62

Gradient (on one data point)

𝑥2

𝑥1

= 𝑦 − 𝑎 2

𝑎1

𝑎2

𝑤11(4)

Layer (4)Layer (3)Layer (2)Layer (1)

𝐸𝑥

𝐸𝑥𝑎1𝑦 − 𝑎 2

𝑧1(4)

= 𝑤11(4)𝑎1(3)

+𝑤12(4)𝑎2(3)

𝑔 𝑧1(4)

𝑧1(4)

𝑤124𝑎2(3)

𝑤114𝑎1(3)

𝜕𝐸𝒙

𝜕𝑤11(4)

= 2 𝑎1 − 𝑦1 𝑎1 1 − 𝑎1 𝑎1(3)

By Chain Rule:

𝜕𝐸𝒙𝜕𝑎1

= 2(𝑎1 − 𝑦1)𝜕𝑎1

𝜕𝑧1(4)

= 𝑔′ 𝑧1(4)

Can be computed by network activation

Page 63: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

slide 63

Backpropagation

𝑥2

𝑥1

= 𝑦 − 𝑎 2

𝑎1

𝑎2

𝑤11(4)

Layer (4)Layer (3)Layer (2)Layer (1)

𝐸𝑥

𝐸𝑥

𝑧1(4)

= 𝑤11(4)𝑎1(3)

+𝑤12(4)𝑎2(3)

𝑧1(4)

𝑤124𝑎2(3)

𝑤114𝑎1(3)

𝜕𝐸𝒙

𝜕𝑤11(4)

= 2 𝑎1 − 𝑦1 𝑎1 1 − 𝑎1 𝑎1(3)

By Chain Rule:

𝜕𝐸𝒙

𝜕𝑧1(4)

= 2(𝑎1 − 𝑦1)𝑔′ 𝑧1(4)

Page 64: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

slide 64

Backpropagation

𝑥2

𝑥1

= 𝑦 − 𝑎 2

𝑎1

𝑎2

𝑤11(4)

Layer (4)Layer (3)Layer (2)Layer (1)

𝐸𝑥

𝐸𝑥

𝑧1(4)

= 𝑤11(4)𝑎1(3)

+𝑤12(4)𝑎2(3)

𝑧1(4)

𝑤124𝑎2(3)

𝑤114𝑎1(3)

𝜕𝐸𝒙

𝜕𝑤11(4)

= 2 𝑎1 − 𝑦1 𝑎1 1 − 𝑎1 𝑎1(3)

By Chain Rule:

𝛿1(4)

=𝜕𝐸𝒙

𝜕𝑧1(4)

= 2(𝑎1 − 𝑦1)𝑔′ 𝑧1(4)

Page 65: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

slide 65

Backpropagation

𝑥2

𝑥1

= 𝑦 − 𝑎 2

𝑎1

𝑎2

𝑤11(4)

Layer (4)Layer (3)Layer (2)Layer (1)

𝐸𝑥

𝐸𝑥

𝛿1(4)

𝑧1(4)

𝑤124𝑎2(3)

𝑤114𝑎1(3)

𝜕𝐸𝒙

𝜕𝑤11(4)

= 𝛿1(4)𝑎1(3)

By Chain Rule:

𝛿1(4)

=𝜕𝐸𝒙

𝜕𝑧1(4)

= 2(𝑎1 − 𝑦1)𝑔′ 𝑧1(4)

𝑎1(3)

Page 66: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

slide 66

Backpropagation

𝑥2

𝑥1

= 𝑦 − 𝑎 2

𝑎1

𝑎2

𝑤11(4)

Layer (4)Layer (3)Layer (2)Layer (1)

𝐸𝑥

𝐸𝑥

𝛿1(4)

𝑧1(4)

𝑤124𝑎2(3)

𝑤114𝑎1(3)

𝜕𝐸𝒙

𝜕𝑤11(4)

= 𝛿1(4)𝑎1(3),

𝜕𝐸𝒙

𝜕𝑤12(4)

= 𝛿1(4)𝑎2(3)

By Chain Rule:

𝛿1(4)

=𝜕𝐸𝒙

𝜕𝑧1(4)

= 2(𝑎1 − 𝑦1)𝑔′ 𝑧1(4)

𝑤12(4)

𝑎2(3)

Page 67: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

slide 67

Backpropagation

𝑥2

𝑥1

= 𝑦 − 𝑎 2

𝑎1

𝑎2𝑤21(4)

Layer (4)Layer (3)Layer (2)Layer (1)

𝐸𝑥

𝐸𝑥

𝛿2(4)

𝑧1(4)

𝑤124𝑎2(3)

𝑤114𝑎1(3)

𝜕𝐸𝒙

𝜕𝑤21(4)

= 𝛿2(4)𝑎1(3),

𝜕𝐸𝒙

𝜕𝑤22(4)

= 𝛿2(4)𝑎2(3)

By Chain Rule:

𝛿2(4)

=𝜕𝐸𝒙

𝜕𝑧2(4)

= 2(𝑎2 − 𝑦2)𝑔′ 𝑧2(4)

𝑤22(4)

𝑎2(3)

𝑎1(3)

Page 68: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

slide 68

𝛿2(4)

𝛿1(3)

𝛿2(3)

𝛿1(2)

𝛿2(2)

𝛿1(4)

Backpropagation

𝑥2

𝑥1

= 𝑦 − 𝑎 2

𝑎1

𝑎2

Layer (4)Layer (3)Layer (2)Layer (1)

𝐸𝑥

Thus, for any weight in the network:

𝜕𝐸𝑥

𝜕𝑤𝑗𝑘(𝑙)

= 𝛿𝑗(𝑙)𝑎𝑘(𝑙−1)

𝛿𝑗(𝑙)

: 𝛿 of 𝑗𝑡ℎ neuron in Layer 𝑙

𝑎𝑘(𝑙−1)

: Activation of 𝑘𝑡ℎ neuron in Layer 𝑙 − 1

𝑤𝑗𝑘(𝑙)

: Weight from 𝑘𝑡ℎ neuron in Layer 𝑙 − 1 to 𝑗𝑡ℎ neuron in Layer 𝑙

Page 69: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

slide 69

𝛿2(4)

𝛿1(3)

𝛿2(3)

𝛿1(2)

𝛿2(2)

𝛿1(4)

Exercise

𝑥2

𝑥1

= 𝑦 − 𝑎 2

𝑎1

𝑎2

Layer (4)Layer (3)Layer (2)Layer (1)

𝐸𝑥

Show that for any bias in the network:

𝜕𝐸𝑥

𝜕𝑏𝑗(𝑙)

= 𝛿𝑗(𝑙)

𝛿𝑗(𝑙)

: 𝛿 of 𝑗𝑡ℎ neuron in Layer 𝑙

𝑏𝑗(𝑙)

: bias for the 𝑗𝑡ℎ neuron in Layer 𝑙, i.e., 𝑧𝑗(𝑙)

= σ𝑘𝑤𝑗𝑘(𝑙)𝑎𝑘(𝑙−1)

+ 𝑏𝑗(𝑙)

Page 70: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

slide 70

𝛿2(4)

𝛿1(3)

𝛿2(3)

𝛿1(2)

𝛿2(2)

𝛿1(4)

Backpropagation of 𝛿

𝑥2

𝑥1

= 𝑦 − 𝑎 2

𝑎1

𝑎2

Layer (4)Layer (3)Layer (2)Layer (1)

𝐸𝑥

Thus, for any neuron in the network:

𝛿𝑗(𝑙)

=

𝑘

𝛿𝑘𝑙+1

𝑤𝑘𝑗𝑙+1

𝑔′ 𝑧𝑗𝑙

𝛿𝑗(𝑙)

: 𝛿 of 𝑗𝑡ℎ Neuron in Layer 𝑙

𝛿𝑘(𝑙+1)

: 𝛿 of 𝑘𝑡ℎ Neuron in Layer 𝑙 + 1

𝑔′ 𝑧𝑗𝑙

: derivative of 𝑗𝑡ℎ Neuron in Layer 𝑙 w.r.t. its linear combination input

𝑤𝑘𝑗(𝑙+1)

: Weight from 𝑗𝑡ℎ Neuron in Layer 𝑙 to 𝑘𝑡ℎ Neuron in Layer 𝑙 + 1

Page 71: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

slide 71

Gradient descent with Backpropagation

1. Initialize Network with Random Weights and Biases

2. For each Training Image:

a. Compute Activations for the Entire Network

b. Compute 𝛿 for Neurons in the Output Layer using Network

Activation and Desired Activation

𝛿𝑗(𝐿)

= 2 𝑦𝑗 − 𝑎𝑗 𝑎𝑗(1 − 𝑎𝑗)

c. Compute 𝛿 for all Neurons in the previous Layers

𝛿𝑗(𝑙)

=

𝑘

𝛿𝑘𝑙+1

𝑤𝑘𝑗𝑙+1

𝑎𝑗𝑙(1 − 𝑎𝑗

𝑙)

d. Compute Gradient of Cost w.r.t each Weight and Bias for

the Training Image using 𝛿𝜕𝐸𝑥

𝜕𝑤𝑗𝑘(𝑙)

= 𝛿𝑗(𝑙)𝑎𝑘(𝑙−1) 𝜕𝐸𝑥

𝜕𝑏𝑗(𝑙)

= 𝛿𝑗(𝑙)

Page 72: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

slide 72

Gradient descent with Backpropagation

3. Average the Gradient w.r.t. each Weight and Bias over the

Entire Training Set𝜕𝐸

𝜕𝑤𝑗𝑘(𝑙)

=1

𝑛

𝜕𝐸𝒙

𝜕𝑤𝑗𝑘(𝑙)

𝜕𝐸

𝜕𝑏𝑗(𝑙)

=1

𝑛

𝜕𝐸𝒙

𝜕𝑏𝑗(𝑙)

4. Update the Weights and Biases using Gradient Descent

𝑤𝑗𝑘(𝑙)⃪𝑤𝑗𝑘

(𝑙)− 𝜂

𝜕𝐸

𝜕𝑤𝑗𝑘𝑙

𝑏𝑗𝑙⃪ 𝑏𝑗

𝑙− 𝜂

𝜕𝐸

𝜕𝑏𝑗(𝑙)

5. Repeat Steps 2-4 till Cost reduces below an acceptable level

Page 73: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/.../slides/lecture10-neural-networks-1.pdfNeural networks • a.k.a. artificial neural networks, connectionist

THANK YOUSome of the slides in these lectures have been adapted/borrowed

from materials developed by Mark Craven, David Page, Jude Shavlik, Tom Mitchell, Nina Balcan, Matt Gormley, Elad Hazan,

Tom Dietterich, Pedro Domingos, and Mohit Gupta.