Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4b.pdf · Rapid...

Post on 31-May-2020

27 views 0 download

Transcript of Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4b.pdf · Rapid...

1/35

1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network

Rapid Introduction to Machine Learning/Deep Learning

Hyeong In Choi

Seoul National University

2/35

1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network

Lecture 4bConvolutional Network

October 30, 2015

3/35

1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network

Table of contents

1 1. Objectives of Lecture 4b

2 2. Convolution kernel2.1. Convolution

3 3. Convolutional network3.1. 2D convolution3.2. Analysis of LeCun’s example3.3. Another example3.4. Classification3.5. Training convolutional network

4/35

1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network

1. Objectives of Lecture 4b

Objective 1

Learn the basic formalism of convolutional network

Objective 2

Go through LeCun’s examples

Objective 3

Learn about the training of convolutional network

5/35

1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network

2.1. Convolution

2. Convolution kernel2.1. Convolution

f (x) ∶ functionK(x) ∶ convolution kernel (filter)

(f ∗K)(x) = ∫ f (y)K(x − y)dy = ∫ f (x − y)K(y)dy

Discrete convolution

x(n) ∶ dataK(n) ∶ convolution kernel (filter)

(x ∗K)(n) =∑m

x(m)K(n −m) =∑m

x(m − n)K(m)

6/35

1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network

2.1. Convolution

Example (1D Convolution)

(x ∗K)(5) = x(5 − 1)K(1) + x(5 − 0)K(0) + x(5 + 1)K(−1)

= x(4)K(1) + x(5)K(0) + x(6)K(−1)

= x(4) + 2x(5) − x(6)

7/35

1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network

2.1. Convolution

(x ∗K)(5) = x(4) + 2x(5) − x(6)

8/35

1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network

2.1. Convolution

(x ∗K)(6) = x(5) + 2x(6) − x(7)

9/35

1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network

2.1. Convolution

Example (2D Convolution)

x(m,n) ∶ data K(p,q) ∶ convolution kernel

(x ∗K)(m,n) =∑p,q

x(m − p,n − q)K(p,q)

10/35

1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network

2.1. Convolution

(x ∗K)(3,4) = 2x(2,3) + 4x(2,4) − 2x(2,5)

+ 3x(3,3) + 6x(3,4) − 3x(3,5)

+ x(4,3) + 2x(4,4) − x(4,5)

11/35

1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network

2.1. Convolution

(x ∗K)(3,5) = 2x(2,4) + 4x(2,5) − 2x(2,6)

+ 3x(3,4) + 6x(3,5) − 3x(3,6)

+ x(4,4) + 2x(4,5) − x(4,6)

12/35

1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network

2.1. Convolution

Boundary effect

Example

At the boundary

13/35

1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network

2.1. Convolution

There is no x(−1), so (x ∗K)(1) is not defined

One may pad 0’s around boundaries

But the “valid” part of x ∗K is shorter than x itself

In the above example, the valid part of x ∗K is an array ofsize 3

14/35

1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network

2.1. Convolution

in general if K is a (2p + 1) × (2q + 1) matrix then the validpart of x ∗K is an (M − 2p) × (N − 2q) matrix, where x is anM ×N matrix

15/35

1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network

3.1. 2D convolution

3. Convolutional network3.1. 2D convolution

The same convolution kernel K is applied at every position

16/35

1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network

3.1. 2D convolution

Example

The same convolution kernel K is applied at every position

17/35

1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network

3.2. Analysis of LeCun’s example

3.2. Analysis of LeCun’s example

18/35

1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network

3.2. Analysis of LeCun’s example

3.2.1.

19/35

1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network

3.2. Analysis of LeCun’s example

Pooling

Moving 10 × 10 window on 75 × 75 image results in 66 × 66matrix

Pooling is taken as one of the following:

Maximum (Max Pooling)LP sum (P = 1,2,⋯)Average

20/35

1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network

3.2. Analysis of LeCun’s example

Subsampling

Example: 5× 5 subsampling (i.e., column stride = 5, row stride = 5)

14 =66 − 1

5+ 1

sampling at (1,1), (1,6),⋯, (1,66), (6,1), (6,6),⋯, (6,66), ⋯,(66,1), (66,6), ⋯, (66,66)

21/35

1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network

3.2. Analysis of LeCun’s example

Layer 3

There are 256 features maps in Layer 3. Each of such (256)feature maps are gotten as follows:

Randomly select 16 feature maps out of 64 feature maps inLayer 2

22/35

1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network

3.2. Analysis of LeCun’s example

Convolution is done for 16 × 9 × 9 3D pipe in the 16 × 14 × 14volume. For each feature map of 16 features, this defines a2D convolution kerenl; thus 16 kernels for Thus there are256 × 16 = 4096 2D kernels

Augmentation

The step from Convolution to Pooling and Subsampling canbe augmented with rectification and Local ContrastNormalization(LCN)

xi ∶ ith feature mapxijk ∶ (j , k)th pixel value of xi

Rectification (Rabc)xijk → ∣xijk ∣

23/35

1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network

3.2. Analysis of LeCun’s example

Subtractive normalization

xijk → vijk = xijk − ∑i ,p,q

ωpqxi ,j+p,k+q,

where ωpq is a Gaussian-like filter such that ∑i ,p,q ωpq = 1

Divisive normalization

vijk → yijk = vijk/max(c , σjk),

where σjk = (∑i ,p,q ωpqv2i ,j+p,k+q)

1/2

24/35

1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network

3.2. Analysis of LeCun’s example

25/35

1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network

3.2. Analysis of LeCun’s example

Summary: Model architecture

The number of n2 × n3 image (input feature map) is n1

26/35

1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network

3.2. Analysis of LeCun’s example

xi ∶ ith image (input feature map)

kij ∶ convolution kernel of size `1 × `2 operating on xi toproduce yj , j = 1,⋯,m1 where m1 is the number of outputfeature maps

yj ∶ jth output feature map

yj = {gj tanh (∑

n1i=1 kij ∗ xi)

gjsigm (∑n1i=1 kij ∗ xi)

for j = 1,⋯,m1

[Hence gj is called the gain coefficient]

27/35

1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network

3.2. Analysis of LeCun’s example

Notations

(a)C = ConvolutionS = sigm/ tanhG = gain

⎫⎪⎪⎪⎬⎪⎪⎪⎭

⇒ FCSG

In LeCun’s example above, Layer 1 is denoted by 64F 9×9CSG

[64 = number of kernels, 9 × 9 = convolution kernel size]

(b) Rabs ∶ rectification (= taking the absolute value)

(c) N ∶ local contrast normalization (LCN)

(d) PA ∶ average pooling and subsamplingPM ∶ max pooling and subsampling

28/35

1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network

3.3. Another example

3.3. Another example

29/35

1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network

3.3. Another example

Th above processes are denoted by

64F 9×9CSG → R/N/P5×5

The whole processes are denoted by

64F 9×9CSG → R/N/P5×5

→ 256F 9×9CSG → R/N/P4×4

30/35

1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network

3.4. Classification

3.4. classification

The final layer is fed into the classification layer like softmaxlayer

31/35

1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network

3.4. Classification

These two layers are fully connected

Train the entire network in the supervised manner

Only the filters (kernels) are trained

The error derivative back propagation has to be worked outacross R/N/P layers

32/35

1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network

3.5. Training convolutional network

3.5. Training convolutional network

Weight training (learning)

Convolution weights

Training is done just like the usual neural networkTo enforce convolution, need to maintain equality constraint

Example

Suppose weights ω1 = ω2 = ⋯ = ωN due to convolutionconstraintDuring the training get ω̃1(new), ω̃2(new),⋯, ω̃N(new)To enforce the equality constraint, define

ωi(new) =1

N

j=1

∑N

ω̃j(new),

for i = 1,⋯,N

33/35

1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network

3.5. Training convolutional network

R/N/P

The computations in R/N steps do not involve weights. So noneed to be concerned on these steps in the training

For the pooling step1D Example: pooling by 3, subsampling by 2 (stride 2)

34/35

1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network

3.5. Training convolutional network

Combine the weights affecting the subsampling neurons tocome up with an effective network

35/35

1. Objectives of Lecture 4b 2. Convolution kernel 3. Convolutional network

3.5. Training convolutional network

Derivative of max function

max(x1, x2) =1

2{∣x1 − x2∣ + x1 + x2}

∂x1max(x1, x2) = {

1 if x1 > x2

0 else

Simiarly

∂x1max(x1, x2, x3) = {

1 if x1 > x2, x1 > x3

0 else

If the pooling is average or other Lp norm, the derivatives canbe easily computed

Once the derivatives of pooling layers are computed, the backpropagation algorithm can be applied