Statistical Learning Theory Part I – 2. Neural Networkwatanabe- · 2 /2 ) is a learning machine....

36
Statistical Learning Theory Part I – 2. Neural Network Sumio Watanabe Tokyo Institute of Technology

Transcript of Statistical Learning Theory Part I – 2. Neural Networkwatanabe- · 2 /2 ) is a learning machine....

Page 1: Statistical Learning Theory Part I – 2. Neural Networkwatanabe- · 2 /2 ) is a learning machine. Realizability. Definition. Let q(y|x) and p(y|x,w) be true conditional. probability

Statistical Learning Theory

Part I – 2.

Neural Network

Sumio WatanabeTokyo Institute of Technology

Page 2: Statistical Learning Theory Part I – 2. Neural Networkwatanabe- · 2 /2 ) is a learning machine. Realizability. Definition. Let q(y|x) and p(y|x,w) be true conditional. probability

Review1: The Purpose of this Lecture

The purpose of this lecture is

not to introduce software products and applications

but to teach mathematics and theoretical physics

which are foundations of machine learning and

statistics.

Page 3: Statistical Learning Theory Part I – 2. Neural Networkwatanabe- · 2 /2 ) is a learning machine. Realizability. Definition. Let q(y|x) and p(y|x,w) be true conditional. probability

Review 2: Framework of Supervised Learning

Training data

X1, X2, …, Xn

Y1, Y2, …, Yn

Test data

X

Y

InformationSourceUnknown

q(x,y) p(y|x,w)

LearningMachine

Page 4: Statistical Learning Theory Part I – 2. Neural Networkwatanabe- · 2 /2 ) is a learning machine. Realizability. Definition. Let q(y|x) and p(y|x,w) be true conditional. probability

I-2-1 True and Leaning Machine

Page 5: Statistical Learning Theory Part I – 2. Neural Networkwatanabe- · 2 /2 ) is a learning machine. Realizability. Definition. Let q(y|x) and p(y|x,w) be true conditional. probability

True distribution

Definition. Let q(x,y) be a simultaneous probability density function to which independent random variables(X1,Y1), (X2,Y2), …,(Xn,Yn) are subject. Then q(x,y) is sometimes called true density or true distribution. Note that q(x) and q(y|x) are called true information sourceand true conditional density (or teacher distribution).

Example. In machine learning and statistics, It is often assumed that training input data are independentand generated from some unknown information source andtraining output data (teaching data) are given by some unknown conditional density (sometimes it is human).

Page 6: Statistical Learning Theory Part I – 2. Neural Networkwatanabe- · 2 /2 ) is a learning machine. Realizability. Definition. Let q(y|x) and p(y|x,w) be true conditional. probability

Learning Machine = Statistical Model

Definition. Let p(y|x,w) be a conditional probability density function of Y for a given X and a parameter w.Then p(y|x,w) is said to be a learning machine ( whichestimates conditional probability density function).

Example. Let x and y be 2-dimensional real vectors and w be a real 2 times 2 matrix. Then

p(y|x,w) = 1/(2π) exp( - ||y-wx||2/2 )

is a learning machine.

Page 7: Statistical Learning Theory Part I – 2. Neural Networkwatanabe- · 2 /2 ) is a learning machine. Realizability. Definition. Let q(y|x) and p(y|x,w) be true conditional. probability

Realizability

Definition. Let q(y|x) and p(y|x,w) be true conditionalprobability density and a learning machine, respectively. If there exists w0 such that q(y|x)=p(y|x,w0) for an arbitrary (x,y), then q(y|x) is said to be realizable byp(y|x,w). If otherwise, then it is called unrealizable.

Example. Let x,y,w be real values and q(x) be a standard normal distribution on R. Assume that

q(y|x)= 1/(2π)1/2 exp( - (y-x-a)2/2 ),p(y|x,w) = 1/(2π)1/2 exp( - (y-wx)2/2 ).

Then q(y|x) is realizable by p(y|x,w) if and only if a=0.

Page 8: Statistical Learning Theory Part I – 2. Neural Networkwatanabe- · 2 /2 ) is a learning machine. Realizability. Definition. Let q(y|x) and p(y|x,w) be true conditional. probability

Identifiability

Definition. Let p(y|x,w) be a learning machine. “If p(y|x,w1)=p(y|x,w2) for an arbitrary (x,y), then w1=w2”,then p(y|x,w) is called identifiable, if otherwise then it is called nonidentifiable (unidentifiable).

Example. Let x,y be real values and q(x) be a standard normal distribution on R. Assume that

p1(y|x,(a,b)) = 1/(2π)1/2 exp( - (y-ax -b)2/2 ),p2(y|x,(a,b)) = 1/(2π)1/2 exp( - (y- a tanh(bx) )2/2 ).

Then p1(y|x,(a,b)) is identifiable, but p2(y|x,(a,b)) is nonidentiable.

Page 9: Statistical Learning Theory Part I – 2. Neural Networkwatanabe- · 2 /2 ) is a learning machine. Realizability. Definition. Let q(y|x) and p(y|x,w) be true conditional. probability

Fisher Information Matrix

Definition. Let p(y|x,w) be a learning machine where w is in Rd.(d is a positive integer). Fisher information matrix I(w) = (Iij(w); i,j=1,2,..,d) is defined by

Iij(w) = fi (x,w) fj (x,w) p(y|x,w) q(x) dx dy,

fi (x,w) = (∂/∂wi ) log p(y|x,w).

Note. By the definition, Fisher information matrix is a real, symmetric, and positive-semidefinite matrix. (Its eigen values are all real and nonnegative).

Page 10: Statistical Learning Theory Part I – 2. Neural Networkwatanabe- · 2 /2 ) is a learning machine. Realizability. Definition. Let q(y|x) and p(y|x,w) be true conditional. probability

Example

Example. Let x,y be real values and q(x) be a standard normal distribution on R. Assume that

p(y|x,(a,b)) = 1/(2π)1/2 exp( - (y- a(x –b))2/2 ).

Then, log p(y|x,(a,b)) = - (y- a(x –b))2/2. Hence

f1(x, (a,b)) = – (y-a(x – b))(-x+b), f2(x, (a,b)) = – (y-a(x – b))(a).

ThusI(a,b) = .

1+b2 ab

ab a2

Page 11: Statistical Learning Theory Part I – 2. Neural Networkwatanabe- · 2 /2 ) is a learning machine. Realizability. Definition. Let q(y|x) and p(y|x,w) be true conditional. probability

Regularity

Definition. Let p(y|x,w) be a learning machine. If Fisher information matrix is positive definite(i.e. all eigen values are positive), then a learning machine is called regular. In otherwise it is callednonregular or singular.

Example. Let x,y be real values and q(x) be a standard normal distribution on R. A learning machine

p(y|x,(a,b)) = 1/(2π)1/2 exp( - (y- a(x –b))2/2 )

is not regular, because Fisher information matrix is not positive definite at a=0.

Page 12: Statistical Learning Theory Part I – 2. Neural Networkwatanabe- · 2 /2 ) is a learning machine. Realizability. Definition. Let q(y|x) and p(y|x,w) be true conditional. probability

12

Identifiable, Regular and Realizable CaseDefinition. Let q(y|x) and p*(y|x) be a true distribution and a trained learning machine. The Kullback-Leibler distance is defined by

D(q||p*)= q(x)q(y|x) log dx dy. q(y|x)p*(y|x)

E[ D(q||p*)] = d/2n + o(1/n),

Assume that a learning machine is identifiable and regular,and that the true distribution is realizable by a learningmachine. Also assume that p*(y|x) is given by one of themaximum likelihood, maximum posterior, or Bayes estimation. Then The average generalization E[ D(q||p*)]is given by

where d is the dimension of the parameter.

Page 13: Statistical Learning Theory Part I – 2. Neural Networkwatanabe- · 2 /2 ) is a learning machine. Realizability. Definition. Let q(y|x) and p(y|x,w) be true conditional. probability

Understanding Learning Machines

If a learning machine p(y|x,w) is identifiable and regular,then statistical learning theory can be derived from the classical mathematical statistics.

However, hierarchical learning machines such as neural networks are

(1) Nonidentifiable,(2) Singular.

It depends on the true distribution whether the true can berealizable or nor.

Page 14: Statistical Learning Theory Part I – 2. Neural Networkwatanabe- · 2 /2 ) is a learning machine. Realizability. Definition. Let q(y|x) and p(y|x,w) be true conditional. probability

I-2-2 Neural Network

Page 15: Statistical Learning Theory Part I – 2. Neural Networkwatanabe- · 2 /2 ) is a learning machine. Realizability. Definition. Let q(y|x) and p(y|x,w) be true conditional. probability

Neural Network

Artificial neural networkwas devised by simulating the biological neuralnetwork, but it is different from real biological one.

x : input

f(x,w) : output

w: parameter

Page 16: Statistical Learning Theory Part I – 2. Neural Networkwatanabe- · 2 /2 ) is a learning machine. Realizability. Definition. Let q(y|x) and p(y|x,w) be true conditional. probability

Sigmoid function

σ(u) = 1/(1+e-u).

u

σ( u )1

0

1/2

σ’ = σ (1-σ ).

A sigmoid function is defined by

Then it satisfies a differential equation,

Page 17: Statistical Learning Theory Part I – 2. Neural Networkwatanabe- · 2 /2 ) is a learning machine. Realizability. Definition. Let q(y|x) and p(y|x,w) be true conditional. probability

Rectified Linear Unit ( ReLU )

σ(u) = max(0, u).

u

σ( u )

0σ’ (u) = 1 (u>0)0 (u<0)

A ReLU function is defined by

Then

Page 18: Statistical Learning Theory Part I – 2. Neural Networkwatanabe- · 2 /2 ) is a learning machine. Realizability. Definition. Let q(y|x) and p(y|x,w) be true conditional. probability

One Neuron

x1 x2 x3 xM

w1 w2 w3wM

∑ wi xiM

i=1

σ( ∑ wi xi + θ) : Output M

i=1

Neuron

Synapse weight

θ

(x1,…,xM) : Input vector (w1,…wM, θ): Parameter

A neuron is defined by

Page 19: Statistical Learning Theory Part I – 2. Neural Networkwatanabe- · 2 /2 ) is a learning machine. Realizability. Definition. Let q(y|x) and p(y|x,w) be true conditional. probability

Structure of Layered Neural Network

i

j

k

xk

oj=σ(∑wjkxk+θj)M

k=1

Output of hidden unit:

fi =σ(∑uijoj+φi)H

j=1

Output of output unit

uij

φi

fi

wjk

θj

oj

Parameter w={wjk, θj, uij,φi }

Page 20: Statistical Learning Theory Part I – 2. Neural Networkwatanabe- · 2 /2 ) is a learning machine. Realizability. Definition. Let q(y|x) and p(y|x,w) be true conditional. probability

Example of q(x)q(y|x)

0 and 6

Example of q(x)

q(y|x)

Example of q(y|x)

Page 21: Statistical Learning Theory Part I – 2. Neural Networkwatanabe- · 2 /2 ) is a learning machine. Realizability. Definition. Let q(y|x) and p(y|x,w) be true conditional. probability

NeuralNetwork

p(y|x,w)0 6

Output units 2

Hidden units 6

Image 25

Input units 25

Neural Network

21

Learning Machine

=

Page 22: Statistical Learning Theory Part I – 2. Neural Networkwatanabe- · 2 /2 ) is a learning machine. Realizability. Definition. Let q(y|x) and p(y|x,w) be true conditional. probability

OutputLayer

HiddenLayer

InputLayer

Output

Input

Desired Output

Classification

Training Data, n=100.

Page 23: Statistical Learning Theory Part I – 2. Neural Networkwatanabe- · 2 /2 ) is a learning machine. Realizability. Definition. Let q(y|x) and p(y|x,w) be true conditional. probability

Learning in a Neural Network

23

Data True

Trained NeuralNetwork

Page 24: Statistical Learning Theory Part I – 2. Neural Networkwatanabe- · 2 /2 ) is a learning machine. Realizability. Definition. Let q(y|x) and p(y|x,w) be true conditional. probability

I-2-3 Function approximation by Neural Networks

Page 25: Statistical Learning Theory Part I – 2. Neural Networkwatanabe- · 2 /2 ) is a learning machine. Realizability. Definition. Let q(y|x) and p(y|x,w) be true conditional. probability

Function Approximation Theorem

Theorem. Let A be an arbitrary compact set in RM.For an arbitrary continuous function g(x) and an arbitrary positive value ε>0, there exists a functionf(x,w) made of a neural network, such that

sup | g(x) –f(x,w) | < ε.x in A

Remark. Sometimes people say that neural networks are universal approximator based on this theorem. However, polynomials andtrigonometric functions are also universal approximators. The aremany universal approximators.

Page 26: Statistical Learning Theory Part I – 2. Neural Networkwatanabe- · 2 /2 ) is a learning machine. Realizability. Definition. Let q(y|x) and p(y|x,w) be true conditional. probability

Function Approximation Problem

x1 x2 x3

F(x) = Σ aj f( bj, x )H

j=1

y

aj

bj

Classical functions --- the parameter is {aj}Neural Networks --- the parameter is {aj,bj}

Page 27: Statistical Learning Theory Part I – 2. Neural Networkwatanabe- · 2 /2 ) is a learning machine. Realizability. Definition. Let q(y|x) and p(y|x,w) be true conditional. probability

Conventional functions and Neural Networks

min || g(x) - Σ aj f( bj, x ) || 2H

j=1≧ C1(g) / H 2/M

Conventional functions : (M=dim x)For arbitrary {bj}, there exist g and H such that

{aj}

Neural Networks:

min || g(x) - Σ aj f( bj, x ) || 2H

j=1≦ C2(g) / H

{aj,bj}

For arbitrary g(x) and H such that

Page 28: Statistical Learning Theory Part I – 2. Neural Networkwatanabe- · 2 /2 ) is a learning machine. Realizability. Definition. Let q(y|x) and p(y|x,w) be true conditional. probability

Curse of Dimensionality

Let M be the dimension of the input x. In the conventionalfunctions, the function approximation error is larger than 1/H2/M, which can not be made small if M is very large. This fact was called the “curse of dimensionality”.

On the other hand, neural networks made it smaller than1/H, for any M. Nowadays, researchers say, “ By theneural networks, curse of dimensionality was resolved.”

However, it should be emphasized that the small function approximation does not always mean the small generalization error.

Page 29: Statistical Learning Theory Part I – 2. Neural Networkwatanabe- · 2 /2 ) is a learning machine. Realizability. Definition. Let q(y|x) and p(y|x,w) be true conditional. probability

I-2-4 Likelihood Function of Neural Networks

Page 30: Statistical Learning Theory Part I – 2. Neural Networkwatanabe- · 2 /2 ) is a learning machine. Realizability. Definition. Let q(y|x) and p(y|x,w) be true conditional. probability

Likelihood Function

Definition. Let q(x)q(y|x) be a true density function andp(y|x,w) be a learning machine. Assume that{(Xi,Yi) ; i=1,2,…,n } be a set of independent random variables whose probability density function is q(x)q(y|x). Then the likelihood function L(w) is defined by

L(w) = Π p(Yi|Xi,w). n

i=1

Page 31: Statistical Learning Theory Part I – 2. Neural Networkwatanabe- · 2 /2 ) is a learning machine. Realizability. Definition. Let q(y|x) and p(y|x,w) be true conditional. probability

Example : Regular Model

Example. Assume that q(x) is a standard normal distribution,

p(y|x,(a,b)) = 1/(2π)1/2 exp( - (y- ax –b))2/2 ),

q(y|x) = p(y|x,0.5,0.5) .

In this case, q(y|x) is realizable by p(y|x,a,b) and p(y|x,a,b)is identifiable and regular.

Remark. The equation q(y|x)=p(y|x,a,b) holds if and onlyIf a0=b0=0.5.

Page 32: Statistical Learning Theory Part I – 2. Neural Networkwatanabe- · 2 /2 ) is a learning machine. Realizability. Definition. Let q(y|x) and p(y|x,w) be true conditional. probability

Example of Likelihood (n=100)

If a learning machineis regular, thenthe likelihood function

becomes localized when n tends to infinity. ab

Likelihood

Page 33: Statistical Learning Theory Part I – 2. Neural Networkwatanabe- · 2 /2 ) is a learning machine. Realizability. Definition. Let q(y|x) and p(y|x,w) be true conditional. probability

Example : Neural Networks

Example. Assume that q(x) is a standard normal distribution,

p(y|x,(a,b)) = 1/(2π)1/2 exp( - (y- a tanh(bx)))2/2 ),

q(y|x) = p(y|x,0.5,0.5) .

In this case, q(y|x) is realizable by p(y|x,a,b).However, p(y|x,a,b) is not identifiable and singular.

Remark. The equation q(y|x)=p(y|x,a,b) holds if and onlyIf a0=b0=0.5 or a0=b0=-0.5, but there are many parameterswhich make the likelihood function maximum.

Page 34: Statistical Learning Theory Part I – 2. Neural Networkwatanabe- · 2 /2 ) is a learning machine. Realizability. Definition. Let q(y|x) and p(y|x,w) be true conditional. probability

Example of Likelihood (n=100)

If a learning machineis singular, thenthe likelihood function

is not localized even when n tends to infinity.

a

b

Likelihood

Page 35: Statistical Learning Theory Part I – 2. Neural Networkwatanabe- · 2 /2 ) is a learning machine. Realizability. Definition. Let q(y|x) and p(y|x,w) be true conditional. probability

Neural Networks are Singular.

If a learning machine is identifiable and regular, and ifthe true distribution is realized by a learning machine, then the maximum likelihood estimator converges to the true parameter and the Bayes posterior distributionconcentrates on the true parameter.

However, if a learning machine is not identifiable and singular, then such properties do not hold. In neural networks, there are infinitely many parameterswhich make the likelihood function almost maximum.

We need new mathematical learning theory.

Page 36: Statistical Learning Theory Part I – 2. Neural Networkwatanabe- · 2 /2 ) is a learning machine. Realizability. Definition. Let q(y|x) and p(y|x,w) be true conditional. probability

36

Contents of Part I

1. Basic Concepts in Statistical Learning2. Neural Network3. Learning in Neural Network, Report Writing (1)4. Boltzmann Machine5. Deep Learning6. Information and Entropy, Report Writing (2)7. Prediction Accuracy8. Knowledge Discovery, Report Writing (3)