Statistical Learning Theory Part I – 2. Neural Networkwatanabe- · 2 /2 ) is a learning machine....

Statistical Learning Theory

Part I – 2.

Neural Network

Sumio WatanabeTokyo Institute of Technology

Review1: The Purpose of this Lecture

The purpose of this lecture is

not to introduce software products and applications

but to teach mathematics and theoretical physics

which are foundations of machine learning and

statistics.

Review 2: Framework of Supervised Learning

Training data

X1, X2, …, Xn

Y1, Y2, …, Yn

Test data

X

Y

InformationSourceUnknown

q(x,y) p(y|x,w)

LearningMachine

I-2-1 True and Leaning Machine

True distribution

Definition. Let q(x,y) be a simultaneous probability density function to which independent random variables(X1,Y1), (X2,Y2), …,(Xn,Yn) are subject. Then q(x,y) is sometimes called true density or true distribution. Note that q(x) and q(y|x) are called true information sourceand true conditional density (or teacher distribution).

Example. In machine learning and statistics, It is often assumed that training input data are independentand generated from some unknown information source andtraining output data (teaching data) are given by some unknown conditional density (sometimes it is human).

Learning Machine = Statistical Model

Definition. Let p(y|x,w) be a conditional probability density function of Y for a given X and a parameter w.Then p(y|x,w) is said to be a learning machine ( whichestimates conditional probability density function).

Example. Let x and y be 2-dimensional real vectors and w be a real 2 times 2 matrix. Then

p(y|x,w) = 1/(2π) exp( - ||y-wx||2/2 )

is a learning machine.

Realizability

Definition. Let q(y|x) and p(y|x,w) be true conditionalprobability density and a learning machine, respectively. If there exists w0 such that q(y|x)=p(y|x,w0) for an arbitrary (x,y), then q(y|x) is said to be realizable byp(y|x,w). If otherwise, then it is called unrealizable.

Example. Let x,y,w be real values and q(x) be a standard normal distribution on R. Assume that

q(y|x)= 1/(2π)1/2 exp( - (y-x-a)2/2 ),p(y|x,w) = 1/(2π)1/2 exp( - (y-wx)2/2 ).

Then q(y|x) is realizable by p(y|x,w) if and only if a=0.

Identifiability

Definition. Let p(y|x,w) be a learning machine. “If p(y|x,w1)=p(y|x,w2) for an arbitrary (x,y), then w1=w2”,then p(y|x,w) is called identifiable, if otherwise then it is called nonidentifiable (unidentifiable).

Example. Let x,y be real values and q(x) be a standard normal distribution on R. Assume that

p1(y|x,(a,b)) = 1/(2π)1/2 exp( - (y-ax -b)2/2 ),p2(y|x,(a,b)) = 1/(2π)1/2 exp( - (y- a tanh(bx) )2/2 ).

Then p1(y|x,(a,b)) is identifiable, but p2(y|x,(a,b)) is nonidentiable.

Fisher Information Matrix

Definition. Let p(y|x,w) be a learning machine where w is in Rd.(d is a positive integer). Fisher information matrix I(w) = (Iij(w); i,j=1,2,..,d) is defined by

Iij(w) = fi (x,w) fj (x,w) p(y|x,w) q(x) dx dy,

fi (x,w) = (∂/∂wi ) log p(y|x,w).

Note. By the definition, Fisher information matrix is a real, symmetric, and positive-semidefinite matrix. (Its eigen values are all real and nonnegative).

Example

Example. Let x,y be real values and q(x) be a standard normal distribution on R. Assume that

p(y|x,(a,b)) = 1/(2π)1/2 exp( - (y- a(x –b))2/2 ).

Then, log p(y|x,(a,b)) = - (y- a(x –b))2/2. Hence

f1(x, (a,b)) = – (y-a(x – b))(-x+b), f2(x, (a,b)) = – (y-a(x – b))(a).

ThusI(a,b) = .

1+b2 ab

ab a2

Regularity

Definition. Let p(y|x,w) be a learning machine. If Fisher information matrix is positive definite(i.e. all eigen values are positive), then a learning machine is called regular. In otherwise it is callednonregular or singular.

Example. Let x,y be real values and q(x) be a standard normal distribution on R. A learning machine

p(y|x,(a,b)) = 1/(2π)1/2 exp( - (y- a(x –b))2/2 )

is not regular, because Fisher information matrix is not positive definite at a=0.

12

Identifiable, Regular and Realizable CaseDefinition. Let q(y|x) and p*(y|x) be a true distribution and a trained learning machine. The Kullback-Leibler distance is defined by

D(q||p*)＝ q(x)q(y|x) log dx dy. q(y|x)p*(y|x)

E[ D(q||p*)] = d/2n + o(1/n),

Assume that a learning machine is identifiable and regular,and that the true distribution is realizable by a learningmachine. Also assume that p*(y|x) is given by one of themaximum likelihood, maximum posterior, or Bayes estimation. Then The average generalization E[ D(q||p*)]is given by

where d is the dimension of the parameter.

Understanding Learning Machines

If a learning machine p(y|x,w) is identifiable and regular,then statistical learning theory can be derived from the classical mathematical statistics.

However, hierarchical learning machines such as neural networks are

(1) Nonidentifiable,(2) Singular.

It depends on the true distribution whether the true can berealizable or nor.

I-2-2 Neural Network

Neural Network

Artificial neural networkwas devised by simulating the biological neuralnetwork, but it is different from real biological one.

x : input

f(x,w) : output

w: parameter

Sigmoid function

σ(u) = 1/(1+e-u).

u

σ( u )1

0

1/2

σ’ = σ (1－σ ).

A sigmoid function is defined by

Then it satisfies a differential equation,

Rectified Linear Unit ( ReLU )

σ(u) = max(0, u).

u

σ( u )

0σ’ (u) = 1 (u>0)0 (u<0)

A ReLU function is defined by

Then

One Neuron

x1 x2 x3 xM

w1 w2 w3wM

∑ wi xiM

i=1

σ( ∑ wi xi + θ) : Output M

i=1

Neuron

Synapse weight

θ

(x1,…,xM) : Input vector (ｗ1,…wM, θ)： Parameter

A neuron is defined by

Structure of Layered Neural Network

i

j

k

xk

oj=σ(∑wjkxk+θj)M

k=1

Output of hidden unit:

fi =σ(∑uijoj+φi)H

j=1

Output of output unit

uij

φi

fi

wjk

θｊ

oj

Parameter w={wjk, θj, uij,φi }

Example of q(x)q(y|x)

０

６

０ and ６

Example of q(x)

q(y|x)

Example of q(y|x)

NeuralNetwork

p(y|x,w)0 6

Output units 2

Hidden units 6

Image 25

Input units 25

Neural Network

21

Learning Machine

=

OutputLayer

HiddenLayer

InputLayer

Output

Input

Desired Output

Classification

Training Data, n=100.

Learning in a Neural Network

23

Data True

Trained NeuralNetwork

I-2-3 Function approximation by Neural Networks

Function Approximation Theorem

Theorem. Let A be an arbitrary compact set in RM.For an arbitrary continuous function g(x) and an arbitrary positive value ε>0, there exists a functionf(x,w) made of a neural network, such that

sup | g(x) –f(x,w) | < ε.x in A

Remark. Sometimes people say that neural networks are universal approximator based on this theorem. However, polynomials andtrigonometric functions are also universal approximators. The aremany universal approximators.

Function Approximation Problem

x1 x2 x3

F(x) = Σ aj f( bj, x )H

j=1

y

aj

bj

Classical functions --- the parameter is {aj}Neural Networks --- the parameter is {aj,bj}

Conventional functions and Neural Networks

min || g(x) - Σ aj f( bj, x ) || 2H

j=1≧ C1(g) / H 2/M

Conventional functions ： (M=dim x)For arbitrary {bj}, there exist g and H such that

{aj}

Neural Networks：

min || g(x) - Σ aj f( bj, x ) || 2H

j=1≦ C2(g) / H

{aj,bj}

For arbitrary g(x) and H such that

Curse of Dimensionality

Let M be the dimension of the input x. In the conventionalfunctions, the function approximation error is larger than 1/H2/M, which can not be made small if M is very large. This fact was called the “curse of dimensionality”.

On the other hand, neural networks made it smaller than1/H, for any M. Nowadays, researchers say, “ By theneural networks, curse of dimensionality was resolved.”

However, it should be emphasized that the small function approximation does not always mean the small generalization error.

I-2-4 Likelihood Function of Neural Networks

Likelihood Function

Definition. Let q(x)q(y|x) be a true density function andp(y|x,w) be a learning machine. Assume that{(Xi,Yi) ; i=1,2,…,n } be a set of independent random variables whose probability density function is q(x)q(y|x). Then the likelihood function L(w) is defined by

L(w) = Π p(Yi|Xi,w). n

i=1

Example of Likelihood (n=100)

If a learning machineis regular, thenthe likelihood function

becomes localized when n tends to infinity. ab

Likelihood

Example : Neural Networks

Example. Assume that q(x) is a standard normal distribution,

p(y|x,(a,b)) = 1/(2π)1/2 exp( - (y- a tanh(bx)))2/2 ),

q(y|x) = p(y|x,0.5,0.5) .

In this case, q(y|x) is realizable by p(y|x,a,b).However, p(y|x,a,b) is not identifiable and singular.

Remark. The equation q(y|x)=p(y|x,a,b) holds if and onlyIf a0=b0=0.5 or a0=b0=-0.5, but there are many parameterswhich make the likelihood function maximum.

Example of Likelihood (n=100)

If a learning machineis singular, thenthe likelihood function

is not localized even when n tends to infinity.

a

b

Likelihood

Neural Networks are Singular.

If a learning machine is identifiable and regular, and ifthe true distribution is realized by a learning machine, then the maximum likelihood estimator converges to the true parameter and the Bayes posterior distributionconcentrates on the true parameter.

However, if a learning machine is not identifiable and singular, then such properties do not hold. In neural networks, there are infinitely many parameterswhich make the likelihood function almost maximum.

We need new mathematical learning theory.

36

Contents of Part I

1. Basic Concepts in Statistical Learning2. Neural Network3. Learning in Neural Network, Report Writing (1)4. Boltzmann Machine5. Deep Learning6. Information and Entropy, Report Writing (2)7. Prediction Accuracy8. Knowledge Discovery, Report Writing (3)

Statistical Learning Theory Part I – 2. Neural Networkwatanabe- · 2 /2 ) is a learning machine....

Documents

Transcript of Statistical Learning Theory Part I – 2. Neural Networkwatanabe- · 2 /2 ) is a learning machine....