Realizability: Extracting Programs from proofs Types Summer School 2007 Bertinoro (Italia)
Statistical Learning Theory Part I – 2. Neural Networkwatanabe- · 2 /2 ) is a learning machine....
Transcript of Statistical Learning Theory Part I – 2. Neural Networkwatanabe- · 2 /2 ) is a learning machine....
Statistical Learning Theory
Part I – 2.
Neural Network
Sumio WatanabeTokyo Institute of Technology
Review1: The Purpose of this Lecture
The purpose of this lecture is
not to introduce software products and applications
but to teach mathematics and theoretical physics
which are foundations of machine learning and
statistics.
Review 2: Framework of Supervised Learning
Training data
X1, X2, …, Xn
Y1, Y2, …, Yn
Test data
X
Y
InformationSourceUnknown
q(x,y) p(y|x,w)
LearningMachine
I-2-1 True and Leaning Machine
True distribution
Definition. Let q(x,y) be a simultaneous probability density function to which independent random variables(X1,Y1), (X2,Y2), …,(Xn,Yn) are subject. Then q(x,y) is sometimes called true density or true distribution. Note that q(x) and q(y|x) are called true information sourceand true conditional density (or teacher distribution).
Example. In machine learning and statistics, It is often assumed that training input data are independentand generated from some unknown information source andtraining output data (teaching data) are given by some unknown conditional density (sometimes it is human).
Learning Machine = Statistical Model
Definition. Let p(y|x,w) be a conditional probability density function of Y for a given X and a parameter w.Then p(y|x,w) is said to be a learning machine ( whichestimates conditional probability density function).
Example. Let x and y be 2-dimensional real vectors and w be a real 2 times 2 matrix. Then
p(y|x,w) = 1/(2π) exp( - ||y-wx||2/2 )
is a learning machine.
Realizability
Definition. Let q(y|x) and p(y|x,w) be true conditionalprobability density and a learning machine, respectively. If there exists w0 such that q(y|x)=p(y|x,w0) for an arbitrary (x,y), then q(y|x) is said to be realizable byp(y|x,w). If otherwise, then it is called unrealizable.
Example. Let x,y,w be real values and q(x) be a standard normal distribution on R. Assume that
q(y|x)= 1/(2π)1/2 exp( - (y-x-a)2/2 ),p(y|x,w) = 1/(2π)1/2 exp( - (y-wx)2/2 ).
Then q(y|x) is realizable by p(y|x,w) if and only if a=0.
Identifiability
Definition. Let p(y|x,w) be a learning machine. “If p(y|x,w1)=p(y|x,w2) for an arbitrary (x,y), then w1=w2”,then p(y|x,w) is called identifiable, if otherwise then it is called nonidentifiable (unidentifiable).
Example. Let x,y be real values and q(x) be a standard normal distribution on R. Assume that
p1(y|x,(a,b)) = 1/(2π)1/2 exp( - (y-ax -b)2/2 ),p2(y|x,(a,b)) = 1/(2π)1/2 exp( - (y- a tanh(bx) )2/2 ).
Then p1(y|x,(a,b)) is identifiable, but p2(y|x,(a,b)) is nonidentiable.
Fisher Information Matrix
Definition. Let p(y|x,w) be a learning machine where w is in Rd.(d is a positive integer). Fisher information matrix I(w) = (Iij(w); i,j=1,2,..,d) is defined by
Iij(w) = fi (x,w) fj (x,w) p(y|x,w) q(x) dx dy,
fi (x,w) = (∂/∂wi ) log p(y|x,w).
Note. By the definition, Fisher information matrix is a real, symmetric, and positive-semidefinite matrix. (Its eigen values are all real and nonnegative).
Example
Example. Let x,y be real values and q(x) be a standard normal distribution on R. Assume that
p(y|x,(a,b)) = 1/(2π)1/2 exp( - (y- a(x –b))2/2 ).
Then, log p(y|x,(a,b)) = - (y- a(x –b))2/2. Hence
f1(x, (a,b)) = – (y-a(x – b))(-x+b), f2(x, (a,b)) = – (y-a(x – b))(a).
ThusI(a,b) = .
1+b2 ab
ab a2
Regularity
Definition. Let p(y|x,w) be a learning machine. If Fisher information matrix is positive definite(i.e. all eigen values are positive), then a learning machine is called regular. In otherwise it is callednonregular or singular.
Example. Let x,y be real values and q(x) be a standard normal distribution on R. A learning machine
p(y|x,(a,b)) = 1/(2π)1/2 exp( - (y- a(x –b))2/2 )
is not regular, because Fisher information matrix is not positive definite at a=0.
12
Identifiable, Regular and Realizable CaseDefinition. Let q(y|x) and p*(y|x) be a true distribution and a trained learning machine. The Kullback-Leibler distance is defined by
D(q||p*)= q(x)q(y|x) log dx dy. q(y|x)p*(y|x)
E[ D(q||p*)] = d/2n + o(1/n),
Assume that a learning machine is identifiable and regular,and that the true distribution is realizable by a learningmachine. Also assume that p*(y|x) is given by one of themaximum likelihood, maximum posterior, or Bayes estimation. Then The average generalization E[ D(q||p*)]is given by
where d is the dimension of the parameter.
Understanding Learning Machines
If a learning machine p(y|x,w) is identifiable and regular,then statistical learning theory can be derived from the classical mathematical statistics.
However, hierarchical learning machines such as neural networks are
(1) Nonidentifiable,(2) Singular.
It depends on the true distribution whether the true can berealizable or nor.
I-2-2 Neural Network
Neural Network
Artificial neural networkwas devised by simulating the biological neuralnetwork, but it is different from real biological one.
x : input
f(x,w) : output
w: parameter
Sigmoid function
σ(u) = 1/(1+e-u).
u
σ( u )1
0
1/2
σ’ = σ (1-σ ).
A sigmoid function is defined by
Then it satisfies a differential equation,
Rectified Linear Unit ( ReLU )
σ(u) = max(0, u).
u
σ( u )
0σ’ (u) = 1 (u>0)0 (u<0)
A ReLU function is defined by
Then
One Neuron
x1 x2 x3 xM
w1 w2 w3wM
∑ wi xiM
i=1
σ( ∑ wi xi + θ) : Output M
i=1
Neuron
Synapse weight
θ
(x1,…,xM) : Input vector (w1,…wM, θ): Parameter
A neuron is defined by
Structure of Layered Neural Network
i
j
k
xk
oj=σ(∑wjkxk+θj)M
k=1
Output of hidden unit:
fi =σ(∑uijoj+φi)H
j=1
Output of output unit
uij
φi
fi
wjk
θj
oj
Parameter w={wjk, θj, uij,φi }
Example of q(x)q(y|x)
0
6
0 and 6
Example of q(x)
q(y|x)
Example of q(y|x)
NeuralNetwork
p(y|x,w)0 6
Output units 2
Hidden units 6
Image 25
Input units 25
Neural Network
21
Learning Machine
=
OutputLayer
HiddenLayer
InputLayer
Output
Input
Desired Output
Classification
Training Data, n=100.
Learning in a Neural Network
23
Data True
Trained NeuralNetwork
I-2-3 Function approximation by Neural Networks
Function Approximation Theorem
Theorem. Let A be an arbitrary compact set in RM.For an arbitrary continuous function g(x) and an arbitrary positive value ε>0, there exists a functionf(x,w) made of a neural network, such that
sup | g(x) –f(x,w) | < ε.x in A
Remark. Sometimes people say that neural networks are universal approximator based on this theorem. However, polynomials andtrigonometric functions are also universal approximators. The aremany universal approximators.
Function Approximation Problem
x1 x2 x3
F(x) = Σ aj f( bj, x )H
j=1
y
aj
bj
Classical functions --- the parameter is {aj}Neural Networks --- the parameter is {aj,bj}
Conventional functions and Neural Networks
min || g(x) - Σ aj f( bj, x ) || 2H
j=1≧ C1(g) / H 2/M
Conventional functions : (M=dim x)For arbitrary {bj}, there exist g and H such that
{aj}
Neural Networks:
min || g(x) - Σ aj f( bj, x ) || 2H
j=1≦ C2(g) / H
{aj,bj}
For arbitrary g(x) and H such that
Curse of Dimensionality
Let M be the dimension of the input x. In the conventionalfunctions, the function approximation error is larger than 1/H2/M, which can not be made small if M is very large. This fact was called the “curse of dimensionality”.
On the other hand, neural networks made it smaller than1/H, for any M. Nowadays, researchers say, “ By theneural networks, curse of dimensionality was resolved.”
However, it should be emphasized that the small function approximation does not always mean the small generalization error.
I-2-4 Likelihood Function of Neural Networks
Likelihood Function
Definition. Let q(x)q(y|x) be a true density function andp(y|x,w) be a learning machine. Assume that{(Xi,Yi) ; i=1,2,…,n } be a set of independent random variables whose probability density function is q(x)q(y|x). Then the likelihood function L(w) is defined by
L(w) = Π p(Yi|Xi,w). n
i=1
Example : Regular Model
Example. Assume that q(x) is a standard normal distribution,
p(y|x,(a,b)) = 1/(2π)1/2 exp( - (y- ax –b))2/2 ),
q(y|x) = p(y|x,0.5,0.5) .
In this case, q(y|x) is realizable by p(y|x,a,b) and p(y|x,a,b)is identifiable and regular.
Remark. The equation q(y|x)=p(y|x,a,b) holds if and onlyIf a0=b0=0.5.
Example of Likelihood (n=100)
If a learning machineis regular, thenthe likelihood function
becomes localized when n tends to infinity. ab
Likelihood
Example : Neural Networks
Example. Assume that q(x) is a standard normal distribution,
p(y|x,(a,b)) = 1/(2π)1/2 exp( - (y- a tanh(bx)))2/2 ),
q(y|x) = p(y|x,0.5,0.5) .
In this case, q(y|x) is realizable by p(y|x,a,b).However, p(y|x,a,b) is not identifiable and singular.
Remark. The equation q(y|x)=p(y|x,a,b) holds if and onlyIf a0=b0=0.5 or a0=b0=-0.5, but there are many parameterswhich make the likelihood function maximum.
Example of Likelihood (n=100)
If a learning machineis singular, thenthe likelihood function
is not localized even when n tends to infinity.
a
b
Likelihood
Neural Networks are Singular.
If a learning machine is identifiable and regular, and ifthe true distribution is realized by a learning machine, then the maximum likelihood estimator converges to the true parameter and the Bayes posterior distributionconcentrates on the true parameter.
However, if a learning machine is not identifiable and singular, then such properties do not hold. In neural networks, there are infinitely many parameterswhich make the likelihood function almost maximum.
We need new mathematical learning theory.
36
Contents of Part I
1. Basic Concepts in Statistical Learning2. Neural Network3. Learning in Neural Network, Report Writing (1)4. Boltzmann Machine5. Deep Learning6. Information and Entropy, Report Writing (2)7. Prediction Accuracy8. Knowledge Discovery, Report Writing (3)