11. 10. 2001.NIMIA Crema, Italy1 Identification and Neural Networks I S R G G. Horváth Department...
-
Upload
roy-simmons -
Category
Documents
-
view
217 -
download
0
Transcript of 11. 10. 2001.NIMIA Crema, Italy1 Identification and Neural Networks I S R G G. Horváth Department...
11. 10. 2001. NIMIA Crema, Italy 1
Identification and Neural Networks
I S R G
G. Horváth
Department of Measurement and Information Systems
11. 10. 2001. NIMIA Crema, Italy 2
Modular networksWhy modular approach
Motivations Biological
Learning
Computational
Implementation
11. 10. 2001. NIMIA Crema, Italy 3
Motivations Biological
Biological systems are not homogenous
Functional specialization
Fault tolerance
Cooperation, competition
Scalability
Extendibility
11. 10. 2001. NIMIA Crema, Italy 4
Motivations Complexity of learning (divide and conquer)
Training of complex network (many layers)
layer by layer learning
Speed of learning
Catastrophic interference, incremental learning
Mixing supervised and unsupervised learning
Hierarchical knowledge structure
11. 10. 2001. NIMIA Crema, Italy 5
Motivations Computational
The capacity of a network
The size of the network
Catastrophic interference
Generalization capability vs network complexity
11. 10. 2001. NIMIA Crema, Italy 6
Motivations Implementation (hardware)
The degree of parallelism
Number of connections
The length of physical connections
Fan out
11. 10. 2001. NIMIA Crema, Italy 7
Modular networksWhat modules
The modules are disagree on some inputs every module solves the same, whole problem,
different ways of solutions (different modules)
every module solves different tasks (sub-tasks) task decomposition (input space, output space)
11. 10. 2001. NIMIA Crema, Italy 8
Modular networksHow combine modules
Cooperative modules simple average weighted average (fixed weights)
optimal linear combination (OLC) of networks
Competitive modules majority vote winner takes all
Competitive/cooperative modules weighted average (input-dependent weights)
mixture of experts (MOE)
11. 10. 2001. NIMIA Crema, Italy 9
Modular networks Construct of modular networks
Task decomposition, subtask definition
Training modules for solving subtasks
Integration of the results
(cooperation and/or competition)
11. 10. 2001. NIMIA Crema, Italy 11
Cooperative networksEnsemble of cooperating networks
(classification/regression)
The motivation Heuristic explanation
Different experts together can solve a problem better
Complementary knowledge
Mathematical justification Accurate and diverse modules
11. 10. 2001. NIMIA Crema, Italy 12
Ensemble of networks Mathematical justification
Ensemble output
Ambiguity (diversity)
Individual error
Ensemble error
Constraint
xx j
M
jj y
0
,y
2,y)((x) xxd
2)( xxx jj yd
2,)( xxx yya jj
1
jj
11. 10. 2001. NIMIA Crema, Italy 13
Ensemble of networks Mathematical justification (cont’d)
Weighted error
Weighted diversity Ensemble error
Averaging over the input distribution
Solution: Ensemble of accurate and diverse networks
xx j
M
jjaa
0
,
xx j
M
jj
0
,
2,y)((x) xxd ,),( xx a
x
xxx dfE )(),( x
xxx dfE )(),( x
xxx dfaA )(),(
AEE
11. 10. 2001. NIMIA Crema, Italy 14
Ensemble of networks How to get accurate and diverse networks
different structures: more than one network structure (e.g. MLP, RBF, CCN, etc.)
different size, different complexity networks (number of hidden units, number of layers, nonlinear function, etc.)
different learning strategies (BP, CG, random search,etc.) batch learning, sequential learning
different training algorithms, sample order, learning samples
different training parameters
different starting parameter values
different stopping criteria
11. 10. 2001. NIMIA Crema, Italy 15
Linear combination of networks
NNM
NN1
NN2
α1
α2
αM
Σ
y1
y2
yM
x
NNM
α 0
y0=1
xx j
M
jj y
0
,y
11. 10. 2001. NIMIA Crema, Italy 16
Linear combination of networks
Computation of optimal coefficients simple average
, k depends on the input for different input domains different network (alone gives the output)
optimal values using the constraint
optimal values without any constraint
Wiener-Hopf equation
MkMk ...1,1
kjjk ,0,1
PR 1*1
y
Ty xyxyR E
1
k
xxyP dE
11. 10. 2001. NIMIA Crema, Italy 17
Task decomposition Decomposition related to learning
before learning (subtask definition)
during learning (automatic task decomposition)
Problem space decomposition input space (input space clustering, definition of
different input regions)
output space (desired response)
11. 10. 2001. NIMIA Crema, Italy 18
Task decomposition Decomposition into separate subproblems
K-class classification K two-class problems (coarse decomposition)
Complex two-class problems smaller two-class problems (fine decomposition)
Integration (module combination)
11. 10. 2001. NIMIA Crema, Italy 19
Task decomposition A 3-class problem
11. 10. 2001. NIMIA Crema, Italy 20
Task decomposition 3 classes
2 small classes 2 small classes
11. 10. 2001. NIMIA Crema, Italy 21
Task decomposition 3 classes
2 classes
2 small classes2 small classes
11. 10. 2001. NIMIA Crema, Italy 22
Task decomposition 3 classes
2 small classes2 small classes
11. 10. 2001. NIMIA Crema, Italy 23
Task decomposition
M12
M13
M23
MIN
MIN
MIN
C1
C2
C3
INV=
Input
11. 10. 2001. NIMIA Crema, Italy 24
Task decompositionA two-class problem
decomposed into
subtasks
11. 10. 2001. NIMIA Crema, Italy 25
Task decomposition
AND
OR
AND
M11 M12
M22M21
11. 10. 2001. NIMIA Crema, Italy 26
Task decomposition
M11
M21
MIN
MAX
MIN
C1
Input
M12
M22
11. 10. 2001. NIMIA Crema, Italy 27
Task decomposition Training set decomposition:
Original training set
Training set for each of the (K) two-class problems
Each of the two-class problems are divided into K-1 smaller two-class problems [using an inverter module really (K-1)/2 is enough]
L
lll yΤ 1, x
KiyΤL
li
lli ...1, 1)( x
iC
iCy
il
ilil exceptclasses allif
classif1)(
x
x
jiKjiΤ j
l
i
l
L
l
iL
l
iij
and...1,,1,
1
)(
1
)( xx
11. 10. 2001. NIMIA Crema, Italy 28
Task decomposition
input number 16 x 16
NormalizationEdge detection
horizontal
diagonal \
diagonal /
vertical
Kirsch masks
4 16 x 16 feature maps
4 8 x 8 matrix
input number16 x 16
A practical example: Zip code recognition
11. 10. 2001. NIMIA Crema, Italy 29
Task decomposition Zip code recognition (handwritten character
recognition) modular solution
45 (K*K-1)/2 neurons
10 AND gates (MIN operator)
256+1 inputs
11. 10. 2001. NIMIA Crema, Italy 30
Mixture of Experts (MOE)
Expert 2Expert 1
Gating network
μ1
μ
g1 g2
x
Expert M
gM
Σ
11. 10. 2001. NIMIA Crema, Italy 31
Mixture of Experts (MOE) The output is the weighted sum of the outputs of the
experts
is the parameter of the i-th expert
The output of the gating network: “softmax” function
is the parameter of the gating network
i
M
iig μμ
1
M
j
ij
i
e
eg
1
i iTv x
11
M
iig
1ig i),( ii f xμ
i
Tiv
11. 10. 2001. NIMIA Crema, Italy 32
Mixture of Experts (MOE) Probabilistic interpretation
The probabilistic model with true parameters
a priori probability
i iE [ | , ]y x g P ii i ( | , )x v
i
iii PgP ),|(),(),|( 000 xyvxxy
g P ii i i( , ) ( | , )x v x v0 0
11. 10. 2001. NIMIA Crema, Italy 33
Mixture of Experts (MOE) Training
Training data
Probability of generating output from the input
The log likelihood function (maximum likelihood estimation)
X l l
l
L
x y( ) ( ),
1
P P i Pl l li
l li
i
( | , ) ( | , ) ( | , )( ) ( ) ( ) ( ) ( )y x x v y x
P P P i Pl l
l
Ll
il l
iil
L
( | , ) ( | , ) ( | , ) ( | , )( ) ( ) ( ) ( ) ( )y x y x x v y x
1 1
ii
lli
l
l
PiPL ),|(),|(log),( )()()( xyvxx
11. 10. 2001. NIMIA Crema, Italy 34
Mixture of Experts (MOE) Training (cont’d)
Gradient method
The parameter of the expert network
The parameter of the gating network
and 0v
x
i ),(
0x
i
),(
i i i
l li
l
Li
i
k k h( ) ( ) ( )( ) ( ) 1
1
y
v v xi i il
il
l
Llk k h g( ) ( ) ( ) ( ) ( )
1
1
11. 10. 2001. NIMIA Crema, Italy 35
Mixture of Experts (MOE) Training (cont’d)
A priori probability
A posteriori probability
jj
lllj
illl
ili
Pg
Pgh
),(
),(
xy
xy
),|(),( il
il
ili iPgg vxvx
11. 10. 2001. NIMIA Crema, Italy 36
Mixture of Experts (MOE) Training (cont’d)
EM (Expectation Maximization) algorithm
A general iterative technique for maximum likelihood estimation Introducing hidden variables Defining a log likelihood function
Two steps: Expectation of the hidden variables Maximization of the log likelihood function
11. 10. 2001. NIMIA Crema, Italy 37
EM (Expectation Maximization) algorithm A simple example: estimating means of k (2) Gaussians
f (y│µ1) f (y│2)
Measurements
11. 10. 2001. NIMIA Crema, Italy 38
EM (Expectation Maximization) algorithm A simple example: estimating means of k (2) Gaussians
hidden variables for every observation,
(x(l), zi1, zi2)
likelihood function
Log likelihood function
expected value of with given
)2()()(2
)(1 if1and0 X lll xzz
)1()()(2
)(1 if0and1 X lll xzz
)(
)((),()( )(
1
)()()( liz
il
k
ii
li
li
l xfzxfxf
)((log),(log )(
1
)()()(i
lk
i
lii
li
l xfzzxf
L
)(liz 21 and
2
1
)(
1)(
)(1
)(
)(
jj
l
ll
xxf
xxfzE
2
1
)(
2)(
)(2
)(
)(
jj
l
ll
xxf
xxfzE
11. 10. 2001. NIMIA Crema, Italy 39
Mixture of Experts (MOE) A simple example: estimating means of k (2) Gaussians
Expected log likelihood function
where
The estimate of the means
)((log)(
)()((log][][ )(
12
1
)(
)()(
1
)(i
lk
i
jj
l
il
ilk
i
li xf
xxf
xxfxfzEE
L
]
2
1exp[
2
1)(
2
2
2
)(
i
il x
xxf
2
2)(
2
)(
2
1
2
1log)(log
pi
il x
xf
L
l
li
li zEx
L 1
)()( ][1̂
11. 10. 2001. NIMIA Crema, Italy 40
Mixture of Experts (MOE) Applications
Simple experts: linear experts
ECG diagnostics
Mixture of Kalman filters
Discussion: comparison to non-modular architecture
11. 10. 2001. NIMIA Crema, Italy 41
Support vector machines A new approach:
Gives answers for questions not solved using the classical approach
The size of the network
The generalization capability
11. 10. 2001. NIMIA Crema, Italy 42
Classical neural learning Support Vector Machine
Support vector machines
Optimal hyperplane
Classification
11. 10. 2001. NIMIA Crema, Italy 43
TMwww ,...,, 10w xxx M ,...,, 10
xwx T
j
M
jjw
0
y
VC dimension
11. 10. 2001. NIMIA Crema, Italy 44
guarantedtiongeneralizaguaranted vvVCfvv ,...)(teach
Structural error minimization
11. 10. 2001. NIMIA Crema, Italy 45
Support vector machines Linearly separable two-class problem
separating hyperpalne
Piii y 1),( x 1,1 21 iiii yy XX xx
0bT xw
2
1
if,1
and if,1
Xb
Xb
iiT
iiT
xxw
xxw
iyb iiT ,1)( xw
Optimal hyperplane
11. 10. 2001. NIMIA Crema, Italy 46
Support vector machines
ww
xw
w
xw
xwxww
xx
xx
2minmin
),,(min),,(min),(
}1;{}1;{
}1;{}1;{
bb
bdbdb
iT
y
iT
y
iy
iy
iiii
iiii
w
xwxw
bbd
T ),,(
x
d(x)
w
bd
x1
x2
Geometric interpretation
11. 10. 2001. NIMIA Crema, Italy 47
Support vector machines Criterion function, Lagrange function
a constrained optimization problem
conditions
dual problem
support vectors optimal hyperplane
2
2
1ww
P
iii
Ti ybbJ
1
2}1]{[
2
1),,( xwww
01
ii
P
ii y
Jxw
w
P
iiii y
1xw 0
1
i
P
ii y
b
J
})(2
1{max)(max
11 1
P
ii
P
i
P
jjijiji yyW
xx
0: ii x
01
P
iii y
P
iiii y
1
0 xw
),,(minmax,
bJb
ww
11. 10. 2001. NIMIA Crema, Italy 48
Support vector machines Linearly nonseparable case
separating hyperplane
criterion function
Lagrange function
support vectors optimal hyperplane
Piby iiT
i ,...,11][ xw
P
iiCw
1
2
2
1),( w
P
iiiii
Tii
P
ii byCbJ
11
2}1][{
2
1),,,,( xwww
Ci 0
0: ii x
P
iiii y
1
0 xw
Optimal hyperplane
11. 10. 2001. NIMIA Crema, Italy 49
Support vector machines Nonlinear separation
separating hyperplane
decision surface
kernel function
criterion function
0bT xw
0),(1 01
P
i
M
jjijiii
P
iii yKy xxxx
jiT
jiK xxxx ),(
00
M
jjjw x
P
i
P
i
P
jjijijii KyyW
1 1 12
1xx
11. 10. 2001. NIMIA Crema, Italy 50
Support vector machines Examples of SVM
Polynomial
RBF
MLP
,...1,)1(, dK di
Ti xxxx
2
22
1exp, iiK xxxx
10tanh, iT
iK xxxx
11. 10. 2001. NIMIA Crema, Italy 51
Support vector machines Example: polynomial
basis functions
kernel function
221122
222121
21
21 2221, iiiiiii xxxxxxxxxxxxK xx
Tiiiiiii xxxxxx ]2,2,,2,,1[ 21
2221
21x
11. 10. 2001. NIMIA Crema, Italy 52
1,2,..Ni 0, bd iiiT
i 1)( xw1,2,..Ni bd iT
i 1)( xw
www T
2
1)(
Minimize:
Constraint:
Separable samples: Not separable samples:
P
ii
T C12
1),( www
Constraint:
Minimize:
Where by minimizing www T
2
1)( we maximize the distance of
the classes, whilst we also control the VC dimension.
SVR (classification)
11. 10. 2001. NIMIA Crema, Italy 53
SVR (regression)
C()
otherwise),(
),( ha),()),(,(
xx
xxfy
fyfyfyC
11. 10. 2001. NIMIA Crema, Italy 54
SVR (regression)
Pi
ε
εd
i
i
iiiT
iiT
i
1,2,...,
,0ξ
,0ξ
,ξd
,ξ
xw
xw
P
i
T CF1
ii ξξ2
1ξξ,, www
Constraints: Minimize:
M
jjjwy
0x
11. 10. 2001. NIMIA Crema, Italy 55
SVR (regression)Lagrange function
dual problem
constraints
support vectors
solution
)()
)2
1)(,,,,,,
11
11
ii
P
iiii
Ti
P
ii
iiiTP
ii
Ti
P
ii
y
yCJ
xw
xwwww
i
P
iii xw
1
ii C ii C
iii :x
ji
P
i
P
jjjii
P
iii
P
iiiiii KyW xx ,
2
1
1 111
01
P
iii ,0 Ci ,0 Ci
11. 10. 2001. NIMIA Crema, Italy 56
SVR (regression)
11. 10. 2001. NIMIA Crema, Italy 57
SVR (regression)
11. 10. 2001. NIMIA Crema, Italy 58
SVR (regression)
11. 10. 2001. NIMIA Crema, Italy 59
SVR (regression)
11. 10. 2001. NIMIA Crema, Italy 60
Support vector machines Main advantages
generalization
size of the network
centre parameters for RBF
linear-in-the-parameter structure
noise immunity
11. 10. 2001. NIMIA Crema, Italy 61
Support vector machines Main disadavantages
computation intensive (quadratic optimization)
hyperparameter selection VC dimension (classification)
batch processing
11. 10. 2001. NIMIA Crema, Italy 62
Support vector machines Variants
LS SVM
basic criterion function
Advantages: easier to compute
adaptivity,
11. 10. 2001. NIMIA Crema, Italy 63
Mixture of SVMs Problem of hyper-parameter selection for SVMs
Different SVMs, with different hyper-parameters
Soft separation of the input space
11. 10. 2001. NIMIA Crema, Italy 64
Mixture of SVMs
11. 10. 2001. NIMIA Crema, Italy 65
Boosting techniques Boosting by filtering
Boosting by subsampling
Boosting by reweighting
11. 10. 2001. NIMIA Crema, Italy 66
Boosting techniques Boosting by filtering
11. 10. 2001. NIMIA Crema, Italy 67
Boosting techniques Boosting by subsampling
11. 10. 2001. NIMIA Crema, Italy 68
Boosting techniques Boosting by reweighting
11. 10. 2001. NIMIA Crema, Italy 69
Other modular architectures
11. 10. 2001. NIMIA Crema, Italy 70
Other modular architectures
11. 10. 2001. NIMIA Crema, Italy 71
Other modular architectures Modular classifiers
Decoupled modules
Hierarchical modules
Network ensemble (linear combination)
Network ensemble (decision, voting)
11. 10. 2001. NIMIA Crema, Italy 72
Modular architectures