Better Visualization of Trips through Agglomerative Clustering
The multilayer perceptron - Örebro...
Transcript of The multilayer perceptron - Örebro...
Self-organizing systems
Self-organizing
• No supervisor available.• Instead, try to find order/structure
in the environment.–Clusters
(i.e.data is not homogeneously distributed)
–Directions (i.e. projections that carry more information than others)
Finding projections
Auto-encoder MLP
Output = input
Input
The auto-encoder is trained to reproduce the output through a“bottle neck” - must try to find an efficient coding. Train using standard training algorithms.
Will lead to ~ principal components.
2-1-2 MLPwith linearoutput butnonlinear input, trainedto reproduce the input data.
The line showsthe directionof w, the weightvector for thehidden unit.
Principal components(Karhounen-Loeve transform)
Task: Find a linear recoding of the data that preserver as much information as possible (”information” = variance in the signal)
⇒ Principal components
Principal components
LMM
LM
+
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛
+
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛
=+++≡
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛
=
0
10
0
01
2122112
1
xxxxx
x
xx
DD
D
eeex
Find a new ON basis Q with M < D
zqqqxx =
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛
=+++=≈
M
MM
z
zz
zzzM
L 2
1
2211ˆ
[ ]∑ −n
nn 2)(ˆ)( xxSuch that is minimized
Reminder on change of basis
iT
iz qx=
The coefficient zi is given by the scalar productof x and the basis vector qi.
Suppose M = D-1
[ ] [ ]
( )
Dn
DDD
D
D
TD
Dn
TTD
nD
TTD
nD
T
nD
nDD
n
nxnxnxnxnx
nxnxnxnxnxnxnxnxnxnx
nn
nnnz
znnnn
qxxq
qxxqqx
qxxxx
∑
∑
∑∑∑
∑∑
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛
=⎥⎦
⎤⎢⎣
⎡
===
=+−=−
)()()()()(
)()()()()()()()()()(
)()(
)()()(
)()()(ˆ)(
221
22212
12121
22
22
L
MOMM
L
L
[ ] DTD
nNnn Rqqxx =−∑ 2)(ˆ)(
where
∑⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛
=n
DDD
D
D
nxnxnxnxnx
nxnxnxnxnxnxnxnxnxnx
N)()()()()(
)()()()()()()()()()(
1
221
22212
12121
L
MOMM
L
L
R
is the correlation matrix
We can guarantee minimum loss if we chooseqD to be the eigenvector of R with minimumeigenvalue
DDD qRq λ=
[ ] DDTD
nNNnn λ==−∑ Rqqxx 2)(ˆ)(
A zero eigenvalue means no loss.
Principal components
• Choose new basis Q of ON eigenvectors of the correlation matrix R.
• Discard basis vectors in increasing order of their eigenvalues (i.e. throw away smallest eigenvalues first)
• Can also be done with the eigenvectors of the covariance matrix Σ. (Identical to the correlation matrix if data has zero mean.)
Covariance matrix
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛
=
DDDD
D
D
σσσ
σσσσσσ
L
MOMM
L
L
21
22221
11211
Σ
[ ][ ]∑ −−−
=n
jjiiij nxnxN
µµσ )()(1
1
∑=n
ii nxN
)(1µ [ ]∑ −−
==n
iiiii nxN
22 )(1
1 µσσ
Covariance matrix
[ ][ ]
jn
iji
jijn
iji
njjiiij
NNnxnx
N
NNnxnx
N
nxnxN
µµ
µµµµ
µµσ
∑
∑
∑
−+
−−
=+−
−−
=−−−
=
11)()(
11
12)()(
11
)()(1
1
TµµRΣ −≈
Principal components
iii qq λ=ΣExpress the data in the new basis Q, with basis vectorsthat are eigenvectors of the data covariance matrix.
-10 -5 0 5 10-8
-6
-4
-2
0
2
4
6
8
x1
x 2
The red lines showthe directions of thetwo eigenvectors.
Variance along eigendirection(Zero mean data)
[ ]
[ ] kn
kT
nkk
Tk
NNn
N
nN
λ
µσ
1)(
11
)(1
1
2
22
−=
−
=−−
=
∑
∑
qx
qx
-10 -5 0 5 10-8
-6
-4
-2
0
2
4
6
8
x1
x 2
PCA example: NIR spectra of meat
0 50 1002
2.5
3
3.5
4
4.520 NIR spectra
0 50 100-1
-0.5
0
0.5
1Same 20 demeaned
0 50 100-2
-1
0
1
2...and rescaled
0 50 100-3
-2
-1
0
1
2Grouped in fat%
Each curve is a pointin 100 dimensionalspace.
NIR: The 9 leading eigenvectors
0 50 100-0.5
0
0.5
0 50 100-0.5
0
0.5
0 50 100-0.5
0
0.5
0 50 100-0.2
0
0.2
0 50 100-0.5
0
0.5
0 50 100-0.5
0
0.5
0 50 100-0.5
0
0.5
0 50 100-0.5
0
0.5
0 50 100-0.5
0
0.5
λi = 2.4308, 0.9372, 0.0489, 0.0256, 0.0108, 0.0023, 0.0014, 0.0002, 0.0001
NIR reconstruction with PCA
0 50 100-2
0
2
0 50 100-2
0
21:st PC
0 50 100-2
0
21,2 PC
0 50 100-2
0
21,2,3 PC
0 50 100-2
0
21,2,3,4 PC
0 50 100-2
0
21-5 PC
0 50 100-2
0
21-6 PC
0 50 100-2
0
21-7 PC
0 50 100-2
0
21-8 PC
z = (2.64, 0.01, 2.35, -9.24, 0.66, -0.23, 0.71, -0.34, 0.06)
The first eigenvectorfor the Legodatacovariancematrix.
The line showsthe directionof the eigen-vector.
= The firstprincipaldirection.
In this case,the firstprincipaldirectionis goodfor doingtheclassification
2-1-2 MLPautoencoderwith linearoutput butnonlinear input, trainedto reproduce the input data.
The line showsthe directionof w, the weightvector for thehidden unit.
PCA application: image compression
Original image
PCA (KL) basisestimated from 12x12patches (144 dim).
Recoded image using 10% of PCA basis for each 12x12 patch.
Recoded image using 50% of PCA basis for each 12x12 patch.
Original image
PCA application: Eigenfaces• Images are high-
dimensional data with high correlation (faces look quite similar after all...the eyes are located above the nose, the mouth below the nose, hair on top...etc.)
• Reduce the dimensionality of the face image database by using PCA.
• Requires that the face is centered in the image and that the individual is looking into the camera (i.e. Same pose all the time).
M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive Neuroscience, vol. 3, no. 1, pp. 71-86, 1991.
Images from the ORL database (http://www.cam-orl.co.uk/facedatabase.html)
Large λ
Medium λ
Small λ
Eigenvectors (”eigenfaces”) when different subsetsof 200 face images are used to compute PCA
• You need only 10-20 eigenfaces to do a reliable identification.
• Compare with dimension of original image.
http://cnx.org/
PCA not always an optimal projection for classification
Auto-encoder applications
Output = input
Input
• Induction motor failure detection (Siemens). Input: Power spectrum of electrical current.
• Failure prediction in helicopter gear boxes (US Navy). Input: Vibration spectrum of gear box.
• Bank note rejection (and acceptance) at automatic vending machines. (U. Firenze)Input: Reflected and transmitted light along bank note.
PCA ≠ Autoencoder
The PCA basis can represent data in a subspace that extendsinfinitely.
The MLP autoencoder reliablyrepresents data in a lower dimensional subspace and in alimited region. This is due tosigmoid functions that saturate.
Nonlinear autoencoder
Output = input
Has been very difficult to train.Now ”solved” by using smart”pretraining” (a ”Boltzmannmachine”).
Matlab code available athttp://www.cs.toronto.edu/~hinton/MatlabForSciencePaper.html
Input
Hinton & Salakhutdinov, Science, 313, pp. 504-507, 2006
Nonlinear autoencoderOriginal
6-dim nonlin. autoencoder
6-dim lin. autoencoder
6-dim linear PCA
Original
30-dim nonl.autoencoder
30-dim PCA
Hinton & Salakhutdinov, Science, 313, pp. 504-507, 2006
Visualization of newswire stories2D nonlinear autoenc. 2D latent semantic analysis
Hinton & Salakhutdinov, Science, 313, pp. 504-507, 2006
PCA with kernels (cf. SVM)Map to high-dim.Space and comp.PCA there.
Can be done withkernels.
Figure from ftp://ftp.research.microsoft.com/users/mtipping/skpca_nips.ps.gz
ICA
• ICA = Independent component analysis
• PCA computes eigenvectors of covariance matrix (2:nd order statistics)
• ICA looks at higher order statistics and finds ”independent”components.
Clustering
k-means clustering
For K “cluster vectors” minimize
[ ] 2
1 1
||)(||)(21
k
N
n
K
kk nnE wxx −Λ= ∑∑
= = The ”distortion”
[ ]⎩⎨⎧
=Λotherwise0
)( closest to is if1)(
nn k
k
xwx
Λ is an ”assignment function”
k-means update
[ ][ ]∑=
−Λ+=+N
nkkkk nntt
1
)()()()1( wxxww η
⎩⎨⎧ +−
=+otherwise)(
closest for )()()1()1(
tnt
tk
kkk w
wxww
ηη
k-means can be done in batch and on-line mode.Often on-line.
TrainE=0.56%TestE=0.80%
But the alg.wasn’t toldabout red &green.
Takes a longtime to getvectors w toconvergeinto regionof interest.
“Better” to pickinitial pointsrandomlyfrom data.
TrainE=0.55%TestE=0.52%
But the alg.needs to knowabout red &green.
k-means problem
• How select the number of centers?
• Common to minimize Schwarz criterion:
[ ] [ ] )log(),()(21
1 1
NDKndnE k
N
n
K
kk λ+Λ= ∑∑
= =
wxx
Distortion Complexity cost
Learning vector quantization
For correctly classified patterns – move closer:
⎩⎨⎧ +−
=+otherwise)(
closest for )()()1()1(
tnt
tk
kkk w
wxww
ηη
For incorrectly classified patterns – move away:
⎩⎨⎧ −−
=+otherwise)(
closest for )()()1()1(
tnt
tk
kkk w
wxww
ηη
Self-organizing maps
• Impose a topology among the “neurons”, i.e. define neighborhood relationships.
• Update neighbors along with closest unit.
• Will encode the data in a 2D or 3D submanifold.
A 2D square lattice topology
Every neuronhas 4 nearneighbors.
A 2D hexagonal lattice topology
Every neuronhas 6 nearneighbors.
SOM maps
For K “cluster vectors” (neurons) minimize
[ ] 2
1 1
||)(||)(21
k
N
n
K
kk nnE wxx −Λ= ∑∑
= =
Example of switch
[ ]⎪⎩
⎪⎨⎧
=Λotherwise0
unitclosest oneighbor t is ifor )( closest to is if1)( k
k
k
nn w
xwx
SOM update
Let the closest unit to x(n) be called unit j.
)()()1()1( ntt jkkjkk xww Λ+Λ−=+ ηη
⎥⎦
⎤⎢⎣
⎡−=Λ 22
expσ
jkjk
d djk is distance in latticeσ is decreased with time
First, bigneighborhood
Then, smallerneighborhood
Then, noneighborhood
Initial
5 epochs
10 epochs
15 epochs
20 epochs
SOM only
Hierarchical clustering
• Agglomerative: Start out with all points as individual clusters. Join closest clusters until you’re satisfied.
Hierarchical clustering
Hierarchical clustering
Hierarchical clustering
Hierarchical clustering
Clustering orderand distances
Dendrogram
k-means
k-means
k-means
Metrics
( )
( )
[ ]
( ) ( )),(),(||||),(
sgn),(
||||),(
||||),(
1
/1
2/12
wxwxwxΣwxwxwx
wx
wxwx
wxwx
Kdd
wxd
wxd
wxd
Tk
kk
p
k
pkkp
kkk
=−−=−=
−=
⎥⎦
⎤⎢⎣
⎡−=−=
⎥⎦
⎤⎢⎣
⎡−=−=
−Σ
∑
∑
∑Euclidean
Minkowski
Manhattan
Mahalanobis
Kernel
etc...mutations, alignments,...whatever...