Data Analysis and Dimension Reduction - NTNUberlin.csie.ntnu.edu.tw/Courses/Introduction to... ·...
Transcript of Data Analysis and Dimension Reduction - NTNUberlin.csie.ntnu.edu.tw/Courses/Introduction to... ·...
Data Analysis and Dimension Reduction- PCA, LDA and LSA
Berlin ChenDepartment of Computer Science & Information Engineering
National Taiwan Normal Universityy
References:1. Introduction to Machine Learning , Chapter 6 2. Data Mining: Concepts, Models, Methods and Algorithms, Chapter 3
Introduction (1/3)
• Goal: discover significant patterns or features from the input data– Salient feature selection or dimensionality reduction
NetworkNetworkx yInput space Feature space
W
C t i t t t i b d d i bl– Compute an input-output mapping based on some desirable properties
Statistics-2
Introduction (2/3)
• Principal Component Analysis (PCA)
• Linear Discriminant Analysis (LDA)
• Latent Semantic Analysis (LSA)
Statistics-3
Bivariate Random Variables
• If the random variables X and Y have a certain joint distribution that describes a bivariate random variabledistribution that describes a bivariate random variable
⎥⎦
⎤⎢⎣
⎡=YX
Z bivariate random variable
⎥⎤
⎢⎡
=⇒
⎥⎦
⎢⎣
X
Y
μμ mean vector
⎤⎡
⎥⎦
⎢⎣
=⇒
YXXX
YZ
σσ
μμ mean vector
⎥⎦
⎤⎢⎣
⎡=Σ⇒
YYYX
YXXXZ
,,
,,
σσσσ
covariance matrix
[ ]( )[ ] [ ] [ ]( )[ ]( ) [ ]( )[ ] [ ] [ ] [ ]YXXYYYXX
XXXXXXX
EEEEEEEEEE −=−== 2222
,
i variance σσ
Statistics-4
[ ]( ) [ ]( )[ ] [ ] [ ] [ ]YXXYYYXXXYYX EEEEEE −=−−== ,, covariance σσ
Multivariate Random Variables
• If the random variables X1,X2,…,Xn have a certain joint distribution that describes a multivariate random variabledistribution that describes a multivariate random variable
⎥⎥⎤
⎢⎢⎡
=X
X1
M multivariate random variable
⎥⎥⎤
⎢⎢⎡
⎥⎥⎦⎢
⎢⎣
X
nX
1μM t
⎤⎡
⎥⎥⎥
⎦⎢⎢⎢
⎣
=⇒
n
XXXX
X
X
σσ
μμ
L
M mean vector
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
=Σ⇒
nnn
n
XXXX
XXXX
Z
,,
,,
1
111
σσ
σσ
L
MMM
L
covariance matrix
[ ]( )[ ] [ ] [ ]( )[ ]( ) [ ]( )[ ] [ ] [ ] [ ]
iiiiXXX XXXXiii
EEEE −=−== 2222,
i
variance σσ
Statistics-5
[ ]( ) [ ]( )[ ] [ ] [ ] [ ]jijijjiiXXXX XXXXXXXXijji
EEEEEE −=−−== ,, covariance σσ
Introduction (3/3)
• Formulation for feature extraction and dimension reduction– Model-free (nonparametric)( p )
• Without prior information: e.g., PCA • With prior information: e.g., LDA
– Model-dependent (parametric), e.g.,• HLDA with Gaussian cluster distributions• PLSA with multinomial latent cluster distributions
Statistics-6
Principal Component Analysis (PCA) (1/2)P 1901
• Known as Karhunen-Loẻve Transform (1947, 1963)
– Or Hotelling Transform (1933)
Pearson, 1901
Or Hotelling Transform (1933)
• A standard technique commonly used for data reduction in statistical pattern recognition and signal processingstatistical pattern recognition and signal processing
• A transform by which the data set can be represented by d d b f ff ti f t d till t i threduced number of effective features and still retain the
most intrinsic information content– A small set of features to be found to represent the data samplesA small set of features to be found to represent the data samples
accurately
• Also called “Subspace Decomposition” “Factor Analysis”Also called Subspace Decomposition , Factor Analysis ..
Statistics-7
Principal Component Analysis (PCA) (2/2)
The patterns show a significant differencefrom each other in one of the transformed axes
Statistics-8
PCA Derivations (1/13)
• Suppose x is an n-dimensional zero mean d t { }random vector,
– If x is not zero-mean, we can subtract the mean b f i th f ll i l i
{ } 0xμ == E
before processing the following analysis
b t d ith t b th ti– x can be represented without error by the summation of n linearly independent vectors
n∑ ===
n
iiiiy Φyφx [ ]Tni yyy .. where 1=y
[ ]ni φφφΦ ..1=Th i h
The basis vectorsThe i-th component
in the feature (mapped) space
Statistics-9
PCA Derivations (2/13)
Subspace Decomposition(2,3)
(0,1)
(5/2’,1/2’)
(1,0)(1,1)(-1,1)
⎥⎦
⎤⎢⎣
⎡+⎥
⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡−+⎥
⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡+⎥
⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡11
121
111
21
11
25
10
301
232
⎦⎣⎦⎣⎦⎣⎦⎣⎦⎣⎦⎣⎦⎣ 121 212103
orthogonal basis sets
Statistics-10
orthogonal basis sets
PCA Derivations (3/13)
– Further assume the column (basis) vectors of ( )the matrix form an orthonormal set
⎨⎧ =
=ji
jT if 1φφ
Φ
• Such that is equal to the projectionof on
⎩⎨ ≠ jiji if 0
φφ
φxiy
of on iφxxx T
iiT
ii y ϕϕ ==∀
x1ϕ
2ϕ1y
2y
cos1
11 11 === xφxx
xx TT
y ϕϕ
θ
1y
Statistics-11
1 where, 1 =φ
PCA Derivations (4/13)
– Further assume the column (basis) vectors of the matrix form an orthonormal setΦ ( )( ){ }= TμxμxΣ E
• also has the following properties– Its mean is zero, tooiy
( )( ){ }∑ −⎟
⎠
⎞⎜⎝
⎛≈
−−=
=T
N
iTiiN
μμxx
μxμxΣ1
1
E
0
– Its variance is
{ } { } { } 0==== 0xx Ti
Ti
Tiiy ϕϕϕ EEE jy{ } ∑≈=
iTii
TN
xxxxR 1E0
{ } { }[ ] { } { } { }[ ]
xxxx2222i
TTii
TTiii
i yyy ϕϕϕϕσ ===−= EEEEE
• The correlation between two projections and
[ ]xRR ofmatrix relation (auto-)cortheis iTi ϕϕ=
y jy• The correlation between two projections and is { } ( )( ){ } { }
{ }j
TTi
TTj
Tiji yy φxxφxφxφ == EEE
iy jy
Statistics-12
{ } jTij
TTi Rφφφxxφ == E
PCA Derivations (5/13)
• Minimum Mean-Squared Error Criterion– We want to choose only m of that we still cans'iφWe want to choose only m of that we still can
approximate well in mean-squared error criterionxsiφ
∑∑∑ +nmn
yyy φφφxi i l t ∑∑∑+===
+==mj
jji
ii
i yyy111
φφφx ii
( ) ∑=m
iiym
1ˆ iφx
original vector
reconstructed vector=i 1
( ) ( ){ } ∑∑+=+= ⎭
⎬⎫
⎩⎨⎧
⎟⎠⎞⎜
⎝⎛⎟⎠⎞
⎜⎝⎛=−= k
n
mkk
Tj
n
mjj yymm
11
2 φφxxε EˆE
{ }
∑ ∑+= += ⎭
⎬⎫
⎩⎨⎧=
n
kTj
n
mj
n
mkkj yy
1 1 φφE
⎧ = kjif1{ } 0yE j = We should
discard the{ }
∑∑
∑+=
==
=
nj
Tj
nj
n
mjjy
2
12
R φφσ
E⎩⎨⎧
≠=
=kjkj
kTj if 0
if 1φφQ{ } { }[ ]
{ }2
222
j
j
yE
yEyE jj
=
−=σdiscard the
bases where the projections have lower variances
Statistics-13
∑∑+=+= mj
jjmj
j11
φφ
PCA Derivations (6/13)
• Minimum Mean-Squared Error Criterion'– If the orthonormal (basis) set is selected to be the
eigenvectors of the correlation matrix , associated with eigenvalues
s'iφR
s'iλis real and s mmetric
• They will have the property that: i
R λ
is real and symmetric, therefore its eigenvectors
form a orthonormal setR
jjj φR φ λ=is positive definite ( )
=> all eigenvalues are positiveR 0>RxxT
– Such that the mean-squared error mentioned above will be
( ) ∑=n
jjm
1
2 σε
∑∑∑+=+=+=
+=
===n
mjj
n
mjjj
Tj
n
mjj
Tj
mjj
111
1
λλ φφR φφ
Statistics-14
PCA Derivations (7/13)
• Minimum Mean-Squared Error Criterion– If the eigenvectors are retained associated with the m largest
eigenvalues, the mean-squared error will ben
( ) ( )0...... where 11
≥≥≥≥≥∑=+=
nmn
mjjeigen m λλλλε
– Any two projections and will be mutually uncorrelatediy jy
{ } ( )( ){ } { }TTTTT EEyyE φxxφxφxφ
( )( )
−⎟⎠⎞
⎜⎝⎛
∑≈
−−=
TN Tii
T
N
E
μμxx
μxμxΣ1
][
{ } ( )( ){ } { }{ } 0 j ====
==
jTij
Tij
TTi
jijiji
E
EEyyE
φφRφφφxxφ
φxxφxφxφ
λ⎥⎤
⎢⎡ ⋅⋅ nσσσ 11211
{ } ∑==
⎠⎝ =
i
Tii
T
iii
NE
N
xxxxR 11
• Good news for most statistical modeling approaches– Gaussians and diagonal matrices
⎥⎥⎥⎥⎥⎥
⎦⎢⎢⎢⎢⎢⎢
⎣ ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅
nnnn σσσ
σσ
21
2222
Statistics-15
⎦⎣ nnnn σσσ 21
PCA Derivations (8/13)
• A Two-dimensional Example of Principle Component AnalysisAnalysis
Statistics-16
PCA Derivations (9/13)
• Minimum Mean-Squared Error CriterionIt can be proved that is the optimal solution under the( )mε– It can be proved that is the optimal solution under the mean-squared error criterion
( )meigenε
ConstraintsTo be minimized ⎩⎨⎧
≠=
=kjkj
jk if 0 if 1
δ
Objective function( )
[ ]( )
n
mj
n
mkjkk
Tjjk
n
mjj
Tj
J
uJ φφRφφ1 11
:Define+= +=+=
∂
−−= ∑ ∑∑ δPartial Differentiation
ϕϕϕϕ RR 2=
∂∂ TObjective function
[ ]( )[ ]( )
jnmjT
n
mkkjkj
jnjm uuuJ
j
φφΦuΦRφ
u0φRφφ
.....1 1
1
where
where 22 ++=
≤≤+
∀⇒
==−=∂∂
∀⇒ ∑
[ ]( )[ ] [ ]
[ ]( )nmmnnm
nmmnjmnjnjm
UUΦRΦ
uuΦφφR
φφΦuΦRφ
..........
.....
11
11
h
where
+−+
+−−≤≤+
=⇒
==∀⇒
[ ]( )nmmnmnmnmn uuUUΦRΦ .....1 where +−−−− ==⇒
Have a particular solution if is a diagonal matrix and its diagonal elements i th i l f d i th i di i t
mn−Uλλ R φφ
Statistics-17
is the eigenvalues of and is their corresponding eigenvectors nm λλ ...1+ R nm φφ .....1+
PCA Derivations (10/13)
• Given an input vector x with dimension mTry to construct a linear transform Φ’ (Φ’ is an nxm matrix m<n)– Try to construct a linear transform Φ’ (Φ’ is an nxm matrix m<n) such that the truncation result, Φ’Tx, is optimal in mean-squared error criterion
⎤⎡x ⎤⎡
Encoder⎥⎥⎥⎥⎤
⎢⎢⎢⎢⎡
=xx
.2
1
x⎥⎥⎥⎥⎤
⎢⎢⎢⎢⎡
=yy
.2
1
yxΦy T′=
T ′Φ
⎥⎥⎥
⎦⎢⎢⎢
⎣ nx.
⎥⎥⎥
⎦⎢⎢⎢
⎣ my.
⎤⎡
[ ]mϕϕϕ ..21 where =′Φ
Decoder⎥⎥⎥⎥⎤
⎢⎢⎢⎢⎡
=xx
.ˆˆ
ˆ2
1
x
yΦx ′=ˆ
Φ ′⎥⎥⎥⎥⎤
⎢⎢⎢⎢⎡
=yy
.2
1
y
⎥⎥⎥
⎦⎢⎢⎢
⎣ nx̂.
( ) ( )( )E T ˆˆi i i
Φ ′
⎥⎥⎥
⎦⎢⎢⎢
⎣ my.
y
Statistics-18
( ) ( )( )-xx-xxEx minimize
PCA Derivations (11/13)
• Data compression in communication
– PCA is an optimal transform for signal representation and dimensional reduction, but not necessary for classification tasks, such as speech recognition ? (To be discussed later on)suc as speec ecog t o
– PCA needs no prior information (e.g. class distributions of output information) of the sample patterns
? (To be discussed later on)
Statistics-19
PCA Derivations (12/13)
• Scree Graph– The plot of variance as a function of the number of eigenvectors– The plot of variance as a function of the number of eigenvectors
kept• Select such that Threshold
nm
m ≥+++++
+++λλλλ
λλλLL
L
21
21m
• Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)
∑≥n1 λλ
Statistics-20
∑≥=i
im n 1λλ
PCA Derivations (13/13)
• PCA finds a linear transform W such that the sum of average between-class variation and average within-average between class variation and average withinclass variation is maximal
( ) WSWWSWSSSW TTJ ~~~ ?( ) WSWWSWSSSW bT
wT
bwJ +=+== ?
( )( )∑ −−= TiiN
xxxxS 1
∑=j
jjw NN
ΣS 1
( )( )i
iiNsample index
jN
( )( )∑ −−=j
Tjjjb N
NxxxxS 1
class index
WSWS wT
w =~
WSWS bT
b =~
SSS + :thatshowTry to
Statistics-21
WSWS bbbw SSS +=
PCA Examples: Data Analysis
• Example 1: principal components of some data points
Statistics-22
PCA Examples: Feature Transformation
• Example 2: feature transformation and selection
Correlation matrix for old feature di idimensions
New feature dimensions
Statistics-23
threshold for information content reserved
PCA Examples: Image Coding (1/2)
• Example 3: Image Coding
8
8
256
256
Statistics-24
PCA Examples: Image Coding (2/2)
• Example 3: Image Coding (cont.)(value reduction)(feature reduction) (value reduction)(feature reduction)
Statistics-25
PCA Examples: Eigenface (1/4)
• Example 4: Eigenface in face recognition (Turk and Pentland, 1991)
Consider an individual image to be a linear combination of a small– Consider an individual image to be a linear combination of a small number of face components or “eigenfaces” derived from a set of reference images
⎥⎥⎤
⎢⎢⎡
⎥⎥⎤
⎢⎢⎡
⎥⎥⎤
⎢⎢⎡ L
xx
xx
xx 1,1,21,1
– Steps• Convert each of the L reference images into a vector of
⎥⎥⎥⎥⎥
⎦⎢⎢⎢⎢⎢
⎣
=
⎥⎥⎥⎥⎥
⎦⎢⎢⎢⎢⎢
⎣
=
⎥⎥⎥⎥⎥
⎦⎢⎢⎢⎢⎢
⎣
=
nL
L
L
nn x
x
x
x
x
x
.
2.
.2
2.2
2
.1
2.1
1
.
.,......,..,
.
. xxx
• Convert each of the L reference images into a vector of floating point numbers representing light intensity in each pixel
• Calculate the coverance/correlation matrix between these reference vectors
• Apply Principal Component Analysis (PCA) find the eigenvectors of the matrix: the eigenfaceseigenvectors of the matrix: the eigenfaces
• Besides, the vector obtained by averaging all images are called “eigenface 0”. The other eigenfaces from “eigenface 1”
d d l th i ti f thi f
Statistics-26
onwards model the variations from this average face
PCA Examples: Eigenface (2/4)
• Example 4: Eigenface in face recognition (cont.) Steps– Steps
• Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K ≤ L) eigenfaces
– The Eigenface approach persists the minimum mean-squared error criterionerror criterion
– Incidentally, the eigenfaces are not only themselves usually plausible faces, but also directions of variations between faces
( ) ( ) ( )[ ]
Kiiii Kwww ,2,1, .....21ˆ ++++= eeexx[ ]Kiiii www ,2,1, ,...,,,1=⇒ y
Feature vector of a person i
Statistics-27
PCA Examples: Eigenface (3/4)
Face images as the training setg g
The averaged face
Statistics-28
PCA Examples: Eigenface (4/4)
A projected face imageSeven eigenfaces derived from the training set
A projected face imageg g
?
Statistics-29
(Indicate directions of variations between faces )
PCA Examples: Eigenvoice (1/3)
• Example 5: Eigenvoice in speaker adaptation (PSTL, 2000)
– Steps– Steps• Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)• SD HMM model mean parameters (μ)
Speaker 1 Data Speaker R Data Each new speaker S is representedEach new speaker S is representedSI HMM
Model Training Model Training
by a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP ,2,1, .....210 ++++=
Ei iEi i
Speaker 1 HMM Speaker R HMM
D =
SI HMM model
Eigenvoice Eigenvoice space space
constructionconstruction
D = (M.n)×1 Principal Component
Analysis
Statistics-30
PCA Examples: Eigenvoice (2/3)
• Example 4: Eigenvoice in speaker adaptation (cont.)
Statistics-31
PCA Examples: Eigenvoice (3/3)
• Example 5: Eigenvoice in speaker adaptation (cont.)– Dimension 1 (eigenvoice 1):
• Correlate with pitch or sexDimension 2 (eigenvoice 2):– Dimension 2 (eigenvoice 2):
• Correlate with amplitude– Dimension 3 (eigenvoice 3):( g )
• Correlate with second-formantmovement
Note that:Eigenface performs on feature spaceg p pwhile eigenvoice performs on model space
Statistics-32
Linear Discriminant Analysis (LDA) (1/2)
• Also called Fisher’s Linear Discriminant Analysis Fisher Rao Linear– Fisher s Linear Discriminant Analysis, Fisher-Rao Linear Discriminant Analysis
• Fisher (1936): introduced it for two-class classification
• Rao (1965): extended it to handle multiple-class classification
Statistics-33
Linear Discriminant Analysis (LDA) (2/2)
• Given a set of sample vectors with labeled (class) information try to find a linear transform W such that theinformation, try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
Statistics-34
LDA Derivations (1/4)
• Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the J
ixdimensionality n, each of them is belongs to one of the Jclasses– The sample mean is:
( ) { } ( ) index class is ,,....,2,1 , ⋅∈= gJjjg ix
∑=N1 xxp
– The class sample means are:
∑==i
iN 1xx
( )∑
==
jgi
jj
iN xxx 1
– The class sample covariances are:
– The average within-class variation before transform
( )( )( )∑
=−−=
jg
Tjiji
jj
iN xxxxxΣ 1
The average within class variation before transform
The average between class variation before transform
∑=j
jjw NN
ΣS 1
– The average between-class variation before transform
( )( )∑ −−=j
Tjjjb N
NxxxxS 1
Statistics-35
jN
LDA Derivations (2/4)
• If the transform is appliedThe sample vectors will be
[ ]mwwwW 1 ....2=T xWy– The sample vectors will be
– The sample mean will be
ii xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== ∑∑
== 11
11
– The class sample means will be
⎠⎝
( ) jT
jgi
T
jj
iNxWxWy
x== ∑
=
1
– The average within-class variation will be
( ) ( )T⎪⎫⎪⎧ ⎟
⎞⎜⎛⎟⎞
⎜⎛ 1111~ ( )
( )( )( )
( )xWxWxWxWS
xx x
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw NNN
NN ii i
⎬⎫
⎨⎧
⎪⎭
⎪⎬
⎪⎩
⎪⎨ ⎟
⎟⎠
⎞⎜⎜⎝
⎛−⎟
⎟⎠
⎞⎜⎜⎝
⎛−⋅= ∑∑ ∑∑
== =
1
1111~
WSW
WΣW
wT
jj
jT NN
=⎭⎬⎫
⎩⎨⎧= ∑
1 x y
wS wS~
S ~
W
Statistics-36
bS bS
LDA Derivations (3/4)
• If the transform is applied– Similarly, the average between-class variation will be
[ ]mwwwW 1 ....2=y, g
T t fi d ti l h th t th f ll i bj ti f ti i
WSWS bT
b =~
W– Try to find optimal such that the following objective function is maximized
W
( )WSWS
Wb
Tb
J ==
~
• A closed-form solution: the column vectors of an optimal matrix
( )WSWS
Ww
Tw
J ~
are the generalized eigenvectors corresponding to the largest eigenvalues in
iwiib wSwS λ=W
• That is, are the eigenvectors corresponding to the largest eigenvalues of
s'iw
wwSS λ=−1bw SS 1−
Statistics-37
iiibw wwSS λ=
LDA Derivations (4/4)
• Proof:
( ) bTb WSWS
~ˆ
determinant
( )
i
wT
b
w
bJ
Ww
WSWWSW
SWW
WWW===
: thatfind want towe of tor column veceach for ly,equivalentOr
Q
,
maxarg~maxargmaxargˆˆˆˆ
( ) ( )TT
iwT
ibT
ii
i
SSSS
wSw
wSwλ =
22
:solution optimal has form qradtic The 2GFGGF
GF ′−′
=′⎟⎠⎞
⎜⎝⎛
( ) ( )( )
( ) ( )ibT
iwiwT
ib
iwT
ibT
iwiwT
ib
i
i
i
ii
wSwwSwSwwS
wSw
wSwwSwSwwS
wλ
=−
=∂∂
⇒ 2 022
( )xCCxCxx T
T
+=d
d )(
( )( )
( )( )
ibT
iwib
iwT
ibiw
iwT
iwib
i
i
i
i
i
wSwwSwS
wSw
wSwwS
wSw
wSwwS
λλ ⎟⎞
⎜⎛
=−⇒ 22
0
0
iwiibiwiib
iwTii
iwT
iw
iwT
ib
i
i
ii
SSwSwSwSwS
wSwwSwwSw
λλλ
λλ
=⇒=−⇒
⎟⎟
⎠⎜⎜
⎝==−
1
0
0 Q
Statistics-38
iiibwwwSS λ=⇒ −1
LDA Examples: Feature Transformation (1/2)
• Example1: Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99’s 5471 files Calculated using Year-99’s 5471 files
( )( )∑ −−=i
TiiN x
xxxxΣ 1 ( )( )∑ −−=′i
TiiN y
yyyyΣ 1
Statistics-39
After Cosine Transform
LDA Examples: Feature Transformation (2/2)
• Example1: Experiments on Speech Signal Processing (cont )Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Example1: Experiments on Speech Signal Processing (cont.)
Calculated using Year-99’s 5471 filesCalculated using Year-99’s 5471 files
Character Error RateAfter PCA Transform After LDA TransformCharacter Error Rate
TC WG
MFCC 26.32 22.71
LDA-1 23.12 20.17
After PCA Transform After LDA Transform
Statistics-40
LDA-2 23.11 20.11
PCA vs. LDA (1/2)
PCA LDA
Statistics-41
Heteroscedastic Discriminant Analysis (HDA)
• HDA: Heteroscedastic Discriminant Analysis – The difference in the projections obtained from LDA and HDA for
2-class case
• Clearly, the HDA provides a much lower classificationClearly, the HDA provides a much lower classification error than LDA theoretically – However, most statistical modeling approaches assume data
Statistics-42
samples are Gaussian and have diagonal covariance matrices
HW: Feature Transformation (1/4)
• Given two data sets (MaleData, Female Data) in which each row is a sample with 39 features please performeach row is a sample with 39 features, please perform the following operations:1. Merge these two data sets and find/plot the covariance matrix for g p
the merged data set.2. Apply PCA and LDA transformations to the merged data set,
respectively Also find/plot the covariance matrices forrespectively. Also, find/plot the covariance matrices for transformations, respectively. Describe the phenomena that you have observed.
3. Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set. Selectively plot portions of samples from MaleData and y p p pFemaleData, respectively. Describe the phenomena that you have observed.
Statistics-43
http://berlin.csie.ntnu.edu.tw/PastCourses/2004S-MachineLearningandDataMining/Homework/HW-1/MaleData.txt
HW: Feature Transformation (2/4)
Statistics-44
HW: Feature Transformation (3/4)
• Plot Covariance MatrixCoVar=[[
3.0 0.5 0.4;0.9 6.3 0.2;0.4 0.4 4.2;
];
• Eigen Decomposition
];colormap('default');surf(CoVar);
%LDAIWI=inv(WI);A=IWI*BE;%PCA• Eigen Decomposition
BE=[3.0 3.5 1.4;1 9 6 3 2 2
A=BE+WI; % why ?? ( Prove it! )
[V,D]=eig(A);[V,D]=eigs(A,3);
1.9 6.3 2.2;2.4 0.4 4.2;
];
WI [
fid=fopen('Basis','w');for i=1:3 % feature vector lengthfor j=1:3 % basis number
WI=[4.0 4.1 2.1;2.9 8.7 3.5;4.4 3.2 4.3;
];
fprintf(fid,'%10.10f ',V(i,j));endfprintf(fid,'\n');
end
Statistics-45
]; fclose(fid);
HW: Feature Transformation (4/4)
• Examples
5
62000筆原始資料經 PCA轉換後分布圖
0
0.22000筆原始資料經LDA轉換後分布圖
3
4
2 -0.4
-0.2
0
2
0
1
2
feat
ure
-0.8
-0.6feat
ure
-10 -8 -6 -4 -2 0 2 4-2
-1
feature 1-0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2
-1.2
-1
feature 1feature 1 feature 1
Statistics-46
Latent Semantic Analysis (LSA) (1/7)
• Also called Latent Semantic Indexing (LSI), Latent Semantic Mapping (LSM)Semantic Mapping (LSM)
• A technique originally proposed for Information Retrieval (IR), which projects queries and docs into a space with “latent” semantic dimensions– Co-occurring terms are projected onto the
same dimensionssame dimensions
– In the latent semantic space (with fewer dimensions), a query fand doc can have high cosine similarity even if they do not share
any terms
– Dimensions of the reduced space correspond to the axes of greatest variation
• Closely related to Principal Component Analysis (PCA)
Statistics-47
y p p y ( )
LSA (2/7)
• Dimension Reduction and Feature Extraction– PCA feature spacePCA
XφTiiy =
Y
knX ∑
=
k
iiiy
1
φn
X̂
feature space
kn
kφ1φ
=i 1
kφ1φ
n
2orthonormal basis
– SVD (in LSA)
Σ’ V’T
kgiven afor ˆmin2
XX −
latent semantick
rxr
Σ
rxn
VU’
kxkA A’spacek
rxr
r ≤ min(m,n)rxn
mxrmxn mxn
kgivenaformin 2AA −′latent semantic
space
k
Statistics-48
kF given afor min AA −space
LSA (3/7)
– Singular Value Decomposition (SVD) used for the word-document matrix
• A least-squares method for dimension reduction
iφProjection of a Vector :xx
1ϕ2ϕ
1y2y
j
1 where,
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y
Statistics-49
LSA (4/7)
• Frameworks to circumvent vocabulary mismatch
Doc terms structure model
doc expansion
literal term matching latent semantic i l
query expansion
literal term matching structure retrieval
Query terms structure model
Statistics-50
LSA (5/7)
Statistics-51
LSA (6/7)
Query: “human computer interaction”
An OOV word
Statistics-52
LSA (7/7)
• Singular Value Decomposition (SVD)d1 d2 dn
d1 d2 dn
Row A Rn
Col A Rm∈∈
w1w2
1 2 n
rxr rxnA UΣr VT
rxm
w1w2
Both U and V has orthonormalcolumn vectors
UTU=IrXr
Col A R
=
wm
rxr rxnA Umxr
wm
VTV=IrXr
r ≤ min(m,n)m
mxn mxr
w1
d1 d2 dn
Σ V’Td1 d2 dn
w1
K ≤ r ||A||F2 ≥ ||A’|| F2
m n
=
w1w2
kxk kxnA’ U’mxk
Σk V’Tkxm1
w2 ∑∑= =
=m
i
n
jijFaA
1 1
22
wmmxn mxk
A mxk
wm
Docs and queries are represented in a k-dimensional space. The quantities ofthe axes can be properly weighted according to the associated diagonal
Statistics-53
mxn mxk according to the associated diagonalvalues of Σk
LSA Derivations (1/7)
• Singular Value Decomposition (SVD)ATA is symmetric nxn matrix– ATA is symmetric nxn matrix
• All eigenvalues λj are nonnegative real numbers
0≥≥≥≥ λλλ ( )diag λλλ2 =Σ• All eigenvectors vj are orthonormal ( Rn)
0....21 ≥≥≥≥ nλλλ
[ ]vvvV 21= 1=jT vv ( )T IVV =
( )ndiag λλλ ,...,, 11=Σ∈
• Define singular values:njjj ,...,1 , == λσ
[ ]nnn vvvV ...21=× jjvv ( )nxnIVV
sigma
– As the square roots of the eigenvalues of ATA– As the lengths of the vectors Av1, Av2 , …., Avn
22
11
Av
Av
=
=
σ
σFor λi≠ 0, i=1,…r,{Av1, Av2 , …. , Avr } is an
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=⇒
===2
Statistics-54
.....orthogonal basis of Col A
LSA Derivations (2/7)
• {Av1, Av2 , …. , Avr } is an orthogonal basis of Col A( ) 0====• TTTT vvAvAvAvAvAvAv λ
– Suppose that A (or ATA) has rank r ≤ n
( ) 0====• jijjijiji vvAvAvAvAvAvAv λ
00 ====>≥≥≥ λλλλλλ
– Define an orthonormal basis {u1, u2 ,…., ur} for Col A
0.... ,0.... 2121 ====>≥≥≥ ++ nrrr λλλλλλ
11
[ ] [ ]
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u 11
=Σ⇒
=⇒== σσ V : an orthonormal matrix (nxr)u: also an
orthonormal matrix
• Extend to an orthonormal basis {u1, u2 ,…, um} of Rm
[ ] [ ]rrr vvvAuuu ... 2121 =Σ⇒
[ ] [ ]
Known in advanceorthonormal matrix
(mxr)
[ ] [ ]
T
TTnrmr
AVVVUAVU
vvvvAuuuu
=Σ⇒=Σ⇒
=Σ⇒
... ... ... ... 2121
2222A σσσ +++ ?
∑∑= =
=m
i
n
jijFaA
1 1
22
I ?
Statistics-55
TVUA Σ=⇒ 21 ... rFA σσσ +++= ?nxnI ?( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
−×−×−
−××
rnrmrrm
rnrrnm 00
0Σ Σ
LSA Derivations (3/7)
Av thespans i iu thespans
mxn
AA of space row
iTA of space row
mxn
1 V U
Rn Rm
V
1U
2U UTV
2 V 2U
( )VV
000Σ
UUVUΣ ⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛= 11
21 T
TT0=AX
U
( )
VAV
VΣU
V00
=
=
⎟⎠
⎜⎝⎟⎠
⎜⎝
11
111
221
T
T
T0 AX
AVU =Σ
Statistics-56
A=
LSA Derivations (4/7)
• Additional Explanations– Each row of is related to the projection of a corresponding U p j p g
row of onto the basis formed by columns of A V
AA
VUAT
TΣ=
• the i-th entry of a row of is related to the projection of a corresponding row of onto the i th column of
AVUUVVUAV T =Σ⇒Σ=Σ=⇒
UA Vcorresponding row of onto the i-th column of
– Each row of is related to the projection of a corresponding row of onto the basis formed by
A
V
V
UTArow of onto the basis formed by UA
( ) VUUVUVUUA
VUATTTT
T
Σ=Σ=Σ=⇒
Σ=
• the i-th entry of a row of is related to the projection of a
( )UAV
VUUVUVUUAT=Σ⇒
Σ=Σ=Σ=⇒
V
Statistics-57
the i th entry of a row of is related to the projection of a corresponding row of onto the i-th column of TA
VU
LSA Derivations (5/7)
• Fundamental comparisons based on SVDThe original word document matrix (A)– The original word-document matrix (A)
w1
d1 d2 dn • compare two terms → dot product of two rows of Aor an entry in AATw1
w2
A– or an entry in AAT
• compare two docs → dot product of two columns of A– or an entry in ATA
Th d d t t i (A’)
wmmxn
• compare a term and a doc → each individual entry of A
wjwi– The new word-document matrix (A’)• compare two terms
→ dot product of two rows of U’Σ’A’A’T=(U’Σ’V’T) (U’Σ’V’T)T=U’Σ’V’TV’Σ’TU’T =(U’Σ’)(U’Σ’)T
For stretching I
U’=Umxk
Σ’=Σk
V’ V
wi
→ dot product of two rows of UΣ• compare two docs
→ dot product of two rows of V’Σ’A’TA’=(U’Σ’V’T)T ’(U’Σ’V’T) =V’Σ’T’UT U’Σ’V’T=(V’Σ’)(V’Σ’)T
gor shrinking
IrxrV’=Vnxk
dk
ds
Statistics-58
• compare a query word and a doc → each individual entry of A’
LSA Derivations (6/7)
• Fold-in: find representations for pesudo-docs qFor objects (new queries or docs) that did not appear in the– For objects (new queries or docs) that did not appear in the original analysis
• Fold-in a new mx1 query (or doc) vector
( ) 111ˆ −
×××× Σ= kkkmmT
k UqqQuery represented by the weighted
The separate dimensions are differentially weighted
Just like a row of V
– Cosine measure between the query and doc vectors in
Query represented by the weightedsum of it constituent term vectors
Just like a row of V
Cosine measure between the query and doc vectors in the latent semantic space
( ) Σ dq Tˆˆˆˆ2( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimˆˆ
)ˆ,ˆ(ˆ,ˆ
Statistics-59
row vectors
LSA Derivations (7/7)
• Fold-in a new 1 X n term vector 1
11ˆ−×××× Σ= kkknnk Vtt 11 ×××× kkknnk
Statistics-60
LSA Example
• Experimental results– HMM is consistently better than VSM at all recall levelsHMM is consistently better than VSM at all recall levels– LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated on
Statistics-61
TDT-3 SD collection. (Using word-level indexing terms)
LSA: Conclusions
• AdvantagesA clean formal framework and a clearly defined optimization– A clean formal framework and a clearly defined optimization criterion (least-squares)
• Conceptual simplicity and clarity– Handle synonymy problems (“heterogeneous vocabulary”)
– Good results for high-recall searchT k t i t t• Take term co-occurrence into account
• DisadvantagesHigh computational complexity– High computational complexity
– LSA offers only a partial solution to polysemy• E.g. bank, bass,…g
Statistics-62
LSA Toolkit: SVDLIBC (1/5)
• Doug Rohde's SVD C Library version 1.3 is basedon the SVDPACKC libraryon the SVDPACKC library
• Download it at http://tedlab mit edu/~dr/• Download it at http://tedlab.mit.edu/~dr/
Statistics-63
LSA Toolkit: SVDLIBC (2/5)
• Given a sparse term-doc matrixE g 4 terms and 3 docs 4 3 6
Row#Tem
Col.# Doc
Nonzero entries
– E.g., 4 terms and 3 docsDoc
4 3 6 20 2.32 3 8
2 nonzero entries at Col 0
Col 0, Row 0 Col 0 Row 2
2.3 0.0 4.2 0.0 1.3 2.2 3 8 0 0 0 5
Term
2 3.811 1.33
Col 0, Row 2 1 nonzero entry
at Col 1Col 1, Row 1
3 nonzero entry
– Each entry is weighted by TFxIDF score
3.8 0.0 0.5 0.0 0.0 0.0
30 4.21 2.2
3 nonzero entryat Col 2
Col 2, Row 0 Col 2, Row 1 Col 2 Row 2– Each entry is weighted by TFxIDF score
• Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
2 0.5 Col 2, Row 2
vectors represented in the latent semantic space• Evaluate the information retrieval capability of the LSA
approach by using varying sizes (e g 100 200 600
Statistics-64
approach by using varying sizes (e.g., 100, 200, ..,600 etc.) of LSA dimensionality
LSA Toolkit: SVDLIBC (3/5)
• Example: term-docmatrixIndexing
D Nonzero
51253 2265 21885277508 7 725771
gTerm no. Doc no. entries
508 7.725771596 16.213399612 13.080868709 7.725771713 7.725771744 7.7257711190 7.7257711200 16 213399
• SVD command (IR_svd.bat)
1200 16.2133991259 7.725771…… LSA100-Utoutput
svd -r st -o LSA100 -d 100 Term-Doc-Matrix
sparse matrix input prefix of output filesNo. of reserved name of sparse
LSA100-S
LSA100-Vt
Statistics-65
sparse matrix input prefix of output files eigenvectors matrix input
LSA Toolkit: SVDLIBC (4/5)
• LSA100-Ut100 51253
51253 words100 512530.003 0.001 ……..0.002 0.002 …….
• LSA100-Sword vector (uT): 1x100
• LSA100-Vt 2265 docs
1002686.18829 941
100 22650.021 0.035 ……..829.941
559.59….
100 eigenvalues
0.021 0.035 ……..0.012 0.022 …….
Statistics-66
doc vector (vT): 1x100
LSA Toolkit: SVDLIBC (5/5)
• Fold-in a new mx1 query vector
( ) 111ˆ −
×××× Σ= kkkmmT
k Uqq The separate dimensions are differentially weighted( )
Query represented by the weightedsum of it constituent term vectors
are differentially weightedJust like a row of V
• Cosine measure between the query and doc vectors in the latent semantic space
TFxIDF weighted beforehand
the latent semantic space
( ) Σ d Tˆˆ 2( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆ)ˆ,ˆ(ˆ,ˆ
2
Statistics-67