Post on 10-Jun-2020
1
M. Verleysen UCL1
PCA and Mixtures of PCA:PCA and Mixtures of PCA:Improving the robustness to outliersImproving the robustness to outliers
26 26 OctoberOctober 20072007
Joint work with- Nicolas Delannay, UCL machine learning group- Cédric Archambeau, University College London
M. Verleysen UCL2
OverviewOverview
Principal Component Analysis: a reminderPrincipal Component Analysis: a reminder
Probabilistic PCAProbabilistic PCA
Robust probabilistic PCARobust probabilistic PCA
Mixtures of (robust) probabilistic PCAMixtures of (robust) probabilistic PCA
ExperimentsExperiments
2
M. Verleysen UCL3
OverviewOverview
Principal Component Analysis: a reminderPrincipal Component Analysis: a reminder
Probabilistic PCAProbabilistic PCA
Robust probabilistic PCARobust probabilistic PCA
Mixtures of (robust) probabilistic PCAMixtures of (robust) probabilistic PCA
ExperimentsExperiments
M. Verleysen UCL4
Maximum variance direction [1/2]Maximum variance direction [1/2]
We look for an orthogonal coordinate system such that We look for an orthogonal coordinate system such that the elements of the elements of XX in the new system are uncorrelatedin the new system are uncorrelated
We look for axes that maximize variance after projectionWe look for axes that maximize variance after projection
-3 3 -0.5
0.5
-0.5 0.5
-3
3
=
3
M. Verleysen UCL5
Maximum variance direction [2/2]Maximum variance direction [2/2]
We look for axes which maximise the variance after We look for axes which maximise the variance after projectionprojection
i.e. i.e. along axis along axis uu withwith
In the new coordinate system :In the new coordinate system :
Therefore :Therefore :
Maximum variance = min. distortion (LMS) Maximum variance = min. distortion (LMS)
uYYu ][][)( 11
12
1TT
X EXXEXVar === σ
uY=1X
1|||| == Tuuu
M. Verleysen UCL6
Maximum variance = minimum distortionMaximum variance = minimum distortion
y1
y2
x1
max
min
4
M. Verleysen UCL7
Choice of direction [1/3]Choice of direction [1/3]
Choice of Choice of uu ??
uΘΛΘu
uCu
uYYue
YYCu
u
u
321)(
maxarg
maxarg
][maxarg
EVD
TT
YYT
TT E
=
=
=
eΘΛΘeYYC321
)(
2max,1
EVD
TTX =σ
M. Verleysen UCL8
Choice of direction [2/3]Choice of direction [2/3]
Choice of Choice of uu ??
where Θ1 is the eigenvector corresponding to the largesteigenvalue of CYY
uYYueu
][maxarg TT E= 1Θe =
T
neig
TTX ]0...01.[
0000...00000000
].0...01[ 2
1
)(
2max, 11
1
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
==
λ
λλ
σ ΘΘΛΘΘYC321
12
max,1λσ =X where nλλλ ≥≥≥ ...21
5
M. Verleysen UCL9
Choice of direction [3/3]Choice of direction [3/3]
Classical result :Classical result :
In the space orthogonal to In the space orthogonal to uu11::
And so on ...And so on ...
Best choice for u1 is the eigenvector Θ1 associated tothe largest eigenvalue λ1 of matrix CYY.
Best choice for u2 is the eigenvector Θ2 associated tothe largest eigenvalue λ2 of matrix CYY.
M. Verleysen UCL10
OverviewOverview
Principal Component Analysis: a reminderPrincipal Component Analysis: a reminder
Probabilistic PCAProbabilistic PCA
Robust probabilistic PCARobust probabilistic PCA
Mixtures of (robust) probabilistic PCAMixtures of (robust) probabilistic PCA
ExperimentsExperiments
6
M. Verleysen UCL11
Probabilistic PCAProbabilistic PCA
Idea #1: Idea #1: generativegenerative probabilisticprobabilistic model with model with latent variables latent variables xxIdea #2: Idea #2: linearlinear dependencydependency between between yynn and and xxnn
Idea #3: Idea #3: GaussianGaussian distribution of distribution of latent variableslatent variablesIdea #4: Idea #4: GaussianGaussian additive additive noisenoise
{ } { } NJxy JNnn
DNnn <ℜ⊂ℜ⊂ == ,, 11
( ) ( )( ) ( )D
J
IWxyxyP
IOxxP1,
,−+Ν=
Ν=
τμ
M. Verleysen UCL12
Probabilistic PCAProbabilistic PCA
Marginal distribution:Marginal distribution:
wherewhere
( ) ( )( ) ( )D
J
IWxyxyP
IOxxP1,
,−+Ν=
Ν=
τμ
Isotropic: natural (what else?)
Arbitrary center and variance:no problem because compensated by μ and W
Gaussians: natural, and mathematical convenience!
( ) ( ) ( ) ( )ΣΝ== ∫Χ ,μydxxPxyPyP
DT IWW 1−+=Σ τ
7
M. Verleysen UCL13
Probabilistic PCA trainingProbabilistic PCA training
Finding parameters Finding parameters θθ ={={WW, , μμ, , τ}τ} inin
How ? By finding the parameters that lead to How ? By finding the parameters that lead to maximum likelihood of the observations maximum likelihood of the observations yynn
( ) ( )( ) ( )D
J
IWxyxyP
IOxxP1,
,−+Ν=
Ν=
τμ
( ) ( )
( )( )
( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛+−Σ−=
⎟⎟⎠
⎞⎜⎜⎝
⎛−=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=⎟⎟
⎠
⎞⎜⎜⎝
⎛=
∑
∑
∏∏
−−
nn
Tn
nn
nn
nn
Nyy
yP
yPyP
11
ML
2log22
minarg
logminarg
logmaxargmaxarg
πτμμτ
θ
θ
θ
θθ
M. Verleysen UCL14
Probabilistic PCA trainingProbabilistic PCA training
How to find How to find θθMLML ??
≡≡ fitting a multivariate Gaussian distribution to the fitting a multivariate Gaussian distribution to the observations, with a constrained covariance matrixobservations, with a constrained covariance matrix
How: EM algorithmHow: EM algorithm
( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛+−Σ−= ∑ −−
nn
Tn
Nyy 11ML 2log
22minarg πτμμτθ
θ
DT IWW 1−+=Σ τ
8
M. Verleysen UCL15
Probabilistic PCAProbabilistic PCA
Is it useful?Is it useful?
(does the probabilistic formulation add something to (does the probabilistic formulation add something to the deterministic one?)the deterministic one?)
M. Verleysen UCL16
Probabilistic PCAProbabilistic PCA
Is it useful?Is it useful?
(does the probabilistic formulation add something to (does the probabilistic formulation add something to the deterministic one?)the deterministic one?)
Answer: no Answer: no
Tipping and Bishop showed that the axes found by Tipping and Bishop showed that the axes found by PPCA span the same subspace as those found by PCAPPCA span the same subspace as those found by PCA
(standard equivalence between ML + Gaussian noise (standard equivalence between ML + Gaussian noise hypothesis, and LMS criterion) hypothesis, and LMS criterion)
9
M. Verleysen UCL17
Probabilistic PCAProbabilistic PCA
Is it useful?Is it useful?
(does the probabilistic formulation add something to (does the probabilistic formulation add something to the deterministic one?)the deterministic one?)
Answer: yes Answer: yes ☺☺
The probabilistic formulation makes it possible to The probabilistic formulation makes it possible to introduce new, more realistic hypothesesintroduce new, more realistic hypotheses
M. Verleysen UCL18
Probabilistic PCAProbabilistic PCA
Is it useful?Is it useful?
(does the probabilistic formulation add something to (does the probabilistic formulation add something to the deterministic one?)the deterministic one?)
Answer: yes Answer: yes ☺☺
The probabilistic formulation makes it possible to The probabilistic formulation makes it possible to introduce new, more realistic hypothesesintroduce new, more realistic hypotheses
Ex: LMS criterion (Ex: LMS criterion (≡≡ Gaussian Gaussian hyphyp.) does not .) does not correspond to natural goals!!!correspond to natural goals!!!
10
M. Verleysen UCL19
OverviewOverview
Principal Component Analysis: a reminderPrincipal Component Analysis: a reminder
Probabilistic PCAProbabilistic PCA
Robust probabilistic PCARobust probabilistic PCA
Mixtures of (robust) probabilistic PCAMixtures of (robust) probabilistic PCA
ExperimentsExperiments
M. Verleysen UCL20
PCA and outliersPCA and outliers
With With ““easyeasy”” datadata
y1
y2
x1
11
M. Verleysen UCL21
PCA and outliersPCA and outliers
With a few outliersWith a few outliers
y1
y2 new data
x1
M. Verleysen UCL22
PCA and outliersPCA and outliers
Why is it the case?Why is it the case?
y1
y2
x1
new data:not likely (low posterior)
12
M. Verleysen UCL23
PCA and outliersPCA and outliers
Why is it the case?Why is it the case?
y1
y2
x1
new data:not likely (low posterior)
M. Verleysen UCL24
PCA and outliersPCA and outliers
Why is it the case?Why is it the case?
y1
y2
x1
all data:more likely
13
M. Verleysen UCL25
PCA robust to outliersPCA robust to outliers
Increasing the probability of outliers: Increasing the probability of outliers: GaussianGaussian
y1
y2
x1
new data
Problem: all dataclose to x1 becomeequally probable
M. Verleysen UCL26
PCA robust to outliersPCA robust to outliers
Increasing the probability of outliers: Increasing the probability of outliers: StudentStudent
y1
y2
x1
new data
For identical outlierposterior: betterdiscrimination around axis
14
M. Verleysen UCL27
StudentStudent--tt distributiondistribution
−5 0 50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
x
P(x
)
−5 0 50
2
4
6
8
10
12
14
x
−lo
g P
(x)
GaussianStudent
GaussianStudent
M. Verleysen UCL28
Robust PPCARobust PPCA
Probabilistic PCAProbabilistic PCA
Robust Probabilistic PCARobust Probabilistic PCA
( ) ( )( ) ( )D
J
IWxyxyP
IOxxP1,
,−+Ν=
Ν=
τμ
( ) ( )( ) ( )ντμ
ν
,,St
,,St1
D
J
IWxyxyP
IOxxP−+=
=Identical for simplicity !
15
M. Verleysen UCL29
Robust PPCARobust PPCA
ModelModel
But StudentBut Student--tt = infinite mixture of Gaussians:= infinite mixture of Gaussians:
where where Ga(Ga(uu|.,.) is a Gamma distribution|.,.) is a Gamma distribution
Therefore theTherefore thegenerative model isgenerative model is
( ) ( )( ) ( )⎪⎩
⎪⎨⎧
+=
=− ντμ
ν
,,St
,,St1
D
J
IWxyxyP
IOxxP
( ) 0,2
,2
Ga1,N,,St0
>⎟⎠
⎞⎜⎝
⎛⎟⎠
⎞⎜⎝
⎛ Σ=Σ ∫∞
νννμνμ duuu
yy
( )
( )
( ) ⎟⎠⎞
⎜⎝⎛ +=
⎟⎠
⎞⎜⎝
⎛=
⎟⎠⎞
⎜⎝⎛=
D
J
Iu
WxyuxyP
Iu
OxuxP
uuP
τμ
νν
1,N,
1,N
2,
2Ga
M. Verleysen UCL30
Robust PPCARobust PPCA
ModelModel
Good news: the posterior is tractableGood news: the posterior is tractable
where againwhere again
( )
( )
( ) ⎟⎠⎞
⎜⎝⎛ +=
⎟⎠
⎞⎜⎝
⎛=
⎟⎠⎞
⎜⎝⎛=
D
J
Iu
WxyuxyP
Iu
OxuxP
uuP
τμ
νν
1,N,
1,N
2,
2Ga
( ) ( ) ( ) ( ) ( )νμ ,,St,0
Σ== ∫ ∫∞ ydxuPuxPuxyPyP
X
DT IWW 1−+=Σ τ
16
M. Verleysen UCL31
Robust PPCA trainingRobust PPCA training
Finding parameters Finding parameters θθ ={={WW, , μμ, , τ, ν}τ, ν} inin
How ? By finding the parameters that lead to How ? By finding the parameters that lead to maximum likelihood of the observations maximum likelihood of the observations yynn
( )( )⎟⎟⎠
⎞⎜⎜⎝
⎛−= ∑ n
nyPlogminargML
θθ
( )
( )
( ) ⎟⎠⎞
⎜⎝⎛ +=
⎟⎠
⎞⎜⎝
⎛=
⎟⎠⎞
⎜⎝⎛=
D
J
Iu
WxyuxyP
Iu
OxuxP
uuP
τμ
νν
1,N,
1,N
2,
2Ga
M. Verleysen UCL32
Robust PPCA advantagesRobust PPCA advantages
1.1. Only one parameter to fix in advance (the dimension Only one parameter to fix in advance (the dimension of the latent of the latent ––projectionprojection-- space)space)
2.2. Natural framework for an extension to mixturesNatural framework for an extension to mixtures
17
M. Verleysen UCL33
OverviewOverview
Principal Component Analysis: a reminderPrincipal Component Analysis: a reminder
Probabilistic PCAProbabilistic PCA
Robust probabilistic PCARobust probabilistic PCA
Mixtures of (robust) probabilistic PCAMixtures of (robust) probabilistic PCA
ExperimentsExperiments
M. Verleysen UCL34
Mixtures of (robust) PPCAMixtures of (robust) PPCA
(P)PCA: linear dependencies in data only(P)PCA: linear dependencies in data only
Mixtures of (P)PCA: nonlinear dependencies, through Mixtures of (P)PCA: nonlinear dependencies, through mixtures of linear manifoldsmixtures of linear manifolds
Latent variable model, Latent variable model, nono common projection spacecommon projection space
y1
y2
x11
x12
x13
18
M. Verleysen UCL35
Mixtures of (robust) PPCAMixtures of (robust) PPCA
Concept of (hidden) cluster membershipConcept of (hidden) cluster membership
y1
y2
x11
x12
x13
( ) ( )yPyP kk
k∑= π
Single PCA model
Mixture proportions1
0
=
>
∑k
k
k
ππ
M. Verleysen UCL36
Mixtures of (robust) PPCAMixtures of (robust) PPCA
zz=[=[zz11, , zz22,,……, , zzKK] with ] with zzkk=1 if =1 if yynn was generated by component was generated by component kkzzkk=0 otherwise=0 otherwise
Latent variable modelLatent variable model
( )
( )
( )
( )k
k
k
k
z
kD
kkkk
z
kJ
k
z
k
kkk
k
zk
Iu
xWyzuxyP
Iu
OxzuxP
uzuP
zP
∏
∏
∏
∏
⎟⎟⎠
⎞⎜⎜⎝
⎛+=
⎟⎟⎠
⎞⎜⎜⎝
⎛=
⎟⎠⎞
⎜⎝⎛=
=
τμ
νν
π
1,N,,
1,N,
2,
2Ga
Possible differentdimensionalities
19
M. Verleysen UCL37
Mixtures of (robust) PPCA trainingMixtures of (robust) PPCA training
Finding parameters Finding parameters θθ ={(={(WWkk, , μμkk, , ττkk, , ννkk, , ππkk)})}kk=1=1……KK inin
How ? By finding the parameters that lead to How ? By finding the parameters that lead to maximum likelihood of the observations maximum likelihood of the observations yynn
( )( )⎟⎟⎠
⎞⎜⎜⎝
⎛−= ∑ n
nyPlogminargML
θθ
( )
( )
( )
( )k
k
k
k
z
kD
kkkk
z
kJ
k
z
k
kkk
k
zk
Iu
xWyzuxyP
Iu
OxzuxP
uzuP
zP
∏
∏
∏
∏
⎟⎟⎠
⎞⎜⎜⎝
⎛+=
⎟⎟⎠
⎞⎜⎜⎝
⎛=
⎟⎠⎞
⎜⎝⎛=
=
τμ
νν
π
1,N,,
1,N,
2,
2Ga
M. Verleysen UCL38
Mixtures of (robust) PPCA trainingMixtures of (robust) PPCA training
In practice: EM algorithmIn practice: EM algorithmEE--step: fix the model parameters, and compute step: fix the model parameters, and compute expectations expectations (over an approximate distribution) of (over an approximate distribution) of latent variableslatent variablesMM--step: update the model parameters to step: update the model parameters to maximize maximize the likelihoodthe likelihood
Two hyperTwo hyper--parameters: parameters: –– the number of componentsthe number of components–– the dimensionality of the latent representationsthe dimensionality of the latent representations
20
M. Verleysen UCL39
OverviewOverview
Principal Component Analysis: a reminderPrincipal Component Analysis: a reminder
Probabilistic PCAProbabilistic PCA
Robust probabilistic PCARobust probabilistic PCA
Mixtures of (robust) probabilistic PCAMixtures of (robust) probabilistic PCA
ExperimentsExperiments
M. Verleysen UCL40
ExperimentsExperiments
PCA and robust PCAPCA and robust PCA
−4 −3 −2 −1 0 1 2 3 4−4
−3
−2
−1
0
1
2
3
y1
y 2
21
M. Verleysen UCL41
ExperimentsExperiments
nonnon--robustrobust robustrobust
PCAPCA
mixturesmixtures
−5 −4 −3 −2 −1 0 1 2−4
−3
−2
−1
0
1
2
3
y1
y 2
−5 −4 −3 −2 −1 0 1 2−4
−3
−2
−1
0
1
2
3
y1
y 2
−5 −4 −3 −2 −1 0 1 2−4
−3
−2
−1
0
1
2
3
y1
y 2
−5 −4 −3 −2 −1 0 1 2−4
−3
−2
−1
0
1
2
3
y1
y 2
M. Verleysen UCL42
ExperimentsExperiments
Three 3Three 3--D Gaussian clustersD Gaussian clusters–– Diagonal covariance matrix: diag(5, 1, 0.2) before rotationDiagonal covariance matrix: diag(5, 1, 0.2) before rotation–– Intrinsic dimensionality: 2Intrinsic dimensionality: 2
−10−5
05
10 −10−5
05
10
−10
−8
−6
−4
−2
0
2
4
6
8
10
y2y
1
y 3
0 2 4 6 8 10 121.6
1.7
1.8
1.9
2
2.1
2.2
2.3
2.4
2.5x 10
4
K
−lo
g P
(y)
number of components
dim of latent spaceJ=1 J=2
robust
without outliers
22
M. Verleysen UCL43
ExperimentsExperiments
True model (True model (K K =3, =3, J J =2)=2)OutliersOutliers
0 10 20 30 40 50 601.65
1.7
1.75
1.8
1.85
1.9
1.95
2
2.05
2.1x 10
4
# outliers
−lo
g P
(y)
Performances on test set
0 10 20 30 40 50 600
10
20
30
40
50
60
70
80
90
100
# outliers
ν k
Degree of freedom parameters
almost Gaussian
takes outliers into account
M. Verleysen UCL44
ExperimentsExperiments
USPS handwritten digit datasetUSPS handwritten digit dataset–– around 700 images of around 700 images of ““22””–– around 700 images of around 700 images of ““33””–– around 100 images of around 100 images of ““00”” (as outliers)(as outliers)
Experiment: two 1Experiment: two 1--D components (two clusters)D components (two clusters)Illustrations: images close from the 1Illustrations: images close from the 1--D subspacesD subspaces
standard mixtures robust mixtures
23
M. Verleysen UCL45
ConclusionsConclusions
Probabilistic formulation of PCA:Probabilistic formulation of PCA:–– identical to PCAidentical to PCA–– but possible to introduce other hypothesesbut possible to introduce other hypotheses
Robust PCA:Robust PCA:–– replacing Gaussians by Studentreplacing Gaussians by Student--tt–– makes outliers more likely makes outliers more likely →→ lower influence on the modellower influence on the model
Mixtures of PCAMixtures of PCA–– latent variable modellatent variable model–– better than mixtures of Gaussians through regularization better than mixtures of Gaussians through regularization
(dimensionality of latent space)(dimensionality of latent space)
Mixtures of Robust PCAMixtures of Robust PCA–– replacing Gaussians by Studentreplacing Gaussians by Student--tt