PCA.… · PCA to rise to the ranks of the Top 6 percent of entrepreneurs. PCA. PEEL, INC. ...
Principalcomponentanalysis(PCA): Principles,Biplots...
Transcript of Principalcomponentanalysis(PCA): Principles,Biplots...
Principal component analysis (PCA):Principles, Biplots, and Modern Extensions for
Sparse Data
Steffen UnkelDepartment of Medical Statistics
University Medical Center Göttingen
Summer term 2017
1/70
Outline
1 Principles of PCA
2 PCA biplots
3 Sparse PCA
2/70
1 Principles of PCA
3/70
Setting the scene
I The basic aim of PCA is to describe variation in a set ofcorrelated variables x1, x2, . . . , xp in terms of a new set ofuncorrelated variables y1, y2, . . . , yp.
I Each of y1, y2, . . . , yp is a linear combination of the xvariables (e.g. y1 = a11x1 + a12x2 + · · ·+ a1pxp).
I The new variables are derived in decreasing order of“importance”, in the sense that
I y1 accounts for as much of the variation (variance) in theoriginal data amongst all linear combinations of x1, x2, . . . , xp.
I Then, y2 is chosen to account for as much as possible of theremaining variation, subject to being uncorrelated with y1,and so on.
4/70
Principal components and dimensionality reduction
I The new variables defined by this process, y1, y2, . . . , yp, arethe principal components (PCs).
I The hope is that the first few PCs will account for asubstantial proportion of the variation in the originalvariables x1, x2, . . . , xp.
I If so, the first few PCs can be used to provide a lowerdimensional summary of the data.
I The PCs form an orthogonal coordinate system.
5/70
The Olympic heptathlon data
I In the 1988 Olympics held in Seoul, the heptathlon was wonby one of the stars of women’s athletics in the USA, JackieJoyner-Kersee.
I The heptathlon data set in the R package HSAUR3 containsthe results for all 25 competitors in all seven disciplines.
library(HSAUR3)data(heptathlon)
I We are using PCA with a view to exploring the structure ofthese data and assessing how the PCs relate to the scoresassigned by the official scoring system.
6/70
Score all seven events in the same directionheptathlon[c(14,25),]
## hurdles highjump shot run200m longjump javelin run800m## Braun (FRG) 13.71 1.83 13.16 24.78 6.12 44.58 142.8## Launa (PNG) 16.42 1.50 11.78 26.16 4.88 46.38 163.4## score## Braun (FRG) 6109## Launa (PNG) 4566
heptathlon$hurdles <- with(heptathlon, max(hurdles)-hurdles)heptathlon$run200m <- with(heptathlon, max(run200m)-run200m)heptathlon$run800m <- with(heptathlon, max(run800m)-run800m)
heptathlon[c(14,25),]
## hurdles highjump shot run200m longjump javelin run800m## Braun (FRG) 2.71 1.83 13.16 1.83 6.12 44.58 20.61## Launa (PNG) 0.00 1.50 11.78 0.45 4.88 46.38 0.00## score## Braun (FRG) 6109## Launa (PNG) 4566
7/70
Scatterplot matrixscore <- which(colnames(heptathlon) == "score")plot(heptathlon[, -score])
hurdles
1.50 1.70 0 1 2 3 4 36 40 44
02
1.50
1.80
highjump
shot
1014
02
4
run200m
longjump
5.0
6.5
3644 javelin
0 1 2 3 10 13 16 5.0 6.0 7.0 0 20 40
020run800m
8/70
Correlation matrix
round(cor(heptathlon[,-score]), 2)
## hurdles highjump shot run200m longjump javelin run800m## hurdles 1.00 0.81 0.65 0.77 0.91 0.01 0.78## highjump 0.81 1.00 0.44 0.49 0.78 0.00 0.59## shot 0.65 0.44 1.00 0.68 0.74 0.27 0.42## run200m 0.77 0.49 0.68 1.00 0.82 0.33 0.62## longjump 0.91 0.78 0.74 0.82 1.00 0.07 0.70## javelin 0.01 0.00 0.27 0.33 0.07 1.00 -0.02## run800m 0.78 0.59 0.42 0.62 0.70 -0.02 1.00
9/70
Removing the outlier
heptathlon <- heptathlon[-grep("PNG", rownames(heptathlon)), ]round(cor(heptathlon[,-score]), 2)
## hurdles highjump shot run200m longjump javelin run800m## hurdles 1.00 0.58 0.77 0.83 0.89 0.33 0.56## highjump 0.58 1.00 0.46 0.39 0.66 0.35 0.15## shot 0.77 0.46 1.00 0.67 0.78 0.34 0.41## run200m 0.83 0.39 0.67 1.00 0.81 0.47 0.57## longjump 0.89 0.66 0.78 0.81 1.00 0.29 0.52## javelin 0.33 0.35 0.34 0.47 0.29 1.00 0.26## run800m 0.56 0.15 0.41 0.57 0.52 0.26 1.00
10/70
Finding the sample principal components
I The first PC of the observations is the linear combination
y1 = a11x1 + a12x2 + · · ·+ a1pxp
whose sample variance is greatest among all such linearcombinations.
I Since the variance of y1 could be increased without limitsimply by increasing the coefficients a11, a12, . . . , a1p, arestriction must be placed on these coefficients.
I A sensible constraint is to require that the sum of squares ofthe coefficients for each PC should take the value one.
11/70
Eigendecomposition of the sample covariance matrix
I Let S be the positive semi-definite covariance matrix of amean-centered data matrix X ∈ Rn×p with rank(S) = r(r ≤ p).
I The eigenvalue decomposition (or spectral decomposition) ofS can be written as
S = AΛA> =r∑
i=1λiaia>i ,
where Λ = diag(λ1, . . . , λr) is an r × r diagonal matrixcontaining the positive eigenvalues of S, λ1 ≥ · · · ≥ λr > 0,on its main diagonal and A ∈ Rp×r is a column-wiseorthonormal matrix whose columns a1, . . . ,ar are thecorresponding unit-norm eigenvectors of λ1, . . . , λr .
12/70
PCA via the eigendecomposition
I PCA looks for r vectors aj ∈ Rp×1 (j = 1, . . . , r) which
maximize a>j Saj
subject to a>j aj = 1 for j = 1, . . . , r anda>i aj = 0 for i = 1, . . . , j − 1 (j ≥ 2) .
I It turns out that yj = Xaj is the j-th sample PC with zeromean and variance λj , where aj is an eigenvector of Scorresponding to its j-th largest eigenvalue λj (j = 1, . . . , r).
I The total variance of the r PCs will equal the total varianceof the original variables so that
∑rj=1 λj = tr(S).
13/70
Singular value decomposition of the data matrix
I The sample PCs can also be found using the singular valuedecomposition (SVD) of X.
I Expressing X with rank r with r ≤ min{n, p} by its SVDgives
X = VDA> =r∑
j=1σjvja>j ,
where V = (v1, . . . ,vr) ∈ Rn×r andA = (a1, . . . ,ar) ∈ Rp×r are orthonormal matrices such thatV>V = A>A = Ir , and D ∈ Rr×r is a diagonal matrixwith the singular values of X sorted in decreasing order,σ1 ≥ σ2 ≥ . . . ≥ σr > 0, on its main diagonal.
14/70
PCA via the SVD
I The matrix A is composed of coefficients or loadings and thematrix of component scores Y ∈ Rn×r is given by Y = VD.
I Since it holds that A>A = Ir andY>Y/(n − 1) = D2/(n − 1), the loadings are orthogonaland the sample PCs are uncorrelated.
I The variance of the j-th sample PC is σ2j /(n − 1) which is
equal to the j-th largest eigenvalue, λj , of S (j = 1, . . . , r).
15/70
PCA via the SVD
I In practice, the leading k components with k � r usuallyaccount for a substantial proportion
λ1 + · · ·+ λktr(S)
of the total variance in the data and the sum in the SVD ofX is therefore truncated after the first k terms.
I If so, PCA comes down to finding a matrixY = (y1, . . . ,yk) ∈ Rn×k of component scores of the nsamples on the k components and a matrixA = (a1, . . . ,ak) ∈ Rp×k of coefficients whose k-th column isthe vector of loadings for the k-th component.
16/70
Finding the sample principal components in R
I In R, PCA can be done using the functions princomp() andprcomp() (both contained in the R package stats).
I The princomp() function carries out PCA via aneigendecomposition of the sample covariance matrix S.
I When the variables are on very different scales, PCA isusually carried out on the correlation matrix R.
I These components are not equal to those derived from S.
17/70
Correlations and covariances of variables andcomponents
I The covariance of variable i with component j is given by
Cov(xi , yj) = λjaji .
I The correlation of variable i with component j is therefore
rxi ,yj =√λjajisi
,
where si is the standard deviation of variable i.
I If the components are extracted from the correlation matrix,then
rxi ,yj =√λjaji .
18/70
PCA using the function princomp()
I For PCA we assume that each of the variables in the n × pdata matrix X has been centered to have mean zero.
I Because the results for the seven heptathlon events are ondifferent scales we shall extract the PCs from the p × pcorrelation matrix R.
heptathlon_pca <- princomp(heptathlon[, -score], cor=TRUE)
I The result is a list containing the coefficients defining eachcomponent, the PC scores, et cetera.
19/70
CoefficientsThe coefficients (also called loadings) for the first PC areobtained as
a1 <- heptathlon_pca$loadings[,1]a1
## hurdles highjump shot run200m longjump javelin run800m## -0.4504 -0.3145 -0.4025 -0.4271 -0.4510 -0.2423 -0.3029
a1%*%a1
## [,1]## [1,] 1
a2 <- heptathlon_pca$loadings[,2]a1%*%a2
## [,1]## [1,] 2.22e-16
Each loading vector is unique, up to a sign flip.20/70
Rescaled coefficientsThe loadings can be rescaled so that coefficients for the mostimportant components are larger than those for less importantcomponents (a∗ =
√λjaj , for which a∗>a∗ = λj).
The rescaled loadings for the 1st PC are calculated as
rescaleda1 <- a1 * heptathlon_pca$sdev[1]rescaleda1
## hurdles highjump shot run200m longjump javelin run800m## -0.9365 -0.6540 -0.8369 -0.8881 -0.9377 -0.5038 -0.6298
When the correlation matrix is analyzed, this rescaling leads toloadings that are the correlations between the 1st PC and theoriginal variables.
rescaleda1%*%rescaleda1
## [,1]## [1,] 4.324
21/70
The variance explained by the principal componentsI The total variance of the p PCs will equal the total variance
of the original variables so that
p∑j=1
λj = s21 + s2
2 + · · ·+ s2p ,
where λj is the variance of the jth PC and s2j is the sample
variance of xj .
I Consequently, the jth PC accounts for a proportion
λj∑pj=1 λj
and the first k PCs account for a proportion∑kj=1 λj∑pj=1 λj
.
22/70
The summary() function
summary(heptathlon_pca)
## Importance of components:## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6## Standard deviation 2.0793 0.9482 0.9109 0.68320 0.54619 0.33745## Proportion of Variance 0.6177 0.1284 0.1185 0.06668 0.04262 0.01627## Cumulative Proportion 0.6177 0.7461 0.8646 0.93131 0.97392 0.99019## Comp.7## Standard deviation 0.262042## Proportion of Variance 0.009809## Cumulative Proportion 1.000000
23/70
Criteria for choosing the number of components
1. Retain the first k components which explain a largeproportion of the total variation, say 70-80%.
2. If the correlation matrix is analyzed, retain only thosecomponents with variances greater than one.
3. Examine a scree plot. This is a plot of the componentvariances versus the component number. The idea is to lookfor an “elbow” which corresponds to the point after whichthe eigenvalues decrease more slowly.
4. Consider whether the component has a sensible and usefulinterpretation.
24/70
Scree plotplot(heptathlon_pca$sdev^2, xlab="Component number",
ylab="Component variance", type="l")
1 2 3 4 5 6 7
01
23
4
Component number
Com
pone
nt v
aria
nce
25/70
Principal component scores
PC scores can be obtained either via heptathlon_pca$scoresor using the predict() function.
Scores on the 1st PC
heptathlon_pca$scores[,1]
or
predict(heptathlon_pca)[,1]
26/70
The uncorrelatedness of the PC scores
t(heptathlon_pca$scores)%*%heptathlon_pca$scores/(24)
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5## Comp.1 4.324e+00 7.587e-16 -1.850e-16 -7.772e-16 -6.846e-16## Comp.2 7.587e-16 8.990e-01 2.423e-15 -3.886e-16 -5.551e-16## Comp.3 -1.850e-16 2.423e-15 8.297e-01 -1.824e-16 -1.746e-16## Comp.4 -7.772e-16 -3.886e-16 -1.824e-16 4.668e-01 -9.946e-17## Comp.5 -6.846e-16 -5.551e-16 -1.746e-16 -9.946e-17 2.983e-01## Comp.6 1.230e-15 -7.517e-17 -1.214e-16 -4.077e-17 5.204e-17## Comp.7 -1.943e-16 -6.823e-17 7.286e-17 -2.481e-16 9.483e-17## Comp.6 Comp.7## Comp.1 1.230e-15 -1.943e-16## Comp.2 -7.517e-17 -6.823e-17## Comp.3 -1.214e-16 7.286e-17## Comp.4 -4.077e-17 -2.481e-16## Comp.5 5.204e-17 9.483e-17## Comp.6 1.139e-01 3.955e-16## Comp.7 3.955e-16 6.867e-02
27/70
The scores assigned to the athletes and the 1st PCcor(heptathlon$score, heptathlon_pca$scores[,1])
## [1] -0.9931
plot(heptathlon$score, heptathlon_pca$scores[,1])
5500 6000 6500 7000
−4
−2
02
4
heptathlon$score
hept
athl
on_p
ca$s
core
s[, 1
]
28/70
The USArrests data
I We now perform PCA on the USArrests data set, which iscontained in the R package datasets.
I For each of the 50 US states, the data set contains thenumber of arrests per 100,000 residents in 1973 for each ofthree crimes: Assault, Murder, and Rape.
I We also record UrbanPop, which measures the percentage ofthe population in each state living in urban areas.
29/70
The USArrests data
The rows of the data set contain the 50 states in alphabeticalorder and the columns contain the four variables.
head(USArrests)
## Murder Assault UrbanPop Rape## Alabama 13.2 236 58 21.2## Alaska 10.0 263 48 44.5## Arizona 8.1 294 80 31.0## Arkansas 8.8 190 50 19.5## California 9.0 276 91 40.6## Colorado 7.9 204 78 38.7
30/70
Examining the USArrests data
apply(USArrests, 2, mean)
## Murder Assault UrbanPop Rape## 7.788 170.760 65.540 21.232
apply(USArrests, 2, var)
## Murder Assault UrbanPop Rape## 18.97 6945.17 209.52 87.73
31/70
PCA on a given data matrix
I The princomp() function performs PCA on a covariancematrix S.
I We can also perform PCA directly on the n × p data matrixX using the function prcomp().
I We assume that the variables in X have been centered tohave mean zero.
I Instead of performing PCA via an eigendecomposition of thecovariance matrix as in princomp(), the computation inprcomp() is done by a singular value decomposition of the(centered and possibly scaled) data matrix.
32/70
PCA using the function prcomp()
I Next, we perform PCA on the USArrests data using theprcomp() function.
pr.out <- prcomp(USArrests, scale=TRUE)
I The calculation is done by a singular value decomposition ofthe centered and scaled data matrix X.
I By default, the prcomp() function centers the variables tohave mean zero.
I By using the option scale=TRUE, we scale the variables tohave standard deviation one.
33/70
The output of prcomp()
names(pr.out)
## [1] "sdev" "rotation" "center" "scale" "x"
pr.out
## Standard deviations:## [1] 1.5749 0.9949 0.5971 0.4164#### Rotation:## PC1 PC2 PC3 PC4## Murder -0.5359 0.4182 -0.3412 0.64923## Assault -0.5832 0.1880 -0.2681 -0.74341## UrbanPop -0.2782 -0.8728 -0.3780 0.13388## Rape -0.5434 -0.1673 0.8178 0.08902
34/70
Principal component scores
I When we matrix-multiply the X matrix bypr.out$rotation, it gives the PC scores.
I Alternative: using the prcomp() function, the 50× 4 matrixx has as its columns the PC score vectors.
dim(pr.out$x)
## [1] 50 4
I That is, the kth column of x is the kth PC score vector.
35/70
Proportion of variance explained by the componentssummary(pr.out)
## Importance of components:## PC1 PC2 PC3 PC4## Standard deviation 1.57 0.995 0.5971 0.4164## Proportion of Variance 0.62 0.247 0.0891 0.0434## Cumulative Proportion 0.62 0.868 0.9566 1.0000
pr.out$sdev
## [1] 1.5749 0.9949 0.5971 0.4164
pr.var <- pr.out$sdev^2pve <- pr.var/sum(pr.var)pve
## [1] 0.62006 0.24744 0.08914 0.04336
36/70
Plot of the proportion of variance explainedplot(pve, xlab="Principal Component",
ylab="Proportion of Variance Explained",ylim=c(0,1),type='b')
1.0 1.5 2.0 2.5 3.0 3.5 4.0
0.0
0.2
0.4
0.6
0.8
1.0
Principal Component
Pro
port
ion
of V
aria
nce
Exp
lain
ed
37/70
Plot of the cumulative proportion of variance explainedplot(cumsum(pve), xlab="Principal Component",
ylab="Cumulative Proportion of Variance Explained",ylim=c(0,1),type='b')
1.0 1.5 2.0 2.5 3.0 3.5 4.0
0.0
0.2
0.4
0.6
0.8
1.0
Principal Component
Cum
ulat
ive
Pro
port
ion
of V
aria
nce
Exp
lain
ed
38/70
2 PCA biplots
39/70
Motivation
I Biplots are a graphical method for simultaneously displayingthe variables and sample units described by a multivariatedata matrix.
I A PCA biplot displays the component scores and thevariable loadings obtained by PCA in two or threedimensions.
I The computations are based on the singular valuedecomposition of the (centered and possibly scaled) datamatrix X.
I Two versions of PCA biplots do exist in the literature andare implemented in software packages.
40/70
Example of the traditional form of a PCA biplot
PC 1−6 −4 −2 0 2 4
PC 2
−5
−4
−3
−2
−1
0
1
2
3
4
5
ab
c
d
e f
ghi
jk m
n
pq
r
s
tuv
w
SPR
RGF
PLF
SLF
Figure 1: The Gabriel form of a PCA biplot for aircraft data41/70
PCA biplot for USArrests databiplot(pr.out, scale=0)
−3 −2 −1 0 1 2 3
−3
−2
−1
01
23
PC1
PC
2
AlabamaAlaska
Arizona
Arkansas
California
Colorado Connecticut
DelawareFlorida
Georgia
Hawaii
Idaho
Illinois
Indiana IowaKansas
KentuckyLouisiana
MaineMaryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
OklahomaOregonPennsylvania
Rhode Island
South Carolina
South DakotaTennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
−0.
50.
00.
5
Murder
Assault
UrbanPop
Rape
42/70
The effect of scaling the variablespr.noscale <- prcomp(USArrests, scale=FALSE)par(mfrow = c(1,2))biplot(pr.out, scale=0); biplot(pr.noscale, scale=0)
−3 −2 −1 0 1 2 3
−3
−2
−1
01
23
PC1
PC
2
AlabamaAlaska
Arizona
Arkansas
California
Colorado Connecticut
DelawareFlorida
Georgia
Hawaii
Idaho
Illinois
Indiana IowaKansas
KentuckyLouisiana
MaineMaryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
OklahomaOregonPennsylvania
Rhode Island
South Carolina
South DakotaTennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
−0.5 0.0 0.5
−0.
50.
00.
5
Murder
Assault
UrbanPop
Rape
−100 −50 0 50 100 150
−10
0−
500
5010
015
0
PC1
PC
2
AlabamaAlaska
Arizona
Arkansas
CaliforniaColoradoConnecticut
Delaware FloridaGeorgia
Hawaii
Idaho
IllinoisIndianaIowa Kansas
Kentucky LouisianaMaine Maryland
MassachusettsMichiganMinnesota
Mississippi
MissouriMontana
NebraskaNevada
New Hampshire
New Jersey
New MexicoNew York
North CarolinaNorth Dakota
OhioOklahomaOregonPennsylvaniaRhode Island
South CarolinaSouth DakotaTennessee
TexasUtah
Vermont
VirginiaWashington
West Virginia
WisconsinWyoming
−0.5 0.0 0.5 1.0
−0.
50.
00.
51.
0
Murder Assault
UrbanPop
Rape
43/70
Calibrated axes
I The arrows representing the variables can be converted intocalibrated axes analogous to ordinary scatterplots.
I Calibrated axes: The p variables are represented by pnon-orthogonal axes, known as biplot axes.
I The biplot axes are used in precisely the same way as theCartesian axes they approximate.
I This will give approximate values that do not in generalagree precisely with those in the data matrix X butreproduce the entries in the matrix YA>.
44/70
PCA biplot with calibrated axes
SPR
0
2
4
6
8
RG
F
3
4
5
6
PLF
00
0.1
0.2
0.3
SLF
−2
−1
0
1
2
3
4
5
6
ab
c
d
e f
g
hi
jk m
n
pq
r
s
t
uv
w
Figure 2: PCA biplot with calibrated axes for aircraft data45/70
The R package BiplotGUI
PCA biplots with calibrated axes can be obtained using thefunction PCAbipl() from the R package UBbipl, which isavailable from http://www.wiley.com//legacy/wileychi/gower/material.html.Alternatively, the BiplotGUI package provides a graphical userinterface (GUI) for the construction of, interaction with, andmanipulation of PCA biplots with calibrated axes in R.
library(BiplotGUI)
Biplots() is the sole function in the BiplotGUI package andinitialises the GUI for a given set of data.
Biplots(USArrests)
46/70
Application to Quality control dataI Throughout the period of a calendar month, a
manufacturing company is monitoring 15 different variablesin a production process.
I In an effort to quantify the overall product quality, thiscompany devised a quality index value.
I At the end of the month, the means and standard deviationsof the 15 selected variables were somehow transformed intoa single quality index value in the interval [0, 100].
I The index values give no indication of what the causes of apoor index value could be.
I We perform a PCA on the monthly mean values of the 15variables for January 2000 to March 2001.
47/70
PCA biplot of the (scaled) quality monitoring data
A1 (0.2)
4949
50
A2 (0.03)
79
79.2
A3 (0.8)
0.8
0.85
0.9
0.95
1
1.05
1.11.1
1.15
1.2
A4 (0.45)
1.4
1.5
1.6
1.7
1.8
A5 (0.89)20.5
2121.5
22B5 (0.89)
20.521
21.522
C6 (0.62)
12
14
16
18
C7 (0.82)
56
5858
60
6262
64
6666
C4 (0.53)
2.52.5
3
3.5
4
4.5
D6 (0.74)
26 27 28 29 30
D7 (0.72)
4343
44
4545
D4 (0.48)
4.6
4.8
5
5.2
5.4
5.6
C5 (0.47) E5 (0.49)
20
20.5
21
14.2
14.314.3
14.4
14.514.5
C8 (0.85)
29.5
3030
30.5
31
31.5
3232
32.5
Jan00
Feb00
Mar00
Apr00
May00
Jun00
Jul00
Aug00
Sep00
Oct00Nov00
Dec00Jan01
Feb01
Mar01
●Target
Figure 3: PCA biplot of the scaled process quality data with amultidimensional target interpolated. 48/70
PCA biplot with quality regions
A1
4949
50
A2
79
79.2
A3
0.8
0.85
0.9
0.95
1
1.05
1.11.1
1.15
1.2
A4
1.4
1.5
1.6
1.7
1.8
A520.5
2121.5
22B520.5
2121.5
22
C6
12
14
16
18
C7
56
5858
60
6262
64
6666
C4
2.52.5
3
3.5
4
4.5
D6
26 27 28 29 30
D7
4343
44
4545
D4
4.6
4.8
5
5.2
5.4
5.6
C5
20
20.5
21
E5
14.2
14.314.3
14.4
14.514.5
C8
29.5
3030
30.5
31
31.5
3232
32.5
Jan00
Feb00
Mar00
Apr00
May00
Jun00
Jul00
Aug00
Sep00
Oct00Nov00
Dec00Jan01
Feb01
Mar01
Target
A5 B5
Poor quality Good qualitySatisfactory quality
Figure 4: PCA biplot of process quality data with a target, smoothtrend line and quality regions added. 49/70
Quality of fit attained with PCA
Table 1: Explained variation by the first four principal components ofthe quality control data (cumulative proportion in percent).
1 dimension 2 dimensions 3 dimensions 4 dimensions37.8% 59.8% 74.9% 82.7%
50/70
3 Sparse PCA
51/70
Motivation
I A sparse statistical model is one having only a small numberof nonzero parameters.
I In this Section, we discuss how PCA can be sparsified.
I That is, how can we derive principal components withsparse loadings to yield more interpretable solutions.
I Sparse PCA is a natural extension of PCA well-suited tohigh-dimensional data (p � n).
52/70
Jeffers’ pitprops data
I Jeffers’ pitprops data is a classical example showing thedifficulty of interpreting principal components.
I The pitprops data is a correlation matrix of 13 physicalmeasurements made on a sample of 180 pitprops cut fromCorsican pine timber.
library(elasticnet)data(pitprops)dim(pitprops)
## [1] 13 13
53/70
The variables in Jeffers’ pitprops data
topdiam Top diameter in incheslength Length in inchesmoist Moisture content, % of dry weighttestsg Specific gravity at time of testovensg Oven-dry specific gravityringtop Number of annual rings at topringbut Number of annual rings at bottombowmax Maximum bow in inchesbowdist Distance of point of maximum bow from top in incheswhorls Number of knot whorlsclear Length of clear prop from top in inchesknots Average number of knots per whorldiaknot Average diameter of the knots in inches
54/70
PCA of pitprops data
pitprop.pca <- princomp(covmat = pitprops)summary(pitprop.pca)
## Importance of components:## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6## Standard deviation 2.0539 1.5421 1.3705 1.05328 0.9540 0.90300## Proportion of Variance 0.3245 0.1829 0.1445 0.08534 0.0700 0.06272## Cumulative Proportion 0.3245 0.5074 0.6519 0.73726 0.8073 0.86999## Comp.7 Comp.8 Comp.9 Comp.10 Comp.11## Standard deviation 0.75917 0.66300 0.59387 0.43685 0.22487## Proportion of Variance 0.04433 0.03381 0.02713 0.01468 0.00389## Cumulative Proportion 0.91432 0.94813 0.97526 0.98994 0.99383## Comp.12 Comp.13## Standard deviation 0.20363 0.196785## Proportion of Variance 0.00319 0.002979## Cumulative Proportion 0.99702 1.000000
55/70
Loadings of the first six components
pitprop.pca$loadings[,1:6]
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6## topdiam -0.40379 -0.21785 0.20729 0.09121 -0.08263 0.119803## length -0.40554 -0.18613 0.23504 0.10272 -0.11279 0.162888## moist -0.12440 -0.54064 -0.14149 -0.07844 0.34977 -0.275901## testsg -0.17322 -0.45564 -0.35242 -0.05477 0.35576 -0.054017## ovensg -0.05717 0.17007 -0.48121 -0.04911 0.17610 0.625557## ringtop -0.28443 0.01420 -0.47526 0.06343 -0.31583 0.052301## ringbut -0.39984 0.18964 -0.25310 0.06498 -0.21507 0.002658## bowmax -0.29356 0.18915 0.24305 -0.28554 0.18533 -0.055119## bowdist -0.35663 -0.01712 0.20764 -0.09672 -0.10611 0.034222## whorls -0.37892 0.24845 0.11877 0.20504 0.15639 -0.173148## clear 0.01109 -0.20530 0.07045 -0.80366 -0.34299 0.175312## knots 0.11508 -0.34317 -0.09200 0.30080 -0.60037 -0.169783## diaknot 0.11251 -0.30853 0.32611 0.30338 0.07990 0.626307
56/70
Rotation
I A traditional way to simplify loadings is by rotation.
I The method of rotation emerged in Factor Analysis and wasmotivated both by solving the rotational indeterminacyproblem and by facilitating the factors’ interpretation.
I Rotation can be performed either in an orthogonal or anoblique (non-orthogonal) fashion.
I Several analytic orthogonal and oblique rotation criteria doexist in the literature.
I All criteria attempt to create a loading matrix whoseelements are close to zero or far from zero, with fewintermediate values.
57/70
Rotation
I If A is the loading matrix, then A is post-multiplied by amatrix T to give rotated loadings B = AT.
I The rotation matrix T is chosen so as to optimize somesimplicity criterion.
I We would also need an algorithm that optimizes the chosenrotation criterion and finds the “best” T.
I However, after rotation, either one or both of the propertiespossessed by PCA, that is, orthogonality of the loadings anduncorrelatedness of the component scores, is lost.
58/70
The Varimax rotation criterion
I Each variable should be either clearly important or clearlyunimportant in a rotated component, with as few cases aspossible of borderline importance.
I Varimax is the most widely used rotation criterion.
I Varimax tends to drive at least some of the loadings in eachcomponent towards zero.
I A component whose loadings are all roughly equal will beavoided by most standard rotation criteria.
59/70
Gradient projection algorithm
I Problems in multivariate statistics are often concerned withthe optimization of matrix functions of structured(e.g. orthogonal) matrix unknowns.
I Gradient projection algorithms are natural ways of solvingsuch optimization problems as they are especially designedto follow the geometry of the matrix parameters.
I They are based on the classical gradient approach andmodified for analyzing and solving constrained optimizationproblems.
I The idea is to follow the steepest descent direction and tokeep the gradient flow “nailed” to the manifold ofpermissible matrices.
60/70
Gradient projection algorithm for orthogonal rotation
I Here, the gradient projection algorithm for orthogonalrotation is used to find T that minimizes f (V) over allorthogonal matrices V.
I LetM be the manifold of all orthogonal matrices.
I Given a current value of V, this algorithm computes thegradient of f at V and moves α units in the negativegradient direction from V.
I The result is projected onM.
61/70
The Gradient projection algorithm visualized
Manifold of permissible matrices M
Vupdated V
VV
∂∂
−fα
Figure 5: Projection on a manifold of permissible matrices
62/70
Iterative scheme
I The algorithm proceeds iteratively; it is monotonicallydescending and converges from any starting point to astationary point.
I At a stationary point of f restricted toM, the Frobeniusnorm of the gradient after projection onto the plane tangenttoM at the current value of V is zero.
I The algorithm stops when the norm is less than someprescribed precision, say 10−5.
I Once the optimal rotation matrix T has been found, therotated loading matrix is obtained as B = AT.
63/70
Using the Varimax criterion for Jeffers’ pitprops data
library(GPArotation)A <- pitprop.pca$loadings[,1:6]B <- GPForth(A, method="varimax")$loadingsB
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6## topdiam -0.4732810 -0.093071 0.066309 -0.035053 -0.047369 0.20604## length -0.4913803 -0.034979 0.053369 -0.039620 -0.047577 0.23409## moist 0.0003155 -0.713627 0.148739 -0.001023 0.013383 -0.02893## testsg -0.0154043 -0.681406 -0.170698 0.018187 0.005942 -0.01546## ovensg 0.0493026 0.003042 -0.807205 -0.018262 0.138311 0.12255## ringtop -0.2388960 -0.028431 -0.391257 -0.014618 -0.358794 -0.27123## ringbut -0.3638911 0.092181 -0.259272 0.063757 -0.130426 -0.28490## bowmax -0.2472791 0.033791 0.113722 -0.127971 0.439568 -0.12301## bowdist -0.3980574 0.039518 0.086691 -0.138574 0.073745 0.01454## whorls -0.3446344 0.052830 0.087417 0.339108 0.212771 -0.16254## clear -0.0185273 0.009334 -0.008107 -0.916145 0.012019 -0.03890## knots -0.0318758 0.025179 0.177287 -0.011326 -0.765165 0.02324## diaknot -0.0516283 0.031716 -0.100118 0.047261 -0.029195 0.82952
64/70
Sparse PCA based on the “elastic net”
I The lasso approach in PCA: perform PCA under the extraconstraints
∑pj=1 |akj | ≤ t for some tuning parameter t
(k = 1, . . . , p).
I The above-mentioned approach has several limitations.
I The so-called elastic net generalizes the lasso to overcome itsdrawbacks.
I Elastic net approach in PCA: formulate PCA as aregression-type optimization problem; obtain sparse loadingsby integrating a lasso penalty (via the elastic net) into theregression criterion.
65/70
Sparse PCA (SPCA) criterion based on the “elastic net”
I Optimization problem:
(A, B) = argminA,B
n∑i=1||xi−AB>xi ||2+λ
k∑j=1||βj ||2+
k∑j=1
λ1,j ||βj ||1
subject to A>A = Ik .
I In the SPCA criterion above, A = (α1, . . . , αk) andB = (β1, . . . , βk) are p × k matrices, and || · || denotes the l1norm.
I Whereas the same λ is used for all k components, differentλ1,j ’s are allowed for penalizing the loadings of differentprincipal components.
66/70
Alternating algorithm to minimize the SPCA criterionI B given A: For each j, let Y ∗j = Xαj . Each βj in
B = (β1, . . . , βk) is an elastic net estimate
βj = argminβj||Y ∗j −Xβj ||2 + λ||βj ||2 + λ1,j ||βj ||1 .
I A given B: If B is fixed, then we can ignore the penaltypart on the SPCA criterion and only try to minimize
n∑i=1||xi −AB>xi ||2 = ||X−XBA>||2F ,
subject to A>A = Ik . The solution is found via the SVD of
(X>X)B = UDV> ,
and we set A = UV>.67/70
Some remarks about SPCA
I Empirical evidence suggests that the output of the abovealgorithm does not change much as λ is varied.
I Practically, λ is chosen to be a small positive number.
I Usually several combinations of λ1,j are tried to figure out agood choice of the tuning parameters.
I Hence, we can pick a λ1,j that gives a good compromisebetween explained variance and sparsity (variance-sparsitytrade-off).
68/70
Implementation of sparse PCA in R
I Efficient algorithms do exist to fit the elastic net approachin PCA to multivariate data.
I Sparse PCA is implemented by the function spca() in theR package elasticnet.
?spca
I The function arrayspc() in the R package elasticnet isspecifically designed for the case p � n, as it is typically thecase in microarrays.
?arrayspc
69/70
Sparse PCA of Jeffers’ pitprops datapitprop.spcap <- spca(pitprops,K = 6, type = "Gram", sparse = "penalty",
para=c(0.06,0.16,0.1,0.5,0.5,0.5))pitprop.spcav <- spca(pitprops,K = 6, type = "Gram", sparse = "varnum",
para = c(7,4,4,1,1,1))pitprop.spcap$loadings
## PC1 PC2 PC3 PC4 PC5 PC6## topdiam -0.4774 0.00000 0.00000 0 0 0## length -0.4759 0.00000 0.00000 0 0 0## moist 0.0000 0.78471 0.00000 0 0 0## testsg 0.0000 0.61936 0.00000 0 0 0## ovensg 0.1766 0.00000 0.64065 0 0 0## ringtop 0.0000 0.00000 0.58901 0 0 0## ringbut -0.2505 0.00000 0.49233 0 0 0## bowmax -0.3440 -0.02100 0.00000 0 0 0## bowdist -0.4164 0.00000 0.00000 0 0 0## whorls -0.4000 0.00000 0.00000 0 0 0## clear 0.0000 0.00000 0.00000 -1 0 0## knots 0.0000 0.01333 0.00000 0 -1 0## diaknot 0.0000 0.00000 -0.01557 0 0 1
pitprop.spcap$pev
## [1] 0.28035 0.13966 0.13298 0.07445 0.06802 0.0622770/70