Principalcomponentanalysis(PCA): Principles,Biplots...

Principal component analysis (PCA):Principles, Biplots, and Modern Extensions for

Sparse Data

Steffen UnkelDepartment of Medical Statistics

University Medical Center Göttingen

Summer term 2017

1/70

Outline

1 Principles of PCA

2 PCA biplots

3 Sparse PCA

2/70

1 Principles of PCA

3/70

Setting the scene

I The basic aim of PCA is to describe variation in a set ofcorrelated variables x1, x2, . . . , xp in terms of a new set ofuncorrelated variables y1, y2, . . . , yp.

I Each of y1, y2, . . . , yp is a linear combination of the xvariables (e.g. y1 = a11x1 + a12x2 + · · ·+ a1pxp).

I The new variables are derived in decreasing order of“importance”, in the sense that

I y1 accounts for as much of the variation (variance) in theoriginal data amongst all linear combinations of x1, x2, . . . , xp.

I Then, y2 is chosen to account for as much as possible of theremaining variation, subject to being uncorrelated with y1,and so on.

4/70

Principal components and dimensionality reduction

I The new variables defined by this process, y1, y2, . . . , yp, arethe principal components (PCs).

I The hope is that the first few PCs will account for asubstantial proportion of the variation in the originalvariables x1, x2, . . . , xp.

I If so, the first few PCs can be used to provide a lowerdimensional summary of the data.

I The PCs form an orthogonal coordinate system.

5/70

The Olympic heptathlon data

I In the 1988 Olympics held in Seoul, the heptathlon was wonby one of the stars of women’s athletics in the USA, JackieJoyner-Kersee.

I The heptathlon data set in the R package HSAUR3 containsthe results for all 25 competitors in all seven disciplines.

library(HSAUR3)data(heptathlon)

I We are using PCA with a view to exploring the structure ofthese data and assessing how the PCs relate to the scoresassigned by the official scoring system.

6/70

Score all seven events in the same directionheptathlon[c(14,25),]

## hurdles highjump shot run200m longjump javelin run800m## Braun (FRG) 13.71 1.83 13.16 24.78 6.12 44.58 142.8## Launa (PNG) 16.42 1.50 11.78 26.16 4.88 46.38 163.4## score## Braun (FRG) 6109## Launa (PNG) 4566

heptathlon$hurdles <- with(heptathlon, max(hurdles)-hurdles)heptathlon$run200m <- with(heptathlon, max(run200m)-run200m)heptathlon$run800m <- with(heptathlon, max(run800m)-run800m)

heptathlon[c(14,25),]

## hurdles highjump shot run200m longjump javelin run800m## Braun (FRG) 2.71 1.83 13.16 1.83 6.12 44.58 20.61## Launa (PNG) 0.00 1.50 11.78 0.45 4.88 46.38 0.00## score## Braun (FRG) 6109## Launa (PNG) 4566

7/70

Scatterplot matrixscore <- which(colnames(heptathlon) == "score")plot(heptathlon[, -score])

hurdles

1.50 1.70 0 1 2 3 4 36 40 44

02

1.50

1.80

highjump

shot

1014

02

4

run200m

longjump

5.0

6.5

3644 javelin

0 1 2 3 10 13 16 5.0 6.0 7.0 0 20 40

020run800m

8/70

Correlation matrix

round(cor(heptathlon[,-score]), 2)

## hurdles highjump shot run200m longjump javelin run800m## hurdles 1.00 0.81 0.65 0.77 0.91 0.01 0.78## highjump 0.81 1.00 0.44 0.49 0.78 0.00 0.59## shot 0.65 0.44 1.00 0.68 0.74 0.27 0.42## run200m 0.77 0.49 0.68 1.00 0.82 0.33 0.62## longjump 0.91 0.78 0.74 0.82 1.00 0.07 0.70## javelin 0.01 0.00 0.27 0.33 0.07 1.00 -0.02## run800m 0.78 0.59 0.42 0.62 0.70 -0.02 1.00

9/70

Removing the outlier

heptathlon <- heptathlon[-grep("PNG", rownames(heptathlon)), ]round(cor(heptathlon[,-score]), 2)

## hurdles highjump shot run200m longjump javelin run800m## hurdles 1.00 0.58 0.77 0.83 0.89 0.33 0.56## highjump 0.58 1.00 0.46 0.39 0.66 0.35 0.15## shot 0.77 0.46 1.00 0.67 0.78 0.34 0.41## run200m 0.83 0.39 0.67 1.00 0.81 0.47 0.57## longjump 0.89 0.66 0.78 0.81 1.00 0.29 0.52## javelin 0.33 0.35 0.34 0.47 0.29 1.00 0.26## run800m 0.56 0.15 0.41 0.57 0.52 0.26 1.00

10/70

Finding the sample principal components

I The first PC of the observations is the linear combination

y1 = a11x1 + a12x2 + · · ·+ a1pxp

whose sample variance is greatest among all such linearcombinations.

I Since the variance of y1 could be increased without limitsimply by increasing the coefficients a11, a12, . . . , a1p, arestriction must be placed on these coefficients.

I A sensible constraint is to require that the sum of squares ofthe coefficients for each PC should take the value one.

11/70

Eigendecomposition of the sample covariance matrix

I Let S be the positive semi-definite covariance matrix of amean-centered data matrix X ∈ Rn×p with rank(S) = r(r ≤ p).

I The eigenvalue decomposition (or spectral decomposition) ofS can be written as

S = AΛA> =r∑

i=1λiaia>i ,

where Λ = diag(λ1, . . . , λr) is an r × r diagonal matrixcontaining the positive eigenvalues of S, λ1 ≥ · · · ≥ λr > 0,on its main diagonal and A ∈ Rp×r is a column-wiseorthonormal matrix whose columns a1, . . . ,ar are thecorresponding unit-norm eigenvectors of λ1, . . . , λr .

12/70

PCA via the eigendecomposition

I PCA looks for r vectors aj ∈ Rp×1 (j = 1, . . . , r) which

maximize a>j Saj

subject to a>j aj = 1 for j = 1, . . . , r anda>i aj = 0 for i = 1, . . . , j − 1 (j ≥ 2) .

I It turns out that yj = Xaj is the j-th sample PC with zeromean and variance λj , where aj is an eigenvector of Scorresponding to its j-th largest eigenvalue λj (j = 1, . . . , r).

I The total variance of the r PCs will equal the total varianceof the original variables so that

∑rj=1 λj = tr(S).

13/70

Singular value decomposition of the data matrix

I The sample PCs can also be found using the singular valuedecomposition (SVD) of X.

I Expressing X with rank r with r ≤ min{n, p} by its SVDgives

X = VDA> =r∑

j=1σjvja>j ,

where V = (v1, . . . ,vr) ∈ Rn×r andA = (a1, . . . ,ar) ∈ Rp×r are orthonormal matrices such thatV>V = A>A = Ir , and D ∈ Rr×r is a diagonal matrixwith the singular values of X sorted in decreasing order,σ1 ≥ σ2 ≥ . . . ≥ σr > 0, on its main diagonal.

14/70

PCA via the SVD

I The matrix A is composed of coefficients or loadings and thematrix of component scores Y ∈ Rn×r is given by Y = VD.

I Since it holds that A>A = Ir andY>Y/(n − 1) = D2/(n − 1), the loadings are orthogonaland the sample PCs are uncorrelated.

I The variance of the j-th sample PC is σ2j /(n − 1) which is

equal to the j-th largest eigenvalue, λj , of S (j = 1, . . . , r).

15/70

PCA via the SVD

I In practice, the leading k components with k � r usuallyaccount for a substantial proportion

λ1 + · · ·+ λktr(S)

of the total variance in the data and the sum in the SVD ofX is therefore truncated after the first k terms.

I If so, PCA comes down to finding a matrixY = (y1, . . . ,yk) ∈ Rn×k of component scores of the nsamples on the k components and a matrixA = (a1, . . . ,ak) ∈ Rp×k of coefficients whose k-th column isthe vector of loadings for the k-th component.

16/70

Finding the sample principal components in R

I In R, PCA can be done using the functions princomp() andprcomp() (both contained in the R package stats).

I The princomp() function carries out PCA via aneigendecomposition of the sample covariance matrix S.

I When the variables are on very different scales, PCA isusually carried out on the correlation matrix R.

I These components are not equal to those derived from S.

17/70

Correlations and covariances of variables andcomponents

I The covariance of variable i with component j is given by

Cov(xi , yj) = λjaji .

I The correlation of variable i with component j is therefore

rxi ,yj =√λjajisi

,

where si is the standard deviation of variable i.

I If the components are extracted from the correlation matrix,then

rxi ,yj =√λjaji .

18/70

PCA using the function princomp()

I For PCA we assume that each of the variables in the n × pdata matrix X has been centered to have mean zero.

I Because the results for the seven heptathlon events are ondifferent scales we shall extract the PCs from the p × pcorrelation matrix R.

heptathlon_pca <- princomp(heptathlon[, -score], cor=TRUE)

I The result is a list containing the coefficients defining eachcomponent, the PC scores, et cetera.

19/70

CoefficientsThe coefficients (also called loadings) for the first PC areobtained as

a1 <- heptathlon_pca$loadings[,1]a1

## hurdles highjump shot run200m longjump javelin run800m## -0.4504 -0.3145 -0.4025 -0.4271 -0.4510 -0.2423 -0.3029

a1%*%a1

## [,1]## [1,] 1

a2 <- heptathlon_pca$loadings[,2]a1%*%a2

## [,1]## [1,] 2.22e-16

Each loading vector is unique, up to a sign flip.20/70

Rescaled coefficientsThe loadings can be rescaled so that coefficients for the mostimportant components are larger than those for less importantcomponents (a∗ =

√λjaj , for which a∗>a∗ = λj).

The rescaled loadings for the 1st PC are calculated as

rescaleda1 <- a1 * heptathlon_pca$sdev[1]rescaleda1

## hurdles highjump shot run200m longjump javelin run800m## -0.9365 -0.6540 -0.8369 -0.8881 -0.9377 -0.5038 -0.6298

When the correlation matrix is analyzed, this rescaling leads toloadings that are the correlations between the 1st PC and theoriginal variables.

rescaleda1%*%rescaleda1

## [,1]## [1,] 4.324

21/70

The variance explained by the principal componentsI The total variance of the p PCs will equal the total variance

of the original variables so that

p∑j=1

λj = s21 + s2

2 + · · ·+ s2p ,

where λj is the variance of the jth PC and s2j is the sample

variance of xj .

I Consequently, the jth PC accounts for a proportion

λj∑pj=1 λj

and the first k PCs account for a proportion∑kj=1 λj∑pj=1 λj

.

22/70

The summary() function

summary(heptathlon_pca)

## Importance of components:## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6## Standard deviation 2.0793 0.9482 0.9109 0.68320 0.54619 0.33745## Proportion of Variance 0.6177 0.1284 0.1185 0.06668 0.04262 0.01627## Cumulative Proportion 0.6177 0.7461 0.8646 0.93131 0.97392 0.99019## Comp.7## Standard deviation 0.262042## Proportion of Variance 0.009809## Cumulative Proportion 1.000000

23/70

Criteria for choosing the number of components

1. Retain the first k components which explain a largeproportion of the total variation, say 70-80%.

2. If the correlation matrix is analyzed, retain only thosecomponents with variances greater than one.

3. Examine a scree plot. This is a plot of the componentvariances versus the component number. The idea is to lookfor an “elbow” which corresponds to the point after whichthe eigenvalues decrease more slowly.

4. Consider whether the component has a sensible and usefulinterpretation.

24/70

Scree plotplot(heptathlon_pca$sdev^2, xlab="Component number",

ylab="Component variance", type="l")

1 2 3 4 5 6 7

01

23

4

Component number

Com

pone

nt v

aria

nce

25/70

Principal component scores

PC scores can be obtained either via heptathlon_pca$scoresor using the predict() function.

Scores on the 1st PC

heptathlon_pca$scores[,1]

or

predict(heptathlon_pca)[,1]

26/70

The uncorrelatedness of the PC scores

t(heptathlon_pca$scores)%*%heptathlon_pca$scores/(24)

## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5## Comp.1 4.324e+00 7.587e-16 -1.850e-16 -7.772e-16 -6.846e-16## Comp.2 7.587e-16 8.990e-01 2.423e-15 -3.886e-16 -5.551e-16## Comp.3 -1.850e-16 2.423e-15 8.297e-01 -1.824e-16 -1.746e-16## Comp.4 -7.772e-16 -3.886e-16 -1.824e-16 4.668e-01 -9.946e-17## Comp.5 -6.846e-16 -5.551e-16 -1.746e-16 -9.946e-17 2.983e-01## Comp.6 1.230e-15 -7.517e-17 -1.214e-16 -4.077e-17 5.204e-17## Comp.7 -1.943e-16 -6.823e-17 7.286e-17 -2.481e-16 9.483e-17## Comp.6 Comp.7## Comp.1 1.230e-15 -1.943e-16## Comp.2 -7.517e-17 -6.823e-17## Comp.3 -1.214e-16 7.286e-17## Comp.4 -4.077e-17 -2.481e-16## Comp.5 5.204e-17 9.483e-17## Comp.6 1.139e-01 3.955e-16## Comp.7 3.955e-16 6.867e-02

27/70

The scores assigned to the athletes and the 1st PCcor(heptathlon$score, heptathlon_pca$scores[,1])

## [1] -0.9931

plot(heptathlon$score, heptathlon_pca$scores[,1])

5500 6000 6500 7000

−4

−2

02

4

heptathlon$score

hept

athl

on_p

ca$s

core

s[, 1

]

28/70

The USArrests data

I We now perform PCA on the USArrests data set, which iscontained in the R package datasets.

I For each of the 50 US states, the data set contains thenumber of arrests per 100,000 residents in 1973 for each ofthree crimes: Assault, Murder, and Rape.

I We also record UrbanPop, which measures the percentage ofthe population in each state living in urban areas.

29/70

The USArrests data

The rows of the data set contain the 50 states in alphabeticalorder and the columns contain the four variables.

head(USArrests)

## Murder Assault UrbanPop Rape## Alabama 13.2 236 58 21.2## Alaska 10.0 263 48 44.5## Arizona 8.1 294 80 31.0## Arkansas 8.8 190 50 19.5## California 9.0 276 91 40.6## Colorado 7.9 204 78 38.7

30/70

Examining the USArrests data

apply(USArrests, 2, mean)

## Murder Assault UrbanPop Rape## 7.788 170.760 65.540 21.232

apply(USArrests, 2, var)

## Murder Assault UrbanPop Rape## 18.97 6945.17 209.52 87.73

31/70

PCA on a given data matrix

I The princomp() function performs PCA on a covariancematrix S.

I We can also perform PCA directly on the n × p data matrixX using the function prcomp().

I We assume that the variables in X have been centered tohave mean zero.

I Instead of performing PCA via an eigendecomposition of thecovariance matrix as in princomp(), the computation inprcomp() is done by a singular value decomposition of the(centered and possibly scaled) data matrix.

32/70

PCA using the function prcomp()

I Next, we perform PCA on the USArrests data using theprcomp() function.

pr.out <- prcomp(USArrests, scale=TRUE)

I The calculation is done by a singular value decomposition ofthe centered and scaled data matrix X.

I By default, the prcomp() function centers the variables tohave mean zero.

I By using the option scale=TRUE, we scale the variables tohave standard deviation one.

33/70

The output of prcomp()

names(pr.out)

## [1] "sdev" "rotation" "center" "scale" "x"

pr.out

## Standard deviations:## [1] 1.5749 0.9949 0.5971 0.4164#### Rotation:## PC1 PC2 PC3 PC4## Murder -0.5359 0.4182 -0.3412 0.64923## Assault -0.5832 0.1880 -0.2681 -0.74341## UrbanPop -0.2782 -0.8728 -0.3780 0.13388## Rape -0.5434 -0.1673 0.8178 0.08902

34/70

Principal component scores

I When we matrix-multiply the X matrix bypr.out$rotation, it gives the PC scores.

I Alternative: using the prcomp() function, the 50× 4 matrixx has as its columns the PC score vectors.

dim(pr.out$x)

## [1] 50 4

I That is, the kth column of x is the kth PC score vector.

35/70

Proportion of variance explained by the componentssummary(pr.out)

## Importance of components:## PC1 PC2 PC3 PC4## Standard deviation 1.57 0.995 0.5971 0.4164## Proportion of Variance 0.62 0.247 0.0891 0.0434## Cumulative Proportion 0.62 0.868 0.9566 1.0000

pr.out$sdev

## [1] 1.5749 0.9949 0.5971 0.4164

pr.var <- pr.out$sdev^2pve <- pr.var/sum(pr.var)pve

## [1] 0.62006 0.24744 0.08914 0.04336

36/70

Plot of the proportion of variance explainedplot(pve, xlab="Principal Component",

ylab="Proportion of Variance Explained",ylim=c(0,1),type='b')

1.0 1.5 2.0 2.5 3.0 3.5 4.0

0.0

0.2

0.4

0.6

0.8

1.0

Principal Component

Pro

port

ion

of V

aria

nce

Exp

lain

ed

37/70

Plot of the cumulative proportion of variance explainedplot(cumsum(pve), xlab="Principal Component",

ylab="Cumulative Proportion of Variance Explained",ylim=c(0,1),type='b')

1.0 1.5 2.0 2.5 3.0 3.5 4.0

0.0

0.2

0.4

0.6

0.8

1.0

Principal Component

Cum

ulat

ive

Pro

port

ion

of V

aria

nce

Exp

lain

ed

38/70

2 PCA biplots

39/70

Motivation

I Biplots are a graphical method for simultaneously displayingthe variables and sample units described by a multivariatedata matrix.

I A PCA biplot displays the component scores and thevariable loadings obtained by PCA in two or threedimensions.

I The computations are based on the singular valuedecomposition of the (centered and possibly scaled) datamatrix X.

I Two versions of PCA biplots do exist in the literature andare implemented in software packages.

40/70

Example of the traditional form of a PCA biplot

PC 1−6 −4 −2 0 2 4

PC 2

−5

−4

−3

−2

−1

0

1

2

3

4

5

ab

c

d

e f

ghi

jk m

n

pq

r

s

tuv

w

SPR

RGF

PLF

SLF

Figure 1: The Gabriel form of a PCA biplot for aircraft data41/70

PCA biplot for USArrests databiplot(pr.out, scale=0)

−3 −2 −1 0 1 2 3

−3

−2

−1

01

23

PC1

PC

2

AlabamaAlaska

Arizona

Arkansas

California

Colorado Connecticut

DelawareFlorida

Georgia

Hawaii

Idaho

Illinois

Indiana IowaKansas

KentuckyLouisiana

MaineMaryland

Massachusetts

Michigan

Minnesota

Mississippi

Missouri

Montana

Nebraska

Nevada

New Hampshire

New Jersey

New Mexico

New York

North Carolina

North Dakota

Ohio

OklahomaOregonPennsylvania

Rhode Island

South Carolina

South DakotaTennessee

Texas

Utah

Vermont

Virginia

Washington

West Virginia

Wisconsin

Wyoming

−0.

50.

00.

5

Murder

Assault

UrbanPop

Rape

42/70

The effect of scaling the variablespr.noscale <- prcomp(USArrests, scale=FALSE)par(mfrow = c(1,2))biplot(pr.out, scale=0); biplot(pr.noscale, scale=0)

−3 −2 −1 0 1 2 3

−3

−2

−1

01

23

PC1

PC

2

AlabamaAlaska

Arizona

Arkansas

California

Colorado Connecticut

DelawareFlorida

Georgia

Hawaii

Idaho

Illinois

Indiana IowaKansas

KentuckyLouisiana

MaineMaryland

Massachusetts

Michigan

Minnesota

Mississippi

Missouri

Montana

Nebraska

Nevada

New Hampshire

New Jersey

New Mexico

New York

North Carolina

North Dakota

Ohio

OklahomaOregonPennsylvania

Rhode Island

South Carolina

South DakotaTennessee

Texas

Utah

Vermont

Virginia

Washington

West Virginia

Wisconsin

Wyoming

−0.5 0.0 0.5

−0.

50.

00.

5

Murder

Assault

UrbanPop

Rape

−100 −50 0 50 100 150

−10

0−

500

5010

015

0

PC1

PC

2

AlabamaAlaska

Arizona

Arkansas

CaliforniaColoradoConnecticut

Delaware FloridaGeorgia

Hawaii

Idaho

IllinoisIndianaIowa Kansas

Kentucky LouisianaMaine Maryland

MassachusettsMichiganMinnesota

Mississippi

MissouriMontana

NebraskaNevada

New Hampshire

New Jersey

New MexicoNew York

North CarolinaNorth Dakota

OhioOklahomaOregonPennsylvaniaRhode Island

South CarolinaSouth DakotaTennessee

TexasUtah

Vermont

VirginiaWashington

West Virginia

WisconsinWyoming

−0.5 0.0 0.5 1.0

−0.

50.

00.

51.

0

Murder Assault

UrbanPop

Rape

43/70

Calibrated axes

I The arrows representing the variables can be converted intocalibrated axes analogous to ordinary scatterplots.

I Calibrated axes: The p variables are represented by pnon-orthogonal axes, known as biplot axes.

I The biplot axes are used in precisely the same way as theCartesian axes they approximate.

I This will give approximate values that do not in generalagree precisely with those in the data matrix X butreproduce the entries in the matrix YA>.

44/70

PCA biplot with calibrated axes

SPR

0

2

4

6

8

RG

F

3

4

5

6

PLF

00

0.1

0.2

0.3

SLF

−2

−1

0

1

2

3

4

5

6

ab

c

d

e f

g

hi

jk m

n

pq

r

s

t

uv

w

Figure 2: PCA biplot with calibrated axes for aircraft data45/70

The R package BiplotGUI

PCA biplots with calibrated axes can be obtained using thefunction PCAbipl() from the R package UBbipl, which isavailable from http://www.wiley.com//legacy/wileychi/gower/material.html.Alternatively, the BiplotGUI package provides a graphical userinterface (GUI) for the construction of, interaction with, andmanipulation of PCA biplots with calibrated axes in R.

library(BiplotGUI)

Biplots() is the sole function in the BiplotGUI package andinitialises the GUI for a given set of data.

Biplots(USArrests)

46/70

http://www.wiley.com//legacy/wileychi/gower/material.html

http://www.wiley.com//legacy/wileychi/gower/material.html

Application to Quality control dataI Throughout the period of a calendar month, a

manufacturing company is monitoring 15 different variablesin a production process.

I In an effort to quantify the overall product quality, thiscompany devised a quality index value.

I At the end of the month, the means and standard deviationsof the 15 selected variables were somehow transformed intoa single quality index value in the interval [0, 100].

I The index values give no indication of what the causes of apoor index value could be.

I We perform a PCA on the monthly mean values of the 15variables for January 2000 to March 2001.

47/70

PCA biplot of the (scaled) quality monitoring data

A1 (0.2)

4949

50

A2 (0.03)

79

79.2

A3 (0.8)

0.8

0.85

0.9

0.95

1

1.05

1.11.1

1.15

1.2

A4 (0.45)

1.4

1.5

1.6

1.7

1.8

A5 (0.89)20.5

2121.5

22B5 (0.89)

20.521

21.522

C6 (0.62)

12

14

16

18

C7 (0.82)

56

5858

60

6262

64

6666

C4 (0.53)

2.52.5

3

3.5

4

4.5

D6 (0.74)

26 27 28 29 30

D7 (0.72)

4343

44

4545

D4 (0.48)

4.6

4.8

5

5.2

5.4

5.6

C5 (0.47) E5 (0.49)

20

20.5

21

14.2

14.314.3

14.4

14.514.5

C8 (0.85)

29.5

3030

30.5

31

31.5

3232

32.5

Jan00

Feb00

Mar00

Apr00

May00

Jun00

Jul00

Aug00

Sep00

Oct00Nov00

Dec00Jan01

Feb01

Mar01

●Target

Figure 3: PCA biplot of the scaled process quality data with amultidimensional target interpolated. 48/70

PCA biplot with quality regions

A1

4949

50

A2

79

79.2

A3

0.8

0.85

0.9

0.95

1

1.05

1.11.1

1.15

1.2

A4

1.4

1.5

1.6

1.7

1.8

A520.5

2121.5

22B520.5

2121.5

22

C6

12

14

16

18

C7

56

5858

60

6262

64

6666

C4

2.52.5

3

3.5

4

4.5

D6

26 27 28 29 30

D7

4343

44

4545

D4

4.6

4.8

5

5.2

5.4

5.6

C5

20

20.5

21

E5

14.2

14.314.3

14.4

14.514.5

C8

29.5

3030

30.5

31

31.5

3232

32.5

Jan00

Feb00

Mar00

Apr00

May00

Jun00

Jul00

Aug00

Sep00

Oct00Nov00

Dec00Jan01

Feb01

Mar01

Target

A5 B5

Poor quality Good qualitySatisfactory quality

Figure 4: PCA biplot of process quality data with a target, smoothtrend line and quality regions added. 49/70

Quality of fit attained with PCA

Table 1: Explained variation by the first four principal components ofthe quality control data (cumulative proportion in percent).

1 dimension 2 dimensions 3 dimensions 4 dimensions37.8% 59.8% 74.9% 82.7%

50/70

3 Sparse PCA

51/70

Motivation

I A sparse statistical model is one having only a small numberof nonzero parameters.

I In this Section, we discuss how PCA can be sparsified.

I That is, how can we derive principal components withsparse loadings to yield more interpretable solutions.

I Sparse PCA is a natural extension of PCA well-suited tohigh-dimensional data (p � n).

52/70

Jeffers’ pitprops data

I Jeffers’ pitprops data is a classical example showing thedifficulty of interpreting principal components.

I The pitprops data is a correlation matrix of 13 physicalmeasurements made on a sample of 180 pitprops cut fromCorsican pine timber.

library(elasticnet)data(pitprops)dim(pitprops)

## [1] 13 13

53/70

The variables in Jeffers’ pitprops data

topdiam Top diameter in incheslength Length in inchesmoist Moisture content, % of dry weighttestsg Specific gravity at time of testovensg Oven-dry specific gravityringtop Number of annual rings at topringbut Number of annual rings at bottombowmax Maximum bow in inchesbowdist Distance of point of maximum bow from top in incheswhorls Number of knot whorlsclear Length of clear prop from top in inchesknots Average number of knots per whorldiaknot Average diameter of the knots in inches

54/70

PCA of pitprops data

pitprop.pca <- princomp(covmat = pitprops)summary(pitprop.pca)

## Importance of components:## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6## Standard deviation 2.0539 1.5421 1.3705 1.05328 0.9540 0.90300## Proportion of Variance 0.3245 0.1829 0.1445 0.08534 0.0700 0.06272## Cumulative Proportion 0.3245 0.5074 0.6519 0.73726 0.8073 0.86999## Comp.7 Comp.8 Comp.9 Comp.10 Comp.11## Standard deviation 0.75917 0.66300 0.59387 0.43685 0.22487## Proportion of Variance 0.04433 0.03381 0.02713 0.01468 0.00389## Cumulative Proportion 0.91432 0.94813 0.97526 0.98994 0.99383## Comp.12 Comp.13## Standard deviation 0.20363 0.196785## Proportion of Variance 0.00319 0.002979## Cumulative Proportion 0.99702 1.000000

55/70

Loadings of the first six components

pitprop.pca$loadings[,1:6]

## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6## topdiam -0.40379 -0.21785 0.20729 0.09121 -0.08263 0.119803## length -0.40554 -0.18613 0.23504 0.10272 -0.11279 0.162888## moist -0.12440 -0.54064 -0.14149 -0.07844 0.34977 -0.275901## testsg -0.17322 -0.45564 -0.35242 -0.05477 0.35576 -0.054017## ovensg -0.05717 0.17007 -0.48121 -0.04911 0.17610 0.625557## ringtop -0.28443 0.01420 -0.47526 0.06343 -0.31583 0.052301## ringbut -0.39984 0.18964 -0.25310 0.06498 -0.21507 0.002658## bowmax -0.29356 0.18915 0.24305 -0.28554 0.18533 -0.055119## bowdist -0.35663 -0.01712 0.20764 -0.09672 -0.10611 0.034222## whorls -0.37892 0.24845 0.11877 0.20504 0.15639 -0.173148## clear 0.01109 -0.20530 0.07045 -0.80366 -0.34299 0.175312## knots 0.11508 -0.34317 -0.09200 0.30080 -0.60037 -0.169783## diaknot 0.11251 -0.30853 0.32611 0.30338 0.07990 0.626307

56/70

Rotation

I A traditional way to simplify loadings is by rotation.

I The method of rotation emerged in Factor Analysis and wasmotivated both by solving the rotational indeterminacyproblem and by facilitating the factors’ interpretation.

I Rotation can be performed either in an orthogonal or anoblique (non-orthogonal) fashion.

I Several analytic orthogonal and oblique rotation criteria doexist in the literature.

I All criteria attempt to create a loading matrix whoseelements are close to zero or far from zero, with fewintermediate values.

57/70

Rotation

I If A is the loading matrix, then A is post-multiplied by amatrix T to give rotated loadings B = AT.

I The rotation matrix T is chosen so as to optimize somesimplicity criterion.

I We would also need an algorithm that optimizes the chosenrotation criterion and finds the “best” T.

I However, after rotation, either one or both of the propertiespossessed by PCA, that is, orthogonality of the loadings anduncorrelatedness of the component scores, is lost.

58/70

The Varimax rotation criterion

I Each variable should be either clearly important or clearlyunimportant in a rotated component, with as few cases aspossible of borderline importance.

I Varimax is the most widely used rotation criterion.

I Varimax tends to drive at least some of the loadings in eachcomponent towards zero.

I A component whose loadings are all roughly equal will beavoided by most standard rotation criteria.

59/70

Gradient projection algorithm

I Problems in multivariate statistics are often concerned withthe optimization of matrix functions of structured(e.g. orthogonal) matrix unknowns.

I Gradient projection algorithms are natural ways of solvingsuch optimization problems as they are especially designedto follow the geometry of the matrix parameters.

I They are based on the classical gradient approach andmodified for analyzing and solving constrained optimizationproblems.

I The idea is to follow the steepest descent direction and tokeep the gradient flow “nailed” to the manifold ofpermissible matrices.

60/70

Gradient projection algorithm for orthogonal rotation

I Here, the gradient projection algorithm for orthogonalrotation is used to find T that minimizes f (V) over allorthogonal matrices V.

I LetM be the manifold of all orthogonal matrices.

I Given a current value of V, this algorithm computes thegradient of f at V and moves α units in the negativegradient direction from V.

I The result is projected onM.

61/70

The Gradient projection algorithm visualized

Manifold of permissible matrices M

Vupdated V

VV

∂∂

−fα

Figure 5: Projection on a manifold of permissible matrices

62/70

Iterative scheme

I The algorithm proceeds iteratively; it is monotonicallydescending and converges from any starting point to astationary point.

I At a stationary point of f restricted toM, the Frobeniusnorm of the gradient after projection onto the plane tangenttoM at the current value of V is zero.

I The algorithm stops when the norm is less than someprescribed precision, say 10−5.

I Once the optimal rotation matrix T has been found, therotated loading matrix is obtained as B = AT.

63/70

Using the Varimax criterion for Jeffers’ pitprops data

library(GPArotation)A <- pitprop.pca$loadings[,1:6]B <- GPForth(A, method="varimax")$loadingsB

## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6## topdiam -0.4732810 -0.093071 0.066309 -0.035053 -0.047369 0.20604## length -0.4913803 -0.034979 0.053369 -0.039620 -0.047577 0.23409## moist 0.0003155 -0.713627 0.148739 -0.001023 0.013383 -0.02893## testsg -0.0154043 -0.681406 -0.170698 0.018187 0.005942 -0.01546## ovensg 0.0493026 0.003042 -0.807205 -0.018262 0.138311 0.12255## ringtop -0.2388960 -0.028431 -0.391257 -0.014618 -0.358794 -0.27123## ringbut -0.3638911 0.092181 -0.259272 0.063757 -0.130426 -0.28490## bowmax -0.2472791 0.033791 0.113722 -0.127971 0.439568 -0.12301## bowdist -0.3980574 0.039518 0.086691 -0.138574 0.073745 0.01454## whorls -0.3446344 0.052830 0.087417 0.339108 0.212771 -0.16254## clear -0.0185273 0.009334 -0.008107 -0.916145 0.012019 -0.03890## knots -0.0318758 0.025179 0.177287 -0.011326 -0.765165 0.02324## diaknot -0.0516283 0.031716 -0.100118 0.047261 -0.029195 0.82952

64/70

Sparse PCA based on the “elastic net”

I The lasso approach in PCA: perform PCA under the extraconstraints

∑pj=1 |akj | ≤ t for some tuning parameter t

(k = 1, . . . , p).

I The above-mentioned approach has several limitations.

I The so-called elastic net generalizes the lasso to overcome itsdrawbacks.

I Elastic net approach in PCA: formulate PCA as aregression-type optimization problem; obtain sparse loadingsby integrating a lasso penalty (via the elastic net) into theregression criterion.

65/70

Sparse PCA (SPCA) criterion based on the “elastic net”

I Optimization problem:

(A, B) = argminA,B

n∑i=1||xi−AB>xi ||2+λ

k∑j=1||βj ||2+

k∑j=1

λ1,j ||βj ||1

subject to A>A = Ik .

I In the SPCA criterion above, A = (α1, . . . , αk) andB = (β1, . . . , βk) are p × k matrices, and || · || denotes the l1norm.

I Whereas the same λ is used for all k components, differentλ1,j ’s are allowed for penalizing the loadings of differentprincipal components.

66/70

Alternating algorithm to minimize the SPCA criterionI B given A: For each j, let Y ∗j = Xαj . Each βj in

B = (β1, . . . , βk) is an elastic net estimate

βj = argminβj||Y ∗j −Xβj ||2 + λ||βj ||2 + λ1,j ||βj ||1 .

I A given B: If B is fixed, then we can ignore the penaltypart on the SPCA criterion and only try to minimize

n∑i=1||xi −AB>xi ||2 = ||X−XBA>||2F ,

subject to A>A = Ik . The solution is found via the SVD of

(X>X)B = UDV> ,

and we set A = UV>.67/70

Some remarks about SPCA

I Empirical evidence suggests that the output of the abovealgorithm does not change much as λ is varied.

I Practically, λ is chosen to be a small positive number.

I Usually several combinations of λ1,j are tried to figure out agood choice of the tuning parameters.

I Hence, we can pick a λ1,j that gives a good compromisebetween explained variance and sparsity (variance-sparsitytrade-off).

68/70

Implementation of sparse PCA in R

I Efficient algorithms do exist to fit the elastic net approachin PCA to multivariate data.

I Sparse PCA is implemented by the function spca() in theR package elasticnet.

?spca

I The function arrayspc() in the R package elasticnet isspecifically designed for the case p � n, as it is typically thecase in microarrays.

?arrayspc

69/70

Sparse PCA of Jeffers’ pitprops datapitprop.spcap <- spca(pitprops,K = 6, type = "Gram", sparse = "penalty",

para=c(0.06,0.16,0.1,0.5,0.5,0.5))pitprop.spcav <- spca(pitprops,K = 6, type = "Gram", sparse = "varnum",

para = c(7,4,4,1,1,1))pitprop.spcap$loadings

## PC1 PC2 PC3 PC4 PC5 PC6## topdiam -0.4774 0.00000 0.00000 0 0 0## length -0.4759 0.00000 0.00000 0 0 0## moist 0.0000 0.78471 0.00000 0 0 0## testsg 0.0000 0.61936 0.00000 0 0 0## ovensg 0.1766 0.00000 0.64065 0 0 0## ringtop 0.0000 0.00000 0.58901 0 0 0## ringbut -0.2505 0.00000 0.49233 0 0 0## bowmax -0.3440 -0.02100 0.00000 0 0 0## bowdist -0.4164 0.00000 0.00000 0 0 0## whorls -0.4000 0.00000 0.00000 0 0 0## clear 0.0000 0.00000 0.00000 -1 0 0## knots 0.0000 0.01333 0.00000 0 -1 0## diaknot 0.0000 0.00000 -0.01557 0 0 1

pitprop.spcap$pev

## [1] 0.28035 0.13966 0.13298 0.07445 0.06802 0.0622770/70

Principalcomponentanalysis(PCA): Principles,Biplots...

Documents

Transcript of Principalcomponentanalysis(PCA): Principles,Biplots...