Advanced Applied Multivariate Analysis - stat.pitt.edu · What is multivariate analysis? 1 First...
Transcript of Advanced Applied Multivariate Analysis - stat.pitt.edu · What is multivariate analysis? 1 First...
Advanced Applied Multivariate AnalysisSTAT 2221, Spring 2015
Sungkyu JungDepartment of Statistics
University of Pittsburgh
E-mail: [email protected]://www.stat.pitt.edu/sungkyu/
1 / 30
General Information
• Course Webpage:http://www.stat.pitt.edu/sungkyu/course/2221Spring15/
• Prerequisite: None (officially), but• probability and inference theory• linear algebra• R, SAS or Matlab programing.
2 / 30
What is multivariate analysis?
1 First course of statistics: numbers–random variables
2 Second course of statistics: vectors of numbers–randomvectors
• Basis for analysis of more complex objects, e.g. functions,matrices, tensors, images, networks.
3 Data Exploration: visualization of relationships betweenobservations.
4 Discovering and modeling patterns from dataset:Visualization, Clustering, Multivariate distributions.
5 Confirming patterns: Inference.
6 Dimension Reduction: PCA, CCA, SVD.
7 Predictions: Regression, Classification.
3 / 30
What is a multivariate dataset?
Multivariate statistical analysis concerns multivariate data whereeach observation consisting of many measurements on the samesubject. We suppose the dataset X = {X1, . . . ,Xn} has nobservations (Here, n is called the sample size), and eachobservation Xi = (xi1, . . . , xip) is a vector in Rp (Here, p is calledthe dimension). These are often recorded in a p × n matrix:
X =(X1, · · · ,Xn
)=
x11 · · · xn1...
. . ....
x1p · · · xnp
4 / 30
Data Exploration - Visualization
1D example – Hidalgo Stamp Data
• n = 485 observed values of thickness for Mexico stamps
• over > 70 years
• During 1980s
• Stamp papers produced in several factories?
• No records. Can we guess by looking at the data?
Izenman and Sommer (1988), here we use data “stamp” in Rpackage BSDA.
5 / 30
Data Exploration - 1D Hidalgo StampHistograms with different bin widths. How many factories?
Histogram of thickness
h = 0.015
Fre
quen
cy
0.06 0.08 0.10 0.12
050
100
150
200
Histogram of thickness
h = 0.010
Fre
quen
cy
0.06 0.08 0.10 0.12
050
100
200
Histogram of thickness
h = 0.005
Fre
quen
cy
0.06 0.08 0.10 0.12
040
8012
0
Histogram of thickness
h = 0.002
Fre
quen
cy
0.06 0.08 0.10 0.12
020
4060
80
6 / 30
Data Exploration - 1D Hidalgo StampHistograms with different bin locations. (Fixed bin width 0.012)
Histogram of thickness
thickness
Fre
quen
cy
0.06 0.08 0.10 0.12
050
100
200
Histogram of thickness
thickness
Fre
quen
cy
0.06 0.08 0.10 0.12
050
150
250
Histogram of thickness
thickness
Fre
quen
cy
0.06 0.08 0.10 0.12
050
150
250
Histogram of thickness
thickness
Fre
quen
cy
0.06 0.08 0.10 0.12
050
100
150
7 / 30
Data Exploration - Kernel density estimate
• Histograms are dependent on binwidth and bin location.
• Smaller binwidth preferred for exploration.
• Different bin locations can obscure important underlyingpatterns; A solution is to average out the effect on differentbin location “Averaged histogram”
• A more elegant solution is kernel density estimate.
• For X1, . . . ,Xn ∼ i.i.d. f (x) (continuous pdf), a kernel densityestimator of f is obtained as
fh(x) =1
n
n∑i=1
1
hK
(x − Xi
h
),
where the kernel K (·) is a function satisfying∫K (x)dx = 1.
8 / 30
Kernel density estimate - illustration
• Toy data: xi = 3, 5, 9, 11, 12, 14.
• Consider a kernel density estimate with Gaussian kernel
K (t) =1√2π
e−t2/2,
with bandwidth h = 1.
• For each i ,
1
hK
(x − xi
h
)=
1√2πh2
e−(x−xi )
2
2h2 .
9 / 30
Kernel density estimate - illustration
0 2 4 6 8 10 12 14 16−0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
observations
dens
ity
toy data
• Toy data: xi = 3, 5, 9, 11, 12, 14.
10 / 30
Kernel density estimate - illustration
0 2 4 6 8 10 12 14 16−0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
observations
dens
ity
toy data
• overlaid is, for h = 1,
1
hK
(x − x1
h
)=
1√2πh2
e−(x−x1)
2
2h2 .
11 / 30
Kernel density estimate - illustration
0 2 4 6 8 10 12 14 16−0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
observations
dens
ity
toy data
• For all i = 1, . . . , 6,
1
hK
(x − xi
h
)=
1√2πh2
e−(x−xi )
2
2h2 .
12 / 30
Kernel density estimate - illustration
0 2 4 6 8 10 12 14 16−0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
observations
dens
ity
toy data
• Kernel Density Estimate (KDE) of the pdf:
fh(x) =1
n
n∑i=1
1
hK
(x − Xi
h
),
13 / 30
KDE for Hidalgo Stamp
0.04 0.08 0.12 0.16
05
1015
2025
KDE
N = 485 Bandwidth = 0.01
Den
sity
0.06 0.10 0.14
010
2030
KDE
N = 485 Bandwidth = 0.005
Den
sity
0.06 0.08 0.10 0.12 0.14
010
2030
40
KDE
N = 485 Bandwidth = 0.0025
Den
sity
0.06 0.08 0.10 0.12
020
4060
KDE
N = 485 Bandwidth = 0.001
Den
sity
14 / 30
KDE for Hidalgo Stamp
• Small bandwidth leads undersmooth; Large bandwidth leadsoversmooth
• Is it unimodal? Bi-modal? Or several modes? Which modesare really there?
• Choice of bandwidth• Important in practice.• Controversial issue.• Many recommendations (Silverman’s rule of thumb,
Sheather-Jones Plug-In.)• The consensus is that there is never a consensus.
15 / 30
KDE for Hidalgo StampBandwidth, by Silverman’s rule-of-thumb, is (sample size)−1/5×90% of minimum of two population standard deviation estimates:i) the sample standard deviation, ii) IQR/1.34
0.06 0.08 0.10 0.12 0.14
010
2030
40
KDE
N = 485 Bandwidth = 0.00391
Den
sity
16 / 30
Data Exploration - Multivariate data
Dimension p = 6 example – Swiss bank notes
• n = 200 Swiss bank notes (See Fig. 1.1, Hardle and Simar)
• Each note (obs.) has p = 6 measurements (variables).
• Additional information: first half are genuine; the other halfare counterfeit.
• Visualization of 6-dim’l data?
• Can use 6 KDEs overlaided with jitterplot for eachmeasurements (variables)
17 / 30
Swiss bank notes - Marginal KDEsMarginal KDEs overlaided with jitterplot for each of 6 variables.• Informative, realistic when p is small• No information about association between variables.
212 214 216 2180
0.5
1
1.5
2
2.5
x1
kde
129 130 1310
0.5
1
1.5
2
x2
kde
129 130 131 1320
0.5
1
1.5
2
x3
kde
5 10 150
0.1
0.2
0.3
0.4
0.5
0.6
0.7
x4
kde
5 10 150
0.2
0.4
0.6
0.8
1
x5
kde
135 140 1450
0.2
0.4
0.6
0.8
x6
kde
18 / 30
Swiss bank notes - Scatterplot• Pairs of variables are best visualized by scatterplot, e.g. X4 vsX6 below.
• Understood as point clouds (representing the empiricaldistributions)
•(
62
)many pairs to choose from.
7 8 9 10 11 12 13137.5
138
138.5
139
139.5
140
140.5
141
141.5
142
142.5
X4
X6
Swiss bank notes: blue o = genuine, green x = fake
19 / 30
Swiss bank notes - Scatterplot• Scatters of three variables can also be informative, if software
allows to rotate the axes.• Otherwise, the 3D scatterplot is a scatterplot of linear
combinations of the three variables.
78
910
1112
13
137
138
139
140
141
142
143213.5
214
214.5
215
215.5
216
216.5
X4
Swiss bank notes: blue o = genuine, green x = fake
X6
X1
20 / 30
Swiss bank notes - scatterplot matrix• A traditional, yet powerful, tool is to construct a matrix of
scatterplots. - Too busy with p = 6.
−1 0 10
0.51
X1
−1−0.500.50
0.5
X2
−1 0 10
0.5
X3
−2 0 20
0.2
X4
−3−2−1010
0.20.4
X5
−2 0 20
0.20.4
X6
−1−0.500.5−1
01
X2
X1
−1 0 1−1
01
X3
X1
−2 0 2−1
01
X4
X1
−3−2−101−1
01
X5
X1
−2 0 2−1
01
X6
X1
−1 0 1−1−0.500.5
X1
X2
−1 0 1−1−0.500.5
X3
X2
−2 0 2−1−0.500.5
X4
X2
−3−2−101−1−0.500.5
X5
X2
−2 0 2−1−0.500.5
X6
X2
−1 0 1−1
01
X1
X3
−1−0.500.5−1
01
X2
X3
−2 0 2−1
01
X4
X3
−3−2−101−1
01
X5
X3
−2 0 2−1
01
X6
X3
−1 0 1−2
02
X1
X4
−1−0.500.5−2
02
X2
X4
−1 0 1−2
02
X3
X4
−3−2−101−2
02
X5
X4
−2 0 2−2
02
X6
X4
−1 0 1−3−2−10
1
X1
X5
−1−0.500.5−3−2−10
1
X2
X5
−1 0 1−3−2−10
1
X3
X5
−2 0 2−3−2−10
1
X4
X5
−2 0 2−3−2−10
1
X6
X5
−1 0 1−2
02
X1
X6
−1−0.500.5−2
02
X2
X6
−1 0 1−2
02
X3
X6
−2 0 2−2
02
X4
X6
−3−2−101−2
02
X5
X6
21 / 30
Swiss bank notes - scatterplot matrix• Better to visualize with principal component scores.
−2 0 20
0.2
Direction 1
−2 0 20
0.20.4
Direction 2
−1 0 10
0.5
Direction 3
−1 0 10
0.51
Direction 4
−1−0.500.50
1
Direction 5
−0.5 0 0.5012
Direction 6
−2 0 2−2
02
Direction 2
Dire
ctio
n 1
−1 0 1−2
02
Direction 3
Dire
ctio
n 1
−1 0 1−2
02
Direction 4
Dire
ctio
n 1
−1−0.500.5−2
02
Direction 5
Dire
ctio
n 1
−0.5 0 0.5−2
02
Direction 6
Dire
ctio
n 1
−2 0 2
−202
Direction 1
Dire
ctio
n 2
−1 0 1
−202
Direction 3
Dire
ctio
n 2
−1 0 1
−202
Direction 4
Dire
ctio
n 2
−1−0.500.5
−202
Direction 5
Dire
ctio
n 2
−0.5 0 0.5
−202
Direction 6
Dire
ctio
n 2
−2 0 2−1
01
Direction 1
Dire
ctio
n 3
−2 0 2−1
01
Direction 2
Dire
ctio
n 3
−1 0 1−1
01
Direction 4
Dire
ctio
n 3
−1−0.500.5−1
01
Direction 5
Dire
ctio
n 3
−0.5 0 0.5−1
01
Direction 6
Dire
ctio
n 3
−2 0 2−1
01
Direction 1
Dire
ctio
n 4
−2 0 2−1
01
Direction 2
Dire
ctio
n 4
−1 0 1−1
01
Direction 3
Dire
ctio
n 4
−1−0.500.5−1
01
Direction 5
Dire
ctio
n 4
−0.5 0 0.5−1
01
Direction 6
Dire
ctio
n 4
−2 0 2−1
−0.50
0.5
Direction 1
Dire
ctio
n 5
−2 0 2−1
−0.50
0.5
Direction 2
Dire
ctio
n 5
−1 0 1−1
−0.50
0.5
Direction 3
Dire
ctio
n 5
−1 0 1−1
−0.50
0.5
Direction 4
Dire
ctio
n 5
−0.5 0 0.5−1
−0.50
0.5
Direction 6
Dire
ctio
n 5
−2 0 2−0.5
00.5
Direction 1
Dire
ctio
n 6
−2 0 2−0.5
00.5
Direction 2
Dire
ctio
n 6
−1 0 1−0.5
00.5
Direction 3
Dire
ctio
n 6
−1 0 1−0.5
00.5
Direction 4
Dire
ctio
n 6
−1−0.500.5−0.5
00.5
Direction 5
Dire
ctio
n 6
22 / 30
Swiss bank notes - scatterplot matrix• With principal component scores, we can focus on fewer
combinations; Dimension Reduction
−2 0 20
0.1
0.2
0.3
Direction 1
−2 0 20
0.1
0.2
0.3
0.4
Direction 2
−1 0 10
0.2
0.4
0.6
0.8
Direction 3
−2 0 2
−2
0
2
Direction 2
Dire
ctio
n 1
−1 0 1
−2
0
2
Direction 3
Dire
ctio
n 1
−2 0 2
−2
0
2
Direction 1
Dire
ctio
n 2
−1 0 1
−2
0
2
Direction 3
Dire
ctio
n 2
−2 0 2
−1
0
1
Direction 1
Dire
ctio
n 3
−2 0 2
−1
0
1
Direction 2
Dire
ctio
n 3
23 / 30
Swiss bank notes - related methods
1 If obtaining succint representation of data is of interest, thenusing the first two principal component scores appeared in thefirst 2× 2 block of the scatterplot matrix would do the job(Principal Component Analysis)
2 Parametric modeling: Are distributions of Swiss bank notemeasurements normal (Gaussian)? (Multivariate NormalDistribution)
3 Without the information on genuine and counterfeit notes,can we classify n = 200 notes into two distinct groups?(Clustering)
4 With the information on genuine and counterfeit notes,• Are the means and covariances of genuine and counterfeit
notes different? (Statistical inference)• When there is a new bank note, how to predict whether the
new note is genuine? (Classification)
24 / 30
Modern challenges
High dimensional data
1 Gene expression data in Golub et al. (1999)
2 p = 7129 gene expression levels (numeric) for n1 = 47subjects with acute lymphoblastic leukemia (ALL) andn2 = 25 subjects with acute myeloid leukemia (AML).
3 Scientific task is to identify new cancer class and/or to assigntumors to known classes (ALL or AML).
25 / 30
Golub dataTaking a subgroup of data with 27 ALLs and 11 AMLs.A visualization of the matrix Xp×n. Is it helpful?
Processed gene expressions (Golub)
subject index: 1−27 = ALL, 28−37 = AML
gene
s
5 10 15 20 25 30 35
1000
2000
3000
4000
5000
6000
7000
10
20
30
40
50
60
26 / 30
Golub dataScatterplot matrix for the first three genes (variables).
−300−200−100 0 1000
2
4
x 10−3
ALL
AML
X1
−100 0 1000
2
4
x 10−3
X2
−200 0 2000
1
2
3
4x 10
−3
X3
−100 0 100
−300
−200
−100
0
100
X2
X1
−200 0 200
−300
−200
−100
0
100
X3
X1
−300−200−100 0 100
−100
0
100
X1
X2
−200 0 200
−100
0
100
X3
X2
−300−200−100 0 100
−200
0
200
X1
X3
−100 0 100
−200
0
200
X2
X3
27 / 30
Golub dataScatterplot matrix using the first three Principal ComponentScores. A pattern there?
−5 0 5
x 104
0
0.5
1
1.5x 10
−5
ALL
AML
PC1 Direction
−5 0 5
x 104
0
0.5
1
1.5x 10
−5
PC2 Direction
−4 −2 0 2 4 6
x 104
0
0.5
1
1.5
2
x 10−5
PC3 Direction
−5 0 5
x 104
−5
0
5
x 104
PC2 Direction
PC
1 D
irect
ion
−4 −2 0 2 4 6
x 104
−5
0
5
x 104
PC3 Direction
PC
1 D
irect
ion
−5 0 5
x 104
−5
0
5
x 104
PC1 Direction
PC
2 D
irect
ion
−4 −2 0 2 4 6
x 104
−5
0
5
x 104
PC3 Direction
PC
2 D
irect
ion
−5 0 5
x 104
−4
−2
0
2
4
6x 10
4
PC1 Direction
PC
3 D
irect
ion
−5 0 5
x 104
−4
−2
0
2
4
6x 10
4
PC2 Direction
PC
3 D
irect
ion
28 / 30
Golub dataScatterplot matrix using the a liniear combination of all variables(that leads a good separation of two groups) and two PrincipalComponent Scores. Better pattern?
−1 0 1 2
x 104
0
1
2
3x 10
−4
ALL
AML
Discriminant Dir.
−5 0 5
x 104
0
0.5
1
1.5x 10
−5
PC1 Direction
−5 0 5
x 104
0
0.5
1
1.5x 10
−5
PC2 Direction
−5 0 5
x 104
−2
0
2
4x 10
4
PC1 Direction
Dis
crim
inan
t Dir.
−5 0 5
x 104
−2
−1
0
1
2
x 104
PC2 Direction
Dis
crim
inan
t Dir.
−1 0 1 2
x 104
−5
0
5
x 104
Discriminant Dir.
PC
1 D
irect
ion
−5 0 5
x 104
−5
0
5
x 104
PC2 Direction
PC
1 D
irect
ion
−1 0 1 2
x 104
−5
0
5
x 104
Discriminant Dir.
PC
2 D
irect
ion
−5 0 5
x 104
−5
0
5
x 104
PC1 Direction
PC
2 D
irect
ion
29 / 30
Next: review on matrix algebra.
30 / 30