Discrete Multivariate Analysis Analysis of Multivariate Categorical Data.
Description of Multivariate Data
-
Upload
porter-riggs -
Category
Documents
-
view
45 -
download
0
description
Transcript of Description of Multivariate Data
![Page 1: Description of Multivariate Data](https://reader031.fdocuments.net/reader031/viewer/2022033022/56812d0d550346895d91e33a/html5/thumbnails/1.jpg)
Description of Multivariate Data
![Page 2: Description of Multivariate Data](https://reader031.fdocuments.net/reader031/viewer/2022033022/56812d0d550346895d91e33a/html5/thumbnails/2.jpg)
Multivariate Analysis
The analysis of many variables
![Page 3: Description of Multivariate Data](https://reader031.fdocuments.net/reader031/viewer/2022033022/56812d0d550346895d91e33a/html5/thumbnails/3.jpg)
Multivariate Analysis: The analysis of many variables
More precisely and also more traditionally this term stands fo the study of a random sample pf n objects (units or cases) such that on each object we measure p variables or characteristics.
So that for each object there is a vector:
1 2, , , nx x x
Each with p components:
1, , ,i i ij ipx x x x
![Page 4: Description of Multivariate Data](https://reader031.fdocuments.net/reader031/viewer/2022033022/56812d0d550346895d91e33a/html5/thumbnails/4.jpg)
The variables will be correlated as they are measured on the same object.
This may lead to incorrect and inadequate analysis:
A common practice is to treat each variable separately by applying methods of univariate analysis.
![Page 5: Description of Multivariate Data](https://reader031.fdocuments.net/reader031/viewer/2022033022/56812d0d550346895d91e33a/html5/thumbnails/5.jpg)
The challenge of multivariate analysis is to untangle the overlapping information provided by a set of correlated variables and to reveal the underlying structure.
![Page 6: Description of Multivariate Data](https://reader031.fdocuments.net/reader031/viewer/2022033022/56812d0d550346895d91e33a/html5/thumbnails/6.jpg)
This is done by a variety of methods,
some of which are
• generalizations of univariate methods
and some which are
• multivariate with without univariate counterparts
![Page 7: Description of Multivariate Data](https://reader031.fdocuments.net/reader031/viewer/2022033022/56812d0d550346895d91e33a/html5/thumbnails/7.jpg)
The purpose of this course is
• to describe and perhaps justify these methods, and also
• provide some guidance about how to select an appropriate method for a given multivariate data set.
![Page 8: Description of Multivariate Data](https://reader031.fdocuments.net/reader031/viewer/2022033022/56812d0d550346895d91e33a/html5/thumbnails/8.jpg)
Example
x1 = age (in years) at entry to university,
Randomly select n = 5 students as objects and for each student measure:
x2 = mark out of 100 in an exam at the end of the first year,
x3 = sex (0 = female, 1= male)
![Page 9: Description of Multivariate Data](https://reader031.fdocuments.net/reader031/viewer/2022033022/56812d0d550346895d91e33a/html5/thumbnails/9.jpg)
The result may look something like this:Objects x1 x2 x3
1 18.5 91 0
2 18.0 73 1
3 18.9 64 1
4 18.5 71 0
5 18.4 85 1
It is of interest to note that the variables in the example are not of the same type:
– x1 is a continuous variable,
– x2 is a discrete variable and
– x3 is a binary variable
![Page 10: Description of Multivariate Data](https://reader031.fdocuments.net/reader031/viewer/2022033022/56812d0d550346895d91e33a/html5/thumbnails/10.jpg)
11 1 1
1
1
j p
i ij ip
n nj np
x x x
x x xX
x x x
1
objectsi
n
o
o
o
1 j px x x
variables
The Data Matrix
![Page 11: Description of Multivariate Data](https://reader031.fdocuments.net/reader031/viewer/2022033022/56812d0d550346895d91e33a/html5/thumbnails/11.jpg)
11 1 1
1
1
j p
i ij ip
n nj np
x x x
x x xX
x x x
1
i
n
x
x
x
1, , ,i i ij ipx x x x
where
We can write
= the ith row of X.
![Page 12: Description of Multivariate Data](https://reader031.fdocuments.net/reader031/viewer/2022033022/56812d0d550346895d91e33a/html5/thumbnails/12.jpg)
11 1 1
1
1
j p
i ij ip
n nj np
x x x
x x xX
x x x
1
the column of
j
thijj
nj
x
xx j X
x
1 , , , ,j px x x
where
We can also write
![Page 13: Description of Multivariate Data](https://reader031.fdocuments.net/reader031/viewer/2022033022/56812d0d550346895d91e33a/html5/thumbnails/13.jpg)
1x
1x
is the p-vector denoting the p observations on the first object, while
In this notation
is the n-vector denoting the observations on the first variable
1 2 3, , , , nx x x x
The rows
form a random sample while the columns
1 2 3, , , , px x x x
do not (this is emphasized in the notation by the use of parentheses)
![Page 14: Description of Multivariate Data](https://reader031.fdocuments.net/reader031/viewer/2022033022/56812d0d550346895d91e33a/html5/thumbnails/14.jpg)
The objective of multivariate analysis will be a attempt to find some feature of the variables (i.e. the columns of the data matrix)
At other times, the objective of multivariate analysis will be a attempt to find some feature of the individuals (i.e. the rows of the data matrix)
The feature that we often look for is grouping of the individuals or of the variables.
We will give a classification of multivariate methods later
![Page 15: Description of Multivariate Data](https://reader031.fdocuments.net/reader031/viewer/2022033022/56812d0d550346895d91e33a/html5/thumbnails/15.jpg)
Summarization of the data
![Page 16: Description of Multivariate Data](https://reader031.fdocuments.net/reader031/viewer/2022033022/56812d0d550346895d91e33a/html5/thumbnails/16.jpg)
Even when n and p are moderately large, the amount of information (np elements of the data matrix) can be overwhelming and it is necessary to find ways of summarizing data.
Later on we will discuss way of graphical representation of the data
![Page 17: Description of Multivariate Data](https://reader031.fdocuments.net/reader031/viewer/2022033022/56812d0d550346895d91e33a/html5/thumbnails/17.jpg)
1
1 for 1,2, ,
n
i rir
x x i pn
22
1
1 for 1, 2, ,
1
n
i ri i iir
s x x s i pn
Definitions:
1
1 for , 1,2, ,
1
n
ij ri i rj jr
s x x x x i j pn
1. The sample mean for the ith variable
2. The sample variance for the ith variable
3. The sample covariance between the ith variable
and the jth variable
![Page 18: Description of Multivariate Data](https://reader031.fdocuments.net/reader031/viewer/2022033022/56812d0d550346895d91e33a/html5/thumbnails/18.jpg)
1
i
p
x
xx
x
Defn: The sample mean vector
Putting the definitions together we are led to the following definitions:
![Page 19: Description of Multivariate Data](https://reader031.fdocuments.net/reader031/viewer/2022033022/56812d0d550346895d91e33a/html5/thumbnails/19.jpg)
11 1 1
1
1
i p
i ii ipijp p
p pi pp
s s s
s s sS s
s s s
Defn: The sample covariance matrix
![Page 20: Description of Multivariate Data](https://reader031.fdocuments.net/reader031/viewer/2022033022/56812d0d550346895d91e33a/html5/thumbnails/20.jpg)
Expressing the sample mean vector and the sample covariance matrix in
terms of the data matrix
![Page 21: Description of Multivariate Data](https://reader031.fdocuments.net/reader031/viewer/2022033022/56812d0d550346895d91e33a/html5/thumbnails/21.jpg)
The sample mean vector
![Page 22: Description of Multivariate Data](https://reader031.fdocuments.net/reader031/viewer/2022033022/56812d0d550346895d91e33a/html5/thumbnails/22.jpg)
1
1
1 11
n
i rr
p
x
xx x Xn n
x
Note
1
1 1
1
where
is the n-vector whose components are all equal to 1.
![Page 23: Description of Multivariate Data](https://reader031.fdocuments.net/reader031/viewer/2022033022/56812d0d550346895d91e33a/html5/thumbnails/23.jpg)
The sample covariance matrix
![Page 24: Description of Multivariate Data](https://reader031.fdocuments.net/reader031/viewer/2022033022/56812d0d550346895d91e33a/html5/thumbnails/24.jpg)
11 1 1 1
1 1
1 1
j j p p
i ij j ip p
n nj j np p
x x x x x x
x x x x x xX
x x x x x x
1X x
We can write
![Page 25: Description of Multivariate Data](https://reader031.fdocuments.net/reader031/viewer/2022033022/56812d0d550346895d91e33a/html5/thumbnails/25.jpg)
1 2
1 2
1 2
1 2
1
11
1
p
p
p
p
x x x
x x xx x x x
x x x
111X X
n
then
because
1X X x
111nI X
n
![Page 26: Description of Multivariate Data](https://reader031.fdocuments.net/reader031/viewer/2022033022/56812d0d550346895d91e33a/html5/thumbnails/26.jpg)
The final step is to realize that that
It is easy to check that
21 1 1
11 11 11n n nI I In n n
111nI
n
1
1
1
n
ij ri i rj jr
s x x x xn
1
1 1
1 1
n
ri rj ijr
x x X Xn n
![Page 27: Description of Multivariate Data](https://reader031.fdocuments.net/reader031/viewer/2022033022/56812d0d550346895d91e33a/html5/thumbnails/27.jpg)
So that
1 111 11n nI X I X
n n
1n S X X
21
11nX I Xn
111nX I X
n
![Page 28: Description of Multivariate Data](https://reader031.fdocuments.net/reader031/viewer/2022033022/56812d0d550346895d91e33a/html5/thumbnails/28.jpg)
In the text book
1 1
11
1 1
n nJ
R
And then
11 p pn S X I J X
n
R
![Page 29: Description of Multivariate Data](https://reader031.fdocuments.net/reader031/viewer/2022033022/56812d0d550346895d91e33a/html5/thumbnails/29.jpg)
Another Expression for S
![Page 30: Description of Multivariate Data](https://reader031.fdocuments.net/reader031/viewer/2022033022/56812d0d550346895d91e33a/html5/thumbnails/30.jpg)
Note:
1n S X X
and
1 1
ii
nn
x x x
X x xx
x xx
![Page 31: Description of Multivariate Data](https://reader031.fdocuments.net/reader031/viewer/2022033022/56812d0d550346895d91e33a/html5/thumbnails/31.jpg)
Thus
1
11 , , n
n
x
n S x x
x
1
n
i ii
x x
1
n
i ii
x x x x
Hence
1
1
1
n
i ii
S x x x xn
![Page 32: Description of Multivariate Data](https://reader031.fdocuments.net/reader031/viewer/2022033022/56812d0d550346895d91e33a/html5/thumbnails/32.jpg)
Data are frequently scaled as well as centered. The scaling is done by introducing:
Defn: the sample correlation coefficient for (between) the ith and the jth variables
ij ijij
i j ii jj
s sr
s s s s
the sample correlation matrix
12 1
12 2
1 2
1
1
1
p
p
ijp p
p p
r r
r rR r
r r
![Page 33: Description of Multivariate Data](https://reader031.fdocuments.net/reader031/viewer/2022033022/56812d0d550346895d91e33a/html5/thumbnails/33.jpg)
Obviously
1ii iiii
i i ii ii
s sr
s s s s
and using the Schwartz’s inequality
1ijr
If R = I then we say the variables are uncorrelated
![Page 34: Description of Multivariate Data](https://reader031.fdocuments.net/reader031/viewer/2022033022/56812d0d550346895d91e33a/html5/thumbnails/34.jpg)
Note: if we denote
Then it can be checked that
1 1R D SD
1
2
1 2
0 0
0 0, , ,
0 0
p
p
s
sD diag s s s
s
![Page 35: Description of Multivariate Data](https://reader031.fdocuments.net/reader031/viewer/2022033022/56812d0d550346895d91e33a/html5/thumbnails/35.jpg)
Measures of Multivariate Scatter
![Page 36: Description of Multivariate Data](https://reader031.fdocuments.net/reader031/viewer/2022033022/56812d0d550346895d91e33a/html5/thumbnails/36.jpg)
The sample variance-covariance matrix S is an obvious generalization of the univariate concept of variance, which measures scatter about the mean.
Sometimes it is convenient to have a single number to measure the overall multivariate scatter.
![Page 37: Description of Multivariate Data](https://reader031.fdocuments.net/reader031/viewer/2022033022/56812d0d550346895d91e33a/html5/thumbnails/37.jpg)
There are two common measures of this type:
Defn: The generalized sample variance
detS S
Defn: The total sample variance
2
1 1
p p
ii ii i
tr S s s
![Page 38: Description of Multivariate Data](https://reader031.fdocuments.net/reader031/viewer/2022033022/56812d0d550346895d91e33a/html5/thumbnails/38.jpg)
In both cases, large values indicate a high degree of scatter about the centroid: x
1 2 pS
low values indicate concentration about the centroid: x
Using the eigenvalues 1, 2, …,p of the matrix S,it can be shown that
1 2 ptr S
![Page 39: Description of Multivariate Data](https://reader031.fdocuments.net/reader031/viewer/2022033022/56812d0d550346895d91e33a/html5/thumbnails/39.jpg)
0S
If p = 0 then
This says that there is a linear dependence amongst the variables.
Normally, S is positive definite and all the eigenvalues are positive.
![Page 40: Description of Multivariate Data](https://reader031.fdocuments.net/reader031/viewer/2022033022/56812d0d550346895d91e33a/html5/thumbnails/40.jpg)
Linear combinations
![Page 41: Description of Multivariate Data](https://reader031.fdocuments.net/reader031/viewer/2022033022/56812d0d550346895d91e33a/html5/thumbnails/41.jpg)
Taking linear combinations of variables is one of the most important tools of multivariate analysis.
This is for basically two reasons:
1. A few appropriately chosen combinations may provide more of the information than a lot of the original variables. (this is called dimension reduction.)
2. Linear combinations can simplify the structure of the variance-covariance matrix, which can help in the interpretation of the data.
![Page 42: Description of Multivariate Data](https://reader031.fdocuments.net/reader031/viewer/2022033022/56812d0d550346895d91e33a/html5/thumbnails/42.jpg)
For a given vector of constraints:
We consider a linear combination
1
p
a
a
a
1 1 2 2i i i i p ipY a x a x a x a a
For i = 1, 2, … , n. Then
1 1
1 1n n
i ii i
Y Y a x a xn n
![Page 43: Description of Multivariate Data](https://reader031.fdocuments.net/reader031/viewer/2022033022/56812d0d550346895d91e33a/html5/thumbnails/43.jpg)
And the variance of the Y’s is
22
1
1
1
n
Y ii
s Y Yn
1
1
1
n
i ii
a x x x x an
a Sa