Multivariate Statistics - The Mind Files · Multivariate Statistics [email protected] 1....
Transcript of Multivariate Statistics - The Mind Files · Multivariate Statistics [email protected] 1....
Multivariate Statistics
1. Simple Linear and Locally Weighted Regression
2. Multiple Linear Regression
3. Multivariate data and multivariate analysis
4. Equivalence of ANOVA and Multiple Linear Regression, and the Generalized Linear
Model
5. Logistic Regression
6. Principle Components Analysis (PCA)
7. Grouped Multivariate Data
[[ Simple Linear Regression ]]
What do I have?
I have sample data of one explanatory variable x and a response variable y.
What do I want?
I want to find a model that explains how the variation in the response variable changes with the explanatory variable
How do I get that? The model I am trying to find is the Line that minimizes the squared error of the distance of each point the line.
We get the parameters through the Least-
Squares Estimates
This Line is the model that explains variation of the data best and is expressed by the equation:
What to look after:
β1 is the slope. It measures how the response variable changes if the explanatory variable is changed by one unit.
Note: linear regression means the parameters are linear, the x variable for example could be exponential!
[[ Variability of the response variable ]]
The variability of the response variable can be partitioned into 2 parts
One due to regression on the explanatory variable RGMS
And a residual mean square RSMS
regression mean square (RGMS)
ŷi is the estimated value of y (the y-
value on the regression line)
the mean y-value
residual mean square (RSMS)
yi is the actual value of y
ŷi is the estimated value of y (the y-
value on the regression line)
the RSMS gives an estimation of σ2 which is the population variance
If I want to test if the correlation of the 2 variables is 0 i can use the F-statistic
F = RGMS/RSMS with 1 and n-2 degrees of freedom (tests if β1 = 0)
The RSMS gives an estimate of σ2 which is the population variance
[[ Regression Diagnostics ]]
What do I have?
I have fitted a linear regression model ,and estimated the coefficients Ŷi = β0 + β1xi
What do I want?
I want to assess the model if it really fits our data
We need to check the assumption that the variance of the response does not change with the values of the explanatory variables constant variance assumption
Also we have to check if we may have to use a more complex model
How do I get that?
Examine the residuals Residuals = observed data – predicted values
What to look after:
Residuals are either positive or negative
I can put them in a boxplot to see how the pattern looks like
Plotting against the corresponding values of x Any sign of curvature Additional term in the explanatory
variable.
Plotting against the fitted response variable y
Variability of the residuals appear to increase with size Transforamtion of the response variable
[[ Locally weighted regression ]]
What do I have?
I have sample data of one explanatory variable x and a response variable y.
What do I want?
I have no idea a priori what model to use
Therefore I want the data itself to suggest what function to use
How do I get that?
Replace the global estimates Ŷi with local estimates for just a range of explanatory variables (x) instead of all
Scatterplot smoothers summarize the relationship between 2 variables just with a line drawing (no parameters)
the simplest is lowess fit
The result is a lowess curve with 2 parameters o Span a smoothing parameter o γ is the degree of the polynomials (= 1 or 2)
both parameters are found by trial and error and common sense
spline smoothers are connected across the intervals
it is a piecewise linear function
What to look after: The spline function is simple and can approximate some relationships but it is not smooth Cubic splines is a spline function which is smoother, because f‘‘ = 0 at knots which makes it curvy instead of sharp. The advantage of cubic splines is that it provides the best mean square error and avoids overfitting (doesn’t display unimportant variation between x and y)
[[ Multiple Linear Regression ]]
What do I have?
I have more than just 1 explanatory variable x
we have q explanatory variables x’s
What do I want?
I want to determine whether the response variable and one or more explanatory variables are associated in some way.
How do I get it?
The multiple linear regression function looks like this
So each explanatory variable has its own coefficient β that measures how much the response variable changes when the explanatory variable changes by 1, IF all the other explanatory variables don’t change
Now I have to estimate the parameters again with the least-squares-estimation-process
it follows
If there are linear relationships between the explanatory variables, the matrix X’X will be singular and its inverse will not exist
Singular means its determinant is zero, a singular matrix always has no inverse
A matrix is singular if and only if its rows (columns) are linearly dependent
What to look after:
To test whether the explanatory variables have an effect at all (all the slope parameters are 0) we can use the F-statistics again:
F = RGMS/RSMS with q and n-q-1 degrees of freedom
RGMS
RSMS
[[ Multiple Linear Regression – Example ]]
Explanatory variables: age, extroversion, gender (0/1) Observed variable: time spend looking after the car
Scatterplot matrix
Men spend more time (male:1)
Age less strongly related with time than extroversion
Age and extroversion also related, 2 outliners
„Raw“ coefficients should be standardized by multiplying them with the standard deviation of the appropriate explanatory variable and dividing by the standard deviation of the response variable
Age might be dropped
fitted model: time = 15.68 + 19.18 x gender + 0.51 x extroversion
[[ Measuring model fit in multiple linear regression ]]
What do I have?
I have the regression model with its estimated coefficients β
What do I want?
I want to assess statistical significance of the regression coefficients
Also I want to see if our model is good
How do I get it?
Compute estimated covariance matrix of the β parameter
The diagonal elements of the matrix give estimates of the variances of the coefficients
The off-diagonal elements give the estimated covariances
R is the multiple correlation coefficient
It is defined by the correlation between the observed variable yi and the values predicted by the fitted model:
R = yi ~
The value of R2 shows in % for how much of the variability the chosen regression model holds.
What to look after:
A problem that can occur in multiple linear regression is multicollinearity
That is when there is correlation among some or all explanatory variables
It is bad because it makes determining the importance of a given explanatory variable difficult (due to intercorrelation)
Also it increases the variances of the regression coefficients. The parameter estimates become unreliable.
First step is to look at scatterplot matrix and check for correlations
If we have a suspicion we can compute the variance inflation factor (VIF) for variable j
Rule of thumb: VIF > 10 gives some trouble for concern
[[ Choosing the most parsimonious model in multiple linear regression]]
What do I have?
We have a few explanatory variables x and a response variable y
We have estimated the parameters β
What do I want?
we want to choose the best model for them
So we want to include the explanatory variables that are relevant for y
And we want to exclude the explanatory variables that are irrelevant for y
How do I get it? Automatic method: Backward elimination
Compute AIC value for the model (prefer “lower“ values)
Compute all AIC values where one explanatory variable is cut out.
The variable with the lowest value is dropped out when it is lower than first AIC value
Repeat till nothing changes
What to look after:
Compute F-statistics
Look at p-value very low? If yes then its fine. (it means not all parameters of the regression model are 0)
Look at t-values of each single explanatory variable. The ones with the lowest could probably be kicked out
[[ Regression diagnostics in multiple linear regression ]]
What do I have?
We have selected the most parsimonious regression model
What do I want?
Check the assumptions of which the model is based on
We have to identify and understand the differences between the model and the data to which it is fitted
How do I get it? In multiple linear regression the raw residuals are not independent of one another, nor do they have the same variance. Therefore different types of residuals can be computed for the diagnosis
Standardized residual:
Or when the ith observation is dropped, the residual mean square estimate can be used instead of s2
Another useful tool is cook’s distance
What to look after:
Both types of residuals should look like standard normal distribution
ridel is often helpful for identification of outliners
cook‘ s distance measures the influence of the observation i on the estimation of all the parameters in the model
Values > 1 suggest that the observation has undue influence on the estimation process
[[ Multivariate data and Multivariate analysis ]]
What do I have?
I have multivariate data which means that there is no division into response and explanatory variables
Also it means that all variables x are random variables
What do I want?
I want to display or „extract“ any signal in the data from the noise
In order to produce a numerical summary for a multivariate data set we need to produce summary statistics for each of the variables separately (means + variances)
Calculate appropriate statistics that summarizes the relationship between the variables (e.g. covariances and correlations)
How do I get it?
Take pairs of variables at a time and look at their covariances or correlations
Population variance :
Covariance matrix between 2 variables:
Estimates of a variance-covariance matrix:
Correlation coefficient (S standardized by the product of the standard deviations of the 2 variables)
What to look after:
Note: the covariance of the variable with itself is simply its variance
The diagonal of S contains the variances of each variable, the covariance in the off-diagonal
The correlation coefficient lies between -1 and +1 and gives a measure of the linear relationship between the variables xi and xj
Its like a normalized covariance matrix but says about the same!
So now I can see what variables correlate with one another!
[[ Multivariate Analysis – Example ]]
n = 20 subjects
q = 3 variables
contains the deviation from the mean for each variable for subject i
covariance matrix:
variances sii = s2 , i = 1,2,3 in the diagonal
covariances in the off diagonal cells
Sample correlation coefficient:
[[ Test for multivariate normality ]]
What do I have?
I have multivariate data
I have a linear combination of the variables
I have the covariance matrix for the data
What do I want?
I want to test whether my data is multivariate normally distributed
It is not critical, but is some cases it makes sense to have the assumption of multivariate normality
That is described by the multivariate normal probability density function.
How do I get it? Compute the generalized distance di
2 Convert the multivariate observations x1 , x2 , ..., xq into a set of generalized distances di
2
S is the sample covariance matrix
What to look after:
If the generalized distances have approximately a chi-squared distribution with q degrees of freedom we have a multivariate normal distribution.
So, plotting the ordered distances against the corresponding quantiles of the appropriate chi-square distribution should lead to a straight line through the origin.
[[ Equivalence of ANOVA and multiple regression ]]
What do I have?
The ANOVA model is
What do I want?
I want to show that ANOVA = multiple linear regression
How do I get it?
If we have 3 explanatory factors we can introduce 2 dummy variables
1 2 3
x1 x2
1 0 -1 0 1 -1
Then α1 + α2 + α3 = 0 can be rewritten in terms of the variables x1 and x2 as
And this is exactly the same form as a multiple linear regression model with 2 explanatory variables.
What to look after:
ANOVA provides a statistical test of whether or not the means of several groups are all equal
ANOVA is a procedure in which the observed variance in a particular variable is partitioned into components attributable to different sources of variation.
[[ Generalized Linear model ]]
What do I have?
I have more than just 1 explanatory variable x
But instead of continuous response variables which are normally distributed i can have now binary responses or counts or reaction times
What do I want?
I want a generalization of the multiple regression model that holds for such response variables
How do I get it?
Generalized linear model consists of
linear predictor η
a link function (its as „tricking“ the linear model into thinking that it is still acting upon normally distributed outcome variables.) g(μ) = η
and an error distribution (given its mean μ) choose according to link function:
identity link with the normal distribution,
log link with the Poisson distribution,
logit with the binomial distribution
What to look after:
In multiple linear regression, the link function is the identity function.
[[ Confidence interval for the odds ratio ]]
Example:
Is being judged as a „case“ related to gender and GHQ score.
If I want to know the odds for an event, I need its probability p, then I can compute:
𝑝
(1 − 𝑝)
The odds ratio can only be estimated. It is
So, an approximate 95% CI for
is given by
Estimated variance:
=1/43 + 1/131 + 1/25 + 1/79 = 0.084
95% CI for :
[−0.037 – 1.96 𝑥 0.290, −0.037 + 1.96 𝑥 0.290]
= [−0.604 , 0.531]
[[ Logistic Regression ]]
What do I have?
I have a binary response variable [0, 1] ( we cannot use multiple linear regression)
What do I want?
I want to investigate the effects of a set of explanatory variables x on a binary response y
How do I get it?
The mean in this case is of course p(1) = π
use the multiple linear regression approach and consider this model
There are 2 problems:
1. π must be between 0 and 1 2. response variable y is Bernoulli distributed
to overcome these problems we must transform π with a logit function
Now it is flexible and can take values from –une. to + une.
An observed value is expressed by
What to look after: To estimate the parameters in the logistic regression model, maximum likelihood is used!. Intuitively, what the estimation procedure tries to do is to find estimates of the regression parameters that make the predicted probabilities as close as possible (in some sense) to the observed probabilities.
[[ Measure of fit for logistic regression ]]
What do I have?
I have a logistic regression model
What do I want?
I want to measure the lack of fit of the model
Compute the differences in the deviance to compare alternative nested logistic regression models
How do I get it?
For example:
The difference D1 – D2 reflects the combined effect of the explanatory variables x2 and x3.
What to look after:
The deviance is essentially the ratio of the likelihoods of the model of interest to the saturated model that fits the data perfectly.
Under the hypothesis that these variables have no effect (i.e., β2 and β3 are both 0), the
difference has an approximate χ2 distribution with degrees of freedom equal to the difference in the number of parameters in the 2 models.
To choose the most parsimonious model we can use backwards elimination algorithm with AIC!
[[ Principal Component Analysis ]]
What do I have?
We have too many variables in a multivariate data set („curse of dimensionality“) The graphical presentation techniques may no longer be useful Statistical techniques like multiple linear regression may no longer be feasible
What do I want?
Reduce the dimensionality of the data set to get rid of problems
But still retain as much as possible of the variation in the set
How do I get it? So in principal component analysis I compute vector a, which is on the axis that maximizes the variance for the cloud of points
That means the distance from the point which is the most far away is minimal
All the other points are projected to the axis So instead of a point P(4 5) i reduce the value to the distance to the axis: e.g. 3 I have reduced the dimension! The principal components are orthogonal, uncorrelated, linear combinations of standardized variables xi
What to look after:
This reduction is only possible if the original variables are correlated; otherwise no simplification is possible
In essence, PCA is simply a rotation of the axes of multivariate data scatter.
The first few of these variables should account for most of the variation in all the original variables
The principal components then serve as a simpler basis for graphing and summarizing the data and for further multivariate analysis
The first few principal component scores can often be used to provide a convenient summary of a multivariate data set, particularly for looking at the data via simple scatterplots.
[[ Extract principal components ]]
What do I have?
too many variables in a multivariate data set („curse of dimensionality“)
What do I want?
Extract the principal components to transform the data set
How do I get it? The 1st principal component is the linear combination
To find coefficient defining the first component we need to choose elements of vector a so as to maximize the variance of y1 :
a1’a1 = 1 (sum-of-square
constraint) the sample variance of y1 is given by
Where S is the q x q sample covariance matrix of the x variables.
This leads to the solution that a1 is what is called
an eigenvector of the sample covariance matrix S and that it is the eigenvector corresponding to the
largest eigenvalue of S.
The second principal component y2 is defined to be the linear combination
which has the greatest variance subject to the following 2 conditions:
What to look after:
Maximization of variance is done by method of Lagrange multipliers? (not explained)
Similarly, the jth principal component is the linear combination yj = aj’x which has the greatest variance subject to the conditions
Application of the Lagrange multiplier technique demonstrates that the vector of coefficients
defining the jth principal component, aj, is the eigenvector of S associated with its jth largest eigenvalue.
[[ Principal Component analysis – Example ]]
Example:
Different types of crime per 100,000 residents of 50 states in the US
First Step:
Check what to use for PCA. The covariance or correlation matrix.
Variables are all on the same scale covariance matrix BUT:
Look at variances of each crime rate, if they are very different, result from PCS would be swamped by them ERGO:
Use correlation matrix for PCA
Results:
Scree plot shows variances of the components
If the variance of a component is > 1 ( > 0.7) it is adequate for describing the data
We see that the first 2 components are adequate
Examining the components First shows overall crimerate Second a bit difficult, maybe „property crimes vs. Person crimes ratio“
Rather than trying to interpret the PCs, one can use the scores of the individuals on the first few PCs as coordinates in a low-dimensional map of the observations.
The Euclidean distance between the points (representing the individuals) best approximate the Euclidean distances based on the original variables
Observed correlation matrix vs. Matrix constructed from the two-component Solution
They are quite similar which shows that components give accurate description
[[ Grouped multivariate data Hotelling’s T2 Test]]
What do I have?
I have multivariate data (no division into response and explanatory variables)
I know that this data was gathered from 2 groups
What do I want?
We want to test whether there are any differences among the groups
Apply Hotelling’s T2 Test
How do I get it?
So H0 = the means of the variables in the first population is equal to the means in the second population.
Under H0 (and when the assumption given below hold), the statistic F given by
Has a Fisher’s F-distribution with q and n1 + n2 – q – 1 degrees of freedom.
What to look after: The T2 test is based of the following assumptions:
In each population, the variables have a multivariate normal distribution
The two populations have the same covariance matrix
The observations are independent
Compute F-value check p-value reject or accept H0 Test statistics of T2 :
n1 and n2 are the sample sizes in each group, D2 is the generalized distance
Where and are the 2 sample mean vectors
S is the estimate of the assumed common covariance matrix of the 2 populations, calculated from the 2 sample covariance matrices S1 and S2 as
[[ Disriminant function analysis ]]
What do I have?
multivariate data
I know that this data was gathered from 2 groups
What do I want?
construct a classification rule from the measurements so I can put a new individual (from the data I measured) in one of the groups
How do I get it?
We have to find a linear function of the variables:
Such that the ratio of the between-group variance of z to its within-group variance is maximized
The allocation rule can be shown to be: Allocate an individual with discriminate score z to group 1 if
b (assuming that the groups are
labeled such that )
Example:
The discriminant function is
Assuming equal prior probabilities (unrealistic), the allocation rule for a new infant becomes „allocate to group 1 (little SID risk) if z > 0.506“ (the cutoff value halfway between the discriminant function means of each group)
The discriminant function can be shown on the scatterplot of the birth weight and factor68 simply by plotting the line z – 0.506 = 0.
[[ Discriminant Analysis for 3 groups ]]
What do I have?
multivaraite data
I know that this data was gathered from 3 groups
What do I want?
construct a classification rule from the measurements so I can put a new individual (from the data I measured) in one of the groups
How do I get it?
What to look after:
multivariate analysis of variance (MANOVA)