Multivariate Statistics - The Mind Files · Multivariate Statistics [email protected] 1....

Multivariate Statistics

[email protected]

1. Simple Linear and Locally Weighted Regression

2. Multiple Linear Regression

3. Multivariate data and multivariate analysis

4. Equivalence of ANOVA and Multiple Linear Regression, and the Generalized Linear

Model

5. Logistic Regression

6. Principle Components Analysis (PCA)

7. Grouped Multivariate Data

[[ Simple Linear Regression ]]

What do I have?

I have sample data of one explanatory variable x and a response variable y.

What do I want?

I want to find a model that explains how the variation in the response variable changes with the explanatory variable

How do I get that? The model I am trying to find is the Line that minimizes the squared error of the distance of each point the line.

We get the parameters through the Least-

Squares Estimates

This Line is the model that explains variation of the data best and is expressed by the equation:

What to look after:

β1 is the slope. It measures how the response variable changes if the explanatory variable is changed by one unit.

Note: linear regression means the parameters are linear, the x variable for example could be exponential!

[[ Variability of the response variable ]]

The variability of the response variable can be partitioned into 2 parts

One due to regression on the explanatory variable RGMS

And a residual mean square RSMS

regression mean square (RGMS)

ŷi is the estimated value of y (the y-

value on the regression line)

the mean y-value

residual mean square (RSMS)

yi is the actual value of y

ŷi is the estimated value of y (the y-

value on the regression line)

the RSMS gives an estimation of σ2 which is the population variance

If I want to test if the correlation of the 2 variables is 0 i can use the F-statistic

F = RGMS/RSMS with 1 and n-2 degrees of freedom (tests if β1 = 0)

The RSMS gives an estimate of σ2 which is the population variance

[[ Regression Diagnostics ]]

What do I have?

I have fitted a linear regression model ,and estimated the coefficients Ŷi = β0 + β1xi

What do I want?

I want to assess the model if it really fits our data

We need to check the assumption that the variance of the response does not change with the values of the explanatory variables constant variance assumption

Also we have to check if we may have to use a more complex model

How do I get that?

Examine the residuals Residuals = observed data – predicted values

What to look after:

Residuals are either positive or negative

I can put them in a boxplot to see how the pattern looks like

Plotting against the corresponding values of x Any sign of curvature Additional term in the explanatory

variable.

Plotting against the fitted response variable y

Variability of the residuals appear to increase with size Transforamtion of the response variable

[[ Locally weighted regression ]]

What do I have?

I have sample data of one explanatory variable x and a response variable y.

What do I want?

I have no idea a priori what model to use

Therefore I want the data itself to suggest what function to use

How do I get that?

Replace the global estimates Ŷi with local estimates for just a range of explanatory variables (x) instead of all

Scatterplot smoothers summarize the relationship between 2 variables just with a line drawing (no parameters)

the simplest is lowess fit

The result is a lowess curve with 2 parameters o Span a smoothing parameter o γ is the degree of the polynomials (= 1 or 2)

both parameters are found by trial and error and common sense

spline smoothers are connected across the intervals

it is a piecewise linear function

What to look after: The spline function is simple and can approximate some relationships but it is not smooth Cubic splines is a spline function which is smoother, because f‘‘ = 0 at knots which makes it curvy instead of sharp. The advantage of cubic splines is that it provides the best mean square error and avoids overfitting (doesn’t display unimportant variation between x and y)

[[ Multiple Linear Regression ]]

What do I have?

I have more than just 1 explanatory variable x

we have q explanatory variables x’s

What do I want?

I want to determine whether the response variable and one or more explanatory variables are associated in some way.

How do I get it?

The multiple linear regression function looks like this

So each explanatory variable has its own coefficient β that measures how much the response variable changes when the explanatory variable changes by 1, IF all the other explanatory variables don’t change

Now I have to estimate the parameters again with the least-squares-estimation-process

it follows

If there are linear relationships between the explanatory variables, the matrix X’X will be singular and its inverse will not exist

Singular means its determinant is zero, a singular matrix always has no inverse

A matrix is singular if and only if its rows (columns) are linearly dependent

What to look after:

To test whether the explanatory variables have an effect at all (all the slope parameters are 0) we can use the F-statistics again:

F = RGMS/RSMS with q and n-q-1 degrees of freedom

RGMS

RSMS

[[ Multiple Linear Regression – Example ]]

Explanatory variables: age, extroversion, gender (0/1) Observed variable: time spend looking after the car

Scatterplot matrix

Men spend more time (male:1)

Age less strongly related with time than extroversion

Age and extroversion also related, 2 outliners

„Raw“ coefficients should be standardized by multiplying them with the standard deviation of the appropriate explanatory variable and dividing by the standard deviation of the response variable

Age might be dropped

fitted model: time = 15.68 + 19.18 x gender + 0.51 x extroversion

[[ Measuring model fit in multiple linear regression ]]

What do I have?

I have the regression model with its estimated coefficients β

What do I want?

I want to assess statistical significance of the regression coefficients

Also I want to see if our model is good

How do I get it?

Compute estimated covariance matrix of the β parameter

The diagonal elements of the matrix give estimates of the variances of the coefficients

The off-diagonal elements give the estimated covariances

R is the multiple correlation coefficient

It is defined by the correlation between the observed variable yi and the values predicted by the fitted model:

R = yi ~

The value of R2 shows in % for how much of the variability the chosen regression model holds.

What to look after:

A problem that can occur in multiple linear regression is multicollinearity

That is when there is correlation among some or all explanatory variables

It is bad because it makes determining the importance of a given explanatory variable difficult (due to intercorrelation)

Also it increases the variances of the regression coefficients. The parameter estimates become unreliable.

First step is to look at scatterplot matrix and check for correlations

If we have a suspicion we can compute the variance inflation factor (VIF) for variable j

Rule of thumb: VIF > 10 gives some trouble for concern

[[ Choosing the most parsimonious model in multiple linear regression]]

What do I have?

We have a few explanatory variables x and a response variable y

We have estimated the parameters β

What do I want?

we want to choose the best model for them

So we want to include the explanatory variables that are relevant for y

And we want to exclude the explanatory variables that are irrelevant for y

How do I get it? Automatic method: Backward elimination

Compute AIC value for the model (prefer “lower“ values)

Compute all AIC values where one explanatory variable is cut out.

The variable with the lowest value is dropped out when it is lower than first AIC value

Repeat till nothing changes

What to look after:

Compute F-statistics

Look at p-value very low? If yes then its fine. (it means not all parameters of the regression model are 0)

Look at t-values of each single explanatory variable. The ones with the lowest could probably be kicked out

[[ Regression diagnostics in multiple linear regression ]]

What do I have?

We have selected the most parsimonious regression model

What do I want?

Check the assumptions of which the model is based on

We have to identify and understand the differences between the model and the data to which it is fitted

How do I get it? In multiple linear regression the raw residuals are not independent of one another, nor do they have the same variance. Therefore different types of residuals can be computed for the diagnosis

Standardized residual:

Or when the ith observation is dropped, the residual mean square estimate can be used instead of s2

Another useful tool is cook’s distance

What to look after:

Both types of residuals should look like standard normal distribution

ridel is often helpful for identification of outliners

cook‘ s distance measures the influence of the observation i on the estimation of all the parameters in the model

Values > 1 suggest that the observation has undue influence on the estimation process

[[ Multivariate data and Multivariate analysis ]]

What do I have?

I have multivariate data which means that there is no division into response and explanatory variables

Also it means that all variables x are random variables

What do I want?

I want to display or „extract“ any signal in the data from the noise

In order to produce a numerical summary for a multivariate data set we need to produce summary statistics for each of the variables separately (means + variances)

Calculate appropriate statistics that summarizes the relationship between the variables (e.g. covariances and correlations)

How do I get it?

Take pairs of variables at a time and look at their covariances or correlations

Population variance :

Covariance matrix between 2 variables:

Estimates of a variance-covariance matrix:

Correlation coefficient (S standardized by the product of the standard deviations of the 2 variables)

What to look after:

Note: the covariance of the variable with itself is simply its variance

The diagonal of S contains the variances of each variable, the covariance in the off-diagonal

The correlation coefficient lies between -1 and +1 and gives a measure of the linear relationship between the variables xi and xj

Its like a normalized covariance matrix but says about the same!

So now I can see what variables correlate with one another!

[[ Multivariate Analysis – Example ]]

n = 20 subjects

q = 3 variables

contains the deviation from the mean for each variable for subject i

covariance matrix:

variances sii = s2 , i = 1,2,3 in the diagonal

covariances in the off diagonal cells

Sample correlation coefficient:

[[ Test for multivariate normality ]]

What do I have?

I have multivariate data

I have a linear combination of the variables

I have the covariance matrix for the data

What do I want?

I want to test whether my data is multivariate normally distributed

It is not critical, but is some cases it makes sense to have the assumption of multivariate normality

That is described by the multivariate normal probability density function.

How do I get it? Compute the generalized distance di

2 Convert the multivariate observations x1 , x2 , ..., xq into a set of generalized distances di

2

S is the sample covariance matrix

What to look after:

If the generalized distances have approximately a chi-squared distribution with q degrees of freedom we have a multivariate normal distribution.

So, plotting the ordered distances against the corresponding quantiles of the appropriate chi-square distribution should lead to a straight line through the origin.

[[ Equivalence of ANOVA and multiple regression ]]

What do I have?

The ANOVA model is

What do I want?

I want to show that ANOVA = multiple linear regression

How do I get it?

If we have 3 explanatory factors we can introduce 2 dummy variables

1 2 3

x1 x2

1 0 -1 0 1 -1

Then α1 + α2 + α3 = 0 can be rewritten in terms of the variables x1 and x2 as

And this is exactly the same form as a multiple linear regression model with 2 explanatory variables.

What to look after:

ANOVA provides a statistical test of whether or not the means of several groups are all equal

ANOVA is a procedure in which the observed variance in a particular variable is partitioned into components attributable to different sources of variation.

http://en.wikipedia.org/wiki/Statistical_test

http://en.wikipedia.org/wiki/Mean

http://en.wikipedia.org/wiki/Variance

[[ Generalized Linear model ]]

What do I have?

I have more than just 1 explanatory variable x

But instead of continuous response variables which are normally distributed i can have now binary responses or counts or reaction times

What do I want?

I want a generalization of the multiple regression model that holds for such response variables

How do I get it?

Generalized linear model consists of

linear predictor η

a link function (its as „tricking“ the linear model into thinking that it is still acting upon normally distributed outcome variables.) g(μ) = η

and an error distribution (given its mean μ) choose according to link function:

identity link with the normal distribution,

log link with the Poisson distribution,

logit with the binomial distribution

What to look after:

In multiple linear regression, the link function is the identity function.

[[ Confidence interval for the odds ratio ]]

Example:

Is being judged as a „case“ related to gender and GHQ score.

If I want to know the odds for an event, I need its probability p, then I can compute:

𝑝

(1 − 𝑝)

The odds ratio can only be estimated. It is

So, an approximate 95% CI for

is given by

Estimated variance:

=1/43 + 1/131 + 1/25 + 1/79 = 0.084

95% CI for :

[−0.037 – 1.96 𝑥 0.290, −0.037 + 1.96 𝑥 0.290]

= [−0.604 , 0.531]

[[ Logistic Regression ]]

What do I have?

I have a binary response variable [0, 1] ( we cannot use multiple linear regression)

What do I want?

I want to investigate the effects of a set of explanatory variables x on a binary response y

How do I get it?

The mean in this case is of course p(1) = π

use the multiple linear regression approach and consider this model

There are 2 problems:

1. π must be between 0 and 1 2. response variable y is Bernoulli distributed

to overcome these problems we must transform π with a logit function

Now it is flexible and can take values from –une. to + une.

An observed value is expressed by

What to look after: To estimate the parameters in the logistic regression model, maximum likelihood is used!. Intuitively, what the estimation procedure tries to do is to find estimates of the regression parameters that make the predicted probabilities as close as possible (in some sense) to the observed probabilities.

[[ Measure of fit for logistic regression ]]

What do I have?

I have a logistic regression model

What do I want?

I want to measure the lack of fit of the model

Compute the differences in the deviance to compare alternative nested logistic regression models

How do I get it?

For example:

The difference D1 – D2 reflects the combined effect of the explanatory variables x2 and x3.

What to look after:

The deviance is essentially the ratio of the likelihoods of the model of interest to the saturated model that fits the data perfectly.

Under the hypothesis that these variables have no effect (i.e., β2 and β3 are both 0), the

difference has an approximate χ2 distribution with degrees of freedom equal to the difference in the number of parameters in the 2 models.

To choose the most parsimonious model we can use backwards elimination algorithm with AIC!

[[ Principal Component Analysis ]]

What do I have?

We have too many variables in a multivariate data set („curse of dimensionality“) The graphical presentation techniques may no longer be useful Statistical techniques like multiple linear regression may no longer be feasible

What do I want?

Reduce the dimensionality of the data set to get rid of problems

But still retain as much as possible of the variation in the set

How do I get it? So in principal component analysis I compute vector a, which is on the axis that maximizes the variance for the cloud of points

That means the distance from the point which is the most far away is minimal

All the other points are projected to the axis So instead of a point P(4 5) i reduce the value to the distance to the axis: e.g. 3 I have reduced the dimension! The principal components are orthogonal, uncorrelated, linear combinations of standardized variables xi

What to look after:

This reduction is only possible if the original variables are correlated; otherwise no simplification is possible

In essence, PCA is simply a rotation of the axes of multivariate data scatter.

The first few of these variables should account for most of the variation in all the original variables

The principal components then serve as a simpler basis for graphing and summarizing the data and for further multivariate analysis

The first few principal component scores can often be used to provide a convenient summary of a multivariate data set, particularly for looking at the data via simple scatterplots.

[[ Extract principal components ]]

What do I have?

too many variables in a multivariate data set („curse of dimensionality“)

What do I want?

Extract the principal components to transform the data set

How do I get it? The 1st principal component is the linear combination

To find coefficient defining the first component we need to choose elements of vector a so as to maximize the variance of y1 :

a1’a1 = 1 (sum-of-square

constraint) the sample variance of y1 is given by

Where S is the q x q sample covariance matrix of the x variables.

This leads to the solution that a1 is what is called

an eigenvector of the sample covariance matrix S and that it is the eigenvector corresponding to the

largest eigenvalue of S.

The second principal component y2 is defined to be the linear combination

which has the greatest variance subject to the following 2 conditions:

What to look after:

Maximization of variance is done by method of Lagrange multipliers? (not explained)

Similarly, the jth principal component is the linear combination yj = aj’x which has the greatest variance subject to the conditions

Application of the Lagrange multiplier technique demonstrates that the vector of coefficients

defining the jth principal component, aj, is the eigenvector of S associated with its jth largest eigenvalue.

[[ Principal Component analysis – Example ]]

Example:

Different types of crime per 100,000 residents of 50 states in the US

First Step:

Check what to use for PCA. The covariance or correlation matrix.

Variables are all on the same scale covariance matrix BUT:

Look at variances of each crime rate, if they are very different, result from PCS would be swamped by them ERGO:

Use correlation matrix for PCA

Results:

Scree plot shows variances of the components

If the variance of a component is > 1 ( > 0.7) it is adequate for describing the data

We see that the first 2 components are adequate

Examining the components First shows overall crimerate Second a bit difficult, maybe „property crimes vs. Person crimes ratio“

Rather than trying to interpret the PCs, one can use the scores of the individuals on the first few PCs as coordinates in a low-dimensional map of the observations.

The Euclidean distance between the points (representing the individuals) best approximate the Euclidean distances based on the original variables

Observed correlation matrix vs. Matrix constructed from the two-component Solution

They are quite similar which shows that components give accurate description

[[ Grouped multivariate data Hotelling’s T2 Test]]

What do I have?

I have multivariate data (no division into response and explanatory variables)

I know that this data was gathered from 2 groups

What do I want?

We want to test whether there are any differences among the groups

Apply Hotelling’s T2 Test

How do I get it?

So H0 = the means of the variables in the first population is equal to the means in the second population.

Under H0 (and when the assumption given below hold), the statistic F given by

Has a Fisher’s F-distribution with q and n1 + n2 – q – 1 degrees of freedom.

What to look after: The T2 test is based of the following assumptions:

In each population, the variables have a multivariate normal distribution

The two populations have the same covariance matrix

The observations are independent

Compute F-value check p-value reject or accept H0 Test statistics of T2 :

n1 and n2 are the sample sizes in each group, D2 is the generalized distance

Where and are the 2 sample mean vectors

S is the estimate of the assumed common covariance matrix of the 2 populations, calculated from the 2 sample covariance matrices S1 and S2 as

[[ Disriminant function analysis ]]

What do I have?

multivariate data


What do I want?

construct a classification rule from the measurements so I can put a new individual (from the data I measured) in one of the groups

How do I get it?

We have to find a linear function of the variables:

Such that the ratio of the between-group variance of z to its within-group variance is maximized

The allocation rule can be shown to be: Allocate an individual with discriminate score z to group 1 if

b (assuming that the groups are

labeled such that )

Example:

The discriminant function is

Assuming equal prior probabilities (unrealistic), the allocation rule for a new infant becomes „allocate to group 1 (little SID risk) if z > 0.506“ (the cutoff value halfway between the discriminant function means of each group)

The discriminant function can be shown on the scatterplot of the birth weight and factor68 simply by plotting the line z – 0.506 = 0.

[[ Discriminant Analysis for 3 groups ]]

What do I have?

multivaraite data


What do I want?

construct a classification rule from the measurements so I can put a new individual (from the data I measured) in one of the groups

How do I get it?

What to look after:

multivariate analysis of variance (MANOVA)

Multivariate Statistics - The Mind Files · Multivariate Statistics [email protected] 1....

Documents

Transcript of Multivariate Statistics - The Mind Files · Multivariate Statistics [email protected] 1....