STAT 3130 Guest Speaker: Ashok Krishnamurthy, Ph.D. Department of Mathematical and Statistical...

Post on 29-Dec-2015

220 views 0 download

Tags:

Transcript of STAT 3130 Guest Speaker: Ashok Krishnamurthy, Ph.D. Department of Mathematical and Statistical...

STAT 3130Guest Speaker:

Ashok Krishnamurthy, Ph.D.Department of Mathematical and Statistical Sciences

24 January 2011

Correspondence: Ashok.Krishnamurthy@ucdenver.edu

Outline

• A brief overview of STAT 3120

• One-way Analysis of Variance (ANOVA)

• ANOVA example

• Implementing ANOVA in R statistical programming language

A quick review of STAT 3120

• Catalog description

– A SAS/SPSS based course aimed at providing students with a foundation in statistical methods, including review of descriptive statistics, confidence intervals, hypothesis testing, t-tests, basic Regression and Chi-Square tests.

Statistical Inference

• Statistical inference is the process of drawing conclusions from data that are subject to random variation.

• The conclusion of a statistical inference is a statistical proposition.

– Estimating the mean and variance of a distribution.

– Confidence interval estimation the mean and variance of a distribution.

– Hypothesis tests on the mean and variance of a distribution.

Theory of point estimation

• There is at least one parameter whose value is to be approximated on the basis of a sample.

• The approximation is done using an appropriate statistic.

• This statistic is called a point estimator for .

CI for population mean,

CI for population variance

CI for difference between two normal population means

1 2, known but unequal 2 21 2

1 22

1 2

x x zn n

1 2, known and equal 21 2

21 2

1 1 x x z

n n

1 2, unknown but equal 21 2 ,2

1 2

1 1 pdf

x x t sn n

1 2

2 21 1 2 22

1 2

where 2

1 1

2p

df n n

n s n ss

n n

CI for difference between two normal population means

1 2, unknown and unequal 2 21 2

1 2 ,21 2

df

s sx x t

n n

where ?df

Hypothesis Testing

• In the estimation problem there is no preconceived notion concerning the actual value of the parameter .

• In contrast, when testing a hypothesis on , there is a preconceived notion concerning its value.

• There are two theories,– The hypothesis proposed by the experimenter, denoted H1

– The negation of H1, denoted H0

Tests concerning the mean of one normal population

0 0 0 0 0 0

1 0 1 0 1 0

: : :

: : :

H H H

H H H

0

Test Statistic

XZ

n

0

Test Statistic

XT

s

n

(or)

Tests concerning the difference between two normal population means

0 1 2 0 1 2 0 1 2

1 1 2 1 1 2 1 1 2

: : :

: : :

H H H

H H H

Independent samples of sizes n1 and n2.

1 2 1 2

2 21 2

1 2

Test Statistic

X XZ

n n

1 2 1 2

2

1 2

Test Statistic

1 1p

X Xt

sn n

1 2 1 2

2 21 2

1 2

Test Statistic

X XT

s sn n

(or) (or)

Tests concerning variance of one normal population

2 2 2 2 2 20 0 0 0 0 0

2 2 2 2 2 21 0 1 0 1 0

: : :

: : :

H H H

H H H

22

20

Test Statistic

1n S

Tests concerning ratio of variances of two normal populations

2 2 2 2 2 20 0 0

2 2 2 2 2 21 1 1

: : :

: : :

x y x y x y

x y x y x y

H H H

H H H

2

22 22

2 2 2 2

2

Test Statistic

1

1

x

yx xx

y y x y

y

n SS S

Fm S S S

Independent T-test

Mann-Whitney Test

Paired T-test

Wilcoxon Rank Sum

One Way ANOVA

Kruskall Wallis Test

Repeated Measures ANOVA

Friedman’s ANOVA

Pearson Correlation or Regression

Spearman Correlation or Kendall’s Tau

Ind. Factorial ANOVA or Regression

Factorial Repeated Measures ANOVA

Factorial Mixed ANOVA

Multiple Regression

Multiple Regression/ANCOVA

Pearson Chi-Square or Likelihood Ratio

Logistic Regression

Loglinear Analysis

MANOVA

Factorial MANOVA

MANCOVA

Yes

No

No

Yes

No

Yes

Yes

Yes

Yes

Yes

Yes

Yes

No

Yes

No

Yes

No

Yes

Different

Same

Same

Both

Different

Different

Different

Same

Different

Yes

Different

Two

Three +

Categorical

Continuous

Both

Categorical

Categorical

Both

Both

Continuous

Categorical

Continuous

Categorical

Categorical

Continuous

One

Two +

One

Two +

One

Two +

Continuous

Categorical

ContinuousTwo +

How Many DependentVariables?

One

What TypeOf Outcome?

How ManyPredictors?

What type Of predictors?

If Categorical Predictor,How many Categories?

If Categorical Predictor,Same Participants or Different in each category?

Does Data MeetParametric Assumptions?

ANALYSIS TOOL

Logistic Regression/Discriminant

Comparing several means

• It is often necessary to compare many populations for a quantitative variable.

• That is, we may want to compare the mean outcome over several populations to determine whether they have the same mean outcome and if not, where differences exist.

• The standard method of analysis for these types of problems is the one-way Analysis of Variance, often abbreviated ANOVA

Can we just use several pairwise t-tests?

• You might be tempted to use t-tests to make such comparisons. Why would this be difficult?

# groups # pair-wise test

3 3 4 6

5 106 157 21

and so on….

One-way ANOVA Contd.

• The method of ANOVA allow for comparison of the mean over more than two independent groups.

• In particular, it tests the following hypotheses for comparing over k groups:

0 1 2

1

: =

: Atleast two means are differentkH

H

Assumptions of ANOVA

• Populations have normal distributions

• Population standard deviations are equal

• Observations are independent, both within and between samples

One-way ANOVA Contd.

• A rejection of the null hypothesis tells us that there is at least one group with a differing mean (though there could be more than one group that is different).

• If we do not reject the null hypothesis, then we can only conclude that there is no significant difference among the groups.

One-way ANOVA procedure

• Total variation in a measured response is partitioned into components that can be attributed to recognizable sources of variation.

• For example, suppose we wish to investigate the sulfur content of 5 coal reams in a certain geographical region. Then we would test,

0 1 2 3 4 5

1

: = =

: for some and i j

H

H i j

ANOVA Table

Sources of variation df Sum of Squares (SS)

Mean Sum of Squares (MS)

F

Between groups(Treatment)Within groups (Error)

*

Total * *

Computational Shortcuts2

2Total

1 1

2 2

Treatment1

Error

ink

iji j

ki

i i

TSS SST Y

N

T TSS SSB

n N

SS SSW SST SSB

ANOVA Table

Sources of variation df Sum of Squares (SS)

Mean Sum of Squares (MS)

F

Between groups k - 1 SSB

Within groups N - k SSW *

Total N - 1 TSS * *

* See page 682 for a general format of a One-Way ANOVA Table

MSB

MSW1

SSBMSB

k

SSW

MSWN k

ANOVA Example

A biologist is doing research on elk in their natural Colorado habitat. Three regions are under study, each region having about the same amount of forage and natural cover.

To determine if there is a difference in elk life spans between the three regions, a sample of 6, 5, and 6 mature elks from each region are tranquilized and have a tooth removed.

A laboratory examination of the teeth reveals the ages of the elk. Results for each sample are given in the below table.

ANOVA Example Contd.Region Age

A 4A 10A 11A 9A 8A 6B 7B 3B 8B 4B 8C 5C 6C 4C 2C 4C 3

Are there differences in age (elk life spans) over the different regions?

If so, where are such differences occurring?

ANOVA Example Contd.

222 2 2 2 2

1 1

22 2 2 2 2

1

1024 10 11 3 114

17

10248 30 2448

6 5 6 17

114 48 66

ink

iji j

ki

i i

TSST Y

N

T TSSB

n N

SSW SST SSB

R Programming Language

• Free software for statistical computing and graphics: http://www.r-project.org/

• Developed at Bell Laboratories

• Considered a baby version of S/S+

• S+ sells for about $2000/year subscription

R code to run an ANOVA (elk data)

> elk <- read.csv("elk.csv", sep=",", header=T)

> boxplot(elk$Age ~ elk$Region, ylab = "Age", xlab = "Region", main = "Boxplot for Elk data")

> Elk.ANOVA <- aov(elk$Age ~ elk$Region)

> summary(Elk.ANOVA)

R output for ANOVA (elk data)

Source Df Sum Sq Mean Sq F value Pr(>F)___ elk$Region 2 48.000 24.000 5.0909 0.0218 *Residuals 14 66.000 4.714 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Next class: Post-hoc tests

• Bonferroni correction

• Tukey’s HSD test

• Fisher’s LSD

• Newman-Keul test

• Scheffe method

Fixed versus random effects• When we consider the effect of a factor, it can be either

fixed or random. If we are interested in the particular levels of a factor, then it is fixed, e.g., gender, socio-economic class, fertilizer, drug. If we are not interested in the particular levels, but rather have selected the levels to make inference about the factor, then the factor is random.

• For example, what if there was an effect of hospital on a person’s recovery? A random sample of hospitals would allow us to study this relationship. Here we are interested in whether there is a relationship rather than describing an effect for each individual hospital. These types of models are called random effects models.