STAT 3130 Guest Speaker: Ashok Krishnamurthy, Ph.D. Department of Mathematical and Statistical...
-
Upload
ferdinand-williamson -
Category
Documents
-
view
218 -
download
0
Transcript of STAT 3130 Guest Speaker: Ashok Krishnamurthy, Ph.D. Department of Mathematical and Statistical...
STAT 3130Guest Speaker:
Ashok Krishnamurthy, Ph.D.Department of Mathematical and Statistical Sciences
24 January 2011
Correspondence: [email protected]
Outline
• A brief overview of STAT 3120
• One-way Analysis of Variance (ANOVA)
• ANOVA example
• Implementing ANOVA in R statistical programming language
A quick review of STAT 3120
• Catalog description
– A SAS/SPSS based course aimed at providing students with a foundation in statistical methods, including review of descriptive statistics, confidence intervals, hypothesis testing, t-tests, basic Regression and Chi-Square tests.
Statistical Inference
• Statistical inference is the process of drawing conclusions from data that are subject to random variation.
• The conclusion of a statistical inference is a statistical proposition.
– Estimating the mean and variance of a distribution.
– Confidence interval estimation the mean and variance of a distribution.
– Hypothesis tests on the mean and variance of a distribution.
Theory of point estimation
• There is at least one parameter whose value is to be approximated on the basis of a sample.
• The approximation is done using an appropriate statistic.
• This statistic is called a point estimator for .
CI for population mean,
CI for population variance
CI for difference between two normal population means
1 2, known but unequal 2 21 2
1 22
1 2
x x zn n
1 2, known and equal 21 2
21 2
1 1 x x z
n n
1 2, unknown but equal 21 2 ,2
1 2
1 1 pdf
x x t sn n
1 2
2 21 1 2 22
1 2
where 2
1 1
2p
df n n
n s n ss
n n
CI for difference between two normal population means
1 2, unknown and unequal 2 21 2
1 2 ,21 2
df
s sx x t
n n
where ?df
Hypothesis Testing
• In the estimation problem there is no preconceived notion concerning the actual value of the parameter .
• In contrast, when testing a hypothesis on , there is a preconceived notion concerning its value.
• There are two theories,– The hypothesis proposed by the experimenter, denoted H1
– The negation of H1, denoted H0
Tests concerning the mean of one normal population
0 0 0 0 0 0
1 0 1 0 1 0
: : :
: : :
H H H
H H H
0
Test Statistic
XZ
n
0
Test Statistic
XT
s
n
(or)
Tests concerning the difference between two normal population means
0 1 2 0 1 2 0 1 2
1 1 2 1 1 2 1 1 2
: : :
: : :
H H H
H H H
Independent samples of sizes n1 and n2.
1 2 1 2
2 21 2
1 2
Test Statistic
X XZ
n n
1 2 1 2
2
1 2
Test Statistic
1 1p
X Xt
sn n
1 2 1 2
2 21 2
1 2
Test Statistic
X XT
s sn n
(or) (or)
Tests concerning variance of one normal population
2 2 2 2 2 20 0 0 0 0 0
2 2 2 2 2 21 0 1 0 1 0
: : :
: : :
H H H
H H H
22
20
Test Statistic
1n S
Tests concerning ratio of variances of two normal populations
2 2 2 2 2 20 0 0
2 2 2 2 2 21 1 1
: : :
: : :
x y x y x y
x y x y x y
H H H
H H H
2
22 22
2 2 2 2
2
Test Statistic
1
1
x
yx xx
y y x y
y
n SS S
Fm S S S
Independent T-test
Mann-Whitney Test
Paired T-test
Wilcoxon Rank Sum
One Way ANOVA
Kruskall Wallis Test
Repeated Measures ANOVA
Friedman’s ANOVA
Pearson Correlation or Regression
Spearman Correlation or Kendall’s Tau
Ind. Factorial ANOVA or Regression
Factorial Repeated Measures ANOVA
Factorial Mixed ANOVA
Multiple Regression
Multiple Regression/ANCOVA
Pearson Chi-Square or Likelihood Ratio
Logistic Regression
Loglinear Analysis
MANOVA
Factorial MANOVA
MANCOVA
Yes
No
No
Yes
No
Yes
Yes
Yes
Yes
Yes
Yes
Yes
No
Yes
No
Yes
No
Yes
Different
Same
Same
Both
Different
Different
Different
Same
Different
Yes
Different
Two
Three +
Categorical
Continuous
Both
Categorical
Categorical
Both
Both
Continuous
Categorical
Continuous
Categorical
Categorical
Continuous
One
Two +
One
Two +
One
Two +
Continuous
Categorical
ContinuousTwo +
How Many DependentVariables?
One
What TypeOf Outcome?
How ManyPredictors?
What type Of predictors?
If Categorical Predictor,How many Categories?
If Categorical Predictor,Same Participants or Different in each category?
Does Data MeetParametric Assumptions?
ANALYSIS TOOL
Logistic Regression/Discriminant
Comparing several means
• It is often necessary to compare many populations for a quantitative variable.
• That is, we may want to compare the mean outcome over several populations to determine whether they have the same mean outcome and if not, where differences exist.
• The standard method of analysis for these types of problems is the one-way Analysis of Variance, often abbreviated ANOVA
Can we just use several pairwise t-tests?
• You might be tempted to use t-tests to make such comparisons. Why would this be difficult?
# groups # pair-wise test
3 3 4 6
5 106 157 21
and so on….
One-way ANOVA Contd.
• The method of ANOVA allow for comparison of the mean over more than two independent groups.
• In particular, it tests the following hypotheses for comparing over k groups:
0 1 2
1
: =
: Atleast two means are differentkH
H
Assumptions of ANOVA
• Populations have normal distributions
• Population standard deviations are equal
• Observations are independent, both within and between samples
One-way ANOVA Contd.
• A rejection of the null hypothesis tells us that there is at least one group with a differing mean (though there could be more than one group that is different).
• If we do not reject the null hypothesis, then we can only conclude that there is no significant difference among the groups.
One-way ANOVA procedure
• Total variation in a measured response is partitioned into components that can be attributed to recognizable sources of variation.
• For example, suppose we wish to investigate the sulfur content of 5 coal reams in a certain geographical region. Then we would test,
0 1 2 3 4 5
1
: = =
: for some and i j
H
H i j
ANOVA Table
Sources of variation df Sum of Squares (SS)
Mean Sum of Squares (MS)
F
Between groups(Treatment)Within groups (Error)
*
Total * *
Computational Shortcuts2
2Total
1 1
2 2
Treatment1
Error
ink
iji j
ki
i i
TSS SST Y
N
T TSS SSB
n N
SS SSW SST SSB
ANOVA Table
Sources of variation df Sum of Squares (SS)
Mean Sum of Squares (MS)
F
Between groups k - 1 SSB
Within groups N - k SSW *
Total N - 1 TSS * *
* See page 682 for a general format of a One-Way ANOVA Table
MSB
MSW1
SSBMSB
k
SSW
MSWN k
ANOVA Example
A biologist is doing research on elk in their natural Colorado habitat. Three regions are under study, each region having about the same amount of forage and natural cover.
To determine if there is a difference in elk life spans between the three regions, a sample of 6, 5, and 6 mature elks from each region are tranquilized and have a tooth removed.
A laboratory examination of the teeth reveals the ages of the elk. Results for each sample are given in the below table.
ANOVA Example Contd.Region Age
A 4A 10A 11A 9A 8A 6B 7B 3B 8B 4B 8C 5C 6C 4C 2C 4C 3
Are there differences in age (elk life spans) over the different regions?
If so, where are such differences occurring?
ANOVA Example Contd.
222 2 2 2 2
1 1
22 2 2 2 2
1
1024 10 11 3 114
17
10248 30 2448
6 5 6 17
114 48 66
ink
iji j
ki
i i
TSST Y
N
T TSSB
n N
SSW SST SSB
R Programming Language
• Free software for statistical computing and graphics: http://www.r-project.org/
• Developed at Bell Laboratories
• Considered a baby version of S/S+
• S+ sells for about $2000/year subscription
R code to run an ANOVA (elk data)
> elk <- read.csv("elk.csv", sep=",", header=T)
> boxplot(elk$Age ~ elk$Region, ylab = "Age", xlab = "Region", main = "Boxplot for Elk data")
> Elk.ANOVA <- aov(elk$Age ~ elk$Region)
> summary(Elk.ANOVA)
R output for ANOVA (elk data)
Source Df Sum Sq Mean Sq F value Pr(>F)___ elk$Region 2 48.000 24.000 5.0909 0.0218 *Residuals 14 66.000 4.714 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Next class: Post-hoc tests
• Bonferroni correction
• Tukey’s HSD test
• Fisher’s LSD
• Newman-Keul test
• Scheffe method
Fixed versus random effects• When we consider the effect of a factor, it can be either
fixed or random. If we are interested in the particular levels of a factor, then it is fixed, e.g., gender, socio-economic class, fertilizer, drug. If we are not interested in the particular levels, but rather have selected the levels to make inference about the factor, then the factor is random.
• For example, what if there was an effect of hospital on a person’s recovery? A random sample of hospitals would allow us to study this relationship. Here we are interested in whether there is a relationship rather than describing an effect for each individual hospital. These types of models are called random effects models.