Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students...
-
Upload
brianna-ball -
Category
Documents
-
view
220 -
download
3
Transcript of Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students...
Introduction to Biostatistical AnalysisUsing R
Statistics course for first-year PhD students
Lecturer: Lorenzo Marini, PhDDepartment of Environmental Agronomy and Crop Production,University of Padova, Viale dell’Università 16, 35020 Legnaro, Padova.E-mail: [email protected].: +39 0498272807http://www.biodiversity-lorenzomarini.eu/
Session 2
Lecture: Introduction to statistical hypothesis testing
Null and alternate hypothesis. Types of error. Two-sample hypotheses. Correlation. Analysis of frequency data.Model simplification
Inference
Sample
A statistical hypothesis test is a method of making statistical decisions from and about experimental data. Null-hypothesis testing just answers the question of "how well do the findings fit the possibility that chance factors alone might be responsible?”.
Population
Statistical Model
sampling
Estimation(Uncertainty!!!)
testing
Statistical testing in five steps:
1. Construct a null hypothesis (H0)
2. Choose a statistical analysis (assumptions!!!)
3. Collect the data (sampling)
4. Calculate P-value and test statistic
5. Reject/accept (H0) if P is small/large
Key concepts: Session 1
Concept of replication vs. pseudoreplication1. Spatial dependence (e.g. spatial autocorrelation)2. Temporal dependence (e.g. repeated measures)3. Biological dependence (e.g. siblings)
n
ymean i 2
)( meanySSdeviance i
)1(
)(var
2
n
meanyi
Key quantities
meanresidual
yi
Remember the order!!!
x
y
n=6
Hypothesis testing
• 1 – Hypothesis formulation (Null hypothesis H0 vs. alternative hypothesis H1)
• 2 – Compute the probability P that H0 is false;
• 3 – If this probability is lower than a defined threshold we can reject the null hypothesis
Hypothesis testing: Types of error
Wrong conclusions: Type 1 and 2
errors
Actu
al situ
ation
Reject H0 Retain H0
Effect Correct
Effect detected
Type 2 error ()
Effect not detected
No effect Type 1 error ()
Effect detected, none exists
(P-value)
Correct,
No effect detected,
None exists
(POWER)
As power increases, the chances of a Type II error decreases
Statistical power depends on:the statistical significance criterion used in the test the size of the difference or the strength of the similarity (effect size) in the population the sensitivity of the data.
Statistical analyses
Mean comparisons for 2 populationsTest the difference between the means drawn by two samples
CorrelationIn probability theory and statistics, correlation, (often measured as a
correlation coefficient), indicates the strength and direction of a linear relationship between two random variables. In general statistical usage, correlation refers to the departure of two variables from independence.
Analysis of count or proportion dataWhole number or integer numbers (not continuous, different
distributional properties) or proportion
Mean comparisons for 2 samples
Assumptions• Independence of cases (work with true replications!!!) - this is a requirement
of the design.
• Normality - the distributions in each of the groups are normal
• Homogeneity of variances - the variance of data in groups should be the same (use Fisher test or Fligner's test for homogeneity of variances).
• These together form the common assumption that the errors are independently, identically, and normally distributed
H0: means do not differ H1: means differ
The t test
mass
Fre
qu
en
cy
0 5 10 15
01
03
05
0
Normality
Before we can carry out a test assuming normality of the data we need to test our distribution (not always before!!!)
Graphics analysis
Shapiro-Wilk Normality Test shapiro.test()
Test for normality
In many cases we must check this assumption after having fitted the
model
(e.g. regression or multifactorial
ANOVA)
Skew + kurtosis (t test)
-2 -1 0 1 2
0.5
1.5
2.5
Normal qqplot
norm quantiles
Ob
serv
ed
qu
an
tile
s
hist(y)lines(density(y))
library(car)qq.plot(y) or qqnorm(y)
RESIDUALS MUST BE NORMAL
Normality: Histogram and Q-Q Plot
-3 -2 -1 0 1 2 3
51
01
5
norm quantiles
fish
es$
ma
ss
-3 -2 -1 0 1 2 3
0.0
1.0
2.0
norm quantiles
log
(fis
he
s$m
ass
)
Histogram of fishes$mas
fishes$mas
Fre
qu
en
cy
0 5 10 15
01
03
05
0Histogram of log(fishes$mas)
log(fishes$mas)
Fre
qu
en
cy
-0.5 0.5 1.0 1.5 2.0 2.5
01
02
03
04
0
Normality: Histogram
library(animation)ani.options(nmax = 2000 + 15 -2, interval = 0.003)freq = quincunx(balls = 2000, col.balls = rainbow(1))# frequency tablebarplot(freq, space = 0)
2.5 4.5 6.5 8.5 10.5 12.5
01
00
20
03
00
40
0
Normal distribution must be symmetrical around the mean
Normality: Quantile-Quantile Plot
Quantiles are points taken at regular intervals from the cumulative distribution function (CDF) of a random variable. The quantiles are the data values marking the boundaries between consecutive subsets
Normality
In case of non-normality: 2 possible approaches
1. Change the distribution (use GLMs)
Logaritmic (skewed data)
2. Data transformation
E.g. Poisson (count data)
E.g. Binomial (proportion)
Square-root
mass
Fre
quen
cy
0 5 10 15
010
3050
fishes$logmassF
requ
ency
-0.5 0.5 1.5 2.5
010
2030
40
Arcsin (percentage)
Probit (proportion)
Box-Cox transformation
Advanced statistics
Homogeneity of variance: two samples
Before we can carry out a test to compare two sample means, we need to test whether the sample variances are significantly different. The test could not be simpler. It is called Fisher’s F
To compare two variances, all you do isdivide the larger variance by the smaller variance.
Test can be carried out with the var.test()
F<-var(A)/var(B)
qf(0.975,nA-1,nB-1)
F calculated
F critical
if the calculated F is larger thanthe critical value, we reject the null hypothesis
E.g. Students from TESAF vs. Students from DAAPV
Homogeneity of variance : > two samples
It is important to know whether variance differs significantly from sample to sample. Constancy of variance (homoscedasticity) is the most important assumption underlying regression and analysis of variance. For multiple samples you can choose between theBartlett test and the Fligner–Killeen test.
Bartlett.test(response,factor)
There are differences between the tests: Fisher and Bartlett are very sensitive to outliers, whereas Fligner–Killeen is not
Fligner.test(response,factor)
Mean comparison
-Some assumptions not met? Non-parametric Wilcox.test() - The Wilcoxon signed-rank test is a non-parametric alternative to the Student's t-test for the case of two samples.
- All Assumptions met? Parametric t.test()
- t test with independent or paired sample
In many cases, a researcher is interesting in gathering information about two populations in order to compare them. As in statistical inference for one population parameter, confidence intervals and tests of significance are useful statistical tools for the difference between two population parameters.
Ho: the two means are the same
H1: the two means differ
Mean comparison: 2 independent samples
Two independent samples
n
ymean i
a
n
ymean i
b
b
b
a
adiff nn
SEvarvar
diff
ba
SE
meanmeant
The two samples are statistically independent
Test can be carried out with the t.test() function
Students on the left Students on the right
Mean comparison: t test for paired samples
Paired sampling in time or in space
Time 1 a: 1, 2, 3, 2, 3, 2 ,2
Time 2 b: 1, 2, 1, 1, 5, 1, 2
nSD
nbat
diff
ii
/
/)(
If we have information about dependence,
we have to use this!!!Test can be carried out with the t.test() function
E.g. Test your performance before or after the course. I measure twice on the same student
We can deal with dependence
Mean comparison: Wilcoxon
Rank procedure
A B3 54 54 63 72 43 41 33 55 62 5
20B72018.5B61918.5B618
15B51715B51615B51515B51415A513
10.5B41210.5B41110.5A41010.5A49
6B386A376A366A356A34
2.5A232.5A22
1A11rankslabelozone
20B72018.5B61918.5B618
15B51715B51615B51515B51415A513
10.5B41210.5B41110.5A41010.5A49
6B386A376A366A356A34
2.5A232.5A22
1A11rankslabelozone
-NB Tied ranks correctionTest can be carried out with the wilcox.test() function
111
21 2
)1(R
nnnnU
n1 and n2 are the number of observations
R1 is the sum of the ranks in the sample 1
Correlation
Correlation, (often measured as a correlation coefficient), indicates the strength and direction of a linear relationship between two random variables
Three alternative approaches1. Parametric - cor()2. Nonparametric - cor()3. Bootstrapping - replicate(), boot()
Plant speciesrichness
1234…
458
Bird speciesrichness
x1
x2
x3
x4
…
x458
l1l2l3l4…
l458
Sampling unit
Correlation: causal relationship?
Which is the response variable in a correlation analysis?
Plant speciesrichness
1234…
458
Bird speciesrichness
x1
x2
x3
x4
…
x458
l1l2l3l4…
l458
Sampling unit
NONE
Correlation
A correlation of +1 means that there is a perfect positive LINEARLINEAR relationship between variables. A correlation of -1 means that there is a perfect negative LINEARLINEAR relationship between variables.A correlation of 0 means there is no LINEARLINEAR relationship between the two variables.
Plot the two variables in a Cartesian space
Correlation
Same correlation coefficient!
r= 0.816
Assumptions-Two random variables from a random populations - cor() detects ONLY linear relationships
Pearson product-moment correlation coefficient
Correlation coefficient:
Hypothesis testing using the t distribution:Ho: Is cor = 0H1: Is cor ≠ 0
Parametric correlation: when is significant?
t critic value for d.f. = n-2
22
)(
yx
xycor
2
)1( 2
n
corSEcor
corSE
cort
Rank procedures
Nonparametric correlation
Spearman correlation index
The Kendall tau rank correlation coefficient
22
)(.
yx
yx
rankrank
rankrankspearmancor
1)1(
4.
nn
Pkendallcor
P is the number of concordant pairsn is the total number of pairs
Distribution-free butless power
NB Don’t use grouped data to compute overall correlation!!!
Scale-dependent correlation
7 sites
Issues related to correlation
2. Spatial autocorrelationValues in close sites are more similarDependence of the data
1. Temporal autocorrelationValues in close years are more similarDependence of the data
Moran's I = 0 Moran's I = 1Moran's I or Geary’s CMeasures of global spatial autocorrelation
Three issues related to correlation
2. Temporal autocorrelationValues in close years are more similarDependence of the data
Working with time series is likely to have temporal pattern in the data
E.g. Ring width series
Autoregressive models (not covered!)
Three issues related to correlation
3. Spatial autocorrelationValues in close sites are more similarDependence of the data
Moran's I or Geary’s C (univariate response) Measures of global spatial autocorrelation
Hint: If you find spatial autocorrelation in your residuals, you should start worrying
ISSUE: can we explain the spatial autocorrelation with our models?
Raw response
Residuals after model fitting
>a<-c(1:5)> a[1] 1 2 3 4 5> replicate(10, sample(a, replace=TRUE)) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10][1,] 2 3 2 1 4 2 1 2 1 3[2,] 1 5 2 3 5 3 1 1 3 2[3,] 4 4 4 5 4 4 5 1 1 5[4,] 4 1 1 3 3 2 3 1 5 2[5,] 5 5 1 3 5 2 4 1 5 4
Estimate correlation with bootstrap
BOOTSTRAP
Bootstrapping is a statistical method for estimating the sampling distribution of an estimator by sampling with replacement from the original sample, most often with the purpose of deriving robust estimates of SEs and CIs of a population parameter
Sampling with replacement
1 original sample
10 bootstrapped samples10 bootstrapped samples
Estimate correlation with bootstrap
Why bootstrap?
It doesn’t depend on normal distribution assumptionIt allows the computation of unbiased SE and CIs
Sample Bootstrap
Statisticdistribution
Quantiles
N sampleswith
replacement
…
Estimate correlation with bootstrap
CIs are asymmetric because our distribution reflects the structure of the data and not a defined probability distribution
If we repeat the sample n time we will find 0.95*n values included in the CIs
Frequency data
Properties of frequency data:-Count data-Proportion data
Proportion dataProportion data: where we know the number doing a particular thing, but also the number not doing that thing (e.g. ‘mortality’ of the students who attend the first lesson, but not the second)
Count dataCount data: where we count how many times something happened, but we have no way of knowing how often it did not happen (e.g. number of students coming at the first lesson)
Straightforward linear methods (assuming constant variance, normal errors) are not appropriate for count data for four main reasons:
• The linear model might lead to the prediction of negative counts.• The variance of the response variable is likely to increase with the mean.• The errors will not be normally distributed.• Many zeros are difficult to handle in transformations.
Count data
- Classical test with contingency tables- Generalized linear models with Poisson distribution and log-link function (extremely powerful and flexible!!!)
- Pearson’s chi-squared (- Pearson’s chi-squared (χχ22))- G test- G test- Fisher’s exact test- Fisher’s exact test
Count data: contingency tables
Group 1 Group 2 Row total
Trait 1 a b a+b
Trait 2 c d c+d
Column total a+c b+d a+b+c+d
We can assess the significance of the differences betweenobserved and expected frequencies in a variety of ways:
H0: frequencies found in rows are independent from frequencies in columns
- Pearson’s chi-squared (- Pearson’s chi-squared (χχ22))
We need a model to define the expected frequencies (E)expected frequencies (E)(many possibilities) – E.g. perfect independence
Count data: contingency tables
Oak Beech Row total (Ri)
With ants 22 30 52
Without ants 31 18 49
Column total (Ci) 53 48 101 (G)X
Critic value
G
)C x (R iiiE 1)-(c x 1)-(r df
i
i2
ii2
E
/E)E-(O
- G test- G test
1. We need a model to define the expected frequencies (E)expected frequencies (E)(many possibilities) – E.g. perfect independence
χ2 distribution
Count data: contingency tables
If expected values are less than 4 o 5
- Fisher’s exact test - Fisher’s exact test fisher.test()fisher.test()
G
)C x (R iiiE
i
ii E
OO ln 2 G
Proportion data
Proportion data have three important properties that affect the way the data should be analyzed:• the data are strictly bounded (0-1);• the variance is non-constant (it depends on the mean);• errors are non-normal.
- Classical test with probit or arcsin transformation- Generalized linear models with binomial distribution and logit-link function (extremely powerful and flexible!!!)
Arcsine transformation
The arcsine transformation takes care of the error distribution
Proportion data: traditional approach
Probit transformation
The probit transformation takes care of the non-linearity
pp arcsin' p are percentages (0-100%)
p are proportions (0-1)
Transform the data!
An important class of problems involves data on proportions such as:• studies on percentage mortality (LD50),• infection rates of diseases,• proportion responding to clinical treatment (bioassay),• sex ratios, or in general• data on proportional response to an experimental treatment
Proportion data: modern analysis
1. It is often needed to transform both response and explanatory variablesor
2. To use Generalized Linear Models (GLM) using different error distributions
2 approaches2 approaches
Statistical modelling
MODEL
Generally speaking, a statistical model is a function of your
explanatory variables to explain the variation in your response
variable (y)
The best model is the model that produces the least unexplained variation (the minimal residual deviance), subject to the constraint that all the parameters in the model should be statistically significant (many ways to reach this!)
The object is to determine the values of the parameters (a, b, c and d) in a specific model that lead to the best fit of the model to the data
2)( meanySSdeviance i
E.g. Y=a+bx1+cx2+ dx3
Y= response variable (performance of the students)
xi= explanatory variables (ability of the teacher, background, age)
Statistical modelling
Getting started with complex statistical modeling
It is essential, that you can answer the following questions:• Which of your variables is the response variable?
• Which are the explanatory variables?
• Are the explanatory variables continuous or categorical, or a mixture of both?
• What kind of response variable do you have: is it a continuous measurement, a count, a proportion, a time at death, or a category?
1. MulticollinearityCorrelation between predictors in a non-orthogonal multiple linear modelsConfounding effects difficult to separate
Variables are not independent
This makes an important difference to our statistical modelling because, in orthogonal designs, the variation that is attributed to a given factor is constant, and does not depend upon the order in which factors are removed from the model.
In contrast, with non-orthogonal data, we find that the variation attributable to a given factor does depend upon the order in which factors are removed from the model
Statistical modelling: multicollinearity
The order of variable selection makes a huge difference(please wait for session 4!!!)
Statistical modelling
Getting started with complex statistical modeling
The explanatory variables(a) All explanatory variables continuous - Regression(b) All explanatory variables categorical - Analysis of variance (ANOVA)(c) Explanatory variables both continuous and categorical - Analysis of covariance (ANCOVA)
The response variable(a) Continuous - Normal regression, ANOVA or ANCOVA(b) Proportion - Logistic regression, GLM logit-linear models(c) Count - GLM Log-linear models(d) Binary - GLM binary logistic analysis(e) Time at death - Survival analysis
You want the model to be minimal (parsimony), and adequate (must describe a significant fraction of the variation in the data)
It is very important to understand that there is not just one model.
• given the data,• and given our choice of model,• what values of the parameters of that model make the observed
data most likely?
Model building: estimate of parameters(slopes and level of factors)
Occam’s Razor
Statistical modelling
Each analysis estimate a MODEL
• Models should have as few parameters as possible;• linear models should be preferred to non-linear models;• experiments relying on few assumptions should be preferred to those relying on many;• models should be pared down until they are minimal adequate;• simple explanations should be preferred to complex explanations.
The process of model simplification is an integral part of hypothesis testing in R. In general, a variable is retained in the model only if it causes a significant increase in deviance when it is removed from the current model.
Occam’s Razor
MODEL SIMPLIFICATION
Statistical modelling
Statistical modelling: model simplification
Parsimony requires that the model should be as simple as possible. This means that the model should not contain any redundant parameters or factor levels.
• remove non-significant interaction terms;• remove non-significant quadratic or other non-linear terms;• remove non-significant explanatory variables;• group together factor levels that do not differ from one another;• in ANCOVA, set non-significant slopes of continuous
explanatory variables to zero.
Model simplification
Statistical modelling: model simplification
Step Procedure Interpretation1 Fit the maximal model Fit all the factors, interactions and covariates of interest. Note
the residual deviance. If you are using Poisson or binomial
errors, check for overdispersion and rescale if necessary.
2 Begin model simplification Inspect the parameter estimates (e.g. using the R function
summary(). Remove the least significant terms first (using
update -,) starting with the highest-order interactions.
3 If the deletion causes an insignificant increase in deviance
Leave that term out of the model.
Inspect the parameter values again.
Remove the least significant term remaining.
4 If the deletion causes a
significant increase in
deviance
Put the term back in the model (using update +). These are
the statistically significant terms as assessed by deletion from
the maximal model.
5 Keep removing terms
from the model
Repeat steps 3 or 4 until the model contains nothing but
significant terms.
This is the minimal adequate model (MAM).
If none of the parameters is significant, then the minimal
adequate model is the null model.