Statistical Data Analysis 2011/2012 M. de Gunst Lecture 7.
-
Upload
allen-mckenzie -
Category
Documents
-
view
216 -
download
2
description
Transcript of Statistical Data Analysis 2011/2012 M. de Gunst Lecture 7.
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
Statistical Data Analysis
2011/2012
M. de Gunst
Lecture 7
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
Statistical Data Analysis 2
Statistical Data Analysis: Introduction
TopicsSummarizing dataExploring distributions Bootstrap Robust methodsNonparametric tests (continued)Analysis of categorical dataMultiple linear regression
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
Statistical Data Analysis 3
Today’s topic: Nonparametric methods for two sample problems (Chapter 6: 6.3, 6.4 )6. Nonparametric methods (continued)6.1. One sample: two nonparametric tests for location 6.2. Aymptotic efficiency
6.3. Two samples: nonparametric tests for equality of distributions 6.3.1. Median test6.3.2. Wilcoxon two-sample test6.3.4. Kolmogorov-Smirnov two-sample test6.3.4. Permutation tests6.3.5. Asymptotic efficiency (read yourself)
6.4. Two samples: nonparametric tests for correlation 6.4.1. Rank correlation test of Spearman6.4.2. Rank correlation test of Kendall6.4.3. Permutation tests
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
Statistical Data Analysis 4
Nonparametric methods: Introduction-recap
Nonparametric testsNo assumption of parametric family for underlying distribution of data
For problems with large class of distributions belonging to H0
Distribution of test statistic same under every distribution that belongs to H0
Why these tests?
Robust w.r.t. confidence level: conf level α for large class of distributions
More efficient than tests with more assumptions when these assumptions do not hold: fewer observations necessary for same power (= onderscheidend vermogen)
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
Statistical Data Analysis 5
6.3. Two-sample problem: equality of distributions (1)
Data: Thromboglubine data of Raynaud patients without organ defects (x) and of patients with other auto-immune disease (z)
> mean(x) = 61.56> mean(z) = 75.11> median(x) = 54.25> median(z) = 62.5> sd(x) = 28.82> sd(z) = 48.51> length(x) = 32> length(z) = 23
Is distribution of x in some way smaller than that of z?
Example
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
Statistical Data Analysis 6
Two-sample problem: equality of distributions (2) (Continued)
Is distribution of x same as that of z?How to investigate with plot?
Better:Empirical qqplot of x and z
(In)equality of distributions not clear (see also Chapter 3), so investigate further with test(s)
Boxplots
in one figure
Example
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
Statistical Data Analysis 7
Two-sample problem: equality of distributions (3)
Situationrealizations of , independent, unknown distr. Frealizations of , independent, unknown distr. G
Are F and G the same? Which aspect? Location, spread, general shape, …
Case 1. Paired observations , m = n
Case 2. Unpaired observations and two independent groups of random variables
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
Statistical Data Analysis 8
Paired-samples: equality of distributions
~ F ~ G
Case 1. Paired observations
Main interest: difference in location of F and G
Consider differences → one sample
Now investigate location of distribution ofwith one sample test(s).
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
Statistical Data Analysis 9
Paired-samples: 3 one-sample tests
~ F ~ G F = G ?
Case 1. Paired observations
Test whether location of distribution of differencesequals 0 with one sample test(s):
i) normal: t-test
ii) dependent, independent: sign test, Wilcoxon’s signed rank test
iii) independent, independent: Wilcoxon’s signed rank test, because then under H0 symmetry around 0 automatic
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
Statistical Data Analysis 10
Unpaired samples: equality of distributions
~ F ~ G
Case 2. Unpaired observations and two independent groups of random variables
m and n may be different
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
Statistical Data Analysis 11
Unpaired samples : t-test
Two-sample t-testAssumptions: F normal, mean μ; G normal, mean ν; equal variances
Test statistic: ~ tm+n-2
If variances not equal: adjusted denominator Note: this is default in R:?t.test t.test(x, y = NULL, alternative = c("two.sided", "less", "greater"), mu = 0,
paired = FALSE, var.equal = FALSE, conf.level = 0.95, ...)
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
Statistical Data Analysis 12
6.3.1. Unpaired samples: median test
Median testAssumptions: F , G continuous distributions
Test statistic: ~ Hyp (m+n, m, p) nonparametric
“half” of the total number of observations
Suited (efficient) for shift alternatives: H1: G = F(-θ)
Does not use much information of data (NB. This is not Mood’s median test)
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
Statistical Data Analysis 13
6.3.2. Unpaired samples: Wilcoxon two-sample test (1)
Wilcoxon two-sample or Wilcoxon rank sum test or Mann-Whitney testAssumptions: F , G continuous distribution functions
Test statistic: ranks of in combined sample of size N = m+n
nonparametric
With ties or for large n, m normal approximation
Especially suited (efficient) for shift alternatives: H1: G = F(-θ)Uses more information from data
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
Statistical Data Analysis 14
Wilcoxon rank sum testAssumptions: F , G continuous distributions
Test statistic: ranks of in combined sample of size N = m+n
Unpaired samples: Wilcoxon two-sample test (2)
Equivalent test statistics used under same name:
and switched roles first and second sample
= sum of ranks of first sample
= m(m+1) used by R
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
Statistical Data Analysis 15
6.3.3. Unpaired samples: Kolmogorov-Smirnov test
Kolmogorov-Smirnov two-sample testAssumptions: F , G have continuous distribution functions
nonparametric
Test statistic:
ranks of in combined sample of size N = m+n
Especially suited (efficient) for general alternatives: F and G need not have
same shape
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
Statistical Data Analysis 16
6.3.4. Unpaired samples: permutation tests
Permutation tests for unpaired dataAssumptions: F , G have continuous distribution functions
Test statistic: that gives info about difference F and Ge.g. , Med(X)m – Med(Y)n , etc.
Test conditionally on ordered combined sample : (right-sided)
nonparametric
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
Statistical Data Analysis 17
Unpaired samples: illustration (1)
Data: Thromboglubine data of Raynaud patients without organ defects (x) and of patients with other auto-immune disease (z)
F distribution of data x; G distribution of data zH0: F = GH1: F stochastically smaller than G (what does this mean??)
Note: For different tests this H1 becomes in R:t.test: difference in expectations is less than 0; > t.test(x,z ,alternative="less") median test: difference in location less than 0; # compute yourself with 1-phyper(18, 32, 23, 56/2) # check where the numbers come from!!Mann-Whitney/Wilcoxon: difference in location less than 0; > wilcox.test ,alternative="less")Kolmogorov-Smirnov: CDF of X above of that of Y; > ks.test(x,z,alternative="greater")
Example
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
Statistical Data Analysis 18
Unpaired samples: illustration (2) (Continued)
F distribution of data x; G distribution of data zH0: F = GH1: F stochastically smaller than G
p-values:t.test: 0.12median test: 0.11Mann-Whitney/Wilcoxon: 0.20 (normal approximation was used –due to ties)Kolmogorov-Smirnov: 0.31 (R-warning)H0 not rejected for these tests.
Note: we have performed all tests, but - t.test is not good candidate, because data not likely to be normal based on plots;- whether distributions have same shape, i.e. whether shift-alternative is good choice, not
clear: shapes look similar, but sd’s are quite different. If it is, then median and Mann- Whitney tests are good in terms of power;
- Kolmogorov Smirnov is good test also for general types of alternatives. There are ties here and R does not know how to adjust for this, so consider p-value as an approximation.
Example
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
Statistical Data Analysis 19
6.4. Paired samples: correlation
~ F ; ~ G
Only for paired observations :are Xi and Yi correlated?
How to start investigation? Make scatter plot
Measures of correlation?
(Pearson’s) sample correlation
Kendall’s rank correlation
Spearman’s rank correlation
Can all be used for testing
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
Statistical Data Analysis 20
6.4.1/6.4.2. Paired samples: tests for correlation
Only for paired observations :
for all i
(Pearson’s) (linear) correlation testAssumptions: F normal, G normal
Test statistic: ~ tn-2
Kendall’s rank correlation test, Spearman’s rank correlation testAssumptions: F , G continuous distribution functions
Test statistic: and , resp. Both based on ranks: nonparametric
R: cor.test(x,y, method= "pearson“/ "kendall“/ "spearman" , …)
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
Statistical Data Analysis 21
6.4.3. Paired samples: permutation test for correlation
Only for paired observations :
for all i
Permutation tests Assumptions: F , G continuous distribution functions
Test statistic: that gives info about dependence Xi and Yie.g. Kendall’s , Spearman’s
Test conditionally on combined first and ordered second sample :
(right-sided)
Conditional, so different results from former tests with same statistics
nonparametric
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
Statistical Data Analysis 22
Permutation tests (1)
1. Unpaired observations and 2. Paired observations , m = n
Bootstrap if not computable exactly: generate large number B of randomly chosen permutations π , and approximate p-value by fraction:
1. Replace
by
2. Replace
by
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
Statistical Data Analysis 23
Permutation tests (2)
1. Unpaired observations and 2. Paired observations , m = n
How to permute in both cases?Instead of permutation π of 1,…, m+n, and 1,…, n, resp.,easier to permute the data:
1. Permute (X1,..,Xm,Y1,…,Yn) and make new division in two samples of m and n observations.
2. Permute (Y1,..,Yn), leave (X1,..,Xn) as it is, and make new pairs.
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
Statistical Data Analysis 24
Permutation test: illustration for unpaired samples (1)
Data: Thromboglubine data of Raynaud patients without organ defects (x) and of patients with other auto-immune disease (z)
F distribution of data x; G distribution of data zH0: F = GH1: F stochastically smaller than G
If interested in specific characteristics: permutation testUnpaired data: conditional test, given the sorted values of the combined sample:
Choose T; suppose H0 rejected for small (large) values of Tthen B times, do:
generate random permutation of (x1,..,x32, ,z1,…,z23) and make new division in two samples of 32 and 23 observations with R-function `sample’
xperm = first 32 elements of permuted data zperm = last 23 elements of permuted data determine T(xperm, zperm)
Count fraction of B values T(xperm, zperm) smaller (larger) than T(x,z) of thrombo data: this is p-value
Example
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
Statistical Data Analysis 25
Permutation test: illustration for unpaired samples (2) (Continued)F distribution of data x; G distribution of data zH0: F = GH1: F stochastically smaller than G
Results of permutation testsPermutation test for difference in mean: T=mean(X)-mean(Y)Left p-value: 0.096 (bootstrap approximation) (several times: 0.107, 0.091, 0.114, …)
Permutation test for difference in median: T=median(X)-median(Y)Left p-value: 0.165 (bootstrap approximation) (several times: 0.163, 0.19, 0.18, …)
Permutation test for Mann-Whitney: T=U-tildeLeft p-value: 0.183 (bootstrap approximation) (several times: 0.195, 0.214, 0.201, …) (around value for unconditional Mann-Whitney).
Permutation test for difference in sd: T=sd(X)-sd(Y)Left p-value: 0.045 (bootstrap approximation) (several times: 0.043, 0.053, 0.059, …)
Example
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
Statistical Data Analysis 26
Recap
6. Nonparametric methods (continued)
6.3. Two samples: nonparametric tests for equality of distributions 6.3.1. Median test6.3.2. Wilcoxon two-sample test6.3.4. Kolmogorov-Smirnov two-sample test6.3.4. Permutation tests6.3.5. Asymptotic efficiency (read yourself)
6.4. Two samples: nonparametric tests for correlation 6.4.1. Rank correlation test of Spearman6.4.2. Rank correlation test of Kendall6.4.3. Permutation tests
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
Statistical Data Analysis 27
Nonparametric methods for one sample problems
The end