Statistical Data Analysis 2011/2012 M. de Gunst Lecture 7.

27
-4 -2 0 2 4 0.0 0.1 0.2 0.3 0.4 x Statistical Data Analysis 2011/2012 M. de Gunst Lecture 7

description

Statistical Data Analysis 3 Today’s topic: Nonparametric methods for two sample problems (Chapter 6: 6.3, 6.4 ) 6. Nonparametric methods (continued) 6.1. One sample: two nonparametric tests for location 6.2. Aymptotic efficiency 6.3. Two samples: nonparametric tests for equality of distributions Median test Wilcoxon two-sample test Kolmogorov-Smirnov two-sample test Permutation tests Asymptotic efficiency (read yourself) 6.4. Two samples: nonparametric tests for correlation Rank correlation test of Spearman Rank correlation test of Kendall Permutation tests

Transcript of Statistical Data Analysis 2011/2012 M. de Gunst Lecture 7.

Page 1: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 7.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis

2011/2012

M. de Gunst

Lecture 7

Page 2: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 7.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 2

Statistical Data Analysis: Introduction

TopicsSummarizing dataExploring distributions Bootstrap Robust methodsNonparametric tests (continued)Analysis of categorical dataMultiple linear regression

Page 3: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 7.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 3

Today’s topic: Nonparametric methods for two sample problems (Chapter 6: 6.3, 6.4 )6. Nonparametric methods (continued)6.1. One sample: two nonparametric tests for location 6.2. Aymptotic efficiency

6.3. Two samples: nonparametric tests for equality of distributions 6.3.1. Median test6.3.2. Wilcoxon two-sample test6.3.4. Kolmogorov-Smirnov two-sample test6.3.4. Permutation tests6.3.5. Asymptotic efficiency (read yourself)

6.4. Two samples: nonparametric tests for correlation 6.4.1. Rank correlation test of Spearman6.4.2. Rank correlation test of Kendall6.4.3. Permutation tests

Page 4: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 7.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 4

Nonparametric methods: Introduction-recap

Nonparametric testsNo assumption of parametric family for underlying distribution of data

For problems with large class of distributions belonging to H0

Distribution of test statistic same under every distribution that belongs to H0

Why these tests?

Robust w.r.t. confidence level: conf level α for large class of distributions

More efficient than tests with more assumptions when these assumptions do not hold: fewer observations necessary for same power (= onderscheidend vermogen)

Page 5: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 7.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 5

6.3. Two-sample problem: equality of distributions (1)

Data: Thromboglubine data of Raynaud patients without organ defects (x) and of patients with other auto-immune disease (z)

> mean(x) = 61.56> mean(z) = 75.11> median(x) = 54.25> median(z) = 62.5> sd(x) = 28.82> sd(z) = 48.51> length(x) = 32> length(z) = 23

Is distribution of x in some way smaller than that of z?

Example

Page 6: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 7.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 6

Two-sample problem: equality of distributions (2) (Continued)

Is distribution of x same as that of z?How to investigate with plot?

Better:Empirical qqplot of x and z

(In)equality of distributions not clear (see also Chapter 3), so investigate further with test(s)

Boxplots

in one figure

Example

Page 7: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 7.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 7

Two-sample problem: equality of distributions (3)

Situationrealizations of , independent, unknown distr. Frealizations of , independent, unknown distr. G

Are F and G the same? Which aspect? Location, spread, general shape, …

Case 1. Paired observations , m = n

Case 2. Unpaired observations and two independent groups of random variables

Page 8: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 7.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 8

Paired-samples: equality of distributions

~ F ~ G

Case 1. Paired observations

Main interest: difference in location of F and G

Consider differences → one sample

Now investigate location of distribution ofwith one sample test(s).

Page 9: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 7.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 9

Paired-samples: 3 one-sample tests

~ F ~ G F = G ?

Case 1. Paired observations

Test whether location of distribution of differencesequals 0 with one sample test(s):

i) normal: t-test

ii) dependent, independent: sign test, Wilcoxon’s signed rank test

iii) independent, independent: Wilcoxon’s signed rank test, because then under H0 symmetry around 0 automatic

Page 10: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 7.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 10

Unpaired samples: equality of distributions

~ F ~ G

Case 2. Unpaired observations and two independent groups of random variables

m and n may be different

Page 11: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 7.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 11

Unpaired samples : t-test

Two-sample t-testAssumptions: F normal, mean μ; G normal, mean ν; equal variances

Test statistic: ~ tm+n-2

If variances not equal: adjusted denominator Note: this is default in R:?t.test t.test(x, y = NULL, alternative = c("two.sided", "less", "greater"), mu = 0,

paired = FALSE, var.equal = FALSE, conf.level = 0.95, ...)

Page 12: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 7.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 12

6.3.1. Unpaired samples: median test

Median testAssumptions: F , G continuous distributions

Test statistic: ~ Hyp (m+n, m, p) nonparametric

“half” of the total number of observations

Suited (efficient) for shift alternatives: H1: G = F(-θ)

Does not use much information of data (NB. This is not Mood’s median test)

Page 13: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 7.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 13

6.3.2. Unpaired samples: Wilcoxon two-sample test (1)

Wilcoxon two-sample or Wilcoxon rank sum test or Mann-Whitney testAssumptions: F , G continuous distribution functions

Test statistic: ranks of in combined sample of size N = m+n

nonparametric

With ties or for large n, m normal approximation

Especially suited (efficient) for shift alternatives: H1: G = F(-θ)Uses more information from data

Page 14: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 7.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 14

Wilcoxon rank sum testAssumptions: F , G continuous distributions

Test statistic: ranks of in combined sample of size N = m+n

Unpaired samples: Wilcoxon two-sample test (2)

Equivalent test statistics used under same name:

and switched roles first and second sample

= sum of ranks of first sample

= m(m+1) used by R

Page 15: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 7.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 15

6.3.3. Unpaired samples: Kolmogorov-Smirnov test

Kolmogorov-Smirnov two-sample testAssumptions: F , G have continuous distribution functions

nonparametric

Test statistic:

ranks of in combined sample of size N = m+n

Especially suited (efficient) for general alternatives: F and G need not have

same shape

Page 16: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 7.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 16

6.3.4. Unpaired samples: permutation tests

Permutation tests for unpaired dataAssumptions: F , G have continuous distribution functions

Test statistic: that gives info about difference F and Ge.g. , Med(X)m – Med(Y)n , etc.

Test conditionally on ordered combined sample : (right-sided)

nonparametric

Page 17: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 7.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 17

Unpaired samples: illustration (1)

Data: Thromboglubine data of Raynaud patients without organ defects (x) and of patients with other auto-immune disease (z)

F distribution of data x; G distribution of data zH0: F = GH1: F stochastically smaller than G (what does this mean??)

Note: For different tests this H1 becomes in R:t.test: difference in expectations is less than 0; > t.test(x,z ,alternative="less") median test: difference in location less than 0; # compute yourself with 1-phyper(18, 32, 23, 56/2) # check where the numbers come from!!Mann-Whitney/Wilcoxon: difference in location less than 0; > wilcox.test ,alternative="less")Kolmogorov-Smirnov: CDF of X above of that of Y; > ks.test(x,z,alternative="greater")

Example

Page 18: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 7.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 18

Unpaired samples: illustration (2) (Continued)

F distribution of data x; G distribution of data zH0: F = GH1: F stochastically smaller than G

p-values:t.test: 0.12median test: 0.11Mann-Whitney/Wilcoxon: 0.20 (normal approximation was used –due to ties)Kolmogorov-Smirnov: 0.31 (R-warning)H0 not rejected for these tests.

Note: we have performed all tests, but - t.test is not good candidate, because data not likely to be normal based on plots;- whether distributions have same shape, i.e. whether shift-alternative is good choice, not

clear: shapes look similar, but sd’s are quite different. If it is, then median and Mann- Whitney tests are good in terms of power;

- Kolmogorov Smirnov is good test also for general types of alternatives. There are ties here and R does not know how to adjust for this, so consider p-value as an approximation.

Example

Page 19: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 7.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 19

6.4. Paired samples: correlation

~ F ; ~ G

Only for paired observations :are Xi and Yi correlated?

How to start investigation? Make scatter plot

Measures of correlation?

(Pearson’s) sample correlation

Kendall’s rank correlation

Spearman’s rank correlation

Can all be used for testing

Page 20: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 7.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 20

6.4.1/6.4.2. Paired samples: tests for correlation

Only for paired observations :

for all i

(Pearson’s) (linear) correlation testAssumptions: F normal, G normal

Test statistic: ~ tn-2

Kendall’s rank correlation test, Spearman’s rank correlation testAssumptions: F , G continuous distribution functions

Test statistic: and , resp. Both based on ranks: nonparametric

R: cor.test(x,y, method= "pearson“/ "kendall“/ "spearman" , …)

Page 21: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 7.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 21

6.4.3. Paired samples: permutation test for correlation

Only for paired observations :

for all i

Permutation tests Assumptions: F , G continuous distribution functions

Test statistic: that gives info about dependence Xi and Yie.g. Kendall’s , Spearman’s

Test conditionally on combined first and ordered second sample :

(right-sided)

Conditional, so different results from former tests with same statistics

nonparametric

Page 22: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 7.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 22

Permutation tests (1)

1. Unpaired observations and 2. Paired observations , m = n

Bootstrap if not computable exactly: generate large number B of randomly chosen permutations π , and approximate p-value by fraction:

1. Replace

by

2. Replace

by

Page 23: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 7.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 23

Permutation tests (2)

1. Unpaired observations and 2. Paired observations , m = n

How to permute in both cases?Instead of permutation π of 1,…, m+n, and 1,…, n, resp.,easier to permute the data:

1. Permute (X1,..,Xm,Y1,…,Yn) and make new division in two samples of m and n observations.

2. Permute (Y1,..,Yn), leave (X1,..,Xn) as it is, and make new pairs.

Page 24: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 7.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 24

Permutation test: illustration for unpaired samples (1)

Data: Thromboglubine data of Raynaud patients without organ defects (x) and of patients with other auto-immune disease (z)

F distribution of data x; G distribution of data zH0: F = GH1: F stochastically smaller than G

If interested in specific characteristics: permutation testUnpaired data: conditional test, given the sorted values of the combined sample:

Choose T; suppose H0 rejected for small (large) values of Tthen B times, do:

generate random permutation of (x1,..,x32, ,z1,…,z23) and make new division in two samples of 32 and 23 observations with R-function `sample’

xperm = first 32 elements of permuted data zperm = last 23 elements of permuted data determine T(xperm, zperm)

Count fraction of B values T(xperm, zperm) smaller (larger) than T(x,z) of thrombo data: this is p-value

Example

Page 25: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 7.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 25

Permutation test: illustration for unpaired samples (2) (Continued)F distribution of data x; G distribution of data zH0: F = GH1: F stochastically smaller than G

Results of permutation testsPermutation test for difference in mean: T=mean(X)-mean(Y)Left p-value: 0.096 (bootstrap approximation) (several times: 0.107, 0.091, 0.114, …)

Permutation test for difference in median: T=median(X)-median(Y)Left p-value: 0.165 (bootstrap approximation) (several times: 0.163, 0.19, 0.18, …)

Permutation test for Mann-Whitney: T=U-tildeLeft p-value: 0.183 (bootstrap approximation) (several times: 0.195, 0.214, 0.201, …) (around value for unconditional Mann-Whitney).

Permutation test for difference in sd: T=sd(X)-sd(Y)Left p-value: 0.045 (bootstrap approximation) (several times: 0.043, 0.053, 0.059, …)

Example

Page 26: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 7.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 26

Recap

6. Nonparametric methods (continued)

6.3. Two samples: nonparametric tests for equality of distributions 6.3.1. Median test6.3.2. Wilcoxon two-sample test6.3.4. Kolmogorov-Smirnov two-sample test6.3.4. Permutation tests6.3.5. Asymptotic efficiency (read yourself)

6.4. Two samples: nonparametric tests for correlation 6.4.1. Rank correlation test of Spearman6.4.2. Rank correlation test of Kendall6.4.3. Permutation tests

Page 27: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 7.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 27

Nonparametric methods for one sample problems

The end