Statistical Data Analysis 2011/2012 M. de Gunst Lecture 7.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis

2011/2012

M. de Gunst

Lecture 7

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 2

Statistical Data Analysis: Introduction

TopicsSummarizing dataExploring distributions Bootstrap Robust methodsNonparametric tests (continued)Analysis of categorical dataMultiple linear regression

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x


Today’s topic: Nonparametric methods for two sample problems (Chapter 6: 6.3, 6.4 )6. Nonparametric methods (continued)6.1. One sample: two nonparametric tests for location 6.2. Aymptotic efficiency

6.3. Two samples: nonparametric tests for equality of distributions 6.3.1. Median test6.3.2. Wilcoxon two-sample test6.3.4. Kolmogorov-Smirnov two-sample test6.3.4. Permutation tests6.3.5. Asymptotic efficiency (read yourself)

6.4. Two samples: nonparametric tests for correlation 6.4.1. Rank correlation test of Spearman6.4.2. Rank correlation test of Kendall6.4.3. Permutation tests

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x


Nonparametric methods: Introduction-recap

Nonparametric testsNo assumption of parametric family for underlying distribution of data

For problems with large class of distributions belonging to H0

Distribution of test statistic same under every distribution that belongs to H0

Why these tests?

Robust w.r.t. confidence level: conf level α for large class of distributions

More efficient than tests with more assumptions when these assumptions do not hold: fewer observations necessary for same power (= onderscheidend vermogen)

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x


6.3. Two-sample problem: equality of distributions (1)

Data: Thromboglubine data of Raynaud patients without organ defects (x) and of patients with other auto-immune disease (z)

> mean(x) = 61.56> mean(z) = 75.11> median(x) = 54.25> median(z) = 62.5> sd(x) = 28.82> sd(z) = 48.51> length(x) = 32> length(z) = 23

Is distribution of x in some way smaller than that of z?

Example

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x


Two-sample problem: equality of distributions (2) (Continued)

Is distribution of x same as that of z?How to investigate with plot?

Better:Empirical qqplot of x and z

(In)equality of distributions not clear (see also Chapter 3), so investigate further with test(s)

Boxplots

in one figure

Example

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x


Two-sample problem: equality of distributions (3)

Situationrealizations of , independent, unknown distr. Frealizations of , independent, unknown distr. G

Are F and G the same? Which aspect? Location, spread, general shape, …

Case 1. Paired observations , m = n

Case 2. Unpaired observations and two independent groups of random variables

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x


Paired-samples: equality of distributions

~ F ~ G

Case 1. Paired observations

Main interest: difference in location of F and G

Consider differences → one sample

Now investigate location of distribution ofwith one sample test(s).

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x


Paired-samples: 3 one-sample tests

~ F ~ G F = G ?

Case 1. Paired observations

Test whether location of distribution of differencesequals 0 with one sample test(s):

i) normal: t-test

ii) dependent, independent: sign test, Wilcoxon’s signed rank test

iii) independent, independent: Wilcoxon’s signed rank test, because then under H0 symmetry around 0 automatic

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x


Unpaired samples: equality of distributions

~ F ~ G

Case 2. Unpaired observations and two independent groups of random variables

m and n may be different

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x


Unpaired samples : t-test

Two-sample t-testAssumptions: F normal, mean μ; G normal, mean ν; equal variances

Test statistic: ~ tm+n-2

If variances not equal: adjusted denominator Note: this is default in R:?t.test t.test(x, y = NULL, alternative = c("two.sided", "less", "greater"), mu = 0,

paired = FALSE, var.equal = FALSE, conf.level = 0.95, ...)

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x


6.3.1. Unpaired samples: median test

Median testAssumptions: F , G continuous distributions

Test statistic: ~ Hyp (m+n, m, p) nonparametric

“half” of the total number of observations

Suited (efficient) for shift alternatives: H1: G = F(-θ)

Does not use much information of data (NB. This is not Mood’s median test)

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x


6.3.2. Unpaired samples: Wilcoxon two-sample test (1)

Wilcoxon two-sample or Wilcoxon rank sum test or Mann-Whitney testAssumptions: F , G continuous distribution functions

Test statistic: ranks of in combined sample of size N = m+n

nonparametric

With ties or for large n, m normal approximation

Especially suited (efficient) for shift alternatives: H1: G = F(-θ)Uses more information from data

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x


Wilcoxon rank sum testAssumptions: F , G continuous distributions

Test statistic: ranks of in combined sample of size N = m+n

Unpaired samples: Wilcoxon two-sample test (2)

Equivalent test statistics used under same name:

and switched roles first and second sample

= sum of ranks of first sample

= m(m+1) used by R

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x


6.3.3. Unpaired samples: Kolmogorov-Smirnov test

Kolmogorov-Smirnov two-sample testAssumptions: F , G have continuous distribution functions

nonparametric

Test statistic:

ranks of in combined sample of size N = m+n

Especially suited (efficient) for general alternatives: F and G need not have

same shape

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x


6.3.4. Unpaired samples: permutation tests

Permutation tests for unpaired dataAssumptions: F , G have continuous distribution functions

Test statistic: that gives info about difference F and Ge.g. , Med(X)m – Med(Y)n , etc.

Test conditionally on ordered combined sample : (right-sided)

nonparametric

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x


Unpaired samples: illustration (1)


F distribution of data x; G distribution of data zH0: F = GH1: F stochastically smaller than G (what does this mean??)

Note: For different tests this H1 becomes in R:t.test: difference in expectations is less than 0; > t.test(x,z ,alternative="less") median test: difference in location less than 0; # compute yourself with 1-phyper(18, 32, 23, 56/2) # check where the numbers come from!!Mann-Whitney/Wilcoxon: difference in location less than 0; > wilcox.test ,alternative="less")Kolmogorov-Smirnov: CDF of X above of that of Y; > ks.test(x,z,alternative="greater")

Example

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x


Unpaired samples: illustration (2) (Continued)

F distribution of data x; G distribution of data zH0: F = GH1: F stochastically smaller than G

p-values:t.test: 0.12median test: 0.11Mann-Whitney/Wilcoxon: 0.20 (normal approximation was used –due to ties)Kolmogorov-Smirnov: 0.31 (R-warning)H0 not rejected for these tests.

Note: we have performed all tests, but - t.test is not good candidate, because data not likely to be normal based on plots;- whether distributions have same shape, i.e. whether shift-alternative is good choice, not

clear: shapes look similar, but sd’s are quite different. If it is, then median and Mann- Whitney tests are good in terms of power;

- Kolmogorov Smirnov is good test also for general types of alternatives. There are ties here and R does not know how to adjust for this, so consider p-value as an approximation.

Example

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x


6.4. Paired samples: correlation

~ F ; ~ G

Only for paired observations :are Xi and Yi correlated?

How to start investigation? Make scatter plot

Measures of correlation?

(Pearson’s) sample correlation

Kendall’s rank correlation

Spearman’s rank correlation

Can all be used for testing

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x


6.4.1/6.4.2. Paired samples: tests for correlation

Only for paired observations :

for all i

(Pearson’s) (linear) correlation testAssumptions: F normal, G normal

Test statistic: ~ tn-2

Kendall’s rank correlation test, Spearman’s rank correlation testAssumptions: F , G continuous distribution functions

Test statistic: and , resp. Both based on ranks: nonparametric

R: cor.test(x,y, method= "pearson“/ "kendall“/ "spearman" , …)

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x


6.4.3. Paired samples: permutation test for correlation

Only for paired observations :

for all i

Permutation tests Assumptions: F , G continuous distribution functions

Test statistic: that gives info about dependence Xi and Yie.g. Kendall’s , Spearman’s

Test conditionally on combined first and ordered second sample :

(right-sided)

Conditional, so different results from former tests with same statistics

nonparametric

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x


Permutation tests (1)

1. Unpaired observations and 2. Paired observations , m = n

Bootstrap if not computable exactly: generate large number B of randomly chosen permutations π , and approximate p-value by fraction:

1. Replace

by

2. Replace

by

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x


Permutation tests (2)

1. Unpaired observations and 2. Paired observations , m = n

How to permute in both cases?Instead of permutation π of 1,…, m+n, and 1,…, n, resp.,easier to permute the data:

1. Permute (X1,..,Xm,Y1,…,Yn) and make new division in two samples of m and n observations.

2. Permute (Y1,..,Yn), leave (X1,..,Xn) as it is, and make new pairs.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x


Permutation test: illustration for unpaired samples (1)


F distribution of data x; G distribution of data zH0: F = GH1: F stochastically smaller than G

If interested in specific characteristics: permutation testUnpaired data: conditional test, given the sorted values of the combined sample:

Choose T; suppose H0 rejected for small (large) values of Tthen B times, do:

generate random permutation of (x1,..,x32, ,z1,…,z23) and make new division in two samples of 32 and 23 observations with R-function `sample’

xperm = first 32 elements of permuted data zperm = last 23 elements of permuted data determine T(xperm, zperm)

Count fraction of B values T(xperm, zperm) smaller (larger) than T(x,z) of thrombo data: this is p-value

Example

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x


Permutation test: illustration for unpaired samples (2) (Continued)F distribution of data x; G distribution of data zH0: F = GH1: F stochastically smaller than G

Results of permutation testsPermutation test for difference in mean: T=mean(X)-mean(Y)Left p-value: 0.096 (bootstrap approximation) (several times: 0.107, 0.091, 0.114, …)

Permutation test for difference in median: T=median(X)-median(Y)Left p-value: 0.165 (bootstrap approximation) (several times: 0.163, 0.19, 0.18, …)

Permutation test for Mann-Whitney: T=U-tildeLeft p-value: 0.183 (bootstrap approximation) (several times: 0.195, 0.214, 0.201, …) (around value for unconditional Mann-Whitney).

Permutation test for difference in sd: T=sd(X)-sd(Y)Left p-value: 0.045 (bootstrap approximation) (several times: 0.043, 0.053, 0.059, …)

Example

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x


Recap

6. Nonparametric methods (continued)

6.3. Two samples: nonparametric tests for equality of distributions 6.3.1. Median test6.3.2. Wilcoxon two-sample test6.3.4. Kolmogorov-Smirnov two-sample test6.3.4. Permutation tests6.3.5. Asymptotic efficiency (read yourself)

6.4. Two samples: nonparametric tests for correlation 6.4.1. Rank correlation test of Spearman6.4.2. Rank correlation test of Kendall6.4.3. Permutation tests

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x


Nonparametric methods for one sample problems

The end

Statistical Data Analysis 2011/2012 M. de Gunst Lecture 7.

Documents

Transcript of Statistical Data Analysis 2011/2012 M. de Gunst Lecture 7.