Discrete Markov Chain Monte Carlo - Mount Holyoke College · Web viewIgnoring relevant information...

62
Discrete Markov Chain Monte Carlo Chapter 2 RANDOMIZATION TESTS Chapter 2: Randomization tests There is a name for the method used by Connor and Simberloff to compare the observed number of checkerboards (10 for the finch data) with what you could expect to get if species had distributed themselves purely at random. The method belongs to the general area of statistics called hypothesis testing, and more specifically, the method is an instance of a randomization test. (In ecology, randomization tests are often named for the models they are based on, called null models.) Until the 1980s, randomization tests tended to be limited to comparatively simple kinds of data sets, because there was no general and well-understood method for generating random data sets in more complicated situations. Computers and related theory have been changing that during the last two decades, so that randomization tests have grown in importance, and are now used much more often than in the past. Much of the mathematics in this book is tied to the theory of how to create random data sets in order to carry out randomization tests. The following example, which comes from a court case, illustrates the use of randomization tests in one of the simpler situations where it is easy to see how to generate random data sets. 2.1 Martin vs. Westvaco: An Introduction to Randomization Tests Robert Martin turned 55 during 1991. Earlier in that same year the Westvaco Corporation, which makes paper products, decided to downsize. They laid off several members of their engineering department, where Bob Martin worked, and he was one of those who lost their jobs. Later that year, he hired a lawyer to sue Westvaco, claiming he had been laid off because of his age. A major piece of Martin's case was based on a statistical analysis of the ages of the employees at Westvaco. 07/01/22 George W. Cobb, Mount Holyoke College, NSF#0089004 page 2.1

Transcript of Discrete Markov Chain Monte Carlo - Mount Holyoke College · Web viewIgnoring relevant information...

Discrete Markov Chain Monte Carlo

Discrete Markov Chain Monte Carlo

Chapter 2 Randomization Tests

Chapter 2: Randomization tests

There is a name for the method used by Connor and Simberloff to compare the observed number of checkerboards (10 for the finch data) with what you could expect to get if species had distributed themselves purely at random. The method belongs to the general area of statistics called hypothesis testing, and more specifically, the method is an instance of a randomization test. (In ecology, randomization tests are often named for the models they are based on, called null models.) Until the 1980s, randomization tests tended to be limited to comparatively simple kinds of data sets, because there was no general and well-understood method for generating random data sets in more complicated situations. Computers and related theory have been changing that during the last two decades, so that randomization tests have grown in importance, and are now used much more often than in the past. Much of the mathematics in this book is tied to the theory of how to create random data sets in order to carry out randomization tests. The following example, which comes from a court case, illustrates the use of randomization tests in one of the simpler situations where it is easy to see how to generate random data sets.

2.1 Martin vs. Westvaco: An Introduction to Randomization Tests

Robert Martin turned 55 during 1991. Earlier in that same year the Westvaco Corporation, which makes paper products, decided to downsize. They laid off several members of their engineering department, where Bob Martin worked, and he was one of those who lost their jobs. Later that year, he hired a lawyer to sue Westvaco, claiming he had been laid off because of his age. A major piece of Martin's case was based on a statistical analysis of the ages of the employees at Westvaco.

At the time the layoffs began, Bob Martin was one of 50 people working at various jobs in the engineering department of Westvaco's envelope division. Some were paid by the hour; others, like Martin, who had more education and greater responsibility, were salaried. Over the course of the spring, Westvaco's management went through five rounds of planning for a reduction in force. In Round 1, they decided to eliminate 11 positions. In Round 2 they added 9 more to the list. By the time the layoffs ended, after all 5 rounds, only 22 of the 50 workers had kept their jobs, and the average age in the department had fallen from 48 to 46.

Display 2.1 shows the data provided by Westvaco to Martin's lawyers. Each row corresponds to one worker, and each column corresponds to some feature: job title, whether hourly or salaried, the date of hire and age as of the first of January, 1991 (shortly before the layoffs). The last column tells how the worker fared in the downsizing: a 1 means chosen for layoff in Round 1 of planning for the reduction in force, a 2 means Round 2, and similarly for 3, 4 or 5; however, 0 means "not chosen for layoff."

40 or older?NoYesTotal

No

404

Yes

336

Total

7310

Display 2.1 The data in Martin versus Westvaco.

On balance, the patterns in the Martin data show that the percentage of people laid off was higher for older workers than for younger ones. One of the main arguments in the case was about what those patterns mean: are the patterns "real," or could they be due just to natural variation? There's no way to repeat Westvaco's actual decision process, which means there's no way to measure the variability in that process. In fact, it's hard to say precisely what "natural variability" really means. It is possible, however, first to define a simple, artificial, age-neutral decision process (a null model), then to repeat that process, and use the results to ask whether that process is variable enough to give results as extreme as Westvaco's.

A comprehensive analysis to answer that question would be quite involved. For now, though, you can get a pretty good idea of how the analysis goes by working with just a subset of the data. Here are ages and Row IDs of the ten hourly workers involved in the second of the five rounds of layoffs, arranged from youngest to oldest. The ages of the three that were laid off are underlined:

Age

25 33 35 38 48 55 55 55 56 64

Row ID 11314 2 412 911 310

What to make of the data requires balancing two points of view. On one hand, the pattern in the data is pretty striking. Of the five people under age 50, all kept their jobs. Of the five who were 55 or older, only two kept their jobs. On the other hand, the numbers of people involved are pretty small: just three out of ten. Should you take seriously a pattern involving so few people? The two viewpoints correspond to two sides of an argument that was at the center of the statistical part of the Martin case. Here's a simplified version.

Martin: Look at the pattern in the data: All three of the workers laid off were much older than average. That's evidence of age bias.

Westvaco: Not so fast! You're only looking at ten people total, and only three jobs were eliminated. Just one small change and the picture would be entirely different. For example, suppose it had been the 25-year-old instead of the 64-year-old who was laid off. Switch the 25 and the 64, and you get a totally different set of averages:

Actual data:25 33 35 38 48 55 55 55 56 64

Altered data:25 33 35 38 48 55 55 55 56 64

Average ages:

Laid offKept

Actual data

58.0

41.4

Altered data

45.0

47.0

See! Just one small change and the average age of the three who were laid off is actually lower than the average age of the others.

Martin: Not so fast, yourself! Of all the possible changes, you picked the one that is most favorable to your side. If you'd switched one of the 55-year-olds who got fired with the 55-year-old who kept his job, the averages wouldn't change at all.

Why not compare what actually happened with all the possibilities that might have happened? Start with the ten workers, and pick three at random. Do this over and over, to see what typically happens, and compare the actual data with these results.

Westvaco: But you'd be ignoring relevant information, things like worker qualifications, and which positions were easiest to do without.

Martin: I agree. But you're changing the subject. Remember our question: "Is the sample large enough to support a conclusion?" That's a pretty narrow question. It doesn't say anything about why the workers were chosen. At this point, we're just asking "If you treat all ten workers alike, and pick three at random without regard to age, how likely is it that their average age will be 58 or more?"

You can use simulation to estimate the probability p that if you draw three workers at random, just by chance you will get an average age of 58 years or more.

Randomization tests by simulation: generate, compare, estimate

Generate a large number (NReps) of random data sets. Here, each “data set” is a random subset of 3 worker’s ages chosen from the 10.

Compare each random data set with the actual data. Is the average age for the random data set greater than or equal to 58? (Yes/No)

Estimate the probability p by the observed proportion of Yes answers:

ˆ

p

= (# Yes)/(# Repetitions) .

If

ˆ

p

is tiny, you know that an average age of 58 is too extreme to occur just by chance. Some other explanation is needed.

For most applications, it will be necessary to carry out the three steps on a computer, but this deliberately simplified example is one you can do by drawing marbles out of a bucket or the equivalent.

Activity 2.1 (Physical simulation): Did Westvaco Discriminate?

Step 1. Generate random data sets

Write each of the ten ages on identical squares cut from 3x5 cards, and put them in a box: 28, 33, 35, 38, 48, 55, 55, 55, 56, 64.

Mix the squares thoroughly and draw out three at random without replacement.

Step 2. Compare each random data set with the actual data (55, 55, 64)

Compute the average age for the sample. Is the value ( 58? (Record Yes or no.)

Step 3. Estimate the value of p using the observed proportion.

Repeat Steps 1 and 2 ten times. Combine your results with those from the rest of the class before you compute the proportion

ˆ

p

= (# Yes) / (# Repetitions).

Your chance model in the physical simulation is completely age neutral: All sets of three workers have exactly the same chance of being selected for layoff, regardless of age. The simulation tells you what sort of results are reasonable to expect from that sort of age-blind process. Here are the first four of 1000 repetitions from such a model:

Simulation (underlined = laid off)

Average age

25 33 35 38 48 55 55 55 56 64

42.67

25 33 35 38 48 55 55 55 56 64

48.00

25 33 35 38 48 55 55 55 56 64

42.67

25 33 35 38 48 55 55 55 56 64

37.00

Display 2.2 is a plot that shows the distribution of average ages for 1000 repetitions of the sampling process.

Average age of those chosen

Number of times

30

35

40

45

50

55

60

0

10

20

30

40

50

Display 2.2 Results of 1000 repetitions

The distribution of average age of those chosen for layoff by the chance model

Out of 1000 repetitions, only 49, or about 5% gave an average age of 58 or older. So it is not at all likely that just by chance you'd pick workers as old as the three Westvaco picked. Did the company discriminate? There's no way to tell just from the numbers alone. However, if your simulations had told you that an average of 58 or older is easy to get by chance alone, then the data would provide no evidence of discrimination. If, on the other hand, it turns out to be very unlikely to get a value this big just by chance, statistical logic says to conclude that the pattern is "real," that is, more than just coincidence. It is then up to the company to explain why their decision-making process led to such a large average age for those laid off.

The logic of the last paragraph may take some time to get used to, but it can help to recast the logic in the form of a real argument between two people. Here's an imaginary version of such an argument.

Martin: Look at the pattern in the data: All three of the workers laid off were much older than average.

Westvaco: So what? I claim you could get a result like that just by chance. If chance alone can account for the pattern, there's no reason to look for any other explanation.

Martin: OK, let's test your claim. If it's easy to get an average as big as 58 by drawing at random, I'll agree that we can't rule out chance as one possible explanation. But if an average that big is really hard to get from random draws, we agree that chance alone can't account for the pattern. Right?

Westvaco: Right.

Martin: Here are the results of my simulations. If you look at the three hourly workers laid off in round two, the probability is only 5% that you could get an average age of 58 or more. And if you do the same computations for the entire engineering department, the probability is a lot less, about 0.01, or one out of 100. What do you say to that?

Westvaco: Well ... I'll agree that it's really hard to get patterns that extreme just by chance, but that by itself still doesn't prove discrimination.

In principle we can apply the same three steps to the finch data, using the number of checkerboards in place of average age for comparing data sets in Step 2. Our estimate in Step 3 would then give an answer to the question Connor and Simberloff asked: “ If you generate data sets purely at random, so that each data sets has the same chance as each of the others, how likely are you to get 10 or more checkerboards?” Although in principle this three-step approach will answer the question, in practice it is hard to carry out Step 1, because there is no quick and simple way to generate random data sets. In a sense, much of the rest of this entire book deals with the mathematics of solving this problem, along with related questions that have yet to be answered.

2.2 An informal introduction to S-Plus

A useful reference: http://lib.stat.cmu.edu/S/cheatsheet

Opening a new script file in S-Plus:

Click on the S-Plus icon

Click OK to use existing data

Start a new script file (File > New > Script file)

Warm-up

There is a standard statistical vocabulary to describe choosing a random subset from some larger set: The larger set that you choose from is called a population; the random subset that you choose is called a sample. In the Martin example, the population is the set of ten ages {25, 33, 35, 38, 48, 55, 55, 55, 56, 64}. The set of three chosen (e.g., {55, 55, 64}) is the sample.

Several sets of lines of S-Plus code are shown below. For each set of lines, first make a guess about what the code will do. Then type the code into the top part of the split window of the script file. This is where you can enter and edit code. (Commands and keystrokes for editing are pretty much the same as in Microsoft Word.) Finally, click on the “run” button, the solid triangle in the left margin of the second toolbar, in the column below File. This will execute your code in the bottom half of the window, and let you check whether your guess was correct.

Populations as vectors

1a

Pop <- c(0,0,0,0,0,1,1,1,1,1)

Pop

1b

zeros <- rep(0,5)

zeros

1c

ones <- rep(1,5)

Pop <- c(zeros, ones)

Pop

1d

Pop <- rep(c(0,1),5)

Pop

1e

sort(Pop)

Pop

Populations and samples

2a

sample(Pop,3,replace=F)

2b

sum(sample(Pop,3,replace=F))

2c

Pop2 <- c(25,33,35,38,48,55,55,55,56,64)

sample(Pop2,3,replace=F)

2d

fired <- sample(Pop2,3,replace=F)

fired

mean(fired)

mean(fired) >= 58

2e

mean(sample(Pop2,3,replace=F)) >= 58

Using a programming loop to create many samples

Read through the following S-Plus code to see what it does. Note that a # separates comments from the executable code.

3

# Draw random samples of size 3, without replacement, from a given population,

# and determine whether the average is > = 58.

# Repeat this process NRep times, and find the proportion of samples that

# have an average age of 58 or more.

#

NRep <- 1000 # NRep is the number of repetitions (= number of samples)

#

NYes <- 0 # NYes will keep track of how many samples have a mean

# of 58 or more.

#

for (i in 1:NRep) # This is the S-Plus language for a loop. The commands

# enclosed between { and } will be executed NRep times,

# once for each value of i

#

{ # Begin the body of the loop

#

#

NYes <- NYes + (mean(sample(Pop2,3,replace=F)) >= 58)

#

} # End the loop

#

pHat <- NYes/NRep # Compute the observed proportion

pHat # Print the value of pHat

Exercises: the Martin case

Here are the ages of the hourly workers at the time of each of the first four rounds of layoffs. Those chosen in the given round are underlined; those already chosen in a previous round are crossed out:

Round 1: 22 25 33 35 38 48 53 55 55 55 55 56 59 64

Round 2: 22 25 33 35 38 48 53 55 55 55 55 56 59 64

Round 3: 22 25 33 35 38 48 53 55 55 55 55 56 59 64

Round 4: 22 25 33 35 38 48 53 55 55 55 55 56 59 64

4. Guess the p-value for each of Rounds 1, 3, and 4. (Note that for Round 3, you don’t need to guess: you can use logic to figure out the p-value.

5. Use S-plus to estimate the p-values for Rounds 1, 3, and 4.

Preliminary Investigation: How many replications do you need?

6. Go back to the data for Round 2 of the reduction in force, and use the S-plus code to get values of

$

p

for each of the following values of NRep:

1, 5, 25, 100, 500, 2500, 10000

(The last one may take 30 seconds or so.) Then put your values of

$

p

versus NRep in a table, along with the values from the others in the class:

Row

Job title

Pay

Birth

Hire

RIF

AGE

mo

yr

mo

yr

1/1/91

1

Engineering Clerk

H

9

66

7

89

0

25

2

Engineering Tech II

H

4

53

8

78

0

38

3

Engineering Tech II

H

10

35

7

65

0

56

4

Secretary to Engin Manag

H

2

43

9

66

0

48

5

Engineering Tech II

H

8

38

9

74

1

53

6

Engineering Tech II

H

8

36

3

60

1

55

7

Engineering Tech II

H

1

32

2

63

1

59

8

Parts Crib Attendant

H

11

69

10

89

1

22

9

Engineering Tech II

H

5

36

4

77

2

55

10

Engineering Tech II

H

8

27

12

51

2

64

11

Technical Secretary

H

5

36

11

73

2

55

12

Engineering Tech II

H

2

36

4

62

3

55

13

Engineering Tech II

H

9

58

11

76

4

33

14

Engineering Tech II

H

7

56

5

77

4

35

15

Customer Serv Engineer

S

4

30

9

66

0

61

16

Customer Serv Engr Assoc

S

2

62

5

88

0

29

17

Design Engineer

S

12

43

9

67

0

48

18

Design Engineer

S

3

37

6

74

0

54

19

Design Engineer

S

3

36

2

78

0

55

20

Design Engineer

S

1

31

3

67

0

60

21

Engineering Assistant

S

6

60

7

86

0

31

22

Engineering Associate

S

2

57

4

85

0

34

23

Engineering Manager

S

2

32

11

63

0

59

24

Machine Designer

S

9

59

3

90

0

32

25

Packaging Engineer

S

3

38

11

83

0

53

26

Prod Spec - Printing

S

12

44

11

74

0

47

27

Proj Eng-Elec

S

9

43

4

71

0

48

28

Project Engineer

S

7

49

9

73

0

42

29

Project Engineer

S

8

43

4

64

0

48

30

Project Engineer

S

6

34

8

81

0

57

31

Supv Engineering Serv

S

4

54

6

72

0

37

32

Supv Machine Shop

S

11

37

3

64

0

54

33

Chemist

S

8

22

4

54

1

69

34

Design Engineer

S

9

38

12

87

1

53

35

Engineering Associate

S

2

61

9

85

1

30

36

Machine Designer

S

2

39

4

85

1

52

37

Machine Parts Cont-Supv

S

10

28

8

53

1

63

38

Prod Specialist

S

9

27

10

43

1

64

39

Project Engineer

S

7

25

9

59

1

66

40

Chemist

S

12

30

10

52

2

61

41

Design Engineer

S

4

60

5

89

2

31

42

Electrical Engineer

S

11

49

3

86

2

42

43

Machine Designer

S

3

35

12

68

2

56

44

Machine Parts Cont Coor

S

9

37

10

67

2

54

45

VH Prod Specialist

S

5

35

9

55

2

56

46

Printing Coordinator

S

2

41

1

62

3

50

47

Prod Dev Engineer

S

6

59

11

85

3

32

48

Prod Specialist

S

7

32

1

55

4

59

49

VH Prod Specialist

S

3

42

4

62

4

49

50

Engineering Associate

S

8

68

5

89

5

23

Based on all the data, what is your best estimate for the value of p? Make a rough plot by hand of

$

p

versus NRep. Describe, as quantitatively as you can, the pattern that relates the variability in the values of the estimates to the number of repetitions. Roughly how many repetitions are needed to be confident that any given estimate will be within .01 of the true value? (This is your first look at a question that you will study more systematically in Chapter 3.)

2.3 Randomizations tests, I: The two-sample permutation test

The Martin example is typical of a large class of situations. Here is another instance:

Example 1. Calcium and blood pressure.

To test whether taking calcium supplements can reduce blood pressure, investigators used a chance device to divide 21 male subjects into two groups. One group of 10 men, the treatment group, were given calcium supplements and told to take them every day for 12 weeks. The other 11 men, the control group, were given pills that looked the same as the supplements (a placebo), and given the same instructions: take one every day. Neither the subjects themselves nor the people giving out the pills and taking blood pressure readings knew which pills contained the calcium. (The experiment was double blind.) Subjects had their blood pressure read at the beginning of the study and again at the end. The numbers below tell the reduction in systolic blood pressure (when the heart is contracted), in millimeters of mercury. (Positive values are good; negative values mean that the blood pressure went up.)

Calcium: 7, -4, 18, 17, -3, -5, 1, 10, 11, -2

Placebo: -1, 12, -1, -3, 3, -5, 5, 2, -11, -1, -3

Here are the same numbers, arranged in order, with the values in the treatment group underlined:

-18 -17 -12 -11 -10 -7 -5 -3 -2 -1 1 1 1 2 3 3 3 4 5 5 11

-11 -5 -5 -4 -3 -3 -3 -2 -1 -1 -1 1 2 3 5 7 10 11 12 17 18

Notice that for this example, as for Martin, there are two groups to compare, in this instance those assigned to the treatment group, and those assigned to the placebo group. Here also, as in the Martin example, the information we have available for comparing the two groups is quantitative, and we can judge the results using the average reduction in blood pressure for the calcium group, which was 5 millimeters of mercury.

Was the calcium supplement effective in lowering blood pressure? Here’s how the logic goes: The only differences between the two groups were (1) the calcium, and (2) differences created by the random assignment. Assume for the moment that the calcium had no effect. Then the observed reduction of 5 mm Hg in the calcium group was due purely to chance, that is, to the random assignment. To see whether chance is a believable way to account for the average of 5, we ask, “If you take the 21 blood pressure values, and choose 10 of them at random, how likely is it that you’ll get an average of 5 or more?” If this probability, the p-value, is tiny, we conclude that chance is not a believable explanation; it must be due to the calcium treatment.

Exercise:

7. (a) I used 10,000 repetitions to estimate this probability, and got 0.0813. What do you conclude? (b) Use the S-plus code from before to compute the p-value, with NReps =1000. How far is your estimate from mine? Which value is more reliable?

Example 2. Hospital carpets.

In a hospital, noise can be an irritation that interferes with a patient’s recovery. Putting down carpeting in the rooms would cut down on noise, but the carpeting might tend to harbor bacteria. To study this possibility, doctors at a Montana hospital conducted an experiment to see whether rooms with carpeting had higher levels of airborne bacteria than rooms with bare floors. They began with 16 rooms and randomly chose eight to have carpeting installed. The other eight were left bare. At the end of their test period, they pumped air from each room over a culture medium (agar in a petri dish), allowed enough time for the bacterial colonies to grow, and recorded the number of colonies per cubic foot of air. Here are the results:

YesNoTotal

55 or

Yes

325

older?

No

055

Total

3710

Laid off?

Display 2.3 Levels of airborne bacteria for 16 hospital rooms

Exercise:

8. Estimate the p-value for testing the hypothesis that carpeting had no effect on the levels of airborne bacteria. (Find the chance that if you choose 8 values at random from the 16 bacteria levels, you’ll get an average of 11.2 or more. Use 10,000 repetitions.)

The three examples, Martin, calcium, and carpets, all have the same abstract structure:

Summary: Two-sample permutation tests

Data: Two groups (samples) of numerical values, n1 in Group 1 and n2 in Group 2.

Test statistic: Average (mean) of the values in Group 1.

Observed value: Group 1 average for the actual data

Null model: All possible ways to choose n1 values (a random sample) from the combined set of n1 + n2 values (the population) are equally likely.

p-value: The chance that the average for a random sample is at least as large as the observed value.

Example 3. Speed limits and traffic deaths.

The year 1996 offered an unusual opportunity to scientists who study traffic safety. Until that year, states had to keep highway speeds at 55 mile per hour or below in order to receive federal money. Then, toward the end of 1995, a new federal law took effect, one that allowed states to raise their speed limits. Thirty two states did just that, either at the beginning of 1996, or at some point during the year. The other 18 states, and the District of Columbia kept the 55 mph limit. Conventional wisdom had it that increasing the speed limit would lead to more highway deaths. The change in the law gave scientists a chance to test this hypothesis. The numbers in Display 2.4 show the percentage change in numbers of highway traffic deaths between 1995 and 1996, for all 50 states and DC.

AK

-29.0

NH

-20.0

AL

24.5

KS

-13.3

OH

1.6

CT

-4.4

NJ

44.1

AR

41.3

MA

33.3

OK

34.1

DC

-80.0

NY

-9.7

AZ

0.0

MD

-1.8

PA

-7.0

HI

-25.0

OR

-16.4

CA

4.4

MI

-7.9

RI

18.2

IN

-13.2

SC

32.1

CO

-19.1

MO

50.7

SD

22.2

KY

3.4

VA

-9.1

DE

30.0

MS

17.6

TN

4.0

LA

-5.4

VT

-41.2

FL

8.2

MT

17.9

TX

14.8

ME

-14.3

WI

41.4

GA

32.1

NC

5.4

UT

9.4

MN

10.8

WV

23.2

IA

41.4

NE

62.5

WA

34.3

ND

-50.0

ID

-17.9

NM

3.4

WY

-31.5

IL

9.4

NV

17.9

States that kept 55 mph

States that raised the speed limit

Display 2.4 Percentage change in traffic deaths

Here is how the features of this example correspond to the elements of the abstract summary. The two groups are the states that increased the 55 mph speed limit (Group 1) and those that kept it (Group 2). The test statistic is the average percent change in highway deaths for Group 1. Its observed value is the actual average for those states, which works out to 13.2. The null model is that all way to choose 32 numbers from the set of 51 listed in the table are equally likely. The p-value is the chance of getting a group average of 13.2 or more; this works out to about 0.005.

Discussion question

9. What would change, and what would be the same, if you defined Group 1 to be the states that didn’t raise their speed limits?

Example 4. O-rings.

The explosion of the Challenger space shuttle has received a lot of attention from statisticians because the disaster and loss of the astronauts’ lives could have been prevented by fairly simple data analysis. The explosion was caused by failure of O-ring seals that allowed rocket fuel to leak and explode, and an investigation concluded that the o-ring failures were themselves caused by the low temperature at the time of the launch. The summary below shows the relationship between air temperature at launch time and the number of “O-ring incidents” per launch for 24 launches.

Launch temperature

Below 65(1 1 1 3

Above 65(0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 2

Exercises:

10. Identify the components (samples, null model, test statistic, observed value, p-value) for the Martin and calcium examples.

11. Guess: Will the p-value for Example 4 turn out to be closest to .5, .1, .05, .01, or .001? After you guess, use S-plus to estimate the p-value.

12. The faulty analysis before the launch ignored all the 0s in the data and looked only at the temperature on the days for which the launches had problems with the O-rings. Ignoring 0s gives the following summary:

Launch temperature

Below 65(1 1 1 3

Above 65(1 1 2

Guess: Will the p-value turn out to be closest to .5, .1, .05, .01, or .001? Then use S-plus to estimate the p-value.

2.3 Randomization tests, II: Fisher’s exact test

So far, the data values in the examples have been quantitative: ages, reduction in blood pressure, levels of airborne bacteria. What if the data values are categorical? One of the simplest randomization tests is Fisher’s exact test, which is used to test hypotheses about data that can be summarized in a 2x2 table of counts.

Example 5. The Salem witchcraft hysteria

The year 1692 saw nineteen convicted witches hanged in Salem Village (now Danvers) Massachusetts. Almost three centuries later, historians examining documents related to the trials discovered a striking pattern relating trial testimony and geography. Those who testified against the accused witches tended to live in the western part of Salem Village; those who testified in defense of the accused tended to live in the eastern part, which was wealthier, more commercial, more cosmopolitan, and closer to the town of Salem, at the time the second busiest port in the colonies. A total of 61 residents testified in the trials, of whom 35 lived in the western part of the village, and 26 in the eastern part. Of the 35 “westerners,” 30 were “accusers” and only 5 were “defenders;” of the 26 easterners, only 2 were accusers; the remaining 24 were defenders:

YesNoTotal

55 or

Yes

1.53.55

older?

No

1.53.55

Total

3710

Laid off?

Display 2.5 Geography and testimony in the Salem witch trials of 1692

Is it possible to get a pattern as extreme as this just by chance? If the relationship between geography and testimony were purely random, how likely would a pattern this extreme be? Represent the residents who testified by poker chips, 35 marked “West” and 26 marked “East.” Put all 61 chips in a bag, mix thoroughly, and draw out 32 at random. Call these Accusers; count how many of the accuser chips say West, how many say East, and record the results in a table. I just did this, and got:

Step 2aObservedExpectedObs - Exp

32

-

1.53.5

=

1.5-1.5

051.53.5-1.51.5

Step 2b

2.252.25

/

1.53.5

=

1.50.64

2.252.251.53.51.50.64

Step 2cChi-square = sum of (O-E)^2/E = 4.29

Expected(O-E)^2/E(Obs - Exp)^2

Display 2.6 Results for a sample of accusers drawn at random

My random data set is not nearly as extreme as the actual one.

I repeated the whole process -- random draws, count, compare – 10,000 times, using the S-Plus code in Display 2.7, and not once did I get a table as extreme as the actual data. Conclusion: If you draw at random, it is all but impossible to get a data table like the observed one. In other words, “It’s just a chance relationship” is not a believable explanation for the data.

For the S-plus simulation, I used 0s and 1s to represent East and West. There were 26 people from the East who testified, and 35 from the West, so my population has 26 0s and 35 1s.

pop <- c(rep(0,26),rep(1,35))

phat <- 0

NRep <- 10000

for (i in 1:NRep){

phat <- phat + (sum(sample(pop,32,replace=F))>=30)/NRep

}

phat

Display 2.7 S-Plus code for drawing random samples of accusers and estimating p

Here’s an abstract version of the same analysis:

Step 1: Generate random data sets

Population: 61 individuals, 35 of them 1s (West) and 26 of them 0s (East).

Sample: A subset of 32.

Null model: The sample is chosen completely randomly; all subsets of 32 are equally likely.

Step 2. Compare random data sets with the actual data.

Test statistic: Number of 1s in the sample (= number of “West” chips among the randomly chosen “accusers.”)

Actual data value: There were, in fact, 30 residents of the western part of Salem Village among the accusers.

Compare: Record a Yes if there are 30 or more 1s in the sample.

Step 3. Estimate.

Out of 10,000 data sets, none had as many as 30 1s.

Because the p-value is so very tiny, we reject the null model. It is not a believable explanation for the actual data.

Drill Exercises:

13. A small version of the witch data.

Assume that only 10 people had testified, of whom 4 lived in the west, 6 in the east. Assume also that 3 were accusers, and that all 3 came from the west. Set up the population, null model, test statistic and observed value. Then find the p-value and state your conclusion.

14. There is more than one way to define the population and null model for S-plus. Two easy variations are (a) to reverse the labels for 1s and 0s, so that 1 represents East and 0 represents West, and/or (b) to reverse the labels for in the sample and not in the sample, so that the sample represents Defenders, and those not in the sample are the Accusers. For each of these variations, tell what the test statistic would be, and how to compute the p-value. Verify that the p-value is the same for all these variations.

15. A more substantive variation reverses the roles of population and sample: Let the 1s and 0s in the population tell whether an individual was an Accuser or Defender. Let “in the sample” correspond to “West,” and “not in the sample” to East. Define a test statistic, and tell how to modify the S-Plus code in Display 3s.3 to compute the p-value. (Optional: Run your modified S-Plus code, and verify the (non-obvious) fact that (apart from random variation) it is equal to the p-value from the original code.

Example 6. US v. Gilbert

For several years in the 1990s, Kristen Gilbert worked as a nurse in the intensive care unit (ICU) of the Veteran’s Administration hospital in Northampton, Massachusetts. Over the course of her time there, other nurses came to suspect that she was killing patients by injecting them with the heart stimulant epinephrine. Part of the evidence against Gilbert was a statistical analysis of more than one thousand 8-hour shifts during the time Gilbert worked in the ICU. Was there an association between Gilbert’s presence on the ICU and whether or not someone died on the shift? Here are the data:

K.G. present

on shift?

YesNoTotal% Yes

Yes4021725715.6%

No34135013842.5%

Total7415671641

Death on shift?

Display 2.8 Data on possible association between nurse Gilbert’s presence in the ICU of the Northampton VA hospital and deaths on a shift.

Drill Exercises:

16. Define the population of 0s and 1s: What does a 0 represent? a 1? How many 0s, and how many 1s, are in the population. (Note: There is more than one right way to do this.)

17. Define the null model: If you think of drawing a random sample from the population,, what does “drawn out” (that is, in the sample) represent? What does “not drawn out” represent?

18. Define the test statistic: What does the number of 1s in the sample represent? What is the observed value of the test statistic?

19. p-value. Modify the S-Plus code in Display 3s.3 so that it would compute the p-value. (Optional: Compute the p-value using your code.)

Example 7. Anthrax.

After Senator Tom Daschle received a letter containing anthrax, the Hart Senate Office Building was fumigated in an attempt to kill the spores. After the first fumigation, public health officials conducted a multi-part test to see whether the building was safe to work in. In the first phase of the test, 17 strips capable of detecting live anthrax spores were placed throughout the test area, and later were checked for anthrax. Five of the 17 were positive. In the second phase of the test, another 17 strips were placed in the same locations, but this time suitably protected technicians walked around on the carpet, moving the room air in the process, to simulate normal office traffic. This time 16 of 17 strips were positive. The results of these tests led to a second, more vigorous, and successful fumigation.

Exercises

20. Summarize the test results in a 2x2 table.

21. Define a suitable population, null model, and test statistic for Fisher’s exact test.

22. Use S-plus to compute an appropriate p-value.

Discussion question:

23. Just as each strip in the test for anthrax can show a false positive (indicate anthrax when none is present) or a false negative (indicating no anthrax when it is in fact present), statistical tests can also show false positives and false negatives. For the study of calcium supplements, what would a false positive be? A false negative?

Summary. Fisher’s exact test is appropriate when (1) you want to compare two randomly chosen groups of individuals, and (2) the feature of the individuals that you use to make the comparison is dichotomous – reducible to yes/no. Think of randomly drawing marbles from a bucket. This gives two randomly chosen groups: those drawn out, and those left in. The marbles are of two colors; color is the feature used to compare the two groups. For the Salem witch data (Example 5), we asked, “What if the ‘accusers’ and ‘defenders’ had been chosen at random?” In that example, the actual accusers and defenders were not chosen at random, but we wanted to test whether the observed data was consistent with random selection. So under our null model, the ‘accusers’ and ‘defenders’ were the randomly chosen groups. The feature used for the comparison was geography, east or west. For the Gilbert data (Example 6) we asked, “What if the deaths had occurred on randomly chosen shifts?” Here, also, the actual shifts were not chosen that way, but we wanted to compare the actual data with what we would be likely to get if the shifts had been chosen randomly. Thus shifts with and without deaths were the randomly chosen groups. The feature used for comparing groups was whether or not Gilbert was present on the shift.

2.4 Randomization tests, III: variations.

Variation 1: Dichotomizing a numerical variable.

Although Fisher’s exact test is designed for dichotomous populations – those with just two kinds of individuals – it is possible to use the test when the feature you use to compare groups is quantitative. To turn a quantitative variable into a dichotomous one, pick a threshold value of the variable, and replace its actual value with a Yes or No answer to the question “Is the value of the variable greater than or equal to the threshold?” Once you’ve replaced the numbers with Yes/No answers, you can carry out Fisher’s exact test.

Example 8. Martin vs. WestVaCo

Here, once again, are the ages of the ten hourly worker involved in the second round of layoffs at WestVaCo, with the ages of those laid off indicated by underlining.

25 33 35 38 48 55 55 55 56 64

a. One way to choose a threshold is to go by the law. According to federal employment law, the “protected class” (of workers who cannot be fired because of their age) begins at age 40. If we use 40 as the threshold, then the set of ten ages becomes

N N N N Y Y Y Y Y Y

We can summarize the new, dichotomized version of what actually happened in a 2x2 table:

NRep = # samplesValues of p-hat

1

2

3

5

etc.

If we use as our test statistic the number of workers aged 40 or more among those laid off, we get a p-value of about .17:

p = P(3 or more Y in a random sample of 3 chosen from 4 N, 6 Y) ( .17.

b. Notice that using Fisher’s test when you have a quantitative variable ignores relevant information. here, for example, all three workers laid of were very much older than the threshold age of 40, but the test in (a) didn’t use that information. In general, it is better not to use Fisher’s test in a situation like this; the permutation test would ordinarily be preferred. However, you can sometimes get a better version of Fisher’s test by changing the threshold. For example, you could choose as your threshold the median (half-way point) of the set of observed ages. The resulting test is called the median test. For the Martin data we have an even number of values, so there are two middle values 48 and 55, and the median is the number half-way between them, 51.5. Using 51.5 as a threshold, and replacing ages by Yes/No answers to “Is the age 51.5 or older?” gives a population of 5 Ns, 5 Ys.

Drill exercise: (24) Summarize the data in a 2x2 table.

For this threshold, the p-value is

p = P(3 or more Y in a random sample of 3 chosen from 5 N, 5 Y) ( .06.

c. Drill exercise. (25) Repeat the test using 55 as the threshold. Summarize the data in a 2x2 table, and explain why the p-value should be less than the two previous p-values. (It’s actual value is 1/30 ( .03.)

Example 9. Calcium and blood pressure.

If we want to apply Fisher’s test to the data in Example 1, we have to reduce the quantitative data to dichotomous data by choosing a threshold value and asking, “Is the change in blood pressure greater than or equal to the threshold?”

Drill exercises.

26. One natural choice for the threshold is 0. Values 0 or greater indicate that the blood pressure did not go down. Carry out Fisher’s test using 0 as your threshold.

27. Carry out the median test.

28. Compare p-values for the two tests. Why do you think the p-values differ in the way that they do?

Drill exercises

29. Tell how to conduct a median test:

Null model.

a. What is the population?

b. What constitutes a random sample?

c. What are the objects that are equally likely according to the null model?

Test statistic

d. Tell what test statistic to use.

Variation 2: Transforming to ranks

Back before the days of cheap computers, it was often not practical to estimate p-values by simulation. Statisticians who wanted to do randomization tests found a clever way around the problem. Their solution was based on the fact that if your population consists of consecutive integers, like {1, 2, 3, …, n} there is a theoretical analysis that gives a workable approximation to p-values. Of course most populations don’t consist of consecutive integers, but you can force them to if you replace the actual data values with their ranks: order the values from smallest to largest, assign rank 1 to the smallest, rank 2 to the next smallest, etc. Once you’ve assigned ranks, you can do a two-sample permutation test on the ranks. The resulting test is called the Wilcoxon rank sum test.

Here’s how the ranking works for the Martin data.

Example 10: A rank test for the Martin data

Age

25

33

35

38

48

55

55

55

56

64

Rank

1

2

3

4

5

7

7

7

9

10

Notice how the ranking handles ties: the three 55s have ranks 6, 7, and 8, so we assign each of the 55s the average of those ranks, (6+7+8)/3 = 7.

Exercises

30. Carry out the Wilcoxon test on the Martin data, but first, guess whether the p-value will be larger or smaller than for the permutation test using the actual age. (It will not be the same.) What do you think is the reason for the difference in p-values?

31. Carry out the Wilcoxon test on the calcium data of Example 1. As in (30), before you run the simulation, guess the p-value.

Variation 3: Paired data

Until now, all the data sets have had the same structure: two groups of values. To generate random data sets with the same structure, you combined the two group into a single population, then randomly chose exactly enough values for Group 1, leaving the rest for Group 2. This structure is just one of a great many that are possible. Often data come in the form of pairs:

Example 11. Beestings.

When a bee stings and leaves his stinger behind in his victim, does he also leave with it some odor that tells other bees, “Drill here!”? To answer this question, J.B. Free designed a randomized experiment. First, he took a square board and from it suspended 16 cotton balls on threads in a 4x4 arrangement. Half the cotton balls had been previously stung, the other half were brand new, and the positions of the two kinds was randomized. Apparatus completed, Free went to a beehive, opened the top and jerked the array of cotton balls up and down, inviting stings. Later, he counted the numbers of new stingers his provocation had garnered. He repeated all this eight more times, with the results shown in Display 2.9.

Occasion

I

II

III

IV

V

VI

VII

VIII

IX

Ave

Stung

27

9

33

33

4

22

21

33

70

28.0

Fresh

33

9

21

15

6

16

19

15

10

16.0

Display 2.9 Numbers of new stinger left by bees

in previously stung and fresh cotton balls

On average, there were 28 stingers left in the cotton balls that had been previously stung, only 16 in those that had not. Given the grand total of 396 new stingers, is the average of 28 for Stung too big to be due just to the random assignment? Suppose for the moment that the presence of stingers in the cotton balls had no effect. Then within each pair, it would be just a matter of chance as to which number got assigned to Stung, and which to Fresh. We can create random data sets by regarding each occasion as a tiny population of just two values, and randomly choosing the value that gets assigned to Stung, leaving the other for fresh. Equivalently, we can toss a coin for each occasion, choosing the first value of the coin lands heads, the second if tails. Display 2.10 shows an instance of this:

Occasion

I

II

III

IV

V

VI

VII

VIII

IX

Ave

Stung

27

9

33

33

4

22

21

33

70

28.0

Fresh

33

9

21

15

6

16

19

15

10

16.0

Coin toss

1

1

0

0

0

1

0

0

1

Ave

"Stung"

27

9

21

15

6

22

19

15

70

22.7

"Fresh"

33

9

33

33

4

16

21

33

10

21.3

Display 2.10 Generating a random data set for the bee sting data

If the coin toss lands head (1) the first value in a pair is assigned to “Stung”;

if tails (0), the second value is assigned to “Stung.”

Once we have a way to generate random data sets, we’re in business. We can carry out the randomization test by the usual 3-step algorithm: generate, compare, estimate. Display 2.11 shows S-plus code for doing this. The p-value turns out to be about .04. Apparently, bees are more likely to sting where others have stung before.

Stung <- c(27,9,33,33,4,22,21,33,70)

Fresh <- c(33,9,21,15,6,16,19,15,10)

StungMean <- mean(Stung)

nPairs <- length(Stung)

NRep <- 1000

NYes <- 0

for (i in 1:NRep){

tosses <- sample(c(0,1),nPairs,replace=T)

ave <- sum(tosses*Stung + (1-tosses)*Fresh)/nPairs

NYes <- NYes + (ave >= StungMean)

}

pHat <- NYes/NRep

pHat

Display 2.11. S-plus code for a permutation test for paired data

Exercise:

32. Explain how the line of code

ave <- sum(tosses*Stung + (1-tosses)*Fresh)/nPairs

works to give the average of the nine values randomly assigned to “Stung”.

Example 12. Radioactive twins.

Would you agree to inhale an aerosol of radioactive Teflon particles? Seven pairs of identical twins once did. They were part of a study of the effect of environment on the health of lungs. One twin in each pair had been living in a rural environment, the other in an urban environment. The numbers in Display 2.12 tell the percent of radioactivity remaining one hour after inhaling the aerosol. Lower values are better: they indicate that a larger percentage of the particles had been cleared from the lungs.

Twin Pair

I

II

III

IV

V

VI

VII

Ave

Rural

10.1

51.8

33.5

32.8

69.0

38.8

54.6

41.5

Urban

28.1

36.2

40.7

38.8

71.0

47.0

57.0

45.5

Display 2.12 Percentage of radioactivity remaining in the lungs,

for seven pairs of twins living in two environments

Exercises:

33. Modify the S-plus code in Display 2.11 to carry out a permutation test. What do you conclude about the effect of environment on health?

34. For the bee data, the justification for a permutation test comes from the fact that conditions (stung or fresh) were randomly assigned. The twin study, however, is an observational study, with no randomization of the conditions (rural or urban) possible. What is the justification for using a permutation test?

Additional exercises involving variations on the permutation test

35. Dichotomize the O-ring data of Example 4, replacing the number of incidents with Yes (1 or more incidents) or No (0 incident), and summarize the results in a 2x2 table. Then carry out Fisher’s exact test. How does your p-value here compare with the one based directly on the numbers of incidents?

36. Dichotomize the data on traffic deaths (Example 3) using the sign of the change, i.e., whether the deaths went up or down. Summarize the results in a 2x2 table, and carry out Fisher’s exact test.

37. Use the bee sting data of Example 11. Order all 18 data values and assign ranks. Then carry out a permutation test on the pairs of ranks, using suitably modified code from Display 2.11. This test is called the signed rank test.

38. Use the bee sting data one more time. This time assign ranks separately for each pair: assign a 1 to the larger value and a 0 to the smaller value. If the two values are equal, simply omit that pair from the analysis. Carry out a permutation test on the pairs of ranks. This test is called the sign test.

39. Compare the p-values from the three tests using the bee sting data: the permutation test using the numbers of stings, the signed rank test, and the sign test. The first test uses the actual data, the second gives up some of that information by converting to ranks, and the third test give up still more information by looking only at which value in a pair was the larger one. Based on the bee sting data, how does giving up information appear to affect p-values?

2.5 Randomization tests, IV: Chi-square tests.

Introduction. Fisher’s exact test applies to data sets you can summarize in a 2x2 table. Such data sets have the same structure as the results of drawing a sample from a bucket containing red and blue marbles: there are two groups (drawn out, left in) and two kinds of individuals (the two colors). What if there are more than two kinds of individuals, or more than two samples?

Example 13. Victoria’s descendants

Some people claim there is an association between a person’s birthday and the day of the year on which they die. According to the theory, people who are dying tend to “hang on” until their birthday. Display 2.13 shows counts for 82 descendants of Queen Victoria, classified by month of birth (row) and month of death (column). Those who died in the same month as they were born in appear on the main diagonal. Those who died in a month just before or just after their birth month appear just below or just above the main diagonal.

JanFebMarAprMayJunJulAugSepOctNovDec

Jan

1000120010106

Feb

1001000001025

Mar

1000210000015

Apr

30200010131112

May

21111111111012

Jun

2000100000003

Jul

20210000111210

Aug

0003001001027

Sep

0001100000103

Oct

1102001001107

Nov

0111200201109

Dec

0110001000003

Total

1347108453497882

Total

Display 2.13. Month of birth (row) and month of death (column)

for 82 descendants of Queen Victoria

If the claim of association is true, we would expect to find a tendency for counts to be higher on or near the main diagonal, lower near the southwest and northeast corners. If, on the other hand, there is no association, we would expect the count in a cell to be equal to the product of the cell’s column total times its row fraction. For example, consider the upper left cell, which corresponds to those born in a January who also died in a January. The row and column totals tell us that 6 of 82 descendants, or 7.32%, were born in a January; 13 died in a January. If there is no association between month of birth and month of death, we would expect the fraction of January births to be the same in each column. In particular, we would expect 7.32% of the 13 January deaths, or 13(0.0732) = 0.95, to be January births. The goal of this section is to apply the same kind of thinking to all the cells of the table, and somehow combine the results to create/define a test statistic that can serve as the basis of a randomization test.

The chi-square test statistic. One of the most common methods in all of statistics is the chi-square test. This test is so flexible and broadly applicable that almost any data set based on sorting and counting can be studied using it. The chi-square test has two parts, which are often presented as a single package, without noting or distinguishing between one part and the other. This is unfortunate, because one part – a measure of distance that serves as a test statistic -- is much more useful than the other – a shortcut method for approximating p-values. In what follows, I’ll describe and illustrate the more useful part, which gives a general-purpose method for carrying out Step 2 (the comparison step) of the p-value algorithm.

For a concrete illustration, consider the summary table for the Martin example. For 2x2 tables, the chi-square distance isn’t something you really need, because its value is completely determined by the entry in the upper left cell of the table. However, you do need chi-square (or some alternative) for tables larger than 2x2, and the 2x2 case is a simple starting point for learning.

Row label12Total

1 (under 55)

14

5

2 (55 or older)

23

5

Total

3710

Column label

(2 = laid off, 1 = retained)

Display 2.14. Summary table for the Martin example

The null model corresponds to choosing a random subset of size 3 – those laid off -- from {28, 3, 35, 38, 48, 55, 55, 55, 56, 64} and counting the number of people who are 55 or older. Since 5 of the 10, or 50% of those in our population are 55 or older, we would expect that, on the average over the long run, 50% of those in a randomly chosen subset would be 55 or older. For subsets of size 3, this long run average – the expected value – would be 50% of 3, or 1.5. Because the table entries have to add to give the same row and column totals as for the actual data, we can fill in the entire table:

Geography

Total

West3053585.7%

East224267.7%

Total322961

AccuserDefender% accuser

Testimony

Display 2.15. Expected values for the Martin example

We can now use these expected values to compare tables. In effect, we invent a way to measure the “distance” between two tables, and compare tables according to how far they are from the table of expected values.

Geography

Total

West19163554.3%

East13132650.0%

Total322961

Testimony

AccuserDefender% accuser

Display 2.16. The chi-square distance between observed and expected counts

By looking closely at the way the chi-square value is defined, you can convince yourself that it does behave like a distance.

· Observed close to expected ( chi-square near zero. Consider first a table whose observed counts are exactly equal to the expected counts. All of the entries in the table of differences will be zeros, and the chi-square distance will be zero as well. In other words, the chi-square distance from a table to itself is zero, just as it should be.

· Observed far from expected ( chi-square large. Now consider a table of observed counts that are far from their expected values. At least some of the differences (Obs – Exp) in Step 2a will be far from 0. When these differences are squared, in Step 2b, the resulting values will be large, and that will make the chi-square value large.

· Why divide (O-E)2 by E? Dividing by E is a technical adjustment, designed to give all the cells an equal chance to contribute to the chi-square total. To see how this works, consider two extreme cases. First, suppose that the expected value is 1. If 1 is the expected value, we might get observed values of 2, or 5, or even 10, but 10 would be a very major departure from expectation. On the low side, the observed count can never be less than 0, so –1 is the lowest possible value for O – E. Next, suppose that instead of 1, the expected value is 101. An observed count of 102, or 105, or even 110 is, in percentage terms, still quite close to the expected value, even though the differences (O-E) are the same as in the first case. On the low side, (O-E) can easily go far below –1.

Now compare the two cases. In the first, a value of (O-E) = 4 indicates major departure from expectation: 5 instead of 1. In the second case, a value of (O-E) = 4 indicates a departure of less than 4% from the expectation. Dividing (O-E)2 by E puts these departures in perspective. In the first case, (O-E)2/E = 16; in the second case, (O-E)2/E = 0.16.

Once you have defined the chi-square distance, you can use it to compare data sets. A random data set is more extreme than the actual Martin data, for example, if and only if its chi-square distance from the table of expected values is greater than or equal to 4.29. You calculate the p-value in the usual way, as the fraction of random data sets that are at least as extreme as the actual data.

Testing for association in two-way tables of counts

For tables larger than 2x2, the arithmetic is messier, but the logic is the same as in the Martin example. Here’s a version of the randomization algorithm suitable for larger tables.

Step 0. Expected value = (row fraction)(column total).

Observed chi-square: Follow Steps 2a-2c below for the actual data.

Step 1. Generate random data sets with the same row and column totals as the actual data.

Step 2Compare values of the chi-square statistic. For each random data set, compute

a. Observed – Expected

b. (Obs-Exp)2/Exp

c. Chi-square = sum of (O-E)2/E

d. Compare: is chi-square for the random data set at least as big as for the actual data?

Step 3. Estimate:

$

#

/

#

p

Yes

Datasets

=

EMBED Equation

For the data in Display 2.13, the value of chi-square is 115.6. To see how this value compares with the values we’d get from random data sets, we need a method for generating 12x12 tables of counts with the same margins as in Display 2.13.

Generating random tables of counts with given margins.

You can generate data sets by physical simulation, much as in the Martin example. Here’s how it works for Victoria’s descendants.

Step 1. Label by rows. Put 82 chips in a bucket, with labels determined by the row (birth month) totals. Thus 6 of the chips say “Jan”, 5 say “Feb”, 5 “Mar”, 12 “Apr”, and so on.

Step 2. Draw out by columns. Mix the chips thoroughly. Then draw them out in stages determined by the column totals: The first 13 chips you draw out correspond to deaths in January; the next 4 correspond to deaths in February, …, the last 8 correspond to deaths in December.

In S-plus, you can carry out the same two steps:

Step 1. Label by rows. Just as rep(1, 3) creates a vector (1, 1, 1), rep(1:3, c(4,3,1)) creates a vector (1, 1, 1, 1, 2, 2, 2, 3). For the Victoria data, we want rep(1:12, c(6, 5, 5, 12, 12, 3, 10, 7, 3, 7, 9, 3)). Rather than type in all the row totals, we use the command rowSums to tell S-plus to compute the totals for us.

Pop <- rep(1:12, rowSums(ActualData))

More generally, the number of rows won’t be necessarily be 12, but it will be given by the first element of the vector that gives the dimension of the data, dim(ActualData)[1]. Thus we create the bucket of labeled chips with the command

Pop <- rep(1:dim(ActualData)[1], rowSums(ActualData))

Step 2. Draw out by columns. In the same way, we create a vector ColGroups of column labels. Following the column totals, this vector will have thirteen 1s for January, four 2s for February, etc. For our particular example, we could use rep(1:12, c(13, 4, 7, 10, 8, 4, 5, 3, 4, 9, 7, 8)). The S-plus code uses the more general

ColGroups <- rep(1:dim(ActualData)[2], colSums(ActualData))

To create a random data set, first permute the row labels using

Permutation <- Sample(Pop, length(Pop)),

Then line up the vector of permuted row labels next to the vector of column labels, and count. Here’s how it works for the Martin example:

Row labels:

1 1 1 1 1 2 2 2 2 2

Permuted row labels:1 2 2 1 2 1 1 1 2 2

Column labels:1 1 1 2 2 2 2 2 2 2

The last two rows give 10 vertical pairs (Permuted row label, column label). Sorting and counting these gives a 2x2 summary table:

Room # Room #

21211.8 21012.1

2168.2 2148.3

2207.1 2153.8

22313.0 2177.2

22510.8 22112.0

22610.1 22211.2

22714.6 22410.1

22814.0 22913.7

Average11.2Average9.8

Colonies/cu.ft. Colonies/cu.ft.

Carpeted floors Bare floors

Display 2.17. Summary table from sorting and counting

S-plus does the counting for us, in response to the command “table”:

RandomTable <- table(Permutation, ColGroups)

Display 2.18 shows the S-plus code for (1) computing expected values, (2) computing chi-square distance, and (3) estimating the p-value by generating random data sets and finding the fraction whose chi-square value is at least as large as for the actual data. Out of 10,000 random data sets, xxx gave a chi-square value of 115.6 or more. Conclusion: It isn’t at all unusual to get data as extreme as the actual data. Queen Victoria’s descendants provide no evidence of the claim that birth month and death month are associated.

##############################################################

#

# Chi-Square Tests by randomization

#

##############################################################

#

#

##############

#

# Expected. This function takes a matrix of non-negative entries

# and returns a matrix of expected values computed assuming there

# is no association between rows and columns: the (i,j) element of

# the matrix equals the total for row i times the proportion for

# column j.

#

##############

#

expected <- function(A){

RowTotals <- matrix(rowSums(A),dim(A)[1],1)

GrandTotal <- sum(A)

ColFractions <- matrix(colSums(A),1,dim(A)[2]) / GrandTotal

Expected <- RowTotals %*% ColFractions

return(Expected)

}

#

##############

#

# ChiSqDistance. This function takes matrices of observed and

# expected counts and returns the usual chi-square statistic,

# with value equal to the sum of

# (observed - expected)^2 / expected.

#

##############

#

ChiSqDistance <- function(Observed,Expected){

ChiSq <- sum((Observed-Expected)^2/Expected)

return(ChiSq)

}

ChiSqDistance(A,expected(A))

#

##############

#

# ChiSqSim: Carries out a randomization chi-square test for independence

# by creating random data sets, computing the chi-square distance from

# expected values computed assuming no association, and estimates the

# p-value using the fraction of random data sets whose chi-square values

# are at least as large as the value for the actual data.

# Input: a matrix of counts and the number of repetitions.

#

##############

#

ChiSqSim <- function(ActualData, NReps){

Exp <- expected(ActualData)

#Expected values

ActChiSq <- ChiSqDistance(ActualData,Exp)# Observed value of the

# chi-square distance

NYes <- 0

# NYes counts # of random

# data sets with

# bigger chi-square

# Pop = the contents of the

# bucket:

Pop <- rep(1:dim(ActualData)[1],rowSums(ActualData))

PopSize <- length(Pop)

# Number of chips in the bucket

# ColGroups tells which draws go

#

with which columns

ColGroups <- rep(1:dim(ActualData)[2],colSums(ActualData))

#

for (i in 1:NReps){

#Repeat this loop once for

#

each random data set

Permutation <- sample(Pop,PopSize)

# Create a random permuation

#

of the chips in the bucket

RandomData <- table(Permutation,ColGroups)# Summarize the results

#

in a r x c table of counts

# Add 1 to NYes if the random

# data set has a chi-square

# values at least as large

# as for the actual data.

#

NYes <- NYes + (ChiSqDistance(RandomData,Exp) >= ActChiSq)

}

p.hat <- NYes / NReps

return(p.hat)

}

#

# The Martin data

#

Martin <- matrix(c(3,2,0,5),2,2,byrow=T)

Martin

expected(Martin)

ChiSqDistance(Martin,expected(Martin))

ChiSqSim(Martin,1000)

#

# Victoria’s Descendants

#

Victoria

expected(Victoria)

ChiSqDistance(Victoria,expected(Victoria))

ChiSqSim(Victoria,1000)

Display 2.18 S-plus code for randomization chi-square test for two-way tables

Appendix 2.1 Randomization tests: A summary

GIVEN:

1. A null model (or model and null hypothesis):

a. A set, called the population. (Subsets of the population are samples; the number of elements in the sample is the sample size, n.)

b. A finite (though often very large) collection of equally likely samples.

2. A test statistic (or metric):

a. A function or rule that assigns a real number to each sample. (The function itself is the test statistic.)

b. A way to tell which of two values of the test statistic is more extreme. (Often, larger values are more extreme.)

3. Observed data

A particular sample (and so, automatically, a particular value of the test statistic.)

COMPUTE:

1. The p-value (more formally, the observed significance level):

The p-value is the probability, computed using the null model, of getting a value of the test statistic at least as extreme as the observed value.

According to the null model, all the samples are equally likely, so the p-value is just the fraction of samples that have values of the test statistic at least as extreme as the observed value. There are three general approaches to computing p-values, brute force, mathematical theory, and simulation.

a. Brute force: List all the samples, compute values of the test statistic for each sample, and count.

b. Mathematical theory: Find a shortcut by applying mathematical ideas (e.g., the theory of permutations and combinations) to the structure of the set of samples in the null model.

c. Simulation: Use physical apparatus or a computer to generate a large number of random samples, and estimate the p-value using the fraction of samples that give a value of the test statistic at least as extreme as the observed value.

INTERPRETATION:

The p-value measures how surprising the observed data would be if the null model were true. A moderate sized p-value means that the observed value of the test statistic is pretty much what you would expect to get for data generated by the null model. A tiny p-value raises doubts about the null model: If the model is correct (or approximately correct), then it would be very unusual to get such an extreme value of the test statistic for data generated by the model. In other words, either the null model is wrong, or else a very unlikely outcome has occurred.

More on the logic of inference

Contrast the time sequences for what actually happens when you generate statistical data, with the time sequence for how you reason about it after the fact. Here's what actually happens when you produce data:

1. The setting: There is a particular given chance mechanism for producing the data, such as tossing a coin 100 times.

2. Before producing data, you can use probability calculations to determine which groups of outcomes are likely, which are unlikely. (In probability, which is a branch of mathematics, the model is known, and you use it to deduce the chances for various events that have not yet happened.)

2. The chance mechanism produces the data, e.g., 91 heads in 100 tosses of a fair coin.

Now consider the analysis using the logic of inference. What was once assumed fixed (the chance mechanism) is now regarded as unknown. In the coin example, you don't know the value p for the probability of heads. To make matters worse, what was initially unknown and variable – the number of heads you would get if you were to toss 100 times -- is now fixed: you got 91 heads in 100 tosses. It may appear to make no sense to compute the probability of something that has already happened. However, according to the logic of classical inference, probabilities only apply to the data. There is no way to answer the question we really want answered: "How likely is it that p=.5?" No wonder people find this hard!

Here's the time sequence for the steps in the inference.

1. The setting: You have fixed, known data, and you want to test whether a particular model is reasonably consistent with the data.

2. You spell out the (tentative) model you want to test.

3. Now, in your imagination, you "go back in time," ignoring for the moment the fact that you already have the data, and ask " Which outcomes are likely, and which are unlikely?" In particular, you ask, "How likely is it to get the data we actually got?" In formal inference, this probability is the p-value.

4. The p-value is used not for prediction, the way probabilities are ordinarily used, but as a measure of surprise: "If I believe the model, how surprised should I be to get data like what I actually got?" If the p-value is small enough, the model is rejected. This is a typical of the way probability is used in statistics: after the fact. (Though statistics is a mathematical science, it is not a branch of mathematics they way probability is.)

Drill and practice with the abstract structure: Null models, test statistics, p-values

40. Samples of size 1. Find p-values for the following situations.

a. Null model: All elements of S1 = {1, 2, …, 20} are equally likely.

Test statistic: t(x) = x, for x ( S1; larger values are more extreme.

Observed data: x0 = 3.

b. Null model: All elements of S2 = {1, 2, …, N} are equally likely.

Test statistic: t(x) = x, for x ( S2; larger values are more extreme.

Observed data: x0 = 3.

c. Same as (b), but x0 = N-3.

d. Null model: All 26 letters of the English alphabet are equally likely.

Test statistic: t=1 if the letter drawn is a vowel, 0 otherwise.

Observed data: x0 = e.

41. Samples of size 2 or more. Find p-values for the following situations.

a. Null model: All subsets of size two drawn from {1, 2, …, 6} are equally likely.

Test statistic: t({x,y}) = x+y; larger values are more extreme.

Observed data: {x0, y0}= {3,6}.

b. Null model: All pairs (i,j) with i

Test statistic: t1(i,j) = j – i ; larger values are more extreme.

Observed data: (1,3).

c. Same as (b), but : t2(i,j) = (j – i)2.

d. Same as (b), but : t3(i,j) = least common multiple of i and j.

e. Null model: All samples of size three chosen with replacement from {1, 2, …, 6} are equally likely. (Note that each sample is like a roll of three fair dice.)

Test statistic: Sum of the sample values; larger values are more extreme.

Observed data: (5,5,4).

42. More complicated structures. Find p-values for the following situations.

a. Null model: All 3 x 3 matrices of 0s and 1s with row totals 2, 1, 2 and column totals, 2, 1, 2 are equally likely.

Test statistic: Number of checkerboard units (see bottom of page 3).

Observed data: The table on page 1.

b. Null model: All 2x2 matrices of 0s, 1s and 2s are equally likely.

Test statistic: Sum of squares of the elements of the matrix.

Observed data: Sum of squares = 10

S-Plus exercises: very basic drill

43 – 45. Write S-Plus code to do the following; then run your code as a check.

43. Sample. Take a simple random sample of size 3 from {1, 2, …, 20}

44. Test statistic. Take a simple random sample of size 3 from {1, 2, …, 20} and find the number of elements in the sample with values of 18 or more.

45.

$

p

.Take NRep random samples of size 3 from the same population, and find p-hat, the proportion of samples with two or more elements with values 18 or more.

Martin v. Westvaco

46. The table below classifies salaried workers using two Yes/No questions: Under 40? and Laid off? (In employment law, 40 is a special age, because only those 40 or older belong to what is called the "protected class," the group covered by the law against age discrimination.)

Laid off?

Under 40?

Yes

No

Total

% Yes

Yes

4

5

9

44.4%

No

14

13

27

51.9%

Total

18

18

36

50.0%

Display 2.19 Martin data for salaried workers

a. Set up a null model: Tell the population (how many of what kinds of items); tell the sample size, and describe the set of equally likely samples.

b. What is the test statistic?

c. What is its observed value?

d. Use S-Plus to find the p-value.

47. 50 or older. The average age of the salaried workforce in Westvaco’s engineering department was older than in many companies: ¾ of the 36 employees were over 40. Here are data like those in (7), except that ages are divided at 50 instead of 40; repeat parts (a) – (d):

Laid off?

Under 50?

Yes

No

Total

% Yes

Yes

5

10

15

33.3%

No

13

8

21

61.9%

Total

18

18

36

50.0%

Verbal/interpretive practice

48. Martin, continued. Does the evidence in the second table (8) provide stronger or weaker support for Martin's case? Explain. How do you account for the different messages from the two tables? Both provide evidence; how do you judge the evidence from the two tables taken together?

49. P-values. Write a short paragraph explaining the logic of p-values and significance testing in your own words.

50. Number of samples. Write a short paragraph summarizing your current understanding of the relationship between the number of repetitions in a simulation and the stability and reliability of the estimated p-value.

51. A trustworthy friend? A friend wants to bet with you on the outcome of a coin toss. The coin looks fair, but you decide to do a little checking. You flip the coin: it lands Heads. You flip again: also heads. A third flip: heads. Flip: heads. Flip: heads. You continue to flip, and the coin lands Heads nineteen times in twenty tosses. Don't try any calculations, but explain why the evidence -- 19 heads in 20 tosses -- makes it hard to believe the coin is fair.

52. The logic in (1) relies on the fact that a certain probability is small. Describe in words what this probability is, and tell how you could use simulation to estimate it.

53. Snow in July? A friendly tornado puts you and your dog Toto down in Kansas, and a booming voice from behind a screen tells you that the date is July 4 (hypothesis). However, you see snow in the air (data), and make an inference that it is not really July 4. Describe in words what probability your inference is based on.

54. Which test statistic? The Physical Simulation asked you to use the average age to summarize the set of three ages of the workers chosen for layoff. How different would your conclusions have been if you had chosen some other summary? (There is a well-developed mathematical theory for deciding which summaries work well, but that theory would take us off on a tangent. All the same, you can still think about the issues even without the theory.) Some other possible summaries are listed at the end of this question. Are any of them equivalent to the average age? Which summary do you like best, and why?

Sum of the ages of the three who were laid off

Average age difference

(= average age of those laid off - average age of those retained)

Number of employees 55 or older who were laid off

Age of the youngest worker who was laid off

Age of the oldest worker who was laid off

Middle of the ages of the three who were laid off

55. How unlikely is "too unlikely"? The probability you estimated in the Physical Simulation is in fact exactly equal to 0.05. What if it had been 0.01 instead? Or 0.10? How would that have changed you conclusions? (In a typical court case, a probability of 0.025 or less is required to serve as evidence of discrimination. Some scientific publications use a cut-off value of 0.05, or sometimes 0.01.)

56. At the end of Round 3, there were only six hourly workers left. Their ages were 25, 33, 34, 38, 48, and 56. The 33 and 34 year olds were chosen for layoff. Think about how you would repeat the Physical Simulation using the data four Round 4.

a. What is the population? (Give a list.)

b. How big is the sample?

c. Define, in words, the probability you would estimate if you were to do the simulation.

d. Write out the rule for estimating the probability, using the format from Step A1 as a guide.

e. Give your best estimate for the probability by choosing from

1%, 5%, 20%, 50%, 80%, 95%, and 99%.

f. Is the actual outcome easy to get just by chance, or hard?

g. Does this one part of the data (Round 4, hourly) provide evidence in Martin's favor?

57. After the first three, the next hourly worker laid off by Westvaco was the other 55 year-old. What's wrong with the following argument?

Lawyer for Westvaco: "I grant you that if you choose three workers at random, the probability of getting an average age of 58 or older is only .05. But if you extend the analysis to include the fourth person laid off, the average age is lower, only 57.25 (= [55+55+55+64] / 4). An average of 57.25 or older is more likely than an average of 58 or older. So in fact, if you look at all four who were laid off instead of just the first three, the evidence of age bias is weaker than you claim."

58. Use the data from Rounds 2 and 3 combined. Tell how to simulate the chance of getting an average age of 57.25 or more using the methods of the Physical Simulation: What is the population? the sample size? Tell how to estimate the probability, following the Physical Simulation as a guide. Give your best estimate of the probability. Then tell how to use this probability in judging the evidence from Rounds 2 and 3 combined.

59. Sketch a dot graph like Display 1.4 to illustrate what you think simulations would look like for the following scenario:

Three workers were laid from a set of ten whose ages were the same as in the Martin case. The ages of those laid off were 48, 55, and 55. If you choose three workers at random, the probability of getting an average of 52.66 or older is .166.

60. For some situations, it is possible to find probabilities by counting equally likely outcomes instead of by simulating. Suppose only two workers had been laid off, with an average age of 59.5 years. It is straightforward, though tedious, to list all possible pairs of workers who might have been chosen.. Here's the beginning of a systematic listing. The first nine outcomes all include the 25-year-old and one other. The next eight outcomes all include the 33-year old and one other, but not the 25-year-old, since the pair (25,33) was already counted.

CountPair chosen (underlined = laid off)

Average age

125 33 35 38 48 55 55 55 56 64 29.0

225 33 35 38 48 55 55 55 56 64 30.0

325 33 35 38 48 55 55 55 56 64 31.5

925 33 35 38 48 55 55 55 56 64

44.5

1025 33 35 38 48 55 55 55 56 64

24.0

1125 33 35 38 48 55 55 55 56 64

35.5

etc.

How many possible pairs are there? (Don't list them all!) How many give an average age of 59.5 years or older? (Do list them.) If the pair is chosen completely at random, then all possibilities are equally likely, and the probability of getting an average age of 59.5 or older equals the number of possibilities with an average of 59.5 or more divided by the total number of possibilities. What is the probability for this situation? Is the evidence of age bias stronger or weaker than in the example?

61. It is possible to use the same approach of listing and counting possibilities to find the probability of getting an average of 58 or more when drawing three at random. It turns out there are 120 possibilities. List the ones that give an average of 58 or more, and compute the probability. How does this number compare with the results of the class simulation in Physical Simulation? Why do the two probabilities differ (if they do)?

62. How would your reasoning and conclusions change if the five oldest workers among the entire group of ten were all age 55 (so that the ages of the ten were 25, 33, 35, 38, 48, 55, 55, 55, 55, 55), and the three chosen for layoff were all 55? Is the evidence of age bias stronger or weaker than in the actual case?

63. The law on age discrimination applies only to people 40 or older. Suppose that instead of looking at actual ages, you look only at whether each worker is less than 40, or 40 or older. Tell what summary statistic you would use, and tell how you would set up the model for simulating an age-neutral process for choosing three workers to be laid off. Conclude by discussing whether you think it is better to use actual ages, or just the information about whether a person is 40 or older.

Applied problems: creating your own null models and test statistics

64. More Martin.

a. Use the 2x2 summary tables for salaried workers (in 7 and 8) to create a 3x2 summary table with three age groups: under 40, 40 to 49, and 50 or older:

b. Describe a null model that corresponds to the hypothesis of no discrimination.

c. Invent/define a test statistic that will be larger if older workers are more likely to be chosen for layoff.

d. Compute the observed value of your test statistic.

e. Find the p-value for your combination of null model, test statistic, and observed value.

65. Horse racing

The data set below shows the starting position of winning horses in 144 races. All races took place in the US, and each race had eight horses. Position 1, nearest the inside rail, is hypothesized to be advantageous.

Starting position

1

2

3

4

5

6

7

8

Number of wins

29

19

18

25

17

10

15

11

a. Describe a null model that corresponds to the hypothesis that starting position has no effect.

b. Invent/describe a test statistic that will have larger values if lower numbered starting positions are more advantageous.

c. For the data given here, tell whether the p-value will be < 0.01, ( 0.01 but ( 0.1, or > 0.1.

70. Spatial data.

One way to record the spatial distribution of a plant species is to subdivide a larger area into a grid of small squares (quadrats), and record whether the plant (Carex arenaria in Display 2.20) is presen