BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

50
BIOL 300 BIOSTATISTICS Lab Manual FALL 2005

Transcript of BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

Page 1: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300

BIOSTATISTICS

Lab Manual

FALL 2005

Page 2: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 2

Fall 2005 .............................................................................................................................1

1. INTRODUCTION..........................................................................................................3

2. EXPLORING AND DESCRIBING DATA ....................................................................5

3. ANAYLZING FREQUENCY TABLES..........................................................................9

4. DISCRETE PROBABILITY DISTRIBUTIONS ..........................................................14

5. INTRODUCTION TO THE NORMAL DISTRIBUTION............................................18

6. ONE-SAMPLE INFERENCE ......................................................................................23

7. COMPARISON OF TWO MEANS: PAIRED-SAMPLES ...........................................27

8. TWO-SAMPLE INFERENCE......................................................................................31

9. SINGLE FACTOR ANALYSIS OF VARIANCE (ANOVA) .......................................37

10. LINEAR REGRESSION AND CORRELATION.......................................................42

11. REVIEW PROBLEMS..........................................................................................48

Page 3: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 3

1. INTRODUCTION

The purpose of these computer lab exercises is to provide exposure to data analysis using acomputer system, including especially the use of graphics.

In the lab, you are provided access to a computer network containing the necessary programs,especially JMP from the SAS institute. JMP is one of the most versatile and easiest to usestatistical programs and is widely used in academic, government and corporate settings. Althougha number of features have been included in this package of programs that are beyond the scope ofmost introductory biostatistics courses, this package is easy to use and designed to introduce noviceusers to statistical analysis. The program is designed to emphasize the graphical and exploratoryrequirements of statistics.

The package of programs is entirely menu driven and runs in Windows and Macintoshenvironments. The program is installed on all the computers in the Biostatistics lab in room 2434,as well as on the computers in Zoolab, the undergraduate computer lab on the second floor of theBiosciences building. The program is relatively inexpensive and can be purchased in the Bookstore(check the computer system requirements before you hand over any money).

This manual will provide some instructions on working with JMP. The best way to learn theprogram, however, is to try different strategies yourselves. The computer won’t blow up if youaccidentally push the wrong key or make a bad menu selection, so don’t hesitate to try out itscapabilities. JMP is capable of much more than we require in Biology 300, but don’t hesitate toexplore when time allows.

Using The Program In Class: General Start-up Instructions

Our computer network and server require passwords to allow you access. You will be assigned apassword and user id during your first lab. The user-id will be valid for the duration of the courseand will allow access to the network, the Internet, assorted applications and a home directory whereyou can store several megabytes of files. The password you will be given will be temporary. Youshould change it at once to protect you from hackers, etc. Follow the instructions you will be givenin class to change your password using the telnet program. This is the only way that you canchange your password. Write your password and id down in a secure location. You will need themfor the rest of term to access the system.

To access the system, type your user-id and password into the windows networking dialog box thatwill be displayed on the screen of your computer. [If a further dialog box appears asking if youwant to use this password for windows, hit cancel.] Once Windows has booted up, click on theSTART button in the bottom left corner of the screen. Use the mouse to move to thePROGRAMS option, then to JMP.

Data Entry and Editing

Data must be put into the computer's memory before you can use any of the statistics programs.Data may be entered directly from the keyboard or it may be stored in a file from a previoussession. When you first open JMP, you will see the JMPIN Starter, which will be on the File tab.

Page 4: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 4

Click “New Data Table” to begin. A window should open with a table labelled Untitled 1. Thetable is currently blank, with 1 column and 0 rows. On the left of the table is a column with threebars labelled Untitled 1, Columns, and Rows.

To enter data you will need to add some rows. Clicking on the red “” symbol on the Rows bar inthe left column, and selecting “Add Rows”, is one way to do this. Another is to double-clickdirectly on the chart at any point to add rows down to the cell in which you have moved the mousecursor. Columns may be added to the table in a similar way. A small dot will appear in any newcell that you have created in this way. Rows may be deleted by holding down the first button on themouse and dragging the mouse over the rows you wish to remove. Then let go of the first mousebutton and press the second. Select “Delete Rows”. Columns may be removed in the same way. Tryadding some rows and columns to the table and then deleting them again. Editing values is assimple as selecting and changing them. Click any cell and replace the value by typing in a new one.

In general, each row represents an individual, and each column on the screen represents a singlevariable. A variable is simply the trait of interest. Each cell on the screen represents a single datapoint.

Problems

1. Ten randomly chosen sections of a river showed the following number of spawning Cohosalmon: 22, 18, 40, 16, 12, 17, 23, 41, 29, 33.

a) What is a “variable”? How many variables are in this data set?

b) Enter these data and save them in a file named salmon.

c) Change the third value to 19 and the eighth value to 27.

d) Insert a value of 16 after the second record.

e) Delete the fourth and fifth values.

f) Add the following values to the data set: 17, 15, 11, 21, 23, 26. Your data set should nowinclude the following values: 22, 18, 16, 12, 17, 23, 27, 29, 33, 17, 15, 11, 21, 23, 26.

Page 5: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 5

2. EXPLORING AND DESCRIBING DATA

The first thing to do with any set of data is to plot and inspect it visually. Inspectionaffords an opportunity to determine the shape of a distribution. This information is of interest on itsown, but will also help to determine the type of analysis to carry out on the data. There are anumber of useful tools for this, including descriptive statistics, histograms and boxplots. JMPoffers all of these plus a number of additional exploratory tools. For now we will limit ourselves tothese three methods, all widely used.

Histograms

A histogram plots the frequency of observations falling into different intervals of acontinuous variable Y. The number and width of Y intervals should be chosen carefully, althoughthere are no set rules for determining how many classes to use. If the range of Y values is dividedinto too many intervals, many of the intervals will contain no observations, and the histogram willresemble the skyline of a city dominated by skyscrapers. Someone viewing such a histogram willhave difficulty determining the shape of the true distribution. Using fewer, larger classes canalleviate this problem. Holes in the distribution are smoothed over, giving a better picture of theshape of the distribution. It is possible to go overboard with smoothing. Histograms consisting ofa few very wide classes may hide significant features of a distribution.

Boxplots

Histograms indicate the whole frequency distribution of a variable, whereas the boxplotsummarises its most prominent features. These features include median and spread as well as theextent and nature of departures from symmetry, and the possible presence of observations havingextreme values (outliers). The ends of the box represent the lower and upper quartiles of the data,and the line across the middle of the box is the median. The median is the middle observation of aset of data. The lower quartile marks the median of the lower half of the observations, and theupper quartile is the median of the upper half. If the distribution of the data is symmetric (i.e. froma uniform or normally distributed population), then the box will appear to be divided equally intotwo halves by the overall median. The lines protruding from the box are the “whiskers”. Thelength of each whisker is up to 1.5 times that of the length of the box (the whisker extends only tothe last data point within this 1.5 limit). Beyond the whiskers arethe outliers.

Outliers are extreme observations, those that lie unusually far from the main body of thedata. These unusual observations may simply reflect the tail end of a highly skewed distribution,but sometimes they are errors of measurement or transcription, or represent individuals from apopulation other than the one under study. You might be tempted to delete outliers, but this isjustified only when there are errors. If an outlier is deleted that is not an error, valuable informationis lost and a bias is introduced into later analyses. Yet including an erroneous measurement also hasharmful consequences, data entry methods should be reviewed and specimens re-measured ifpossible. One strategy is to repeat every analysis with and without a suspicious observation andcompare the results. If the conclusions from the two analyses are different then this should bereported.

Page 6: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 6

Descriptive Statistics

Several descriptive statistics measured on the variable of interest will also appear next to thehistogram. These include the mean and standard deviation. (Also included are the standard error ofthe mean and the 95% confidence interval). More descriptions of the data can be found by clickingMore Moments under Display Options.

Accessing Files from the Server

The procedure to access files from the shared drive is to choose the Open from the Filemenu at the top of the JMP window or click Open Data Table on the JMP Starter. Choose theshared drive S to grab files from the server. From the shared directory choose the file that youwant.

Using the Program

After opening or creating a data table, select the Distribution option from the Analyze pull-down menu. Or, click “Distribution” on the Basic Stats tab of the JMP Starter. Select a column inthe window that pops up, and then click “Y, columns”. Finally, click OK to generate the histogram.The graph that appears provides an estimate of the distribution of the Y variable you chose.

You can modify the style of the graph and the information provided in various ways. Clickthe red “” symbol beside the variable name above the histogram. For example, Display Options-> Horizontal Layout changes the orientation. Select Histogram Options -> Count Axis to havethe actual counts plotted along the side of the graph. Experiment with other options.

The same actions generate a boxplot next to the histogram. It includes a number of featuresbesides those mentioned above. To read about these extra features, select Index from the Helppull-down menu. Type “outlier box plot” as your keyword and press Enter on your keyboard. Thedescription will appear in the right side of the help window.

Problems

1. Open the data file poverty from the shared directory (most files we will use will be locatedthere). For 97 countries in the world, data are given for birth rates, death rates, infant deathrates, life expectancies for males and females, and Gross National Product. These data werecollected from The Annual Register 1992 (data for 1990) and the U.N.E.S.C.O. 1990Demographic Year Book by M. Rouncefield. The variables are:

• Live birth rate per 1,000 of population• Death rate per 1,000 of population• Infant deaths per 1,000 of population under 1 year old• Life expectancy at birth for males• Life expectancy at birth for females• Gross National Product per capita in U.S. dollars• Country Group 1 = Eastern Europe 2 = South America and Mexico

Page 7: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 7

3 = Western Europe, North America, Japan, Australia, New Zealand 4 = Middle East 5 = Asia 6 = Africa• Country

a) Create a histogram for the variable “death rate per 1000”. Describe the general shape of thedata distribution: normal (bell-shaped), uniform, skewed with a long tail to the left or right,middle-heavy (platykurtic) or tail-heavy (leptokurtic), or bimodal.

b) Choose the hand tool (click the open hand just below the top pull-down menus) and move itover the histogram. Hold the first mouse button down and describe what happens as youslide the mouse left to right. What happens when you slide the mouse up and down instead?

c) How strongly is the histogram affected by changes in interval start points?

d) What are the consequences of too few intervals in a histogram? Too many?

e) Does the box plot change when you manipulate the histogram? Why?

f) Revert back to the pointer tool (click the arrow button on the tool bar just below the pull-down menus). Try highlighting one bar of the histogram by clicking on it. Now examine theoriginal data table. What effect do you notice? Use this method to identify the countrieshaving the highest death rates (in 1990).

g) Try the reverse: select several rows in the data table using the pointer or selection tool andinspect the effect on the histogram. Use this method to determine how Canada’s death ratecompares with that of the rest of the world.

h) Display a normal (bell-shaped) curve over your histogram (click the red “” symbol besidethe variable name, and select Fit Distribution -> normal). The normal curve is one of themost useful distributions for statistical analyses. This is the shape that we hope to find in aplot of our data. How well does your histogram approximate a normal curve? (In futuresessions we will learn about more powerful tools for testing normality.)

i) Examine the outlier boxplot included with the histogram. Do the data include any outliers?Use the selection tool (arrow) to click the outlying observation(s) and highlight thecorresponding value(s) in the data set.

j) Are the interquartiles (the ends of the box) symmetric about the overall median? Does therange of the data set extend equally on either side of the box? Can you tell from the boxplot whether the data are well described by a normal distribution?

k) How similar are the values for the mean and median of the death rate data? Which is larger,and why? Under what distributions would you expect the mean and median to be moresimilar?

Page 8: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 8

2. Keep the histogram for death rate, but go back to the JMP menus to generate a secondhistogram, this time for “live birth rate per 1000”. Put the two histogram windows side by side(you may need to look behind some open windows to find the first histogram again; use theWindow pull-down menu to bring hidden windows forward).

a) Describe the general shape of the birth rate distribution: normal (bell-shaped), uniform,skewed with a long tail to the left or right, middle-heavy (platykurtic) or tail-heavy(leptokurtic), or bimodal. Compare its shape with that for death rate. Are there any outliersin the birth rate distribution? Can you think of an explanation for why birth and death rateshave such different distributions?

b) With the pointer tool, click some of the bars in the death rates histogram corresponding tohigh death rate and observe the effect on the birth rates histogram. Do countries with highdeath rates tend also to have high birth rates?

c) Repeat the above but click some of the bars corresponding to low death rates, and observethe effect on the birth rates histogram. Do countries with low death rates tend to have lowbirth rates? Contrast the result here with your finding in (b). Can you think of a reason forthe difference?

3. Return to the data table. The variable “country group” is erroneously listed as a continuousvariable in the columns list on the left of the data table window. It should be a nominal variableinstead. Click the first mouse button over the c at the left of the variable name and change thevariable type. DO NOT SAVE THE DATA.

Using the same steps you used above to produce a histogram, produce a distribution of thenominal variable, “country group.” Instead of a histogram you will see a bar graph(unfortunately, without a gap between the classes). The counts in this case refer to the numberof countries falling in each group. Instead of a boxplot you will see a “mosaic plot”, whichvertically stacks the same bars to provide easy comparison of relative frequency. We will workmore with this type of plot in a later lab.

4. If time permits, experiment with the program to produce other kinds of graphs. For example, tryproducing a bivariate “scatter plot” (a plot of one variable against another) for pairs of variablessuch as GNP and infant death rate, birth and death rates, etc. Can you produce a separate boxplot of GNP for each country group?

Page 9: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 9

3. ANAYLZING FREQUENCY TABLES

Categorical (nominal) data are usually summarized in frequency tables. Continuous numerical datamay also be grouped into intervals and the frequency of observations in each interval may also besummarized in a frequency table (or in a histogram; see earlier lab on “Exploring and DescribingData”). In this lab we will explore two kinds of frequency tables and the ideas they may be used totest.

One-way Frequency Table

The first type of frequency table lists the number of observations in different categories of a singlelist. An example is the following table evaluating how good humans are at choosing randomnumbers. The data are from the early years of a US State Lottery, in which players would buy aticket and choose any number they wanted between 000 and 999. Winnings would be dividedbetween all holders of the winning number, which was chosen randomly. The following data arebased on a random sample of 100 players of the Lottery (these are not the winning numbers, butrather they are the numbers selected by players). Listed are the frequencies of numbers chosen thathave 0 to 9 as the first digit:

First digit of chosen number Frequency (fi)0 41 162 143 154 135 86 97 78 89 6

Total 100

These frequencies may be compared to those predicted by different hypotheses. For example, whenplayers pick a number between 000 and 999, are some first digits more popular than others? Itlooks like numbers beginning with “0” are unpopular, and those beginning with 1 through 4 areexcessively popular. A comparison with a uniform distribution using a goodness of fit test wouldbe the appropriate way to determine whether the observed frequencies of first digits are sufficientlyvariable to rule out simple chance.

Two-way (Contingency) Table

The second kind of frequency table is the two-way table, or contingency table. Here, two or moresets of frequencies are compared. The usual goal is to test whether true (population) relativefrequencies are the same in the different sets. An example given below lists the number ofsurvivors and non-survivors in two categories of mountaineers descending from the peak of Mount

Page 10: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 10

Everest between 1978 and 1999: those using supplemental oxygen, and those descending withoutsupplemental oxygen. (Most deaths on Mount Everest occur during the descent, not the ascent.)

Survival Used supplemental oxygen Did not use supplemental oxygenSurvived descent 1045 88Did not survive descent 32 8

Total 1077 96 (data from Huey and Eguskitza 2000, JAMA 284: 181)

In this case we are comparing two (or more) sets of observed frequencies with each other ratherthan with an expected frequency distribution. Does the use of supplemental oxygen influence theprobability of survival? (This is not an experimental study, so we are really only able to testwhether there is an association between use of supplemental oxygen and survival, not whether theoxygen itself is the cause of any change.) A test of differing survival frequencies between the twocategories of mountaineers is carried out using a contingency test.

Hypothesis Testing

Forming and testing hypotheses is one of the most basic endeavors in statistical analysis ofbiological data. With your notes and the course textbook, review your knowledge of the followingconcepts:

• null hypothesis (Ho) and alternate hypothesis (Ha).• Type I errors and Type II errors• significance level• degrees of freedom

Test Statistics for Goodness of Fit and Contingency Tests

The chi-squared statistic is a measure of discrepancy between observed and expected frequencies,where expected frequencies are those expected under the null hypothesis.

Using the program

In the case of one-way tables, only a single categorical variable is required (e.g., “First digit ofchosen number”). Two categorical variables are needed for a two-way (contingency) table (e.g.,“Use of supplemental oxygen” and “Survival”). Make sure that after entering the data, the categoryvariable(s) have the nominal attribute (this can be reset in the columns section of the left frame, orby selecting Column Info in the Cols pull-down menu). The observed frequencies may be entereddirectly to a new column (call it “observed frequency” or “number of observations”.

χ 2 =Observedi − Expectedi( )2

Expectedi∑

Page 11: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 11

To produce a bar graph of frequencies from a one-way table, use the Distribution menu optionand select the categorical variable as your Y column in the pop-up window. In the same window,select the observed frequency column as your “Freq” variable. To carry out a goodness of fit test,click the red “” symbol next to the categorical variable name above the bar graph and select TestProbabilities. This action will open a new display box below the frequency table in theDistribution output window. Here you will need to enter the expected frequencies for your test.Click on each row and enter either the expected frequency or the expected proportion for that row(it doesn’t matter which, as long as you are consistent; the goodness of fit test will be carried outusing the expected frequencies in either case). JMP displays the expected frequencies it uses tocalculate the test statistic if your click on Expected next to Contingency Table. Check that theassumptions of the χ2 test are met.

To produce a mosaic plot for a two-way (contingency) table, use the Fit Y by X menu option. Inthe pop-up window, select one of the categorical variables as your Y column and the other as you Xcolumn. Once again, select the observed frequency column as your “Freq” variable. A two-waytable will also appear beneath the mosaic plot, giving the observed frequencies (the program willalso display the expected frequencies, but you need to select this option by pressing the red “”symbol next to the Contingency Table title). JMP does include the Fisher exact test, which youcan use to validate the results of the chi-square tests.

One-way and two-way frequency tables can be constructed from raw data on individual subjectsusing the Tables -> Summary option in the pull-down menu or by selecting Summary in theTables tab on the JMP Starter. In the pop-up menu choose one (in one-way tables) or two (for two-way tables) categorical variables and click the Group button. Then click the Statistics button in thesame window and select N. When you click “OK” a new data table will appear that tallies thefrequency of observations corresponding to each category or combination of categories.

Problems

1. Enter the Lottery data given above and generate the corresponding bar graph.

a) Examine the bar graph. Do the frequencies appear to vary greatly between classes?

b) Carry out a statistical test of the hypothesis that players favor some first digits over otherswhen choosing a number between 000 and 999. In your work, present all steps (i.e., statehypotheses, give the P-value, the significance level for the test, and state your conclusion).Since the computer provides the P-value directly, there is no need to provide the criticalvalue from the tables in Zar.

c) Compare the results from your visual appraisal of the data to the goodness of fit tests.Which approach provides qualitative information and which one provides quantitativeinformation? What level of uncertainty is associated with those quantitativeprobabilities?

d) What are the degrees of freedom for these tests? Why?

Page 12: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 12

2. In a breeding experiment, disease resistant, early maturing apple trees were crossed to eachother to produce 190 saplings of the types shown below:

Type Number of Offspringdisease resistant, early maturing 111disease resistant, slow maturing 37disease susceptible, early maturing 34disease susceptible, slow maturing 8

Total 190

a) Test whether these data are consistent with an expected Mendelian ratio of 9:3:3:1.Show all steps taken in testing the null hypothesis. (You can enter the expected ratiosdirectly to the column of expected probabilities, because the computer will re-scaleeverything before analysis. Recall, however, that the test itself is carried out usingexpected frequencies, which must add up to 190).

b) Do the expected frequencies satisfy the assumptions of the chi-square approximation?

3. Enter the Mount Everest mountaineer survival and supplemental oxygendata.

a) Inspect the mosaic plot for these data. Describe the pattern in words.Are the relative frequencies of individuals surviving similar or differentin the two oxygen groups?

b) Test whether survival of mountaineers descending from Mount Everest is significantlyassociated with use of supplemental oxygen. Show all steps in your work (a good habit,as always).

c) Do the expected frequencies satisfy the assumptions of the chi-square test? Whatstrategy do you recommend?

d) The authors who compiled the Everest data also presented results from the teams ofmountaineers descending Everest (climbers tend not to go alone). These data are givenbelow. Which data set is the most appropriate to test an association between survivaland supplemental oxygen? Why?

SurvivalUsed supplemental

oxygenDid not use

supplemental oxygenAll team members survived 85 24At least one team member died 8 4

Total 93 28

e) Should we conclude from the test in (f) that supplemental oxygen has no effect onsurvival?

Page 13: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 13

f) EXTRA, for later practice: The same authors also compiled similar data for K2, anearby summit in the Himalayas. Analyze these data in the same way as for MountEverest. Are the results the same as those in (f) ?

SurvivalUsed supplemental

oxygenDid not use

supplemental oxygenAll team members survived 12 24At least one team member died 0 12

Total 12 36

4. Open the data file student_data.jmp from the shared directory. This filerecords the data taken from Biology 300 students on the first day of class,January 2001. The variables are:

• height, Student height in cm• hand, Student handedness (left or right; “both” was classified as

left)• parent.first, Parent listed first by student when giving their heights (mom or dad)• mom.height, Student’s mother’s height, in cm• dad.height, Student’s father’s height, in cm• mom.hand, Whether mother is left or right-handed• sex, Whether student is male or female

a) Use Distribution to test whether male and female students occur with equal frequencyin the Bio 300 class. Note that in the pop-up window you will not need to specify acolumn for the Freq button because you are working now with the raw data instead ofthe frequency table.

b) Use an appropriate method to test whether there is a statistical association betweenhandedness of student (left or right-handed) and that of his/her mother.

c) Use Tables to generate a two-way (contingency) table for handedness of student andmother. This method shows how JMP may be used to construct frequency tables fromraw data.

d) Some students listed their dad first when giving their parents' height, whereas somestudent listed their mother first. Does this depend on the sex of the student?

Page 14: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 14

4. DISCRETE PROBABILITY DISTRIBUTIONS

A probability distribution can be a useful model of a biological process. Often we wish to knowwhether a set of data matches some particular probability distribution. Probability distributions maybe discrete or continuous. This week we examine two discrete distributions commonly used inbiology: the binomial and Poisson distributions. We will use JMPin to generate random samplesfrom these distributions and explore their characteristics.

The Binomial Distribution

The binomial distribution is one of the most commonly encountered discrete probabilitydistributions in biology. It is based on nominal scale data that come from a population with onlytwo categories. One of the two categories is arbitrarily referred to as a “success” and the other a“failure”. These categories are mutually exclusive events (i.e. female [success] vs. male [failure];black vs. white; left vs. right). A proportion p of the individuals in the population are in the successcategory, and a proportion q are in the failure category. The process of randomly selecting anindividual from the population is called a trial. The probability of a success p remains constantfrom trial to trial. Since success and failure are mutually exclusive events, and represent theuniverse of possibilities, the probability of failure in any one trial is q = 1 - p. The outcome of anyparticular trial is not affected by the outcome of any other trial (i.e. trials are independent). Underthese conditions, the probability of obtaining X successes in an independent sequence of n trials hasa binomial distribution, where:

The mean of X is µ = np. The variance of X is σ2 = npq (the standard deviation is the square rootof this).

The Poisson Distribution

Another discrete probability distribution commonly encountered in biology is the Poissondistribution. This distribution is important in describing random occurrences of events in space orin time. For example, imagine that you would scatter seeds over a vast field from an airplane.Imagine also that you have divided the field up into blocks of equal size, say 10×10 metres in area.If the probability that a given square millimetre of soil receives a seed is low (you haven't droppeda trillion seeds, just a few thousand), and if this probability is the same everywhere across the entirefield, and if seeds are independent of each other, then the number of seeds per block, X, shouldfollow a Poisson distribution:

In this equation, P(X) is the probability of seeing X successes in a given block and µ is the meannumber of occurrences. The constant e = 2.718 is the base of the natural logarithm.€

P(X) =e−µµX

X!

XnX qpXnX

nXP −

−=

)!(!

!)(

Page 15: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 15

A useful property of the Poisson distribution is that the mean and variance of the number of events(X) in a block are equal: σ2 = µ. Thus when sampling from a Poisson distribution, the samplevariance to sample mean ratio, often called the coefficient of dispersion (s2/

X ) should be close to 1.Sampling from a distribution other than the Poisson should lead to a value of s2/

X less than 1 orgreater than 1. The variance to mean ratio is therefore a useful index of “randomness”. Forexample, if events were evenly spaced among blocks, then the variance/mean ratio would be lessthan 1. More commonly, events are clumped (e.g., seeds are sticky and land in your field in groups)producing a variance to mean ratio greater than 1.

Using the Random Number Generator

To experiment with these distributions you will need to use the calculator functions of JMP. Startby opening a new data table. To use a column to illustrate a probability distribution, create acolumn and then add the desired number of rows (e.g., 20). Add rows by clicking the “” symbolto the left of the Rows label on the left side of the data table window. Then select the column (bydouble clicking at the top of the column, or clicking once at the top of the column and thenchoosing Column Info from the Cols pull-down menu). Click New Property->Formula in thewindow that pops up. Finally, click the Edit Formula button. This will open the JMP calculatorwindow. This platform allows you to create complex formulas to produce the data for a randomvariable.

In the calculator window, click the Random option in the Functions (grouped) panel. This willproduce a set of options that will allow us to generate random samples from a wide variety ofprobability distributions. To generate random numbers from the binomial distribution chooseRandom Binomial. This will generate a formula in the formula box. The two parameters of thebinomial distribution will need to be specified here: number of trials (n) and the probability ofsuccess in any one trial (p). Click to highlight the first box and type in the number of trials (e.g.,10). Select the second box and type in the probability of success, p (e.g., 0.5). Now click the Applybutton of the calculator window (don’t close the window). Examine the data table to see theresults. Each time you click the “Apply” button of the calculator window a new random sample isgenerated. Adding rows also generates new values.

The process for generating random numbers from the Poisson distribution is similar, except that thePoisson distribution requires only one parameter, the mean, µ (the calculator window will refer tothis parameter as “lambda” instead).

Using the Calculator

The calculator window may be used to generate new variables (columns) that are functions ofvariables (columns) already present in the data set. For example, create a new data table having asingle column (label it as X) with the following numbers entered in consecutive rows: 0, 1, 2, 3, 4,5. Now create a new column containing the values of eX where e is the base of the naturallogarithm. To accomplish this you will need to create a second column, give it the property NewProperty -> Formula as before. Edit the formula in the calculator window. Choose Functions ->Transcendental -> Exp to start the formula. Select the box between the parentheses in the formulaand then click on the column corresponding to the variable X in the Table Columns box of the

Page 16: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 16

calculator window. Click Apply and return to your data table. The new column will contain thevalues for eX . Repeat these steps but calculate X! instead (“!” refers to the factorial function) usingthe Functions -> Transcendental -> Factorial function in the calculator window.

More complicated formulas involving more terms and more variables may be carried out within thecalculator window. For example, using the same values for X used to calculate X! in the previousexample, try to calculate P(X) for the binomial distribution using the formula provided earlier inthis section of the lab manual. You can control the form of the equation by selecting appropriateboxes. This may take some trial and error (use <Ctrl> <z> on the keypad to undo a previous step[hold down the <ctrl> key and then press <z>]).

Problems

1. Let’s examine the distribution of the number of male offspring in families of a rats with littersizes of 10 (i.e., the number of trials, n=10) in a hypothetical population. Assume that malesand females occur with the same probability, so set p=0.5. Generate a random sample of 50such families using the random number generator.

a) Plot a histogram and boxplots of the number of males and describe the distribution(skewed, bimodal, uniform, normal etc.). Does this sample appear to be symmetric?Does it have any outliers?

b) Note the mean and standard deviation of the random sample. How do they comparewith the mean and variance of the population from which you obtained your sample?

c) Click the “Apply” button in the calculator window to generate a new random sample offamilies. Plot a new histogram and compare it with the previous one. Are the twodistributions identical? Why or why not?

d) In mammals, meiotic drive occurs when the Y chromosome is more successful than theX chromosome during sperm formation, with the result that more sons than daughtersare produced after mating. Create a second column in the data table, and using thebinomial random generator sample 50 families of n=10 offspring in which theprobability of a male offspring in any one trial is p=0.90. Plot the histogram andboxplot for the new column of data. Describe the changes to the distribution of thesample from this new binomial population. How are the mean and variance of thedistribution changed? Are these changes expected (i.e., calculate the new mean andstandard deviation of the number of males in litters of size 10 in the population offamilies).

e) Add 950 more families to the two columns. What effect does this have on the shapes ofthe distributions? On the sample mean and variance? How much do your new sampleestimates of mean and standard deviation (in the number of males in families of 10offspring) differ from population values? Which sample sizes provides a more reliableestimate of population parameters?

f) Change the number of trials in the second column to 100 (keep p=0.90). How does thischange the shape of the distribution?

Page 17: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 17

2. Assume that the number of Asian Gypsy moths captured in traps set for several nights acrossthe lower mainland follows a Poisson distribution with mean 0.5. Set up a new column in thedata table (label it “moths”) and use the random number generator to randomly sample trapcounts.

a) Produce a histogram and boxplots for the sample and describe its shape. Note the meanand standard deviation for this set of values.

b) Edit the formula for your Poisson distribution under column info, and change the meanto 0.1. Produce a new histogram and boxplots, and describe how the shape of thedistribution has changed.

c) Increase the mean number of insects found per trap to 15 and describe the shape of thecurve. Describe what happens to the shape of the distribution. What distribution is thisstarting to resemble?

d) Refer back to the means and standard deviations from parts (a) through (c). What is theapproximate relationship between these values? What is the relationship between meanand standard deviation in the populations from which the samples were obtained?

3. During the process of sperm and egg formation in most metazoans, deleterious mutations mayoccur that will be passed on to the next generation. Thus, offspring individuals of each newgeneration may carry zero, one, two, or more new mutations. In an experimental study ofmutation accumulation in Arabidopsis, 60 offspring of a cross within an inbred line werescreened for new mutations. The following results were obtained:

umber of new mutations umber of individuals01234

>4

25229310

Total 60

a) If separate mutations are independent and the probability of a particular gene acquiring anew mutation is small and equal between all genes and all individuals, what distributionshould we expect to see for the number of new mutations per individual?

b) Using the formula for this distribution and the JMP calculator, calculate the expectednumber of individuals possessing 0, 1, 2, 3, 4, and >4 new mutations. Assume that themean number of new mutations per individual is exactly 1.0.

c) Test whether the data conform to the expected frequencies. Assume that the mean numberof new mutations per individual is exactly 1.0. Show all steps in your work.

d) Calculate the observed mean number of mutations per individual and the correspondingstandard error of the mean.

Page 18: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 18

5. INTRODUCTION TO THE NORMAL DISTRIBUTION

The Normal Curve

The normal distribution, the theoretical “bell-shaped curve”, is one of the most importantcontinuous distributions in statistics. This is because many types of data, especially biological data,have a distribution that is approximately normal in the population. Even when a variable is notnormally-distributed, the distribution of sample means is approximately normal if sample size issufficiently large (Central Limit Theorem). These facts have been used to great advantage in thedevelopment of methods for analysing biological data.

This week we will use JMP to take random samples from normal and non-normal distributions,calculate area (probability) under the normal curve, and test the goodness of fit between real dataand the theoretical normal curve. The equation for the normal curve is (don’t memorize this,please):

X is a continuous variable with mean µ and standard deviation σ. P(X) is “probability density”rather than frequency, and probability is given by area under the curve (rather than height of thecurve). Here is an example of the normal distribution:

X

The normal distribution is symmetric, centred over the mean, with tails that extend to positive andnegative infinity.

The existence of an infinite number of normal distributions, each one with a different standarddeviation and mean, would make it difficult to calculate areas under the normal curve. We solvethis problem by converting X for any normal distribution to the standard normal distribution Zhaving a mean of 0 and standard deviation of 1. If we convert a value into a standard normal, then Ztells us how many standard deviations from the mean the value is. The formula for converting is:

Areas under the standard normal curve are provided in statistical tables in the back of your textbookand are those given by JMP.

2

2

2)(

2

1)( σ

µ

πσ

−−

=X

eXP

Z =X −µσ

Page 19: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 19

Obtaining Probabilities Under the Normal Curve

JMP can calculate exact probabilities under the normal curve. Open a new data set and create asingle row. Then choose Formula from the columns menu to open the calculator window. UnderFunctions choose Probability -> Normal Distribution. Enter a number between the brackets inthe formula (e.g., 1.96) and click “Apply”. The row you created in the data table will containProb(Z<1.96). Note that JMP gives the probability of a value less than the number you enter,compared with other stats tables that give you the probability of a value greater than the number.

With a continuous probability distribution like the normal, the probability of getting exactly anyone value is very close to zero. Therefore we use the normal distribution to find the probability ofgetting a value within some range.

Evaluating Fit to the Normal Distribution

Because the normal distribution is an assumption of so many methods for analysing data, a way toevaluate the assumption is needed. JMP has two visual tools for assessing the goodness of fitbetween the data and a normal distribution.

The first tool is the simplest, and involves comparing the histogram of the data with the normalcurve having the same mean and variance as the data. Click the “” symbol next to the variablename above the histogram and choose Fit Distribution -> Normal. This will result in a normalcurve superimposed on the histogram, to allow a visual comparison of shape. To supplement thevisual impression, get JMP to calculate skewness and kurtosis of the data by clicking the “”symbol again and selecting Display Options -> More Moments. The normal curve has zero skewand kurtosis, and departures from zero in the data inform us about departures from normality. Askewed distribution is asymmetrical, and a distribution with kurtosis is either more pinched in ormore flat than the normal distribution.

The second tool is the normal quantile plot. This plot is easier to explain if you have chosen thehorizontal layout (Click the “” symbol next to the variable name above the histogram and chooseDisplay Options -> Horizontal Layout). Then click the “” again and choose Normal QuantilePlot. This will generate a new plot next to the histogram that compares quantiles of the data on the

Page 20: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 20

X-axis with Z-values corresponding to each quantile of the normal distribution on the Y-axis. Theplot is basically a cumulative relative frequency distribution, with which you are already familiar,but here cumulative relative frequency (given on the Y axis; see the numbers ranging from 0.01 to0.99 along the inside edge of this axis) is plotted on a normal probability scale. The numbers onthe outside edge of the Y-axis are Z-values corresponding to successive values for cumulativerelative frequency. If the data are normally distributed, then the points in the figure will lie on astraight line. Departures from a straight line indicate departures from normality.

JMP will also carry out a goodness of fit test of normality. To carry out this test you will first needto Fit Distribution -> Normal as explained above. Then go to the results and click the “”symbol next to the Fitted Normal heading and select Goodness of Fit. For very large samples(>2000), JMP carries out the Kolmogorov-Smirnov-Lilliefors test, (this test compares the goodnessof fit between observed and expected cumulative relative frequency distributions). For smallersamples JMP provides the Shapiro-Wilk test, which tests the adequacy of a linear fit in the normalquantile plot. Either of these tests the null hypothesis that the distribution that the data comes fromis a normal distribution.

Problems

1. Determine the following probabilities under the normal curve.

a) What is the probability of obtaining a Z value less than or equal to -1.00?

b) What is the probability of obtaining a Z value less than or equal to -1.96?

c) What is the probability of obtaining a Z value greater than or equal to 2.50? What isthe probability of obtaining a Z value greater than 2.50?

d) What is the probability of obtaining a Z value greater than -0.65?

e) What is the probability of obtaining a Z value between -2.3 and 0.7?

f) What is the probability of obtaining a Z value less than –1.2 or greater than 0.2?

g) What is the probability of obtaining a Z value less than –1.2 and greater than 0.2?

h) Using Probability->Normal Quantile, the normal quantile function, what value of Zcorresponds to an area of 0.05 on the left tail of the standard normal distribution?

i) What value of Z corresponds to an area 0.01 in the upper (right) tail of the standardnormal distribution?

j) What values of Z correspond to a total area of 0.25 spread evenly between both tails?

2. Open the fruitflies.jmp file in the shared directory. These data were collected by Partridge(1981, Nature 294: 580-581) to test whether male flies suffered a survival cost from mating. The

Page 21: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 21

life spans of individual males supplied with 1 or 8 receptive virgin females per day were comparedwith life spans of three types of control males. The first two types of control males were suppliedwith either 1 or 8 newly inseminated females (newly inseminated females will not re-mate for atleast two days, but they control for other effects females have on males (e.g., competition for food).The third type of control males were kept alone (i.e., 0 females were added). The four variables are:

Number of female partners supplied daily to males (0,1 or 8)Treatment (0, 1 or 8 virgin females, 1 or 8 newly inseminated females)Male lifespan, in daysMale thorax length, in mm

a) For now, ignore the fact that males are in different treatment groups and consider allof them together. Visually compare the histogram of male lifespan with a normaldistribution. Is the fit reasonably close? Add the normal quantile plot and reassess thefit of the data to a normal distribution. Is the fit reasonably good?

b) Test whether male lifespan fits a normal distribution. Show all steps. If the nullhypothesis is not rejected, does this mean the data are from a population having anormal distribution?

c) Repeat steps (2a) and (2b) for the variable male thorax. How do the results comparewith those for male lifespan?

3. In fact the males in the data set fruitflies.jmp come from different treatment groups, and treatingthem as though they constitute a single sample is not valid. We will deal with analysis of multipletreatment groups later in the course. Here we briefly explore the distributions of multiple groups.Use Distribution to plot a histogram of lifespan separately for each treatment group (this can bedone all at once by selecting the variable Treatment in the By box of the Distribution popupwindow). Click the “” symbol next to the label Distributions at the top of the results windowand select Stack for easier comparison of histograms.

a) Produce a normal quantile plot for lifespan separately for each treatment group. Does eachsample conform reasonably well with a normal distribution? Test each fit using theShapiro-Wilk test. Do any of the groups depart significantly from the normal distribution?

b) Return to the fly data table. This time, use Fit Y by X to plot male lifespan (Y) againsttreatment (X). This will produce a plot in which the lifespans of males are plottedseparately for each treatment group. Click the “” symbol at the top of the result windowand select Display Options -> Mean Error Bars. This illustrates the mean ± 1 standarderror for each treatment group. Which treatment group appears to have the shortest meanlifespan (for now, you can refrain from testing differences)?

c) Click the “” symbol again at the top of the result window and select Normal QuantilePlot. This produces a normal quantile plot for each treatment group separately.

4. Open the file cntrlmt.jmp. This file has a set of 5 columns, each of which uses randomnumbers between 0 and 1 that have been raised to the fourth power. This distribution is highlyskewed, with most observations lying near 0. The first column of the data table is simply a

Page 22: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 22

random sample from this distribution (each value is a random number between 0 to 1 that hasbeen raised to the fourth power). Each row of the second column is a mean of a random sampleof n=5 observations from this distribution. The third column reports means of random samplesof n=10 observations. The fourth and fifth columns are means of samples of n=50 and n=100observations. In reality, you will rarely have the opportunity to take multiple samples from apopulation to examine the distribution of sample means; the idea that you would get a differentvalue for the sample mean each time represents a “thought experiment”. Here, the computer isused to illustrate the outcome of such a “thought experiment”.

a) Add 500 rows to each of the 5 columns. Display histograms for each of the 5 columnsand compare their general shapes. Which ones appear to have the best fit to a normaldistribution?

b) Compare skew and kurtosis of each of the 5 columns. What happens to these valuesas sample size increases?

c) Carry out a goodness of fit test to the normal distribution on each of the columns.Which ones are significantly non-normal according to the test?

d) What principle is illustrated by the fact that successive columns provide increasinglybetter fits to the normal distribution even though the underlying distribution is notnormal?

Page 23: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 23

6. ONE-SAMPLE INFERENCE

Confidence interval for the mean

Unfortunately, the mean calculated from a sample,

X , will differ from the population mean µ. Theexpected discrepancy between

X and µ depends on the size of the sample and the variability of X.If sample size is small and X has high variance, then

X may be quite far from the population mean.In contrast, if sample size is large and X has low variance,

Xwill probably be close to µ. Theconfidence interval (CI) combines information on sample size and variability to put probabilisticbounds on estimates of the population mean. CI’s can be calculated for any desired degree ofconfidence, but 95% confidence intervals are most common. If your sample is random and thepopulation has a normal distribution, you can be “95% confident” that your confidence intervalincludes the population mean. More accurately, if you sample repeatedly and generate a 95% CI’seach time, you can expect the CI to include the population mean in 95% of the cases, and not in theother 5% of cases. Since you usually don't know the population mean, you'll never know when thishappens. If the data are not from a normal distribution, then the 95% CI will include the true meanin approximately 95% of cases only if sample size is large (follows from the Central LimitTheorem).

Hypothesis testing for the mean

Samples from a population may also be taken to test hypotheses about the population mean. Forexample, the sample may be the result of an experiment designed to test for a proposed treatmenteffect against the null hypothesis of “no effect”:

Ho: µ = 0Ha: µ ≠ 0

If the random sample is from a normally distributed population, then the (two-tailed) one-sample t-test may be used: reject Ho if t ≥ t0.05(2),ν or t ≤ −t0.05(2),ν where

and ν is degrees of freedom. The same procedure may be used if the data are not from a normaldistribution only if n is large (follows from the Central Limit Theorem). If the data are not from anormal distribution and the sample size is not large, the Wilcoxon signed-rank test may be usedinstead. We will learn more about the Wilcoxon and other “non-parametric” tests in future labexercises. These tests are based on ranks and do not require the assumption that the population isnormally distributed. However, rank tests are generally less powerful than tests based on the normaldistribution, and the latter are therefore preferred if the assumption of normality can be met.

Using the Program

JMP calculates the 95% confidence interval for the means automatically when you examine theDistribution of a variable. The values for the upper and lower limits to the interval are shown inthe Moments table. The 95% confidence interval for the mean is also displayed in the outlierboxplot as a diamond shape, with the mean being the midpoint of the diamond. To obtain 99% orother confidence intervals click the red “” next to the variable name in the results window andselect Confidence Interval → .99. To carry out a one-sample t-test or Wilcoxon test click the red

t =X −µs / n

Page 24: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 24

“”next to the variable name in the results window and select Test Mean. The results willdisplayed under the Test Mean=value heading. If you click the red “” symbol at the bottom ofthis display you will see a visualization of the P-value calculation (use the buttons at the bottom ofthe visualization to toggle between one- and two-tailed P-values). This visualization is especiallyhelpful if you are carrying out a one-tailed test.

The P-value for a two-tailed test is given by the line that reads: "Prob >|t|".

Problems

1. A meta-analysis involves the combining of information from availablepublished studies that have investigated the same question using “similar”methods. The approach analyzes the magnitude of effects seen in separateexperiments, and the unit of observation is the individual published study.The data file metacompetition.jmp (located on the shared drive S)contains a summary of available studies from a recent meta-analysis ofplant competition (Gurevitch and Hedges 1993, Meta-analysis: combiningthe results of independent experiment. Pp 378-398 in S. M. Scheiner and J.Gurevitch, eds, Design and analysis of ecological experiments. Chapman& Hall, New York). The variable Competition effect size measures the difference in plantperformance (e.g., growth rate) between two types of experimental plots: those in which thetarget species occurs alone, and those in which a competing species is also present. Effect sizehas been scaled relative to variability within studies, and therefore has no units. A positivenumber means that adding a potentially competing species reduced growth, whereas a negativenumber means that adding a competitor enhanced growth.

a) Estimate the mean Competition effect size among studies, assuming that the 43 casesrepresent a random sample of effect sizes in plant competition experiments. Generate a95% confidence interval for the mean.

b) State the interpretation of the 95% confidence interval.

c) Other than the assumption of a random sample, what assumption did you requirewhen calculating the confidence interval? Is this assumption met in the present dataset? Explain.

d) State the Central Limit Theorem. Can we appeal to this theorem in our analysis in (a)?

e) Test whether the mean competition effect size is significantly different from zero. Inlight of your answer to (c), which test would you recommend?

f) What was the probability of committing a Type I error in your analysis?

g) A number of the measurements of competition effect size come from the same study(as indicated by the Source variable). Is this a problem for your analysis? Explain.What would you recommend as a solution? (If time permits, try out your solution andsee if you obtain the same answer as in (e)).

Page 25: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 25

2. The mean specific activity of the enzyme Na+-K+-ATPase in gills ofmost freshwater teleost fishes at 15°C is known to be 3.33 micromoles ofphosphate / milligram of protein / hour. The specific activity of thisenzyme in the gills of marine fishes is expected to be higher than infreshwater fishes due to the greater salinity of their environment. To testthis, the specific activity of Na+-K+-ATPase in was measured in gills of asample of marine-dwelling hagfish (Eptatretus stouti) (units are micromoles of phosphate / mgof protein / hour). The data are stored in the file called hagfish.jmp.

a) Is the specific activity of Na+-K+-ATPase in gills of the hagfish different from in gills ofmost freshwater teleost fishes? Use the most powerful test available. Show all stepstaken in testing the hypotheses.

b) In addition to the assumption of a random sample, what assumption did your test in (a)require? Test this assumption. Was your assumption valid?

3. Vertebrates and other animals frequently lose mass during periods of lowfood supply, but their structural body size generally continues growing orremains the same. The data in iguana.jmp are measurements of changesin body length of male Galápagos marine iguanas from Santa Fe island,Galápagos, during the El-Niño event of 1997/1998. (Wikelski, M. and C.Thom. 2000. Nature. 403:37−38; partial data set kindly provided by M.Wikelski).

a) Test whether mean body length changed during the El-Niño event. Justify themethod you used by also testing its assumptions.

b) If you rejected the null hypothesis in (a), do the results imply that conditionsduring the El-Niño caused shrinkage of marine iguanas?

c) What was the mean % change in length? Comment on the difference betweenthe magnitude of the P-value in your test in (a) and the magnitude of the effect. Are theP-value and the magnitude of effect expected to be related in general?

d) Observe the 95% confidence interval for change in body length. Are theconfidence limits likely to be accurate? Explain.

e) Plot the histograms for Length before and for Change in length side-by-side. Inthe histogram for Length before click on the bars corresponding to individuals of greaterthan average size in the population. Observe the values of Change in length for theseindividuals (indicated by dark shading in the corresponding histogram). Repeat forindividual iguanas of smaller than average size. Do you notice a pattern? Use theremaining two variables in the data table (Size category and Direction of change) to testwhether change in body length during the El-Niño depended upon initial body length.

Page 26: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 26

4. With a computer we can carry out multiple random samples from apopulation and compute confidence intervals each time. This willaid the interpretation of the confidence interval based on only asingle sample (the usual case, in real life). Click on the link belowto visit the confidence limits page of the Rice Virtual Lab inStatistics. Once you are there, click Begin to start the JavaApplet and then click the Back button on your browser so that you return to this page:http://www.ruf.rice.edu/~lane/stat_sim/conf_interval/index.html

(if this did not work, paste this web address into the location bar of your browser and hit the Enterkey on your keyboard). The Java Applet allows you to take multiple random samples from anormal population with mean µ = 50 and standard deviation µ=10, and compute 95% and 99%confidence intervals for each sample. In the Applet window specify the sample size (n) andclick Sample to generate confidence limits for 100 random samples. Observe the number ofconfidence intervals that contained the true population mean µ = 50 and the number that didnot. Click Sample to add another 100 samples; click Clear to begin again.

a) What percent of the 95% confidence intervals would you expect to contain the populationmean µ = 50? What percent of the 99% confidence intervals would you expect to containthe population mean?

b) Which is wider, 95% or 99% confidence intervals?

c) How does sample size affect the number of intervals that contain the population mean?Explain.

d) How does sample size affect the width of the intervals?

e) The widths of the intervals vary somewhat even for a given sample size. Why?

Page 27: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 27

7. COMPARISON OF TWO MEANS: PAIRED-SAMPLES

Two-sample vs. Paired-sample Designs

In this lab we consider the problem of estimating and testing differences between two means. Forexample, we may be interested in comparing the effects of two different medications on patientmean blood pressure. Or, we may wish to compare the effects of different fertilizers on mean plantgrowth. There are two completely different ways of carrying out such comparisons of means. Thefirst approach is to randomly assign independent observations (e.g., patients, field plots) to differenttreatments. In this case we have two samples of individuals, each representing separatepopulations: one sample of individuals given drug #1 and a second sample of individuals givendrug #2 (or, one sample of field plots treated with fertilizer #1 and another sample treated withfertilizer #2). This is the two-sample design, and our goal is to compare the two means (µ1 and µ2)using two random samples of patients (or, field plots).

The second approach is to apply both treatments to each independent observation in the randomsample (treat each patient with both drugs in random order and separated by time; or, divide eachfield plot into equal halves, and apply one fertilizer to one side and the second fertilizer to the otherhalf). This is the “split plot” or paired-sample design, which is the subject of the present labexercise.

The difference between these two approaches is crucial, and affects the statistical method used totest for treatment effects. In particular, when a paired design is used, the two measurements madeon every individual at the end of the experiment must be reduced to a single number: the change, ordifference d, between the two measurements. We then use the familiar one-sample methods toestimate the mean difference µd and/or test hypotheses about the mean difference. Paired-sampleinference is a straightforward extension of one-sample methods learned in the previous labexercise. Methods for dealing with two-sample experiments are covered in the next lab exercise.

Confidence Interval for a Mean Difference

The confidence interval for the mean difference µd between paired measurements is obtained in thesame way as that for a single population mean. We simply treat our sample of differences for whatit is: a random sample from a single population. Thus, for paired data the 95% confidence intervalfor the mean difference is:

where µd is the parameter for the mean difference between measurements, d is the sample meandifference,

ds is the standard error of the sample mean difference, ν is the degrees of freedom (n -

1). This interval assumes that the differences are from a normally distributed population. If the dataare not from a normal population then the computer confidence interval is approximate, and isexpected to be accurate only when n is large (by the Central Limit Theorem).

d − t0.05(2),ν sd ≤ µd ≤ d + t0.05(2),ν sd

Page 28: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 28

Hypothesis Testing for a Difference

The paired-sample t-test is appropriate for testing Ho: µd = 0 vs. Ha: µd ≠ 0 when the populationof differences d has a normal distribution. Standard methods should therefore be applied to therandom sample of d values to test the validity of this assumption. The one-sample t-statistic is ourmeasure of discrepancy between the sample mean d and the value of µ d stated in the nullhypothesis:

If d has a normal distribution, then t has a t-distribution n−1 degrees of freedom, where n is thesample size (number of independent observations).

What if d does not have a normal distribution in the population? If n is large then the distributionof d is nevertheless approximately normal (by the Central Limit Theorem) and we may still use theone-sample t-test as above. If d is not normally distributed and sample size is not large, then wemay need to do a sign test.

The sign test is just an application of the familiar binomial test. We record whether the differencesd are positive or negative. Under the null hypothesis of no difference, the number of positive d-values should be roughly equal to the number of negative d-values. Let p be the proportion ofdifferences that are positive. Under the null hypothesis of no difference, Ho: p = 0.5, whereas Ha: p≠ 0.5 under the alternative hypothesis.

Using the Program

To carry out the paired sample t-test or its non-parametric analogue you will need to enter bothmeasurements for each individual in separate columns on the same row. Then create a new variablecomputed as the difference between the paired measurements. Then proceed as in the earlier labexercise on one-sample tests.

Problems

1. Before proceeding with further research into the mechanisms regulatingerythrocyte pH in toads (Bufo marinus), scientists compared twomethods of measuring intracellular pH to determine whether or not themethods give the same results. Arterial blood (0.8 ml) was collectedfrom a random sample of 37 toads. Each sample was equally dividedand erythrocyte pH in each aliquot was determined either by a freeze-thaw (FT) method or a method involving C14-labelled 5,5-dimethyl-2,4-oxazolidinedione (DMO). The data are stored in the file toads.jmp on the shared drive. Eachrow corresponds to a different toad.

a) Test whether the two methods give the same results on average, using the mostpowerful test available. Show all steps.

s

-d = t

d

Page 29: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 29

b) What assumption is required in (a)? Visually examine the data for departures for thisassumption. Is your assumption met? Explain.

c) Carry out a test of your assumption given in (b). Do the results of the test match yourvisual interpretation? Recommend a strategy for testing differences between the twomethods on the basis of your results.

d) Calculate the 95% confidence interval for the difference between means. Based onyour evaluation in (b) and (c), is the interval likely to be accurate? Explain.

2. Scientists studying the effect of slash burning examined thediversity of spiders in clear-cut areas of coastal forests. Thenumber of species of spiders was measured at 27 sites of equal size(1.4 ha). The sites were then burned. Four years later the numberof spider species at each site was measured again. The results arestored in the file spider.jmp on the shared drive.

a) Was there a significant change in number of species of spiders between the twosampling periods? Explain how you chose your method for testing.

b) Comment on the advantages and disadvantages of a paired sampling design such asthe one used here over a two-sample design in which the experimenter simply comparespider diversity in burned plots with those in other plots not burned?

c) Compute the 95% confidence interval for the change in number of species. Is thisinterval likely to be accurate? Explain.

Page 30: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 30

3. EXTRA QUESTION FOR REVIEW- SAVE UNTILEND: In insect species whose females mate withmultiple males (polyandrous), male seminal fluidcontains toxins that increase the proportion offertilizations a male obtains relative to other malesmating with the same female. However, these toxinsreduce the survival of females. Experiments have shown that over multiple generationsfemales evolve defenses to prevailing male toxins, but that males forever evolve new toxins.The result is a long-term “arms race” between the sexes. Researchers have postulated thatthis process in polyandrous insect species should speed the rate at which sterility barriersevolve between different populations of that species, increasing the rate at which newspecies are formed. In contrast, sterility barriers should evolve more slowly in insect specieswhose females mate only once (monandrous), since no arms race between the sexes occurs,yielding a lower rate at which new species are formed. To test this idea, Arnqvist et al.(2000, Proc. Natl. Acad. Sci. 97:10460–10464) compared the total numbers of species in 25pairs of insect taxa. Each pair consisted of two closely related “clades” (a clade is a groupof species all of which share a common ancestor). One of the clades contained onlypolyandrous species, whereas all of the species in the other clade of the pair wasmonandrous. The number of species in each pair of clades is provided in the file conflict &speciation.jmp. These data were taken directly from Table 1 in Arnqvist et al. (2000).

a) Using these data, test whether the number of species in polyandrous clades issignificantly different from the number in monandrous clades. Use a two-tailed test.Justify your choice of method by testing appropriate assumptions.

b) Repeat the exercise in (a) using the log number of species in clades instead. How didthis affect the best procedure for testing the hypotheses? Does your conclusion differ?[We will be investigating the use of data transformations like the logarithm morethoroughly in a later lab exercise.]

c) Carry out a test of the same hypotheses using the sign test (a.k.a., the binomial test)in JMP. How do your results compare with those from the previous tests?

d) Were the authors justified in concluding that polyandrous taxa of insect species havemore species than related monandrous taxa?

Page 31: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 31

8. TWO-SAMPLE INFERENCE

As mentioned in the previous lab, there are two ways of comparing means of two treatments. In thepaired-sample design, both treatments are applied to each independent unit (e.g., patient, or fieldplots) in the random sample. In the two-sample design, independent observations are fully assignedto one or the other treatment. The two samples of individuals represent separate populations, andour goal is to compare the means of these two populations, µ1 and µ2.

Distribution of Differences Between Sample Means

The foundation for analysis of means of two populations is the fact that if X has a normaldistribution in each of the two populations, with equal variance σ2, then the difference betweensample means, 21 XX − , also has a normal distribution.

You will have only a single estimate of each mean, but keep in mind that if you were to go backand collect two more random samples, the value of 21 XX − obtained the second time would bedifferent from that obtained the first time. The mean of the distribution of possible values for

21 XX − is µ1 − µ2, and its standard deviation is 21 XX −σ .

In this case, the quantity

has a t-distribution with n1 + n2 − 2 degrees of freedom. This fact is the basis of the two-sample t-test for a difference between population means, and of the confidence interval for the differencebetween two means. The quantity

21 XXs

− is computed from the pooled sample variance, sp

2, where

21

2121

XXs

)()XX(t

−−−=

µµ

2

2p

1

2p

21 n

s

n

ss XX +=−

Page 32: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 32

If X is normal in both population with unequal variances, then a modified version of the aboveequation yields the Welch’s t-statistic, which has an approximate t-distribution. Consult yourtextbook for this calculation, and for the calculation of the appropriate degrees of freedom.

Comparing Two Population Variances

The assumption of equal variances can be tested using the two-sample F test (JMP computes theBartlett’s test, which is exactly equivalent to the F test). This test is very sensitive to theassumption that the variable has a normal distribution in both populations. More robust, if lesspowerful, methods also exist including the Levene test. JMP computes the Levene test along withtwo other tests, the O’Brien’s and the Brown-Forsythe tests.

Non-parametric Alternative to the Two-sample t Test

If the populations are not normally distributed, and sample size is not large enough to appeal to theCentral Limit Theorem, then an alternative approach is to use a nonparametric test. Nonparametrictests are based on the ranks of the data rather than the data themselves, and they assume only that Xis a continuous variable. The nonparametric equivalent of the two-sample t-test is the Wilcoxonrank sum test (equivalent to the Mann-Whitney U test). Under optimal conditions the Wilcoxonrank sum test is about 95% as powerful as a 2 sample t-test, although it may be less powerful inspecific settings.

Power Analysis

When researchers carry out an experiment to test the difference between two treatment means, howdo they decide on the appropriate sample sizes to take? How confident are they about their abilitiesto detect a difference if one is present? You haven’t had to worry about this problem because wehave provided the data sets and asked you to analyse them using the most appropriate procedures.But many of these data are from published studies that were designed intelligently: researchersdecided on an appropriate sample size based in part on the expected power of the test. Power is theprobability of correctly rejecting the null hypothesis when it is false (power is 1−β, where β is theprobability of making a Type II error). The power of the two-sample t test depends on:

1. The sample size (n1+n2). Greater sample size increases power of a test.2. The significance level (α). Power decreases with decreasing α. For example, reducing α from

0.05 to 0.01 to reduce the probability of making a Type I error but increases the probability ofmaking a Type II error.

3. The within-population variation (σ). Higher variation reduces power.4. The difference between means, µ1−µ2. The larger the difference between the population means,

the greater the probability of rejecting Ho.

In this lab we will explore the relationship between the power of the two-sample t test and thesequantities.

Using The Program

The data must be entered into the data table as two separate columns. One column is a category(nominal) variable indicating treatment group (this will be “X”). The second column contains the

Page 33: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 33

actual measurements (this will be “Y”). Use Fit Y by X to start the analysis, placing the appropriatevariables in the X and Y boxes. This will generate a “one-way” plot in which the observations of Yare displayed for each category of X. Click the red “” next to the “One-way” title bar columns toselect the following actions:

→ Means/ANOVA/T-test: Carries out the two-sample t-test; calculates a 95% confidence intervalfor difference between means; presents the Analysis of Variance (ANOVA) table [we willcover this method in a later lab]; adds the means diamonds to the plot (the vertical span of eachdiamond represents the 95% confidence interval for the mean of each group).

→ UnEqual Variances: Carries out tests of the null hypothesis that variances of the twopopulations are equal, using several methods including the Bartlett test (equivalent to the F test)and the Levene test; carries out the calculations for the Welch’s approximate t-test ofdifferences between means when variances are unequal.

→ Nonparametric → Wilcoxon Test: Carries out the nonparametric analogue of the two-samplet-test. Note that exact P-values are not provided even when sample sizes are small: JMP usesthe normal approximation regardless of sample size, and therefore provides only anapproximate P-value, especially at small sample sizes.

→ Power…: Power analysis of the two-sample t-test. It is most useful to vary only one of thequantities at a time. For example, select a range of sample sizes and leave the other quantities totheir predetermined values. Click the Solve for Power and then Done to start computing. Atthe bottom of the output window there is an option to view the power curve. The only quantityyou won’t recognize is “Delta”, which is a scaled measure of the difference between population(see the Help features if you are interested in details); the preset value is calculated from theobserved difference between sample means.

Problems

1. Dr. Jamie Smith, a professor in the Zoology Department at UBC,has studied song sparrows (Melospiza melodia) on the small Gulfisland of Mandarte over several years. Mandarte Island is a shortdistance from Sidney, B.C., near the Victoria ferry terminal. Eachsummer for four years he captured every young song sparrowborn on the island in that year, measured it, and placed a set ofcolor bands on its legs. These bands uniquely identified eachindividual song sparrow. Each following spring Dr. Smith carriedout a census of birds on Mandarte to determine which young had survived their first winter, andwhich had disappeared (presumed dead). A difference between survivors and dead birds in atrait would represent evidence of natural selection, the main cause of evolution according toDarwin. The data for young female birds is located in the file song.sparrows.jmp. Each line ofthe file refers to a single female bird. The variables, in order, are:

• Survival - Whether the bird survived or died over her first winter• mass - Body mass, in g• wing - Wing length, in mm• tarsus - Tarsus (“leg”) length, in mm• beakL - Beak length, in mm

Page 34: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 34

• beakD - Beak depth (height), in mm• beakW - Beak width, in mm

a) Examine the distributions for beak length of surviving and dead sparrows (in theDistribution pop-up window, put beakL in the Y box and Survival in the By box; in theresults window that appears, select Distributions→Stack to display the twohistograms one on top of the other). Do you notice a difference in the distributions ofdead and surviving birds?

b) Evaluate the fit of these two data sets to the normal distribution.

c) A difference in the means of surviving and dead birds in a trait would reflect naturalselection favoring one extreme over the other (“directional” selection). On the basis ofyour evaluation in (b), choose and carry out a test for a difference in mean beak length(beakL) between surviving and dead birds.

d) What other assumption did your test in (c) require? Test this assumption with thebeak length measurements. Was your assumption valid? What alternative methods areavailable if this assumption is not met?

e) What is the 95% confidence interval for the difference between mean beak lengths ofsurviving and dead birds?

f) Do surviving and dead birds differ in the means of any other traits? If so, do thelarger individuals tend to survive better than the smaller birds in these traits as was thecase for beak length?

g) Reduction in variance of survivors compared with dead birds, in the absence of achange in the mean, reflects a tendency for extreme individuals to do worse thanindividuals in the middle of the distribution (= “stabilizing” natural selection).Conversely, a higher variance among survivors than dead birds reflects a tendency forextreme individuals to do better than individuals in the middle of the distribution (=“disruptive” selection). Do any of the traits show evidence of stabilizing or disruptiveselection?

h) Why is caution necessary when using the F-test (or, equivalently, the Bartlett test) fortesting differences between populations in variance?

Page 35: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 35

2. Maguire et al. (2000, Proc. Natl. Acad. Sci.USA 97: 4398−4403) used MRI to scanthe brains of London taxi cab drivers, whoare renowned for feats of spatial memoryand navigation (individuals must undergotwo years of extensive training and pass a stringent set of examinations known as “TheKnowledge” before they can be licensed). MRI scans focussed on the hippocampus, a region ofthe brain associated with spatial memory (especially the posterior hippocampus). The data inthe file hippocampus.jmp record the volume of gray matter (mm3) in the right posteriorhippocampus and the right anterior hippocampus of 15 drivers with different numbers of yearsof experience. Volume of the posterior hippocampus was measured using the VBM method,which provides a relative measure, whereas the anterior hippocampus was measured using apixel-counting method that estimates absolute volume. All subjects were right-handed malesbetween 32 and 62 years of age. These data were grabbed from Figure 3 in Maguire et al.(2000). The variables are:

a) Examine the difference in the volume of gray matter in the posterior hippocampusbetween the two experience groups of taxi drivers (< 15 years on the job vs. > 15 yearson the job). Explain how you decided on the best method to use.

b) Repeat the above procedure on the anterior hippocampus measurements. Justify themethods you used.

c) Do the results of (a) and (b) imply that changes in the volume of gray matter indifferent regions of the hippocampus are influenced by experience as a London cabdriver?

d) What does the following statement mean: “the two-sample t-test is more powerfulthan the Mann-Whitney U-test”?

Page 36: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 36

3. As part of a larger study looking at the influence of productivity on desert plant growth anddiversity, a field experiment examined effects of water addition on plant biomass. Each of fivedesert plots was divided into equal halves during the natural growing season. One half of eachplot was watered weekly, whereas the other half was left unwatered as the control. At the endof the experiment, three estimates of total plant biomass were made in every half-plot. Theresults are contained in the file water.the.plants.jmp.

a) Using these data, test whether watering affected mean plant growth. Justify themethod you used.

Page 37: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 37

9. SINGLE FACTOR ANALYSIS OF VARIANCE (ANOVA)

The next step to consider after comparison of means of two treatments, µ1 and µ2, is comparison ofmeans of multiple treatments: µ1, µ2, … µk. The most powerful method available is the analysis ofvariance (ANOVA). The null hypothesis is that the means are the same for all groups; if any grouphas a mean different from any other group, that can would lead ideally to the rejection of the nullhypothesis.

In many respects, ANOVA is just like the two-sample t-test. In fact, when k=2, either ANOVA orthe two-sample t-test may be used (the P-values will be identical). The assumptions are the same:data must be randomly sampled from populations having normal distributions with equal variance.Normality and equality of variance can be assessed for multiple samples in the same way as for twosamples (e.g. using the Levene test). ANOVA is a robust test, meaning it can tolerate a reasonableamount of deviation from these assumptions, especially when sample sizes are large and nearlyequal in the different groups.

Fixed and Random Effects ANOVA

There are two main types of single factor ANOVA: fixed effects (Model I) and random effects(Model II). In a fixed effects ANOVA the treatments are specifically chosen (e.g. drug A vs. drugB vs. drug C), treatments are repeatable, and we care about the results for each treatment (e.g. Dodrugs A, B and C differ in effectiveness? Which one is best?). In a random effects ANOVA, thetreatments are randomly sampled from a distribution of possible treatments. Specific treatments arenot repeatable and we won’t usually care about the findings for individual treatments. Instead, ourgoal is to say something general about the population of possible treatments from which theanalysed treatments are drawn. For example, to answer the question “does the mean size ofoffspring differ between females in a population of mice?” we would obtain a random sample offemales (=treatments) from the population, breed them, and measure the sizes of each of theiroffspring. This will tell us about variation among females, but since the females used are simply arandom sample of females from a larger population, we will not be interested in the results forindividual females.

The calculations for fixed and random effects are the same for single-factor ANOVA. Thecalculations will differ when more complicated experiments having more than one factor areanalysed.

Multiple Comparisons

Rejecting the null hypothesis of equal means only tells us that at least one of the population meansis different from the others. In a fixed effects experiment (e.g. comparison of the effectiveness ofdrug A vs. drug B vs. drug C), we usually want to know more: which is the most effective? whichis the least effective? are all three drugs different from one another, or does one clearly stand outfrom the other two? Answering these questions will require a comparison of all pairs of populationmeans (A vs. B; A vs. C; B vs C). Of course, the whole point of using ANOVA is to avoid pairwisecomparisons among pairs of means when testing for an overall difference. However, once theANOVA has rejected Ho, we need to return to the individual means for additional information.

Page 38: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 38

We can’t simply carry out a series of two-sample t tests to compare all possible pairs of treatmentmeans. Doing so will badly inflate the probability of making at least one Type I error. The Tukeytest was invented to circumvent this problem. The Tukey test avoids the inflation of Type I errorrates by using a critical value, q, that takes account of the number of pairs of means beingcompared. Use of this critical value ensures that the probability of making at least one Type I error,when carrying out tests between all pairs of means, is 0.05. This “protection” comes at a price,however: the test is not very powerful. Indeed it is possible for ANOVA to reject Ho yet the Tukeytest will not find any pairs of means that differ (usually, however, it finds at least one pair of meansthat differ). The Tukey test is referred to as an a posteriori test (one carried out after getting aspecific result from another test).

See the interleaf on Snooping in the text, for a further description.

We don’t usually carry out Tukey tests with random effects ANOVAs for the simple reason that inthis model our treatments are random and unrepeatable and we are uninterested in specifictreatment differences.

Transformations

If the assumptions of ANOVA are not met, don’t give up yet: consider transforming the data toachieve normality and equal variances. Many transformations are possible, but these three are thestandards: log, square root, arcsin. These transformations rescale the measurements but don’totherwise distort them. Transformations can allow you to do the more powerful ANOVA instead ofa non-parametric test, b y sometimes making a form of the data that match the assumptions ofANOVA.

Name Calculation Uses

Log

)X(elog

or)X(e 1log +

The most frequently-used transformation. Works formany types of data, especially data that are measureddimensions (size, length, etc.). In general, considerusing when group variances are unequal but groupcoefficients of variation are equal. Use the naturallogarithm (base e). Use X+1 if the data set includeszeros.

Square root50.X +

Often useful for data in the form of counts, when groupvariances are unequal but group variance:mean ratios areequal. The addition of 0.5 is optional, but might improvethe transformation when there are zeros.

Arcsine square root X1sin−Used only when data are proportions (note: first divideby 100 if data are percentages). Arcsine is the inversesine function.

The types of data for which each transformation is often used provide a guide only. Every data setis different, and a log transformation might work better for your count data set than the square roottransformation. Use whatever works, but within reason. If simple transformations fail, then moveon to a nonparametric alternative.

Page 39: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 39

Nonparametric Alternative to ANOVA

The Kruskall-Wallis test is the best nonparametric alternative to ANOVA, and should be used if theassumptions of ANOVA cannot be met (and transformations don’t solve the problem). The methodis based on ranks, and is the multi-sample equivalent of the Mann-Whitney U-test (Wilcoxon ranksum test) used in the case of two samples. Under the null hypothesis, the test statistic H has anapproximate chi-square distribution with k-1 degrees of freedom. For small total sample size,however, exact critical values may be obtained from the statistical tables in your textbook.

Using The Program

The required data format is the same as for the two-sample t-test. A category (nominal) variableindicates treatment group (this will be “X”). A second column contains the actual measurements(this will be “Y”). Use Fit Y by X to start the analysis, placing the appropriate variables in the Xand Y boxes. This will generate a “one-way” plot in which the observations of Y are displayed foreach category of X . Use Display Options → Points Jittered to spread apart overlappingobservations. Click the red “” next to the “One-way” title bar columns to select the followingactions:

→ Means/ANOVA/T-test: Carries out the ANOVA and generates the ANOVA table. Alsocalculates a 95% confidence interval for each mean (using the individual standard errors ratherthan the standard error based on the pooled sample variance). Adds the means diamonds to theplot (the vertical span of each diamond represents the 95% confidence interval for the mean ofeach group).

→ UnEqual Variances: Carries out tests of the null hypothesis that variances of all populationsare equal, using several methods including the Bartlett test (equivalent to the F test in the caseof two samples) and the Levene test.

→ Compare Means → All Pairs, Tukey HSD: Carries out the Tukey test between all pairs ofmeans. A table with the results is added to the results window, and a diagram of “comparisoncircles” is positioned next to the one-way plot.

→ Nonparametric → Wilcoxon Test: Carries out the Kruskall-Wallis nonparametric test (theWilcoxon test in the case of two samples). Note that exact P-values are not provided even whensample sizes are small: JMP uses the normal approximation regardless of sample size, andtherefore provides only an approximate P-value. This P-values may not be accurate whensample sizes are small.

Transformations: To transform data, you will need to create a new column. Choose column infoand label this column to identify it as transformed data. Format the column so that it is based on aformula, then in the calculator window set up the equation for the appropriate transform for yourdata type. The log function is located in the transcendental group of functions. The arcsinefunction is located in the trigonometric group of functions.

Problems

1. Open the fruitflies.jmp file in the shared directory. These data werecollected by Partridge (1981, Nature 294: 580-581) to test whethermale Drosophila melanogaster suffered a survival cost from mating.

Page 40: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 40

The life spans of individual males supplied with 1 or 8 receptive virgin females per daywere compared with life spans of two types of control males. The first control consistedof two sets of individual males kept with either 1 or 8 newly inseminated females (newlyinseminated females will not re-mate for at least two days, so they controlled for anyeffect of competition with the male for food or space). The second control was a set ofindividual males kept with 0 females. The four variables are:

Number of females supplied daily to males (0,1 or 8)Experimental treatment (0, 1 or 8 virgin females, 1 or 8 newly inseminated females)Male life span, in daysMale thorax length, in mm

a) ANOVA requires an assumption in addition to the assumption of normal populations.Test this assumption using the appropriate method. Comment on the validity of theassumption.

b) Examine the histograms of male life span for each group separately (use the By box in theDistribution window to generate histograms for each treatment group all at once). Is anassumption of normality reasonable?

c) Use Fit Y by X to view the one-way plot of male lifespans of different experimentaltreatment groups. Use Display Options → Points Jittered to spread apart overlappingpoints. To the eye, do the mean lifespans appear to differ between treatment groups?

d) Use ANOVA to test whether experimental treatment influenced the lifespan of males fruitflies.

e) Is this a fixed effects ANOVA or a random effects ANOVA? Explain.

f) Why is it invalid to test multisample hypotheses by applying two-sample tests to allpossible pairs of samples?

g) Assuming that you rejected the null hypothesis in (b), determine which pairs of treatmentmeans were significantly different. Which treatment(s) yielded the lowest mean lifespan?Which treatment(s) yielded the highest mean lifespan?

h) What are the assumptions of the Tukey test?

2. The mimic leatherjacket, Paraluteres prionurus, is asmall fish of the Great Barrier Reef that resembles thesharp-nosed puffer, Canthigaster valentini, a fish with apowerful neurotoxin in its skin. It has been suggested thatthe leatherjacket has evolved to resemble the tobybecause of the protection from predators gained. A fieldstudy tested this idea by constructing plastic models in thebody shape of the toby and painting them one of four color patterns: “toby” (the puffer fishpattern), “leatherjacket” (the leatherjacket pattern), “1step”, and “2step” (patterns that weresmall and medium departures from the leatherjacket pattern, respectively, but using thesame colors). Using SCUBA, the researchers tethered the models to fishing line and drew

Page 41: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 41

them across sections of reef for a two hour period each. The number of times the model wasapproached by a predatory fish was recorded. The data are in the file mimicry.jmp. Usethese data to test for differences between color patterns in the mean number of approachesby predators.

a) What are the assumptions of ANOVA? Examine the data and judge whether or notthese assumptions might be met.

b) Try transforming the data to improve the validity of the assumptions. Given the type ofdata, which transformation would be your first choice? Carry out the transformation andreexamine the date. Are the assumptions of ANOVA met?

c) Think of a second transformation that might also work, and give it a try. Whichtransformation worked best?

d) After you decide on the best transformation, test whether color pattern influenced thenumber of approaches by predators.

e) Which color pattern(s) were most attractive to predators, and which pattern(s) wereleast attractive? Base your answers to these questions on a formal test.

f) Is this a fixed effects ANOVA or a random effects ANOVA? Explain.

3. Open the data file genotype.jmp from the shareddrive. The data are weight gains of young rats separatedfrom their natural mothers at birth and randomlyreassigned to other mothers. The variables in the data setare:

Weight gain of a young rat, in gMother ID, the identity of the young rat’s true motherFoster mother ID, the identity of the young rat’s foster mother

a) Test whether mean weight gain of offspring differed between foster mothers. Justify allsteps of your analysis.

b) Test whether mean weight gain of offspring differed between true mothers. Justify allsteps of your analysis.

c) Suggest a biological explanation for the results in (a) and (b)

d) Is this a fixed effects design or a random effects design?

Page 42: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 42

10. LINEAR REGRESSION AND CORRELATION

Regression and correlation are more powerful methods to describe and test associations betweennumerical variables.

Regression and correlation are related but have different purposes. Correlation describes therelationship between two numerical variables. The goal of regression is to predict the value of onevariable, Y, from measurements of the other variable, X, whereas correlation merely describes theirassociation. When using regression, keep in mind that to predict Y from X in no way implies that Xis the cause of Y. Demonstrating a cause and effect relationship requires careful experimentaldesign with appropriate controls to rule out other causes.

Simple Linear Regression

Linear regression assumes that Y’s relationship to X is a straight line:

Y = α + βX

α is the Y-intercept (the mean value of Y when X is zero), and β is the slope of the line (the amountthat Y changes per unit change in X). α and β are population parameters that describe the truerelationship of Y on X. The quantities a and b are the sample estimates of these two parameters.The usual null hypothesis for regression is that the slop of this line is zero, but the null hypothesiscould specify any other slope if necessary.

Individual Y-values will not lie directly on the line, but will be scattered above and below by arandom amount. The difference between a Y observation and the predicted Y on the line is calledthe residual. Under the method of least squares, α and β are estimated from data by finding thevalues of a and b that minimize the sum of squared residuals.

X

Y

0 2 4 6 8 10

10

15

20

25

30YX=8.36+1.80

residual

Page 43: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 43

In this lab, we emphasise graphical tools that help you evaluate the assumptions that underlieregression analysis. These methods rely on visual and statistical inspection of data. Your goal is totry to make the data fit the assumptions as closely as possible, and then decide whether theagreement between fact and assumption is close enough to proceed with the analysis. Be preparedto try several remedies and to choose the best among them.

Assumptions of Linear Regression

Linear regression rests on three special assumptions.

1. For every value of X, there is a distribution of possible Y values whose mean falls on astraight line (i.e., the relationship is linear).

2. The distribution of possible Y values at each X is normally distributed.

3. The variance of the Y values is assumed to be the same at all values of X.

Added to these are the usual assumptions that observations must be independent of another. Wewill also assume that there is no measurement error in X.

When evaluating whether the assumptions are met, several tools are useful and should beconsidered a part of any regression analysis:

• Plot of residuals. Fit a straight line and then plot the residuals in Y against X.

X

Y

x0 2 4 6 8 10

-5

0

5

Page 44: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 44

Examine the residual plot for indications that the variability of the residuals varies widelyacross the range of X values (indicating that the assumption of equal variances is violated), orthat the residuals are highly unevenly distributed on both sides of the line over the range of Xvalues (indicating that the relationship is probably not linear).

• Distribution of residuals using standard methods (e.g., histogram, boxplot) will help assess fitof the residuals to the normal distribution. (Note: JMP let's you save the residuals in a newcolumn.)

Transformations in Regression

To meet the assumptions of regression try transforming X and/or Y. The log transformation (orlox(X+1) when there are zeros) is by far the most commonly used transformation and we will use itprimarily. If the relationship of Y on X is described by a power function (e.g. Y = 3 e1.4X) thentransforming both X and Y should yield satisfactory results. For other relationships it may only benecessary to transform only one of the variables. Use whatever works best to make the transformeddata match the assumptions.

The square root and arcsine transformations are also frequently use (from the overview of thesetransformations in the previous ANOVA lab, you might be able to guess the types of data for whichthese other transformations might be suited). Transformations offer no hope, however, if the realcurve contains either a distinct peak or a distinct valley. In these and other cases nonlinearregression methods are appropriate (we will not be covering this topic).

YY Y

Y

X

X X

X

Transform Y

Transform X

Transform and/or XY Log transformation won’t work

Page 45: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 45

Simple Linear Correlation

In correlation analysis, pairs of X,Y values are assumed to be drawn at random by the investigatorfrom a population. Our goal is to determine whether the two variables are associated. Thecorrelation coefficient, r, measures the strength of the association; r may vary between −1 and 1.Linear correlation assumes that the distribution of X,Y pairs in the population has a bivariate normaldistribution. A bivariate normal distribution is a bell-shaped distribution in 3D.

If this assumption of bivariate normality is violated and cannot be corrected by transformation, thena nonparametric method, the Spearman rank correlation is used instead.

Using the Program

Linear Regression - Use Fit Y by X (or Bivariate in the JMPIN Starter) and designate your X andY variables. The computer will display a scatterplot of the data. Click the red “” next to the“Bivariate” title bar to select → Fit Line, which fits a linear regression to the data. This also yieldsthe estimated regression equation (slope and intercept), a summary of the fit (r2, the fraction ofvariation in Y that is “explained” by X, an index of the strength of the fit) and the lack of fit (ignorefor now). The result window will also display estimates, standard errors and significance tests ofintercept and slope (based on the t-distribution), and an F-test of the fit of the whole model to thedata (analogous to ANOVA).

Below the scatterplot you’ll find another red “” next to “Linear Fit”. Click the symbol toselect a series of other options:

→ Save residuals: Creates a new column in the data table containing the residuals. Useful if youwish to generate a histogram and box plot for the residuals.

→ Plot Residuals: Generates a plot of Y residuals against X. Useful for checking equality ofvariance of residuals across the range of X-values.

→ Remove Fit: Deletes the line and associated results.

Spline - Use Fit Y by X (or Bivariate in the JMPIN Starter) and designate your X and Yvariables. The computer will display a scatterplot of the data. Click the red “” next to the“Bivariate” title bar to select → Fit Spline. Start with a lambda of 1, and then go higher or lowerif this value yields a spline fit with too little or too much wobble. The goal is to examine thegeneral trend of the data.

Transformations - Use the JMP calculator to create a column with new variables, as you did forANOVA.

Correlation – Use Analyze→Multivariate (or select the Multivariate tab in the JMPIN Starter).Select the two variables you want to correlate and place them both in the Y box. This will generatea table reporting the linear correlation r between all pairs of selected variables. A scatterplot for allpairs of variables also results, along with a 95% density ellipse for each pair. This ellipse shouldenclose approximately 95% of the observations if the two variables have a bivariate normaldistribution. Click the red “” next to the “Multivariate” title bar to select other options:

Page 46: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 46

→ Pairwise correlations: Reports the correlation coefficient for each pair of variables, the P-value for a test of the null hypothesis of no correlation, and a bar chart indicating the magnitudeof each correlation.

→ Nonparametric Correlations → Spearman’s Rho: Reports the Spearman’s rank correlationcoefficient and the P-value for a test of the null hypothesis of no correlation.

Problems

1. Open the data set mammals.jmp, which contains average brain andbody masses for 62 species of land mammals. The data are fromAllison and Cicchetti (1976, Science 194: 732–734).

a) Use the Multivariate option to calculate the linear correlation betweenbrain size and body size in this sample of mammals.

b) Observe the scatter plot of observations and the 95% density ellipse.This ellipse represents a contour on the bivariate normal distributionthat best fits the data (picture a bell in 3D). Do the data appear to conform well to theassumption of bivariate normality?

c) If the data do not fit the assumption of bivariate normality, test the correlation between brainsize and body size using a nonparametric method.

d) Explore transformations of these variables, to yield a scatterplot that fits a bivariate normaldistribution better than the untransformed data.

e) What are the main assumptions of linear regression? Do they appear to be met in this case?Show how you determined this.

f) With the transformed data, use linear regression to predict (transformed) brain mass from(transformed) body mass. Report also the r2 (coefficient of determination, a measure of thestrength of fit).

g) Is the slope of the regression significantly different from zero?

h) Which species of mammal has the largest brain, taking into account differences between speciesin body size? Show how you determined this. You should be pleased by the answer.

i) What other species have relatively large brains, again taking into account differences betweenspecies in body size? Which species has the smallest brain?

2. Open the data file mutation.jmp. The file contains estimates of whole-genome, deleteriouspoint mutation rates in a range of animal taxa. Each rate was estimated by comparing twoclosely related species in the number of differences at “important” sites in shared genes relativeto the number of differences at “silent” sites. Mutation rates are measured as the number of

Page 47: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 47

new, deleterious mutations expected per individual per generation (column #3). The estimateswere obtained from Table 1 of Keightley and Eyre-Walker (2000, Science 290: 331-333).

a) Observe the range of values for different taxa of animals. Note especially the largevalues for humans and other primates!

b) Test whether whole-genome deleterious mutation rates, measured per generation, arecorrelated with generation time. Use the most powerful method available. Show how youmet the assumptions of the method.

THE FOLLOWING PROBLEM CAN BE SAVED FOR REVIEW PRACTICE:

3. The data file gagurine.jmp contains data on the concentration of the chemical GAG in theurine of 314 children aged from zero to seventeen years. The aim of the study was to produce achart to help a pediatrician assess whether an individual child’s GAG concentration is “typical”.Variables are age of child (in years) and GAG concentration (the units have been lost). The datawere taken from Venables and Ripley’s MASS library (original data from Prosser, cited inVenables and Ripley).

Predict GAG concentration from the age of the child. Explain and justify your methods.

Page 48: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 48

11. REVIEW PROBLEM

In previous exercises you have used the computer program to answer example problems associatedwith specific statistical tests. In this exercise you are given an example problem and you mustdecide which statistical test is appropriate to answer each question. You should explore each dataset before you proceed with any tests. Use as many of the components of the program as possible.When you answer a question you should provide a fairly standard set of information. When a testis called for, show all your steps:

1. State your null and alternate hypotheses (mathematically, if possible andappropriate).

2. State your significance level (α).

3. Note any problems with assumptions and any transforms or changes you made tocorrect those problems.

4. Report the test statistic (F, t, W, z, etc.)

5. State the degrees of freedom (or sample sizes in the case of nonparametric testing).

6. State the P-value (report the critical value when using the tables, whether thecorresponding test statistic is greater or less than the critical value, and what this implies aboutthe magnitude of P).

7. Compare the P-value value to α (by hand, compare the test statistic to the criticalvalue).

8. Report your conclusion (reject or fail to reject HO). For exams, it’s safest to stophere. In the real world you would restate your conclusion to explain what it means for yourstudy.

Hints for the lab exam:

Read the questions thoroughly. One of the most common ways people lose marks is by providing athorough answer to a question we didn't ask.

Look for key phrases: causation, prediction or functional relationship implies regression if the Y-variable is continuous. Describe a regression relationship algebraically means tell us the equation ofthe line (don't forget to include any transformations in the equation). Describe the strength of arelationship implies r2 in the regression framework, when X and Y are continuous, or r in thecorrelation framework. Association or vary together implies correlation when both X and Y arecontinuous. Independence of variables or do the ratios vary points to a contingency test when bothvariables are nominal. Summarize the differences between all pairs of means should tell you to do aTukey test.

Try the following set of problems to review some of the ideas you have learned this year. Thisexercise is very similar to the lab exam you will write next week, although the lab exam will have

Page 49: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 49

fewer questions. You should not consider this set of questions to be all-inclusive, however. Any ofthe topics that we have dealt with in lab are fair game for the lab exam.

Problems

A study of clutch size (i.e. number of eggs laid per nest) in great blue herons (Ardea herodias) wasundertaken in south coastal British Columbia. Nests were randomly sampled from three differentcolonies (populations) and the number of eggs in each nest were counted. The wing length (mm) ofthe female heron occupying each nest was also measured, as an index of body size. The data arestored in a file named clutch.JMP the shared directory.

a) Visually inspect the data from each of the three colonies and describe thedistributions of clutch size and wing length. Test whether the distributions are normal.

b) Are there differences in mean clutch size among the colonies? Are theredifferences in mean female wing length among the three colonies?

c) In your analyses of part (b), what was the power of each test to reject the nullhypothesis, based on the default power estimates from your samples. What were thesmallest sample sizes that would have allowed you to reject the null hypotheses?

d) The researchers observed that Douglas fir and alder trees were equallyabundant at the three colony locations. Subsequently, they determined the frequency withwhich heron nests were located in different trees:

Colony # Douglas fir Alder1 5 72 31 53 30 11Totals 66 23

Are the ratios of herons nesting in Douglas fir to herons nesting in alder trees the same in allcolonies?

e) The average wing length of female great blue herons in North America is 500mm and the average clutch size is 4. For the combined sample (i.e. assuming nodifferences among colonies in mean or variance for each variable), test whether or notclutch size of great blue herons in south coastal British Columbia is representative of greatblue herons in North America. Carry out a similar analysis using female wing length.

f) For the combined sample (assuming no differences among colonies in mean orvariance for each variable), decide whether or not it is possible to use female body size topredict clutch size in great blue herons from south coastal British Columbia. If so, describet h e r e l a t i o n s h i p a l g e b r a i c a l l y .

Page 50: BIOSTATISTICS Lab Manual - | Department of Zoology at UBC

BIOL 300 LAB MANUAL 50

Selecting a statistical method (partial review of appropriate methods in Bio 300)

Normally distributedpopulation(s) Non- normal population(s)

Nominal data(categories or names)

e a parameter Standard error and confidenceinterval for the mean andvariance

Standard error of the mean Standard error and approximateconfidence interval for aproportion

re one group to atical value

One-sample t test Wilcoxon signed rank test Chi-square goodness of fit testorBinomial test (two outcomes)

re two unpaired Two-sample t test (means)

Two-sample F test (variances)

Mann-Whitney U test (rank sums)

Levene test (absolute deviations)

Chi-square contingency testFisher's exact test (two outcomes)

re two paired groups Paired t test (mean difference) Wilcoxon signed rank test

re three or morehed groups

One-way ANOVA (means)

Bartlett's test (variances)

Kruskal-Wallis test

Levene test (absolute deviations)

Chi-square contingency test

e or test association two variables

Linear (Pearson) correlation Spearman rank correlation Contingency test

value from anothered variable

Linear regression