Chapter 0 Getting Started With Minitab

Chapter 0Getting Started With Minitab

0.1 Overview

This chapter covers the basic structure and commands of Minitab for WindowsRelease 14. After reading this chapter you should be able to

1. Start Minitab2. Identify the Main Menu Bar3. Enter Data into Minitab4. Save the Data File5. Compute Descriptive Statistics6. Print the Session Window7. Obtain Online Help8. Exit Minitab.

Minitab commands and software features are featured in areas where they areappropriate for the specific statistical analysis.

0.2 Starting Minitab

Minitab is a computer software program initially designed as a system to help inthe teaching of statistics, and over the years has evolved into an excellent systemfor data analysis. The procedure for starting Minitab requires only that you:

1. Select Start>Programs>Minitab 14 for Windows>Minitab, or

2. Double click on the blue Minitab for Windows icon as shown in Figure 0.1.

Figure 0.1

0.3 The Main Menu

The main Minitab window contains numerous subwindows, two of which are shownin Figure 0.2: the Worksheet window and the Session window. A third window isthe Project Manager. The Project Manager contains folders that allow access tovarious parts of your project. These folders include Session, History, Graphs, Re-port Pad, Related Documents, and Worksheet folders. Across the top of the Minitabwindow is the menu bar, from which menus may be opened and from which youchoose commands. The Session and Worksheet windows are the most important

1

and the most frequently used windows.

Figure 0.2

The main menu bar, shown in Figure 0.3, contains selections common to mostWindows applications and some selections specific to Minitab. The File command

2

contains options related to opening files, saving files, printing, and exiting Minitab.

Figure 0.3The Edit command contains options related to deleting, copying, and pasting. Theother selections on the menu bar, Manip(ulate), Calc(ulate), Stat(istics), Graph arespecific to Minitab. The final two selections on the main menu bar, Window andHelp are found in most Windows applications. The Window command enablesyou to switch among windows, while the Help command enables you to get on-line help from Minitab.

0.4 Entering Data

Minitab’s Worksheet window, as shown in Figure 0.4, is like a spreadsheet in thatit works with data in rows and columns. Typically, a column contains the data forone variable, with each individual observation in a row. Columns are designated

3

as C1, C2, C3,... and rows are numbered 1, 2, 3, ...

Figure 0.4

The size of the worksheet is limited only by the memory available and the size ofthe hard drive.There are several ways to enter data into the Minitab Data window. You may readdata from a file or type in the data. Let’s look at a problem involving a combi-nation of reading data from a file and adding data to the dataset by typing in theadditional data.

Example1.12, textThe Problem - Graduation RatesThe Chronicle of Higher Education ( 1Almanac Issue, Aug 31, 2001) reportedgraduation rates for NCAA Division I schools. The rates reported are the percentof full time freshman in fall 1993 who earned a bachelor’s degree by Aug. 1999.Data from the Division I schools in California is contained in file ex1_12a.mtp.

Reading Data from a FileFollow these steps to read data from a file:

1. Start MinitabLocate and double click on the Minitab program group icon and double clickon the blue Minitab for Windows icon.

2. Open the file.Choose File>Open Worksheet from the menu. A portion of the submenu is

4

shown in Figure 0.5.

Figure 0.5At the completion of this operation, all data in the current worksheet will bereplaced with the data in the file. When you select Open Worksheet the dialogbox shown in Figure 0.6 will open.

Figure 0.6Minitab allows you to open files from many different software packages. Minitabworksheets use the file extension .mtw and Minitab portable files use the ex-tension .mtp. Choose the location and the type of file you want to open, thenselect the filename from the list and choose Open to open the file. Select the

file ex_1_12a.mtp and choose Open.

Typing Data into the WorksheetFollow these steps to add data to the dataset:

3. Make the Worksheet window the active window.Position the cursor in the Worksheet window in the column and cell where youwant the data located. Position the cursor in column 1 row #.

5

4. Enter the data.Enter the data indicated below beginning in row 18 of column C1. Type in thedata value and press {ENTER} after each entry.

66 71 63

5. Correcting errors.If you enter an incorrect value, highlight the cell, retype the data entry and press{ENTER}. Do not delete the error, just type in the correct value. Deleting thedata causes the entire column to move up one line!! Change the entry in column1 row 19 from 71 to 70.

Naming Columns in the WorksheetColumns are generally used for different variables within the dataset.1. Name the column.

To name a column (variable) in the worksheet, position the cursor in the box atthe top of the column above row 1 and below the C# label. Type in the nameyou want to assign to the column. In version 14 of Minitab, column names maybe longer than 8 characters. Position the cursor in the box at the top of column1 above row 1 and type in the name California.

0.5 Compute Descriptive Statistics

Minitab offers a variety of basic statistics to analyze data. Let’s begin by obtaininga summary table describing the variable California.Follow these steps to determine the descriptive statistics for Height.1. Compute descriptive statistics.

Begin by using the mouse to click on Stat>Basic Statistics>Display Descrip-tive Statistics.

2. Enter the information in the dialog box,as shown in Figure 0.7.Select the variable California by highlighting California and double clicking

6

(or Select). Choose OK.

Figure 0.7

0.6 Saving a File

There are three basic components in a Minitab session: the worksheet (containedin the Worksheet window), the Session window and graphs. Saving graphs will becovered after you create your first graph.

Saving a WorksheetFollow these steps to save a Minitab worksheet for the first time.1. Choose File>Save Current Worksheet As...

2. Select the drive.Designate the correct drive and path for saving the file in the dialog box,asshown in Figure 0.8, then position the cursor in the box labeled File Name:

7

and type in the filename.

Figure 0.8

Minitab uses the same file naming conventions as Windows. Minitab work-sheets use the file extension .mtw and Minitab portable files use the extension.mtp.. Designate the drive as A: and enter the filename ex1_12a. Choose Save.

Saving a ProjectWhen you save your work as a project, you save all the information about yourwork. The contents of every window is saved, including the columns of data ineach Worksheet window, the complete text in the Session window and Historywindow, and each Graph window. You will want to save these results if it isnecessary to examine the output at a later time or use the output in a document.Follow these steps to save a project.

a. Choose File>Save Project As... Designate the correct drive and path forsaving the file in the dialog box,as shown in Figure 0.9, then position thecursor in the box labeled File Name: and type in the filename for thisproject file. Select drive A: and enter the filename ex1_12a.mpj and choose

8

Save.

Figure 0.9

0.7 Printing the Session Window

Follow these steps to print a copy of the Session window.

1. Make the Session window the active window.Click on the title bar of the Session window to make it the active window.

2. Select the correct printer.If necessary, choose File>Print Setup... to select the correct printer. Afterselecting the correct printer chose OK.

3. Print the Session window.Choose File>Print Session Window..(see Figure 0.10). Chose OK to print

9

the Session window.

Figure 0.10

0.8 Obtaining On-line Help

Follow these steps to obtain on-line help.

1. Click with the mouse on Help>Help, to bring up the dialog box shown inFigure 0.11.

Figure 0.11

2. Select the topic.

10

Click on the text ’’Getting Started’’ and ’’Introduction to Minitab’’. The Helpwindow, as shown in Figure 0.12, displays basic information on using Minitab.Click on the Close option button on the top right of the Help window.

Figure 0.12

3. Using the Index.Click Index tab, to bring up the dialog box shown in Figure 0.13.Type ’’Graph Menu’’ in the Type in the keyword to find: textbox. Click onthe Display option button, as shown in Figure 0.13. The results of selectingHistogram are shown in Figure 0.14.

Figure 0.13

11

Figure 0.14

4. Exiting Help.To return to the Minitab session, Click on the Close option button on the topright of the Help window.

12

Chapter 1Data Displays

1.1 Overview

This chapter covers two basic displays for categorical and numerical data. Afterreading this chapter you should be able to

1. Construct a Histogram (Bar Chart) for Categorical Data(Example 1.11, text)

2. Construct a Dot Plot for Numerical Data(Example 1.12, text)

One of the most useful ways to begin an initial exploration of data is to use tech-niques that result in a pictorial representation of the data. The graphic representa-tions can visually reveal characteristics of the variable being examined. There area variety of graphic techniques that may be used to describe the data. The tech-nique used is dictated by the type of data and the circumstances surrounding theproblem.

1.2 Bar Charts and Dotplots

New Minitab Commands

1. Graph>Dotplot - Use to view the distribution of individual data. Minitab dis-plays a dot for each observation along a number line. If there are many data,each dot may represent multiple points. Minitab prints a footnote on the graphabout the maximum number of observations that the dots represent. Minitabdivides the sample values into many small intervals, or bins. The number ofdots in a bin represents the number of data values within that interval. The pat-tern of the dots allows you to view the data distribution. Dotplots are especiallyuseful for comparing distributions of several groups.

2. Graph>Bar Chart - Use to compare categories of data. Each bar can repre-sent a count for a category, a function of a category (such as the mean, sum, orstandard deviation), or summary values from a table. Because each bar repre-sents a discrete category, the distances between bars on the x-axis have no truemeaning.Appropriate graphical representations of data often have more impact and con-

vey more information quickly than a numerical summarization.

Bar Charts for Categorical DataHistograms that summarize categorical data are also referred to as bar graphs oras bar charts. These histograms for categorical data show the frequency corre-sponding to each category as a proportionally sized rectangular areas. Minitabcalls to this type of histogram a chart. Minitab enables you to create simple bar

1

charts, clustered bar charts, stacked bar charts, increasing bar charts, decreasing barcharts, cumulative bar charts, percent bar charts (by category), transposing axes orhorizontal bar charts by choosing the appropriate items in the <Options> dialogbox. Let’s look at the following problem to construct a horizontal bar chart (his-togram).

Example1.11, text

The Problem - Why Students Drop OutThe article, ’’So Close, Yet So Far: Predictors of Attrition in College Seniors’’ (J.College Student Development ( 1998): 343-348) examined the reasons that collegeseniors leave their college programs before graduating. Forty-two college seniorsat a large public university who dropped out prior to graduation were interviewedand asked the main reason for discontinuing ennrollment at the university. Dataconsistent with that given in the article is contained in file ex1_11.mtp.

Follow these steps to construct a bar chart.

1. Open the worksheet.Choose File>Open Worksheet. Select the file ex1_11.mtp. Choose Open.

2. Construct the histogram.Choose Graph>Bar Chart. From the Bars represent: drop down list box,select Values from a table. Select Simple as the type of bar chart from the Onecolumn of values choices. Place Frequency in the Graph variables: text box.Place Reason For Leaving in the Categorical variable: text box. Choose OK.

The Minitab Output

Figure 1.1The bar chart in Figure 1.1 shows that more students reported leaving the uni-

2

versity for economic reasons or to attend another school than for academic reasons.

DotplotsDotplots are an attractive summary of numerical data when the data set is reason-ably small or there are relatively few distinct data values.

Example1.12, text

The Problem - Graduation Rates for NCAA Division I Schools in Californiaand TexasThe Chronicle of Higher Education ( 1Almanac Issue, Aug 31, 2001) reportedgraduation rates for NCAA Division I schools. The rates reported are the per-cent of full time freshman in fall 1993 who earned a bachelor’s degree by Aug.1999. Data from the two largest states, the 20 Division I schools in California andthe 19 in Texas is contained in file ex1_12.mtp.

Follow these steps to construct a dotplot for the graduation rates:


2. Construct the dotplot.Choose Graph>Dotplot... Select Simple as the type of dotplot from the Mul-tiple Y’s dialog choices. Place California and Texas in the Graph variables: textbox. Choose OK.

The Minitab Output

Figure 1.2

The graphics window, as shown in Figure 1.2, labeled Dotplot for California-Texas shows separate dotplots for the California and Texas schools. The dotplotsare drawn using the same scale to facilitae comparisons. From the two plots, wecan see that while both states have a high and low group, there are only six schoolsin the low group for California and only six schools in the high group for Texas.

3

Chapter 2Selecting Random Samples

2.1 Overview

This chapter covers one process of selecting a simple random sample using a dis-crete uniform distribution. After reading this chapter you should be able to

1. Generate a Simple Random Sample(Example 2.3, text)

2.2 Random Sampling


1. Calc>Random Data>Integer - Generates random data from an integer distri-bution, which is a discrete uniform distribution that ranges from the minimumto the maximum integer value specified. Each integer in the range has equalprobability. In this section, you will generate three random numbers between1 and 100.

Example 2.3, textThe Problem - Selecting a Random Sample of Glass Soda BottlesBreaking strength is an importand characteristic of glass soda bottles. If the strengthis too low, a bottle may burst - not a desirable outcome. Suppose that we want tomeasure the breaking strength of each bottle in a random sample of size n = 3 se-lected from four crates containing a total of 100 bottles (the population). Each rowcontains five rows of four bottles each. We can identify each bottle with a num-ber from 1 to 100 by numbering across the rows starting with the top row of crate1and continuing with each of the crates.

1 2 3 4 5 76 77 78 79 806 7 8 9 10 81 82 83 84 8511 12 13 14 15 ... 86 87 88 89 9016 17 18 19 20 91 92 93 94 9521 22 23 24 25 96 97 98 99 100

Follow these steps to generate three random numbers between 1 and 100 todetermine which bottles would be included in our smple:

1. Generate three random numbers between 1 and 100.Choose Calc>Random Data>Integer. Place 1 in the Generate rows of data:text box. Place c1-c3 in the Store in column(s): textbox. Place 1 in the Minimumvalue: text box. Place 100 in the Maximum value: text box.. Choose OK.

4

The Minitab Output

Figure 2.1

The section of the Worksheet window displayed in Figure 2.1 indictes the bot-tles selected in this random sample of size n = 3. This particular sample includesbottles 83, (row 2 column 3 of crate 5), 76 (row 1 column 1 of crate 5), and bottle24 (row 5 column 4 of crate 1). Since these represent a random sample, the samplewill be differerent each time this command is executed.

5

Chapter 3Data Displays

3.1 Overview

This chapter covers the basic displays for categorical and numerical data. Afterreading this chapter you should be able to

1. Construct a Comparative Bar Chart for Categorical Data(Example 3.1, text)

2. Construct a Pie Chart for Categorical Data(Example 3.4, text)

3. Construct a Stem-and-Leaf Display(Example 3.8, text)

4. Construct a Histogram(Example 3.17, text)

5. Construct a Scatter Plot(Example 3.21, text)

One of the most useful ways to begin an initial exploration of data is to use tech-niques that result in a pictorial representation of the data. The graphic representa-tions can visually reveal characteristics of the variable being examined. There area variety of graphic techniques that may be used to describe the data. The tech-nique used is dictated by the type of data and the circumstances surrounding theproblem.

3.2 Comparative Bar Graphs


1. Graph>Bar Chart - Use to compare categories of data. Each bar can representa count for a category, a function of a category (such as the mean, sum, orstandard deviation), or summary values from a table.

a. Bars represent- By choosing appropriate elements in the Bars represent:drop down list box, you can select columns of categorical data, measure-ment data or summary data.² Counts of unique values: Choose if you have one or more columns of

categorical data and you want to chart the frequency of each category.² A function of a variable: Choose if you have one or more columns of

measurement data and one or more columns of corresponding categori-cal data, and you want to chart a function of the measurement data, suchas the mean, for each category.

² Values from a table: Choose if you have one or more columns of sum-mary data and one or more columns of corresponding categorical data,and you want to chart the summary value for each category. In this sec-

6

tion, you will create a clustered bar chart from summary data containedin a two-way table.

Let’s look at the following problem to construct a comparative bar graph.

Example 3.1, text

The Problem - Perceived Risk of SmokingThe article ’’Most Smokers Wish They Could Quit’’ (JGallup Poll analyses, Nov.21, 2002) noted that smokers and nonsmokers perceive the risks of smoking differ-ently. The accompanying relative frequency table summarizes responses regardingteh perceived harm of smoking for ech of three groups - a sample of smokers, asample of former smokers, and a sample of nonsmokers.

Relative FrequencyPerceived Risk of Smoking Smokers Former Smokers NonsmokersVery Harmful .60 .78 .86Somewhat Harmful .30 .16 .10Not Too Harmful .07 .04 .03Not Harmful at All .03 .02 .01Follow these steps to construct a comparative bar graph for the three groups

with respect to the perceived risk of smoking:

1. Open the worksheet.Choose File>Open Worksheet. Select the file ex_3_1.mtp. Choose Open.This worksheet is consistent with the above relative frequency table.

2. Construct the comparative bar graph.Choose Graph>Bar Chart. From the Bars represent drop down list box chooseValues from a table. Under Two-way table, choose Cluster. Choose OK. PlaceVeryHarmful, SomewhatHarmful, NotTooHarmful and NotHarmfulAtAll in theGraph variables: text box. Place Status in the Row labels: text box. ChooseOK.

7

The Minitab Output

Figure 3.1

The comparative bar chart is shown in Figure 3.1. It is easy to see the differ-ences among the three groups with respect to the perceived risk of smoking. Theproportion believing that smoking is very harmful is noticeably smaller for smok-ers that for either former smokers or nonsmokers, and the proportion for formersmokers is smaller than for nonsmokers.

3.3 Pie Charts


1. Graph>Pie Chart - A pie chart shows the proportion of each data categoryrelative to the whole data set.

a. Chart values from a table: Choose when the category names are in onecolumn and the summary data (such as counts, sums, or percentages) arein another column.² Categorical variable: Enter the column containing the categories.² Summary variables: Enter one or more columns containing the sum-

mary data for each category. Minitab displays a separate pie chart foreach summary column you enter.

Let’s look at the following problem to construct a pie chart.

8

Example 3.4, textThe Problem - Birds that ’’Fish’’Night herons and cattle egrets are species of birds that feed on aquatic prey in shal-low water. These birds stalk submerged prey while wading in shallow water, andthen strike rapidly and downward through the water in an attempt to catch the prey.The article, ’’Cattle Egrets are Less Able to Cope with Light Refraction Than areOther Herons’’ (Animal Behaviour, (1999): 687-694 gave data on outcome when240 cattle egrets attempted to capture submerged prey. The data is summarized inthe accompanying frequency distribution.

Outcome Frequency Relative Frequencyprey caught on first attempt 103 .43

prey caught on second attempt 41 .17prey caught on third attempt 2 .01

prey not caught 94 .39Follow these steps to construct a pie chart for the cattle egret data:

1. Open the worksheet.Choose File>Open Worksheet. Select the file ex_3_4.mtp. Choose Open.

2. Construct the pie chart.Choose Graph>Pie Chart. Darken the Chart values from a table: option but-ton. Place Outcome in the Categorical variable: text box. Place Frequency inthe Summary variables: text box. Choose Labels. In the Titles/Footnotes Title:text box, enter a title of Cattle Egret. Choose OK.

The Minitab Output

Figure 3.2

9

The completed pie chart for the cattle egert data is shown in Figure 3.2.

3.4 Stem-and-Leaf Displays


1. Graph>Stem-and-Leaf - Produces a character-based stem-and-leaf plot in theSession window.The stem-and-leaf display provides an opportunity to explore a data set contain-

ing numerical data that may be either discrete or continuous. This exploration pro-vides you the opportunity to obtain an intuitive feel for the shape of the data. Sucha preliminary organization often reveals useful information and opens up paths ofinquiry. Let’s look at the following problem to construct a basic stem-and-leaf plot.

Example 3.8, textThe Problem - Binge DrinkingThe use of alcohol by college students is of great concern, not only to those inthe academic community, but also, because of potential health and safety conse-quences, to society a large. The article ’’Health and Behavioral Consequences ofBinge Drinking in College’’ (J. of the Amer. Med. Assoc. (1994):1672-1677) re-ported on a comprehensive study of heavy drinking on compuses across the coun-try. A binge episode was defined as five or more drinks in a row for males and fouror more for females. These values were not given in the cited article, but agree witha picture of the data that did appear.):

Follow these steps to construct a basic stem-and-leaf display for the percentageof binge drinkers:

1. Open the worksheet.Choose File>Open Worksheet. Select the file ex_3_8.mtp (included on thedisk). Choose Open.

2. Construct the stem-and-leaf display.Choose Graph>Stem-and-Leaf. Place % of Binge Drinkers in the Variables:text box. Place 10 in the Increment: textbox (to duplicate the stem-and-leafdisplay within the text). Choose OK.

10

The Minitab Output

Figure 3.3

The section of the session window labeled Stem-and-Leaf Display: Pages (Fig-ure 3.3) indicates the number of observations (N) and depth information. To theleft of the stem, Minitab indicates the cumulative number of observations, count-ing in from the extremes. This is the depth information. The depth represents howfar the observation on the right is from the appropriate end of the data set. For ex-ample, the value 18, is represented as the 8 on the 1 stem and is the eighth, ninthand tenth observations from the beginning of the ordered data set. The stem inwhich the median occurs is indicated in parentheses, and displays the frequencyfor that stem alone.The stem-and-leaf display of Figure 3.3 suggests that a typical or representativevalue is in the stem 4 row, perhaps someplace in the low 40% range. The obser-vations are not highly concentrated about this typical value, as would be the caseif all values were between 20% and 49%. The observations are not highly con-centrated about this typical value, as would be the case if all values were between20% and 49%. The display rises to a single peak as we move downward and thendeclines, and there are no gaps in the display. The shape of the display is not per-fectly symmetric but rather appears to stretch out a bit more in the direction of lowstems than in the direction of high stems. The most surprising feature of this datais that at most colleges in the sample, at least one-quarter of the students are bingedrinkers.

3.5 Histograms


1. Graph>Histogram - Histograms are useful for examining the shape and spreadof continuous sample data. They divide the sample values into many intervalscalled bins. Bars represent the number of observations falling within each bin(its frequency). In this section, you will construct a histogram for a data set.

11

Appropriate graphical representations of data often have more impact and con-vey more information quickly than a numerical summarization.

Example 3.17, text

The Problem - Mercury ContaminationMercury contamination is a serious environmental concern. Mercury levels areparticuarly high in certain types of fish. Citizens of the Republic of Seychelles, agroup of islands in the Indian Ocean, are among those who consume the most fishin the world. The article ’’Mercury Content of Commercially Important Fish of theSeychelles, and Hair Mercury Levels of a Selected Part of the Population’’ (Env-iron. Research (1983):305-312) reported the following observations on mercurycontent (ppm) in the hair of 40 fishermen:

13.26 32.43 18.10 58.23 64.00 68.20 35.35 33.92 23.94 18.2822.05 39.14 31.43 18.51 21.03 5.50 6.96 5.19 28.66 26.2913.89 25.87 9.84 26.88 16.81 37.65 19.63 21.82 31.58 30.1342.42 16.51 21.16 32.97 9.84 10.64 29.56 40.69 12.86 13.80Follow these steps to construct a relative frequency histogram for the mercury

content (ppm) in the hair of 40 fishermen:1. Open the worksheet.

Choose File>Open Worksheet. Select the file ex_3_17.mtp. Choose Open.

2. Construct the relative frequency histogram.Choose Graph>Histogram...Choose the Simple histogram. Choose OK. PlaceMercury (ppm) in the Graph variables: text box. Select Scale. Select Y-ScaleType. Darken the Percent Y-Scale Type. Choose OK.

12

The Minitab Output

Figure 3.4

The graphics window, as shown in Figure 3.4, indicates that the upper or rightend of the histogram is much more stretched out than the lower or left end. Typicalmercury content is somewhere between 20 and 30, but the data exhibits a substan-tial amount of variability about the center.

3.6 Scatterplots


1. Graph>Scatterplot. - Use to illustrate the relationship between two variablesby plotting one against the other. In this section, you will construct a scatterplotfor a data set.

Appropriate graphical representations of data often have more impact and con-vey more information quickly than a numerical summarization.

ScatterplotsA scatterplot is a graph of the relationship between the two characteristics of inter-est. The scatterplot provides a visual means of assessing the relationship betweenthe variables and can assist us in proposing reasonable models.

13

Example 3.21, textThe Problem - Vermont SugarbushesThe growth and decline of forests is a matter of great concern in both the publicand scientific communities. The paper ’’Relationships Among Crown Condition,Growth, and Stand Nutrition in Seven Northern Vermont Sugarbushes’’ (Canad. J.of Forest Res. (1995):386-397) indicates the percentage of mean crown dieback(y, the dependent variable), which is one indicator of growth retardation, and soilpH (higher pH indicates a more acidic soil) (x, the independent variable).

Soil pH (x) 3.3 3.4 3.4 3.5 3.6 3.6 3.7 3.7 3.8 3.8Dieback (y) 7.3 10.8 13.1 10.4 5.8 9.3 12.4 14.9 11.2 8.0

Soil pH (x) 3.9 4.0 4.1 4.2 4.3 4.4 4.5 5.0 5.1Dieback (y) 6.6 10.0 9.2 12.4 2.3 4.3 3.0 1.6 1.0

Follow these steps to construct a scatterplot of the data.1. Open the worksheet.

Choose File>Open Worksheet. Select the file ex3_21.mtp. Choose Open.

2. Create the scatterplot.Choose Graph>Scatterplot. Choose the Simple scatterplot. Choose OK.PlaceDieback (y) in the Y Graph variables: text box. Place Soil pH (x) in the X Graphvariables: text box.

3. Add a title.Choose Labels. Place the title: Mean Crown Dieback (%) vs. Soil pH in theTitles/Footnotes Title: text box.. Choose OK. Choose OK.

The Minitab Output

Figure 3.5

14

As shown in Figure 3.5, large values of crown dieback tend to be associatedwith low soil pH - a negative or inverse relationship. The two variables appear tobe approximately linearly related, although the points would spread out quit a bitabout any straight line drawn through the plot.

15

Chapter 4Numerical Summaries

4.1 Overview

In the previous chapter you have examined graphical methods for displaying data.Although graphical methods provide a visual picture of the data, those graphicalmethods do not provide any numerical summary measures of the data. This chapteraddresses the issue of providing numerical summaries. After reading this chapteryou should be able to

1. Describe the Center of a Data Set(Example 4.3, text)

2. Describe the the Variability of a Data Set(Example 4.8, text)

3. Obtain Quartiles and Construct a Boxplot(Example 4.11, text)

4.2 Measures of the Center and Variability of a Data Set

New Minitab Commands1. Stat>Basic Statistics>Display Descriptive Statistics - Produces descriptive

statistics (N, Mean, Median, Standard Deviation, etc.) for each variable orcolumn. In this section, you will produce descriptive statistics for a small dataset.a. Graphs option - Provides the option of displaying a histogram, a histogram

with a normal curve, a dotplot, a boxplot, or a graphical summary of thevariables.

Numerical summaries that indicate where the center of a data set is located arecalled measures of central tendency. Measures of the center typically include themean and median. Recall that the mean of a data set is the sum of the data dividedby the number of pieces of data, while the median represents the middle value inan ordered data set and divides the data set into two equal parts.Numerical summaries that describe the spread of values about the center are calledmeasures of variability or measures of dispersion. Measures of variability typi-cally include the range and standard deviation. The range represents the differencebetween the largest (maximum) and smallest (minimum) values in a data set. Thestandard deviation may be the most useful of all the measures of dispersion. Thestandard deviation is found by taking the square root of the variance:

s2 =Px2−(

Px)2

n

n−1to the data.

Quartiles are a numerical summary that represent a measure of location. The lower

16

quartile (Q1) represents the point such that 25% of the observations are below thepoint. The median is the second quartile (Q2) and is the point such that 50% ofthe observations are below the point. The upper quartile (Q3) represents the pointsuch that 75% of the observations are below the point.Let’s look at the following problem and determine the numerical summaries forthe data set.

Example 4.3, textThe Problem - Number of Visits to a Class WebsiteForty students were enrolled in a section of STAT 130, a general education coursein statistical reasoning, during Fall quarter 2002 at Cal Poly, San Luis Obispo. Theinstructor made course materials, grades and lecture notes available to studentson a class website, and course management software kept track of how often eachstudent accessed any of the web pages on the class site. One month after the coursebegan, the instructor requested a report that indicated how many times each studenthad accessed a web page on the class site. The forty observations were:

20 37 4 20 0 84 14 36 5 33119 0 0 22 3 13 14 36 4 018 8 0 26 4 0 5 23 19 712 8 13 16 21 7 13 12 8 42

Follow these steps to calculate the numerical summaries for the data set:1. Open the worksheet.

Choose File>Open Worksheet. Select the file ex_4_3.mtp. Choose Open.2. Calculate the numerical summaries.

Choose Stat>Basic Statistics>Display Descriptive Statistics. Place Visitsin the Variables: text box. Choose OK .

3. Construct a dotplot.Choose Graph>Dotplot...Choose Simple. Choose OK. Place Visits in theVariables: text box. Choose OK.

The Minitab OutputDescriptive Statistics: VisitsVariable N N* Mean SE Mean StDevVisits 40 0 23.10 8.27 52.33

Variable Minimum Q1 Median Q3 MaximumVisits 0.00 4.25 13.00 20.75 331.00

Figure 4.1

The Minitab output, shown in Figure 4.1, indicates the number of observations(N), the sample mean (Mean) and other statistics describing the data. The samplemean for this data is x = 23.10.

17

The Minitab Output

Figure 4.2

The dotplot of the data, as shown in Figure 4.2, suggests that many would arguethat 23.10 is not a very representative value for this sample, since 23.10 is largerthan most of the observations in the data set - only 7 of 40, or 17.5%, are largerthan 23.10. The two outlying values of 84 and 331 have a substantial impact onthe value of x.

Example 4.8, textThe Problem - Acrylamide Levels in French Fries

Research by the Federal Drug Administration shows that acrylamide ( a possiblecancer-causing substance) forms in high carbohydrate foods cooked at high tem-peratures and that acrylamide level can vary widely even within the same brandof food (Associated Press, Dec. 6, 2002) FDA scientists analyzed McDonald’sFrench fries purchased at seven different locations and found the following acry-lamide levels:

497 193 328 155 326 245 270Follow these steps to calculate the numerical summaries for the data set:


2. Calculate the numerical summaries.Choose Stat>Basic Statistics>Display Descriptive Statistics. Place Acry-lamide in the Variables: text box. Choose OK .

18

The Minitab Output

Descriptive Statistics: AcrylamideVariable N N* Mean SE Mean StDevAcrylamide 7 0 287.7 42.4 112.3

Variable Minimum Q1 Median Q3 MaximumAcrylamide 155.0 193.0 270.0 328.0 497.0

Figure 4.3

The Minitab output, as shown in Figure 4.3, indicates the number of observa-tions (N), the number of missing observations (N*), the sample mean (Mean) thestandard error of the mean (SE Mean), the sample standard deviation (StDev), theminimum value (Min), the first quartile (Q1), the sample median (Median) and themaximum value (Max). You can identify that the Acrylamide levels in French frieshave a sample mean of 287.7 and a sample standard deviation of 112.3.

4.3 Quartiles and Boxplots

New Minitab Commands1. Graph> Boxplot - Produces a boxplot. A default boxplot consists of a box,

whiskers, and outliers. Minitab draws a line across the box at the median. Inthis section, you will construct a boxplot from a data set.

a. Options - Contains one options specific to Boxplot. You can transpose Xand Y. Place a check in the Transpose checkbox to interchange the variablesdefining the vertical and horizontal axes.

The boxplot provides a quick display of some important features of the data.The boxplot ’’distills’’ the data set to its most important features and provides aformal tool for discriminating outliers during preliminary data analysis.

Example 4.11, textThe Problem - Golden RectanglesThe accompanying data came from an anthropological study of rectangular shapes(JLowie’s Selected Papers in Anthropology, Cora Dubios, ed., Berkeley, Calif.:Univ. of Calif. Press, 1960: 137-142). Observations were made on the variable x= width/length for a sample of n = 20 beaded rectangles used in Shoshoni Indianleather handicrafts.

.553 .570 .576 .601 .606 .606 .609 .611 .615 .628

.654 .662 .668 .670 .672 .690 .693 .749 .844 .933

Follow these steps to obtain quartiles and a boxplot of the data:1. Open the worksheet.

Choose File>Open Worksheet. Select the file ex_4_11.mtp. Choose Open.

19

2. Calculate the numerical summaries and obtain a boxplot.Choose Stat>Basic Statistics>Display Descriptive Statistics.Place x = width/length in the Variables: text box.Select Graphs. Place a check in the Boxplot of data: checkbox. Choose OK.Choose OK.

The Minitab OutputDescriptive Statistics: x=width/lengthVariable N N* Mean SE Mean StDevx=width/length 20 0 0.6605 0.0207 0.0925

Variable Minimum Q1 Median Q3 Maximumx=width/length 0.5530 0.6060 0.6410 0.6855 0.9330

Figure 4.4

The Mintab output, as shown in Figure 4.4, indicates the sample median (Me-dian) is 0.641, the first quartile (Q1) is 0.606 and the third quartile (Q3) is 0.681.

The Minitab Output

Figure 4.5

An examination of the boxplot shown in Figure 4.5, indicates that a typicalvalue is 0.641. The boxplot graphically depicts the position of the quartiles: Q1,Q2, Q3 and indicates that 50% of the data fall between 0.606 (Q1) and 0.681 (Q3).The whiskers are the lines that extend from the left and right sides of the box tothe adjacent values. The adjacent values in Minitab are the lowest and highestobservations that are still inside the region defined by the following limits:

Lower Limit Q1 − 1.5(Q3 −Q1)Upper Limit Q1 + 1.5(Q3 −Q1)

20

The bottom whisker indicates the minimum value ( 0.553). Observe that in thiscase there are two outliers recognized by Minitab (outliers will be identified by a*). The maximum value (0.933) is an extreme outlier and 0.844 is also an outlier.The median line is not at the center of the box, so there is a slight asymmetry in themiddle half of the data. However the most striking feature is the presence of thetwo outliers. These two x values considerably exceed the ’’golden ratio’’ of 0.618,used since antiquity as an aesthetic standard for rectangles.

21

Chapter 5Summarizing Bivariate Data

5.1 Overview

This chapter introduces methods for describing relationships among various quan-titative variables or characteristics, to predict a characteristic of interest, called theresponse or dependent variable. The characteristics used to predict the responsevariable are called the independent or predictor variables. After reading this chap-ter you should be able to

1. Obtain a Correlation Coefficient Between Two Variables(Example 5.4, text)

2. Fit a Least Squares Line to Bivariate Data(Example 5.6, text)

3. Assess the Fit of a Line Using a Residual Plot(Example 5.10, text)

Relationships Among DataYou can develop models, which express the relationships among various charac-teristics, to predict a characteristic of interest, called the response or dependentvariable. The characteristics used to predict the response variable are called theindependent or predictor variables. Minitab will perform simple linear correla-tion(s), linear regression and multiple regression. Both numerical and graphicalpresentations are available.

5.2 Pearson’s Sample Correlation Coefficient

New Minitab Commands1. Stat>Basic Statistics>Correlation - Calculates the Pearson product moment

correlation coefficient between each pair of variables you place in the Variables:text box. In this section, you will use this command to determine Pearson’s cor-relation coefficient.

A scatterplot of bivariate numerical data gives a visual impression of therelationship between two variables. In order to make precise statements anddraw conclusions from data, we need to go beyond pictures. A correlationcoefficient is a quantitative assessment of the strength of a linear relationshipbetween the ordered pairs of data.

Example 5.4, textThe Problem - Is Foal Weight Related to Mare Weight?Foal weight at birth is an indicator of health, so it is of interest to breeders of

22

throughbred horses. Is foal weight related to the weight of the mare (mother)?The accompanying data are from the article ’’Suckling Behavior Does Not Mea-sure Milk Intake in Horses’’ (Animal Behaviour (1999): 673-678)

Observation 1 2 3 4 5Mare weight (x, in kg) 556 638 588 550 580Foal weight (y, in kg) 129 119 132 123.5 112

Observation 6 7 8 9 10Mare weight (x, in kg) 642 568 642 556 616Foal weight (y, in kg) 113.5 95 104 104 93.5

Observation 11 12 13 14 15Mare weight (x, in kg) 549 504 515 551 594Foal weight (y, in kg) 108.5 95 117.5 128 127.5

Follow these steps to determine the correlation coefficient.a. Open the worksheet.

Choose File>Open Worksheet. Select the file ex_5_4.mtp. Choose Open.b. Determine the correlation coefficient.

Choose Stat>Basic Statistics>Correlation. Place Mare weight (x, in kg)and Foal weight (y, in kg) in the Variables: text box. Choose OK.

c. Create a scatterplot.Choose Graph>Scatterplot. Choose the Simple scatterplot. Choose OK.Place Foal weight (y, in kg) in the Y Graph variables: text box. Place Mareweight (x, in kg) in the X Graph variables: text box. Choose OK

The Minitab OutputCorrelations: Mare weight (x, in kg), Foal weight (y, in kg)

Pearson correlation of Mare weight (x, in kg)and Foal weight (y, in kg) = 0.001P-Value = 0.996

Figure 5.1

A correlation coefficient, as shown in Figure 5.1, this close to zero indicates nolinear relationship between mare weight and foal weight.

23

The Minitab Output

Figure 5.2A scatterplot of the data, as shown in Figure 5.2, supports the conclusion that mareweight and foal weight are unrelated. From the correlation coefficient alone, wecan conclude that there is no linear relationship. We cannot rule out a more com-plicated curved relationship without examining the scatterplot.

5.3 Fitting a Least Squares Line to Bivariate Data

New Minitab Commands1. Stat>Regression>Regression - Performs simple, polynomial regression, and

multiple regression using the least squares method.a. Options - Permits various options: weighted regression, fit the model

with/without an intercept, calculate variance inflation factors and the Durbin-Watson statistic, and calculate and store prediction intervals for new ob-servations. In this section, you will use this command to make predictionsusing the least squares regression line.

2. Stat>Regression>Fitted Line Plot - Fits a simple linear or polynomial (sec-ond or third order) regression model and plots a regression line through theactual data or the log10 of the data. The fitted line plot shows you how closelythe actual data lie to the fitted regression line. In this section, you will obtaina fitted line plot to illustrate how the estimated relationship fits the data in asimple linear regression model.Given two variables x and y, the general objective of regression analysis is to

use information about x to make predictions concerning y. The roles played by thetwo variables are reflected in the terminology: y is referred to as the dependent or

24

response variable, while x is referred to as the independent, predictor, or explana-tory variable. We can model the response variable as a linear relationship of theindependent variable. The simple linear regression model is a straight line of theform

yi = a+ bxwhere 1. a is the y-intercept, the point on the y-axis where the straight line

crosses the y-axis,2. b is the slope, the amount by which y increases when x increases by

1 unit.

Example 5.6, textThe Problem - Time to Defibrillator Shock and Heart Attack Survival RateStudies have shown that people who suffer cardiac arrest (SCA) have a betterchance of survival if a defibrillator shock is administered very soon after cardiacarrest. How is survival rate related to the time between when cardiac arrest occursand when the defibrillator shock is delivered? This question is addressed in thepaper ’’Improving Survival from Sudden Cardiac Arrest - The Role of HomeDe-fibrillators’’ (University of Michigan, Feb. 2002) The accompanying data gives

y = survival rate (percent)

and

x = mean call-to-shock time (minutes)

for a cardiac rehabilitation center (where cardiac arrests occurred while victimswere hospitalized and so the call-to-shock time tend to be short) and for four com-munities of different sizes.

mean call-to-shock time, x 2 6 7 9 12survival rate, y 90 45 30 5 2

Follow these steps to determine the least squares equation between mean call-to-shock time, x and survival rate, y and to predict the survival rate, y when the meancall-to-shock time is 5 (min).

a. Open the worksheet.Choose File>Open Worksheet. Select the file ex_5_6.mtp. Choose Open.

b. Obtain the regression equation.Choose Stat>Regression>Regression. Place survival rate, y in theResponse: text box. Place mean call-to-shock time, x in the Predictors:text box.

c. Make a prediction.Choose Options. Place 5 in the Prediction intervals for new observations:text box. Choose OK. Choose OK.

25

The Minitab Output

Figure 5.3The Minitab output, as shown in Figure 5.3, indicates the regression equation:survival rate, y = 101 - 9.30 mean call-to-shock time, x. In the table just belowthe equation, appears information concerning the y intercept (Constant) and theslope of the variable Depth (carbonation depth). The column labeled ’’coefficient’’indicates the values of the y-intercept (using more significant figures) is 101.33 andthe slope is -9.296. The additional information appearing below will be addressedat a later point, with the exception of the last line. The last two lines in the Minitaboutput indicate the predicted value (Fit) of survival rate, y for a mean call-to-shocktime, x of 5 is 54.85.

A plot illustrating how the estimated relationship fits the data is possible in Minitab.This plot is called a fitted line plot.

26

Follow these steps to obtain a fitted line plot.

a. Construct the fitted line plot.Choose Stat>Regression>Fitted Line Plot. Place survival rate, y in theResponse (Y): text box. Place mean call-to-shock time, x in the Predictor(X): text box.Choose OK.

The Minitab Output

Figure 5.4The fitted line plot, as shown in Figure 5.4, shows the relationship between sur-vival rate and mean call-to-shock time for times in the range 2 to 12 minutes couldreasonably be summarized by a straight line

5.4 Assessing the Fit of a Line

New Minitab Commands1. Stat>Regression>Regression - - Performs simple, polynomial regression, and

multiple regression using the least squares method.a. Graphs - Displays residual plots. In this section, you will construct resid-

ual plots of the residuals versus the independent values.Once the least squares line has been obtained, it is appropriate to ask how ef-

fectively the line summarizes the relationship between the dependent (y) and in-

27

dependent (x) variables. Specifically, we would like a quantitative indicator of theextent to which y variation can be attributed to the approximate linear relationshipbetween the two variables.

Example 5.10, textThe Problem - Tennis ElbowOne factor in the development of tennis elbow is the impact-induced vibration ofthe racket and arm at ball contact. Tennis elbow is thought to be related to vari-ous properties of the tennis racket used. The accompanying data is a subset of thatanalyzed in the article ’’Transfer of Tennis Racket Vibrations into the Human Fore-man’’ (Med. and Sci. in Sports and Exercise (1992): 1134-1140). Measurementsonx = racket resonance frequency (Hz)andy = sum of peak-to-peak accelerations (a characteristic of arm vibration in m/sec/sec)are given for n = 14 different rackets.

Racket 1 2 3 4 5 6 7Resonance (x) 105 106 110 111 112 113 113Acceleration (y) 36.0 35.0 34.5 36.8 37.0 34.0 34.2

Racket 8 9 10 11 12 13 14Resonance (x) 114 114 119 120 121 126 189Acceleration (y) 33.8 35.0 35.0 33.6 34.2 36.2 30.0

Plotting the ResidualsA residual plot is a plot of the (x, residual) ordered pairs. A desirable plot is onethat exhibits no particular pattern such as curvature. An examination of the resid-ual plot after determining the least squares line effectively amounts to examiningy after removing any linear dependence on x. Sometimes this examination mayreveal the existence of a nonlinear relationship.Follow these steps to determine the least squares equation between Resonance (x)and Acceleration (y):

a. Open the worksheet.Choose File>Open Worksheet. Select the file ex_5_10.mtp. ChooseOpen.

b. Obtain a fitted line plot.Choose Stat>Regression>Fitted Line Plot. Place Acceleration (y) in theResponse (Y): text box. Place Resonance (x) in the Predictor (X): text box.Choose OK. The Fitted Line Plot is shown in Figure 5.5.

c. Obtain the regression equation.Choose Stat>Regression>Regression. Place Acceleration (y) in theResponse: text box. Place Resonance (x) in the Predictors: text box.

d. Display the full table of fits and residuals.Choose Results. Darken the In addition, the full table of fits and residuals

28

option button. Choose OK.e. Obtain a plot of the residuals.

Choose Graphs.. Darken the Regular option button for Residuals for Plots.Place Resonance (x) in the Residuals versus the variables: textbox. ChooseOK. Choose OK. The plot of the residuals is shown in Figure 5.6.

The Minitab Output

Figure 5.5The fitted line plot, as shown in Figure 5.5, provides a scatterplot for the full sam-ple. Observe that one observation in the data is far to the right of the other pointsin the scatterplot. This observation corresponds to Racket 14, with a Resonance of189 and Acceleration of 30.0. Because the least squares line minimizes the sum ofthe squared residuals, the least squares line is pulled down toward this discrepantpoint. This single observation plays a big role in determining the slope of the leastsquares line, and it is therefore called an influential observation. Notice that an in-fluential observation is not necessarily the one with the largest residual, since theleast squares line actually passes very near this point. The plot of the residuals,as shown in Figure 5.6, provides a visual assessment of the model effectiveness inmaking predictions.

29

The Minitab Output

Figure 5.6The residual plot, as shown in Figure 5.6, indicates a point whose x value differsgreatly from others in the data set, exerting excessive influence in determining thefitted line. One method for assessing the impact of such an isolated point on thefit is to delete it from the data set, recomputing the best-fit line, and evaluating theextent to which the equation of the line has changed.

Follow these steps to recompute the best-fit line between Resonance (x) and Ac-celeration (y):

a. Delete an observation.Delete the observation corresponding to Racket 14, with a Resonance of189 and Acceleration of 30.0.

b. Obtain a new fitted line plot.Choose Stat>Regression>Fitted Line Plot. Place Acceleration (y) in theResponse (Y): text box. Place Resonance (x) in the Predictor (X): text box.Choose OK. The new Fitted Line Plot is shown in Figure 5.7.

c. Obtain the new regression equation.Choose Stat>Regression>Regression. Place Acceleration (y) in theResponse: text box. Place Resonance (x) in the Predictors: text box.

d. Display the full table of fits and residuals.Choose Results. Darken the In addition, the full table of fits and residualsoption button. Choose OK.

e. Obtain a new plot of the residuals.Choose Graphs.. Darken the Regular option button for Residuals for Plots.Place Resonance (x) in the Residuals versus the variables: textbox. ChooseOK. Choose OK. The plot of the residuals is shown in Figure 5.8.

30

The Minitab Output

Figure 5.7The fitted line plot, as shown in Figure 5.7, suggests that the deletion of the influ-ential observation corresponding to corresponding to Racket 14, with a Resonanceof 189 and Acceleration of 30.0 changes the slope and intercept of the least squaresline. The plot of the residuals, as shown in Figure 5.8, enables you to determinethat the residuals have no particular pattern or structure.

The Minitab Output

Figure 5.8

31

Chapter 6Probability

6.1 Overview

This chapter introduces the basic concepts of probability that are most widely usedin statistics. Real data exhibit variability and that variability means uncertainty.Statistics uses probability to model the random behavior of real data as well as toquantify the uncertainty when inferences are made about a population of interest.Probability experiments via simulation may be performed a large number of timesto understand underlying concepts of probability. Random samples with givenprobability distributions can be generated to examine and illustrate those probabil-ity distributions.The relative frequency interpretation of probability holds that the probability of anevent is the long-run proportion of the time that the outcome will occur. Minitabmay be used to perform the calculations required in applying the relative frequencyinterpretation of probability. After reading this chapter you should be able to

1. Estimate Probabilities Using Simulation(Example 6.10, in DP5e, Example 6.32 in POD2e)

6.2 Estimating Probabilities Using Simulation

New Minitab Commands1. Calc>Random Data>Integer - Generates random data from an integer distri-

bution, which is a discrete uniform distribution that ranges from the minimumto the maximum integer value specified. Each integer in the range has equalprobability. In this section, you will use this command to simulate ’’guessing’’on a true-false questionSimulation provides a means of estimating probabilities that ’’generates obser-

vations’’ by performing an experiment that is similar in structure to the real situa-tion of interest.

Example 6.10, in DP5e, 6.32 in POD2e, text

The Problem - One-Boy Family PlanningSuppose that couples who wanted children were to continue having children untila boy is born. Assuming that each newborn child is equally likely to be a boyor a girl, would this behavior change the proportion of boys in the population?This question was posed in an article that appeared in The American Statistician(1994: 290-293), and many people answered the question incorrectly. We will usesimulation to estimate the long-run proportion of boys in the population if familieswere to continue to have children until they have a boy. This proportion is an

32

estimate of the probability that a randomly selected child from this population is aboy. Not that every sibling group would have exactly one boy.

We will use a single-digit random number to represent a child. The odd digits (1, 3,5, 7, 9) will represent a male birth, and the even digits (0, 2, 4, 6, 8) will representa female birth. An observation will be constructed by selecting a sequence ofrandom digits. If the first random number obtained is odd (a boy), the observationis complete. If the first selected number is even ( a girl), another digit will bechosen. We will continue in this way until an odd digit is obtained.

Follow these steps to simulate the experiment.1. Generate random data.

Choose Calc>Random Data>Integer. Place 1 in the Generate rows of datatext box. Place C1-C20 in the Store in column(s): text box. Place 0 in theMinimum value: text box. Place 9 in the Maximum value: text box. ChooseOK

The Minitab Output

Figure 6.1

The Minitab output, as shown in Figure 6.1, indicates that the first 20 digits are2 4 8 3 8 2 2 0 5 9 5 8 6 7 6 0 7 7 3 7

Using these numbers to simulate sibling groups, we get

Sibling group 1 2 4 8 3 girl, girl, girl, boySibling group 2 8, 2, 2, 0, 5 girl, girl, girl, girl, boySibling group 3 9 boySibling group 4 5 boySibling group 5 8, 6, 7 girl, girl, boySibling group 6 6, 0, 7 girl, girl, boySibling group 7 7 boySibling group 8 7 boy

After simulating eight sibling groups, we have 8 boys among 19 children. The pro-portion of boys is 8/19, which is close to 0.5. Continuing the simulation to obtaina large number of observations suggests that the long-run proportion of boys in thepopulation would still be 0.5, which is indeed the case.

33

Chapter 7Population Distributions

7.1 Overview

This chapter begins to link together the basic concepts of probability with the con-cepts of statistical inference. This chapter introduces probability models that canbe used to describe the distribution of characteristics of individuals in a popula-tion. Such models are essential if we are to reach conclusions based on a samplefrom the population. After reading this chapter you should be able to

1. Find Areas Under the Normal Curve(Example 7.13, in DP5e, Example 7.25 in POD2e)

7.2 Normal Distributions

New Minitab Commands1. Calc>Probability Distributions>Normal - Allows you to calculate the prob-

ability densities, cumulative probabilities, and inverse cumulative probabilitiesfor a normal distribution.For the continuous distributions, such as the normaldistribution, Minitab calculates the continuous probability density function.

a. Cumulative probability: In this section, you will darken the Cumulativeprobability: option button in order to determine the area under the normalprobability density function.

Normal distributions are continuous probability distributions. Normal distrib-utions are frequently used as population models since normal distributions providereasonable approximations to the distributions of many different variables. Differ-ent normal distributions are distinguished from one another by their mean, µ,andstandard deviation, σ. The mean, µ, describes the center of the distribution, andthe standard deviation, σ, describes the shape of the distribution.

The Standard Normal DistributionThe standard normal distribution is a normal distribution with

µ = 0 and σ = 1.It is customary to let z represent a variable whose distribution is distribution is de-scribed by the standard normal curve. In working with the normal distribution, weneed to be able to find areas under the standard normal distribution.You can use Minitab to create a table of values from a normal probability distrib-ution with a given mean and standard deviation. Typically, you have to convert avalue of x to a z-value and then look up the answer in a table of areas under the nor-mal curve in order to find a probability. Minitab provides a table of probabilitiesin terms of the original value, x. Minitab will enable you to determine cumulativeareas to the left of a particular value of x, or a particular value of z.

34

You can calculate probabilities and describe values for any normal distribution. Tocalculate those probabilities, you may (a) use Minitab or (b)standardize the rele-vant values and use a table of areas under the normal curve.

If x is a variable whose behavior is described by a normal distribution withmean µ and standard deviation σ, then

P(x<b) = P(z<b∗)P(a<x) = P(a∗<z) P(x>a) = P(z>a∗)

P(a<x<b) = P(a∗<x<b∗)where z is a variable whose distribution is standard normal and

a∗ = a−µσ

and b∗ = b−µσ

Example 7.13 in DP5e, 7.25 in POD2e, textThe Problem - Children’s HeightsIn poor countries, the growth of children can be an important indicator of gen-eral levels of nutrition and health. Data in the paper ’’The Osteological para-dox:Problems of Inferring Prehistoric Health from Skeletal Samples’’ (CurrentAnthropology (1992):343-370) suggests that a reasonable model for the popula-tion distribution of the continuous numerical variable x = height of a five-yearold child is a normal distribution with mean µ = 100 cm and standard deviationσ = 6 cm.Follow these steps to determine the proportion of the population that has a heightbetween 94 cm and 112 cm.1. Enter data.

Enter the heights of 94 and 112 cm in column C1 in the Data window. Namecolumn C1 as Heights.

2. Determine areas.Choose Calc>Probability Distributions> Normal. Darken the Cumulativeprobability: option button. Place 100 in the Mean: text box. Place 6 in theStandard deviation: text box. Darken the Input column: option button. PlaceHeights in the Input column: text box. Place Area in the Optional storage: textbox. Choose OK.

The Minitab Output

Figure 7.1The resulting Minitab output, as shown in Figure 7.1, indicates the area to the leftof x = 94 (and z = -1.00) is 0.1587 and the area to the left of x =112 (and z = -2.00)is 0.9773. The area between z = -1.00 and z = 2.00 is found by subtracting the

35

two areas: .9772 − .1587 = .8185 or 81.85%. If height were observed for manychildren from this population, about 82% of the heights would fall between 94 and112 cm.

What is the probability that a randomly chosen child will be taller than 110 cm?Follow these steps to determine probability that a randomly chosen child will betaller than 110 cm.1. Enter data.

Delete the heights of 94 and 112 cm in column C1 in the Data window, and thecorresponding areas under Area. Enter 110 in row 1 of column C1, Heights.

2. Determine areas.Choose Calc>Probability Distributions> Normal. Darken the Cumulativeprobability: option button. Place 100 in the Mean: text box. Place 6 in theStandard deviation: text box. Darken the Input column: option button. PlaceHeights in the Input column: text box. Place Areas in the Optional storage:text box. Choose OK.

The Minitab Output

Figure 7.2The resulting Minitab output, as shown in Figure 7.2, indicates the area to the left ofx = 110 (and z = -1.67) is 0.9522. The area to the right of x = 110 (and z = -1.67) isfound by subtracting the area to the left of x = 110 from 1: 1.0000−.9522 = 0.0478or 4.78%. If height were observed for many children from this population, about5% of the heights would be larger than 110 cm.

36

Chapter 8 There are no examples for chapter 8.

Chapter 9Estimation Using aSingle Sample

9.1 Overview

The objective of inferential statistics is to use sample data to decrease our un-certainty about the corresponding population. Often, data is collected to obtaininformation that allows the investigator to estimate the value of some populationcharacteristic, such as a population mean, µ, or a population proportion, π. Thiscould be accomplished by using the sample data to arrive at a single number thatrepresents a plausible value for the characteristic of interest. Alternatively, onecould report an entire range of plausible values for the characteristic. These twoestimation techniques, point estimation and interval estimation, are addressed inthis chapter. After reading this chapter you should be able to

1. Make Point Estimates (Proportions, Means)(Example 9.2, text)

2. Construct Large-Sample Confidence Intervals (Means)(Example 9.5, text)

3. Construct a Small-Sample Confidence Interval (Mean)(Example 9.9, text)

9.2 Point Estimation

The usual way of obtaining information regarding the value of a population charac-teristic, such as a population mean, µ, or a population proporiton, π, is by selectinga sample from the population. A point estimate of a population characteristic is asingle number that is based on sample data and represents a plausible value of thecharacteristic.For example, a survey by a public interest group might report that585 of 1000 individuals favor a proposal to lower the drunk-driving blood alco-hol level from 0.10 to 0.08. The sample proportion, p, is a point estimate of π; inthis case that is p = 585

1000 = 0.585. As a second example, sample data might sug-gest that 50 calories (from fat in a serving) is a plausible value for µ, the true meancalorie content (from fat in a serving) in Banana Nut Crunch cereal. This is thevalue stated on the package. In this example, 50 is a point estimate of µ.

Example 9.2, textThe Problem - Internet Use by College StudentsThe article ’’Online Extracurricular Activity (USA Today, Mar. 13, 2000) reportedthe results of a study of college students conducted by a polling organization called

37

The Student Monitor. One aspect of computer use examined in this study was thenumber of hours per week spent on the Internet. Suppose the following observa-tions represent the number of Internet hours per week reported by twenty collegestudents (these data are compatible with summary values given in the article).

4.00 5.00 5.00 5.25 5.506.25 6.25 6.50 6.50 7.007.25 7.75 8.00 8.00 8.008.25 8.50 8.50 9.50 10.50

Follow these steps to produce a summary of the data.1. Open the worksheet.

Choose File>Open Worksheet. Select the file ex_9_2.mtp. Choose Open.2. Construct a dotplot.

Choose Graph>Dotplot...Choose Simple. Choose OK. Place Internet Timein the Variables: text box. Choose OK.

The Minitab Output

Figure 9.1

The Minitab output, shown in Figure 9.1, indicates the dotplot of the observa-tions of the data.

Suppose further that a point estimate of µ, the true mean Internet time per weekfor college students is desired. An obvious choice of a statistic for estimating µ isthe sample mean, . However there are other possiblities. We might consider us-ing a trimmed mean or evern the sample median, since the data set exhibits somesymmetry.3. Calculate the numerical summaries.

Choose Stat>Basic Statistics>Display Descriptive Statistics. Place Inter-net Time in the Variables: text box. Choose Statistics. Place a check in theTrimmed mean checkbox. Choose OK.

38

The Minitab Output

Descriptive Statistics: HoursVariable N N* Mean SE Mean TrMean StDevInternet Time 20 0 7.075 0.370 7.056 1.655

Variable Minimum Q1 Median Q3 MaximumInternet Time 4.000 5.688 7.125 8.188 10.500

Figure 9.2The three statistics and the resulting estimates of µ calculated from the data arex =

Pxn = 141.50

20 = 7.075, the sample median = 7.125 and the 5% trimmedmean= 7.056. The estimates of the mean Internet time per week differ somewhatfrom each other. The choice from among them should depend on which statistictends, on average, to produce an estimate closest to the true value of µ.

9.3 A Large-Sample Confidence Interval

New Minitab Commands1. Stat>Basic Statistics> 1 Proportion -Performs perform a hypothesis test of

the proportion and computes a confidence interval. In this section, you willcalculate a confidence interval for the proportion of one or more variables.You have just seen how to use a sample statistic to produce a point estimate

of a population characteristic. The value of a point estimate depends on whichsample is selected, and different samples usually yield different estimates, due tochance differences, of the population characteristic. Rarely is the point estimatefrom the sample exactly equal to the true value of the population characteristic.While a point estimate may represent the best single number guess for the value ofthe population characteristic, it is not the only plausible value.Suppose, that instead of reporting a single point estimate as the single most crediblevalue for the population characteristic, we report an interval of reasonable valuesbased on the sample data. For example, we might be confident that for all callsmade from AT&T pay phones, the proportion, π, of calls that are billed to a creditcard is in the interval from .53 to .57. The narrowness of this interval implies thatwe have rather precise information about the value of π. If, with the same degreeof confidence, we could state only that π was in the interval from .32 and .74, itwould be clear that we had relatively imprecise knowledge of the value of π.

A confidence interval for a population characteristic is an intervalof plausible values for the characteristic. It is constructed so that,with a chosen degree of confidence, the value of the characteristicwill be captured indside the interval.The confidence level associated with a confidence interval estimatespecifies the success rate of the method used to construct the interval.

39

Example 9.5, textThe Problem - Violent Behavior in the WorkplaceAn Associated Press article on potential violent behavior reported the results of asurvey of 750 workers who were employed full time (San Luis Obispo Tribune,Sept. 7, 1999). Of those surveyed, 125 indicated that they were so angered by acoworker during the past year that he or she felt like hitting the person (but didn’t).Assuming that it is reasonable to regard this sample of 750 as a random samplefrom the population of full-time workers, we can use this information to constructan estimate of π, the true proportion of full-time workers so angered in the lastyear that they wanted to hit a colleague.Follow these steps to construct a 95% confidence interval for the population pro-portion, π, of respondents who responded yes:1. Start Minitab.2. Construct the confidence interval.

Choose Stat>Basic Statistics> 1 Proportion. Darken the Summarized dataoption button. Place 750 in the Number of trials: text box. Place 125 in theNumber of events: text box. Choose Options. Choose the default Level: ofconfidence 95.0. Accept the (default) value of 0.5 in the Test proportion: textbox. Place a check in the Use test and interval based on normal distribution:checkbox. Choose OK . Choose OK .

The Minitab OutputTest and CI for One Proportion

Test of p = 05 vs p not = 0.5

Sample X N Sample p 90% CI Z-Value P-Value1 125 750 0.166667 (0.144283, 0.189050) -18.26 0.000

Figure 9.3

For this sample, p = 125750 = 0.167. Since np = 125 and n(1 − p) = 625 are

both greater than or equal to 10, the sample size is sufficiently large for a large-sample confidence interval. A 90% confidence interval for π is between 0.144283and 0.189050. Based on the sample data, we can be 90% confident that the trueproportion of full-time workers who have been angry enough in the last yeat toconsider hitting a coworker is between 0.144283 and 0.189050. We have used amethod to construct this interval estimate that has a 10% error rate.

40

9.4 A Small-Sample Confidence Interval

New Minitab Commands1. Stat>Basic Statistics>1-Sample-t - Performs a one sample t-test or t-confidence

interval for the mean. In this section, you will darken the Confidence interval:option button to calculate a separate one-sample confidence interval for themean of one or more variables when the population standard deviation is notknown.

The large-sample confidence interval for µ is appropriate whatever the shapeof the population distribution. This is because it is based on the Central LimitTheorem, which states that when n is sufficiently large, the x sampling distributionis approximately normal for any population distribution. When n is small, theCentral Limit Theorem does not apply. one way to proceed in the small-samplecase is to make a specific assumption about the shape of the population distributionand then use an interval that is valid under that assumption.

If the sample size, n, is small then the shape of the x sampling distribution maynot be approximately normal. However, when the population distribution itself isnormal, the x sampling distribution is approximately normal even for small samplesizes. Since σ is usually unknown, we must estimate σ2 with the sample variances2, resulting in the standardized variable

t = x−µs√n

Example 9.9, textThe Problem - Walking a Straight LineA study of the ability of individuals to walk in a straight line (’’Can We Really WalkStraight?’’, Amer. J. of Physical Anthropology (1992): 19-27) reported the accom-panying data on cadence (strides per second) for a sample of n = 20 randomlyselected healthy men.

.95 .85 .92 .95 .93

.86 1.00 .92 .85 .81

.78 .93 .93 1.05 .931.06 1.06 .96 .81 .96

Follow these steps to produce a summary of the data.

41

A normal probability plot of this data appears in Figure 9.4.

Figure 9.4The plot, as shown in Figure 9.4, is reasonably straight, so it seems plausible thatthe population distribution is approximately normal.Follow these steps to construct a 99% confidence interval for the Cadence values:1. Open the worksheet.

Choose File>Open Worksheet. Select the file ex_9_9.mtp. Choose Open.2. Construct the confidence interval.

Choose Stat>Basic Statistics>1-Sample-t. Darken the Samples in columns:option button. Place Cadence in the Variables: text box. Select Options. Enter99.0 in the Confidence level: text box.95.0. Choose OK. Choose OK.

The Minitab OutputOne-Sample T: Cadence

Variable N Mean StDev SeMean 99% CICadence 20 0.925500 0.080947 0.018100 0.873716, 0.977284)

Figure 9.5

With 99% confidence, we estimate the population mean cadence to be between0.874 and 0.977 strides per second, as indicated in Figure 9.5. Remember that the99% confidence level implies that if the same formula is used to calculate intervalsfor sample after sample randomly selected from the population, in the long run 99%of the intervals will capture µ between the lower and upper confidence limits.

42

Chapter 10Hypothesis TestingUsing a Single Sample

10.1 Overview

In the previous chapter, we considered situtations in which the primary goal was toestimate the unknown value of some population characteristic. Sample data mayalso be used to decide if some claim or hypothesis about a population characteristicis plauible. This chapter addresses the issue of analyzing sample data to determineif a hypothesis about a population characteristic is plausible. Hypothesis testingmethods presented in this chapter can be used to determine whether the sampledata provides strong support for rejecting or failing to reject a hypothesis. Afterreading this chapter you should be able to

1. Perform a Large-Sample Hypothesis Testa. For a Proportion

(Example 10.11, text)2. Perform a Hypothesis Test

a. For a Population Mean(Example 10.14, text)

Under most conditions it is impossible or unrealistic to study an entire populationto obtain the value of the population characteristic of interest. Statisticians have de-veloped techniques that enable us to draw inferences about population parametersfrom sample statistics. This particular statistical decision making-tool is hypothe-sis testing. Hypothesis tests are used to investigate theories concerning populationcharacteristics. Minitab may be used to make inferences about the value of a pop-ulation parameter.

10.2 Hypotheses and Test Procedures

A hypothesis is a claim or statement either about the value of a single populationcharacteristic or about the values of several population characteristics. The fol-lowing are examples of legitimate hypotheses:µ = 1000 where µ is the mean number of characters in an email messageπ < .01 where π is the proportion of email messages that are undeliverable.A criminal trial is a familiar situtation in which a choice between two compet-ing claims must be made. In the U.S., the person accused of the crime is initiallypresumed to be innocent. Only strong evidence to the contrary will cause the pre-sumption of innocence to be rejected in favor of a guilty verdict.A test of hypotheses is a method for using sample data to decide between two com-peting claims (hypotheses) about a population characteristic. As in a U.S. judicialproceeding, we shall initially assume that a particular hypothesis, called the null

43

hypothesis and designated asH0, is the correct one. We then consider the evidence(the sample data), and we only reject the null hypothesis in favor of the compet-ing hypothesis, called the alternative hypothesis and designated as Ha, if there isconvincing evidence against the null hypothesis.

10.3 Errors in Hypothesis Testing

Once hypotheses have been formulated, we need to make a method fo using sam-ple data to determine whetherH0 should be rejected. The decision rule that we usefor this purpose is called a test procedure. Just as a jury trial may reach the wrongverdict in a trial, there is some chance that the use of a test procedure on samplingdata may lead us to the wrong conclusion.One erroneous conclusion in a criminal trial is for a jury to convict an innocentperson, and another is for a guilty person to be set free. Similarly, there are two dif-ferent types of errors that might be made when making a decision in a hypothesis-testing problem. One type of error involves rejectingH0 even though H0 is true.The second type of error results from failing to rejectH0 when it is false.

Type I error the error of rejectingH0 whenH0 is trueType II error the error of failing to rejectH0 when H0 is false

No reasonable test procedure comes with a guarantee that neither type of errorwill be made; this is the price paid for basing an inference on a sample. With anyprocedure, there is some chance that a Type I error will be made, and there is alsosome chance that a Type II error will result.

10.4 Large Sample Tests for a Population Proportion

New Minitab Commands1. Stat>Basic Statistics> 1 Proportion -Performs perform a hypothesis test of

the proportion and computes a confidence interval. In this section, you willperform a hypothesis test of the proportion.Now that some general concepts of hypothesis testing have been introduced,

we are ready to turn our attention to the development of procedures for using sam-ple information to choose between the null and alternative hypotheses. There aretwo possibilities - we will either reject H0 or we will fail to reject H0. The fun-damental concept behind hypothesis testing is this: We reject the null hypothesisif the observed sample is very unlikely to have occurred when H0 is true. In thissection, we consider testing hypotheses about π, the population proportion.

Perhaps the most common inference of all is an inference concerning a proportion.Generally, we will let π represent the proportion of individuals or objects in a spec-ified population that possess a certain property. A random sample of n individualsor objects is to be selected from the population. the sample proportion

p =x = number in the sample that possess property

nis the natural statistic for making inferences about π.

44

When the sample proportion is to be tested against a hypothesized population pro-portion, we will use the test statistic

z = p−π√π(1−π)n

Example 10.11, textThe Problem - Credit Card DebtThe article ’’Credit Cards and College Students: Who Pays, Who Benefits?’’ (J.College Student Development, (1998):50-56) described a study of credit card pay-ment practices of college students. According to the authors of the article, thecredit card industry asserts that at most 50% of college students carry a credit cardbalance from month to month. However, the authors of the article report that, in arandom sample of 310 college students, 217 carried a balance each month. Doesthis sample provide sufficient evidence to reject the industry claim? We will an-swer this question by carrying out a hypothesis test using a .05 significance level.Population characteristic of interest:

π =true proporiton of college studentswho carry a balance from month to month

The hypotheses to be tested areH0 : π = 0.5Ha : π > 0.5

The null hypothesis will be rejected only if there is convincing evidence that π >0.5 (that is, strong evidence againstH0).Follow these steps to perform the hypothesis test.1. Perform the hypothesis test.

Choose Stat>Basic Statistics> 1 Proportion. Darken the Summarized dataoption button. Place 310 in the Number of trials: text box. Place 217 in theNumber of successes: text box. Choose Options. Accept the default Level:of confidence 95.0. Enter 0.5 in the Test proportion: text box. Select greaterthan from the Alternative drop down list box. Place a check in the Use test andinterval based on normal distribution: checkbox. Choose OK . Choose OK.

The Minitab OutputTest and CI for One Proportion

Test of p = 0.5 vs p > 0.595%

Sample X N Sample p Lower Bound Z-Value P-Value1 217 310 0.700000 0.657189 7.04 0.000

Figure 10.1

The Session window displays the results of the population proportion test, asshown in Figure 10.1. The Minitab output indicates an observed Z value of 7.04,

45

and the p-value is 0.000 (implying p < .05). Since p is less than α, the null hy-pothesisH0 : π = .5 is rejected at the 0.05 level of significance. We conclude thatthe proportion of students who carry a credit card balance from month to month isgreater than .5. That is, the sample provides convincing evidence that the industryclaim is not correct.

10.5 Hypothesis Tests for a Population Mean

New Minitab Commands1. Stat>Basic Statistics>1-Sample t - Performs a one sample t-test or t-confidence

interval for the mean. In this section, you will use this command to performone sample t-tests.The procedures for testing hypotheses about a population mean µ are based on

the same results that led to the confidence intervals in Chapter 8.Since it is rarelythe case that σ, the population standard deviation, is known, we will focus ourattention on the procedure for the procedure where σ is assumed to be unknown.Here we shall restrict consideration to the case where the original parent is assumedto be a normal population distribution.Whenx1, x2, ...xn constitute a random sample of sizen from a normal distribution,the probability distribution of the standardized variable is

t = x−µs√n

the t distribution with n− 1 degrees of freedom.In most situations the population standard deviation, σ, is unknown. Minitab canuse Student’s t test to make inferences about the value of the population parame-ter µ when σ is unknown. This procedure may be applied to samples of all sizeswhere the assumption is that the parent population is approximately normally dis-tributed.

Example 10.14, textThe Problem - Personal Use of Company TechnologyOne concern employers have about the use of technology is the amount of timethat employees spend each day making use of company technology, such as per-sonal phone, e-mail, Internet, and computer games. The Associated Press (Sept.7, 1999) reported that a management consultant believes that, on average, workersspend 75 minutes a day making personal use of company technology. Suppose thatthe CEO of a large corporation wanted to determine whether the average amount oftime spent in personal use of company technology for her employees was greaterthan the reported value of 75 minutes. Each person in a random sample of 10 em-ployees was contacted and asked about daily personal use of company technology.(Participants would probably have to be guaranteed anonymity to obtain truthful

46

responses.) The resulting data is given below:Employee 1 2 3 4 5 6 7 8 9 10

Time 66 70 75 88 69 89 71 71 63 86

Does this data provide evidence that the mean for this company is greater than 75minutes? To answer this question, let’s carry out a hypothesis test with α = 0.05.The hypotheses to be tested are

H0 : µ = 75Ha : µ > 0.5

The null hypothesis will be rejected only if there is convincing evidence that µ >0.5 (that is, strong evidence againstH0).This test rquires a random sample and either a large sample or a normal populationdistribution. The given sample was a random sample of employees. Since thesample size is small, we must be willing to assume that the population distributionof times is at least approximately normal. The accompanying normal probabilityplot, as shown in Figure 10.2, appears to be reasonably straight. A boxplot ofthe data, shown in Figure 10.3, reveals some skewness in the sample, it does notreveal any outliers. Based on these observations, it is plausible that the populationdistribution is approximately normal, so we will proceed with the t test.

Figure 10.2

47

Figure 10.3Follow these steps to perform the hypothesis test.


2. Calculate the test statistic.Choose Stat>Basic Statistics>1-Sample t. Darken the Samples in columns:option button.Place Time in the Variables: text box. Place 75 in the Test mean:text box. Select Options. Accept the 95.0 default value in the Confidence level:text box. Choose the option of greater than in the Alternative: drop down listbox. Choose OK. Choose OK.

The Minitab OutputOne-Sample T: Time

Test of mu = 75 vs > 7595%

Variable N Mean StDev SeMean Lower Bound T PTime 10 74.8000 9.4493 2.9881 69.3224 -0.07 0.526

Figure 10.4

Minitab produces a number of descriptive statistics (N, Mean, StDev, SE Mean),as well as the observed t value of -0.07, and the p-value of 0.526 (implying p >0.05). Since p is greater than α, the null hypothesis H0 : µ = 75 is not rejected.We thus do not have sufficient evidence to suggest that the mean time spent in per-sonal use of company technology is significantly greater than 75 minutes per dayfor this company.

48

Chapter 11Comparing Two Populations

Or Treatments11.1 Overview

Many investigations are carried out for the purpose of comparing two populations.For example, one study focused on the question of whether there are differences inconsumer perceptions of retail price reductions when there is an extremely low saleprice and when there is a moderately low sale price. The study included subjectswho were presented with ads containing an unusually deep (about 50%) discountfrom the expected retail price and subjects who were presented with ads containinga discount of about 25% from the product’s expected retail price. After exposureto the ads, both groups (of subjects) reported on their perceptions of the value ofthe advertised deal, resulting in data that led the researchers to conclude that therewere significant differences between consumer’s responses to the ads.To reach this conclusion, hypothesis tests that compare the means of two differentpopulations were used. This chapter addresses hypothesis tests and confidenceintervals that can be used when comparing two populations or treatments on thebasis of means. After reading this chapter you should be able to

1. Perform a Small-Sample Hypothesis TestConcerning the Difference BetweenTwo Normal Population Means for Independent Samples(Example 11.2, text)

2. Obtain a Confidence Interval on the Difference BetweenTwo Normal Population Means(Example 11.4, text)

3. Obtain a Confidence Interval on the Difference BetweenTwo Population Means for Paired Samples(Example 11.8, text)

4. Perform a Large-Sample Hypothesis TestConcerning the Difference BetweenTwo Normal Population Proportions(Example 11.10, text)

Under most conditions it is impossible or unrealistic to study an entire populationto obtain the value of the population characteristic of interest. Statisticians have de-veloped techniques that enable us to draw inferences about population parametersfrom sample statistics. This particular statistical decision making-tool is hypothe-sis testing. Hypothesis tests are used to investigate theories concerning population

49

characteristics. Minitab may be used to make inferences about the value of a pop-ulation parameter.

11.2 Independent Samples

New Minitab Commands1. Stat>Basic Statistics>2-Sample t - Performs an independent two-sample t-

test and generates a confidence interval.a. Samples in one column - Choose if the groups are stacked in the same

column, differentiated by subscript values (group codes) in a second col-umn.

b. Samples in different columns - Choose if the groups are in two separatecolumns. In this section, you will use this command to perform a two-sample t-test where the data is in two seperate columns.

Hypothesis Tests on µ1−µ2An investigator who wishes to compare two populations is often interested eitherin estimating the difference between two population means or in testing hypothesesabout the difference between the two population means. In the small sample case,the test procedure that is appropriate requires the assumption that the two popula-tion distributions are normal. Normal probability plots can be used to check theplausibility of the normality assumptions.When the two samples are independently selected from normal population distri-butions, the standardized variable

t = (x1−x2)−(µ1−µ2)rs21n1+s22n2

has approximately a t distribution with

df = (V1+V2)2

V 21n1−1

+V 22n2−1

whereV1 =s21n1

andV2 =s22n2

df should be truncated (rounded down) to an integer.

50

Example 11.2, textThe Problem - Oral Contraceptive Use and Bone Mineral DensityTo assess the impact of oral contraceptive use on bone mineral density (BMD), re-searchers in Canada carried out a study comparing BMD for women who had usedoral contraceptives for at least three months to BMD for women who had neverused oral contraceptives (’’Oral Contraceptive Use and Bone Mineral Density inPremenopausal Women,’’ Canadian Medical Association Journal (2001):1023-1029). Data consistent with summary quantities given in the paper appear in theaccompany table (the actual sample sizes for the study were much larger).

Never used oral contraceptives 0.82 0.94 0.96 1.31 0.941.21 1.26 1.09 1.13 1.14

Used oral contraceptives 0.94 1.09 0.97 0.98 1.140.85 1.30 0.89 0.87 1.01

The authors of the paper believed that it was reasonable to view the samples usedin the study as representative of the two populations of interest - women who usedoral contraceptives for at least three months and women who never used oral con-traceptives. For purposes of this example, we will assume that it is also justifiableto consider the two samples given here as representative of the populations as well.We will use the given information and a significance level of 0.05 to determine ifthere is evidence that women who use oral contraceptives have a lower mean bonemineral density than women who have never used oral contraceptives.Let µ1 = true mean bone mineral density for women who never used oral contra-ceptives andµ2 = true mean bone mineral density for women who used oral contraceptiveshehypothesesthen: µ1 − µ2 = difference in mean bone mineral densityThe hypotheses to be tested are

H0 : µ1 − µ2 = 0Ha : µ1 − µ2 > 0

The null hypothesis will be rejected only if there is convincing evidence that µ1−µ2 > 0 (that is, strong evidence againstH0).For the two-sample t test to be appropriate, we must be willing to assume thatthe two samples can be viewed as independently selected random samples fromthe two populations of interest. As previously noted, we will assume that this isreasonable. Since the sample sizes are both small, it is also necessary to assume thatthe bone mineral density distribution is approximately normal for each of these twopopulations.Since the boxplots of the data, shown in Figure 11.1, are reasonablysymmetric and there are no outliers, the assumption of normality is plausible.

51

Figure 11.1

Follow these steps to perform the hypothesis test.1. Open the worksheet.

Choose File>Open Worksheet. Select the file ex_11_2.mtp. Choose Open.2. Calculate the test statistic.

Choose Stat>Basic Statistics>2-Sample t. Darken the option button for Sam-ples in different columns. Place Never in the First: text box. Place Used inthe Second: text box. Select Options. Accept the 95.0 default value in theConfidence level: text box. Accept the 0.0 default value in the Test difference:text box. Choose the option of greater than in the Alternative: drop down listbox. Choose OK. Choose OK.

The Minitab OutputTwo Sample T-Test and CI: Never, Used

Two-sample T for Never vs UsedN Mean StDev SE Mean

Never 10 1.080 0.160 0.051Used 10 1.004 0.139 0.044

Difference = mu (Never) - mu (Used)Estimate for difference: 0.07600095% lower bound for difference: -0.040500T-Test of difference = 0 (vs >): T-Value = 1.13P-Value = 0.136 DF = 17

Figure 11.2

.

52

Minitab produces a number of descriptive statistics (N, Mean, StDev, SE Mean)„as shown in Figure 11.3, as well as the observed t value of 1.13, and the one-tailedp-value of 0.136 (meaning p = 0.136 > α = .05). Since p is greater than α, thenull hypothesis H0 : µ1 − µ2 = 0 is not rejected. We thus do not have sufficientevidence to support the claim that mean bone mineral density is lower for womenwho used oral conceptives.

11.3 Confidence Intervals on µ1−µ2A small sample confidence interval for µ1 − µ2 can easily be obtained by usingMinitab. The two-sample t confidence interval for µ1 − µ2 is

x1 − x2±(tcritical value)

qs21n1+ s22n2

The critical t values is based on

df = (V1+V2)2

V 21n1−1

+V 22n2−1

whereV1 =s21n1

andV2 =s22n2

df should be truncated (rounded down) ton an integer. This confidence inter-val is valid when both population distributions are normal.

Example 11.4, textThe Problem - Effect of Talking on Blood PressureDoes talking elevate blood pressure, contributing to the tendency for blood pres-sure to be higher when measured in a doctor’s office than when measured in a lessstressful environment (called the ’’white coat’’ effect)? The article ’’The TalkingEffect and ’’White Coat’’ Effect in Hypertensive Patients: Physical Effor or Emo-tional Content’’(Behavioral Medicine (2001):149-157) describes a study in whichpatients with high blood pressure were randomly assigned to one of two groups.Those in the first group (the talking group) were asked questions about their med-ical history and about the sources of stress in their lives in the minutes prior tomeasuring blood pressure. Those in the second group (the counting group) wereasked to count aloud from 1 to 100 four times prior to having blood pressure mea-sured. The accompanying data values for diastolic blood pressure (mmHg) areconsistent with summary quantities appearing in the paper.

Talking 104 110 107 112 108 103 108 118

Counting 110 96 103 98 100 109 97 105Subjects were randomly assigned to the two treatments. Since the sample sizesare both small, we must first investigate whether is is reasonable to assume thatthe diastolic blood pressure distributions are approximately normal for the two

53

treatments.Since the boxplots of the data, shown in Figure 11.3, are reasonablysymmetric and there are no outliers, the assumption of normality is plausible.

Figure 11.3To estimate µ1 − µ2, the difference in mean diastolic blood pressure for the two

treatments, we will calculate a 95% confidence interval.Follow these steps to construct the 95% confidence interval.1. Choose File>Open Worksheet. Select the file ex_11_4.mtp. Choose Open.2. Calculate the test statistic.

Choose Stat>Basic Statistics>2-Sample t. Darken the option button for Sam-ples in different columns. Place Talking in the First: text box. Place Countingin the Second: text box. Select Options. Accept the 95.0 default value in theConfidence level: text box. Accept the 0.0 default value in the Test difference:text box. Accept the default value of not equal in the Alternative: drop downlist box. Choose OK. Choose OK.

54

The Minitab OutputTwo Sample T-Test and CI: Talking, Counting

Two-sample T for Talking vs CountingN Mean StDev SE Mean

Talking 8 108.75 4.74 1.7Counting 8 102.25 5.39 1.9

Difference = mu (Talking) - mu (Counting)Estimate for difference: 6.5000095% CI for difference: (1.01486, 11.98514)T-Test of difference = 0 (vs not =): T-Value = 2.56P-Value = 0.024 DF = 13

Figure 11.4

.

Minitab again produces a number of descriptive statistics (N, Mean, StDev, SEMean), as shown in Figure 11.4. With 95% confidence, the difference in meandiastolic blood pressure for the two treatments is estimated to be between 1.01and 11.99. The 95% confidence interval is rather wide, because the two samplevariance are large and the sample sizes are small. Notice that the interval doesnot include 0, so 0 is not one of the plausible values for µ1 − µ2. Based on thiscomputed interval, we estimate that the mean diastolic blood pressure when talkingis higher than the mean when counting by somewhere between 1.01 and 11.99mmHg. The 95% confidence level means that we used a method to produce thisestimate that correctly captures the true value ofµ1−µ2 95% of the time in repeatedsampling.

11.4 Paired Samples

New Minitab Commands1. Stat>Basic Statistics>Paired t - Performs a paired t-test. This is appropriate

for testing the difference between two means when the data are paired and thepaired differences follow a normal distribution. Use the Paired t command tocompute a confidence interval and perform a hypothesis test of the differencebetween population means when observations are paired. A paired t -procedurematches responses that are dependent or related in a pairwise manner. Thismatching allows you to account for variability between the pairs usually re-sulting in a smaller error term, thus increasing the sensitivity of the hypothesistest or confidence interval. In this section, you will use this command to ob-tain a confidence interval on the difference between population means whenobservations are paired.Hypothesis Tests on µd

Two samples are said to be independent if the selection of the individuals or ob-jects that make up one of the samples has no bearing on the selection of those in theother sample. In some situations, an experiment with independent random sam-ples is not necessarily the best way to obtain information concerning any possible

55

difference between the populations. For example, suppose an investigator wantsto determine whether regular aerobic exercise affects blood pressure. A randomsample of people who jog regularly and a second random sample of peoploe whodo not exercise regularly are selected independently of one another. The reseacherthen uses the two-sample t test to conclude that a significant difference exists be-tween the average blood pressures for joggers and nonjoggers. Is it reasonable tothink that the difference in mean blood pressure is attributable to jogging? It isknown that blood pressure is related to both diet and body weight. Might it not bethe case that joggers in the sample tend to be leaner and adhere to a healthier dietthan the nonjoggers and that this might account for the observed difference? Onthe basis of this study, the researcher wouldn’t be able to rule out the possibilitythat the observed difference in blood pressure is explained by weight differencesbetween the people in the two samples and that aerobic exercise itself has no ef-fect.One way to avoid this difficulty would be to match subjects by weight. The re-searcher would find pairs of subjects so that the jogger and nonjogger in each pairwere similar in weight (although weights for different pairs might vary widely).The factor weight could then be ruled out as a possible explanation for an ob-served difference in average blood pressure between the two groups. Matchingsubjects by weight results in two samples for which each observtion in the firstsample is coupled in a meaningful way with a particular observation in the secondsample. Such samples are said to be paired.Experiments can be designed to yield paired data in a number of differenct ways.Some studies involve using the same group of individuals with measurementsrecorded both before and after some intervening treatment. Others use naturallyoccuring pairs, such as twins or husbands and wives, and some construct pairsby matching on factors with effects that might otherwise obscure differences (orthe lack of them) between the two populations of interest (as might weight in thejogging example). Paired samples often provide more information than would in-dependent samples, because extraneous effects are screened out.When sample observations from the first population are paired in some meaning-ful way with the sample observations from the second population, inferences canbe based on the differences between the two observations between each sampledpair. The n sample differences can then be regarded as having been selected froma large population of differences. Let µd = mean value of the difference popu-lation and σd = standard deviation of the difference population. the relationshipbetween µd and the two individual population means is µd = µ1−µ2. Therefore,when the samples are paired, inferences about µ1−µ2 are equivalent to inferencesabout µd. Since inference about µd can be based on the n observed sample differ-ences, the original two-sample problem becomes a familiar one-sample problem.When the two samples are paired and it is reasonable to assume that the populationof differences is normal, the standardized variable

56

t = d−µdsd√n

has approximately a t distribution withdf = n− 1. The t confidence interval for µd is

xd±(tcritical value)sd√n

Example 11.8, textThe Problem - Lactic Acid in the Blood After ExerciseThe effect of exercise on the amount of lactic acid in the blood was examined in thearticle ’’A Descriptive Analysis of Elite-Level Racquetball’’ (Research Quarterlyfor Exercise and Sport (1991): 109-114. Eight males were selected at random fromthose attending a weeklong training camp. Blood lactate levels were measuredbefore and after playing three games of racquetball, as shown in the accompanyingtable. We will use this data to estimagte the mean change in blood lactate levelusing a 95% confidence interval.

Player 1 2 3 4 5 6 7 8Before 13 20 17 13 13 16 15 16After 18 37 40 35 30 20 33 19Difference -5 -17 -23 -22 -17 -4 -18 -3

The eight men were selected at random from training camp participants. The ac-companying boxplot of the eight sample differences, shown in Figure 11.5, is con-sistent with a difference population that is approximately normal, so the paired tconfidence interval is appropriate.

Figure 11.5

57

Follow these steps to perform the hypothesis test.1. Choose File>Open Worksheet. Select the file ex_11_8.mtp. Choose Open.2. Obtain the confidence interval.

Choose Stat>Basic Statistics>Paired t. Darken the Samples in columns: op-tion button. Place Before in the First sample: text box. Place After in theSecond sample: text box. Choose Options. Accept the (default) level of 95.0in the Confidence level: text box. Accept the (default) level of 0.0 in the Testmean: text box. Accept the (default) option of not equal in the Alternative:drop down dialog box. Choose OK.

The Minitab OutputPaired T-Test and CI: Before, After

Paired T for Before - AfterN Mean StDev SE Mean

Before 8 15.3750 2.4458 0.8647After 8 29.0000 8.7831 3.1053Difference 8 -13.6250 8.2797 2.9273

95% CI for mean difference: (-20.5470, -6.7030)T-Test of mean difference = 0 (vs not = 0): T-Value = -4.65P-Value = 0.002

Figure 11.6

Minitab again produces a number of descriptive statistics (N, Mean, StDev, SEMean), as well as the 95% confidence interval for µd, as shown in Figure 11.6.Based on the sample data, we can be 95% confident that the difference in meanblood lactate level is between -20.5470 and -6.7030. That is, we are 95% confidentthat the mean increase in blood lactate level is somewhere between 6.7030 and20.5470 after three games of racquetball.

11.5 Two Population Proportions

New Minitab Commands1. Stat>Basic Statistics>2 Proportions - Performs a test of two binomial pro-

portions. Use the 2 Proportions command to compute a confidence interval andperform a hypothesis test of the difference between two proportions. For ex-ample, suppose you wanted to know whether the proportion of consumers whoreturn a survey could be increased by providing an incentive such as a productsample. You might include the product sample with half of your mailings andsee if you have more responses from the group that received the sample thanfrom those who did not. In this section, you will use this command to performa hypothesis test on the difference between two population proportions.Hypothesis Tests on π1−π2

Many investigations are carried out to compare the proportion of successes in one

58

population (or resulting from one treatment) to the proportion of successes in asecond populaiton (or from a second treatment). When comparing two populationsor treatments on the basis of ’’success’’ proportions, it is common to focus on thequantity π1 − π2, the difference between the two proportions. Since p1 providesan estimate of π1 and p2 provides an estimate of π2, the obvious choice for anestimate of π1 − π2 is p1− p2.When the two random samples are selected independently of one another and bothsamples are large, we will use the test statistic:

z = (p1−p2)−(π1−π2)qpc(1−pc)n1

+pc(1−pc)n2

, wherepc =n1p1+n2p2n1+n2

.A large sample confidence interval for π1 − π2 can easily be obtained by usingMinitab. The confidence interval for π1 − π2 is

π1 − π2±(zcritical value)

qp1(1−p1)n1

+ p2(1−p2)n2

.

Example 11.10, textThe Problem - Aids and Housing AvailabilityThe authors of the article ’’Accommodating Persons With AIDS: Acceptance andRejection in Rental Situations’’ (J. Applied Social Psychology (1999): 261-270)state that even though landlords participating in a telephone survey indicated thatthey would generally be willing to rent to persons with AIDS, they wonderedwhether this was true in actual practice. To investigate, two random samples of80 advertisements for rooms for rent were independently selected from newspa-per advertisements in three large cities. An adult male caller responded to each adin the first sample of 80 and inquired about the availability of the room and wastold that the room was still available in 61 of these calls. The same caller also re-sponded to each ad in the second sample. In these calls, the caller indicated thathe was currently receiving some treatment for AIDS and was about to be releasedfrom the hospital and would require a place to live. The caller was told that a roomwas available in 32 of these calls. Based on this information, the authors con-cluded that ’’reference to AIDS substantially decreased the likelyhood of a roombeing described as available.’’ Does the data support this conclusion? Let’s carryout a hypothesis test with α = 0.01.Follow these steps to perform the hypothesis test.1. Start Minitab.2. Calculate the test statistic.

Choose Stat>Basic Statistics>2 Proportions. . Darken the Summarizeddata: option button. Place 80 in the First Trials: text box. Place 61 in the

59

First Events: text box. Place 80 in the Second Trials: text box. Place 32 inthe Second Events: text box. Choose Options. Accept the (default) level of95.0 in the Confidence level: text box. Accept the (default) level of 0.0 in theTest mean: text box. Choose the greater than from the Alternative: drop downlist box. Place a check in the Use pooled estimate of p for the test: checkbox.Choose OK. Choose OK.

The Minitab OutputTest and CI for Two Proportions

Sample X N Sample p1 61 80 0.7625002 32 80 0.400000

Difference = p (1) - p (2)Estimate for difference: 0.362595% CI for difference: (0.220302, 0.504698)Test for difference = 0 (vs not = 0): Z = 5.00P-Value = 0.000

Figure 11.6

Minitab produces a number of descriptive statistics (X, N, Sample p), as shownin Figure 11.6, as well as the observed z value of 5.00, and the one-tailed p-valueof 0.000 (meaning p<α = .05). Since p is less than α, the null hypothesis H0 :π1 − π2 > 0 is rejected. We thus do have sufficient evidence to suggest that theproportion of rooms reported as available is smaller with the AIDS reference thanwithout it. This supports the claim made by the authors fo the article.

60

Chapter 12The Analysis of CategoricalData And Goodness of Fit Tests

12.1 Overview

Most of the techniques presented in earlier chapters are designed for numericaldata. It is often the case, however, that information is collected on categoricalvariables such as political affiliation, sex, or college major. As with numericaldata, categorical data sets can be univariate (consisting of observations on a singlecategorical variable), bivariate (observations on two categorical variables), or evenmultivariate. Minitab doesn’t easily do goodness-of-fit tests; hence that test is notaddressed in this chapter. After reading this chapter you should be able to

Perform a Chi-Squared Test for1. Homogeneity and Independence in a Two-Way Table

(Example 12.7, text)

12.2 Tests for Homogeneity and Independence

New Minitab Commands1. Stat>Tables>Chisquare Test - Does a chi-square test of association (non-

independence) for the table of counts given in the specified columns. In thissection, you will use this command to perform a chi-square test to determine ifthe category proportions are the same for all of the populations.

Data resulting from observations made on two different categorical variablescan also be summarized using a tabular format. As an example, suppose that res-idents of a particular city can watch national news on affiliate stations of ABC,CBS, NBC, or PBS (the public television network). A researcher wishes to knowwhether there is any relationship between political philosophy (liberal, moderate,or conservative) and preferred new program among those residents who regularlywatch the national news. Let x denote the variable political philosophy and y thevariable preferred network. A random sample of 300 regular watchers is to be se-lected, and each one will be is asked for his or her x and y values. The data set isbivariate and might initially be displayed as follows:

Observation x value y value1 Liberal CBS2 Conservative ABC3 Conservative PBS... ... ...

300 Liberal PBS

61

Bivariate categorical data of this sort can most easily be summarized by construct-ing a two-way frequency or contingency table. This contingency table consists ofa row for each possible x category and a column for each possible y category.Comparing Two or More PopulationsWhen the value of a categorical variable is recorded for member of separate randomsamples obtained from each population under study, the central issue is whetherthe category proportions are the same for all of the populations. The test proce-dure uses a chi-squared statistic that compares the observed counts to those thatwould be expected if there were no differences between the populations.

Example 12.7, textThe Problem - Oral Contraceptive Use and Bone Mineral DensityThe paper ’’Factors Associated with Sexual Risk-Taking Behaviors Among Ado-lescents (J. Marriage and Family (1994):622-632) examined the relationship be-tween gender and contraceptive use by sexually active teens. Each person in asample of sexually active teens was classified according to gender and contracep-tive use (with three categories: rarely or never use, use sometimes or most of thetime, and always use), resulting in a 3×2 contingency table. Data consistent withpercentages given in the paper appears in the table.

GenderContraceptive use Female MaleRarely/never 210 350Sometimes/most times 190 510Always 400 930

The authors were interested in determining whether there is an association betweengender and contraceptive use. Using a significance level of 0.05, we will test

H0 :Gender and contraceptive use are independent.Ha :Gender and contraceptive use are not independent.

The null hypothesis will be rejected only if there is convincing evidence that genderand contraceptive use are not independent. (that is, strong evidence against H0).

Follow these steps to test the hypothesis H0 : Gender and contraceptive use areindependent at the α = .05 level of significance.1. Open the worksheet.

Choose File>Open Worksheet. Select the file ex_12_7.mtp. Choose Open.2. Calculate χ2.

Choose Stat>Tables>Chisquare Test (Table in Worksheet). Place Femaleand Male in theColumns containing the table: text box. Choose OK.

62

The Minitab Output

Chi-Square Test: Female, Male

Expected counts are printed below observed countsChi-Square contributions are printed below expected counts

Female Male Total1 210 350 560

224.00 336.000.875 0.583

2 190 320 510204.00 306.000.961 0.641

3 400 530 930372.00 558.002.108 1.405

Total 800 1200 2000

Chi-Sq = 6.572, DF = 2, P-Value = 0.037

Figure 12.1The Minitab output (Figure 12.1) indicates the observed values, the expected counts,the marginal totals, the calculated Chi-squared statistic (χ2 = 6.572), the degreesof freedom (df=2) and the p-value (p=0.037). The p-value (p=0.037) ≤ α, so H0is rejected. There is strong evidence to indicate an association between gender andcontraceptive use.

63

Chapter 13Simple Linear RegressionAnd Correlation: InferentialMethods

13.1 Overview

Regression and correlation were introduced in Chapter 4 as techniques for describ-ing and summarizing bivariate data consisting of (x, y) pairs. For example, weconsidered a regression of the dependent variable y = unemployment expenditureas a percentage of gross domestic product for the 22 OECD countries on the inde-pendent variable x = unemployment rate. The equation of the least squares linewas by = 0.104 + 0.169x. When x = 7 is substituted into this equation, the num-ber 1.286 results. This number can be interpreted either as a point estimate of theaverage expenditure for all 22 OECD countries that have an unemployment rateof 7.0 or as a point prediction for the unemployment rate of 7.0 in a single coun-try that has an unemployment rate of 7.0. In this chapter, we address inferentialmethods for this type of data, including a confidence interval (interval estimate)for a mean y value,and a prediction interval for a single y value. After reading thischapter you should be able to

1. Obtain the Least Squares Regression Line(Example 13.2, text)

2. Obtain a Confidence Interval Concerning β, the Slope(Example 13.4, text)

3. Check the Model Adequacy(Example 13.6, text)

4. Obtain a Confidence Interval for a Mean y value(Example 13.11, text)

13.2 The Simple Linear Regression Model

New Minitab Commands (and some Minitab commands used previously)1. Stat>Regression>Regression - Performs simple, polynomial regression, and

multiple regression using the least squares method. In this section, you will usethis command to determine the least square equation between two variables.

a. Options - Permits various options: weighted regression, fit the modelwith/without an intercept, calculate variance inflation factors and the Durbin-Watson statistic, and calculate and store prediction intervals for new ob-servations. In this section, you will use this command to make predictionsusing the least squares regression line.

64

2. Stat>Regression>Fitted Line Plot - Fits a simple linear or polynomial (sec-ond or third order) regression model and plots a regression line through theactual data or the log10 of the data. The fitted line plot shows you how closelythe actual data lie to the fitted regression line. In this section, you will obtaina fitted line plot to illustrate how the estimated relationship fits the data in asimple linear regression model.

A deterministic relationship is one in which the value of y is completely de-termined by the value of an independent variable x. Such a relationship can bedescribed using traditional mathematical notation such as y = f(x), where f(x)is a specified function of x. For example, y = 3 + 2x is a deterministic relation-ship. However, the variables of interest are often not deterministically related. Forexample, the value of y = unemployment expenditure as a percentage of gross do-mestic product is certainly not determined solely by x = unemployment rate.A description of the relationship between two variables x and y that are not deter-ministically related can be given by specifying a probabilistic model. The general

form of an additive probabilistic model is y = f(x) + e, wheree =random deviation. The simple linear regression model is a special case ofthe general probabilistic model in which the deterministic function f(x) is linear.The simple linear regression model assumes that there is a line with slope β andvertical or y interceptα, called the true or population regression line. When a valueof the independent variable x is fixed and an observtion on the dependent variabley is made,

y = α + βx + e.

Example 13.2, textThe Problem - Mother’s Age and Babies Birth WeightMedical researchers have noted that adolescent females are much more likely todeliver low birth weight babies than are adult females. Because low birth weightbabies have higher mortality rates, there have been a number of studies examiningthe relationship between birth weight and mother’s age for babies born to youngmothers. One such study is described in the article ’’The Risk of Teen MothersHaving Low Birth Weight Babies: Implications of Recent Medical Research forSchool Health Personnel’’ (J. of School Health (1998): 271-2740. The accompa-nying data onx = maternal age (years)andy = birth weight of baby (grams)is consistent with summary values given in the referenced article and also with data

65

published by the National Center for Health Statistics.Observation 1 2 3 4 5x 15 17 18 15 16y 2289 3393 3271 2648 2897

Observation 6 7 8 9 10x 19 17 16 18 19y 3327 2970 2535 3138 3573

Follow these steps to determine the least squares equation between y = birthweight of baby (grams) andx =maternal age (years) and to predict the birth weightof baby (grams) when the maternal age (years) is 18.1. Open the worksheet.

Choose File>Open Worksheet. Select the file ex_13_2.mtp. Choose Open.2. Create a scatterplot.

Choose Graph>Scatterplot. Select the Simple scatterplot from the dialog boxchoices. Choose OK. Place y = baby weight in the Y Graph variables: textbox. Place x = mother age in the X Graph variables: text box.

The Minitab Output

Figure 13.1The scatter plot (Figure 13.1) strongly suggests the appropriateness of the simplelinear regression model.1. Obtain the regression equation.

Choose Stat>Regression>Regression. Place y = baby weight in theResponse: text box. Place x = mother age in the Predictors: text box.

66

2. Make a prediction.Choose Options. Place 18 in the Prediction intervals for new observations: textbox. Choose OK. Choose OK.

The Minitab Output

Figure 13.2The Minitab output (Figure 13.2) indicates the estimated regression line, y, babyweight = - 1163 + 245 x, mother age

(by = −1163 + 245x)as a well as a point estimate (3249.3) of the average birth weigh of babies born

to 18-year-old mothers.

67

13.3 The Slope of the Population Regression Line

The slope β in the simple regression model is the average or expected change inthe dependent variable y associated with a one-unit increase in the value of theindependent variable x. For example, consider x = storage temperature (oC) andy = shelf-life of the medication. Assuming that the simple linear regression modelis appropriate for the population of medications, β would be the average decreasein shelf-life associated with a 1oC increase in temperature.Since the value of β is almost always unknown, it will have to be estimated fromthe sample data. The slope of the least squares line gives us a point estimate of β.As with any point estimate, though, it is desirable to have some indication of howaccurately b estimates β. To proceed further, we need to know some informationwith regards to the sampling distribution of b.

Proberties of the Sampling Distribution of b

1. The mean value of b is β. That is,µb = β .

2. The standard deviation of the statisticbisσb =σ√Sxx

3. The statistic b has a normal distribution.The estimate standard deviation of b is

sb =se√Sxx

.

The probability distribution of the standardized variable

t = b−βsb

,

is the t distribution with (n− 2) degrees of freedom.The t variable in the preceding box can be employed to provide a confidence in-terval (interval estimate) for β.

When the basic assumptions of the simple linear regression model are satisfied, aconfidence interval for β, the slope of the population regression line, has the form

b± (t critical value) · sbwhere the t critical value is based on df = n-2. Appendix Table III, text, gives crit-ical values corresponding to the most frequently used confidence levels.

Example 13.4, textThe Problem - Athletic Performance and Cardiovascular FitnessIs cardiovascular fitness (as measured by time to exhaustion running on a tread-mill) related to an athlete’s performance in a 20-km ski race? The accompanyingdata onx = treadmill time to exhaustion (min.)

68

andy = 20-km ski time (min)was taken from the article ’’Physiological Characteristics and Performance of TopU.S. Biathletes’’ (Medicine and Science in Sports and Exercise (1995): 1302-1310):

x 7.7 8.4 8.7 9.0 9.6 9.6y 71.0 71.4 65.0 68.7 64.4 69.4

x 10.0 10.2 10.4 11.0 11.7y 63.0 64.6 66.9 62.6 61.7

Follow these steps to determine the least squares equation between ski time andtreadmill time and to calculate a 95% confidence interval for β, the slope of theregression line.1. Open the worksheet.

Choose File>Open Worksheet. Select the file ex13_4.mtp. Choose Open.2. Create a scatterplot.

Choose Graph>Scatterplot. Select the Simple scatterplot. Choose OK. Placeski time in the Y Graph variables: text box. Place treadmill time in the X Graphvariables: text box. Choose OK.

The Minitab Output

Figure 13.3The scatter plot, as shown in Figure 13.3, shows a linear pattern and the verticalspread of points does not appear to be changing over the range of x values inthe sample. If we assume that the distribution of errors at any given x value is

69

approximately normal, the simple linear regression model seems appropriate.1. Obtain the regression equation.

Choose Stat>Regression>Regression. Place ski time in theResponse: text box. Place treadmill in the Predictors: text box. Choose OK.

The Minitab Output

Figure 13.4The Minitab output, as shown in Figure 13.4, indicates the estimated regressionline, ski time = 88.8 - 2.33 treadmill time

(by = 88.8− 2.33x).Calculation of the 95% confidence interval forβ requires a t critical value based

on df = n− 2 = 11− 2 = 9, which (from Appendix Table III, text) is 2.26. Theresulting interval is then

70

b±(tcritical value)·sb= −2.3335± (2.26) (0.5911)

= (−3.671,−0.999).We interpret this interval as follows: Based on the sample data, we are 95%

confident that the true average decrease in ski time associated with a one-minuteincrease in treadmill time is between 1 and 3.7 minutes.

13.4 Checking Model Adequacy



a. Graphs - Displays residual plots. You do not have to store the residualsin order to produce these plots. In this section, you will use this commandconstruct a standardized residual plot to determine the adequacy of the re-gression model.

The simple linear regression model equation is

y = α + βx + e,

ewhere represents the random deviation of an observed y value from the popu-lation regression lineα+βx. Key assumption for the inferential methods presentedin previous sections are

1. e has a normal distribution2. The standard deviation of e is σ, which does not depend on x.

Easily applied methods for checking the validity of the assumptions for the simplelinear regression model are very desirable. When all the model assumptions aresatisfied, the mean value of any residual is zero. Any observation that producesa very large positive or negative residual needs to be examined carefully for anyanomalous circumstances, such as a recording error or exceptional experimentalconditions. Identifying residuals with unusually large magnitudes is made easierby inspecting standardized residuals.

Example 13.6, textThe Problem - Landslides and Timber GrowthLandslides are common events in tree-growing regions of the Pacific Northwest, sotheir effect on timber growth is of special concern to foresters. The paper ’’Effectsof Landslide Erosion on Subsquent Douglas Fir Growth and Stocking Levels inthe Western Cascades, Oregon’’ (Soil Science Soc. of Amer. J. (1984):667-671)

71

reported on the results of a study in which growth in a landslide area was comparedwith growth in a previously clear-cut area. We present data on clear-cut growth,with x = tree age (years) and y = 5-year height growth (cm).

Tree Age (x) 5 9 9 10 10 11 11 12Height Growth (y) 70 150 260 230 255 165 225 340

Tree Age (x) 13 13 14 14 15 15 18 18Height Growth (y) 305 335 290 340 225 300 380 400

Follow these steps to:(a) construct a scatter plot for this data,(b) determine the estimated linear regression equation for this data, and(c) construct a standardized residual plot to determine the adequacy of the simple

linear regression model.


2. Create a scatterplot.Choose Graph>Scatterplot. Select the Simple scatterplot. Choose OK. Placegrowth in the Y Graph variables: text box. Place age in the X Graph variables:text box. Choose OK.

The Minitab Output

Figure 13.5

72

The scatter plot, as shown in Figure 13.5, is consistent with the assumptions of thesimple linear regression model.1. Obtain the regression equation.

Choose Stat>Regression>Regression. Place growth in theResponse: text box. Place age in the Predictors: text box.

2. Store the fits, residuals and standardized residuals.Choose Storage. Place a check in the Fits Characteristics of Estimated Equationcheckbox. Place a check in the Standardized residuals Diagnonstic Measurescheckbox. Choose OK. Choose OK.

The Minitab Output

Figure 13.6The Minitab output, as shown in Figure 13.6, indicates the estimated regressionline, growth = 4.35 + 21.322 age

(by = 4.35 + 21.322x).1. Obtain the normal scores of the standardized residuals.

Choose Calc>Calculator. Type normal scores in the Store result in variabletextbox. Select Statistics from the Function drop down list box. Select Nor-mal scores from the Statistics list box.Choose Select. Replace (number) in theNSCOR(number) with SRES1. Choose OK.

2. Create a scatterplot of the standardized residuals.Choose Graph>Scatterplot. Select the Simple scatterplot. Choose OK. PlaceSRES1 in the Y Graph variables: text box. Place age in the X Graph variables:text box. Choose OK.

73

The Minitab Output

Figure 13.7The Minitab output, as shown in Figure 13.7, indicates a normal probability plot ofthe standardized residuals. The plot casts no doubt on the normality assumption.

13.5 Inferences Based on the Estimated Regression Line



a. Options - Allows you to perform weighted regression, fit the model with/withoutan intercept, calculate variance inflation factors and the Durbin-Watsonstatistic, and calculate and store confidence and prediction intervals fornew observations. In this section, you will use this command obtain con-fidence and prediction intervals based on the extimated regression line.

The number obtained by substituting a particular x value x∗ into the equationof the estimated regression line has two different interpretations. It is a point es-timate of the average y value when x = x∗, and it is also a point prediction of asingle y value to be observed when x = x∗. Properties of the sampling distrib-ution are used to obtain both a confidence interval for α + βx∗ and a predictioninterval formula for a particular y observation. The width of the corresponding in-terval conveys information about the precision of the estimate or prediction. Whenthe basic assumptions of the simple linear regression model are met, a confidenceinterval for α+ βx∗, the average y value when x has a value x∗, is

74

a + bx∗ ± (tcritical value) · sa+bx∗ , where the

tcritical value is based on df = n− 2.

When the basic assumptions of the simple linear regression model are met, theprediction interval for y∗, a single y observation when x has a value x∗, is

a + bx∗ ± (tcritical value) ·qs2e + s

2a+bx∗

,

where the tcritical value is based on df = n− 2.

The prediction interval and confidence interval are centered at exactly the sameplace, a+ bx∗. The addition of under the square root symbol makes the predictioninterval wider - often substantially so - than the confidence interval.

Example 13.11, textThe Problem - Shark Length and Jaw WidthPhysical characteristics of sharks are of interest to surfers and scuba divers, as wellas marine researchers. The accompanying data on x = length (in feet) and y = jawwidth (in inches) for 44 sharks was found in various articles in the magazines SkinDiver and Scuba News :

x 18.7 12.3 18.6 16.4 15.7 18.3 14.6 15.8 14.9 17.6 12.1y 17.5 12.3 21.8 17.2 16.2 19.9 13.9 14.7 15.1 18.5 12.0

x 16.4 16.7 17.8 16.2 12.6 17.8 13.8 12.2 15.2 14.7 12.4y 13.8 15.2 18.2 16.7 11.6 17.4 14.2 14.8 15.9 15.3 11.9

x 13.2 15.8 14.3 16.6 9.4 18.2 13.2 13.6 15.3 16.1 13.5y 11.6 14.3 13.3 15.8 10.2 19.0 16.8 14.2 16.9 16.0 15.9

x 19.1 16.2 22.8 16.8 13.6 13.2 15.7 19.7 18.7 13.2 16.8y 17.9 15.7 21.2 16.3 13.0 13.3 14.3 21.3 20.8 12.2 16.9

75

Follow these steps to:(a) construct a scatter plot (fitted line plot) for this data,(b) determine the estimated linear regression equation for this data, and(c) construct 90% confidence and prediction intervals for 15-foot-long sharks.


2. Create a scatterplot.Choose Graph>Scatterplot. Select the Simple scatterplot. Choose OK. PlaceJawWidth(y) in the Y Graph variables: text box. Place Length(x) in the XGraph variables: text box. Choose OK.

The Minitab Output

Figure 13.8The scatter plot, as shown in Figure 13.8, shows a linear pattern and is consistentwith use of the simple linear regression model.3. Obtain the regression equation.

Choose Stat>Regression>Regression. Place JawWidth(y) in theResponse: text box. Place Length(x) in the Predictors: text box. Choose OK.

76

The Minitab Output

Figure 13.9The Minitab output, as shown in Figure 13.9, indicates the estimated regressionline, JawWidth(y) = 0.69 + 0.963Length(x) (by = 0.69 + 0.963x). TheMinitab output under the column labeled Coef includes the computed values of a(a = 0.688). and b (b = 0.96345). The sbappears under the columnlabeled StDev (sb = 0.08228)(sb = 0.08228). The value of the test sta-tistic, the t ratio for the model utility test, is 11.71 and is found under the columnlabeled T. Since the associated p-value is less than .001, theH0 is rejected. We con-clude that there is a useful linear relationship between JawWidth(y) and Length(x).The next line in the Minitab output includes se, (se = 1.376)and r2

(r2 = 76.6%). The Analysis of Variance table includes the SSError or theSS Re sid (SS Re sid = 79.49), and the SSRe sid SSTo(SSTo = 339.02). Observe that se = 1.376is also indicated on theline with S, R-Sq and R-Sq(adj). The coefficient of determination, r2, has a valueof 76.6%. Approximately 76.6% of the observed variation in jaw widths can beattributed to the probabilistic linear relationship with length. The magnitude of atypical sample deviation from the least squares line is about 1.376, which is rea-sonably small in comparison to the y values themselves.4. Obtain the confidence interval and prediction interval for 15-foot-long sharks.

Choose Stat>Regression>Regression. Place JawWidth(y) in theResponse: text box. Place Length(x) in the Predictors: text box.Choose Options. Place 15 in the Predicton intevals for new observations: textbox.Place 90 in the Confidence level: text box. Choose OK. Choose OK.

77

The Minitab Output

Figure 13.10The Minitab output, as shown in Figure 13.10, contains a point estimate (Fit =15.140) of the mean jaw width for 15-foot-long sharks, along with the standarderror of the fitted value (SE Fit - 0.213).

The 90% confidence interval (90.0% CI) for the mean jaw width for sharks whoselength is 15 ft. is between 14.710 and 15.569 inches. As with all confidenceintervals, the 90% confidence level means that we have used a method to constructthis interval estimate that has a 10% error rate.

78

Chapter 14Multiple RegressionAnalysis

14.1 Overview

The general objective of regression analysis is to establish a useful relationshipbetween a dependent variable y and one or more indpendent (predictor) variables.The simple linear regression model y = α + βx + e has been used successfullyby many investigators in a wide variety of disciplines to relate y to a single pre-dictor variable x. Most practical applications of regression analysis utilize modelsthat are more complex than the simple linear regression model; in most problemsmore than one independent variable is needed in the regression model. For exam-ple, some variation in house prices may be attributed to the size of the house, butknowledge of house size alone would not enable one to accurately predict a home’svalue. Price is also determined to some extent by other variables, such as the num-ber of bathrooms, the number of bedrooms and the age of the home.In this chapter, we extend the regression methodology developed in previous chap-ters to multiple regression models. After reading this chapter you should be ableto

1. Fit a Model and Assess its Utility(Example 14.6, text)

2. Select Variables to include in a ModelUsing a Best Subsets Procedure

(Example 14.17, text) and(Example 14.18, text) and

14.2 Multiple Regression Models

A general additive multiple regression model, which relates a dependent variabley to k predictor variables x1, x2, ...xk is given by the model equation

y = α + β1x1 + β2x2 + ... + βkxk + e.

The random deviation e is assumed to be normally distributed with mean value0 and variance σ2 for any values of x1, x2, ...xk. This implies that for fixedx1, x2, ...xk values, y has a normal distribution with variance σ2 and

(mean y value for fixedx1,x2,...xk values ) = α+ β1x1 + β2x2 + ...+ βkxk.

The β0is are called population regression coefficients; each βi can be inter-preted as the true average change in y when the predictor xi increases by one unit

79

and the values of all the other predictors remain fixed. The deterministic portionα+ β1x1 + β2x2 + ...+ βkxk is called the population regression function.

A Special Case: Polynomial RegressionThe kth−degree polynomial regression model

y = α + β1x + β2x2 + · · · + βkx

k + e.

is a special case of the general multiple regression model with x1 = x1, x2 =x2, ...xk = xk. The population regression function (mean value of y for fixedvalues of the predictors) is α + β1x+ β2x

2 + · · · + βkxk. The mose important

special case other than simple linear regression (k = 1) is the quadratic regressionmodel

y = α + β1x + β2x2 + e.

This model replaces the line of mean values α+βx in simple linear regressionwith a parabolic curve of mean values α+β1x+β2x

2. If β2 > 0, the curve opensupward, whereas if β2 < 0, the curve opens downward.

Interaction between VariablesThe population regression function

(mean y value for fixedx1,x2,...xk values ) = α+ β1x1 + β2x2 + ...+ βkxk.

exhibit a characteristic of all first-order models (k = 1). If you graph the meanvalue functions - say x1- for fixed values of the other variables, the each graph willbe a straight line. If you repeat the process for other values of the fixed indepen-dent variables, the lines will be parallel. This indicates that the effect on the meany value of a change in x1 is independent of the other variables in the model. Whenthis situtation occurs, we say that the independent variables in the model do not in-teract.If a term involving the cross-product x1x2 is added to the model, the effect on themean y value is now dependent on the value of x2. When this situtation occurs,we say x1 and x2 interact. If you now graph the mean value functions -say x1 - forfixed values of the other variables and repeat the process, the lines may no longerbe parallel. When the slopes are different, the variables are said to interact.

To create an interaction variable using Minitab, use the Calc>Calculator com-mand placing x1x2 in the Store result in variable: text box. Place x1 ∗ x2 in theExpression: text box. Choose OK.

If the change in the mean y values associated with a one-unit increase in one inde-pendent variable depends on the values of a second independent variable, there isinteraction between these two variables. When the variables are denoted by x1 andx2, such interaction can be modeled by including x1x2 , the product of the vari-ables that interact, as a predictor variable.

80

The general equation for a multiple regression model based on two independentvariables x1 and x2 that also includes an interaction predictor is

y = α + β1x1 + β2x2 + β3x1x2 + e.

14.3 Fitting a Model and Assessing Its Utility


multiple regression using the least squares method. In this section, you willuse this command to determine the least square equation between a dependentvariable y to two or more predictor variables.

a. Options - Permits various options: weighted regression, fit the modelwith/without an intercept, calculate variance inflation factors and the Durbin-Watson statistic, and calculate and store prediction intervals for new obser-vations. In this section, you will use this command to make predictions us-ing the least squares regression line where two or more predictor variablesare present.

b. Results - Control the display of output to the Session window. In thissection, you will use this command to display fits and residuals.

Let’s suppose a particular set of k predictor variables x1, x2, ...xk hasbeen selected for inclusion in the model

y = α+β1x1+β2x2+ · ··+βkxk+e.

It is then necessary to estimate the model coefficients α,β1,β2, ...βkand the regression function

α + β1x1 + β2x2 + · · · + βkxk .

(mean y value for specified values of the predictors), assess the model’sutility, and perhaps use the estimated model to make further inferences.

81

c. Example14.6, textThe Problem - Soil and Sediment AdsorptionSoil and sediment adsorption, the extent to which chemicals collect in acondensed form on the surface, is an important characteristic because itinfluences the effectiveness of pesticides and various agricultural chemi-cals. The paper ’’Adsorption of Phosphates, Arsenate, Methanearsenate,and Cacodylate by Lake and Stream Sediments: Comparisons with Soils’’(J. of Enviorn. Qual. (1984):499-504) presented the accompanying dataconsisting of n = 13 triples and proposed the model

y = α + β1x1 + β2x2 + efor relating

y =phosphate adsorption indexx1 = amount of extractable ironx2 =amount of extractable aluminum

x1 =iron x2 =aluminum y =adsorption index61 13 4

175 21 18111 24 14124 23 18130 64 26173 38 26169 33 21169 61 30160 39 28244 79 36257 112 65333 88 62199 54 40

Follow these steps to obtain the least squares regression equation.


3. Obtain the regression equation.Choose Stat>Regression>Regression. Place HPO in theResponse: text box. Place FE in the Predictors: text box. Place AL in thePredictors: text box. (FE followed by AL)

4. Make a prediction.Choose Options. Place 150 (FE) and 60 (AL) in the Prediction intervals fornew observations: text box. Choose OK.

5. Obtain fits and residuals.Choose Results. Darken the In addition, the full table of fits and residualsoption button. Choose OK.Choose OK.

82

The Minitab Output

Figure 14.1Parts of the Minitab output are shown in Figure 14.1. Focus on the column labeledCoef (for coefficient) in Figure 14.1. The three numbers in this column are theestimated model coefficients:

a = −7.351b1 = 0.11273b2 = 0.34900

Thus we estimate that the average change in HPO associated with a 1-unit increasein FE while AL remains fixed is 0.11273. A similar interpretation applies to b2.

83

The estimated regression function is

(estimated mean value of y)= −7.351+ 0.11273x1+0.34900x2.Substituting x1 = 150 and x2 = 60 gives

−7.351 + 0.11273 (150) + 0.34900 (60) = 30.5 which canbe interpreted as either a point estimate for the mean value of HPO or as a pointprediction for a single HPO value.

The Minitab Output

Figure 14.2The remainder of the Minitab output is shown in Figure 14.2. The utility of themodel can be assessed by examing the extent to which the predicted y values basedon the estimated regression function are close to the y values actually observed.The first y observation, y1 = 4 was made with x1 = 61 and x2 = 13. The firstpredicted value isy1 = −7.351 + 0.11273 (61) + 0.34900 (13) = 4.06.

The first residual is then y1 − by1 = 4− 4.06 = −0.06.

The other predicted values and residuals are computed in a similiar fashion. Thesum of residuals from a least squares fit should, except for rounding effects, bezero.

84

14.4 Variable Selection

New Minitab Commands1. Stat>Regression>Best Subsets - Best subsets regression uses the maximumR2 criterion. Suppose you specify m predictors, Minitab first selects the one-predictor regression model giving the largestR2. Minitab then prints informa-tion on this model and the next best one-predictor model. Next Minitab findsthe two-predictor model with the largest R2, and prints information on it andthe next best. The process continues until all m predictors are used. Best Sub-sets is an efficient way to select a group of ’’best subsets’’ for further analysisby selecting the smallest subset that fulfills certain statistical criteria. The sub-set model may actually estimate the regression coefficients and predict futureresponses with smaller variance than the full model using all predictors.

Suppose that an investigator has data on a number of predictor variables thatmight be incorporated into a model. The primary objective is then to select the setof predictors that in some sense specifies a ’’best’’ model.

Model selection methods can be divided into two types. There are those basedon every possible model, computing one or more summary quantities from each fit,and comparing these quantities to identify the most satisfactory models. Minitabrefers to this method as Best subsets regression. A second selection method is re-ferred to as Stepwise regression. A backward stepwise procedure begins with allpossible predictors in the model and deletes predictors one by one until all remain-ing predictors are judged important. A forward stepwise procedure begins with nopredictors and then adds predictors until no predictor not in the model seems im-portant. The ’’best’’ model contains relatively few predictors but has a large R2value and is such that no other model containing more predictors gives much of animprovement in R2.

Example14.17, textThe Problem - Modeling the Price of Industrial PropertiesThe paper ’’Using Multiple Regression Analysis in Real Estate Appraisal’’ (TheAppraisal Journal (2001):424-430) reported the accompanying data for a randomsample of nine large industrial properties. A primary objective was to relate theprice of the property to various other characteristics of the property. The variablesshown are

y = price per square footx1 = size of building (square feet)x2 = age of building (years)x3 = quality of location (measure on a scale

of 1 (very poor location) to 4 (very good location))x4 = land to building ratio

85

y = Price x1 = Size x2 = Age x3 = Location x4 = Ratio4.89 2,166,600 30 4.0 2.103.49 751,658 30 2.0 3.544.33 2,422,650 28 3.0 3.638.24 224,573 25 1.5 4.655.10 3,917,800 26 4.0 1.712.79 2,866,526 35 4.0 2.275.89 1,698,161 28 3.0 3.126.38 1,046,260 33 4.0 4.775.25 1,108,828 28 4.0 7.56

The data is stored in the file ex14_17.mtp.Follow these steps to select the most satisfactory model.1. Obtain the data file.

Choose File>Open Worksheet. Select the file ex14_17.mtp. Observe thatthe columns are named Price (y), Size (x1), Age (x2), Location (x3), and Ratio(x4).

2. Perform the stepwise regression procedure.Choose Stat>Regression>Best Subsets. Place Price in theResponse: text box. Place Size, Age, Location, and Ratio in the Free predic-tors: text box. Choose OK.

3. Apply the backward elimination procedure.Place 10000 in the F to enter: text box. Choose OK. Choose OK.

The Minitab Output

Figure 14.3The Minitab output, as shown in Figure 14.3, indicates the results of the best sub-

86

sets procedure. It is clear that the best two-predictor model offers considerableimprovement with respect to both R2 and adjusted R2 over any model with justa single predictor. The best two-predictor model, containing Size and Age, alsohas the largest value of adjusted R2 (36.3) – even larger than adjusted R2 for thethree-predictor model and the model that uses all four predictors.

Follow these steps to determine the estimated regression equation.1. Obtain the regression equation.

Choose Stat>Regression>Regression. Place Price in theResponse: text box. Place Size and Age in the Predictors: text box. ChooseOK.

The Minitab Output

Figure 14.4The Minitab output, as shown in Figure 14.4, indicates the esitmated regressionequation. Note that while this model may be the best choice among those consid-ered here, the R2 value is not particularly large and the value of se = 1.28297 (indollars per square foot) is quite large given the range of y values in the data set.

87

Example14.18, textThe Problem - Durable Press Rating of Cotton FabricAn Example using Best subsets RegressionThe paper ’’Applying Stepwise Multiple Regression Analysis to the Reaction ofFormaldehyde with Cotton Cellulose’’ (Textile Research J. (1984):157-165) re-ported the results of an experiment involving wrinkle resistance and a number ofindependent variables. The dependent variable

y = durable press ratingis a quantitative measure of wrinkle resistance. The four independent variables,and the abbreviations, used in the model building process are

x1 = HCHO (formaldehyde) concentation (con)x2 = catalyst ratio (cat)x3 = curing temperature (temp)x4 = curing time (time)

In addition to the four independent variables, the investigators considered as po-tential predictors x21 (consqd), x22 (catsqd), x23 (tempsqd), x24 (timesqd), and all sixinteractions x1x2 (con∗cat), x1x3 (con*temp), x1x4 (con*time), x2x3 (cat*temp),x2x4 (cat*time), and x3x4 (temp*time). The data is stored in the file ex14_18.mtp.Follow these steps to select the most satisfactory model.1. Obtain the data file.

Choose File>Open Worksheet. Select the file ex14_18.mtp. Choose Open.Observethat the columns are named con, cat, temp, time, consqd, catsqd, etc.

2. Perform the best subsets regression procedure.Choose Stat>Regression>Best Subsets. Place Y in theResponse: text box. Place all predictors con-’temp*time’ in the Free predictors:text box. Choose Options. Place 4 in the Minimum: Free predictors in eachmodel text box. Choose OK. Choose OK.

88

The Minitab Output

Figure 14.5The Minitab output, as shown in Figure 14.5, indicates that the choice of a bestmodel is not clear-cut. We certainly don’t see the benefit of including more thank = 8 predictor variables (after that, adjusted R2 begins to decrease) nor wouldwe suggest a model with fewer than five predictor variables (adjusted R2 is stillincreasing and CP is large). Based on this output, the best six-predictor model isa reasonable choice. That model is indicated by the largest adjusted R2 within thesix-predictor models (85.5). This model also has the smallest C-p statistic (5.9). Asmall value of Cp indicates that the model is relatively precise (has small variance)in estimating the true regression coefficients and predicting future responses. Thisprecision will not improve much by adding more predictors. Models with consid-erable lack of fit have values of Cp larger than p (the number of parameters in themodel).Based upon this output, the best six-predictor model contains the variables, x2(cat), x21 (consqd), x22 (catsqd), x1x3 (con*temp), x1x4 (con*time), and x2x3(cat*temp).Follow these steps to determine the estimated regression equation.1. Obtain the regression equation.

Choose Stat>Regression>Regression. Place y in theResponse: text box. Place cat, consqd, catsqd, con*temp, con*time, and cat*tempin the Predictors: text box. Choose OK.

89

The Minitab Output

Figure 14.6The Minitab output, as shown in Figure 14.6, indicates the best six-predictor esit-mated regression equation. This may be written as y = −1.218+0.9599x2−0.0373x21− 0.0389x22+0.0037x1x3+0.019x1x4− 0.0013x2x3 An-other good candidate is the best seven-predictor model. Although it includes onemore predictor than the model just suggested, only one of the seven predictors isan interaction term (x1x3), so model interpretation is somewhat easier. (Notice,though, that none of the best three models with seven predictors results simplyfrom adding a single predictor to the best six-predictor model.) Since every goodmodel includes x1, x2, x3, and x4 in some predictor, it appears that HCHO con-centration, catalyst ratio, curing time, and curing temperature are all importantdeterminants of durable press rating.

90

Chapter 15The Analysis of Variance

15.1 Overview

Methods for testingH0 : µ1−µ2 = 0, where µ1 and µ2 are the means of two dif-ferent populations were discussed in Chapter 10. Many investigations involve acomparison of more than two population or treatment means. Many investigationsinvolve a comparison of more than two population or treatment means. For exam,an investigation carried out to study purchasers of luxury automobiles (’’Measur-ing Values Can Sharpen Segmentation in the Luxury Car Market,’’ J. of AdvertisingResearch (1995): 9-22) reported data on a number of different attributes that mightaffect purchase decisions, including comfort, safety, styling, durability, and relabil-ity. Here is summary information on the level of importance of speed, rated on aseven-point scale.

Type of Car American German JapaneseSample size 58 38 59Sample mean rating 3.05 2.87 2.67

Let µ1, µ2, and µ3 denote the true average (i.e., population mean) rating on thisattribute for owners of American, German, and Japanese luxury cars, respectively.Do the data support the claim that µ1 = µ2 = µ3, or does it appear that at leasttwo of the µ’s are different from one another. This is an example of a single-factor analysis of variance (ANOVA) problem, in which the objective is to decidewhether the means for more than two populations or treatments are identical. Inthis chapter, we address situations involving analysis of variance (ANOVA). Afterreading this chapter you should be able to

1. Obtain the Single-Factor Analysis of Variance (ANOVA) Table(Example 15.4, text)

2. Perform the Tukey-Kramer Multiple Comparison Procedure(Example 15.6, text)

3. Perform ANOVA for a Randomized Block Experiment(Example 15.10, text)

4. Perform a Two-Factor ANOVA(Example 15.14, text)

15.2 Single-Factor ANOVA

New Minitab Commands1. Stat>ANOVA>Oneway (Unstacked) - Performs a one-way analysis of vari-

ance, with each group in a separate column. Data contained in seperate columnsis referred to by Minitab as unstacked data. In this section, you will use thiscommand to perform a one-way analysis of variance on data contained in fourcolumns.

91

Analysis of variance is used to compare data from several populations. Whentwo or more populations or treatments are compared, the characteristic that distin-guishes the population or treatments from one another is called the factor underinvestigation. For example, an experiment might be carried out to compare threedifferent methods for teaching reading (three different treatments), in which casethe factor of interest would be teaching method. This is a qualitative factor. Ifgrowth of fish raised in waters having different salinity levels - 0%, 10%, 20% and30% - is of interest, the factor salinity level is quantitative.A single-factor analysis of variance, ANOVA, problem involves a comparison ofk population or treatment means µ1, µ2, ...µk. The objective is to test

H0 : µ1 = µ2 = ... = µkagainst

Ha : at least two of the µ’s are different.The analysis is based on k independently selected random samples, one from eachpopulation or for each treatment. A comparison of treatments based on indepen-dently selected experimental units is often referred to as a completely randomizeddesign.

Example15.4, textThe Problem - Musical Preferences and Reckless BehaviorThe article ’’The Soundtrack of Recklessness: Musical Preferences and Reck-less Behavior Among Adolescents’’ (J. Adolescent Research (1992):313-331) de-scribed a study whose purpose was to determine whether adolescents who pre-ferred certain types of music reported higher rates of reckless behaviors, such asspeeding, drug use, shoplifting, and unprotected sex. Independently chosen ran-dom samples were selected from each of four groups of students with differenntmusical preferences at a large high school: (1) acoustic/pop, (2) mainstream rock,(3) hard rock, and (4) heavy metal. Each student in these samples was asked howmany times they had engaged in various reckless activities during the last year.The following table lists data on driving over 80 mph that is consistent with sum-mary quantities given in the article. (The sample sizes in the article were much

92

larger, but for purposes of this example, we use n1 = n2 = n3 = n4 = 20.Musical Preference

Acoustic/Pop Mainstream Rock Hard Rock Heavy Metal2 3 3 43 2 4 34 1 3 41 2 1 33 3 2 33 4 1 33 3 4 33 2 2 32 4 2 22 4 2 41 4 3 43 4 3 52 2 4 42 3 3 52 2 3 33 2 2 42 2 3 52 3 4 43 1 2 24 3 4 3

Follow these steps to test the hypothesisH0 : µ1 = µ2 = µ3 = µ41. Open the worksheet.

Choose File>Open Worksheet. Select the file ex_15_4.mtp. Choose Open.2. Create a boxplot.

Choose Graph>Boxplot. Select the Multiple Y’s Simple boxplot from thedialog box choices. Choose OK. Place Acoustic/Pop, Mainstream Rock, HardRock, and Heavy Metal in the Graph variables: text box. Choose OK.

93

The Minitab Output

Figure 15.1The boxplot, as shown in Figure 15.1) indicates that the boxplots are roughly sym-metric, and there are no outliers. The assumptions of ANOVA are reasonable.1. Perform the ANOVA.

Choose Stat>ANOVA>Oneway (Unstacked). Place Acoustic/Pop, Main-stream Rock, Hard Rock, and Heavy Metal in the Responses (in separate columns):text box. Choose OK.

The Minitab Output

Figure 15.2The Minitab output, as shown in Figure 15.2 indicates the One-Way Analysis ofVariance table. The output contains the ANOVA summary table with the sourcesof variation, associated degrees of freedom, mean square, F statistic and the as-

94

sociated p-value. The F statistic (F = 5.09) and associated p-value (p = 0.003)indicates that there is compelling evidence to reject the null hypothesisH0 : µ1 =µ2 = µ3 = µ4 at the .05 level of significance.

15.3 Multiple Comparisons

New Minitab Commands1. Stat>ANOVA>Oneway - Performs a one-way analysis of variance, with the

dependent variable in one column, subscripts in another. In this section, youwill perform a one-way analysis of variance on four samples, where all of thedata is contained in one column, and the subscripts in another column.

a. Comparisons - Provides confidence intervals for the differences betweenmeans, using four different methods: Tukey’s, Fisher’s, Dunnett’s, andHsu’s MCB. Tukey and Fisher provide confidence intervals for all pair-wise differences between level means. In this section, you will place acheck in the Tukey’s, family error rate: checkbox to perform a multiplecomparison procedure.

When H0 : µ1 = µ2 ... µk is rejected by the F test, we believe that there aredifferences among the k population means. The question is ’’Which pairs of meansdiffer?’’. A multiple comparison procedure is a process for identifying differencesamong the k population means once the hypothesis of overall equality has beenrejected. One of those methods is the Tukey-Kramer multiple comparison method.The Tukey-Kramer multiple comparison method provides a confidence intervalfor all pairwise differences between means. The null hypothesis of no differencebetween two means may be rejected if and only if the confidence interval does notinclude zero.

Example15.6, textThe Problem - Sleep TimeA biologist wished to study the effects of ethanol on sleep time. A sample of 20rats, matched for age and other characteristics, was selected, and each rat was givenan oral injection having a particular concentration of ethanol per body weight. Therapid eye movement (REM) sleep time for each rat was then recorded for a 24-hourperiod, with the results shown in the accompanying table.

Treatment1.0 Control 88.6 73.2 91..4 68.0 75.22.1 g/kg 63.0 53.9 69.2 50.1 71.53.2 g/kg 44.9 59.5 40.2 56.3 38.74.4 g/kg 31.0 39.6 45.3 25.2 22.7

95

Follow these steps to test the hypothesisH0 : µ1 = µ2 = µ3 = µ4

1. Open the worksheet.Choose File>Open Worksheet. Select the file ex_15_6.mtp. Choose Open.The manner in which this data is entered is referred to as stacked data.

2. Perform the ANOVA.Choose Stat>ANOVA>Oneway. Place REM in the Response: text box. PlaceTreatment in the Factor: text box.

3. Make comparisons.Choose Comparisons. Place a check in the Tukey’s, family error rate: check-box. Choose the default value of 5. Choose OK. Choose OK.

The Minitab Output

Figure 15.3The Minitab output, as shown in Figure 15.3 indicates the One-Way Analysis ofVariance table. The output contains the ANOVA summary table with the sourcesof variation, associated degrees of freedom, mean square, F statistic and the asso-ciated p-value. The F statistic (F = 21.09) and associated p-value (p = 0.000)indicates that there is compelling evidence to reject the null hypothesisH0 : µ1 =µ2 = µ3 = µ4 at the .05 level of significance.This ANOVA table leads us to the conclusion that the true average REM sleep timedepends on the treatment used. Once the null hypothesis has been rejected (Figure15.3), the question is to determine which pairs of means are significantly different.

96

The Minitab Output

Figure 15.4The remainder of the Minitab output, as shown in Figure 15.4 indicates that theonly intervals that include zero are those for µ3 − µ2 and µ4 − µ3.

15.4 Randomized Block Experiments

New Minitab Commands1. Stat>ANOVA>Twoway - Performs a two-way analysis of variance for bal-

anced data. Each cell must contain an equal number of observations. You cannot specify whether the effects are fixed or random with TWOWAY. As a result,TWOWAY does not produce F and p-values as Minitab would have to guessat the type of effects you have. In this section, you will use this command toperform a two-way analysis of variance where four treatments and five blocksare present.

In Chapter 11 of the text, we saw that when two treatments are to be compared,a paired experiment is often more effective than one involving two independentsamples. For example, in a study of the effects of different diets on weight loss,subjects are often paired (blocked or grouped) into different initial weight cate-gories. Within each initial weight category, the subjects are alike as possible. Thenwithin each initial weight category, one subject is randomly selected for diet 1, asecond subject is randomly selected to receive diet 2, and so on. These homoge-neous groups (initial weight categories) are called blocks, and the random alloca-

97

tion of treatments (diets) within each block as described gives a randomized blockexperiment.Let experimental units (individuals or objects to which the treatments are applied)be separated into groups consisting of k units in such a way that the units withineach group are as similar as possible. Each unit in a group receives a different treat-ment. The groups are often called blocks, and the experimental design is referredto as a randomized block design.

Example15.10, textThe Problem - Comparing Four Stool DesignsIn the article ’’The Effects of a Pneumatic Stool and a One-Legged Stool on LowerLimb Joint Load and Muscular Activity During Sitting and Rising’’ (Ergonomics(1993):519-535, the following data is given on the effort (measured on the BorgScale) required by a subject to rise from a sitting position for each of four dif-ferent stools. Because it was suspected that different people could exhibit largedifferences in effort, even for the same type of stool, a samle of nine people wasselected and each tested on all four stools, with the following results:

SubjectType of Stool 1 2 3 4 5 6 7 8 9

A 12 10 7 7 8 9 8 7 9B 15 14 14 11 11 11 12 11 13C 12 13 13 10 8 11 12 8 10D 10 12 9 9 7 10 11 7 8

For each person, the order in which the stools were tested was randomized. Thisis a randomized block experiment, with subjects playing the role of blocks.

Follow these steps to test the hypothesis of interest that the mean value does notdepend on which treatment is applied:

H0 : Mean effort does not depend on type of stool.Ha : Mean effort does depend on type of stool.


2. Perform the ANOVA.Choose Stat>ANOVA>Twoway. Place Effort in the Response: text box.Place Stool in the Row factor: text box. Place Block in the Column factor:text box. Choose OK.

98

The Minitab Output

Figure 15.5The Minitab output, as shown in Figure 15.5 indicates the Two-Way Analysis ofVariance table. The output contains the ANOVA summary table with the sourcesof variation, associated degrees of freedom, and mean square. The F statistic isF = 22.36, with a P-value of 0.000. Since P-value< α, we reject the H0. Thereis sufficient evidence to conclude that the mean effort required is not the same forall four stool types.

15.5 The Two-Factor ANOVA

New Minitab Commands1. Stat>ANOVA>Balanced ANOVA - Performs univariate and multivariate analy-

sis of variance. Factors may be crossed or nested, fixed or random. Nestingmust be balanced and the subscripts used to indicate levels of B within eachlevel of A must be the same. For a two-way analysis of variance data must bebalanced (all cells have the same number of observations).

An investigator will often be interested in assessing the effects of two differentfactors on a response variable. For example, an agricultural scientist in determininghow yield of tomatoes is affected by choice of variety planted (a categorical factor,with each category corresponding to a different varietysay variety 1, variety 2 andvariety 3) and planting density (a quantitative factor, with a level corresponding toeach planting density being considered, say 10, 20, 30 and 40 thousand plants perhectare).Let’s call the two factors under study factor A and factor B. Even when a factoris categorical, it simplifies terminology to refer to the categories as levels. Thus,the categorical factor variety may have a number of levels, one for each variety.The number of levels of factor A is denoted by k, and l denotes the number oflevels of factor B. Each cell in the table corresponds to a particular level of factorA in combination with a particular level of factor B. Because there are l cells ineach row and k rows, there are kl cells in the table. The kl different combinationsof factor A and factor B levels are often referred to as treatments. In this tomatoexample, there are three tomato varieties and four different planting densities underconsideration, providing 3× 4 = 12 treatments.

99

An experimenter frequently designs the experiment to have, m, the same numberof observations on each treatment.

Example15.10, textThe Problem - Effect of Soil Type and Pipe Coating on CorrosionWhen metal pipe is buried in soil, it is desirable to apply a coating to retard cor-rosion. Four different coatings under consideration for use with pipe that will ulti-mately be buried in three types of soil. An experiment to investigate the effects ofthese coatings and soils was carried out by first selecting 12 pipe segments and ap-plying each coating to three segments. The segments were then buried in soil fora specified period in such a way that each soil type received one piece with eachcoating. Assuming that there is no interaction between coating type and soil type,let’s test at level 0.05 for the presence of seperate factor A (coating) and factor B(soil) effects. The resulting data (depth of corrosion) follows:

Factor B (Soil)Factor A (Coating) 1 2 3

1 64 49 502 53 51 483 47 45 504 51 43 52

Follow these steps to obtain a table of means.


2. Perform the ANOVA.Choose Stat>ANOVA>Twoway. Place Corrosion in the Response: text box.Place Factor A in the Row factor: text box. Place Factor B in the Columnfactor: text box. Choose OK.

The Minitab Output

Figure 15.6The Minitab output, as shown in Figure 15.6) indicates the Analysis of Variance

100

table, with P-value > 0.10 for both tests. It appears that the true average response(amount of corrosion) depends neither on the coating used nor on the type of soilin which the pipe is buried.

101

Chapter 0 Getting Started With Minitab

Documents

Transcript of Chapter 0 Getting Started With Minitab