STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017...

176
1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. Kathleen Lawry-Batty, Dr. Christina Anton, Allan Wesley, Dr. Muhammad Islam, Dr. Karen Buro, and Dr. Wanhua Su Student Name:_________________________________________________ Student I.D:____________________________________________________ Lab Section Day:__________________________________________________________ Time:_________________________________________________________ Room:________________________________________________________ Lab Instructor Name:________________________________________________________ Office:________________________________________________________ Email:_________________________________________________________ Telephone:_____________________________________________________

Transcript of STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017...

Page 1: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

1

1

STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II

SPSS LAB MANUAL Fall 2017

Science Department Grant MacEwan University

Contributions by: Dr. Kathleen Lawry-Batty, Dr. Christina Anton, Allan Wesley, Dr. Muhammad Islam, Dr. Karen Buro, and Dr. Wanhua Su

Student Name:_________________________________________________ Student I.D:____________________________________________________ Lab Section Day:__________________________________________________________ Time:_________________________________________________________ Room:________________________________________________________

Lab Instructor

Name:________________________________________________________ Office:________________________________________________________ Email:_________________________________________________________ Telephone:_____________________________________________________

Page 2: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

2

2

Table of Contents Introduction .......................................................................................................................................5

Chapter 1 Introduction to SPSS ............................................................................................................6

1.1 Starting SPSS .......................................................................................................................6

1.2 The SPSS Environment .........................................................................................................6

1.2.1 Title Bar .......................................................................................................................7

1.2.2 Windows in SPSS ..........................................................................................................7

1.2.4 Data Editor Menu Bar ...................................................................................................7

1.2.5 Data Editor Toolbar......................................................................................................7

1.3 Entering Data and Defining Variables ....................................................................................8

1.3.1 Defining Variables.........................................................................................................8

1.3.2 Entering and Editing Data..............................................................................................9

1.4 Saving and Reading Data Files............................................................................................. 11

1.4.1 Saving data File........................................................................................................... 11

1.4.2 Reading Existing Data Files .......................................................................................... 11

1.5 Manipulating Data ............................................................................................................ 12

1.5.1 Creating a New Variable.............................................................................................. 12

1.5.2 Recoding Variables ..................................................................................................... 14

1.5.3 Selecting Cases .......................................................................................................... 16

1.5.4 Sorting Data ............................................................................................................... 17

1.6 Drawing a Graph ................................................................................................................ 18

1.7 Computation of Numerical Summaries ................................................................................ 20

SPSS offers a wide variety of statistical tools to help you analyze your data, 1.7 Computation of Numerical Summaries ................................................................................................................... 20

1.7.1 Frequencies................................................................................................................ 20

1.7.2 Descriptive Statistics ................................................................................................... 22

1.7.3 Explore further: (for subsets of data of interest) ........................................................... 23

1.8 Saving ..................................................................................................................................... 24

1.8.1 Saving Results............................................................................................................. 24

1.8.2 Opening previously saved data files or output .................................................................... 24

1.9 Transferring Output into Word ........................................................................................... 24

1.10 Printing Your Output.............................................................................................................. 25

Chapter 2: Graphs and Descriptive Statistics....................................................................................... 26

2.1 Data Variables ......................................................................................................................... 26

2.2 Bar Chart................................................................................................................................ 26

Page 3: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

3

3

2.2.1 Bar Chart of a Categorical Variable ..................................................................................... 26

2.3 Boxplot ................................................................................................................................... 29

2.3.1 Boxplot of a Numerical Variable ......................................................................................... 29

2.3.2 Side by Side Boxplot: Numerical Variable by Levels of Categorical Variable ........... 31

2.4 Histogram ............................................................................................................................... 33

2.4.1 Histogram of a Numerical Variable ..................................................................................... 33

2.4.2 Side by Side Histograms of a Numerical Variable by Levels of Categorical Variable ................ 36

2.5 Descriptive Statistics................................................................................................................ 37

2.5.1 Descriptive Statistics of a Numerical Variable...................................................................... 37

2.5.2 Descriptive Statistics of a Numerical Variable by Levels of a Categorical Variable .................. 38

Chapter 3: Relationships ................................................................................................................... 41

3.1 Crosstabs ................................................................................................................................ 41

3.1.1 Counts and Percents ................................................................................................... 41

3.1.2 Reading and Calculating Percent using the Crosstab table ............................................. 42

3.2 Graphing Relationships ............................................................................................................ 42

3.2.1 Two categorical Variables: Overall count and percent Bar Charts ...................... 43

3.2.2 Two categorical Variables: Bar Charts for Percents of Levels of one Category within another Category ................................................................................................................... 44

3.2.3 Two categorical and one Numerical Variable: Histograms.................................................... 45

3.3 Weighted Cases: Two Categorical Variables .............................................................................. 46

3.3.1 Weighted Cases: Frequency Tables, Bar and Pie Charts for Individual Categories .................. 48

3.3.2 Weighted Cases: Crosstab Tables ....................................................................................... 50

3.3.3 Weighted Cases: Overall Count and Percent Cluster Bars..................................................... 52

3.3.4 Weighted Cases: Percents for Levels of one Category within levels of another Category........ 53

3.4 Scatterplots ............................................................................................................................ 54

3.4 Correlation.............................................................................................................................. 56

Chapter 4 Parametric Procedures ...................................................................................................... 59

4.1 Inference About One Population Mean Using the One Sample t Procedure ........................... 59

4.1.1 Two sided confidence intervals and hypothesis tests:.......................................................... 59

4.1.2 One sided confidence bounds and hypothesis tests:............................................................ 64

4.2 Inference About Two Population Means Using Two Sample t Procedure..................................... 69

4.3 Inferences about Two Population Means Using the Paired t Procedure ...................................... 73

Chapter 5 ANOVA ............................................................................................................................. 78

5.1 The Analysis of Variance F-test for Equality of k Population Means............................................. 78

5.2 Linear Combinations and Multiple Comparisons of Means ......................................................... 87

Page 4: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

4

4

5.3 Randomized Block Designs ....................................................................................................... 94

5.4 2-Way ANOVA ................................................................................................................. 101

Chapter 6 Non-Parametric Statistics ................................................................................................ 114

6.1 Wilcoxon (Mann-Whitney) Rank Sum Test for 2 Independent Samples................................ 114

6.2 Inferences About Two Population Medians Using Wilcoxon’s Signed Rank Tests for Paired Differences ................................................................................................................................. 119

6.3 The Kruskal-Wallis Test for k independent samples ............................................................ 124

Chapter 7 Simple Linear Regression ................................................................................................. 130

7.1 Linear Regression Model........................................................................................................ 130

7.2 Residual Analysis ............................................................................................................. 135

Chapter 8 Multiple Linear Regression............................................................................................... 139

8.1 The Multiple Regression Model ........................................................................................ 139

8.2 Dummy Variables in Regression Analysis................................................................................. 149

8.3 Selecting the Best Regression Equation ............................................................................. 152

Chapter 8 One Sample and Two Sample Proportion .......................................................................... 158

8.1 Introduction: One Sample Inference about Proportion ............................................................ 158

8.1.1 Examining the Data................................................................................................... 159

8.1.2 Confidence Intervals for 1 Proportion ....................................................................... 160

8.1.3 Hypothesis Tests for 1 Proportion.............................................................................. 163

8.1.4 One Sided Hypotheses Tests for One Proportion........................................................ 166

8.2 Introduction: Two Independent Samples Inference for Proportions.......................................... 167

8.2.1 Examining the Data................................................................................................... 167

8.2.2 Confidence Intervals for 2 Proportions ...................................................................... 169

8.2.3 Hypothesis Tests for 2 Proportions ........................................................................... 172

8.2.4 One Sided Hypotheses Tests for 2 Proportions .......................................................... 175

8.3 Online Proportion Calculators ................................................................................................ 176

Page 5: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

5

5

Introduction

This manual covers the basic instructions for the computer lab component of Statistics 252. It is to be used in conjunction with SPSS (Version 24.0) for Windows XP (or higher versions) and the corresponding textbook in STAT 252. It is written for those who have no previous computer experience.

The purpose of the computer lab is to familiarize students in the use of a statistical software package, provide them with extra practice in the interpretation of statistical analyses, as well as demonstrate some interesting applications of statistics. SPSS, which stands for Statistical Package for the Social Sciences, is a powerful, easy-to-use, statistical software package that provides a wide range of basic and advanced data analysis capabilities. SPSS's straightforward command structure makes it accessible to users with a wide variety of backgrounds and experience.

On one hand, we find that statistics is a highly technical subject, with complex formulas and equations that seem to be written almost entirely in Greek, but on the other hand, we find that statistics is interesting and relevant because it provides the means for using data to gain insight into real-world problems. To understand the material presented in this course, you must get involved. Throughout the manual we will illustrate how to do particular types of analyses step by step. In the lab assignments you will work out similar analyses. Through this process you will learn statistics by doing statistics. To maximize your learning, we recommend that you read the text and simultaneously follow along on your computer.

Page 6: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

6

6

Chapter 1 Introduction to SPSS In this chapter the reader will be introduced to SPSS. After studying this section you should be able to:

1. Start SPSS 2. Use different SPSS windows 3. Enter, edit, and manipulate variables and cases 4. Display numerical summaries 5. Create and edit graphs 6. Save, export and print your results 7. Exit SPSS

1.1 Starting SPSS After you log onto your computer double click on the SPSS icon on your computer’s desktop. Alternatively, from the taskbar on your desktop:

1. Click on the Start button 2. Click on All Programs 3. Click on the SPSS Inc. Menu 4. Click on PASW Statistics 24.0

1.2 The SPSS Environment After starting SPSS, a window as in Figure 1.2 opens. Pictured is the default configuration.

Figure 1.2: SPSS Data Editor Window

Page 7: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

7

7

The SPSS environment consists of different windows and bars. As you perform data analysis, you will work with those bars and windows. Here is a brief description of the different parts in the SPSS environment.

1.2.1 Title Bar At the top of the main window is the title bar, which shows the SPSS icon and three window buttons. The SPSS icon allows resizing, minimizing, closing, etc. of SPSS. The window control buttons have similar functions.

1.2.2 Windows in SPSS The four most important windows in SPSS are: 1. Data Editor: opens automatically when you start a SPSS session, and displays the contents of

the current data file 2. Viewer: opens automatically the first time you run a procedure that generates output, and

displays the results of the statistical procedures 3. Chart Editor: opens only after SPSS produces a plot or diagram, and is used for editing 4. Syntax Editor: is used if you wish to run SPSS commands instead of clicking the pull -down

menus. Each window has its own menu and toolbar. The Analyze and Graphs menus are available in all windows, so that new output can be generated without switching windows. To activate a window, click on the edge of the desired window, or select the window from the Window menu.

1.2.4 Data Editor Menu Bar The Menu bar is the second horizontal line from the top. It provides easy access to most SP SS features, and it contains twelve drop-down menus:

Figure 1.3: Data Editor Menu Bar

1.2.5 Data Editor Toolbar Beneath the menu bar is the toolbar which provides shortcuts for several important actions. When you click on a button SPSS performs an action or opens a dialog box corresponding to the menu command. If you place the mouse pointer over a button, without clicking, SPSS displays a brief description of the tool in the Status Bar.

Figure 1.4: Data Editor Toolbar

Page 8: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

8

8

1.3 Entering Data and Defining Variables Data can be manually entered into the Data Editor window. The main components of this window are displayed in Figure 1.2. The following example illustrates data entry and modification, descriptive statistics and graphical summaries: Example 1. The data you are about to analyze have been collected from an exam given to 40 students in an introductory statistics course. Two variables were measured in the students:

1. Marks (Mark of the student, 40-90) 2. Binary Gender (Sex with which student most identifies: M-Male, F-Female)

The following data are used to illustrate data entry and modification, descriptivestatistics and graphical summaries: Marks of F students: 85, 83, 56, 98, 72, 52, 88, 75, 91, 69, 78, 64, 78, 81, 74, 73, 90, 75, 65, 55 Marks of M students: 40, 47, 50, 52, 58, 61, 62, 63, 64, 67, 70, 72, 74, 75, 78, 80, 81, 82, 90, 92

1.3.1 Defining Variables To define a variable

1. Click on the Variable View tab at the bottom of the Data Editor window (see Figure 1.2) 2. Enter a new variable name in the column Name and press the Enter key on the keyboard.

Variable names must begin with a letter and cannot end with a period. After entering the name, default values (Type, Width, ....) are assigned. To manually select the data type, click on the corresponding cell in the column Type.

Figure 1.5: Selecting the Type

Page 9: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

9

9

Figure 1.6: Defining the Values

In our example we have two variables: Marks (numeric) and Sex (categorical). For the categorical variable click on the radio button for String in the Variable Type dialog box (see Figure 1.5). To define possible values of the variable Sex (M for male and F for female), click on the Value cell in the row for the variable, enter each value with the corresponding label, and then click on the add button (see Figure 1.6). To delete a variable, select the corresponding row, and press Delete key on your keyboard, or click on Edit>Clear. To insert a new variable between existing variables:

1. Click on the row below the place you wish to insert the variable 2. Click on Data>Insert variable

1.3.2 Entering and Editing Data Tabs at the bottom of the screen (see Figure 1.6) can be used to switch between Variable View and Data View. Data is entered in the Data View window, a window that is a grid with rows corresponding to subjects (or cases), and columns corresponding to variables (Marks and Sex in our case). The cells in the spreadsheet contain only data values, they cannot contain formulas. Enter the values for all cases from example 1. Data values are not recorded until you press Enter or select another cell.

Page 10: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

10

10

Figure 1.7: Entering Data

1. To correct a value in a cell:

(a) Click on the cell, type the correct value, and press Enter. (b) To change a portion of the cell contents, double-click on the cell

use the arrow, Backspace and Delete keys to make the changes. (c) To delete the values in a range, highlight the area and press Delete. (d) To undo a change use Edit> Undo.

2. To insert a new case (row)

(a) Select the row below the row where you wish to insert (b) Click on Edit>Insert case

3. To delete a row

(a) Click on the case number (b) Click on Edit>Clear

4. To copy data using copy and paste commands

(a) Highlight the data to be copied. (b) Choose Edit >Copy (c) Select the place where the cells are to be pasted (change the columns type if necessary) (d) Click on Edit>Paste

Page 11: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

11

11

5. To change the width of one or more columns

(a) Highlight the column(s). (b) With your mouse, point to the line dividing a selected column from another column. The cursor becomes a two-sided arrow. (c) Drag the border until the columns have the desired width.

1.4 Saving and Reading Data Files It is strongly recommended that you save your work regularly. This will prevent you from having to re -enter data should the computer crash.

1.4.1 Saving data File To save a new SPSS data file make the Data Editor the active window and follow the steps:

1. Select File>Save As 2. In the Save in box select your network drive or the name of your USB flash drive. 3. In file Name box, type lab1 and click on Save.

The current data file will now be saved as lab1.sav, where sav is the file extension. For future saves (to overwrite the old version of the current file with the new version), simply use File>Save or use the ctrl-s keys. To save a data file as an ASCII file, in the Save As type box you have to choose Tab-delimited (*.dat). To practice, save the data from example 1 as lab1.dat.

Figure 1.8: Saving Data

1.4.2 Reading Existing Data Files SPSS can read different types of data files.

1. To read an SPSS data file

(a) Select File> Open>Data (b) Click on the data file you wish to open SPSS data files have the extension .sav, and they contain not only the data, but also the variables names and formats.

Page 12: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

12

12

2. To read a text file

(a) Select File>New>Data (b) Select File>Read Text data (c) In the Files of Type box choose the right extension (usually .dat or .txt) (d) In the file name box choose the appropriate path, and click on the text file you wish to open (e) Transfer the data from the text file to the Data Editor window using the Text Import Wizard

To practice open a new SPSS Data Editor Window and transfer the data from lab1.dat file.

1.5 Manipulating Data

1.5.1 Creating a New Variable New variables can be created using Transform > Compute in the Data Editor menu. A dialog box will appear. Choose/type the target variable (the column where you wish created entries stored) of interest and create the Numeric Expression of interest. Variables in the numeric expression should be selected and moved from the columns on the left. The calculator should be used to insert numbers and symbols. Note the “if” dialog box can be chosen if you wish to apply data transformations to selected subsets of cases. Click ok when you are done. Example: A private agency is investigating a new procedure to teach typing. The following table gives the scores of eight students before and after they attended this course. Open a new worksheet in Minitab and type the data in C1 and C2. Label C1 Before and C2 After. Save the Worksheet as Typing.mtw.

Students 1 2 3 4 5 6 7 8

C1:Before 81 75 89 91 65 70 90 69

C2:After 97 72 93 110 78 69 115 75

When investigating paired data like this, where two numerical observations have been made on each individual, the data in the column of numerical differences is usually of greater interest than the numerical data in the individual columns. For example, in a data set that measured typing speed, as ours does, the differences noted by individuals in their typing speed would be of interest. (Another example might be a data set that measured wages before and after an increase in inflation. Here the differences in wages would be of interest to people.) We will create a new variable, diff, containing the differences in the scores after and before the crash course, using the commands below. For example, to create the variable diff using the formula: diff = After - Before:

1. Choose Transform > Compute 2. Type diff in the Target Variable box 3. Highlight the variable After. Use the arrow key to bring it to the Numeric Expression box. 4. Use the calculator to place a minus sign (a - ) in the Numeric Expression box.

Page 13: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

13

13

5. Highlight the variable Before. Use the arrow key to bring it to the Numeric Expression box. 6. click OK.

Your screen should look as in Figure 1.9.

Figure 1.9: Compute Variable Dialog Box

If you look at your worksheet, Column 3 (C3) will now contain the following values.

C3:diff 16 -3 4 19 13 -1 25 6

Although this data set of differences is very small (only 8 observations), we can seek some patterns of note in the data. It appears that most of the differences are positive, so their average (mean) would also be positive. And if we were to plot the differences on a number line, they would be relatively spread out, as they vary considerably, with a minimum of -3 and a maximum of 25. Figures 6.11 and 6.13 presents some descriptive statistics and graphs that can be created to describe the data in the column of numerical differences in this typing data, and students might wish to take a quick look ahead just to get a flavour of what can be done. Information needed for the creation and interpretation of such material will be offered later in the course.

Page 14: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

14

14

1.5.2 Recoding Variables You may recode the values of a variable into the same variable, or into a new variable formed with the recoded values. The preferred method is to use a new variable in order to preserve the original values. Recoding into the same variable Suppose that the categorical variable sex needs to be recoded as a numerical variable (some tools will not work with categorical data), by assigning F=0, and M=1. To recode into the same variable:

Figure 1.10: Recoding into the same variable

1. Select Transform > Recode into Same Variable 2. Double click on the variable you want to recode (Sex) 3. Select Old and New Values 4. Enter the old and the new values and click on the Add button 5. Select Continue to close the Old and New Values dialog box, when you have indicated all the recode instructions 6. Click on OK to have SPSS execute your instructions

Figure 1.11: Selecting the Old and the New Values

Page 15: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

15

15

In the Data Editor you find that sex is now expressed by one of two integers 0 or 1. Click on the Variable View tab and change its type to numeric. Recoding into a different variable Referring to our example, we will define a new variable Grade with the values in table 1.1.

Grade Marks

A+ 95-100

A 90-94 A- 85-89

B+ 80-84

B 75-79

B- 70-74

C+ 65-69

C 60-64

C- 55-59

D+ 50-54

D 45-49

F 0-44

The variable Grades To recode into the new variable:

1. Select Transform > Recode Into Different Variables 2. Double click on the variable you want to recode (marks) and write the name and the label of the

new variable (Grades) 3. Click on Change 4. Select Old and New Values 5. Check the Output variables are strings button, since Grades is categorical. 6. Choose the appropriate range for the old values, enter new values, and click on the Add button 7. Select Continue to close the Old and New Values dialog box, when you have indicated all the

recode instructions 8. Click on OK to close the Recode into Different Variables dialog box

Figure 1.12: Recoding into Different Variables

Page 16: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

16

16

Figure 1.13: Defining the New Values

1.5.3 Selecting Cases Sometimes you might want to select a subset of cases from your data file for a particular analysis. To select a subset of cases:

1. Click on Data > Select Cases 2. Click on If conditions is satisfied radio button 3. Double click on the names of the variables, complete the condition, and click on Continue 4. Click on OK

Figure 1.14: Selecting cases

Page 17: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

17

17

For practice select only the female identifying students who got a B. When you return to the Data Editor window, you should notice a new column labelled filter_$, and containing 0 and 1 for the unselected and the selected cases respectively. The row number of the unselected cases has also been marked with an oblique bar. See Figures 1.15 and 1.16.

Figure 1.15: Completing the Condition

Figure 1.16: The Data Editor Window with Selected Cases

Any further analysis of the data will include only the selected cases! To undo the selection you must click on Data> Select Cases> All Cases> OK It is prudent to copy selected cases to a new dataset when working with selected cases (rather than filter in an original dataset). Subsequent work can be done in appropriate datasets.

1.5.4 Sorting Data Suppose that we would like to sort the data according to the students’ marks. To do this:

1. Click on Data>Sort Cases 2. Double click on the variable(s) you want to sort by (marks)

Page 18: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

18

18

3. Select the Sort Order button Ascending 4. Click on OK

Figure 1.17: Sort Cases Dialog Box

The cases in the Datasheet are now sorted in ascending order according to marks. See Figure 1.18.

Figure 1.18: Partial Worksheet Window with Sorted Cases

1.6 Drawing a Graph To draw any graph (pie chart, bar graph, boxplot, histogram, etc.):

1. Click on Graph in the menu bar 2. Click on Legacy Dialogs

Page 19: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

19

19

3. Select the appropriate option. We will do a simple example here, and later in the course introduce

A pie chart is a pictorial representation of a variable with values that are categories, such as graded (Coded Marks). The size of a slice of the pie represents the count (frequency) of different grades in a proportional manner. The more individuals with a certain grade, the larger the slice of the pie.

To draw a pie chart of Grades (see Figure 1.19 for the dialog box and Figure 1.20 for the pie chart) use the following commands.

1. Select Graph > Legacy Dialogs > Pie 2. Select Summaries for groups of cases and click Define 3. In the Define Pie: Summaries for Groups of Cases box place the cursor in the Define Slices

by box and select and move the column Grades to the box 4. Leave the Slices Represent button N of Cases highlighted. Note the other options. 5. Click the Titles button and Type the title “Pie Chart of Grades” in Line 1 and click Continue 6. Click OK

Figure 1.19 Dialog box for creating Pie Chart of Grades

Page 20: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

20

20

1.20 Pie Graph of Codes Marks (Grades)

While the above pie chart is sumptuously colourful to view, the plethora of slices make it difficult to read. A graph that illustrates the frequencies of grades in side-by-side bars with the grades proceeding in an increasing manner would be a more appropriate, although prosaic, way of representing this data. We will look at how to create and interpret other graphs later in the course. There is often more than one way to create a graph.

1.7 Computation of Numerical Summaries

SPSS offers a wide variety of statistical tools to help you analyze your data, 1.7 Computation of Numerical Summaries SPSS offers a wide variety of statistical tools to help you describe your data, such as frequency tables and descriptive statistics.

1.7.1 Frequencies

We are often interested in how often a particular value appears in a set of data. Here we will use SPSS to create a count (frequency) and percent distribution for the Grades.

1. Click on Analyze > Descriptive Statistics > Frequencies 2. Double - click on the variable you want to analyze (Grades) 3. Place a check in the “Display frequency tables” button 4. Click on OK

Page 21: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

21

21

Results appear in the Viewer window.

Figure 1.21: The Frequencies Dialog Box

Figure 1.22: The Frequencies and Relative Frequencies Distribution Table for Grades

Note that you can use the Statistics button to calculate many descriptive statistics (including minimum, maximum, mean (a measure of centrality) and standard deviation (a measure of spread). Also notice that you use the Charts button to draw a pie chart, bar chart. We will further explain and investigate how relevant statistics and charts can be produced this way as the course continues.

Page 22: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

22

22

1.7.2 Descriptive Statistics Descriptive statistics can be obtained by selecting Analyze > Descriptive Statistics > Descriptives and Options. The Descriptives Dialog box (Figure 1.23) is shown with the Options popup box (Figure 1.24). Here Marks is the variable of interest, and Mean (a measure of data centrality), Standard Deviation (a measure of data spread), Minimum and Maximum have been chosen in the options dialog box.

The Descriptives Dialog Box (Figure 1.23) with the Options Popup box (Figure 1.24)

Figure 1.25 shows how the output is displayed in the Viewer Window. Note that the Viewer window is divided into two regions.

1. The Outline pane: contains the table of contents; 2. The Contents pane: contains the tables, charts, text output.

Clicking on an item in the outline pane makes that item current in the contents pane. In this example, the components (Title, Notes, Descriptive Statistics) were obtained with the Descriptives procedure, and you can see from the output that the variable of interest was marks.

Page 23: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

23

23

Figure 1.25: Descriptive Statistics in Viewer Window

1.7.3 Explore further: (for subsets of data of interest) SPSS offers an option that allows you to calculate descriptive statistics and plots for subsets of a numerical data of interest. That is, you can ask it to consider a “dependent” quantitative (numerical) variable for subsets of a “factor” variable that is qualitative (categorical). For example, we could calculate descriptive statistics and plots for the variable marks for subsets of the “by” variable sex (M and F). We show the original set up here for exploring marks by sex, but leave exploring the strengths of this particular tool as appropriate for later in the course. Students should feel free to poke about a bit now, of course!

1. Click on Analyze > Descriptive Statistics > Explore... 2. Move the quantitative variable (marks) to the Dependent List and the categorical variable sex to

the Factor List using the arrows 3. Choose further options and enjoy.

Figure 1.26: The Explore Dialog (Box Set up for exploring a numerical variable in categorical subsets)

Page 24: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

24

24

Finally, students should not that you can even work with more than one numerical and factor variable at a time in the list boxes, so this little tool is deceptively powerful.

1.8 Saving Each part of the SPSS work you have created will need to be saved as separate files. Save commands are selected from the File menu. To ensure you save the correct window, make certain that it is selected first. All Data files have the extension .sav. All output files have the extension .spv.

1.8.1 Saving Results To save the contents of a Data Editor or a Viewer window

1. Select File>Save As 2. In the Look in box select your network drive or the name of your USB flash drive. 3. In the File name box, type your datasheet name or your output file name 4. In the Save as type box, choose the right extension 5. Click on the Save button.

1.8.2 Opening previously saved data files or output

To open a SPSS file that you have previously saved

1. Select File>Open > Data (if a .sav file) or File>Open>Output (if a .spv file) 2. In the Look in box select your network drive or the name of your USB flash drive. 3. In the File name box, find your file 4. In the Files of type box, choose the right extension 5. Click on the Open button to open the data or output file

1.9 Transferring Output into Word SPSS output can be transferred to virtually any word processor, and edited. Transferring the output allows you to add appropriate comments and to appropriately format your report. After you have saved both the data and the results from SPSS, move SPSS to the left half of the screen by placing your cursor on the bar above the top menu bar and moving it sharply to the left. Doing this allows SPSS to remain active, so it will be easily accessible if you need it. Open Word from the Start menu, and move it to the right half of the screen by placing your cursor on the bar above the top menu bar and moving sharply it to the right. (Alternatively, you can minimize and maximize the SPSS and Word screens as you work, but most people prefer to work on split screens). Option 1: Select the output you want to transfer by clicking the item in the outline or the contents pane of the Viewer - Click on Edit > Copy or use the ctrl-c keys. Next go to Word and move the cursor to

Page 25: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

25

25

where you want the output to be inserted, and click on the paste icon located on the toolbar or select Edit > Paste or use the ctrl-v keys. Once the output has been inserted, you may add the appropriate comments, titles, etc. If you wish to resize a graph, click on the graph. The border will then contain eight small squares on it. To resize, click on one of the squares and drag the frame to its desired size. Note that your computer can freeze if you try to do such movement of SPSS objects too quickly. Therefore it is recommended that students always use the SNIPPING TOOL (Option 2) instead. Option 2: The SNIPPING TOOL can be accessed from the start menu accessed from the bottom left of your monitor screen. This tool allows you to take a “snapshot” photo of any desired portion of your monitor screen. After clicking to open the snipping tool, select the portion of the monitor screen you want to copy with the red cross cursor, let go, and then copy and paste your snapshot to your word document. Snapshots copy more clearly than SPSS objects and are much easier to reshape to fit nicely in your word document. The extra clicks to get a snapshot to copy are well worth it, as no-one wants to deal with a frozen computer and loss of valuable work and time will tell you . ALWAYS USE THE SNIPPING TOOL. While editing your document you should save it from time to time. This will ensure that you do not lose your work in the event of a computer failure.

1.10 Printing Your Output Generally, you will copy and paste appropriate output from SPSS to a Word document (as you will do for your assignments) and then simply print your word document. Note, though, that although you are unlikely to do so, you can print the content of the Contents pane in the Viewer window, as follows.

1. Select the items you want to print in the outline or the contents pane 2. Select File > Print Preview to preview your printout 3. Select File > Print 4. Check the Selection button from Print range in the Print dialog box 5. Click OK

Page 26: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

26

26

Chapter 2: Graphs and Descriptive Statistics In this chapter the reader will use SPSS to create material with which to describe statistical data. After studying this section you should be able to:

1. Use SPSS to create and edit barcharts for one categorical variable: 2. Use SPSS to create and edit boxplots and histograms for one numerical variable 3. Use SPSS to display descriptive statistics output for one numerical variable 4. Use SPSS to display descriptive statistics output for one numerical variable by levels of a

categorical variable 5. Interpret the charts and output above

2.1 Data Variables Example A group of students were randomly surveyed, and asked questions re lated to these variables. Results can be found in the dataset SMOKEFITNESS.sav, which contains 5 columns, each corresponding to one question (each of which can be considered a random variable). Data coding was as follows.

Variable and values Type of Variable

GENDER (Female – F, Male – M) Categorical MINPA (Daily minutes of physical activity) Continuous Numerical (Discrete values

recorded)

SMOKE (Yes – Y, No – N) Categorical FITLEV(reported fitness level of 1-Poor to 10-Excellent)

Ordinal (ranked) Discrete

RSTHRT (resting heart rate) Continuous Numerical (Discrete values recorded)

Figure 2.1: Summary of Data Set Variables

2.2 Bar Chart

2.2.1 Bar Chart of a Categorical Variable A bar chart is a pictorial representation of a categorical or ordinal (ranked) variable. One bar is created for each level (value) of the categorical variable and the height of the bar represents either the number (count) or the percent of units in the data that have that value. FITLEV is an ordinal variable that can be represented with a bar chart. One set of commands to draw a bar chart of the counts for FITLEV follow (see Figure 2.2):

1. Choose Graphs>Legacy Dialogs>Bar 2. Choose the Simple icon

Page 27: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

27

27

3. Choose Summaries for Groups of Cases and select Define 4. Move FITLEV to the Category Axis by highlighting the variable name and using the select arrow. 5. Be sure that the Bars Represent section has the N of cases button chosen. 6. Choose Ok

Figure 2.2: Dialog Box for a One Categorical Variable Bar Chart

Figure 2.3: Bar Chart of Counts for One Categorical (Ordinal) Variable (FITLEV)

Page 28: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

28

28

Note that a title could be added to the produced graph by clicking on Titles in the Dialog box and adding a title with wording as one saw fit. (Say, “Counts for Self -Assessed Levels of Fitness as Reported by Students”) We might decide to change the counts on the Y axis to percents. We could do so by choosing % of cases in the Dialog Box in Figure 2.2.

Figure 2.4 Bar Chart: Options to change count to percent on Y axis

The resulting bar chart is shown in Figure 2.5.

Figure 2.5 Percents of Participants reporting Fitness Levels (1- Very unfit to 10 – Very fit)

Page 29: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

29

29

There are many other options for modifying a bar chart, and students are encouraged to explore. Students should note that a bar chart can also be created using the Analyze>Descriptive Statistics>Frequencies path and the Chart Builder (to be introduced later).

2.3 Boxplot

2.3.1 Boxplot of a Numerical Variable A boxplot is a way of representing data in a pictorial manner. A box with vertical whiskers is created using the data. The lower end of the bottom whisker corresponds to the minimum value of a data set, and the upper end of the top whisker to the maximum value of a data set. 25% of the data lies below the value at the bottom edge of the box, Q1, (75% lies above it), and 75% of the data lies below the top edge of the box, Q3, (25% lies above it). The box shows where the middle 50% of the data lies. A line in the box corresponds to the middle (median) value in the data, and 50% of the data lies above it and 50% lies below it. One set of commands to create a boxplot for RSTHRT are below. Figure 2.6 illustrate the dialog box and Figure 2.7 the boxplot itself.

1. Click on Graphs > Legacy Dialogs > Boxplot 2. Choose the Simple Icon and choose the button Summaries of Separate Variables 3. Click define 4. Move RSTHRT to the Boxes Represent box. 5. Click OK

Students should note a boxplots can also be created using the Analyze>Descriptive Statistics>Explore path and the Chart Builder (to be introduced later).

Figure 2.6: Boxplot: One Y, Simple Dialog Box for a One Variable Boxplot

Page 30: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

30

30

Figure 2.7: Boxplot of RSTHRT (Resting Heart Rate)

To add a title to your boxplot, you would click anywhere on the plot to bring up the chart editor b ox and then select the circled icon in Figure 2.8. This will bring up a your boxplot with a title placeholder icon labeled “Title”. Place your cursor is current inside the Title placeholder and change it to an appropriate title (say, “Boxplot of Resting Heart Rate”). Close the chart editor, and your graph will appear with a title. This is an example of a situation where a title cannot be added until after you have created a graph, and illustrates why it is a good idea to get used to using the Chart Editor! Take the time to observe all the other icons in the pop-up box. In time you will develop your own way of creating graphs – there are many options for most of the graphs you will learn in the course.

Figure 2.8: The Chart Editor choosing the icon to display a chart to which you can add a title.

Page 31: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

31

31

This boxplot indicates that the middle 50% (from the first quartile to the third quartile) of resting heart rates are from about 71 to 79 beats per minute, with tails extending down to about 57 beats a minute and up to about 86 beats per minute. That these tails are very long is evident, and the *d values indicate values in the lower end that are so far away from the middle of the data that they are viewed as “outliers”.

2.3.2 Side by Side Boxplot: Numerical Variable by Levels of Categorical Variable Sometimes it is of interest to look at boxplots of one variable by (for) subgroups (levels) of another variable. Commands to draw a boxplots of RSTHRT by SMOKE (for both nonsmokers and smokers) follow in Figures 2.9 and Figure 2.10 and the boxplot chart can be viewed in Figure 2.11. Two options for creating this side by side boxplot are suggested below. Students should try both sets of commands in order to get used to how they work, and the terminology employe d by SPSS. Option 1:

1. Click on Graphs > Legacy Dialogs > Boxplot 2. Choose the Simple Icon and choose the button Summaries of Groups of Cases 3. Click define 4. Move RSTHRT to the Boxes Represent box and GENDER to the Category Axis box. 5. Click OK

Option 2: 1. Click on Analyze> Descriptive Statistics>Explore 2. Choose RSTHRT for your dependent list and GENDER for your factor list. 3. Select Plots in the Display box 4. Click the Plots button 5. In the pop-up, in the Boxplots section, choose Factor levels together. 6. Click continue and OK

Figure 2.9: Command Windows for 1st Approach

Page 32: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

32

32

Figure 2.10: Command Windows for 2nd Approach

Figure 2.11: Side by Side Boxplots of RSTHRT for the two levels(N and Y) of the Categorical Variable

SMOKE Students should note that the Chart Builder could also be used to create these plots. We will use the chart builder later in the course. Note that the median resting heart rate for the non-smokers is slightly lower than the median resting heart rate for the smokers. However, the long low tail whisker on the non-smokers indicates that there are several non-smokers in the bottom ¼ of the data with low resting heart rates, while the low tail whisker in the smoker group indicates that the individual with the low resting heart rate in that group is an “outlier” (unusual). This is a result we might intuitively expect. Note that the number on the axis on a boxplot reference the values that a variable can take on. When side by side boxplots are compared, it is important to remember that unless the sample sizes are relatively equal, the number summaries provided could be misleading, although we do remark that the median is less influenced by outliers than the mean.

YN

85

80

75

70

65

60

SMOKE

RSTH

RT

Boxplot of RSTHRT

Page 33: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

33

33

2.4 Histogram

2.4.1 Histogram of a Numerical Variable For this section we will return to the Marks dataset from Chapter 1. A histogram is a pictorial representation of a numerical variable. Bars (bins) with interval widths that are almost always of equal width are created, and the leftmost bar is made so that its interval covers the minimum possible data value and the rightmost bar is made so that its interval covers the maximum date value. The height of each bar represents the number (count) of units with numerical values between the lower and upper endpoint values of its respective bars. One way to draw a histogram of Marks follows (see Figure 2.12 for the dialog box set up and Figure 2.13 for the histogram):

1. Choose Graph > Legacy Dialogs > Histogram 2. Bring marks to the Variable box 3. Click on Titles to create a relevant title (say “Histogram of Marks”) 4. Choose Ok

Figure 2.12: Histogram: S Dialog Box

Page 34: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

34

34

Figure 2.13: Frequency Histogram of Marks

It is time to talk about editing a bit more. Recall that to edit any chart, you can double -click on the chart you wish to edit to get the chart editor. You can then click on the area of the chart that you wish to modify, and this will pop up a Properties box with several tabs, and you can then work on the various tabs to change the display as you wish. You can also edit charts from the menus or from the toolbar. Figure 2.14a shows the Properties box that pops up if you click on a bar in the histogram. It is set on the binning tab. Figure 2.14b shows the Properties box that pops up if you click on a number on the x axis of a histogram. It is set on the Scale tab. These changes allow the interval width to be 10, which is more familiar with marks, and allow midpoints to show on the scale. Figure 2.15 shows the final result of making these changes

Figure 2.14a: Edit the bin (interval) width

Figure 2.14b: Edit position of numbers on scale

Page 35: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

35

35

Figure 2.15: New histogram with modified bins and scale numbers

Note that we have produced a frequency histogram. To produce a percent histogram, you take the following steps. Click on a number on the y scale in the frequency histogram in the chart editor and in the resulting properties box, go to the Variables tab. Upon clicking in the blank box next to the word percent, a drop down will appear, and the Y Axis option should be selected. Then click Apply and Close in the Properties box. You histogram will now be in percent.

Figure 2.16a Properties box in which you will select Y Axis for percent when the dropdown box appears.

Figure 2.16b New Percent Histogram of Marks

Page 36: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

36

36

2.4.2 Side by Side Histograms of a Numerical Variable by Levels of Categorical Variable

Sometimes it is of interest to look at histograms of one numerical variable by (for) subgroups (levels) of another categorical variable. The following commands illustrate an efficient way to generate a side by side histogram (Figure 2.17a) of Marks by Sex and the resulting output (Figure 2.17b).

1. Select Graphs>Legacy Dialogs>Histogram 2. In the Histogram dialog box, highlight Marks and use the arrow to bring marks to the Variables

box. 3. Highlight sex and use the arrow to bring it over to the Panel by Rows box. 4. Click OK.

Figure 2.17a: Choosing the Numerical Variable (marks) and Panel By Rows Categorical Variable (sex).

Figure 2.17b: A side-by-side(top over bottom) graph comparing the histograms of marks for students most identifying as female and male. Note the matching scales!

Note that you can do a panel by columns to produce side by side frequency histograms. However, SPSS has a glitch where it will not produce percent histograms if you panel your categorical variable by columns. We see that female grades have a narrower range than male grades (about 52 to 98 versus about 38 to 98), and that they do not have as many individuals with lower grades (below about 50). It looks like the female grades are higher on average than the male grades, also (balancing about 72 versus about 64). As always, it is possible to modify the titles, scales, and bins on these graphs by clicking on appropriate places in the chart editor pop-up graphs to obtain popup property boxes from which relevant tabs and choices upon the tabs can be made. A frequency histogram is a fine choice here because of the equal numbers of respondents who most identify as female or male. In a situation where groups (levels) in a categorical variable contain unequal numbers of respondents, a percent histogram would be more appropriate.

Page 37: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

37

37

To change to a percent histogram on a top over bottom graph, you would bring up the chart editor, click on a frequency number on the y axis, and in the resulting properties box, go to the Variables tab. Upon clicking in the blank box next to the word percent, a drop down will appear, and the Y Axis option should be selected. Then click Apply and Close in the Properties box. You histogram will now be in percent and the scales on the Y Axis will match for both histograms. Note that the Analyze>Descriptive Statistics>Explore path can also be used to produce two histograms. To take this approach, following the Explore setup, marks would be moved to the dependent list and sex to the Factor list. Then in the Explore: Plots popup, histogram would be checked. However, output produced will be individual female and male histograms and will not have matching scales, and the researcher would have to edit the boxes individually to match scales, so this is not a preferred approach.

2.5 Descriptive Statistics In Sections 1.7, the Descriptive Statistics command was introduced. It allows us to make numerical summaries of data (as well as graphs). NEED DECISION ON WHETHER TO DO EXTRA GRAPH HERE.

2.5.1 Descriptive Statistics of a Numerical Variable The following commands will calculate SPSS’s default descriptive statistics for the variable MINPA. Figures 2.18a and 2.18b will illustrate the dialog boxes of interest, and Figure 2.19 will present the descriptive statistics output from the Viewer window.

1. Select Analyze > Descriptive Statistics >Explore 2. Put MINPA into the Dependent List box 3. Choose the Statistics button in the Display box. 4. Click on the Statistics button 5. In the Explore: Statistics pop-up box, make sure Descriptives is checked. 6. Click Continue and then OK

Figure 2.18a: Dialog boxes- descriptive statistics

Figure 2.18b: Obtained Descriptive Statistics

Page 38: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

38

38

Note that the above commands do not provide mode or quartiles for the data. We could check percentiles in the Explore Statistics box above to get this information (and this approach is taken in Section 2.5.2). For now, though, we will consider another way to obtain these details is to use the Analyze>Descriptive Statistics>Frequencies path. The Statistics button in the Frequencies window pops up a Frequencies: Statistics window where we can choose which statistics we wish for our data. (see Figure 2.19a. The results of running this approach are in Figure 2.19b.

Figure 2.19a: Choosing the Statistics of interest in the Frequencies pop-up box

Figure 2.19b: The resulting statistics.

For the MINPA data, we note that the values range from 0 to 161, and that the middle 50% of the data ranges from 14.5 to 66 in value. The mean (43.71) is higher than the median (34.53), indicating that it is pulled up above the mean by folk who get a great deal of exercise. The mode or most common value is 0 (no exercise). The top 25% of the data ranges from 66 to 161, indicating a relatively large top whisker (for this data). The “relatively” large standard deviation (38.22) also indicates the wide variation away from the mean of this data. All values are minutes per day of exercise.

2.5.2 Descriptive Statistics of a Numerical Variable by Levels of a Categorical Variable Sometimes it is of interest to look at descriptive statistics of one variable by (for) subgroups (levels) of another variable. The following commands il lustrate how to create descriptive statistics for the variable MINPA for the two levels of the category variable GENDER. Figures 2.20 presents the dialog box that illustrates placement of MINPA (the numerical variable) and GENDER (the categorical variable ). Default statistics are used, and the resulting output is shown in Figure 2.21. The following commands will calculate SPSS’s default descriptive statistics for the variable MINPA.

Page 39: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

39

39

1. Select Analyze>Descriptive Statistics>Explore 2. Use the arrows to place MINPA in the Dependent List box and GENDER in the Factor List box 3. Choose Statistics in the Display box. 4. Click on the Statistics button 5. In the Explore:Statistics popup, check Descriptives and Percentiles 6. Click Continue and then Ok

Figure 2.20: Dialog Box for calculating descriptive statistics for a numerical variable (MINPA) by a

categorical variable (GENDER)

Page 40: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

40

40

Figure 2.21: Default descriptive statistics and percentiles for a numerical variable (MINPA) for levels of by a categorical variable (GENDER – F and M) (Students should note that the Analyze>Descriptive Statistics>Explore path does not provide a mode option, so if a mode is desired for the female and male data separately, it is necessary to “select cases” to choose and save only the female data and then repeat this to choose and save only the male data, and then to run an Analyze>Descriptive Statistics>Frequencies set of commands on the female data and male data separately. Whew!) For the MINPA data for females, we note that the values range from 0 to 145, and that the middle 50% of the data ranges from 15.5 to 55.5 in value. The mean (43.70) is higher than the median (37.00), indicating that it is pulled up above the mean by folk who get a great deal of exercise. The top 25% of the data ranges from 55.50 to 145, indicating a relatively large top whisker (for this data). The “relatively” large standard deviation (35.18) also indicates the wide variation away from the mean of this data. All values are minutes per day of physical activity For the MINPA data for males, we note that the values range from 0 to 161, and that the middle 50% of the data ranges from 10.25 to 73.25 in value. The mean (43.72) is higher than the median (27.5), indicating that it is pulled up above the mean by folk who get a great deal of exercise. The top 25% of the data ranges from 73.25 to 161, indicating a relatively large top whisker (for this data). The “relatively” large standard deviation (42.05) also indicates the wide variation away from the mean of this data. All values are minutes per day of physical activity. On average, minutes of daily physical activity per day are close for both genders (43.70 and 43.72). The female median (37.00) is larger than the male median (27.50), but in both cases, the means are pul led away from the medians by long right tails to their respective maximums (145 for the females and 161 for the males). The top 50% of the female data lies in a smaller range of values (37.00 to 145) than the top 50% of the male data (27.50 to 161.00). Male MINPA is more spread out than female MINPA, as reflected by the standard deviations - they capture the fact that although the minimum is 0 for both genders, the male maximum (161) exceeds the female maximum (145).

Page 41: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

41

41

Chapter 3: Relationships In this chapter the reader will use SPSS to create material with which to investigate statistical relationship between variables. After studying this section you should be able to:

1. Use SPSS to display Crosstab tables 2. Use SPSS to create and edit Bar Charts (overall percent and percents within categories), and Histograms for all paired levels of two categorical variables, and Scatterplots (with a line of best fit) 3. Use SPSS to display correlation output 4. Interpret the tables, charts and output above

3.1 Crosstabs

3.1.1 Counts and Percents

Example 2: A group of students were randomly surveyed and asked to answer a fitness survey. Results can be found in the dataset SMOKEFITNESS.sav, which contains 5 columns, each corresponding to one question, each of which can be considered a random variable. Data coding was as follows.

Variable and values Type of Variable GENDER (Female – F, Male – M) Categorical

MINPA (Daily minutes of physical activity) Continuous Numerical (Discrete values recorded)

SMOKE (Yes – Y, No – N) Categorical

FITLEV(reported fitness level of 1-Poor to 10-Excellent)

Ordinal (ranked) Discrete

RSTHRT (resting heart rate) Continuous Numerical (Discrete values recorded)

Figure 3.1a: Summary of Data Set Variables

The following commands create crosstab tables to count totals and percentages for each (Gender, Smoke) pair, and calculate the percentages of non-smokers and smokers within each gender group (female and male), and the percentages of males and females within each smoking status group (smoker and non-smoker).

1. Click on Analyze > Descriptive Statistics > Crosstabs 2. In the Crosstabs Window, select Row(s): GENDER and Column(s): SMOKE 3. Click on the Cells Button 4. In the Crosstabs Cell Display Window, check Counts: Observed, Percentages: Row, Percentages:

Column, Percentages: Total 5. Click Continue and Ok

Page 42: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

42

42

Figure 3.1bCross tab table for 2 categorical variables (each with 2 levels)

3.1.2 Reading and Calculating Percent using the Crosstab table We can see from the above table that:

a) The percentage of people who are female and non-smokers is 23/69 = 33.3% b) The percent of females that are non-smokers is 23/37 = 62.2% c) The percent of non-smokers that are female is 23/39 = 59%

Part a) considers the entire group of all survey respondents, part b) only the subgroup of females, and part c) only the subgroup of non-smokers.

3.2 Graphing Relationships At this point, we will introduce the Chart Editor. It is a moderately powerful tool that allows you to do some drag and drop actions in order to create graphs that cannot be created any other way. Students should follow the path Graphs>Chart Editor, and then click ok in the box that pops up. You will see a screen that looks like this. We will be making Cluster Bars in this section, so students should click on the cluster bar icon in the graph (second from left, with the side by side blue and green bars).

Page 43: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

43

43

Figure 3.2: Chart Editor box with Cluster Bar Chart chosen.

3.2.1 Two categorical Variables: Overall count and percent Bar Charts The following commands produce bars that summarize the counts and percents over the 69 observations in each of the 4 cells in the two by two classification table above. Students can experiment to add titles. To produce the chart of overall counts shown below in Figure 3.3a 1. Drag Smoke to X Axis in the chart preview window 2. Drag Gender to Cluster on X: set colour in the chart preview window 3. In element properties, leave edit properties as bar 1, and leave statistic as count 4. Click on the Set Parameters dropdown and note that “Grand Total” is the default in the Denominator for the Computing Percentages box 5. Click OK in the Chart Builder To produce the chart of overall percents shown in Figure 3.3b. 1. In element properties, leave edit properties as bar 1, and drop statistic to percentage. 2. Click on the Set Parameters dropdown and note that “Grand Total” remains as the default in the Denominator for the Computing Percentages box 3. Click Continue in the Set Parameters box 4. Click Apply in the Element Properties box 5. Click OK in the Chart Builder

Page 44: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

44

44

Figure 3.3a OVERALL COUNTS of Smoker Status and Gender Pairings for all data in the data set (with SMOKE on the x axis and GENDER in the clusters (legend) )

Figure 3.3b OVERALL PERCENTS of Smoker Status and Gender Pairings for all data in the data set (with SMOKE on the x axis and GENDER in the clusters (legend) )

Note that students can also do counts and percents that place GENDER on the x axis and SMOKE in the legend. Commands are analogous, except that GENDER is on the x axis and sSMOKE on the legend. See Figure 3.3a and Figure 3.3b.

Figure 3.3c OVERALL COUNTS of Smoker Status and Gender Pairings for all data in the data set set (with GENDER on the x axis and SMOKE in the clusters (legend) )

Figure 3.3d OVERALL PERCENTS of Smoker Status and Gender Pairings for all data in the data set (with GENDER on the x axis and SMOKE in the clusters (legend) )

3.2.2 Two categorical Variables: Bar Charts for Percents of Levels of one Category within another Category

The following commands will yield bars that illustrate the smoker status percentages within each gender for comparison. The left cluster of bars contains percent bars for female smokers and non-smokers and the right cluster contains percent bars for male smokers and non-smokers. The smoker and non-smoker bars in the left cluster of females will total 100%, as will the smoker and non-smoker bars in the right cluster of males. Students can add appropriate titles. See Figure 3.4.

Page 45: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

45

45

To cluster on GENDER (GENDER on X axis) (To have SMOKE N and Y percents add to 100% in each GENDER level (F and M) ) 1.Move Gender to X Axis 2.Move Smoke to Cluster on X: set colour 3.Change Denominator for Computing Percentage to “Total for Each X Category”. 4.Click Continue in Set Parameters box 5.Click Apply in Element Properties box 6. Click OK in Chart Builder

Figure 3.4 Percents of Smokers/Non-smokers within Gender clusters

The following commands will yield bars that illustrate the gender percentages within each smoker status for comparison. The left cluster of bars contains percent bars for female and male non-smokers and the right cluster contains percent bars for female and male smokers. The female and male bars in the left cluster of non-smokers will total 100%, as will the female and male bars in the right cluster of smokers. Students can add appropriate titles. See Figure 3.5.

To cluster on SMOKE (SMOKE on X axis) (To have GENDER F and M percents add to 100% within each SMOKE level (N and Y) ) 1.Change Denominator for Computing Percentage to “Total for Each X Category”. 2.Click Continue in Set Parameters box 3.Click Apply in Element Properties box 4.Click OK in Chart Builder

Figure 3.5: Percents of Gender within Smoking Status clusters

3.2.3 Two categorical and one Numerical Variable: Histograms The following commands yield percent histograms of resting heart rate for the four gender and smoking status combinations. Note that each histogram will have bars totaling 100%, and that SPSS will match the scales for you! This will faciliate comparisons of the resting heart rates for the 4 gender and smoking status combinations. How to change the number of intervals is illustrated. Students can add appropriate titles.

Page 46: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

46

46

1. Select Graphs > legacy dialogs > histogram

2. In the Histogram dialog box, choose Variable: RSTHRT, Rows: GENDER, And Columns: SMOKE

3. Select Ok

Obtain Frequency Histogram

1. Double click to bring up chart editor 2. Double click on number on frequency

axis to bring up properties box 3. On the Variables tab, in the Percent box,

drop down to put in y axis 4. 4. Click Apply

Obtain Percent Histogram

1. Double click on bar to bring up

properties box 2. On the Binning tab, in the X axis region,

choose Custom and set the interval width to 5.

3. Click Apply Customizing the Interval Width

Figure 3.6: Percent Histograms of a numerical variable for all combinations of levels of two categorical

variables (of RSTHRT for (gender, smoking status) combinations)

3.3 Weighted Cases: Two Categorical Variables Occasionally, data is presented to people with counting already done. For example, the Smoke and Gender data could have been presented to us in another way in SPSS, viz:

Page 47: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

47

47

GENDER SMOKING STATUS COUNT Female Non-Smoker 23

Female Smoker 14 Male Non-Smoker 16

Male Non-Smoker 16

Example: The vintage music in a person’s home is classified according to Genre (Classical, Jazz, Blues or Rock) and the medium that is held upon (Vinyl, Cassette, or CD) .

GENRE MEDIUM COUNT Classical Vinyl 53

Classical Cassette 25

Classical CD 57 Jazz Vinyl 72

Jazz Cassette 27 Jazz CD 76

Blues Vinyl 46 Blues Cassette 26

Blues CD 52

Rock Vinyl 65 Rock Cassette 34

Rock CD 55 Note that that the data consists of categorical pairs (cases) weighted with appropriate counts (frequencies). The Data>Weight cases command can be used to tell SPSS to recognize that the data consists of pairs with appropriate counts (frequencies). To weight cases, the following commands should be entered 1. Choose the path Data>Weight Cases 2. In the Weight Cases popup, place a dot in the “Do not weigh cases” button and move the variable COUNT in as the frequency variable 3. Select OK To remove the weights, the following commands should be entered 1. Choose the path Data>Weight Cases 2. In the Weight Cases popup, place a dot in the “Do not weigh cases” button 3. Select OK Minitab commands can be used appropriately with data presented this way in order to create frequency tables, bar charts and pie charts for the individual categorical variables. And it is also possible to make crosstab tables and cluster bar charts for the pairs of categorical tables. We will do this next.

Page 48: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

48

48

3.3.1 Weighted Cases: Frequency Tables, Bar and Pie Charts for Individual Categories

The following commands will use the music data above to create individual frequency tables, bar charts and pie charts for the variables GENRE and MEDIUM. First make sure the cases are weighted again by using the Data>Weight Cases command to weight the cases by the frequency variable COUNT (see above!). Then we can proceed. Analyze>Descriptive Statistics>Frequencies Move GENRE and MEDIUM into the Variable(s) Box Select the Charts button In the popup Frequencies: Charts box, select the Bar Charts button Leave Frequencies as your choice for the chart values Choose Continue Choose OK To create pie charts, you can repeat the above and select the Pie Charts option. The create bar charts and pie charts that show percents, you can select Percentages as your chart values. Frequencies or percents (according to your setup) will appear on the chart when you create a Bar Chart. To have frequencies or percents (according to your setup) appear on a Pie Chart, bring up your pie in the chart editor, and double click on the icon with the bars with highlighted top tips on them in the bottom row of the menu bar.

Figure 3.7a: Frequency and Percent Tables for the Individual Variables GENRE and MEDIUM

Page 49: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

49

49

Figure 3.7b: Percent Bar Charts for the Individual Variables GENRE and MEDIUM

Figure 3.7c: Percent Pie Charts for the Individual Variables GENRE and MEDIUM Note that an alternative way to create frequency Bar Charts is to use the commands Graphs>Legacy Dialogs>Bar Select the Simple Icon Leave the radio button Summaries of Groups of Cases active Select Define Choose N of cases in the Bars Represent area Place MEDIUM in the Category Axis Select OK Repeat the above commands with GENRE as the Category Axis

Page 50: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

50

50

Similarly, an alternative way to create frequency Pie Charts is to use the commands Graphs>Legacy Dialogs>Pie Select the Simple Icon Leave the radio button Summaries of Groups of Cases active Select Define Choose N of cases is selected in the Bars Represent area Place MEDIUM in the Category Axis Select OK Repeat the above commands with GENRE as the Category Axis Here you could obtain percent bar and pie charts by choosing % of cases in the original commands. Comments GENRE: Jazz is the most popular vintage genre for this person (29.8%), with rock being nex t most popular (26.2%), classical being next most popular (23%), and blues being least popular (21%). MEDIUM: CDs are the most popular vintage medium (40.8%) for this person, with Vinyl coming a close second (40.1%), and cassettes coming in as least popular (19%). Bars and Graphs are equally illuminating picture wise. Frequency tables provide exact percentages for our comments.

3.3.2 Weighted Cases: Crosstab Tables It is generally informative to drill more deeply into data to find more than individual patterns. To obtain a crosstab table (two-way table) to show the genre and medium pairings with counts and percents, perform the following commands. Analyze>Descriptive Statistics>Crosstabs Choose GENRE for the Rows and MEDIUM for the columns Click “Cells”, choose row percentages, column percentages and total percentages. Click “Continue” Click “OK”

Page 51: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

51

51

Figure 3.8 Crosstab Table for Two Categorical Variables

Comments: We will use the crosstab table above to make our comments. This can be a bit brain-crowding, and should be done in concert with the graphs that you will learn to make in the next two sections, as the graphs will help you to see that you are reading the crosstab table correctly. As always, comments should compare information in a relative sense, include values, and consider all that is going on in detail in the interest of completeness.

WITHIN MEDIUM clusters: For Cassettes, classical was the least used genre (22.3%), blues the second least used genre (23.2%), jazz the third least used genre (24.1%), and rock the most used genre (30.4%). For CDs, blues was the least used genre (21.7%), classical the second least used genre (23.8%), rock the third least used genre (22.9%), and jazz the most used genre (31.7%). For Vinyl, blues was the least used genre (19.5%), classical the second least used genre (22.5%), rock the third least used genre (27.5%), and jazz the most used genre (29.8%). In general, for all mediums, blues and classical are a less played genre (roughly 19 to 23% each) for all mediums. For cassettes, jazz is also little played (close in percent to blues and classical), with rock standing above at 30.4%. while for CDs, rock is also little played (close in percent to blues) and classical, standing above at 31.7%. For vinyl, both jazz (29.8%) and rock (27.5%) are more played than blues and classical

Page 52: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

52

52

WITHIN GENRE clusters: For Blues, cassette was the least used medium (21%), vinyl the next most used (37.1%) and CDs the most used (41.9%). For Classical, cassette was the least used medium (18.5%), vinyl the next most used (39.3%) and CDs the most used (42.2%). For Jazz, cassette was the least used medium (15.4%), vinyl the next most used (41.1%) and CDs the most used (43.4%). For Rock, cassette was the least used medium (22.1%), CD the next most used (35.7%) and CDs the most used (42.2%). In general, cassettes are the least used vintage medium for all genres, while Vinyl and CDs tend to be used about twice as much as cassettes for all genres. CDs edges out vinyl as the most used vintage medium for Blues, Classical, and Jazz, while Vinyl edges out CDs as the most used vintage medium for Rock.

3.3.3 Weighted Cases: Overall Count and Percent Cluster Bars

It can be difficult to see patterns in crosstab tables, so cluster bars can be very useful. The following commands produce bars that summarize the overall counts and percents over the 588 observations for each of the 12 GENRE,MEDIUM cells in the 4x3 crosstab table above. Students can experiment to add titles. To produce the chart of overall counts shown below in Figure 3.9a 1. Drag GENRE to X Axis in the chart preview window 2. Drag MEDIUM to Cluster on X: set colour in the chart preview window 3. In element properties, leave edit properties as bar 1, and leave statistic as count 4. Click OK in the Chart Builder To produce the chart of overall percents shown in Figure 3.9b 1. In element properties, leave edit properties as bar 1, and drop statistic to percentage. 2. Click on the Set Parameters dropdown and note that “Grand Total” remains as the default in the Denominator for the Computing Percentages box 3. Click Continue in the Set Parameters box 4. Click Apply in the Element Properties box 5. Click OK in the Chart Builder Note that the scale change can be a bit disconcerting, but you’ve done everything right. Students can add appropriate titles. Also note that the only difference in the sets of commands used to produce cluster bar charts with raw data versus with data that has weighted cases is in where you place GENRE and MEDIUM in the original setup.

Page 53: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

53

53

Figure 3.9a OVERALL COUNTS of GENRE and MEDIUM Pairings for all data in the data set (with GENRE on the x axis and MEDIUM in the clusters (legend) )

Figure 3.9b OVERALL PERCENTS of GENRE and MEDIUM Pairings for all data in the data set (with GENRE on the x axis and MEDIUM in the clusters (legend) )

Note that students can also do counts and percents cluster charts that place MEDIUM on the x axis and GENRE in the legend. Commands are analogous, except that MEDIUM is on the x axis and GENRE on the legend.

3.3.4 Weighted Cases: Percents for Levels of one Category within levels of another Category The following commands will yield bars that illustrate the MEDIUM percentages within each GENRE for comparison. The MEDIUM bars will total 100% within each GENRE. Students can add appropriate titles. See Figure 3.4.

(To have MEDIUM percents add to 100% within each GENRE level) 1.Move GENRE to X Axis 2.Move MEDIUM to Cluster on X: set colour 3. In Statistics dropdown, move Count to Percentage 3. In Set Parameters box, change Denominator for Computing Percentage to “Total for Each X Category”. 4.Click Continue in Set Parameters box 5.Click Apply in Element Properties box 6. Click OK in Chart Builder

Figure 3.10a Percents of Mediums used within Gender clusters

Page 54: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

54

54

The following commands will yield bars that illustrate the GENRE percentages within each MEDIUM for comparison. The GENRE bars will total 100% within each MEDIUM. Students can add appropriate titles.

(To have GENRE percents add to 100% within each MEDIUM level ) 1.Move MEDIUM to X Axis 2.Move GENRE to Cluster on X: set colour 3. In Statistics dropdown, move Count to Percentage 3. In Set Parameters box, change Denominator for Computing Percentage to “Total for Each X Category”. 4.Click Continue in Set Parameters box 5.Click Apply in Element Properties box 6. Click OK in Chart Builder

Figure 3.10b: Percents of Genres played within Medium clusters The above graphs are particularly useful for students to reference in concert with the actual numbers in the frequency/percent crosstab tables when they are writing up comments to describe what is happening with sets of data that they are examining. It is always appropriate and useful to note general patterns along with actual observed counts/percents.

3.4 Scatterplots A scatterplot is a plot of (X,Y) points for two numerical variables X and Y. SPSS can fit a “line of best fit” through the points. This line is also known as the “least squares” line, and is constructed (using calculus) so as to minimize the sum of the squared distances between the actual y values and the y values on the line. Here are the commands to get a scatterplot and a “line of best fit” for the (X,Y) pairs (MINPA, RSTHRT).

1. Select Graphs>Legacy Dialogs>Scatter/Dot 2. In the Scatter/Dot Window, choose Simple Scatter and select Define 3. In the Simple Scatterplot Window, choose Y Axis: RSTHRT, and X Axis: MINPA 4. Select the Titles button, and then type the title Scatterplot of RSTHRT versus MINPA 5. Click Ok

6. Double Click on the scatterplot to bring it up in the chart editor 7. Select the Elements Menu Bar dropdown, and choose Select Fit Line at Total (a line of best fit

(with equation) will appear on the scatterplot) 8. Close the Properties Box Window that popped up.

For our example, the line of best fit has the equation y=80.35 – 0.14*x (or RSTHRT = 80.35 -0.14*MINPA). It has a negative slope, and as MINPA increases, RSTHRT decreases.

Page 55: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

55

55

Note that, in general, if the line of best fit has a positive slope, the Y variable will increase as the X variable increases, while if the line of best fit has a negative slope, the Y variable will decrease as the X variable increases.

Figure 3.11: Scatterplot of 2 numerical variables (RSTHRT and MINPA)

In this case, creating different scatterplots for the two smoking status groups (smokers and non-smokers) offers the opportunity for deeper examination of the data. The commands that create the scatterplots along with the scatterplots are given below. One way is to use the Select Cases command to create, first, a dataset of SMOKER, and secondly, a dataset of NONSMOKER. Then, for each of these datasets, run commands to create a scatterplot, and add the line of best fit. Doing this yields the following. For the NONSMOKER group, the line of best fit has the equation RSTHRT = 80.98 -0.14*MINPA. For the SMOKER group, the line of best fit has the equation RSTHRT = 81.61 – 0.23*MINPA. Note that the slope for the SMOKER line of best fit is steeper than the slope for the SMOKER line of best fit. Also note that in order to get a true comparison of the two scatterplots, it was necessary to make sure that the scales matched on the two graphs.

Page 56: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

56

56

Figure 3.12a: Scatterplot of 2 numerical variables for one level of a categorical vari able

(of RSTHRT and MINPA for smokers)

Figure 3.12b: Scatterplot of 2 numerical variables for one level of a categorical variable

(of RSTHRT and MINPA for non-smokers)

3.4 Correlation Correlation, r, is a standardized measure between -1 and +1. It indicates the strength of linear relationship (as seen on a scatterplot) between two quantitative (numerical) variables. If r is positive, the slope of a line of best fit is positive, and the two variables are said to have a positive linear relationship. If r is negative, the slope of a line of best fit is negative, and the two variables are said to have a negative linear relationship. When points are close to the line, and r is above roughly 0.75 in absolute value, the fit is said to be strong. If points are further away from the line, and r is roughly between 0.5 and 0.75 in absolute value, the fit is said to be medium. If points are widely scattered from

Page 57: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

57

57

the line, and r is below about 0.5 in absolute value, the fit is said to be weak. If the points are so widely scattered that a line of best fit cannot be readily found, r will be close to 0, and there will be no linear relationship.

Figure 3.13: Suggested correlation values for various scatter patterns

When a scatterplot (and line of best fit) are created in SPSS (as above), R2 (the square of the correlation r for two numerical variables) appears on the upper right corner of the scatterplot. R2 is known as the coefficient of determination, and it measures the amount of variation in the data that i s explained by fitting the line of best fit. We will return to it later in the course. Because the slope of the lines above are all negative, r is the negative square root of R 2 in each case. For

all the data, r is √0.679 = -0.824. For the SMOKER data, r is √0.611 = -0.782, and for the NONSMOKER

data, r is √0.758 = -0.871. The strength of linear relationship between RSTHRT and MINPA is fairly strongly negative in all three cases, but strongest for the NONSMOKER data alone. Commands for another way to calculate r for the full data set are given below. (Identical commands for the separate SMOKER and NONSMOKER datasets would also yield their respective correlations). In this output, r is “Pearson Correlation”.

1. Select Analyze > Correlate >Bivariate 2. In the Bivariate Correlations Box, place the Variables MINPA and RSTHRT 3. 3. Choose the Correlations Coefficients Pearson box 4. Select OK

Page 58: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

58

58

Correlations

RSTHRT MINPA

RSTHRT Pearson Correlation 1 -.824**

Sig. (2-tailed) .000

N 69 69

MINPA Pearson Correlation -.824** 1

Sig. (2-tailed) .000

N 69 69

**. Correlation is significant at the 0.01 level (2-tailed).

Figure 3.14: Correlation between two numerical variables (RSTHRT and MINPA)

Page 59: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

59

59

Chapter 4 Parametric Procedures Objective: After studying this chapter, you should be able to:

1. Do inference about one population mean using the one sample t procedure; 2. Do inference about two population means using the two sample t procedure; 3. Do inference about two population means using the paired t procedure.

4.1 Inference About One Population Mean Using the One Sample t Procedure

4.1.1 Two sided confidence intervals and hypothesis tests:

Assumptions

1. Simple Random Sample 2. Normal population or large sample (n >= 30)

The one-sample t-procedure is used.

1. Two sided (1 – α) % Confidence interval for µ:

(𝑥̅) ± 𝑡∝/2

𝑠

√𝑛 , df = n -1

and

2. Test statistic for testing H0: μ = δ versus Ha: μ ≠ δ

to = x̅−δ

s

√n

, df = n - 1

where µ is the population mean, 𝑥̅ is the sample mean, s is the sample standard deviation, n is the sample size and 𝑡∝/2 is the t distribution value such that the area to the right of it is α/2.

Example: A math achievement test is given to a random sample of 13 high school girls. The scores are given here: Scores for girls: 87, 91, 78, 81, 72, 95, 89, 93, 83, 74, 75, 85, 95 The data are given in the SPSS data file girlsachieve.sav on Blackboard.

1. Does the mean achievement score for girls differ from 80? Test at a 5% level of significance. 2. Construct a 95% confidence interval for the mean achievement score for girls.

Page 60: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

60

60

Solution: Let µ denote the mean achievement score for girls. Hypothesis Test: Perform a one sample t-test:

Step 1. Hypotheses: H0 : µ = 80 versus Ha : µ ≠ 80, α = 0.05

Step 2. Assume a simple random sample was taken. Since our sample size is small, we must assume normality of the population. Using the sample data, we draw a boxplot and a normal probablility plot (called a Q-Q plot in SPSS) to check the normality of the sample data to check if it supports this assumption about the population. (Note that you are only establishing the shape of one not so large sample from our population here, so you are seeing if the normal population assumption is reasonable by checking the shape of our one particular sample. Also, the sample, with only has 13 data values, is not very large a sample. Ideally you would wish a larger sample. You should keep both these points this in mind.)

(a) Select Graphs>Legacy Dialogs>Boxplot and make sure the Simple box is current. (b) Click Define. In the Define dialog box use the arrows to select GirlScores as Boxes Represent

(c) Click OK

(a) Select Graphs>Legacy Dialogs> Histogram (b) Select GirlScores as the Variable

(c) Click OK

(a) Select Analyze>Descriptive Statistics>Q-Q Plots (b) Select GirlScores as the Variable

(c) Click OK Open the SPSS Viewer to see the output.

Page 61: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

61

61

Figure 4.1: Dialog Boxes for Drawing a Boxplort, Histogram and Probability Plot for Girlscores

Figure 4.2: Boxplot, Histogram and Probablility Plot for Girlscores

Page 62: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

62

62

The boxplot shows that the distribution is approximately symmetric, with no outliers. (Note that the boxplot cannot tell us more than this. However, the t-test is robust to this situation, even if a population is not normal, and more uniform, for example). The histogram puts 13 observations into six bins, so it is rather unable to act as a histogram can in giving us information when we have more observations. However, it suggests, with its shape, that a normal distribution may be indeed possible in the population. There is a peak in the middle, and the tails are evenly spread out and not too long. The Q-Q plot (normal probability plot) compares the actual observed values against the expected normal values for a normally distributed random variable. If the distribution from which one samples is normal, then the points will fall close to line. Here the points fall somewhat close to the line, indicating that the shape of the sample data is not too far removed from a normal shape, so we can expect that the assumption of a normal population is not untoward. Note that it is prudent to examine all three of a histogram, a boxplot and a normal probability plot when checking the normality of your sample data, and establishing that the assumption of a normal population is not untoward. Step 3 - 6: It is necessary to generate computer output to complete the hypothesis test.

Commands for a one sample t-test using SPSS: a. Select Analyze>Compare Means>One-Sample T Test b. Select GirlScores as the test variable(s) (see Figure 4.3) c. Select 80 as your Test Value d. Click Options, type 95% for Confidence Interval, and click Continue e. Click OK

Figure 4.3: One- Sample T test Dialog Box

Page 63: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

63

63

One-Sample Statistics

N Mean Std. Deviation Std. Error Mean

GirlScores 13 84.46 8.038 2.229

One-Sample Test

Test Value = 80

t df Sig. (2-tailed) Mean Difference

95% Confidence Interval of the

Difference

Lower Upper

GirlScores 2.001 12 .068 4.462 -.40 9.32

Figure 4.4:One Sample T Test Output with test statistic and P-value for test of H0 : µ = 80 vs Ha:µ ≠ 80 Step 3: From the output in Figure 4.4, the test statistic is to = 2.001 with 12 degrees of freedom. Step 4: From the output in Figure 4.4, the P-value=0.068. Step 5: Do not reject Ho, at the 5% significance level (since 0.068 > 0.05). Step 6: At the 5% significance level, there is no significant evidence the mean of the girl achievement scores differs from 80. Confidence Interval: The result in Figure 4.4 returns a 95% confidence interval for μ – 80. To obtain the 95% confidence interval for μ, one can simply add 80 to both endpoints of the interval. So the 95% confidence interval for μ is (79.60, 89.32) points. Alternatively, to obtain a confidence interval for μ, the commands above can be rerun using 0 as the test value. The output returned if you chose this approach is in Figure 4.5.

One-Sample Test

Test Value = 0

t df Sig. (2-tailed) Mean Difference

95% Confidence Interval of the

Difference

Lower Upper

GirlScores 37.888 12 .000 84.462 79.60 89.32

Figure 4.5: One Sample T Output to obtain confidence interval for μ

From the SPSS output, the 95% confidence interval for µ is (79.60, 89.32). We are 95% confident that the mean achievement score for girls falls between (79.60, 89.32) points. Since 80 is captured within this confidence interval, we cannot exclude 80 as a possible value for the mean. We do not find sufficient evidence that the mean achievement score for girls differs significantly from 80.

Page 64: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

64

64

4.1.2 One sided confidence bounds and hypothesis tests: Assumptions

1. The sample is randomly and independently selected from the population. 2. The sampled population is approximately normal.

The one-sample t-procedure is used. Right Sided Test:

1. Test statistic for testing H0: μ = δ versus Ha: μ > δ

to = �̅�−𝛿

𝑠

√𝑛

, df = n - 1

2. One sided lower bound with (1 – α) % Confidence

𝑥̅ − 𝑡∝

𝑠

√𝑛 , df = n -1

Left Sided Test:

1. Test statistic for testing H0: μ = δ versus Ha: μ < δ

to = �̅�−𝛿

𝑠

√𝑛

, df = n - 1

2. One sided upper bound with (1 – α) % Confidence

𝑥̅ + 𝑡∝

𝑠

√𝑛 , df = n -1

Example: A math achievement test is given to a random sample of 13 high school girls. The scores are given here: Scores for girls: 87, 91, 78, 81, 72, 95, 89, 93, 83, 74, 75, 85, 95 The data are given in the SPSS data file girlsachieve.sav on Blackboard. 1. Does the mean achievement score for girls exceed 78? Test at a 1% level of significance. 2. Construct a 99% confidence one sided right sided lower bound for the mean achievement score for girls. Solution: Let µ denote the mean achievement score for girls. Hypothesis Test: Perform a one sided right sided sample t-test:

Step 1: Hypotheses: H0 : µ = 78 versus Ha µ > 78, α = 0.01

Page 65: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

65

65

Step 2: Assume a simple random sample was taken from the population. Recall the boxplot, histogram and Q-Q plots that were created to check the normality assumption of the population by examining the distribution of the sample data, and, then, bearing in mind the small sample size, and that this is just one sample from the population that we are examining, provided an indication of how the population is distributed. The boxplot indicated that the sample data is approximately symmetric, with no outliers, the histogram that there was a fairly central peak in the sample data and no long tails, and the Q-Q plot indicated that it was not untoward to feel that the sample data was normal. We proceed.

Step 3 - 6: To generate computer output for the hypothesis test, repeat the command steps shown above using 78 as a Test Value (see Figure 4.6) and 98 as a confidence level (the reason is discussed below). SPSS only provides the P-value for a two sided test, so it is necessary to convert the P-value in the output to one for a one-sided test. For this example, the obtained P-value will be divided by 2. (Details on how to handle all possible cases is provided in Figure 4.8) Note: A 98% two sided confidence interval for μ – 78 will also be returned, and this will not be of interest. However, when 78 is added to the lower bound, or a rerun of commands is done with 0 as a Test Value, the 98% confidence interval for μ will be of interest. This is because the 99% confidence one sided right sided lower bound is the same as the lower bound of a 98% two sided confidence interval. Figure 4.7 provides the results when a rerun of commands is done with 0 as a test value . (Details on how to handle all possible situations is provided in Figure 4.9) Commands:

a. Select Analyze>Compare Means>One-Sample T Test b. Select GirlScores as the test variable(s) c. Select 78 as your Test Value d. Click Options, type 98% for Confidence Interval, and click Continue

e. Click OK

One-Sample Test

Test Value = 78

t df Sig. (2-tailed) Mean Difference

98% Confidence Interval of the Difference

Lower Upper

GirlScores 2.899 12 .013 6.462 .49 12.44

Figure 4.6 One Sample T Test Output to obtain test statistic and P-value for test of H0 : µ = 78 versus Ha:µ > 78

Step 3: The test statistic is to = 2.899 with 12 degrees of freedom. Step 4: The P-value = 0.013/2 = 0.0065 (since the P-value in Figure 4.6 is for a 2 sided test) Step 5: Reject Ho, at the 1% significance level (since 0.0065 < 0.01). Step 6: There is significant evidence the mean of the girl achievement scores exceeds 78.

Confidence Bound:

A (1-α)% confidence one sided right sided lower bound is 𝑥̅ - 𝑡𝛼𝑠

√𝑛 . This is the same as the lower bound

for a (1-2α)% two sided confidence interval of ( 𝑥̅ - 𝑡𝛼𝑠

√𝑛 , 𝑥̅ + 𝑡𝛼

𝑠

√𝑛 ).

Page 66: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

66

66

For α = 0.01, a (1 - .01)% =99% confidence one sided right sided lower bound is 𝑥̅ - 𝑡.01𝑠

√𝑛 , the same as

the lower bound for a (1-.02)% = 98% two sided confidence interval of ( 𝑥̅ - 𝑡.01𝑠

√𝑛 , 𝑥̅ + 𝑡.01

𝑠

√𝑛 ).

Using the output in Figure 4.6, we calculate the lower bound of 78.49 for a 99% one sided right sided confidence interval by adding 78 to 0.49. The 99% one sided confidence interval can be presented as (78.49, ∞). Alternatively, to obtain 78.49, you can rerun the commands above using 0 as the test value and 98 as the confidence level. The output returned is in Figure 4.7.

One-Sample Test

Test Value = 0

t df Sig. (2-tailed) Mean

Difference

98% Confidence Interval of the Difference

Lower Upper

GirlScores 37.888 12 .000 84.462 78.49 90.44

Figure 4.7: One Sample T Output to obtain lower confidence bound for μ From Figure 4.7, the 99% one sided right sided lower bound for µ is 78.49. We are 99% confident that the girl’s mean achievement score falls above 78.49. Since 78.49 is above 78, we exclude 78 as a possible value for the mean. We find sufficient evidence that the mean achievement score for girls exceeds 78. P-value Calculation Two Sided P-value SPSS only provides the P-value for a two sided test, so it is necessary to convert the P-value obtained in the output to a P-value for a one-sided test. Details on how to handle all possible cases are in Figure 4.8.

P-value Calculation Two Sided P-value Ha: µ ≠ µo TSP-value = 2P(t>|t0|)

P-value Calculation One Sided P-value Ha:µ > µo t0 positive P(t> t0) = OSP-value =TSP-value/2

Ha:µ > µo t0 negative P(t> t0) = OSP-VALUE = 1 – P(t< t0) = 1- (TSP-value/2)

UNLIKELY SCENARIO: WOULD HAVE A x̅ LESS THAN µo WHICH IS NOT AS EXPECTED WHEN SET UP TEST

Ha:µ < µo t0 negative P(t< t0) = OSP-value =TSP-value/2

Ha:µ<µo t0 positive P(t< t0) = OSP-VALUE = 1 – P(t> t0) = 1 - (TSP-value/2)

UNLIKELY SCENARIO: WOULD HAVE A x̅ GREATER THAN µo WHICH IS NOT AS EXPECTED WHEN SET UP TEST

Figure 4.8: P-value Adjustment Table

Page 67: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

67

67

Calculation of One Sided Confidence Bounds Finding a (1-α)% one sided confidence bound using SPSS requires that the confidence level used to perform the test be (1 - 2α)% . Details are included in Figure 4.9.

Right Sided Left Sided

A (1-α)% confidence one sided right sided lower

bound is 𝑥̅ - 𝑡𝛼𝑠

√𝑛 .

A (1-α)% confidence one sided left sided upper

bound is 𝑥̅ + 𝑡𝛼𝑠

√𝑛 .

Same as: lower bound for a (1-2α)% two sided

confidence interval of ( 𝑥̅ - 𝑡𝛼𝑠

√𝑛 , 𝑥̅ + 𝑡𝛼

𝑠

√𝑛 ).

Same as: upper bound for a (1-2α)% two sided

confidence interval of ( 𝑥̅ - 𝑡𝛼𝑠

√𝑛 , 𝑥̅ + 𝑡𝛼

𝑠

√𝑛 ).

Figure 4.9: Adjustment Table for One Sided Confidence Intervals Example: A math achievement test is given to a random sample of 13 high school girls. The scores are given here: Scores for girls: 87, 91, 78, 81, 72, 95, 89, 93, 83, 74, 75, 85, 95

Figure 4.10: Scores for girls The data are given in the SPSS data file girlsachieve.sav on Blackboard.

1. Is the mean achievement score for girls below 87? Test at a 2% level of significance. 2. Construct a 98% confidence one sided left sided upper bound for the mean achievement score for girls.

Solution: Let µ denote the mean achievement score for girls. Hypothesis Test: Perform a one sided left sided sample t-test:

Step 1: Hypotheses: H0 : µ = 87 versus Ha µ < 87, α = 0.02

Step 2: Assume a simple random sample was taken from the population. Recall the boxplot, histogram and Q-Q plots that were created to check the normality assumption of the population by examining the distribution of the sample data, and, then, bearing in mind the small sample size, and that this is just one sample from the population that we are examining, provided an indication of how the population is distributed. The boxplot indicated that the sample data is approximately symmetric, with no outliers, the histogram that there was a fairly central peak in the sample data and no long tails, and the Q-Q plot indicated that it was not untoward to feel that the sample data was normal. We proceed.

Step 3 - 6: To generate computer output for the hypothesis test, repeat the command steps shown in the previous example, but this time use 87 as a Test Value (see Figure 4.11) and 96 as a confidence level. Note: A 96% two sided confidence interval for μ – 87 will also be returned, and this will not be of interest. However, when a rerun of commands is done with 0 as a Test Value, the 96% confidence

Page 68: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

68

68

interval for μ will be of interest. This is because the 98% confidence one sided left sided upper confidence bound is the same as the upper bound of a 96% two sided confidence interval. Figure 4.12 provides these results. Conduct a one sample t-test using SPSS:

a. Select Analyze>Compare Means>One-Sample T Test b. Select GirlScores as the test variable(s) c. Select 87 as your Test Value d. Click Options, type 96% for Confidence Interval, and click Continue e. Click OK

One-Sample Test

Test Value = 87

t df Sig. (2-tailed) Mean Difference

96% Confidence Interval of the Difference

Lower Upper

GirlScores -1.139 12 .277 -2.538 -7.67 2.59

Figure 4.11:One Sample T Test Output to obtain test statistic and P-value for test of H0: µ=87 vs Ha:µ< 87

Step 3: The test statistic is to = -1.139 with 12 degrees of freedom. Step 4: The P-value = 0.277/2 = 0.138 (since the P-value in Figure 4.11 is for a 2 sided test) Step 5: Do not reject Ho, at the 2% significance level (since 0.138 > 0.02). Step 6: There is no significant evidence the mean of the girl achievement scores lies below 87.

Confidence Interval:

A (1-α)% confidence one sided left sided upper bound is 𝑥̅ + 𝑡𝛼𝑠

√𝑛 . This is the same as the upper bound

for a (1-2α)% two sided confidence interval of ( 𝑥̅ - 𝑡𝛼𝑠

√𝑛 , 𝑥̅ + 𝑡𝛼

𝑠

√𝑛 ).

For α = 0.02, a (1 - .02)% =98% one sided left sided confidence interval has an upper bound of 𝑥̅ + 𝑡.02𝑠

√𝑛.

This is the same as the upper bound for a (1-2(.02))% = (1-.04)% = 96% two sided confidence interval of

( 𝑥̅ - 𝑡.02𝑠

√𝑛 , 𝑥̅ + 𝑡.02

𝑠

√𝑛 ).

Using the output in Figure 4.11, we calculate the upper bound of 89.59 for a 98% one sided right sided confidence interval by adding 87 to 2.59. The 98% one sided confidence interval can be presented as (-∞, 89.59). Alternatively, we can obtain rerun the commands above using 0 as the test value and 96 as the confidence level. The output returned is in Figure 4.12.

One-Sample Test

Test Value = 0

t df Sig. (2-tailed) Mean Difference

96% Confidence Interval of the Difference

Lower Upper

GirlScores 37.888 12 .000 84.462 79.33 89.59

Figure 4.12: One Sample T Output to obtain upper confidence bound for μ

Page 69: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

69

69

We are 98% confident that the mean achievement score for girls falls below 89.59. Since 89.59 is not below 87, we cannot exclude 87 as a possible value for the mean. We do not have sufficient evidence that the mean achievement score for girls lies below 87. (The upper bound is not below 87).

4.2 Inference About Two Population Means Using Two Sample t Procedure Confidence intervals and hypothesis tests Assumptions:

1. Simple random samples 2. Independent samples. 3. Normal populations or large samples (n1 and n2 both >= 30)

We will use the two sample non-pooled Welch t-test: 1. Test statistic for testing H0: µ1 - µ2 = δ versus Ha: µ1 - µ2 ≠ δ or Ha: µ1 - µ2 > δ or Ha: µ1 - µ2 < δ

to = (�̅�1−�̅�2)−𝛿

√(𝑠1

2

𝑛1+

𝑠12

𝑛2)

where df = (

𝑠12

𝑛1+

𝑠12

𝑛2)

2

𝑠14

𝑛12

𝑛1−1 +

𝑠24

𝑛12

𝑛2−1

or has the approximation df = min(n1 – 1, n2 – 1)

where µ1 is the population 1 mean, 𝑥̅1 is the sample 1 mean, s1 is the sample 1 standard deviation, n1 is the sample 1 size and 𝑡∝/2 is the t distribution value such that the area to the right of it is α/2. Similarly,

µ2 is the population 2 mean, 𝑥̅2 is the sample 2 mean, s2 is the sample 2 standard deviation, n2 is the sample 2 size and 𝑡∝/2 is the t distribution value such that the area to the right of it is α/2.

2. (1 – α) % Two Sided Confidence interval for µ1 - µ2

(𝑥̅1 − 𝑥̅2) ± 𝑡∝/2√(𝑠1

2

𝑛1+

𝑠22

𝑛2) , df as above

3. (1-α) % Lower bound for a right sided confidence interval for µ1 - µ2

(𝑥̅1 − 𝑥̅2) − 𝑡∝√(𝑠1

2

𝑛1+

𝑠22

𝑛2) , df as above

4. (1-α) % Upper bound for a left sided confidence interval for µ1 - µ2

(𝑥̅1 − 𝑥̅2) + 𝑡∝√(𝑠1

2

𝑛1+

𝑠22

𝑛2) , df as above

Page 70: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

70

70

Example: A math achievement test is given to a random sample of 25 high school students. The scores are below. Scores for girls: 87, 91, 78, 81, 72, 95, 89, 93, 83, 74, 75, 85, 95 Scores for boys: 68, 87, 67, 74, 81, 93, 60, 78, 74, 92, 81, 62 The data are given in the SPSS data file achieve.sav on Blackboard.

1. Is there a significant difference between the mean scores for boys and girls? Test at 5% level of significance.

2. Construct a 95% confidence interval for the difference in the mean scores between boys and girls.

Solution: Let µ1 denote the mean score for girls, and µ2 denote the mean score for boys. Hypothesis Test: Perform the following two sample t-test:

Step 1. Hypotheses: H0 : µ1 - µ2 = 0 versus Ha : µ1 - µ2 ≠ 0 (Also can state as H0 : µ1 = µ2 versus Ha : µ1 ≠ µ2 ) Step 2. We assume that the samples are randomly and independently selected from the populations. Since our sample sizes are small, we must assume normality of the populations in order to proceed. We draw boxplots to check the normality assumptions of the test:

(a) Select Graphs>Legacy Dialogs>Boxplot and check the Summaries for groups of cases button

(b) Click Define. In the Define dialog box use the arrows to select Scores as Variable and Sex for the Category Axis (see Figure 4.13)

(c) Click OK

Figure 4.13: Define Dialog Box for Drawing Boxplots of Score by Sex

Open the SPSS Viewer to see the output.

Page 71: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

71

71

Figure 4.14: Boxplot of Math Achievement Scores for Males and Females

The side-by-side boxplot shows that the distributions are approximately symmetric, with no outliers, and the spread, i.e. the variances, are almost the same. The boxplot also shows that overall the girls scored higher than the boys (see Figure 4.14). Bearing in mind that this is only one possible set of sample data and the sample sizes are small, we nevertheless think this sample data suggests that assuming our two populations to be normally distributed is not untoward. For practise, students should undertake further exploration of the data with histograms and probability plots to obtain a further idea of the shape of the sample distribution data. In your submitted work, you will be expected to do all three, and discuss all three. Steps 3 - 6: To generate computer output for the hypothesis test, perform the steps below. Conduct a two sample t-test using SPSS:

1. Select Analyze>Compare Means>Independent-samples t test 2. Select Scores as the test variable(s) and Sex as the grouping variable (see Figure 4.15) 3. Click Define Groups, type F for Group 1 and M for Group 2, and click Continue (see Figure

4.16) 4. Click Options, type 95% for Confidence Interval, and click Continue 5. Click OK

Figure 4.15: Independent Sample T test Dialog Box

Page 72: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

72

72

Figure 4.16: Define Groups Dialog Box

Results for the Welch t-test can be found in the SPSS output in Figure 4.17. Steps 3 to 6 for Welch’s T-Test follow. You will use the information found in the “equal variances not assumed” row. (Note that, in this course, we will not use the “pooled” t test, a test that assumes that the population variances are equal. This is an unrealistic assumption.) Step 3: The test statistic is to = 2.078 with 20.086 degrees of freedom. Step 4: The P-value = 0.051 Step 5: Do not reject Ho, at the 5% significance level (since 0.051 > 0.05) (pvalue > α) Step 6: There is no significant evidence the mean score of boys differs from the mean score of girls. below 87. Note that the P-value is quite close to the level of significance. It is always important to state and consider a P-value when any conclusion about significance is stated. Confidence Interval: For the Welch’s Test, the 95% confidence interval for µ1 - µ2 is (-0.030, 16.119). We are 95% confident that the difference in mean score between girls and boys falls between (-0.030, 16.119). Since 0 is captured within this confidence interval, we cannot exclude 0 as a possible value for the difference. Therefore, we do not find a significant difference in the mean score between boys and girls. In other words, the data do not provide sufficient evidence that the mean scores for boys and girls are different.

Figure 4.17 SPSS Output: Two Independent Sample T-test and Confidence Interval

Page 73: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

73

73

Note, again, that should you need to do a one sided hypothesis test, it would be necessary to convert the P-value obtained in the SPSS output to a P-value for a one-sided test. Furthermore, if you wish to calculate the upper bound for a (1 – α)% one sided left sided confidence interval, or the lower bound for a (1 – α)% one sided right sided confidence interval, you would need to run the SPSS commands with a (1-2α)% as your confidence level, in order to get the bounds you need to show up in the SPSS output.

4.3 Inferences about Two Population Means Using the Paired t Procedure Confidence intervals and hypothesis tests: Assumptions:

1. The population of differences is normally distributed. 2. The sample of differences represents a random sample from the population of differences.

Paired-t Procedure: Create a sample of differences (x̅d= x̅1- x̅2) and then apply the one-sample t-procedure for the parameter µd = µ1 - µ2. Example: A private agency is investigating a new procedure to teach typing. The following table gives the scores of eight students before and after they attended this course. The data are given in the SPSS data file typing.sav found on Blackboard.

Students 1 2 3 4 5 6 7 8

Before 81 75 89 91 65 70 90 69

After 97 72 93 110 78 69 115 75

1. Find a two sided 90% confidence interval for the mean difference (of “After – Before” pairs) in the writing speed of all students participating in the course.

2. At the 10% level of significance, do the data provide evidence of a significant mean difference in paired “After – Before” typing speeds?

Solution: Prior to creating the confidence interval and doing the hypothesis test, assumptions must be checked. For this problem we will assume that the sample of differences constitutes a random sample from the population of differences, and proceed to check the assumption of normality of the differences. In the SPSS data editor window, define the numerical variables, before and after. Next, type the score data before the course in the variable before, and the score data after the course in the variable after. We need to define a new variable, diff, containing the differences in the scores after and before the crash course:

Page 74: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

74

74

1. Select Transform>Compute 2. Type diff in the Target Variable box 3. Type After - Before in the Numeric Expression box and click OK (see Figure 4.18).

Figure 4.18: Defining a New Variable for the Differences

It is preferable to examine all of a normal probability plot, a boxplot, and a histogram to help analyze if the sample of differences is normally distributed. If the sample size is small, the normal probability plot is sometimes considered the most useful of the three. How to do the normal probability plot for this problem is illustrated below. Students should make the other plots for practise and think about what they show about the shape of the sample distribution. The plot appears in Figure 4.19.

1. Select Analyze>Descriptive Statistics>Q-Q Plots 2. Select diff as Variables 3. The test distribution dropdown should show the “Normal” default 4. Click OK

Page 75: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

75

75

Figure 4.19: Q-Q Plot of the Difference: After - Before in the Typing Scores

The normal probability plot compares the actual observed values against the expected normal values for a normally distributed random variable. If the distribution from which one samples is normal, then the points will fall close to the lines. Here most points fall close to the line, so we expect that the assumption of a normal population for the differences is not strongly violated. However, there are only 8 values in our data set of differences, which really is very small, and this should be kept in mind. To obtain the confidence interval (and also the hypothesis test) requested, perform the following commands.

1. Choose Analyze> Compare Means> Paired-samples T test 2. Click after, then click before to select after-before as Paired Variables (see Figure 4.20) 3. Click Options, type 90% for Confidence Interval, and click Continue 4. Click OK.

Figure 4.20: Dialog Box for Paired-t Procedure

Page 76: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

76

76

Paired Samples Statistics

Mean N Std. Deviation Std. Error Mean

Pair 1 After 88.6250 8 17.73566 6.27050

Before 78.7500 8 10.43004 3.68758

Paired Samples Correlations

N Correlation Sig.

Pair 1 After & Before 8 .877 .004

Paired Samples Test

Paired Differences

t df Sig. (2-

tailed) Mean Std. Deviation

Std. Error Mean

90% Confidence Interval of the Difference

Lower Upper

Pair 1 After-Before

9.87500 9.94898 3.51749 3.21083 16.53917 2.807 7 .026

Figure 4.21: SPSS Output for the Paired-t Procedure Confidence Interval: In Figure 4.21, a 90% confidence interval for the mean difference µAfter - µBefore is given as 3.21 to 16.54 minutes. Since the interval does not cover 0, there is significant evidence that the mean difference in the “After – Before” pairs differs from 0. The mean difference in minutes for the “After – Before” pairs ranges from 3.21 to 16.54 words a minute. Hypothesis Test: Step 1: Hypotheses: H0 : µAfter = µBefore versus Ha : µAfter ≠ µBefore (can state as H0 : µd = 0 versus Ha : µd ≠ 0)

where d = After - Before

Step 2: We assumed a random sample of paired differences, and found, as discussed above, from

examining a probability plot of the differences (Figure 4.19), that the sample of differences seemed to indicate that normality of the population of differences was not an untoward assumption. Note that students should examine a boxplot and a histogram in addition to a probability plot for this step. (Also note that because n is small, we must assume normality of the population of differences in order to proceed.

For Steps 3 to 6, SPSS output (Figure 4.21) from executing the paired-t procedure commands is needed. Step 3: Test statistic: t = 2.807 with 7 ( = # pairs – 1) degrees of freedom Step 4: P-value= 0.026 Step 5: Reject Ho, at the 10% level of significance, since .026 < .10. Step 6: The data provides sufficient evidence that the mean difference in minutes for the “before –

after” pairs differs from 0.

Page 77: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

77

77

Note, again, that should you need to do a one sided hypothesis test, it would be necessary to convert the P-value obtained in the SPSS output to a P-value for a one-sided test. Furthermore, if you wish to calculate the upper bound for a (1 – α)% one sided left sided confidence interval, or the lower bound for a (1 – α)% one sided right sided confidence interval, you would need to run the SPSS commands with a (1-2α)% as your confidence level, in order to get the bounds you need to show up in the SPSS output.

Page 78: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

78

78

Chapter 5 ANOVA Objectives: After studying this chapter you should be able to

1. Compare two or more population means using the one-way ANOVA procedure; 2. Conduct a multiple comparison of means; 3. Make inferences about linear combinations of group means; 4. Run and interpret randomized block designs; and 5. Fit a 2-way ANOVA and interpret the results.

5.1 The Analysis of Variance F-test for Equality of k Population Means In a single factor experiment, measurements are made on a dependent response variable of interest for several levels (treatments) of a factor of interest. ANOVA is a method of parametric inference that allows us to compare the k populations of interest from which samples have been obtained or allocated. Assumptions:

1. Simple random samples 2. Independent samples 3. Normal populations 4. Equal population standard deviations

Model:

Xij = µ + αi + εij i = 1, … k, j = 1, … ni

Xij is a random dependent variable denoting the jth measurement on treatment i k is the number of treatments

ni is the number of sample units per treatment, n = ∑ 𝑛𝑖𝑘𝑖=1

αi = μ – μi is the mean difference in treatment for treatment i μi is the mean of the ith treatment εij is the error in measurement, which can be explained through random effects not included in the model. We can write that we assume εij ~ N(0, σ2).

If all ni are equal, we say the model is balanced. Otherwise, the model is unbalanced. Null hypothesis: H0 : µ1 = µ 2 = ... = µ k , where k is the number of treatments Alternative hypothesis: Ha : At least one pair of the means is different. 𝑥 �̅� – sample mean of ith treatment si – sample standard deviation of ith treatment 𝑥̅ – overall mean of sample observations Test statistic: F=MSTr/MSE ,

Page 79: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

79

79

MSTr =𝑆𝑆𝑇𝑟

𝑘 −1 is the Mean Treatment Sum of Squares, where SSTr, the Treatment Sum of Squares,

measures the variation among the sample means. MSE= SSE/n-k is the Mean Error Sum of Squares, where SSE, the Error Sum of Squares, measures the variation within the samples. Under the null hypothesis the test statistic has an F distribution With ν1= treatment degrees of freedom =k - 1, and ν2=error degrees of freedom =n – k. It is common to present the test results in an ANOVA table, as follows.

Sum of Squares Degrees of Freedom

Mean Sum of Squares

Test Statistic Pvalue

Between Groups SSTr k-1 MSTr = SSTr/(k-1) F=MSTr/MSE

Within Groups SSE n-k MSE = SSE/(n-k)

Total SST

Example: Suppose that 44 students were selected and randomly and independently assigned to four groups of 11 students each. Students in each group were taught statistics using a different method of programmed learning. A standard test was administered to the four groups and scored on a 10-point scale. The data are given in the SPSS data file program.sav found on Blackboard. Method 1: 3, 5, 6, 8, 4, 3, 5, 6, 4, 6, 3 (lecture only) Method 2: 5, 7, 7, 7, 6, 6, 8, 4, 6, 7, 5 (lecture and computer labs) Method 3: 7, 5, 6, 8, 7, 6, 9, 8, 7, 7, 8 (lecture and computer labs and assignments) Method 4: 4, 6, 6, 7, 6, 5, 5, 5, 6, 5, 4 (lecture and assignments)

1. Do the assumptions for a one-way analysis of variance seem to be met? 2. Determine whether there is a significant difference in the mean scores for the four methods.

Solution: The measurement scores in program.sav are organized in 4 different columns (one for each treatment). SPSS requires data to be organized a different way in order to perform the one-way analysis of variance F-test. It requires two variables, one for all the scores (i.e. measurements) and the other for the index. Here is a way to create a new file with the data reorganized in order that the ANOVA test can be done.

Page 80: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

80

80

Figure 5.1: Restructure Data Wizard

1. Select Data>Restructure to open the Restructure data wizard (see Figure 5.1) 2. Check the Restructure selected variables into cases button and click Next 3. Check the One button for the number of variable groups and click Next 4. Select all variables for the four methods in the Variables to be transposed dialog

box using the arrow, type Scores as the Target variable, select None as Case Group Identification, and click Next (see Figure 5.2)

5. Check the One button for the number of index variables and click Next 6. Check Sequential numbers for the index values and click Finish

Figure 5.2: Variable to Cases – Select Variables Dialog Box

Page 81: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

81

81

SPSS will create a new data file containing two variables: index1 (with the values 1 for method 1, 2 for method 2, 3 for method 3, and 4 for method 4), and scores with all the 44 scores. Save this file as programrestructured.sav and use it to solve the problem. 1. CHECK ASSUMPTIONS We are told in the question that students were assigned randomly and independently to the treatment groups. The following commands will be used to check the normality and equal variances assumptions. Note that the sample sizes are small, and that this is only one possible set of sample data that could be taken from the population of interest. Nevertheless, we investigate in order to get a suggestion of population shapes.

(a) Using Analyze>Descriptive Statistics>Explore (b) Dependent List: Scores (c) Factor List: Index1 (d) Statistics tab: Check Descriptives (e) Continue (f) Plots tab: In Boxplots, Choose Factor Levels together (g) Check Histogram (h) Check Normality Plots with test (i) Continue (j) Ok

Treatment Mean Standard Deviation

Method 1 4.82 1.601 Method 2 6.18 1.168

Method 3 7.09 1.136

Method 4 5.36 0.924

Page 82: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

82

82

Figure 5.3: Descriptives, Histograms, Probability Plots, and Boxplots for the 4 Different Scoring Methods

Page 83: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

83

83

Note that with the very small sample sizes, none of the plots obtained are particularly informative (Figure 5.3). There is some indication that the normality assumption might be violated. The samples are not symmetric. However, since the sample sizes are equal, and the distributions are not extremely skewed or long-tailed, and there are no glaring outliers, we can still use the ANOVA F-test as it is robust to this situation for shapes in the populations of interest. To check the equality of the population variances, the following rule of thumb may be used. If the ratio of the largest sample standard deviation to the smallest sample standard deviation is less than 2, the assumption of equal standard deviations is plausible. In this case the ratio = s 1/s4 = 1.73 < 2. Thus, the conditions for one-way ANOVA seem to be met.

2. Perform the ANOVA F-test:

SPSS Commands and Output for a One-way ANOVA test

i. Select Analyze>Compare Means>One-way ANOVA... ii. Select scores as Dependent list(s) and index1 as Factor (see Figure 5.4) iii. Click Options and check Descriptive, Homogeneity of variance test, and Means plot buttons, and click Continue (see Figure 5.5)

iv. Click Ok

Figure 5.4: Dialog Box for the One-Way ANOVA Procedure

Figure 5.5: Options Dialog box for the One-Way ANOVA Procedure

Page 84: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

84

84

The descriptive option provides standard deviations to help check for equal variances. In addition, the standard deviations and means are needed to perform multiple comparisons, if appropriate. The homogeneity of variances check performs a test for equal variances called the Levene test . In this case, rejection of the null hypothesis of equal variances provides support for unequal variances. So, a finding that indicates the null hypothesis should not be rejected supports the assumptions. The means plot provides a graphical look at how far apart the means are (see Figure 5.7)

Figure 5.6 SPSS Output for a One-Way ANOVA

Figure 5.7: Means Plot

Levene Test: P-value=0.298. There is certainly no strong evidence against the assumption that the population variances are equal. This supports the result found with the rule of thumb approach.

Page 85: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

85

85

Descriptive Statistics Table: 𝑥̅𝑖 =4.82, 6.18, 7.09, 5.36 (I = 1,2,3,4) . Note that the 95% confidence interval for µ1 and µ3 do not overlap, and this would indicate to us that students in Method 3 (lecture and assignments and computer lab) classes obtained, on average, superior scores to students in Method 1 (lecture) classes. We will look at a more statistically apt way of making multiple comparisons of the treatment means later on. Means Plot: also illustrates the differences among the means Proper Write Up on Hypothesis Test, using information from above. Step 1: Hypotheses: H0 :µ1 = µ 2 = ... = µ4 versus Ha : At least two means are different. We decide to pre-chose a level of significance of α = 0.01 Step 2: Assumptions: Simple random samples and independent samples were assumed for the treatments. Normality of populations and equality of population variances were also checked above. One should always check equal variances with a rule of thumb and the Levine Test, and should also check boxplots, histograms, and probability plots of the sample distributions for normality. Results should be reported, and how supportive they are in showing that the ANOVA assumptions have been met prior should be discussed prior to proceeding to the test. It must always be noted that this set of treatment samples is just one set of possible samples from the treatment populations, and can only suggest that the assumptions for the populations might hold. This sample data appears such that the ANOVA test can be used.

Furthermore, although doing this is not a test of assumptions per se, students should examine the mean plot and descriptive statistics for their sample data to check (and report) whether there appears to be an indication of one or more means differing from the others. Noticing such an indication is support for performing the test. Step 3 and Step 4 (test statistic and P-value) (Commands and Output appear below)

The F-statistic=7.126 with degrees of freedom dfn = 3, dfd = 40 The P-value=0.001 Step 5: Decision Reject Ho since P-value < α (that is: 0.001 < 0.01)

Step 6: Interpretation At the 1% significance level, there is strong evidence to suggest that not all mean scores of the four methods are equal. (Further investigation will be undertaken later in this chapter.) If all residuals come from a normal distribution (that is, if the scores are normally distributed), then it is expected that a normal probability plot of the residuals should have points close to the line. The following commands present an alternative way to obtain ONE WAY ANOVA results (the alternative format in the output is not of interest here) while having SPSS store the unstandardized predicted values and the standardized residuals in the original data file (these are of interest). Once we have the columns in the data file, two plots can be made: 1) a normal probability plot with the residuals, and 2) a plot of the residuals versus the (fitted) predicted values (see Figures 5.8a and 5.8b)

Page 86: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

86

86

(a) Analyze > General Linear Model > Univariate (b) Dependent Variable: Scores (c) Fixed Factor(s): Index1 (d) Save: Predicted Values: Unstandardized (e) Save: Residuals: Standardized (f) Continue (g) OK

(a) Analyze > Descriptive Statistics > Q-Q Plots (b) Variables: ZRE_1 (Residuals for Score) (c) Test Distribution: Normal (d) OK

In Figure 5.8a, we see that most of the residuals are close to the line. We do identify a couple that are outlying at each end. And examination of the RES_1 column in the data file reveals these to belong to observation 13 in group 1 with value 8 and a residual of 3.18, and to observation 30 in group 3 with value 6.18 and residual -2.18. Both of these values can be readily located on the histograms above. They are not so far away that we are worried. However, we remain aware of our small sample sizes.

Figure 5.8a: Normal Probability Plot of Residuals

Plotting residuals versus the independent variable, and/or plotting residuals versus the (fitted) predicted values allows us to check for whether the errors are centered at 0, and have equal variances. If these assumptions are met, then a plot of the residuals should show a randomized pattern, and they should appear in a horizontal band centered around 0. Recall the Scatterplot commands.

a) Graphs>Legacy Dialogs>Scatterplot b) Put ZRE_1 to the Y axis

Page 87: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

87

87

c) Put PRE_1 to the X Axis d) Say OK

Figure 5.8b: Plot of the residual values versus the (fitted) predicted values

In Figure 5.8b, the plot can be construed to show what appears to be random points centered in a symmetric band of constant width around 0 for each of the treatment groups. There is one point that may be a bit outlying from the others at (4.82, 3.18) which is a bit of a concern (corresponding to the observation of 8 in Group 1, which otherwise consists of values between 3 and 6), as it indicates a possible outlier, but it is not an extreme outlier given the small spaces in measure between our units, and for teaching purposes, we decide to proceed with caution.

5.2 Linear Combinations and Multiple Comparisons of Means Contrasts: Linear combinations of the group means have the form 𝛾 = 𝐶1 𝜇1 + 𝐶2 𝜇2 + … + 𝐶𝑘𝜇𝑘 If the coefficients add to zero (C1 + .... + Ck = 0) then the linear combination is called a contrast. Some important concepts to note:

1. (1 – α)% Confidence interval for 𝛾:

(𝐶1𝑥̅1 + ⋯ + 𝐶𝑘 �̅�𝑘) ± 𝑡𝛼/2𝑠𝑝√(𝐶1

2

𝑛1+ ⋯ +

𝐶𝑘2

𝑛𝑘

)

where 𝑠𝑝 = √𝑀𝑆𝐸 is the pooled estimate for σ , 𝑡𝛼/2 is critical value of the t distribution with

d.f. = n1 + ... + nk – k.

Page 88: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

88

88

2. Test statistic for testing H0 : γ = δ:

to = (𝐶1𝑥̅1+⋯+𝐶𝑘 𝑥̅𝑘)− 𝛿

𝑠𝑝 √(𝐶1

2

𝑛1+⋯+

𝐶𝑘2

𝑛𝑘)

Example: Suppose we are interested in whether the method that involves all 3 components (method 3) is clearly superior. We compare it with the average of the other three methods (methods 1,2 and 4).

We define the contrast γ = µ3 - .33µ1 - .33µ2 - .33µ4:

The coefficients for this contrast are C1 = -.33; C2 = -.33 C3 = 1 C4= -.33 To test if the extra learning components help: (a) Select Analyze>Compare Mean>One-way ANOVA... (b) Select score as Dependent list(s) and method as Factor (c) Click Contrasts (d) Check Polynomial button and choose Linear for Degree (e) Type the coefficients one by one in the Coefficients box, click Add every time, and finally click

Continue (see Figure 5.9) (f) Click OK

Figure 5.9: SPSS Dialog Box for Contrasts

The SPSS output is given in Figure 5.10. The estimated value of the contrast is given as 1.69 (note that it can also be calculated as 𝑥̅3 - .33𝑥̅1 - .33𝑥̅2 - .33𝑥̅4 and that 𝑥̅1,𝑥̅2, 𝑥̅3, and 𝑥̅4 can be found in the descriptives box in Figure 5.6. with to = 3.952, df=40 and the P-value very small (≈0.000) (when assuming equal variances). Using the information in the table we can also find a 95% confidence interval for γ:

Page 89: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

89

89

1.69 ± (2.021) √1.518 √(−.33)2

11+

(−.33)2

11+

(1)2

11+

(−.33)2

11 =1.69 ± .8648

(You can select Transform>Compute and then choose the SPSS function IDF.T(0.975,40) to return 2.021 in SPSS. Or alternately, search out the t calculator on stattrek.com and give it df = 40 and P(T<=t)= 0.025 to return 2.021 ). Since the 95% confidence interval contains only positive numbers (i.e. adding learning components to the lecture increases the mean results) and the P-value is less than 0.05, we conclude that the extra learning components increase the mean score on the exam.

A 6 step one sided right sided hypothesis test could also be performed as fol lows. We’ll assume a prechosen α of 0.01. Step 1: Ho: µ3 - .33µ2 - .33µ1 - .33µ4 <=0 versus Ha: µ3 - .33µ2 - .33µ1 - .33µ4 > 0 Step 2: Assumptions are the standard ANOVA ones, and we have already discussed them for this data, and determined that we can proceed with testing. Step 3: to = 3.952 with 40 degrees of freedom. Step 4: p-value = P(t >3.952) ≈ .000/2 ≈ 0 Step 5: Since the p-value <= 0.01, we reject Ho Step 6: We have evidence that the extra learning components increase the mean score on the ex am.

Contrast Coefficients

Contrast

Index1

1 2 3 4

1 -.33 -.33 1 -.33

Contrast Tests

Contrast

Value of Contrast Std. Error t df Sig. (2-tailed)

Scores Assume equal variances 1 1.69a .428 3.952 40 .000

Does not assume equal variances 1 1.69a .406 4.167 18.490 .001

a. The sum of the contrast coefficients is not zero.

Figure 5.10: Contrast: check for superiority of the lecture, computer lab and assignment method Multiple Comparisons: Sometimes we are interested in more than one contrast In the situation where an ANOVA test has an overall significant finding for the model that tells us that at least one of the means is different from the others, then the creation of individual confidence intervals for all pairwise comparisons is of interest. Example: Check if the overall model for the test above is significant. If it is appropriate, create pairwise comparison confidence intervals for all possible pairs, so that each confidence interval of interest has an individual level of confidence of 95%. This approach is known as the Fisher LSD approach. Since our F test is significant for our test of overall model significance, it is appropriate to proceed with the calculation of the Fisher LSD intervals. Output is shown in Figure 5.11.

Page 90: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

90

90

Computer Commands and Output (a) Select Analyze>Compare Means>One-Way ANOVA... (b) Use Dependent List: Scores and Factor: Index1 (c) Click Post Hoc & LSD under Equal Variances Assume on the Post Hoc Multiple Comparisons screen (d) Set the significance level to 0.05 (e) Click Continue and then OK

Multiple Comparisons Scores LSD

(I) Index1 (J) Index1

Mean

Difference (I-J) Std. Error Sig.

95% Confidence Interval

Lower Bound Upper Bound

1 2 -1.364* .525 .013 -2.43 -.30

3 -2.273* .525 .000 -3.33 -1.21

4 -.545 .525 .305 -1.61 .52

2 1 1.364* .525 .013 .30 2.43

3 -.909 .525 .091 -1.97 .15

4 .818 .525 .127 -.24 1.88

3 1 2.273* .525 .000 1.21 3.33

2 .909 .525 .091 -.15 1.97

4 1.727* .525 .002 .67 2.79

4 1 .545 .525 .305 -.52 1.61

2 -.818 .525 .127 -1.88 .24

3 -1.727* .525 .002 -2.79 -.67

*. The mean difference is significant at the 0.05 level.

Figure 5.11: Multiple Comparison output for LSD Approach The table gives the p-value of the specific test and the 95% confidence interval for each pairwise comparison. Recall that the confidence intervals are not considered significant when they contain 0 (or in our case when the P-value> 0.05). We see that the differences between μ1 and μ2 , μ 1 and μ 3, and μ3

and μ4 present as significant. The other 3 pairwise intervals do not present as significant. Each of these individual pairwise comparison intervals has 95% confidence. In general, the Individual confidence level is the confidence we have that any particular confidence interval of interest contains the difference between the corresponding population means. And, if several confidence intervals are of interest, the family wise (also called experiment wise or simultaneous or overall) confidence level is the confidence we have that all the confidence intervals of simultaneous interest contain the differences between the corresponding population means. The more comparisons we do, the more likely it becomes that we make a wrong decision (that is, that we have at least one wrong conclusion in our results).

*Suppose we made three comparisons for three independent tests. Then the joint probability of not making a type 1 error on all three tests is (.95)3 = .8574. And the probability of making at least one type 1 error is then 1 - .8574 = .1426. Now, in our case, our tests are not independent because MSE is used in each test, and we have the same x is in various tests. It can be shown (this is beyond the scope of our course) that the error involved is even greater in this case.

Page 91: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

91

91

The decision we must make is whether to control for the family wise error rate or for the individual error rates. Several strategies exist to handle this situation. Remember, though, that multiple comparisons should only be made if they are of interest (i.e. we would not make multiple comparisons if the overall F model was not significant). In order to do so, an acceptable individual error rate or an acceptable experiment wise (family wise) error rate, α, must be decided. Fisher’s LSD: This test is used for pairwise comparisons only. Individual t tests, each at some chosen level, α, are performed. The level of overall significance for all tests performed will be larger (often considerably) than the individual α. Tukey-Kramer: This test is specifically for pairwise comparisons in an ANOVA setting. It uses a formula that is based on the q distribution (the studentized range distribution), a special type of right-skewed curve. The formula for the CI for 𝜇𝑖 - 𝜇𝑗 using the Tukey-Kramer approach is

𝑥 �̅� - 𝑥�̅� 𝑞𝛼

√2√𝑀𝑆𝐸√

1

𝑛𝑖+

1

𝑛𝑗 , where 𝑞𝛼=the q value from a q distribution with area α to its right*

*For a balanced design, all these intervals have the same width, and are known as the Tukey HSD intervals. With equal sample sizes, the level of family wise confidence for the Tukey interval is (1 – α). For an unbalanced design, the Tukey-Kramer intervals do not have the same width. With unequal sample size, the level of family wise confidence is “conservative”, and higher than (1 – α). Bonferroni: This multiple comparison method can be used to test for all possible contrasts of interest in very general situations when we want to control family wise error rate. Contrasts of interest are decided ahead of time. It uses individual significance levels of α* = α/g, where g is the number of contrasts of interest and α is the overall error rate. When used with pairwise comparisons, Bonferroni intervals have the following formula:

𝑥 �̅� - 𝑥�̅� 𝑡𝛼∗,𝑛−𝑘√𝑀𝑆𝐸√1

𝑛𝑖+

1

𝑛𝑗 , where 𝛼∗ =

𝛼

𝑚 and m is the number of pairwise comparisons

For a balanced design, all Bonferroni intervals will have the same width. In this case, the Tukey HSD approach results in narrower intervals than the Bonferroni approach. The Bonferroni approach is more conservative than the Tukey-Kramer approach. Scheffe: This multiple comparison method can also be used to test for all possible contrasts of interest for both equal and unequal sample sizes. It offers an overall α level of protection, regardless of how many contrasts are tested. It is best used post-hoc rather than planned. Scheffe intervals are wider than Tukey intervals when used for pairwise comparisons. Question: For our example above, produce LSD, Bonferroni, Tukey and Sheffe comparison intervals for our example (see Figure 5.12). Tell SPSS to use an α = .05 Computer Commands and Output

(a) Select Analyze>Compare Means>One-Way ANOVA & Dependent List: Scores and Factor: Index1 (b) Click Post Hoc

Page 92: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

92

92

(c) Choose to use LSD, Bonferroni, Tukey, and Scheffe under Equal Variances Assume on the Post Hoc Multiple Comparisons screen (d) Set the significance level to 0.05**, and click on Continue and OK

**Be careful here. For LSD, SPSS will use an individual error rate of 0.05 for each interval. For the Bonferroni, Tukey, and Scheffe intervals, it will use a family wise error rate of 0.05. The table gives the p-value of the specific test and the 95% confidence interval for all methods. Recall that the confidence intervals are not considered significant when they contain 0 (or in our case when the P-value> 0.05). The LSD approach gives the narrowest confidence intervals, but remember, these intervals are intervals where the deciding factor is the individual error rate of 0.05. This methods indicates that learning method 3 (lecture, computer lab and assignments) is superior to method 1(lecture only) and to method 4 (lecture and assignment). It also indicates that learning method 2(lecture and computer lab) is superior to learning method 1(lecture only). The other three methods are using a family wise error rate of 0.05. The Tukey intervals are smallest, the Bonferroni the next smallest, and the Scheffe the largest. The Tukey, Bonferroni and Scheffe methods all indicate that Learning Method 3 (lecture, computer lab and assignments) leads to a significantly higher mean score on the standardized test than learning methods 1 (lecture only) and 4 (lecture and assignment). (All other mean scores are not significantly different at an experiment wise error rate of 5%.) In order to summarize the results of a multiple comparison it is helpful to create a diagram showing the treatment means, in order from smallest to largest, with their corresponding factor levels listed above them, and to underline the treatment pairs that are NOT significantly different, such as is shown below. This diagram is for the Bonferroni, Tukey, and Scheffe results. The means for the scores as required for the diagram are returned in a descriptives box when you run the commands above (see Figure 5.6).

Scores

Index1 N

Subset for alpha = 0.05

1 2

Tukey HSDa 1 11 4.82 4 11 5.36 2 11 6.18 6.18

3 11 7.09

Sig. .061 .322

Scheffea 1 11 4.82 4 11 5.36 2 11 6.18 6.18

3 11 7.09

Sig. .098 .404

Means for groups in homogeneous subsets are displayed. a. Uses Harmonic Mean Sample Size = 11.000.

Diagram 1 4 2 3 4.82 5.36 6.18 7.09 --------------------------- ----------------

Page 93: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

93

93

Multiple Comparisons Dependent Variable:Scores

(I) Index1 (J) Index1 Mean

Difference (I-J) Std. Error Sig.

95% Confidence Interval

Lower Bound Upper Bound

Tukey HSD 1 2 -1.364 .525 .061 -2.77 .04

3 -2.273* .525 .001 -3.68 -.86

4 -.545 .525 .728 -1.95 .86

2 1 1.364 .525 .061 -.04 2.77

3 -.909 .525 .322 -2.32 .50

4 .818 .525 .414 -.59 2.23

3 1 2.273* .525 .001 .86 3.68

2 .909 .525 .322 -.50 2.32

4 1.727* .525 .011 .32 3.14

4 1 .545 .525 .728 -.86 1.95

2 -.818 .525 .414 -2.23 .59

3 -1.727* .525 .011 -3.14 -.32

Scheffe 1 2 -1.364 .525 .098 -2.90 .17

3 -2.273* .525 .001 -3.81 -.74

4 -.545 .525 .783 -2.08 .99

2 1 1.364 .525 .098 -.17 2.90

3 -.909 .525 .404 -2.44 .62

4 .818 .525 .497 -.72 2.35

3 1 2.273* .525 .001 .74 3.81

2 .909 .525 .404 -.62 2.44

4 1.727* .525 .021 .19 3.26

4 1 .545 .525 .783 -.99 2.08

2 -.818 .525 .497 -2.35 .72

3 -1.727* .525 .021 -3.26 -.19

LSD 1 2 -1.364* .525 .013 -2.43 -.30

3 -2.273* .525 .000 -3.33 -1.21

4 -.545 .525 .305 -1.61 .52

2 1 1.364* .525 .013 .30 2.43

3 -.909 .525 .091 -1.97 .15

4 .818 .525 .127 -.24 1.88

3 1 2.273* .525 .000 1.21 3.33

2 .909 .525 .091 -.15 1.97

4 1.727* .525 .002 .67 2.79

4 1 .545 .525 .305 -.52 1.61

2 -.818 .525 .127 -1.88 .24

3 -1.727* .525 .002 -2.79 -.67

Bonferroni 1 2 -1.364 .525 .079 -2.82 .09

3 -2.273* .525 .001 -3.73 -.81

4 -.545 .525 1.000 -2.00 .91

2 1 1.364 .525 .079 -.09 2.82

3 -.909 .525 .548 -2.37 .55

4 .818 .525 .764 -.64 2.28

3 1 2.273* .525 .001 .81 3.73

2 .909 .525 .548 -.55 2.37

4 1.727* .525 .013 .27 3.19

4 1 .545 .525 1.000 -.91 2.00

2 -.818 .525 .764 -2.28 .64

3 -1.727* .525 .013 -3.19 -.27

*. The mean difference is significant at the 0.05 level.

Figure 5.12: SPSS Output for the Multiple Comparison of Means Procedure

Page 94: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

94

94

5.3 Randomized Block Designs In a single factor ANOVA, once subjects have been chosen, each subject is randomly and independently allocated to levels (treatments) of the factor. Sometimes subjects exhibit heterogeneity with respect to another factor other than the treatment factor of interest. Therefore, we might “block” on that other factor. Then, within each homogenous block, the treatments can be randomly assigned. In general, a randomized block design will have b levels of a “blocking” factor and k levels of a treatment factor of interest. Including this block factor allows us to correct for its influence on the dependent treatment variable and to account for the variability it causes. Note that blocks can sometimes be subjects who are tested at each of the k different levels of the treatment factor of interest. When k=2 (the treatment factor of interest has 2 levels), the randomized block test can be shown to be analogous to the familiar paired t test. When I > 2, a F test is used. A block factor is only of secondary interest, but as long as blocking is necessary, it should present as significant. Assumptions:

1. The samples are randomly and independently selected. 2. The populations are approximately normal for each factor/block combinations. 3. The population variances for all factor/block combinations are equal.

Model:

Xij = µ + αi + βj + εij , i = 1,…,k, j = 1,…,b

Xij is a random dependent variable denoting the ith treatment measurement in block j k is the number of treatments b is the number of blocks αi is a parameter that indicates the effect for treatment i βj is a parameter that indicates the effect for block j μij = µ + αi + βj is the mean of the ijth treatment/block combination εij is the error in measurement, which can be explained through random effects not included in the model. We can write that we assume ε ij ~ N(0, σ2).

Example:

Three language tests shall be compared.

1. WAIS vocabulary (linguistic)

2. Willner Unusual Meanings Vocabulary (WUMV) (pragmatic)

3. Willner-Sheerer Analogy Test (WSA) (pragmatic)

Page 95: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

95

95

It is assumed that subjects should score lower on WUMV and WSA compared to the WAIS. Twelve subjects took the three tests and their scores are recorded in the table below. Blocking will be done of the 12 subjects. Blocking allows us to look at each treatment mean with factors such as subject experience and intelligence “noise” managed.

Subject WAIS WUMV WSA

1 15 12 11

2 10 11 8 3 6 4 3 4 7 7 5

5 9 6 6 6 16 14 10

7 11 10 7

8 13 9 4

9 12 10 8 10 10 8 7

11 11 9 9

12 14 11 10

It is reasonable to assume that the outcome on a test depends on the difficulty of the test and the linguistic ability of the person taking the test. In order to account for the linguistic ability of the different people a Randomized Block Design is chosen, and the test type is the treatment variable and the person is the block variable for the analysis. The data can be found in languagetest.sav on Blackboard. The file includes three variables: subject (ranging from 1 to 12), test (with k = 3 levels WAIS, WUMV, or WSA), and score (the test results for each subject/test pair). Questions:

1. State the model for the scores on the three language tests. 2. Obtain a line chart showing the means for each block/treatment combination. Does the graph

indicate a treatment effect? block effect? (Note that there is only one observation for each block/treatment combination here.)

3. Test at significance level of 5% if the mean scores for the three tests are all the same. 4. Should you do a multiple comparison for the mean scores? Explain. If yes, do one. 5. Use the residuals to check if the model assumptions are met.

Solution:

1. Xij = µ + αi + βj + εij , i = 1,…,3, j = 1,…,12

Xij is a random dependent variable denoting the ith score of person j k is the number of tests b is the number of persons αi is a parameter that indicates the effect for test i βj is a parameter that indicates the effect for person j μij = µ + αi + βj is the mean of the ijth test/person combination εij is the error in measurement, which can be explained through random effects not included in the model. We can write that we assume ε ij ~ N(0, σ2).

Page 96: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

96

96

2. To obtain the line chart

(a) Choose Graphs>Legacy Dialogs>Line>Multiple>Define (b) In Lines represent box choose Other Statistics (e.g. mean) for variable score (c) Choose subject (the block) for Category Axis (d) Choose test (the treatment) for Define Lines by (see Figure 5.13) (e) Click OK

In the Line Graph (see Figure 5.14) each line represents the results of one test (level of treatment). The lines are more or less parallel to each other, indicating that there is no interaction between subject and test. For each subject, WAIS test score generally falls above the WUMV test score, and that, in turn, generally falls above the WSA test score, so we should expect to find an effect of the test on the mean score. Since the mean scores for the different subjects vary, we should also expect to find an effect of subject on the mean score.

Figure 5.13: Define Multiple Line: Summaries for Groups of Cases

Page 97: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

97

97

Figure 5.14: Line Graph for scores

3. For the test Step 1: H0: α1 = α2 = α3 = 0 versus Ha: at least one is not 0, level of significance = 0.05

Step 2: It is given that the samples are randomly and independently selected. It must be assumed that the populations are approximately normal for each factor/block combinations, and that the population variances for all factor/block combinations are equal. We will test the sample residuals for normality, centered at 0, and equal variance below. (We remember that our samples are only the results of one experiment, and that the results we find when examining the sample residuals are suggestive of what is going on it the populations, but not proof.)

(a) To obtain the result from SPSS

a) Go to Analyze>General Linear Model>Univariate... b) Choose score as the Dependent Variable c) Choose test and subject as Fixed Factors d) Click Model, and choose Custom e) For the Built terms choose Main effects from the pull down menu f) Move test and subject into the Model box and click Continue g) Click OK

(b) In the SPSS output (see Figure 5.15) , you can see that F=31.798 with df1 = 11 and df2 = 2

and P-value <0.001.

Page 98: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

98

98

(c) Since the P-value is less than α we conclude, at a level of significance of 0.05 that the MEAN scores for the tests are not all the same.

Figure 5.15: SPSS Output for RBD Analysis

4. Since we found that at least for one of the tests the mean score is different from the others, we now ask ourselves “where are the differences?”. In order to control for the experiment wise error rate, we should do a multiple comparison of means. We will use Tukey approach here, as it will result in narrower intervals.

Commands to do the multiple comparison are:

(a) Analyze>General Linear Model>Univariate... and choose Post Hoc (b) Move test over into the Post hoc Test for box and click Tukey (c) Click Continue and then OK.

In Figure 5.16, observe that each pairwise comparison is labelled by a star, indicating that at experiment wise error rate of 5% all mean test scores are significantly different from each other. From the confidence intervals we can conclude, that WAIS is the easiest (highest scores), and WAS is the hardest test (lowest scores). When we make a diagram to order the treatment means from smallest to largest (with their matching treatment levels above), there will be nothing to underline! Note that in order to obtain the treatment means, after setting up the commands to create the output for the multiple comparisons and before you run you run the commands, add a step where you click on the Options box in the Univariate box and move test (the treatment) into the Display Means for box, and then click continue.

Page 99: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

99

99

Figure 5.16: SPSS Output for Tukey’s Multiple Comparison for the Variable test

Diagram 2 3 1 7.333 9.250 11.167

5. We can use the residuals and make a normal probability plot of them to see if it is reasonable to assume that the errors comes from a normal distribution (see Figure 5.17a). We can also plot the residuals versus the fitted (predicted) values to see if the errors are centered at 0, and have equal variances. To save the residuals and the fitted values,

(a) Go to Analyze>General Linear Model>Univariate (b) Dependent Variables: Score (c ) Fixed Factor(s): Subject and Test (d) Save: Predicted Values: Unstandardized and Residuals: Standardized (e) Continue (f) OK

(g) Analyze > Descriptive Statistics > Q-Q Plots (h) Variables: ZRE_1 (Residuals for Score) (i) Test Distribution: Normal (j) OK

Page 100: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

100

100

Figure 5.17a: Normal Probability Plot of the Residuals

Most of the residuals are close to the line with the exception of 2 outlying values. These belong to subject 8, who obtained a 13 in WAIS, 9 in WUMV, and 4 in WSA. This subject, with the second smallest WSA score, has the widest range in values between WSA and WAIS. Although s/he follows the expected pattern of doing best in WAIS and worst in WSA, s/he obtained a higher score on WAIS than we might have expected. Given the robustness of the randomized block design to slight departures from normality, the test is still an appropriate one to perform. And again, we note our small sample sizes.

Figure 5.17b: Subject 8’s information from the raw data.

We now create and observe a scatterplot of the standardized residuals against the unstandardized predicted values.

Page 101: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

101

101

a) Graphs>Legacy Dialogs>Scatterplot b) Put ZRE_1 to the Y axis c) Put PRE_1 to the X Axis d) Say OK

Figure 5.17c: Plot of standardized residuals vs fitted (predicted) values

With the exception of the points already noted with subject 8 (which correspond to slightly higher residuals (in absolute value) than expected), the plot of the residuals versus the predicted values shows what appears to be random points centered in a symmetric band of constant width around 0, indicating (potentially) equal variances of the errors in the factor/block combinations.

5.4 2-Way ANOVA In randomized block designs we allow for two factors to influence the outcome of the dependent variable, but our interest is only in the effect of the treatment variable, and we only include the block variable to account for its influence on the dependent variable. We also assume that the treatment effect is the same for each block. In 2-way ANOVA we include two factor variables (say A and B) in the model, and their combination determines the treatment, e.g. the choice of seed + choice of fertilizer determines the treatment of a certain plot. We no longer assume that the effect of one factor is the same for each level of the other factor. Interaction is possible.

Page 102: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

102

102

Assumptions:

1. The samples are randomly and independently selected 2. The populations are approximately normal for each treatment (i.e. factor combination) 3. The population variances for all treatments are equal.

Model:

Xijl = µ + αi + βj + (αβ)ij + εijl , i = 1,…,I, j = 1, …, J, l = 1, …, L

Xijl is a random dependent variable denoting the lth treatment measurement at level i of Factor A and level j of Factor B I is the number of levels of Factor A J is the number of levels of Factor B L is the number of observations at treatment combination ij αi is the effect for level i in Factor A βj is the effect for level j in Factor B (αβ)ij is the interaction effect of level i of factor A and level j of factor B εijl is the error in measurement, which can be explained through random effects not included in the model. We can write that we assume ε ijl ~ N(0, σ2).

Note that we will only consider the balanced two-way ANOVA design (with the same number of sample units for each (level i of factor A, level j of factor B) combination.

Example: The distance a golf ball is hit depends on the type of club used, and on the brand of the ball used. A golf player investigates this effect by hitting balls of 4 different brands (A, B, C, D) with a driver and a five iron. For each hit the distance the ball travelled is measured. The results can be found in golf.sav on Blackboard.

1. State the model for distance in dependency on the club used and the brand of the ball. Include interaction effects.

2. Obtain two line charts showing the means for each club/brand combination. One should have club on the horizontal axis and ball on the lines, and the other should have ball on the horizontal axis and club on the lines. Do these graphs indicate an interaction effect? a club effect? a brand effect?

3. Obtain clustered boxplots to check the assumptions 4. Create a two-way ANOVA table with which you will perform inference for your data. 5. Analyze the residuals from this model for normality. 6. Test if the model is useful in describing the distance a ball was hit with a driver of five iron. 7. Test if the two factors club and brand of the ball interact. 8. Test if the brand of the balls effects the mean distance the balls were hit. 9. Test if the mean distances the balls were hit are significantly different for the driver and the five

iron. 10. Is it useful to do multiple comparisons for comparing the mean distances for the different brands

of balls and/or different clubs? Do any post hoc analysis necessary.

Page 103: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

103

103

Solution: 1. Xijl = µ + αi + βj + (αβ)ij + εijl , i = 1,…,4, j = 1, …, 2, l = 1, …, 4

Xijl is a random dependent variable denoting the distance ball of brand i was hit by club of type j for repetition l αi is the effect for ball brand i βj is the effect for club type j (αβ)ij is the interaction effect of ball brand i and club type j εijl is the error in measurement, which can be explained through random effects not included in the model. We can write that we assume ε ijl ~ N(0, σ2).

2. To obtain the line charts with ball on the horizontal axis, do the following.

a. Choose Graphs>Legacy Dialogs>Line>Multiple>Define b. In Lines represent box choose Other statistics (e.g. mean) for variable distance c. Choose ball as the Category Axis d. Choose club for Define Lines by e. Click OK

To obtain the line chart with club on the horizontal axis, follow the commands above, but choose club as the Category Axis and ball for Define Lines by.

Figure 5.18a: Line Chart for mean distance

Page 104: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

104

104

Figure 5.18b: Line Chart for mean distance

The difference between the average of the means for the driver and the average of the means for the five iron indicates a likely effect of the factor club. The difference between the average of the means for the four brands indicates a likely effect of the brand of the balls. Since the lines are not parallel we should expect to find an interaction effect within the model.

3. To find a clustered boxplot with SPSS a. Select Graphs>Boxplot...>Clustered>Define b. Set Variable to distance, Category Axis to club, and Define Clusters by: to ball c. Click OK

Page 105: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

105

105

Figure 5.19: Clustered Boxplot for Distance Dependent on Choice of Club and Brand of Ball

All boxes look somewhat symmetric (see Figure 5.19), but the lengths of the boxes do vary. We find no notable evidence against the assumption that the data is normally distributed, but we should look further at the estimated standard deviations to check if the assumption of equal variances is reasonable. One way to find the descriptive statistics is split by the factors as follows.

(a) Select Data>Split File (b) Select Compare groups (c) Move ball and club into the Groups based on: box (d) Click OK (e) Select Analyze>Descriptive Statistics>Descriptives (f) Move distance into the Variable(s) box (g) Click OK

In the output (see Figure 5.20) the largest standard deviation is 9.24, which is more than 4 times larger than the smallest (1.96). Therefore we have indication that the variances might not be the same.

Figure 5.20: Descriptive Statistics for the Distance of Each Treatment (club/ball combination)

Note that Figure 5.20 also provides the means for each treatment (club/ball combination).

Page 106: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

106

106

4. To obtain all the results we need to conduct a 2-way ANOVA (see Figure 5.21).

(You will need to make sure that you follow the path Data>Split File and change back to “Analyze all cases, do not create groups” prior to following the commands below)

a. Select Analyze>General Linear Model>Univariate b. Select distance as the dependent variable c. Select club and ball as Fixed Factors d. Click OK

Figure 5.21: ANOVA Table for Distance Depending on Club and Ball

5. To obtain residuals to use, the commands above are used with an additional command to save

the residuals, viz.

a. Select Analyze>General Linear Model>Univariate b. Select distance as the dependent variable c. Select club and ball as Fixed Factors d. Save Button: Under Residuals: Check Standardized, and under Predicted Values: Check

Unstandardized e. Continue f. Click OK

In the data file, a column labeled ZRE_1 will have appeared. We will make a normal probability plot of these residuals (see Figure 5.22).

(a) Analyze > Descriptive Statistics > QQplots (b) Select Standardized Residuals as the Variable (c) Under Test Distribution, ensure Normal appears (d) Click OK

Page 107: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

107

107

Figure 5.22a: Normal Probability Plot of the Residuals Most of the points are fairly close to the straight l ine. It is perhaps reasonable to assume that the residuals come from a normal distribution (that is, that the distance variable is normally distributed for each (club type, ball brand) combination.

Figure 5.22b: Plot of the Standardized Residuals vs the Predicted Values In Figure 5.22b, the plot shows what appears to be random points centered on fairly symmetric bands around 0 for the driver/brand groupings. There is clearly an indication that the assumption of equality of the variances for the groups is violated, as the vertical distances from 0 vary widely. However, for education purposes, we proceed with the example.

6. Testing for model (see Figure 5.21 for output):

(a) Ho: the overall model is NOT useful Ha: the overall model is useful

(b) We are given that the samples are randomly and independently selected. Investigation of the sample data to see if it suggestions that the populations are approximately normal for each treatment (i.e. factor combination), and that the population variances for all treatments are equal was undertaken above. For the sample data, the assumption of normality seems to be met, but the variances might not be the same. The reasons for this are detailed in the question above.

(c) Fo= 140.689 with df1 = 7 and df2 = 24

Page 108: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

108

108

(d) P value < 0.001 (e) Reject H0, as the pvalue < 0.05 (f) The data provide sufficient evidence that the treatment means are not all the same for

all driver/ball combinations. The 2-way model helps to explain the distance a ball was hit.

7. Testing for interaction of club and ball (see Figure 5.21 for output):

a. Hypotheses: H0: (αβ)11 = ... = (αβ)42 = 0 versus Ha: at least one interaction term is not 0, α=0.05

b. See above for assumptions (only say this is you are sure you are correct in previous work!!!)

c. Fo = 7.459, with df1 = 3 and df2 = 24 d. P value = 0.001 e. Reject Ho as pvalue < α = 0.05 f. The data provide sufficient evidence that not all interaction terms are zero, i.e.

interaction exists and that certain club/ball combinations seem to be particularly good/bad, and not explained through main effects by club and ball. It appears that certain combinations of club/ball result in significant different mean distances that cannot be explained by just adding the main effect means for that club and ball.

8. Testing for main effect of ball (see Figure 5.21 for output):

a. Hypotheses: H0: α1 = ... = α4 = 0 versus Ha: at least one term is not 0, α=0.05 b. See above for assumptions (only say this is you are sure you are correct in previous

work!!!) c. Fo = 7.817, with df1 = 3 and df2 = 24 d. P value = 0.001 e. Reject Ho as pvalue < α = 0.05 f. The data provide sufficient that the mean distances are not all the same for the different

brands, averaging over all clubs.

9. Testing for main effect of club (see Figure 5.21 for output):

a. Hypotheses: H0: β1 = β2 = 0 versus Ha: at least one term is not 0, α=0.05 b. See above for assumptions (only say this is you are sure you are correct in previous

work!!!) c. Fo = 938.996 with df1 = 1 and df2 = 24 d. P value < 0.001 e. Reject Ho as pvalue < α = 0.05 f. The data provide sufficient that the mean distances are not all the same for the different

clubs, averaging over all brands.

The 2-way ANOVA analysis confirmed that the distance the balls went depended on the choice of club and the brand of the balls, and in addition an interaction effect was confirmed. Certain combination of club/brand work particularly well/bad beyond what can be explained through the main effects. All these results have to be treated carefully because the assumption that the variances are equal seems to be violated. Larger sample sizes could help to make a decision on this problem.

Page 109: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

109

109

10. With the 2-way ANOVA analysis we confirmed the presence of an interaction effect of club and balls on the distance a golf ball was hit. This means that for the different clubs, different balls were best/worse (the mean distance hit fell above/below the expected value). Or, for the different balls, different clubs were best/worse (the mean distance hit fell above/below the expected value). To further analyze which type of ball was best/worst for each club, we should, for each club type , conduct a pairwise multiple comparison of the mean distances attained by all possible pairs of ball brands when that club type was used. (that is, for each pair of ball brands, test if the mean distance the balls were hit are significantly different when that club type was used.) To further analyze which type of club was best/worst, we should, for each ball brand, conduct a pairwise multiple comparison of the mean distances attained by five iron and driver when that ball brand was used (that is, for the five iron driver combination, test if the mean distance the balls were hit by five irons is significantly different from the mean distance the balls were hit by drivers when that ball brand was used) We will need to rerun the 2-way ANOVA commands, but this time multiple comparison commands will be included. It is important to note that these multiple comparisons are looking 1) at pairwise intervals for each level of Club (our first factor of interest) and 2) at pairwise intervals for each level of Ball (our second factor of interest). This is because our model has interaction. Commands for multiple comparisons of ball brands within levels of club type and for multiple comparisons of club types within levels of ball brand follow. We will make Bonferroni intervals here. SPSS will not create Tukey intervals in this case, when we wish to look for pairwise comparisons of means of one factor within the levels of another factors. (Note that the commands ADJ(LSD) or ADJ(Scheffe) can be used, though.)

Select Analyze>General Linear Model>Univariate.

a. Click Options b. Select club*ball c. Click Continue d. Click Paste to open the SPSS Syntax Editor e. Complete the line starting with /EMMEANS = TABLES(club*ball) by adding

COMPARE(club) ADJ(Bonferroni) f. Add another line underneath /EMMEANS = TABLES(club*ball) by adding COMPARE(ball)

ADJ(Bonferroni) (see Figure 5.23) g. Select Run>All

Page 110: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

110

110

Figure 5.23: Syntax for Analysis of Treatment Means

SPSS conducts multiple comparisons (Bonferroni procedures) of the mean distances by brand for each club separately, and by club for each brand separately – see Figures 5.24 and 5.25 below. Figure 5.24 shows the results of comparisons of the mean distances attained by driver and five iron for each brand of balls separately. At any reasonable experiment wise error rate the mean distances for driver and five iron are significantly different for all brands of balls. The mean distance made using the driver is always greater than the mean distance using the five iron. Figure 5.25 shows the results of comparisons of the mean distances attained by all the ball brand pairs (there are 6 distinct pairs) for each club type separately. For drivers, brand C significantly outperforms brand A and brand D at an experiment wise error rate of 5%. For five irons, brand B significantly outperforms brand C and brand D at an experiment wise error rate of 5%. No other significant results are observed at an experiment wise error rate of 5%. The star beside any mean difference indicates that the difference is significantly different from 0. One could also check the confidence intervals, and find that none includes zero.

Page 111: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

111

111

Estimates

Dependent Variable:distance

club ball Mean Std. Error

95% Conf idence Interv al

Lower Bound Upper Bound

Driv er A 228.425 2.923 222.393 234.457

B 233.725 2.923 227.693 239.757

C 243.100 2.923 237.068 249.132

D 229.750 2.923 223.718 235.782

Fiv e iron A 171.300 2.923 165.268 177.332

B 182.675 2.923 176.643 188.707

C 167.200 2.923 161.168 173.232

D 160.500 2.923 154.468 166.532

Figure 5.24: Multiple Comparison of Mean Distances by Club for Each Brand

Figure 5.24 gives the outcomes of four multiple comparisons (one for each Brand of ball) of the mean distances between the driver and five iron. All results are significant. The mean distance made using the driver is always greater than the mean distance using the five iron for all brands of ball. Diagrams made for each Ball brand do not have any underlining as there are no nonsignificant results.

Diagram for Ball A Fiveiron Driver 171.300 228.425 Diagram for Ball B Fiveiron Driver 182.675 233.725

Diagram for Ball C Fiveiron Driver 167.200 243.100 Diagram for Ball D Fiveiron Driver 160.500 160.500

Page 112: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

112

112

Figure 5.25: Multiple Comparison of Mean Distances by Brand for Each Club

The output in Figure 5.25 gives the outcomes of two multiple comparisons (one for the driver and a second for the five iron) of the mean distances for the different brands of balls.

Diagram for Driver A D B C 228.4 229.8 233.7 243.1 --------------------------------------------------

---------------------------- The diagram shows that at a 5% experiment wise error rate, balls from brand C went significantly farther when hit by driver than balls from brand A and D. No other differences are significant at this error rate.

Diagram for Five Iron D C A B 160.5 167.2 171.3 182.5 -------------------------------------------------- ---------------------------------------------

The diagram shows that at experiment wise error rate of 5% balls from brand B went significantly farther when hit with the five iron than balls from brand D. No other differences are significant at this error rate.

Page 113: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

113

113

Note: If you wished Bonferroni and Tukey (or any of the other pairwise comparison methods offered) to compare marginal pairwise means for ball and/or for club (if significance indicated their investigation warranted), you could have included those requests in the Post Hoc commands when running the Analyze>GLM>Univariate commands. However, they are not of interest here because there is interaction in the model. They would have only been of interest if no interaction had been found, and only when they were found significant. IMPORTANT: In general, if a 2-WAY ANOVA model does not have interaction, and a factor turns out to be significant, then the pairwise multiple comparison intervals should not be created separately for particular levels of the other factor. In this case, it is correct to create multiple comparison intervals for the differences for the marginal pairwise means themselves. This should be done from the Post Hoc box when following the Analyze > General Linear Model > Univariate path. This was what we did in the randomized block question above where the model assumes no interaction between language tests (first factor) and blocking (second factor). There, when the factor language test turned out to be significant, we looked for the marginal pairwise comparison intervals between mean scores on the language tests (we created intervals to compare μWAIS to μ MWUMV, μMWAIS to μWSA, and μWAIS to μWSA.)

Page 114: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

114

114

Chapter 6 Non-Parametric Statistics

6.1 Wilcoxon (Mann-Whitney) Rank Sum Test for 2 Independent Samples Assumptions:

1. Independent simple random samples 2. Numerical Response Variable 3. Same shaped populations with equal variances. 4. Continuous population distributions (so not many ties) 5. At least 10 observations in each sample Note the absence of a normality assumption.

HO: LOCATION OF D1 AND D2 IS THE SAME HA: LOCATIONS OF D1 AND D2 DIFFER (OR D1 IS SHIFTED RIGHT OF D2 OR D1 IS SHIFTED RIGHT OF D1) (WHERE D1 AND D2 ARE IDENTICALLY SHAPED DISTRIBUTIONS FROM WHICH THE TWO INDEPENDENT SIMPLE RANDOM SAMPLES WERE CHOSEN)

Example: Do patients taking Drug A take less time to recover? A new medicine, Drug A, has been developed for treating patients with low hemoglobin counts. The pharmaceutical company that developed the new medicine is planning to advertise that it is superior to another medicine, Drug B, currently in use. As evidence the company uses the number of days to recovery of a sample of patients who were independently and randomly assigned to one or the other of the two drugs.

The data:

Drug A: 14, 10, 1, 12, 11, 14, 8, 10, 2, 12, 16, 12, 12, 15, 4 Drug B: 17, 15, 5, 14, 18, 3, 16, 13, 15, 16, 17, 8, 19

The data are given in the SPSS data file drug.sav . Do patients taking Drug A take less time to recover? Conduct either a t-test or the two-sample Wilcoxon rank sum test. Use a 5% significance level in your test.

Solution:

In order to decide which test to use, first we check the assumptions. The side -by-side boxplots and histograms showing the time in days to recovery by the two drugs are shown in Figures 6.1 and 6.2.

To draw the boxplots,

1. Select Graphs>Legacy Dialogs>Boxplot 2. Choose Simple and Summaries for groups of cases 3. Define 4. Variable: Time 5. Category Axis: Drug 6. Click OK

Page 115: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

115

115

Figure 6.1: Side by Side Boxplots Showing Time (in days) to Recovery

To draw the histograms,

7. Select Graphs>Legacy Dialogs>Histogram 8. Check Display normal curve 9. Choose time for variable 10. Use drug for rows 11. Click OK

Figure 6.2: Histograms for Drug A and B, with Normal Curve

You can also get the box-plots and separate histograms for the two drugs by using Analyze>Descriptive Statistics>Explore.

Page 116: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

116

116

Based on the side-by-side boxplots and histograms, it appears reasonable to assume that the underlying distributions are similar in shape. However, both distributions are skewed to the left (as evidenced by the boxplot whiskers and the tails in the histograms) and do not appear to be normal Next we take a look at the probability plots. Analyze >Descriptive Statistics >Explore Drag Time to the Dependent List Drag Drug to the Factor List Click on the Plots button Select Normality Plots and then Continue Select OK

Figure 6.3 The sigmoid shape of the Normality Plots further bear out the skewness of the sample data in the two sample distributions that we noted in the histogram and the boxplots. The sample distributions do not suggest that the populations are normal. We should avoid an independent samples t-test here.

The Wilcoxon Rank Sum Test is the appropriate tool to use.

SPSS only accepts numerical grouping variables for the 2-sample test, so we have to recode the variable Drug into a numerical variable, say n_drug, prior to performing the WSR test. Commands to transform the data are:

1. Transform>Recode>Into Different Variables... 2. Double click the variable Drug and write the name and the label of the new variable n_drug 3. Click Change 4. Select Old and New Values 5. Enter the old (A,B) and the new values (1,2), and every time click the Add button (see Figure 6.4) 6. Select Continue to close the Old and New Values dialog box 7. Click OK to close the Recode into Different Variables dialog box

Page 117: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

117

117

Figure 6.4: Recoding into a Numerical Variable

Now we can perform the Wilcoxon Rank Sum (Mann-Whitney) test:

Use Di, i = 1,2, for the distribution shapes of the recovery days for the 2 populations (1=Drug A and 2=Drug B). Let M1 = the median number of days to recover by drug A, and M2 = the median number of days to recover by drug B. Let µ1 = the mean number of days to recover by drug A, and µ2 = the mean number of days to recover by drug B. Step 1: Hypotheses: Ho: D1 – D2 >= 0 versus Ha: D1 – D2 < 0

(or H0: M1 – M2 ≥ 0 versus Ha: M1 – M2 < 0) (or Ho: µ1 – µ2 ≥ 0 versus Ha: µ1 – µ2 < 0)

Step 2: Assumptions: Independent simple random samples (as per sample design), numerical response variable, same shape continuous populations with equal variances, at least 10 observations in each sample. Prior to obtaining the SPSS output to help finish the formal write –up for the hypothesis test, recall how the WRS test works. First, all data from both samples is ranked (when there are ties, each tied observation is assigned the mean of the ranks they would have had if no ties were present). Then the average ranks for the two samples are compared to see if they differ significantly from what would be expected if the null hypothesis were true. An average rank from one population that was significantly smaller than the average rank of the other population would imply that the median of the one population would be significantly smaller than the median of the second population, and vice-versa. Let W= the sum of the ranks of sample 1. Since the sum of the ranks is n(n+1)/2, the sum of the ranks of sample 2 can readily be determined if W is known. Therefore, it suffices to choose W as the test statistic. When both sample sizes are at least 10, the W distribution, if Ho is true, is close to normal with

mean 𝑛1(𝑛1+𝑛2+1)

2 and standard deviation √

𝑛1(𝑛1+𝑛2+1)

12.

SPSS will calculate both the asymptotic p-value and an exact p-value. This exact p-value for a WRS (M-W) test can be calculated for any sample sizes, even those less than 10. However, if one chooses to report an exact p-value for sample sizes less than 10, and use it to perform a test, one should be very very sure that the other assumptions are met. Due to the difficulty of making inferences from small samples to larger populations, many textbooks (such as Weiss) choose to teach students that they

Page 118: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

118

118

should only perform the WSR test when both samples sizes are at least 10, and have them use the asymptotic p-value (as the sample sizes are a bit larger). We take a moment to look at the by hand results for this example. This is often a good idea as we can check that we are running the SPSS commands correctly with a smal l amount of data, in order to ensure we get it right for samples with more data.

SAMPLE 1 OVERALL RANK SAMPLE 2 OVERALL RANK

1 1

2 3

1 2

2 5

1 4

2 6.5

1 6.5

2 15

1 8.5

2 17

1 8.5

2 20

1 10

2 20

1 12.5

2 23

1 12.5

2 23

1 12.5

2 25.5

1 12.5

2 25.5

1 17

2 27

1 17

2 28

1 20

238.5 SUM SAMPLE 2

1 23

18.34615 AVERAGE SAMPLE 2 167.5 WILCOXIN W

11.16667 AVERAGE SAMPLE 1

SPSS commands for the Mann-Whitney test:

(a) Select Analyze>Nonparametric Tests>Legacy Dialogs>2 Independent samples (b) Select Time on the Test Variable list and n_drug as the Grouping Variables (see Figure

6.5a) (c) Click Define Groups, enter 1 for Group 1 and 2 for Group 2 then click Continue (d) Click OK

Page 119: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

119

119

Figure 6.5a: Wilcoxon (Mann-Whitney) Test

Step 3: U* = 47.5 (or W* = 167.5) (see Figure 6.5b) Step 4: P-value = 0.0095 (1/2 of the reported 2 sided P-value of 0.190) (see Figure 6.5b) Step 5: Since our P-value < 0.05, do reject Ho. Step 6: We have significant evidence, at the 5% significance level, that patients using the new drug A recover faster than patients taking drug B.

Figure 6.5b: SPSS Output for Mann-Whitney Non-Parametric Test

6.2 Inferences About Two Population Medians Using Wilcoxon’s Signed Rank Tests for Paired Differences The Wilcoxon Signed Rank Test for matched pairs experiments is used when the underlying population of differences does not have a normal distribution.

Assumptions: 1. simple random paired samples 2. symmetric differences 3. continuous population distribution of differences (not many ties)

Page 120: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

120

120

Example: We select a simple random sample of 20 parent/child pairs in the first year the child is in University and record the minutes they spend texting each other in a typical week. The data for these 20 pairs can be found below and in the file parentchild.sav. Child Parent Difference (= Parent – Child) 45 40 -5 50 45 -5 56 52 -4 62 58 -4 66 63 -3 77 75 -2 77 76 -1 78 78 0 79 80 1 85 87 2 87 90 3 89 93 4 90 95 5 96 102 6 98 105 7 102 110 8 111 120 9 115 124 9 123 133 10 133 143 10 1. Examine boxplots, histograms and normal probability plots for the parent and child distributions. We can see that the two distributions are similar in shape, and that there are no outliers. The upper whiskers on the boxplots and the right tai ls on the histograms and the slight falling below the line on the upper end of the probability plot all indicate a slight right skew to the sample data. When the distributions from which we calculate the differences are similar in shape, it follows the distribution of the differences will be symmetric.

Page 121: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

121

121

Figure 6.6 We also calculate a sample histogram for the Difference = Parent – Child. As can be seen, the sample distribution of the differences is quite symmetric. We expect this as the shapes of the parent and child sample distributions are similar.

Page 122: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

122

122

Figure 6.7: Distribution of Difference in Minutes

2. Test, at a 5% significance level, if there is a difference in the median minutes of textings between parent and child. We use D1 for the distribution of the child texting minute population and D2 for the distribution of the parent texting population. We use Difference = Child Texting Minutes – Parent Texting Minutes. We pre-chose α to be 0.05. Step 1: H0 : D1 = D2 vs. Ha : D1 ≠ D2 (or Mdifferences = 0 versus Mdifferences ≠ 0) (or µdifferences = 0 versus µdifferences ≠ 0)

Step 2: Assumptions: We have a simple random paired sample. The distribution of texting minutes differences is continuous (although we did end up with 4 ties because we presented minutes in whole units). The distribution of sample differences, as shown above, looks to be symmetric. Although this is only one sample distribution, it can suggest that the distribution of the population differences in also positive. Prior to obtaining the SPSS output to help finish the formal write –up for the hypothesis test, recall how the WSR test works. We calculate the absolute differences, |di| for all pairs. If any of the di equal 0, they are removed from the experiment, and the number of pairs is reduced by 1. The |di| are then ranked, and finally, the ranks are given signs that correspond to the signs of the original di. The di give us an indication of how far away the differences (of the pairs) are from 0. If two or more of the absolute paired differences are tied, each is assigned the rank that it would have been if there were not ties. If the null hypothesis is true, then we would expect the sum of the positive ranks and the sum of the

negative ranks to be similar in magnitude. That is, we would expect both sums to be n(n+1)

2

2. We compare

them to see if they differ significantly from what would be expected if the null hypothesis were true.

Page 123: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

123

123

The test statistic W= the sum of the positive ranks is chosen. When both sample sizes are above about

20, the W distribution is close to normal with mean n(n+1)

2 and standard deviation √

n(n+1)(2n+1)

24.

SPSS will calculate the p-value ONLY for this situation (that is, it will only calculate the asymptotic P-value). We take a moment to look at the by hand results for this example. This is often a good idea as we can check that we are running the SPSS commands correctly with a small amount of data, in order to ensure we get it right for samples with more data.

In our case, we have one difference of 0, so we remove it from the data prior to doing the test. This means that we will have 19 paired differences. 19 is close enough to 20 that we decide to proceed with the test. Our Difference here is “Parent – Child”

Chi ld Parent Di fference Abs Di ff Rank Signed

Rank 45 40 -5 5 11 -11 50 45 -5 5 11 -11 56 52 -4 4 8 -8 62 58 -4 4 8 -8 66 63 -3 3 5.5 -5.5 77 75 -2 2 3.5 -3.5 77 76 -1 1 1.5 -1.5 79 80 1 1 1.5 1.5 85 87 2 2 3.5 3.5 87 90 3 3 5.5 5.5 89 93 4 4 8 8 90 95 5 5 11 11 96 102 6 6 13 13 98 105 7 7 14 14

102 110 8 8 15 15 111 120 9 9 16.5 16.5 115 124 9 9 16.5 16.5 123 133 10 10 18.5 18.5 133 143 10 10 18.5 18.5

The absolute total of the positive ranks above is 141.5 and the absolute total of the negative ranks above is 48.5. This matches the information in the output below in Figure 6.8. Here we are uses the difference “Parent – Child” minutes. You must put Child into SPSS first and Parent into SPSS second if you wish the test to present positive and negative ranks to match this difference.

SPSS commands for Wilcoxon signed rank tests:

i. Select Analyze>Nonparametric Tests> Legacy Dialogs > 2 Related Samples... ii. Select Child - Parent as Test Pair(s) iii. Check Wilcoxon in the Test Type box iv. Click OK.

Step 3: Z* = -1.874 Step 4: P-value = 0.061 (2 sided, as provided by SPSS) Step 5: Since P-value > α (0.061 > 0.05), we do not reject Ho

Page 124: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

124

124

Step 6: The data does not provide evidence that the median of the differences of “Parent – Child” texting minute pairs differs from 0, at a 5% significance level. We do note, however, that our p-value of 0.061 is relatively close to our pre-chosen level of significance of 0.05. Note, also, that we did make differences that were “Child – Parent”, as we requested in step iii) above, as it lists 7 negative ranks, and 12 positive ones. SPSS, however, says “Parent – Child” in the output here. So watch out for that idiosyncrasy.

Ranks

N Mean Rank Sum of Ranks

Parent - Child Negative Ranks 7a 6.93 48.50

Positive Ranks 12b 11.79 141.50

Ties 0c Total 19

a. Parent < Child b. Parent > Child c. Parent = Child

Test Statisticsb

Parent - Child

Z -1.874a Asymp. Sig. (2-tailed) .061

a. Based on negative ranks. b. Wilcoxon Signed Ranks Test

Figure 6.8: SPSS Output for the Wilcoxon Signed Rank Test

6.3 The Kruskal-Wallis Test for k independent samples

The Kruskal-Wallis test is a generalization of the Wilcoxon Rank Sum test for comparing the locations of two distributions based on independent samples to the case of comparing the locations of k distributions based on k independent samples. It can be used to test equality of medians (and means) when the parent distributions are similar in shape. The hypotheses can thus be worded in terms of the population medians, if one wishes. Assumptions:

1.Independent simple random samples 2.Numerical Response Variable 3.Same shaped populations with equal variances. 4.Continuous population distributions (so not many ties) 5.At least 5 observations in each sample

HO: LOCATION OF ALL Di DISTRIBUTIONS IS THE SAME, i = 1,2,…k HA: LOCATION OF AT LEAST ONE OF THE Di DISTRIBUTIONS DIFFERS (WHERE D1 ,D2 , … Dk ARE IDENTICALLY SHAPED DISTRIBUTIONS FROM WHICH THE k INDEPENDENT SIMPLE RANDOM SAMPLES WERE CHOSEN )

Page 125: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

125

125

Example: The carbon monoxide level was measured (in parts per million) at three randomly selected industrial sites. Data can be found below and in the file carbonmonoxide.sav. Is there a significant difference in carbon monoxide levels at the three sites? Test at a significance level of 10%.

Site A: 0.106, 0.127, 0.132, 0.105, 0.117, 0.109, 0.107, 0.109

Site B: 0.121, 0.119, 0.121, 0.120, 0.117, 0.134, 0.118, 0.142

Site C: 0.119, 0.110, 0.106, 0.118, 0.115, 0.121, 0.109, 0.134

1. How do you classify the shapes of the distributions? Are the three distributions similar in shape? 2. Should the ordinary ANOVA F-test or the Kruskal test be used to compare the centers of the three distributions? 3. Conduct the appropriate test at a significance level of 10%. 4. If appropriate, conduct a multiple comparison for the centers of the distributions of the carbon monoxide levels at the three industrial sites. Solution: 1. The boxplots of the sample distributions n Figure 6.9 show that all three distributions are skewed right (the top 25% have the widest range). Thus, normality of the sample distributions is not indicated. However, the distributions are somewhat similar in shape and look to have variances that are not dissimilar. This can be borne out by examination of the sample histograms, which have right tails. Although this is just one possible set of sample data, these results can suggest that the population distributions are similar in shape and right tailed. A summary of descriptive statistics for these 3 sites is provided in the table below (output is not included). The ratio of the largest to the smallest standard deviation is .010323/.008832 = 1.168818 < 2.

Mean Median Standard Deviation

Site A .11400 .10900 .010323

Site B .12400 .12050 .009008

Site C .11800 .11650 .008832

Page 126: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

126

126

Figure 6.9: Boxplots and Histograms to investigate Distributions of Carbon Monoxide Level at Three Sites Because the histograms and boxplots indicate an obvious right skew to the sample data, the probability plots are not included here. However, for completeness, one should, in general, create boxplots, histograms, and probability plots when comparing distributions. 2. Because it does not appear that we have normal population distributions of carbon dioxide levels at the sites, ANOVA should not be used for comparing the carbon dioxide levels. However, the shapes of the sample distributions are similar (thus suggesting the same in the population distribution), and a Kruskal-Wallis test will be fine for comparing the center (median) of these distributions as long as all other necessary assumptions for performing a KW test are met. We mention these in Step 2 of the Hypothesis Test below, and indicate which we assumed and which we checked.

3. We perform the Kruskal Wallis Test. Step 1: H0: D1 = D2 = D3 = D4 = D5 versus Ha: At least one distribution is shifted to the right or left of the other distributions. Use α = 0.10

Step 2: Assumptions: Independent random samples (as per sample design), numerical response variable of carbon monoxide levels, same shape continuous populations (checked) with equal variances (checked), and all sample sizes are greater than 5 (check). To perform a Kruskal Wallis test, we rank the data from all samples combined, and calculate the average rank, Ri, for each sample. If R = n(n+1)/2 represents the total of the ranks, then the overall mean of the n ranks is then (n+1)/2. The Kruskal Wallis test statistic H is based on a weighted average of the squared

Page 127: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

127

127

differences between the Ri and (n+1)/2. H measures the variation among the mean ranks and it follows a chi-square distribution with k-1 degrees of freedom when Ho is true.

H = 12

𝑛(𝑛+1)∑ 𝑛𝑖 (𝑅𝑖 −

𝑛+1

2)2

If the Di are all equal, then the mean ranks are all close to R, and H will be small, but if any one (or more) of the Di is/are shifted away from the others, that/those Di will have a larger average rank than R, and H will tend to larger.

Site A Overall

Rank Site B Overall

Rank Site C Overall

Rank

0.105 1 0.117 10.5 0.106 2.5

0.106 2.5 0.118 12.5 0.109 6

0.107 4 0.119 14.5 0.11 8

0.109 6 0.12 16 0.115 9

0.109 6 0.121 18 0.118 12.5

0.117 10.5 0.121 18 0.119 14.5

0.127 20 0.134 22.5 0.121 18

0.132 21 0.142 24 0.134 22.5

71 136 93

8.875 17 11.625

H = 12

𝑛(𝑛+1)∑𝑛𝑖 (𝑅�̅� −

𝑛+1

2)2 =

12

24(25)( 8 (8.875 −

24

2)

2

+ 8 (17 − 24

2)

2

+ 8 (11.625 − 24

2)

2

) =

0.02( 8 (-3.125)2 + 8(5)2 + 8(-.375)2) = 0.02(8)(9.765625 + 25 + 0.140625) = (0.02)(8)(34.90625) = 5.585

(Note that due to ties and SPSS computing the test statistic in a slightly different manner, the H found here does not match the H in the SPSS output. However, the sums and averages of the ranks for the three treatments match, so we know that have set the problem up correctly when we submitted it to SPSS.) SPSS commands:

1. Select Analyze>Nonparametric Tests>Legacy Dialogs>K Independent samples... 2. Select CarbonMon as the test variable list and site as the Grouping Variable (see Figure 6.10) 3. Click on Define range and type 1 for Minimum and 3 for Maximum then click on Continue 4. Check Kruskal-Wallis H button in the Test type dialog box 5. Click OK

Step 3: H*= 5.496 (as per Figure 6.11) with 2 df. Step 4: P-value = 0.064 Step 5: Since our P-value < 0.10, reject Ho Step 6: We have significant evidence, at the 10% significance level, that at least one of sites has a carbon monoxide level distributions that differs in location from the others.

Page 128: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

128

128

Figure 6.10: Dialog Box for the Kruskal-Wallis Test

Figure 6.11: SPSS Output for the Kruskal-Wallis Test

4. Since the Kruskal-Wallis test indicates that not all three medians are equal, we want to know where the differences are. In order to control for the experiment wise error rate a Bonferroni procedure should be used.

Choose an experiment wise error rate of 10% The number or comparisons we have to do is c=k(k-1)/2 = 3 Then the comparison wise error rate has to be α* = α/c = 0.0333

Page 129: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

129

129

Now we have to compare the medians for each industrial site with each other industrial site. We should use Wilcoxon Rank Sum test to test each pair. This following table summarizes the result of using SPSS for testing H0: Mi = Mj versus Ha: Mi ≠ Mj for each i,j pair with and α* = 0.0333

Sites test statistic P-value Decision A - B 48.500 0.038 do not reject H0 A - C 58.500 0.328 do not reject H0 B - C 51.500 0.083 do not reject H0

Even though the Kruskal-Wallis test was significant, at an error rate of 10% we do not find any pairwise significant difference between the median carbon monoxide levels for the three industrial sites using Bonferroni's procedure. It is worth noting that the A-B P-value of 0.038 is very close to the α* of 0.033, and we have a result that borders on significant when we look at the difference between sites A and B in this situation.

Page 130: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

130

130

Chapter 7 Simple Linear Regression Objectives: After studying this chapter you should be able to

1. Create an X-Y scatterplot for two quantitative variables 2. Perform a simple linear regression analysis 3. Test hypotheses concerning the linear relationship between two quantitative variables 4. Evaluate the goodness of fit of a linear regression model 5. Check the model assumptions.

7.1 Linear Regression Model Many statistical studies are designed to explore the relationship between quantitative variables, such as the relationship between height and weight of people, the concentration of an injected drug and heart rate, or the consumption level of some nutrient and weight gain. The nature and strength of the relationship between two variables of interest, such as these, may be examined by two important statistical techniques called regression and correlation analysis. Consider the simple case where there is just one exploratory (independent) variable X and the response (dependent) variable Y. The response variable depends on the explanatory variables. We assume the mean response can be expressed as a linear combination of the explanatory variable: µy = β0 + β1x This expression is the population regression equation. βo is the intercept of the line and β1 is the slope of the line. We cannot directly observe this equation because the observed values of y vary about their means. We can think of subpopulations of responses, each corresponding to a particular observed explanatory variable x. In each subpopulation, y varies normally with mean given by the population regression equation. The regression model assumes that the standard deviation σ of the responses is the same in all subpopulations. The simple linear regression (SLR) model can be written y = β0 + β1x + ε ε ~ N(0, σ2) The model parameters are: β0, β1 and σ. The random errors corresponding to the subpopulations of responses are assumed uncorrelated. The non-random (deterministic) part in the SLR model, the line relating x and y, is the population regression line mentioned above.

Page 131: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

131

131

It is of interest to fit a line of “best” fit to n observed data points (x i, yi), I = 1, …, n. We chose to fit a “least squares line” that minimizes the sum of the squared vertical deviations of the

points (x1, y1), … , (xn, yn) from the line. Some calculus will provide us with 𝛽0̂ and 𝛽1̂ , the least squares estimates of β0 and β1 . The simple regression line (based on the observed units) can be written as:

�̂� = �̂�0 + �̂�1𝑥 where the �̂� is the predicted value for a given value x of the explanatory variable. Example: The SPSS data file sbp.sav contains data on systolic blood pressure (SBP) and age for a sample of 30 individuals.

1. Construct a scatter diagram and describe the relationship between SBP and age. 2. Obtain the estimated regression line of sbp on age. 3. Obtain an estimate of σ. 4. Find the correlation coefficient and the coefficient of determination for sbp and age. 5. Conduct a test, at 1% level of significance, to decide whether or not there is a positive linear

association between SBP and age. 6. Obtain 95% confidence intervals for the slope parameter β1 and the intercept β0 . 7. Obtain a 95% confidence interval for the estimate of the mean SBP of all individuals aged 65

years. 8. Predict with 95% confidence the SBP of an individual whose age is 65. 9. Check the model assumptions by using the appropriate graphical methods and tests. Solution:

1. To create a scatter diagram of SBP and age using SPSS,

(a) Choose Graphs>Legacy Dialogs>Scatter/Dot (b) Click the Simple Scatter icon and select Define (c) Specify sbp in the Y axis text box and age in the X axis box (d) Click Titles, type SBP versus Age and click Continue (e) Click OK

The scatter diagram appears in the SPSS viewer window (see Figure 7.1). There appears to be a moderately strong, positive, linear association between age and SBP. One outlier seems to be included in the data set.

Page 132: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

132

132

Figure 7.1: Scatter Diagram of Systolic Blood Pressure versus Age

2. To estimate the regression line of sbp on age using SPSS...

(a) Choose Analyze>Regression>Linear (b) Select sbp for the dependent box (c) Select age for the independent(s) box (d) Click Statistics, check Estimates from Regression Coefficients box and Model fit, and

Click Continue (e) Click Options, check Include constant in equation then click Continue (f) Click OK

Figure 7.2: Regression Dialog Box

Page 133: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

133

133

From the SPSS output in Figure 7.3, we can see that the estimated regression equation is

sbp = 98.715 + 0.971age

Figure 7.3: Estimated Regression Coefficients

3. In Figure 7.4 we see the output for the model summary and ANOVA. The estimate for σ is given in

the last column of the model summary table. An alternative way to calculate σ uses information from the ANOVA table

Estimate σ = √𝑀𝑆𝐸 = √299.766 = 17.314

Figure 7.4: Model Fit

4. From the model summary table, the correlation coefficient r = 0.658 and the coefficient of

determination R2 = 0.432 = 43.2%. Thus, we have a moderate linear relation, and 43.2% of the variation in the variable sbp is explained by the linear regression on age. (Note that the presence of the “outlier” weakens the linear fit and affects the slope of the line.)

5. The hypotheses are

(a) H0: β1 ≤ 0 (b) Ha: β1 > 0

From the SPSS output in Figure 7.3, the test statistic t = 4.618, df = dfe = 28 and the P-value

<0.001/2. Since the P-value is very small, the null hypothesis is rejected and we conclude that there is a positive linear relationship between sbp and age.

Page 134: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

134

134

6. To obtain confidence intervals for the slope and the intercept parameters choose Analyze>Regression>Linear, select Statistics and check Confidence intervals. From the table in Figure 7.5 we can see that the 95% confidence interval for β1 is (0.54, 1.401) and the 95% confidence interval for β0 is (78.230, 119.200).

7. To calculate a confidence interval for the mean of y for a given value of x with SPSS choose

Analyze>Regression >Linear, select Save and check Mean in the Prediction intervals box. If you also check Individual you will get the individual prediction intervals, too. (In the predicted values box, check Unstandardized. This will return the point estimates for your confidence interval and prediction interval. )Type 95 in the Confidence Interval text box (see Figure 7.6). The lower bounds of the confidence intervals for the average sbp will appear in the column Lmci 1 and the upper bound in the column Umci 1 of the datasheet. The 95% confidence interval estimate for the average sbp of all people 65 years old is (151.09234, 172.55024) (see row number 5).

Figure 7.5: Prediction Intervals

8. The lower bounds of the individual prediction intervals are in the column Lici_1 and the upper

bounds are in the column Uici_1. The 95% prediction interval for the sbp of a particular 65 year old is (124.76836, 198.87422) (see row number 5).

Figure 7.6: Confidence Intervals for the Slope and the Intercept Parameters

Page 135: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

135

135

Note: If you wish to obtain an individual prediction interval for a particular individual with a sbp of interest that does not appear in the data values, type that number in the cell below the bottom entry in the sbp column in the data file. Then run the commands above again. The upper and lower bounds of an individual prediction interval for the sbp of interest will appear in the data file.

7.2 Residual Analysis Our model makes the following assumptions about the error terms εi of the model. Population Regression Line: The εi have mean of 0 Independence: The εi are independent Normality: The εi are normally distributed Homogeneity of variances: The εi have the same variance σ2

Some properties of residuals: Since the error terms εi are unknown, we use the residuals, ei = yi - 𝑦�̂�,

as estimates of the error terms.

E(ei) = 𝐸(y𝑖 − 𝑦�̂�) = 0

The ei are functions of the x i, and as such, have different variances. In fact, the further away a x i is from

𝑥̅ xbar, the smaller the V(ei) is.

V(ei) = V(𝑦 − �̂�) = 𝜎2(1 − 1

𝑛−

(𝑥𝑖− �̅� )2

𝑆𝑥𝑥)

where Sxx = ∑ (𝑥𝑖 − 𝑥̅ )2𝑛𝑖=1

We write V(𝑦 − �̂�) = σ2( 1 - hii) where the hii are 1

𝑛+

(𝑥𝑖− �̅� )2

𝑆𝑥𝑥

The 𝑦𝑖 −�̂�𝑖

√𝑉(𝑦𝑖−�̂�𝑖) are approximately N(0,1) (note: this requires normality of the model errors εi). They are

not independent; however, as long as the h ii are fairly close to zero, they can be considered to be independent.

We can substitute s2 = MSE = �̂�2 for σ2 , and create studentized residuals:

si = 𝑦𝑖 −�̂�𝑖

√𝑀𝑆𝐸(1−ℎ𝑖𝑖) where the hii can be viewed as “leverage” values that help indicate how far an x i

value lies from the mean 𝑥̅. They can be shown to be functions of only the x i in the model, and they are such that 0 <= hii <=1. Observation: If the εi have the same variance σ2, then the studentized residuals have a Student’s t distribution with n – 2 degrees of freedom. This, of course, is rather close to a N(0,1) distribution for large enough n. Further details about the above material about residuals can be Statistics texts such as “Probability and Statistics for Engineering and the Sciences, Eighth Edition”, by Jay L. Devore, Brooks/Cole, 2012.

Graphical techniques are used in the residual assessment. A probability plot of residuals and/or a histogram can check for normality.

Page 136: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

136

136

Plotting residuals versus the independent variable, and/or plotting residuals versus the predicted values allows us to check for whether the errors are centered at 0, and have equal variances. If these assumptions are met, then a plot of the residuals should show a randomized pattern, and they should appear in a horizontal band centered around 0. If this is not the case, then one of these assumptions is not being met. Plotting residuals versus the independent variable when the independent variable has a natural ordering to it (such as time) allows us to make sure that no patterns of dependency exist in such a case. These plots work quite well where there is only one independent variable.

Patterns to watch out are included below in Figure 7.7.

a b c d

Figure 7.7 Patterns of Note in plots of Residuals versus Predicted Values Pattern a) above is satisfactory; a horizontal band centered on 0 appears. Patterns b) and c) can indicate that the assumption of equal variances is doubtful (with variance increasing over time or x or predicted values in b) and variance unequal over time with time or x or predicted values in c). Pattern d) can indicate a non-linear relationship and that a transformation on the x variables may be needed to bring linearity to a model. Since the size of the residuals will depend on the particular problem at hand, it often facilitates residual analysis to standardize or studentize the residuals.

Standardized residuals may be found by dividing the residuals ei = yi - 𝑦�̂� by the estimate (s = √𝑀𝑆𝐸)

of σ. Furthermore, since the variance of residuals decreases as the independent variable x i values move further from their mean, many texts (and your instructor) suggest that the use of Studentized residuals in residual analysis is perhaps more sensible. Studentized residuals are always larger than standardized residuals because the hii are always between 0 and 1, making studentized residuals more sensitive to outlying values. SPSS has a feature that allows calculation of unstandardized, standardized and studentized residuals. To save the residuals and the predicted values for a residual analysis, from the regression window:

(a) Select Save... (b) Check Unstandardized, Standardized and Studentized in the Residual box (c) Check Unstandardized in the Predicted Values box (d) Click Continue (e) Click OK

Page 137: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

137

137

The unstandardized residuals, standardized residuals, studentized residuals, and unstandardized predicted values are now saved as variables RES_1, ZRE_1, SRE_1, and PRE_1 in the SPSS worksheet. (We note that standardized predicted values can also be saved. We note that it is easier to identify the mean of standardized predicted values than it is to identify the mean of unstandardized predictors when looking at the axis of a graph. We will use unstandardized predictors in our exploration of model fitness, mainly to facilitate our looking up predicted values of interest.) To check if it is reasonable to assume that the error is normally distributed we obtained a normal Q-Q plot and a histogram for each of RES_1, ZRE_1 and SRE_1. (Students should also make boxplots here.) The commands below indicate how to do this for RES_1. Analogous commands would do this for ZRE_1 and SRE_1.

(a) Click Analyze>Descriptive Statistics...>Q-Q Plots... (b) Choose RES_1 (or Unstandardized Residuals) as variable (c) Make sure the Test Distribution is Normal (d) Click OK (d) Click Graphs>Legacy Dialogs>Histogram (e) Choose RES_1 (or Unstandardized Residuals) as variable (f) Click OK

Note that in practise, one would choose one variable (likely studentized residuals) for the y axis and one variable (likely standardized predicted values) for the x axis. All graphs are included so that students can see their similarity. The outcome appears in Figures 7.8 and 7.9

Figure 7.8: Q-Q Plots of the Unstandardized, Standardized, and Studentized Residuals

Note that scaling the data gives us a better perspective on the “closeness” of the points to the line.

Figure 7.9: Histogram of the Unstandardized, Standardized, and Studentized Residuals

Page 138: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

138

138

With the exception of an outlier, the histogram of residuals is fairly normal and centered at 0. The QQ- plot likewise indicates a fairly normal distribution for the error, with the exception of the one outlier. A QQ plot is often preferable to a histogram when the number of points is small, although in this case n=30, which is sometimes viewed as somewhat large in statistical applications. The one outlier is responsible for the small correlation co-efficient. It pulls the regression line up above the scatter of the other points so that the points are not as close to the regression line as they might be otherwise. To check if it is reasonable to assume that the errors have a mean of 0 and have constant variance, we can plot residuals versus the independent variables, and/or plotting the residuals versus the predicted values. Commands are provided below to plot residuals against unstandardized predicted values, RES_1. Analogous commands apply for ZRE_1 and SRE_1. Output is presented in Figures 7.9 and 7.10.

a) Click Graphs>Legacy Dialogs>Scatter/Dot... (b) Click the Simple Scatter icon and click Define (c) Choose RES _1 (Standardized Residuals) for the Y axis (d) Choose PRE_1 (Unstandardized Predicted Values) for the X axis (e) Click OK

Figure 7.10: Scatterplot of the Unstandardized, Standardized Residuals and Studentized Residuals versus the Unstandardized Predicted Val ue

With the exception of the readily identified outlying value, all plots of the residuals show what appear to be random points centred in a symmetric band of constant width around 0. The assumptions of linearity and constant variance appear to be met. A scatter plot that overlays the standardized and studentized residuals against the predicted values (see Figure 7.10) shows that the studentized residuals and standardized residuals are quite close, but does point out that the studentized approach views the identified outlier as even (slightly) more problematic.

Figure 7.11: Comparison plot of Standardized and Studentized residuals against Predicted values

Page 139: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

139

139

Chapter 8 Multiple Linear Regression Objectives: After studying this chapter you should be able to

1. Create a matrix plot 2. Fit a multiple regression model with SPSS 3. Conduct statistical inference concerning the regression coefficients 4. Use the multiple linear regression model for estimation and prediction 5. Analyze the residuals 6. Define multiple linear regression models with dummy variables 7. Apply variable selection techniques

8.1 The Multiple Regression Model In the multiple linear regression setting, the response variable y depends not on just one, but k explanatory variables x1, x2, ... , xk. The mean response is a linear combination of the explanatory variables: µy = β0 + β1x1 + ... + βkxk This expression is the population regression equation. We cannot directly observe this equation because the observed values of y vary about their means. We can think of subpopulations of responses, each corresponding to a particular set of values for all of the explanatory variables x 1, x2, ... , xk. In each subpopulation, y varies normally with mean given by the population regression equation. The regression model assumes that the standard deviation σ of the responses is the same in all subpopulations. The multiple linear regression model can then be written as y = β0 + β1x1 + ... + βkxk + ε, ε ~ N(0, σ2) The model parameters are: β0, β1, ..., βk and σ. The random errors corresponding to the subpopulations of responses are assumed uncorrelated. It is of interest to fit a line of “best” fit to n observed data points (y i, x1i, x2i, …xki) i= 1, …, n.

Some calculus will provide us with 𝛽0̂ , 𝛽1̂ , . . . , 𝛽�̂� the least squares estimates of β0, β1 , … βk . The multiple regression line (based on the observed values) can be written as:

�̂� = �̂�0 + �̂�1 𝑥1 + ⋯ + �̂�𝑘𝑥𝑘 where the �̂� is the predicted value for a given value x of the explanatory variable. Example: Fuel consumption in heating a home is a function of other variables such as outdoor air temperature (x1) and wind velocity (x2). For illustrative purposes, suppose the data in the SPSS file fuel.sav were collected to investigate how, for a sample of 10 winter days, the amount of fuel required to heat a home depends upon the outdoor temperature and wind velocity.

Page 140: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

140

140

1. How strongly are the explanatory variables related to the response? Use a matrix plot (multiple scatter plots) and the correlation matrix for the data set to examine the pairwise relationships among the three variables.

2. Obtain the estimated regression equation for predicting fuel consumption from the two other variables. Interpret the coefficients. Report the standard error of the estimate ( σ)

3. Test at a 1% significance whether or not the model is useful for predicting the mean fuel consumption.

4. Test at a significance level of 1% whether or not temperature is linearly related to fuel consumption.

5. Check the model assumptions using the appropriate graphical techniques and tests. 6. Check if there are any outliers or influential observations.

Solution:

1. To create a matrix plot of fuel consumption, temperature, and wind velocity using SPSS:

(a) Choose Graphs>Legacy Dialogs>Scatter/Dot (b) Click the Matrix icon and then select Define (c) Select fuelc, temp, and wind for the Matrix variables box (d) Click Titles, type Matrix Plot of Fuel Consumption, Temperature and Wind Velocity then

click Continue (e) Click OK

In Figure 8.1, we see scatterplots relating each pair of these three variables. In the first row, both of the graphs have fuelc on the vertical axis: in the first column, fuelc forms the horizontal axis.

Figure 8.1: Matrix Plot of Fuel Consumption, Temperature, and Wind Velocity

Page 141: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

141

141

To compute the correlation coefficients for each pair, do the following:

(a) Choose Analyze>Correlate>Bivariate (see Figure 8.2) (b) Move fuelc, temp, and wind into the Variables box (c) Click OK

Figure 8.2: Correlation Dialog Box

In the SPSS Viewer window you will find the correlation matrix (see Figure 8.3), which reports the correlation between each pairing of the three variables: the correlation coefficient for a pair of variables appears at the intersection of the corresponding row and column. For example, fuel consumption and temperature have a significant negative correlation of -0.879, while fuel consumption and wind velocity have a non-significant positive correlation of only 0.424.

We expect fuel consumption to increase as the wind velocity increases, and to decrease as the temperature increases. The matrix plot and the correlation matrix confirm our expectation.

Figure 8.3: Correlation Matrix of Fuel Consumption, Temperature, and Wind Variables

Correlations

fuelc temp wind

fuelc Pearson Correlation 1 -.879** .424

Sig. (2-tailed) .001 .222

N 10 10 10

temp Pearson Correlation -.879** 1 -.071

Sig. (2-tailed) .001 .846

N 10 10 10

wind Pearson Correlation .424 -.071 1

Sig. (2-tailed) .222 .846

N 10 10 10

**. Correlation is significant at the 0.01 level (2-tailed).

Page 142: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

142

142

2. To obtain the equation of the multiple regression model of fuel consumption on temperature and wind velocity using SPSS, enter the following commands (see Figure 8.4):

(a) Choose Analyze>Regression>Linear (b) Select fuelc for the Dependent box (c) Select temp, wind for the Independent(s) box (d) Click Statistics, check Estimates from the Regression Coefficients box and Model fit then

click Continue (e) Click Options, check Include constant in equation then click Continue (f) Click OK

Figure 8.4: Regression Dialog Box

The results are displayed in the SPSS Viewer window. We can see that we have one intercept and two slopes, one for each of the two explanatory variables. The estimated regression equation is:

�̂�= 11.928 – 0.628 x temp + 0.130 xwind The intercept (11.928) represents the fuel consumption when the temperature is zero degrees and the wind velocity equals zero. Each slope represents the mean change in fuel consumption associated with a one-unit increase in the corresponding explanatory variable, if the other explanatory variable remains unchanged. For example, if temperature were to increase by one degree, and wind velocity were to remain constant, then fuel consumption would decrease in average by 0.628 units. In the model summary output(see Figure 8.6) the standard error of the estimate (σ) is equal to 1.22492.

Coefficientsa

Model Unstandardized Coefficients Standardized Coefficients

t Sig. B Std. Error Beta

1 (Constant) 11.928 .932 12.793 .000

temp -.628 .086 -.853 -7.275 .000

wind .130 .042 .364 3.102 .017

a. Dependent Variable: fuelc

Figure 8.5: Regression Coefficients

Page 143: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

143

143

Figure 8.6: ANOVA Table

3. The F statistic given in the ANOVA table is used to test the overall significance of the regression

model. The hypotheses are: (a) H0: β1 = β2 = 0 (b) Ha: At least one of the βi ≠ 0, i = 1,2

From the SPSS output in Figure 8.6 the test statistics F = 33.036 with df1 = 2 and df2 = 7. Since the P-value < 0.001 is smaller than 0.05, the null hypothesis is rejected, and we conclude at the 5% significance level that the multiple regression model fits the data, and not all slope coefficients are zero.

4. The t-statistics (in Figure 8.5) are used for testing the hypothesis: The null hypothesis: H0 : β1 = 0 The alternative hypothesis: Ha: β1 ≠ 0. The t-statistic=-7.275 with df = dfE = 7 and the P-value < 0.001 in the second row of the table tell us to reject the null hypothesis. We conclude that temperature has a significant linear relationship with fuel consumption.

Note that a test of the hypothesis β2 = 0 versus β2 ≠ 0 could also be performed here. Students are encouraged to do so for practise. It will yield a significant result.

In multiple regression, we again can use the residuals, ei = yi - 𝑦�̂�, as estimates of the error terms.

As before, we can create studentized residuals:

si = 𝑦𝑖 −�̂�𝑖

√𝑀𝑆𝐸(1−ℎ𝑖𝑖) where the hii can be viewed as “leverage” values that help indicate when a point in

x-space is from remote from the rest of the data (that is, it can be viewed as a measure of the distance of the point (x i1,…, xin) from the average of all the points in the data set.) Also, as before, the h ii are functions of only the x i in the model, and they are such that 0 <= hii <=1. The hii are not as readily calculated in multiple regression situations, but fortunately, SPSS will do all the calculation for us.

5. Commands to save residuals to use in residual analysis follow.

(a) Select Save... (b) Check unstandardized, standardized and studentized in the Residual box

Page 144: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

144

144

(c) Check unstandardized and standardized in the Predicted Values box (d) Click Continue (e) Click OK

In order to check the assumption of the error being normally distributed we should do a Q-Q plot (see Figure 8.7) and a histogram (see Figure 8.8) for the residuals. (Students should also make boxplots here.) Commands for unstandardized residuals are given below. Commands for standardized and studentized residuals would be analogous.

(a) Click Analyze>Descriptive Statistics...>Q-Q Plots.. (b) Choose RES_1 (Standardized Residuals) as the variable (c) Click OK (d) Click Graphs>Legacy Dialogs>Histogram.. (e) Choose RES_1 (Standardized Residuals) as the variable (f) Click OK

Figure 8.7: Q-Q Plots

Figure 8.8: Histograms The Q-Q plots displays a slight sigmoid shape indicating that the assumption of normal error might be violated. The studentized residuals tend to be further away from the line than the unstandardized or standardized ones, and given that the studentized residuals have an “edge” when it comes to discerning points to watch, this is something to note. A larger sample would be helpful.

The histograms of the residuals are centered at about 0, but do not resemble a bell curve. There are really too few points here to get a good idea of what is going on. The histograms, at least, do not seem to identify any outlying points, but could perhaps indicate less probability in the middle of the distribution and more probability in the tails that we would see in a normal distribution.

Page 145: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

145

145

In order to check if the assumptions of linearity and homogeneity of variance are correct, scatter plots of residuals against the predicted values are of interest, as are scatterplots of residuals against the individual independent variables. Commands to obtain a scatterplot for RES_1 and PRE_1 follow. Other graphs would be obtained with analogous commands.

(a) Click Graphs>Legacy Dialogs>Scatter/Dot. (b) Click the Simple Scatter icon and click Define (c) Choose RES_1 for the Y axis (d) Choose PRE_1 for the X axis (e) Click OK

Note that in practise, one would choose one variable (likely studentized residuals) for the y axis and one variable (likely standardized predicted values) for the x axis. All graphs are included so that students can see their similarity.

Figure 8.9: Scatterplots of Residuals versus Predicted Values

Figure 8.10a: Scatterplot of Residuals versus Temp

Page 146: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

146

146

Figure 8.10b: Scatterplot of Residuals versus Wind

As with SLR, it is of interest to plot the residuals against the predictors (Figure 8.9). One would expect a constant band to appear symmetrically scattered around 0 if the assumptions of linearity and constant variance are met. However, with multiple regression, multi-collinearity (correlations among the independent variables) can become a problem, and it can be difficult to interpret these graphs. They may, however, identify points of interest that may be outlying. In addition, plotting the residuals against each independent variable can help to discern if problems exist, but even then, these issues are not readily identified with such graphs.

No curvature appears in any of the plots and the data is randomly and somewhat symmetrically scattered in a constant band about 0; the assumptions of linearity (proper independent variables included in the model) and constant variance appear to be met.

With only ten data points, it is very difficult to investigate the residuals. The data on the graphs of the residuals versus the predicted values falls (roughly) in a horizontal band. None of these graphs show any outlying (wind, temp) point strongly affecting the predicted value. However, Figures 8.10a and 8.10b graphs do highlight, respectively, that one point has a temp value that is much smaller than all the other temp values and one wind value has a value that is much higher than all the other wind values. A look in the data file can identify these as observation 7 ( (temp value, wind value) = ( -15.50. 5.90) )and observation 3 ( temp value =(-10.00, 41.20) ). What we see here is two lower temperatures (relative to the data in the data set), but in one case the wind is low and in the other case the wind is high! It is likely that with a larger data set, we would get a better picture here. 6. To identify influential observations, the following two additional statistics can be calculated: leverage value, and Cook’s Distance.

(a) Leverage value (hii) for Finding Cases with Unusual Explanatory Variable Values. If hii > 2p/n (=6/10=0.6 in our case), then observation i has a high potential for influence, where p is the number of regression coefficients and n is the number of data in the study.

(b) Cook’s Distance (Di) for Finding Influential Cases: This is a measure that considers the

squared distance between the usual least squared estimate (uses the �̂�s) based on all n observations and the estimate obtained when the ith point is deleted. If a point is influential, its removal would result in some of the regression coefficients changing considerably.

Page 147: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

147

147

If Di is close to or larger than 1 then case i may be considered influential. To calculate these statistics in SPSS from the regression window: (a) Select Save (b) Check Leverage values and Cook’s under Distances (c) Click Continue (d) Click OK

We create scatterplots to graph Cook’s Distances and Leverage values against Fuelc in order to discern if there are any “outlying” values of interest (Figures 6. 11a and 6.11b)

Figure 8.11a: Leverage versus Fuelc

Figure 8.11b: Cook’s Distance versus Fuelc

One quick way to identify which case(s) are having more influence is to click on a graph to bring up the chart editor and then click on a point to bring up the properties window. Then on the Variables tab, under Case number, drop down and select X Axis. This will re -label the X axis with case numbers. You can then look up the point in the original data file. See Figures 6.12a and 6.12b.

Figure 8.12a: Leverage versus Fuel

BY Case Number

Figure 8.12b: Cook’s Distance versus Fuel BY Case Number

Page 148: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

148

148

In Figure 8.12a we see that there are 2 observations – observation 7 (fuelc = 21.83) and observation 3 (fuelc = 23.76) – with leverages greater than 0.6. Note that these findings match our previous observations when we looked at the graphs of residuals versus the independent variables above. Recall that observations 7 and 3 have colder values of temperature than the other observations, but in one case wind is very low and in the other case wind is very high. In Figure 8.12b we have only one observation (observation 7 on the worksheet corresponding to fuelc = 21.83) with a Cook’s distance greater than 1 and substantially larger than the rest. Note that Cook’s distance measure does not identify case number 3 as influential.

This suggests we look further at observations 7 and 3 as potentially influential observations. Sometimes, we can decide to eliminate “outlying” observations. In what follows, we eliminate the observation number 7 and then estimate the regression coefficients using the remaining 9 observations. Note: you want to be very careful if you eliminate observations as you will want to be able to justify your decision – evaluating the data for measurement anomalies is useful. Comparing the results displayed in Figure 8.13 and 8.14, a striking consequence of the exclusion of case 7 is the drop in significance of the two intercepts (from two sided P-values of <0.001 and 0.017, to two-sided P-values of 0.003 and 0.172 respectively). Now wind does not have a significant influence on fuel consumption (holding temp constant).

Figure 8.13: Estimated Regression without Case 7

Finally, out of curiosity, let us see what happens if we choose to eliminate both observation 3 and observation 7. See Figure 8.14.

Coefficientsa

Model Unstandardized Coefficients Standardized

Coefficients t Sig. B Std. Error Beta

1 (Constant) 12.966 1.806 7.180 .001

temp -.753 .169 -.884 -4.442 .007

wind .038 .110 .068 .343 .745

a. Dependent Variable: fuelc

Figure 8.14: Estimated Regression without Cases 7 and 3

Elimination of observation 3 and observation 7 further increases the drop in significance of the two intercepts (from two sided P-values of <0.003 and 0.172, to two-sided P-values of 0.007 and 0.745 respectively). This model continues to find that wind has no significance on fuel consumption (holding temp constant).

Page 149: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

149

149

Note that the 8 remaining (x i, yi) are much closer to their respective mean (𝑥̅ , 𝑦 ) . We note that a larger sample would have been very useful here. Intuitively, the independent factor that would have the biggest impact on fuel consumption would be temperature. And it may be that a simple linear regression with just this one independent variable of temperature may fit well enough to allow for prediction, without needing or wanting to consider other factors. On the other hand, we may want to consider a model that allowed for temperature with a different predictor rather than wind, such as how well insulated a house is.

8.2 Dummy Variables in Regression Analysis Dummy variables, which are also called indicator variables, are used to employ categorical predictor variables in a regression analysis. By using dummy variables, we can broaden the application of regression analysis. As you saw before, regression analysis is used to analyze the relationship of quantitative variables. Through the introduction of dummy variables we now can include categorical variables. In particular, dummy variables allow us to employ regression to produce the same information obtained by analytical procedures, such as analysis of variance and analysis of covariance. In this section we focus on one important application of dummy variables: comparing several regression equations by using a single multiple regression model. Example: Suppose a random sample of data was collected on residential sales in a large city. The following data shows the sales price y (in $1000s), living area x 1 (in hundreds of square feet), number of bedrooms x2, total number of rooms x3, age x4, and location of each house (dummy variables z1 and z2 defined as follows: z1 = z2 = 0 for intown, z1 = 1, z2 = 0 for inner suburbs, z1 = 0, z2 = 1 for outer suburbs). The data are available in the SPSS file house.sav.

1. Identify a single regression model that uses the data for all three locations, and that defines straight-line models relating sale price (y) and area (x1) for each location.

2. Determine the fitted straight line for each location. 3. Test whether or not the straight-lines for the three locations coincide. 4. Test whether or not the lines are parallel. 5. In the light of your answers to part (3) and (4), comment on the differences and similarities in the

sale price-area relationship for the three locations. Solution:

1. The multiple regression model can be written as: y = β0 + β1x1 + β2z1 + β3z2 + β4 x1z1 + β5x1z2 + ε

For each location, the simple regression equations are: For intown: y = β0 + β1x1 + ε For inner suburb: y = β0 + β2 + ( β1 + β4 )x1 + ε For outer suburb: y = β0 + β3 + (β1 + β5 )x1 + ε

To estimate the coefficients, we select Transform>Compute and we add two more columns corresponding to x1z1 and x1z2. Then we select Analyze>Regression>Linear with y as the dependent variable and x1, z1, z2, x1z1, x1z2 as the independent variables. We get the results shown in 6.15

Page 150: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

150

150

Figure 8.15: SPSS Output 2. From the SPSS output, the estimated linear regression equation for each location can be written

as: For intown: �̂� = 8.97 + 4.81𝑥1

For inner suburb: �̂� = 61.07 + 1.61𝑥1 For outer suburb: �̂� = 57.57 + 2.00𝑥1

3. For each location, the simple regression is as in point 1 above. All three lines will coincide if all slopes and intercepts are the same. Thus, we must have:

β0 = β0 + β2 = β0 + β3 β1 = β1 + β4 = β1 + β5 This is only the case if β2 = β3 = β4 = β5 = 0 To determine this we apply a partial F test:

Reduced Model: y = β0 + β1x1 + ε Full Model: y = β0 + β1x1 + β2z1 + β3z2 + β4 x1z1 + β5x1z2 + ε

The hypotheses are: Ho: β2 = β3 = β4 = β5 = 0, Ha: At least one of the βi ≠ 0, i = 2, ..., 5. For the reduced model we have the ANOVA table in Figure 8.16.

Page 151: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

151

151

Figure 8.16: ANOVA table for the Coincident Regression Lines

From the ANOVA tables in Figures 8.15 and 8.16 we get the test statistics

F = (1282.296−395.701)/4

16.483 =

221.6753

16.483 = 13.44872,

which has an F(4, 24) distribution. Since the P-value < 0.001, we have enough evidence against H0, and we conclude that the lines are not coincident – i.e. the three lines are not all the same.

4. All three lines will be parallel if all the slopes are the same. Thus, we would have:

β1 = β1 + β4 = β1 + β5 or β4 = β5 = 0

Again we apply a partial F test: Reduced Model: y = β0 + β1x1 + β2z1 + β3z2 + ε

Full Model: y = β0 + β1x1 + β2z1 + β3z2 + β4 x1z1 + β5x1z2 + ε The hypotheses are:

H0: β4 = β5 = 0, Ha: At least one of the βi ≠ 0, i = 4, 5. For the reduced model we have the ANOVA table in Figure 8.17.

Figure 8.17: ANOVA Table for the Parallel Regression Lines

From the ANOVA tables in Figures 8.15 and 8.17 we get the test statistics

F = (946.277−395.701)/2

16.483 =

275.341

16.483 = 16.70454,

which has an F(2, 24) distribution. Since the P-value < 0.001, we have enough evidence against H0 and

we conclude that the lines are not parallel.

Page 152: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

152

152

5. In town has a much lower baseline price relative to the suburbs, but intown price increases faster than in the suburbs as the house size increases. Both suburb areas are similar in the

baseline sale price and the increase in the price as house size increases.

8.3 Selecting the Best Regression Equation The purpose of model selection is

1. To identify important predictor variables ( variable screening) for the prediction of the response variable;

2. To improve the accuracy of prediction; 3. To simplify the prediction equation. Several changes occur as we include more predictor variables in a regression. 1. Prediction improves. R2 - but not necessarily R2 adjusted - increases, and se, the residual standard

deviation, decreases. Is this improvement substantial? 2. Coefficients describe how the additional variables affect �̂�. Are these coefficients significantly

different from zero and large enough for substantial importance? 3. Spurious coefficients may shrink. Do the added variables substantially alter our conclusions

regarding the effects of other predictor variables? Affirmative answers to any of these questions support keeping added variable(s) in the model. Negative answers indicate the variables contribute little and should be left out unless theoretically important. When practical, all-possible regressions have to be considered, and the model having the largest R2, the smallest MSE, and so on, has to be chosen. When the number of variables in the maximum model is large, the amount of calculation necessary becomes impractical. There are several selection procedures that examine a reduced number of subsets among which a good model is found. The search strategy for selecting variables is concerned with determining how many variables and also which particular variables should be in the final model. Here are some important procedures:

1. Forward selection procedure

2. Backward selection procedure

3. Stepwise procedure

Forward Selection Procedure

1. Start by fitting all possible one-variable models of the form- µy = β0 + β1x1 - to the data. For each model, conduct a t-test for a single β parameter with hypotheses H0: β1 = 0 versus Ha: β1 ≠ 0. The independent variable that produces the smallest P-value (or largest t-value) is declared the best one-variable predictor of y, and enters the model, provided that its P-value does not exceed the specified α constant.

2. At each step, add one variable having the smallest P-value (or largest t-value). 3. Stop adding variables when a stopping rule is satisfied (stop when all variables in the model have

a P-value larger than a certain number). 4. The model used is the one containing all predictors that were added.

Page 153: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

153

153

Backward Elimination Procedure

1. Start with the full model. 2. At each step, remove one variable having the smallest t-value or the largest P-value (least

significant variable). 3. Stop removing when a stopping rule is satisfied (stop when all the variables in the model has P -

value smaller than a certain number). 4. The model used is the one containing all predictors that were not elimi nated.

Stepwise Forward-Backward Procedure This is a modified version of the forward procedure that permits re-examination, at every step, of the variables incorporated in the model in previous steps. A variable that entered at an early stage may become superfluous at a later stage because of its relationship with other variables subsequently added to the model. To check this possibility, at each step we do a partial F-test for each variable currently in the model, as though it were the most recent variables entered, irrespective of its actual entry point into the model. The variable with the smallest non-significant partial F statistic (if there is such a variable) is removed; the model is refitted with the remaining variables; the partial F's are obtained and similarly examined, and so on. The whole process continues until no more variables can be entered or removed. One should not overly rely on stepwise regression, interpret results carefully, and perform residual or diagnostic analysis as described before.

Example: A random sample of data was collected on residential sales in a large city: the selling price y in $1000s, the area x1 in hundreds of square feet, the number of bedrooms x 2, the total number of rooms x3, the house age x4 in years and the location z = 0 for in-town and inner suburbs, z = 1 for outer suburbs. Use the variables x1, x2, x3, x4 and z as the predictor variables. The data are in the SPSS file house2.sav.

1. Use the forward procedure to suggest a best model 2. Use the backward elimination procedure to suggest a best model 3. Use the stepwise procedure to suggest a best model. 4. Which of the models previously selected seems to be the best model, and why?

Solution:

The SPSS commands for forward selection procedure are:

a. Select Analyze>Regression>Linear b. Select y as the Dependent variable and x1, x2, x3, x4 and z as Independent variable(s) c. Choose Forward as Method d. Click Statistics , check R squared change e. Click Options, in Use probability of F, type 0.25 in the Entry box and 1 in the Removal box f. Click OK

The SPSS output contains 5 tables. The Variables entered/removed table shows the variables added and the order in which the variables are added. The final model contains the predictors x 3, x1, x4 and x2 (see Figure 8.18). The Model summary table contains the values of R, R-squared, adjusted R-squared, standard error of the estimate, R-squared changes and their significance. The final model has the best values for R, R-squared , R-squared adjusted and standard error of the estimate, but compared with the previous model with predictors x3, x1 and x4 there is not a large improvement (see Figure 8.18).

Page 154: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

154

154

Figure 8.18: The Coefficient Table for the Forward Procedure

The Coefficient tables give the estimated coefficients for each model and the t and P-values for each coefficient, in each model (see Figure 8.18). The ANOVA table provides the results of the F-tests for each model (see Figure 8.19). The entry value of 0.25 is chosen for forward selection to allow the procedure to continue through most of its subsets sizes. By inspecting the P-values we can notice that the selection would have been stopped at the two variable model x 3, x1 had the entry value been chosen 0.05.

Page 155: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

155

155

Figure 8.19: ANOVA Table for the Forward Procedure

Figure 8.20: The Excluded Variables Table for the Forward Procedure

The Excluded variables table summarizes for each model the variables that have not yet been considered (see Figure 8.20).

1. The SPSS commands for backward selection procedure are:

(a) Select Analyze>Regression>Linear (b) Select y as the Dependent variable and x1, x2, x3, x4 and z as Independent variable(s)

Page 156: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

156

156

(c) Choose Backward as Method (d) Click Statistics , check R squared change (e) Click Options, as Use probability of F, type 0.05 in the Entry box and 0.1 in the Removal

box (f) Click OK

Figure 8.21: Partial Output of Backward Procedure The Variables entered/removed table shows that the final model contains the predictors x 3 and x1 (see Figure 8.21). The last model doesn't have the best R-square or R-square adjusted (see the table it Model Summary in Figure 8.21). The coefficients, the t and P-values are given in the Coefficients table displayed in Figure 8.22. We can notice that the P-values (significance) for the coefficients corresponding to x3 and x1 are 0.000 < 0.1 and 0.005 < 0.1 respectively.

Figure 8.22: The Coefficients Table for the Backward Procedure

Page 157: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

157

157

2. The commands SPSS for stepwise selection procedure are:

(a) Select Analyze>Regression>Linear (b) Select y as the Dependent variable and x1, x2, x3, x4 and z as Independent variable(s) (c) Choose Stepwise as Method (d) Click Statistics , check R squared change (e) Click Options, as Use probability of F, type 0.149 in the Entry box and 0.15 in the

Removal box (f) Click OK

The Variables entered/removed table shows that the final model contains the predictors x3, x1, and x4 (see Figure 8.23). The last model has the best R-square and R-square adjusted (see the table Model Summary in Figure 8.23). 3. Comparing the Model Summary tables for the models found using the forward, backwards and

stepwise procedures, we can see that the model found with the stepwise procedure and containing variables x3, x1, and x4 is a simple model that has a high R-squared and R-squared adjusted values, and the smallest standard error. This appears to be the be st model. The model found using the forward procedure is larger, but the improvement of the R-squared and R-squared adjusted values after adding an extra predictor x2 is not significant.

Figure 8.23: Partial Output of Stepwise Procedure

Page 158: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

158

158

Chapter 9 One Sample and Two Sample Proportion Students are not currently responsible for this material. This chapter is for information purposes only, and may provide a useful reference for students when doing coursework in other classes. In section 8.1 of this chapter, the reader will use SPSS to create material from sample data with which to perform inference tests about a one population proportion. After studying this section, you should be able to:

1. Use SPSS to display tallies that count observations of particular events in a relevant column of sample data.

2. Use SPSS with given sample data so as to calculate test statistic and p-value material relevant to hypothesis testing problems (two sided and one sided) about the population proportion.

3. Use SPSS with given sample data so as to calculate relevant confidence intervals about a population proportion.

4. Use relevant output to set up, investigate, and form conclusions about the population proportion for posed one sample and two sample inference problems

In section 8.2 of this chapter, the reader will use SPSS to create material from sample data with which to perform inference tests about the difference between two population proportions. After studying this section, you should be able to:

1. Use SPSS to display tallies that count observations of particular events in relevant columns of sample data.

2. Use SPSS with given sample data so as to calculate test statistic and p-value material relevant to hypothesis testing problems (two sided and one sided) about the difference in two population proportions

3. Use SPSS with given sample data so as to calculate relevant confidence intervals about the difference in two population proportions.

4. Use relevant output to set up, investigate, and form conclusions about the difference in two population proportions for posed one sample and two sample inference problems.

PLEASE NOTE: SPSS is not set up to easily and readily calculate one and two proportion confidence intervals and tests. Nevertheless, we will show you how it can be “tricked” to do all of these. Although none of these methods can be construed intuitive by any stretch of the imagination, they will serve to whet your appetite for further statistics courses as you will get a glimmer of the power of SPSS.

9.1 Introduction: One Sample Inference about Proportion In order to perform one sample hypothesis tests and confidence intervals about a population proportion, p, using a sample proportion, �̂�= x/n, (where x is the number of sample members with a certain attribute of interest, and n is the number of observations in the sample), it is necessary that some assumptions be met. In particular, in this lab, the following are assumed. 1. A simple random sample 2. The number of successes, x, and the number of failures, n – x, are both 5 or greater

Page 159: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

159

159

When these assumptions are met, the sampling distribution of the sample proportion �̂� has mean p,

standard deviation 𝜎𝑝 = √𝑝(1 − 𝑝)/𝑛 and �̂� is approximately normally distributed for large n.

Assumptions MUST be checked prior to undertaking statistical inference. The first assumption is built into the sample design, and in this course, you will assume that this condition has been met. The second assumption can be readily investigated.

9.1.1 Examining the Data In what follows, how to create a tally from a column of data, along with how to calculate a pertinent proportion from the results will be demonstrated. Example: In the fall of 2015, students in the Stat 151 classes at MacEwan filled out two surveys, one in early September, and one just after the October 2015 election. The first survey elicited their opinions on election issues, and their voting intentions, while the second survey elicited further information on how they planned to vote in September and how they actually voted in October. The raw survey data results for the two elections are available in SPSS worksheets Election2015DatabaseallStudents and FullPostElectionStudentData on Blackboard. The actual survey questions are available in the files ElectionSurvey and PostElectionSurveyQ on Blackboard. The following commands allow us to create a tally that will enable us to examine the proportion of students who planned to vote for certain parties and how they actually voted in the election. The FullPostElectionStudentData worksheet will be used.

1. Click on Analyze>Descriptive Statistics>Frequencies 2. Move the variables PlanSept and Actual Oct to the Variables box 3. Check Display Frequency tables 4. Click on OK

Figure 9.1: Frequencies Distribution Setup for Planned September and Actual October votes

Page 160: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

160

160

Output is as follows.

Figure 9.2: Tally Output for Planned September and Actual October Votes We can create proportions of interest using these tallies from our sample. Note that some students did not plan to vote, were undecided (and eligible) or were ineligible. In September, there were 300 students who had chosen how they planned to vote (87 + 8 + 88 + 117) and 367 students who were eligible to vote and planned to vote (300 + 67). So the proportion of decided sampled students who planned to vote Conservative in September was 87/300 = 0.29 or 29%, while the proportion of sampled students who were eligible to vote and planned to vote Conservative was 87/367 = 0.2371 = 23.71%. In pre-election polls, this is an important point to note, as a fairly high proportion of people can be undecided up until a time very close to the election. Here the proportion of sampled students who were undecided in September was 67/367 = 0.1826 = 18.26%.

9.1.2 Confidence Intervals for 1 Proportion Consider a population in which each member of the population either does or does not have a certain attribute. The population proportion, p, is the proportion of members of the population with that attribute, while the sample proportion, �̂� , is the proportion of members of a large sample taken from the population that have that attribute. A one sample (1-α)% Confidence Interval for p is

�̂� ± 𝑧𝛼/2 √�̂�(1 − �̂�)/𝑛

Page 161: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

161

161

where �̂� = x/n is the sample proportion and n is the sample size, and 𝑧𝛼/2 is the z value such that the

area to the right of 𝑧𝛼/2 is α/2, as long as the assumptions in the introduction in 8.1 have been met.

Example: For the purposes of this question, consider our class data to be a random sample of a hypothetical larger population of 1st year statistics students in Edmonton. Use the FullPostElectionStudentData.sav worksheet to do this problem, and refer to the tallied information in Section 8.1.1 to get the appropriate counts. Create a 95% confidence interval for decided voters who planned to vote NDP in September. Let p denote the population proportion planning to vote NDP in September in the hypothesised population. The commands to be used to answer this question follow. Open a new SPSS worksheet. You will need to put the data needed to do this problem into a different format and use the weight cases command to let SPSS know how you want it to think. Create two new numeric variables in Variable View in your new SPSS worksheet, as shown below

Figure 9.3: Variable View and Data View for the two new variables. NDPORNOT takes the value 1 if a person favours the NDP platform and 2 if it does not. We consider only the 300 people who actually planned to vote in September, so when NDPORNOT = 1, COUNT = 117 and when NDPORNOT = 2, COUNT = 183. Perform the following commands 1. Select Data>Weight Cases 2.Choose to weight cases by COUNT (bring COUNT to the frequency box) 3. Choose OK 4. Select Analyze>Generalized Linear Models>Generalized Linear Models 5. On the Type of Model Tab, select Custom and choose Binomial for the Distribution box, and Identity for the Link Function 6. On the Response Variable Tab, place NDPORNOT into the Dependent Variable box 7. Select OK 8. Say Yes to the popup box that asks if you want to fit an intercept-only model. (As mentioned above, use of these commands is not intuitive to students at this level. Suffice it to say that Generalized Linear Models are a very big library of models, and we can specify a particular one in order to do the problem. A Binomial Identity problem fits our needs.)

Page 162: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

162

162

Figure 9.4: Dialog Boxes for creating a One Sample Proportion Confidence Interval

Partial output of use follows.

Figure 9.5: Output to obtain a 95% Confidence Interval for a Proportion

Note that the value of �̂� = 0.39 or 39% can be read in the output. The 95% Wald Confidence Interval is the one we need. So a 95% confidence interval for p is (0.335,0.455) or approximately (34%, 46%). We are 95% confident that the proportion of decided 1st year Statistics student voters in Edmonton who planned to vote NDP in September falls between 34% and 46%. Similarly, the following proportions can be determined.

Page 163: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

163

163

Create a 95% confidence interval for all decided and undecided voter s who planned to vote NDP in September

Stat > Basic Statistics > 1 Proportion Summarized data Number of Events: 117 Number of Trials: 367 (all but ineligible and

noplaneligible) 95% Normal Approximation

Test and CI for One Proportion Sample X N Sample p 95% CI 1 117 367 0.318801 (0.271124, 0.366478)

Using the normal approximation.

A 95% confidence interval for p is (0.27,0.37). We are 95% confident the proportion of decided and undecided 1st year Edmonton Statistics student voters who planned to vote NDP in September falls between 27% and 37%.

Create a 95% confidence interval for decided voters who actually vote NDP in October

Stat > Basic Statistics > 1 Proportion

Summarized data Number of Events: 72 Number of Trials: 322(all but novoteeligible and ineligible)

95% Normal Approximation

Test and CI for One Proportion

Sample X N Sample p 95% CI 1 72 322 0.223602 (0.178093, 0.269112)

Using the normal approximation.

A 95% confidence interval for p is (0.18,0.27). We are 95% confident that the proportion of decided 1st year Edmonton Statistics student voters who actually voted NDP in September falls between 18% and 27%.

It is interesting to note the different intervals created for decided and undecided voters, and for how students voted versus how they planned to vote! Undecided voters are highly sought voters in the run -up to the election!

9.1.3 Hypothesis Tests for 1 Proportion As long as the necessary assumption are met, the test statistic used for testing H0: p = p0 versus H0: p ≠ p0 looks at how unusual a sample proportion value is compared to the hypothesized population proportion value if the null hypothesis is true. The formula for the test statistic is as shown below.

𝑧 = �̂� − 𝑝𝑜

√𝑝0 (1 − 𝑝0)

How unusual the test statistic is (the p-value) is determined, and then compared to the pre-chosen level of significance, α, which indicates a cut-off level at which we decide if the unusualness of the observed test statistic indicates a significant departure (or not) in the population distribution of interest being centered on the hypothesized null proportion. Example: In the 2015 federal election, 19.7% of Canadians voted NDP. You are interested in how this proportion compares to the percentage of 1st year statistics students who voted NDP in Edmonton. Use a hypothesis test to determine, at a significance level of 5%, whether there is evidence that the proportion of students voting NDP differs from 19.7%. Also answer this question using a 95% confidence interval. For the purposes of this question, consider our class data to be a random sample from a hypothetical larger population of 1st year statistics students in Edmonton. Use the FullPostElectionStudentData.sav worksheet to do this problem.

Page 164: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

164

164

Solution: Let p denote the considered population proportion of 1st year statistics students in Edmonton who voted NDP. The commands to be used to answer this question follow. As always, we first set up our data in a new summary table in order to use the weight cases command. When doing these questions, I recommend a new datasheet for each of them. It will minimize the chance of you running commands with weights attached to the wrong set of data. Create two new numeric variables in Variable View in your new SPSS worksheet, as shown below

Figure 9.6: Variable View and Data View for the two new variables. NDPOCT takes the value 1 if a person voted NDP and 2 if it does not. We consider only the 322 people who actually voted. Perform the following commands to weight the cases 1. Select Data>Weight Cases 2.Choose to weight cases by COUNTOCT (bring COUNTOCT to the frequency box) 3. Choose OK Perform the following commands to obtain output that contains the test statistic and p-value needed in order to perform a 6 step hypothesis test. 1. Analyze>Nonparametric Tests>Legacy Dialogs>Chi Square 2. In the Chi-square Test pop-up box, move NDPOCT to the Test Variable List 3. Type 0.197 in the Values box, and then click the Add button 4. Type 0.803 in the now vacant values box, and then click the Add button again 5. Click OK Note that you tell the problem the null proportion that you are investigating (0.197) along with its complement 0.803. Without going into detail, the problem that we are investigating, with only two possible categories for the variable PARTY (success = NDP, failure = not NDP), is part of a much large set of problems that can be investigated with a Chi-square Test when more categories are of interest.

Page 165: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

165

165

Figure 9.7 Dialog Boxes for creating output for a One Sample Proportion Test Output is presented here.

Figure 9.8 Output for a One Sample Proportion Test

Note the observed N are the numbers in the box that we provided for weighting. It’s a good double check to look for this to make sure you have the right cases weighted when you do a problem!

Page 166: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

166

166

The z test statistic needed for Step 3 in the regular 6 step hypothesis procedure write -up is the square

root of the Chi-Square test statistic provided in the Test Statistics output (it is 1.20 = √1.441 ). The p-value needed for Step 4 is found in the Asymp. Sig box in the Test Statistics output (it is 0.230). We now answer the question about the significance of the result, by writing up the 6 step hypothesis procedure, followed by a write-up of the confidence interval procedure. Note that we have to go back to Section 8.1.3 to get the confidence interval.

Step 1 Hypothesis Ho: p = 0.197 versus Ha: p ≠ 0.197

Step 2 Assumptions Simple Random Sample assumed # successes = 72 >5, # failures = 322-72= = 250 > 5 Proceed (use normal approximation)

Step 3 Test Statistic z0 = 1.20

Step 4 Pvalue 0.230 Step 5 Decision Do not reject Ho, since p-value of 2P(Z>1.20) = 0.230 > α of 0.05

Step 6 Conclusion At the 5% level of significance, there is no significant evidence that the true proportion of those voting NDP in a proposed population of first year statistics students differs from a population proportion p of 19.7%.

CI: Assumptions As above (you may do this if both a HT and CI are asked for – otherwise you will need to state them)

CI: Interval (0.178093, 0.269112) ≈ (0.18, 0.27) = (18%, 27%)

CI Conclusion Since 19.7% lies within the confidence interval, there is no significant evidence that the true proportion of those voting NDP in a proposed population of first year Edmonton statistics students differs from a population proportion p of 19.7%.

9.1.4 One Sided Hypotheses Tests for One Proportion

Here is how SPSS handles one sided hypothesis tests for proportion where an alternative hypothesis states that p exceeds or is below a hypothesized null hypothesis. The null and alternative hypotheses for these cases are as follows.

Right Sided Test Ho: p = p0 versus Ha: p > p0 Left Sided Test Ho: p = p0 versus Ha: p < p0

In such cases the p-value, the probability of observing a value as extreme as you did if Ho is true, considers you to be looking towards the right side of the null distribution if you are doing a right sided test and to the left of the null distribution if you are doing a left sided test.

Ho: p = p0 versus Ha: p > p0 Ho: p = p0 versus Ha: p < p0

p-value = P(Z > z0) for the tests statistic z0 p-value = P(Z < z0) for the tests statistic z0

Page 167: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

167

167

8.2 Introduction: Two Independent Samples Inference for Proportions In order to perform hypothesis tests and confidence intervals about a difference between two population proportions of interest, p1 - p2, using a sample proportion difference, �̂�1 - �̂�2 , (where �̂�1 = x1/n1 and �̂�2 = x2/n2, and x1 and x2 are the number of sample members in each sample with certain attributes of interest, and n1 and n2 are the number of observations in each sample), it is necessary that some assumptions be met. In particular, in this lab, the following are assumed.

1. Simple random samples 2. Independent samples 3. x1, n1 – x1, x2, and n2 – x2 are all 5 or greater

When these assumptions are met, the sampling distribution of �̂�1 - �̂�2 , has mean p1 - p2, standard

deviation 𝜎𝑝1 − 𝑝2 = √𝑝1(1 − 𝑝1) + 𝑝2(1 − 𝑝2) and �̂�1 - �̂�2 is approximately normally distributed for

large n1 and n2 . Assumptions MUST be checked prior to undertaking statistical inference. The first assumption is built into the sample design, and in this course, you will assume that this condition has been met. The second assumption can be readily investigated.

9.2.1 Examining the Data

Example: We return to the FullPostElectionStudentData.sav worksheet. We are interested in determining if the proportion of first year statistics students in Edmonton who viewed jobs as the most important issue differs between those who view the Conservative platform as best and those who view the Liberal platform as best. Recall that the following commands allow us to create a crosstab table that will enable us to examine the crosstab counts for the two variables “Most Important Issue” (MostImptIssue) and “Best Platform” (BestPlat).

1. Click on Analyze>Descriptive Statistics>Crosstab 2. Move the variable MostImptIssue to the Rows box 3. Move the variable BestPlat to the Columns box 4. In the cells popup, make sure observed values is checked 5. Click Continue and then OK

Page 168: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

168

168

Figure 9.9: Crosstab Setup for MostImptIssue and BestPlat table

Output follows.

Figure 9.10: Crosstab Table Output

Page 169: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

169

169

In the following sections, we will investigate whether the proportion of first year statistics students in Edmonton who viewed jobs as the most important issue differs between those who view the Conservative platform as best and those who view the Liberal platform as best. From above, we can see that the proportion of students who viewed jobs as most the important issue among those preferring the Conservative platform was 51/130 = 0.392 , while the proportion of students who viewed jobs as the most important issue among those preferring the Liberal platform was 61/220 = 0.277.

9.2.2 Confidence Intervals for 2 Proportions

A two sample (1-α)% Confidence Interval for p1 – p2 is

(�̂�1 − �̂�2) ± 𝑧𝛼/2 √�̂�1(1 − �̂�1)

𝑛1+

�̂�2(1 − �̂�2)

𝑛2

where �̂�1and �̂�2 are the sample proportion for samples 1 and 2, p1 and p2 are the population proportions for populations 1 and 2, n1 and n2 are the sample sizes, and 𝑧𝛼/2 is the z value such that the

area to the right of z is α/2, as long as the assumptions in the 8.2 introduction have been met. Example: For the purposes of this question, consider our class data to be a random sample of a hypothetical larger population of 1st year statistics students in Edmonton. Again, use the FullPostElectionStudentData.sav datasheet, and refer to the tallied information in Section 8.2.1 to get the appropriate counts. Create a 95% confidence interval for p1 – p2, the difference in the proportion of students who viewed jobs as the most important issue differs between those who view the Conservative platform as best and those who view the Liberal platform as best. Let p1 denote the population proportion for Conservative platform aficionado students viewing jobs as the most important issue and p2 denote the population proportion for Liberal platform aficionado students viewing jobs as the most important issue. Create a 95% confidence interval for p1 – p2, the difference in the proportion of students who viewed jobs as the most important issue differs between those who view the Conservative platform as best and those who view the Liberal platform as best. The commands to be used to answer this question fol low. Open a new SPSS worksheet. You will need to put the data needed to do this problem into a different format and use the weight cases command to let SPSS know how you want it to think. The two variables of interest are Issue (Jobs Most Important Issue = 1, Other Issue Most Important = 2) and Party (Conservative = 1, Liberal = 2). Note that if we make a mini-table (by hand) for our problem, it looks like this.

Page 170: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

170

170

PARTY

ISSUE

Conservative (1) Liberal (2) Total

Jobs Most Important Issue (1) 51 61 130 Other Issue Most Important (2) 79 159 238

Total 130 220 350

In order to do the tests we need to do, we create three new numeric variables in Variable View in a new SPSS worksheet, and populate them as shown below Create two new numeric variables in Variable View in your new SPSS worksheet, as shown below

Figure 9. 11: Variable View and Data View for the three new variables.

Be sure to set up your data view to match your original question so that the software can solve the problem correctly. In this case, your two samples of interest are the parties. It is within those parties that you are looking at the proportions of those who view jobs as the most important issue. This should be kept in mind. In the Data View box, PARTY (the one corresponding to your two samples of interest) is your leftmost column. ISSUE, the other categorical variable, will be in the middle column. Your counts (COUNTISPA) will be in the third column. Always set your first column to contain all its 1s first, and then all its 2s second. Then for each of those 1s and 2s, place 1 and 2 in order in the second column. Copy the numbers from the Conservative (1) column in the handmade box to correspond to the appropriate entries where PARTY is 1 in the Data View box, and copy the numbers from the Liberal (2) column in the handmade box to correspond to appropriate entries where PARTY is 2. Remember what proportions you wish to compare. You wish to compare those who see Jobs as most important in the Conservatives to those who see Jobs as most important in the Liberals. The numerator for the Jobs, Conservative proportion should be in the (1,1) row and the numerator for the Jobs, Liberal proportion should be in the (2,1) row. Perform the following commands

Page 171: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

171

171

1. Select Data>Weight Cases 2. Choose to weight cases by COUNTISPA (move it to the frequency box) 3. Choose OK

4. Select Analyze>Generalized Linear Models>Generalized Linear Models 5. On the Type of Model Tab, select Custom and choose Binomial for the Distribution box, and Identity for the Link Function 6. On the Response Tab, place ISSUE into the Dependent Variable box 7. On the Predictor Tab, place PARTY into the Factors box 8. On the Model Tab, place PARTY into the Model box 9. Select OK 10. Say Yes to the popup box that asks if you want to fit an intercept-only model.

(As mentioned above, use of these commands is not intuitive to students at this level. Suffice it to say that Generalized Linear Models are a very big library of models, and we can specify a particular one in order to do the problem. A Binomial Identity problem fits our needs.) Students sometimes find it difficult to see what to choose for Response and Predictor Variables. In our case, it makes sense to think that the party you think of having the best platform as a predictor of the issue you think most important. You should make sure that your Data View data is set up so that the variable that is in your leftmost column (PARTY) (the one referencing your two samples of interest) is used in the predictor and model tabs, and the variable that is in your rightmost column (ISSUE) in the response tab. (This is what your instructor does so that the defaults work.) Whew!

Figure 9.12: Dialog Box for creating a two sample Proportion Interval. It is set at the Step 8 above.

Page 172: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

172

172

Output is as follows. The confidence interval that appears in the PARTY=1 line is the confidence interval for p1 – p2 . The B number in the line 0.115 is �̂�1 - �̂�2 = 0.392 - .277.

Figure 9.13: Output for the Two Sample Proportion 95% Confidence Interval

A 95% confidence interval for p1 – p2 = (0.012, 0.217) or about (1%, 22%). We are 95% confident that the difference in the proportion of voters who voted Liberal and Conservative falls between 1% and 22%.

9.2.3 Hypothesis Tests for 2 Proportions As long as necessary assumptions are met, the test statistic used for testing Ho: p 1-p2 = δ versus Ha: p1-p2 ≠ δ looks at how unusual a difference in sample proportion values is compared to the hypothesized difference of δ in population proportion values if the null hypothesis is true. We will only be considering the case where δ = 0, so our null and alternative hypothesis will be Ho: p1-p2 = 0 versus Ha: p1-p2 ≠ 0. The formula for the test statistic is as shown below.

�̂�1 − �̂�2

√�̂�𝑝(1 − �̂�𝑝) √(1𝑛1

) + (1𝑛2

)

where �̂�𝑝 = (x1 + x2)/(n1 + n2)

How unusual the test statistic is (the p-value) is determined, and then compared to the pre-chosen level of significance, α, which indicates a cut-off level at which we decide if the unusualness of the observed test statistic indicates a significant departure (or not) in the population distribution of interest being centered on the hypothesized null proportion.

Page 173: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

173

173

Example: You are interested in the difference in the proportion of students who viewed jobs as the most important issue differs between those who view the Conservative platform as best and those who view the Liberal platform as best.Use a hypothesis test to determine, at a significance level of 5%, whether there is evidence that the proportion of Conservative platform aficionado students viewing jobs as the most important issue and the Liberal platform aficionado students viewing jobs as the most important issue differed. Also answer this question using a 95% confidence interval. For the purposes of this question, consider our class data to be a random sample from a hypothetical larger population of 1st year statistics students in Edmonton. Again, we use the FullPostElectionStudentData.sav datasheet. Solution: Let p1 denote the population proportion for Conservative platform aficionado students viewing jobs as the most important issue and p2 denote the population proportion for Liberal platform aficionado students viewing jobs as the most important issue. The commands to be used to answer this question follow.

1. Analyze>Descriptive Statistics>Crosstabs 2. Move Parties to the Rows 3. Move Issues to the Columns 4. Click up the Cells button, and check Observed Counts and Row Percentages 5. Click up the Statistics button, and check the Chi Square button 6. Click up the Cells button, and check Observed Counts and Row Percentages 7. Click Continue and Ok

Note, without further detail, that the problem that we are investigating, with only two possible categories for the variable PARTY (success = NDP, failure = not NDP) and for the variable ISSUE (success = Jobs, failure = Other Issue) is part of a much large set of problems that can be investigated with a Chi-square Test when more categories are of interest.

Figure 9.14: Dialog Boxes for performing a Two Sample Proportion Test set at Step 5 above.

Page 174: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

174

174

Output is as follows.

Figure 9.15 Output for a Two Sample Proportion Test

Note the Crosstab table, where one can check that data entry is correct, and that SPSS is working with the right data. The z test statistic needed for Step 3 in the regular 6 step hypothesis procedure write -up is the square

root of the Chi-Square test statistic provided in the Test Statistics output (it is 2.23 = √4.969 ). The p-value needed for Step 4 is found in the Asymptotic Significance box in the same row (it is 0.026). We now answer the question about the significance of the result, by writing up the 6 step hypothesis procedure, followed by a write-up of the confidence interval procedure. Note that we have to go back to section 8.2.2 to get the necessary confidence interval.

Page 175: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

175

175

Recall that p1 denotes the population proportion for Conservative platform aficionado students viewing jobs as the most important issue and p2 denotes the population proportion for Liberal platform aficionado students viewing jobs as the most important issue.

Step 1 Hypothesis Ho:p1 – p2 =0 versus Ha:p1 – p2 ≠0

Step 2 Assumptions Simple Random Sample assumed Independent Samples # successes = 51> 5, # failures = 130-51=79>5 # successes = 61> 5, # failures = 220-61=159>5 Proceed (use pooled estimate of proportion)

Step 3 Test Statistic z0= 2.23 Step 4 Pvalue 2P(Z > 2.23) = 0.026

Step 5 Decision Reject Ho, since pvalue of 0.026<= α of .05

Step 6 Conclusion At the 5% significance level, we do have significant evidence that the population proportions of those who viewed jobs as the most important issue differs between those who view the Conservative platform as best and those who view the Liberal platform as best.

CI: Assumptions As above (you may do this if both a HT and CI are asked for – otherwise

you will need to state them)

CI: Interval (0.0123519, 0.217718)≈ (0.01, 0.22) ≈ (1%, 22%)

CI Conclusion At the 5% significance level, since the confidence interval does not cover 0, we do have significant evidence that there is a difference in the population proportions of Conservative platform aficionado students viewing jobs as the most important issue and the Liberal platform aficionado students viewing jobs as the most important issue differed.

9.2.4 One Sided Hypotheses Tests for 2 Proportions Here is how SPSS handles one sided hypothesis tests for proportion where an alternative hypothesis states that p exceeds or is below a hypothesized null hypothesis. The null and alternative hypotheses for these cases follow.

Right Sided Test Ho: p1 – p2 = 0 versus Ha: p1 – p2 > 0

Left Sided Test Ho: p1 – p2 = 0 versus Ha: p1 – p2 < 0

In such cases the p-value, the probability of observing a value as extreme as you did if Ho is true, considers you to be looking towards the right side of the null distribution if you are doing a right sided test and to the left of the null distribution if you are doing a left sided test.

Ho: p1 – p2 = 0 versus Ha: p1 – p2 > 0 Ho: p1 – p2 = 0 versus Ha: p1 – p2 < 0 p-value = P(Z > z0) for the tests statistic z0 p-value = P(Z< z0) for the tests statistic z0

Page 176: STATISTICS 252...1 1 STATISTICS 252 INTRODUCTION TO APPLIED STATISTICS II SPSS LAB MANUAL Fall 2017 Science Department Grant MacEwan University Contributions by: Dr. …

176

176

9.3 Online Proportion Calculators The work done in SPSS above, as chosen by your instructor above matches the Minitab answers for one proportion and two proportion confidence intervals, and also matches the answers taken by the approach taken by your textbook. A few online calculators for calculating proportion are provided. Students are encouraged to scout about for more. Your instructor could not find an online calculator that calculated the confidence interval for the difference in two proportions that same way. One Proportion: This gives a match on the confidence interval. https://www.easycalculation.com/statistics/population-confidence-interval.php One Proportion: This gives a z value and a pvalue that match. The CI is calculated differently and wi ll not match. https://www.medcalc.org/calc/test_one_proportion.php Two Proportion: This gives a match on the z value and on the pvalue. The CI is not provided. http://www.socscistatistics.com/tests/ztest/ Two Proportion: The z value and the pvalue match, but the CI does not. http://vassarstats.net/prop2_ind.html