A Guide to Data Analysis in R Commander...

A Guide to Data Analysis in

R Commander

Viann Nguyen-Feng, M.A.

Mark A. Stellmack, Ph.D.

University of Minnesota

Copyright © 2016 by Viann Nguyen-Feng and Mark A. Stellmack

2

About the authors Viann Nguyen-Feng received her M.P.H. from Eastern Virginia Medical School,

after which she completed a post-graduate epidemiology fellowship and then a M.A. at the University of Minnesota. She is currently pursuing a Ph.D. at the University of Minnesota. Mark A. Stellmack received his Ph.D. in Experimental Psychology from Loyola

University of Chicago, specializing in the study of auditory perception. He teaches undergraduate statistics and research methods courses at the University of Minnesota. Please contact the authors with questions, comments, and suggestions: Viann Nguyen-Feng: [email protected] Mark Stellmack: [email protected]

3

A Guide to Data Analysis in R Commander

Table of Contents

The Purpose of this Guide .................................................................................................................. 5

Installing R, RStudio, and Rcmdr ..................................................................................................... 6

Running R Commander ....................................................................................................................... 7

Initial data entry .................................................................................................................................... 9 Saving a Comma Separated Variable (CSV) spreadsheet in Excel ................................................... 9 Opening a Comma Separated Variable (CSV) spreadsheet in Rcmdr ......................................... 10 Viewing a data set .......................................................................................................................................... 11 Editing a data set ............................................................................................................................................ 11 Switching between multiple data sets .................................................................................................... 11 Saving and loading a data set .................................................................................................................... 12

BASIC ANALYSES .................................................................................................................................. 15

Descriptive statistics .......................................................................................................................... 15 Mean, standard deviation, standard error of mean, interquartile range, coefficient of variation, skewness, kurtosis, quantiles ............................................................................................... 15

Correlations .......................................................................................................................................... 18 Correlation matrix ......................................................................................................................................... 18

Two-sample t-tests ............................................................................................................................. 21 Between-groups/Independent-groups t-test ...................................................................................... 21 Within-groups/Repeated-measures/Correlated-groups/Paired t-test ..................................... 26

One-Way Analysis of Variance (ANOVA) and Post-Hoc Tests .............................................. 30

Two-Way Analysis of Variance (ANOVA) .................................................................................... 35 Testing main effects and interactions with multi-way ANOVAs ................................................... 35 Graphing interactions with multi-way ANOVAs ................................................................................. 38

Linear regression ................................................................................................................................ 40

Chi-square .............................................................................................................................................. 43 Chi-square using raw data .......................................................................................................................... 43 Chi-square using frequency counts ......................................................................................................... 46

ADVANCED ANALYSES ....................................................................................................................... 49

Descriptive statistics for sub-groups ........................................................................................... 49

Correlation test: Testing the significance of correlations ................................................... 53

Graphs ..................................................................................................................................................... 56 Scatterplot ........................................................................................................................................................ 56 Scatterplot by groups ................................................................................................................................... 59

4

Line graph ......................................................................................................................................................... 61 Bar graph .......................................................................................................................................................... 64

Miscellaneous ....................................................................................................................................... 66 Opening and Entering data ......................................................................................................................... 66 Opening an Excel spreadsheet .................................................................................................................. 66 Entering data directly into Rcmdr ........................................................................................................... 67 Recoding variables ........................................................................................................................................ 70 Combining items ............................................................................................................................................ 73 Converting variables from numeric to factor items .......................................................................... 75

Coding in R ............................................................................................................................................. 77 Deleting data sets ........................................................................................................................................... 77 Labeling points in a scatterplot ................................................................................................................ 79 Repeated-measures ANOVA ....................................................................................................................... 82 Mixed-method ANOVA.................................................................................................................................. 89

Updates ................................................................................................................................................... 93 Updating packages ........................................................................................................................................ 93 RStudio updates ............................................................................................................................................. 93

5

The Purpose of this Guide Our background is in Psychology. We teach introductory statistics and research methods courses, which are typically required for most Psychology majors. Our statistics course teaches the basics of descriptive and inferential statistics and our students perform computations by hand. In our research methods course, students learn to perform the same statistical analyses on a computer. In the past, we used a popular software package for which our university purchased a site-license and which ran on university-owned computers. However, the software is prohibitively expensive to students. As a result, students were able to use the software only on university computers during times when the computers were available. We sought an alternative software package that worked like the more expensive option but that was more affordable and that students could use anytime on their own computers. That led us to the R programming language. R is a free, powerful data-analysis program that performs many complex statistical analyses, but using R requires one to learn the R programming language. R Commander (Rcmdr) is a simple point-and-click interface to the R language that provides easy access to the most common analyses that Psychology students are likely to want to perform. Our goal was not to write a statistics textbook. Thus, this guide does not contain exercises or practice problems. Rather, our goal was to write a guide for users who already have knowledge of basic statistical techniques. This guide provides simple, step-by-step instructions for performing those analyses. This guide assumes that the user has knowledge of statistics at the level of a student who has completed an introductory course, including an understanding of the interpretation of p-values. Rcmdr is somewhat intuitive, but there are enough quirks and hidden data-formatting requirements to bring some analyses grinding to a halt if you do not format things properly. In addition, the Rcmdr output sometimes can be cluttered and difficult to wade through. This guide instructs the user on how to format the data for a particular analysis, what to click on, and where to find the most relevant information in the output. We tried to keep the writing as brief as possible so that this guide would be a useful, quick reference tool. Layouts and instructions may vary depending on your operating system or computer type (e.g., Mac or PC). The instructions were prepared using a Windows platform. Disclaimer: All data presented in this guide are entirely fabricated and perhaps even nonsensical. They are meant to serve an illustrative purpose in understanding the basics of data analysis in Rcmdr. Depending on your operating system, your screen may appear slightly different from the ones in this guide.

6

Installing R, RStudio, and Rcmdr R is a free software package and programming language for performing a wide variety of statistical analyses. RStudio and R Commander are interfaces that make it easier and more convenient to use R. You will only need to follow the instructions on this page once, when you first install R, RStudio, and Rcmdr. You must first download and install R, then download and install RStudio, then you can install R Commander.

1. Go to one of these following websites and follow the instructions to download and install the R software: Windows: http://cran.r-project.org/bin/windows/base/ Macintosh: http://cran.r-project.org/bin/macosx/

2. After R is downloaded and installed, go to the following website and follow the

instructions to download the RStudio software for your operating system: http://www.rstudio.com/products/rstudio/download/

3. After RStudio is downloaded and installed, launch RStudio.

4. When RStudio opens, at the command prompt (>) in the Console panel, type

install.packages("Rcmdr", dependencies=TRUE)

and press Enter. Note: R and RStudio are case sensitive! If a pop-up window appears asking you to “Please select a CRAN mirror for use in this session”; select the site closest to you, then click “Ok.”

5. Many messages will appear in the Console window as R Commander is being installed. When the installation is complete, the command prompt (>) will appear again at the bottom of the Console window.

6. To open R Commander, type library(Rcmdr) at the command prompt and press

Enter.

If a pop-up window appears saying that you need to install another package and asking if you want to do so, click on “YES”.

To run R Commander in the future after it is installed, you only need to launch RStudio and type library(Rcmdr) at the command prompt in the Console window. (You will need to click in the Console window before you can type in it.)

http://cran.r-project.org/bin/windows/base/

http://cran.r-project.org/bin/macosx/

http://www.rstudio.com/products/rstudio/download/

7

Running R Commander To run R Commander, you must first launch RStudio (by double-clicking the RStudio icon). The window shown below will open. The panel on the left-hand side of the RStudio window is the “Console”. The “>” in the Console panel is the command prompt. At the command prompt, type library(Rcmdr) and press Enter. (You probably will need to click in the Console window before you can type in it.)

8

When you press enter, Rcmdr will open. Rcmdr looks like this:

You can click on commands in the Rcmdr menus to run your analyses. All of your output (the results of your point-and-click commands) will appear in the RStudio “Console” window.

The big, white box at the bottom of the R Commander window is the “R Script” box of Rcmdr. You will not need to type anything in the “R Script” box when you are doing basic statistical analyses in Rcmdr. The “R Script” box has two purposes:

1. Script: You can enter R code into this box. This guide does not focus on writing and entering R code, though the end of this guide provides code for you to type in to perform some special functions. Rcmdr also generates code that appears in this box when you point and click commands. Thus, as you click on commands in R Commander, code will appear in the “R Script” box.

2. Messages: Rcmdr will give you “notes,” “warnings,” or “error messages” about the commands that you execute. These messages are generated and displayed by Rcmdr in the “R Script” box as you perform various operations in Rcmdr. “Notes” do not require any action, “warnings” may require some action, and “error messages” definitely require some action in order to have the command run properly. The messages will most likely explicitly tell you what went wrong and, thus, what must be changed in order to have the command run properly.

9

Initial data entry Here, we describe how to open a Comma Separated Variable (CSV) spreadsheet in Rcmdr. You can easily create a CSV file by entering your data in Excel and saving it in CSV format. Not all versions of Rcmdr are the same. If the methods for opening a data file described below do not work on your computer, see the Miscellaneous section of this guide (page 66) for other methods of entering data (e.g., opening an Excel spreadsheet, manually entering data). We highlight opening a Comma Separated Variable (CSV) spreadsheet as the primary data entry method because CSVs seem to open across all operating platforms.

Saving a Comma Separated Variable (CSV) spreadsheet in Excel Launch Excel and enter your data in a spreadsheet, remembering to enter each subject’s data in a different row and each variable in a different column. Save your Excel sheet as a .csv instead of a .xls or .xlsx: In Excel, go to File Save As In the dropdown menu next to “Save as type,” select “CSV (Comma delimited).” Click “OK” through the remaining windows to save the file.

10

Opening a Comma Separated Variable (CSV) spreadsheet in Rcmdr In Rcmdr, click on Data Import data from text file, clipboard, or URL…

A “Read Text Data From File, Clipboard, or URL” window will open up. The default data set name is “Dataset.” This is a label for the data set that is used within Rcmdr. You may change this to something more meaningful by clicking in the box and typing. (This is particularly useful if you intend to open more than one dataset at a time.) Under “Field Separator,” change it from “White space” to “Comma” because it is a comma-separated variable file. You may keep the other default values in this window.

Click “OK” and select the data file you want to open in the subsequent windows to create the new data set.

11

Viewing a data set When you return to the Rcmdr home screen, you can view the data you read into Rcmdr by clicking the “View data set” button.

Your data set table will pop up in a new window. You cannot edit the variable names or the data in this window.

Editing a data set To change variable names or to change specific cell values, click on the “Edit data set” button. The Data Editor window will open with your data in it. In the Data Editor window, you can click on variable names or cell values to edit them. (For more information, refer to the section “Entering data directly into Rcmdr” on page 67.)

Switching between multiple data sets You can open or create more than one data set during an Rcmdr session. In the Rcmdr home screen, the button next to “Data set:” shows the data set that is currently active. All commands that you click on will operate on the data set that is named in the button.

If you opened or created more than one data set during this Rcmdr session and you would like to switch to a different data set, click the button next to “Data set:” A window labeled “Select Data Set” will pop up. This window contains a list of all of the data sets that you have created during this session.

12

Select the data set that you would like to use, then click “OK.” You will notice that the button on your Rcmdr home screen will change to reflect the name of the data set that you have selected. That data set is now the active data set.

Saving and loading a data set After creating a data set, you may want to save the data set as an R Data file so that you may easily load it into Rcmdr later instead of re-importing the CSV (or Excel, etc.,) file. This may come in handy in particular when you are manually loading data (see p. 67). Data Active data set Save active data set…

13

A “Save As” window will pop up. Select the folder that you would like the data set to be stored in and change the File name as appropriate. Click “Save.”

To load the saved data set later, go to: Data Load data set…

14

An “Open” window will pop up. Select the R Data file in the appropriate folder. Press “Open” and the data file to be loaded into Rcmdr. The data set you loaded will be the active data set and the name that you gave the data set before you saved it will be displayed in the “Data set:” button in Rcmdr.

15

BASIC ANALYSES

Descriptive statistics

Mean, standard deviation, standard error of mean, interquartile range, coefficient of variation, skewness, kurtosis, quantiles Statistics Summaries Numerical summaries…

16

A “Numerical Summaries” window will pop up. Select the variable(s) for which you want to calculate descriptive statistics. Only numeric variables will be shown in this box. Hold the “Ctrl” key while clicking to select more than one variable. Hold the “Shift” key and click to select more than one variable that are listed directly next to each other.

Click on the “Statistics” tab to select which descriptive statistics you would like. The ones selected below (mean, standard deviation, interquartile range, quantiles) are selected by default.

17

Click “OK.” The output will appear in your RStudio console window. Rcmdr> numSummary(OurData2[,"test1"],

Rcmdr+ statistics=c("mean", "sd", "IQR", "quantiles"),

Rcmdr+ quantiles=c(0,.25,.5,.75,1))

mean sd IQR 0% 25% 50% 75% 100% n

19.4 4.102264 7 12 16 20 23 25 15

Interpretation: Mean (mean) = 19.4 Standard deviation (sd) = 4.102264

Interquartile range (IQR) = 7 0th percentile score (0%) = 12

25th percentile score (25%) = 16 50th percentile score (50%) = 20 75th percentile score (75%) = 23 100th percentile score (100%) = 25 Sample size (n) = 15

18

Correlations

Correlation matrix A correlation matrix is a table with all of the variables of interest listed in the rows and the columns. The intersecting cell of a particular row and a particular column shows the Pearson product-moment correlation (r) between the two variables. Correlations are shown for all pair-wise combinations of the variables of interest. Statistics Summaries Correlation matrix…

19

A “Correlation Matrix” window will pop up.

-Under “Variables (pick two or more)”: Select the variables that you want to include in the correlation matrix. Only numeric variables appear in this list. Press the “Ctrl” key to select more than one variable. Press the “Shift” key to select more than one variable that are listed directly next to each other. -Under “Type of Correlations”: When most people talk about a correlation, they are referring to the Pearson product-moment correlation (r). This is the default setting. -Under “Observations to Use”: The option selected here does not matter if you chose only two variables in the box above. If you chose three or more variables, the options in this section have the following effects:

1. “Complete observations”: If the value of one variable is missing for a case (or row), then the entire case/row will be omitted from all correlation computations in the correlation matrix. This will result in the same number of observations across all correlations.

2. “Pairwise-complete observations”: If the value of one variable is missing for a case (or row), then that case will be omitted from the analysis only for the correlations involving the variable with the missing observation. This may result in different numbers of observations for each correlation.

20

To calculate the p-value of the correlation (to determine if the r-value is significantly different from zero), select the “Pairwise p-values” option so that a checkmark appears in the box.

Click “OK.” The output will appear in the RStudio console window. Rcmdr> rcorr.adjust(Dataset[,c("test1","test2")],

type="pearson", use="complete")

test1 test2

test1 1.00 0.92

test2 0.92 1.00

n= 15

P

test1 test2

test1 0

test2 0

Adjusted p-values (Holm's method)

test1 test2

test1 0

test2 0 Interpretation: The output tells us that the correlation between a student’s score on test1 and his/her score on test2 is 0.92 (r = 0.92). Looking at the p-values in the p section, we see that the p-value for this correlation is 0. As a result, we would conclude that this r is statistically significantly different from zero (assuming we have chosen the .05 level of significance). (Note that 0s in the p section indicate that p < .001, not zero. The value of p is never zero.)

21

Two-sample t-tests

Between-groups/Independent-groups t-test In the following example, we will perform a between-groups/independent-groups t-test in which we want to compare how students in different classes (freshman and sophomore) perform on a specific test. Data sets often contain more information than needed for a particular analysis. For example, in this case, we have scores for two different tests, but we will only compare how students performed on one test, test1. The students’ names (student variable) also are additional data that are not necessary to perform the between-groups/independent-groups t-test in this case. This is what our data file looks like:

IMPORTANT DATA-FORMATTING NOTE: Rcmdr will only make the independent-samples t-test analysis available if Rcmdr can identify a potential grouping variable in your data set. In order for the “Independent samples t-test…” option to be available in the Rcmdr menu, your data must have at least one variable (column) with exactly two different values of a character variable (e.g. “a” and “b”) that can possibly serve as the grouping variable; there cannot be more than two values of the character variable that you want to use as the grouping variable. Also, the grouping variable cannot be numeric (e.g., “1” and “2”). For example, in the data set shown above, the class variable has values of only “freshman” and “sophomore”, so it is a potential grouping variable. If the active data set does not satisfy these conditions, the “Independent samples t-test…” option will be grayed out (not selectable) in the Rcmdr menu.

22

Statistics Means Independent samples t-test…

An “Independent Samples t-Test” window will pop up. -Under “Groups (pick one)”: Select the grouping variable that identifies the two groups. This will be the independent variable. Only character variables that Rcmdr determines to be potential grouping variables (see the IMPORTANT DATA-FORMATTING NOTE above) will be shown in this list; in this example, student does not appear on the list because no student name is repeated more than once so it does not seem to be a grouping variable. -Under “Response Variable (pick one)”: Select the variable for which you want the means to be calculated. This will be the dependent variable. Only numeric variables will be shown in this list.

23

Select the “Options” tab (next to the “Data” tab). “Difference” has been automatically set to “freshman – sophomore” (freshman minus sophomore) as these are the two categories in alphabetical order under the “Groups” variable (class). In this example, Rcmdr will calculate the difference between means as the freshman mean minus the sophomore mean. -Under “Alternative Hypothesis”: Select “Two-sided” if your alternative hypothesis is non-directional and states that the freshman mean is different from the sophomore mean. Select “Difference < 0” if your alternative hypothesis is directional and predicts that freshman – sophomore (i.e., the freshman mean minus the sophomore mean) is less than 0, meaning that the sophomore mean is larger than the freshman mean. Select “Difference > 0” if your alternative hypothesis is directional and predicts that freshman – sophomore (i.e., the freshman mean minus the sophomore mean) is greater than 0, meaning that the freshman mean is larger than the sophomore mean. -Under “Confidence Level”: 1 – the confidence level = alpha, your chosen level of significance. Setting alpha to .05 is typical, so you probably will keep the default setting of Confidence Level = .95. -Under “Assume equal variances?”: Assuming that we have checked our assumptions beforehand, we would ideally want the variances of the two groups (freshman and sophomore) to be equal. The default setting is “No.” Change this to “Yes.”

Click “OK.” The output will appear in your RStudio console window. (See following pages.)

When you assume equal variances in the Independent Samples t-test, you are assuming that your data meet the condition of homogeneity of variance. Homogeneity of variance means that the variances of the populations from which your samples were drawn are equal. The homogeneity of variance condition is most important when there is a large difference between the sizes of the samples. If the samples sizes and the sample variances are very different, the results of the Independent Samples t-test will be less interpretable. If you suspect that the samples may have been drawn from populations with unequal variances, there are tests for homogeneity of variance; for example, Hartley’s F-max test. Refer to an advanced statistics text for instructions on performing that test.

24

Output when Alternative Hypothesis = Two-sided Rcmdr> t.test(test1~class, alternative='two.sided',

conf.level=.95, var.equal=TRUE,

Rcmdr+ data=Dataset)

Two Sample t-test

data: test1 by class

t = 0.3913, df = 13, p-value = 0.7019

alternative hypothesis: true difference in means is not equal to

0

95 percent confidence interval:

-3.874943 5.589228

sample estimates:

mean in group freshman mean in group sophomore

19.85714 19.00000

Interpretation: The freshman and sophomore group means do not significantly differ from each other, as p is greater than alpha = .05. In APA format, we would write the results: t(13) = .39, p = .70. Output when Alternative Hypothesis = Difference < 0 Rcmdr> t.test(test1~class, alternative='less', conf.level=.95,

var.equal=TRUE,


Two Sample t-test


t = 0.3913, df = 13, p-value = 0.649

alternative hypothesis: true difference in means is less than 0


-Inf 4.736207

sample estimates:


19.85714 19.00000

Interpretation: The sophomore mean is not significantly larger than the freshman mean, as p is greater than alpha = .05. In APA format, we would write the results: t(13) = .39, p = .65.

25

Output when Alternative Hypothesis = Difference > 0 Rcmdr> t.test(test1~class, alternative='greater',

conf.level=.95, var.equal=TRUE,


Two Sample t-test


t = 0.3913, df = 13, p-value = 0.351

alternative hypothesis: true difference in means is greater than

0


-3.021921 Inf

sample estimates:


19.85714 19.00000

Interpretation: The freshman mean is not significantly larger than the sophomore mean, as p is greater than alpha = .05. In APA format, we would write the results: t(13) = .39, p = .35.

26

Within-groups/Repeated-measures/Correlated-groups/Paired t-test In the following example, we will use the following data set, which we introduced in the previous section (covering the Between-groups/Independent-groups t-test):

IMPORTANT DATA-FORMATTING NOTE: In order for the repeated-measures t-test (“Paired t-test…”) option to be available in the Rcmdr menu, your data must have at least two variables (columns) containing all numerical values. For example, in the data set shown above, the test1 and test2 variables contain all numerical values. If your data do not satisfy these conditions, the “Paired t-test…” option will be grayed out (not selectable) in the Rcmdr menu. Statistics Means Paired t-test…

27

A “Paired t-Test” window will pop up. Select one variable under “First variable (pick one)” and then a different variable under “Second variable (pick one).” Only variables that contain all numerical values will be shown.

Select the “Options” tab (next to the “Data” tab). “Difference” has been automatically set to your “First variable” – your “Second variable” (first minus second variable), as selected on the “Data” tab. In this case, “Difference” refers to test1 – test2. (Note that, unlike the “Independent samples t-test”, the “Paired t-test” does not show the difference in this window.)

- Under “Alternative Hypothesis”: Select “Two-sided” if your alternative hypothesis is non-directional and states that the test1 mean is different from the test2 mean. Select

“Difference < 0” if your alternative hypothesis is directional and predicts that test1 – test2 (i.e., the test1 score minus the test2 score) is less than 0, meaning that the test2 mean is larger than the test1 mean. Select “Difference > 0” if your alternative hypothesis is directional and predicts that test1 – test2 (i.e., the test1 score minus the

test2 score) is greater than 0, meaning that the test1 mean is larger than the test2 mean. -Under “Confidence Level”: 1 – the confidence level = alpha. Setting alpha at .05 is typical, so you probably will keep the default setting of Confidence Level = .95. Click “OK.” The output will appear in your RStudio console window. (See following pages.)

28

Output when Alternative Hypothesis = Two-sided Rcmdr> with(Dataset, (t.test(test1, test2,

alternative='two.sided', conf.level=.95,

Rcmdr+ paired=TRUE)))

Paired t-test

data: test1 and test2

t = -35.515, df = 14, p-value = 4.044e-15

alternative hypothesis: true difference in means is not equal to

0


-32.87212 -29.12788

sample estimates:

mean of the differences

-31

Interpretation: The test1 and test2 mean scores significantly differ from each other, as p is less than alpha = .05. In APA format, we would write the results: t(14) = -35.52, p < .001. Output when Alternative Hypothesis = Difference < 0 Rcmdr> with(Dataset, (t.test(test1, test2, alternative='less',

conf.level=.95,


Paired t-test


t = -35.515, df = 14, p-value = 2.022e-15

alternative hypothesis: true difference in means is less than 0


-Inf -29.4626

sample estimates:


-31

Interpretation: The test2 mean score is significantly larger than the test1 mean score, as p is less than alpha = .05. In APA format, we would write the results: t(14) = -35.52, p < .001. Output when Alternative Hypothesis = Difference > 0 Rcmdr> with(Dataset, (t.test(test1, test2,

alternative='greater', conf.level=.95,


Paired t-test

29


t = -35.515, df = 14, p-value = 1

alternative hypothesis: true difference in means is greater than

0


-32.5374 Inf

sample estimates:


-31

Interpretation: The test1 mean score is not significantly larger than the test2 mean score, as p is greater than alpha = .05. In APA format, we would write the results: t(14) = -35.52, p = .99.

30

One-Way Analysis of Variance (ANOVA) and Post-Hoc Tests Perform a one-way ANOVA if you want to compare the means of three or more samples that differ in terms of the level of a single independent variable. If the means of the samples are significantly different, you may want to perform a post-hoc test to test the significance of the difference between each pairwise combination of group means in your data set. IMPORTANT DATA-FORMATTING NOTE: Rcmdr will only make the one-way ANOVA analysis available if Rcmdr can identify a potential grouping variable in your data set. In order for the “One-way ANOVA…” option to be available in the Rcmdr menu, your data must have at least one variable (column) with at least two different values of a character variable (e.g. “a”, “b”, and “c”) that can possibly serve as the grouping variable. The grouping variable cannot be numeric (e.g., “1”, “2”, and “3”). If the active data set does not satisfy these conditions, the “One-way ANOVA…” option will be grayed out (not selectable) in the Rcmdr menu. Note that the “One-way ANOVA” option will be available if your data have only two levels of the grouping variable, but it is more appropriate to perform a t-test in that situation. For this example, we would set up our data as shown in the following table.

Statistics Means One-way ANOVA…

31

A “One-Way Analysis of Variance” window will pop up. You may elect to change the model name besides “Enter name for model:” or you may keep the default name of “AnovaModel.1.” -Under “Groups (pick one)”: Select the grouping variable that identifies the different levels of the independent variable. Only character variables that satisfy the conditions described above in the IMPORTANT DATA-FORMATTING NOTE appear in this list. -Under “Response Variable (pick one)”: Select the variable for which you wish to compare sample means. This will be the dependent variable. Only numeric variables appear in this list. -“Pairwise comparison of means”: Select this box in order to output the contrasts, or comparisons of each of the possible pairs within your 3+ groups, in this case, vehicle types.

Click “OK.” The output will appear in the RStudio console window. (See following pages.)

32

ANOVA output Rcmdr> AnovaModel.1 <- aov(maxspeed ~ vehicle, data=Dataset2)

Rcmdr> summary(AnovaModel.1)

Df Sum Sq Mean Sq F value Pr(>F)

vehicle 2 1383 691.7 6.58 0.00454 **

Residuals 28 2943 105.1

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Rcmdr> with(Dataset2, numSummary(maxspeed, groups=vehicle,

statistics=c("mean",

Rcmdr+ "sd")))

mean sd data:n

sedan 58.70000 9.592242 10

SUV 75.20000 9.330952 10

van 65.18182 11.539655 11

Interpretation: The maxspeed means of each vehicle type significantly differ from each other, as p is less than an assumed alpha = .05. In APA format, we would write the results: F(2) = 6.58, p < .01.

Pairwise comparison of means Rcmdr> local({

Rcmdr+ .Pairs <- glht(AnovaModel.1, linfct = mcp(vehicle =

"Tukey"))

Rcmdr+ print(summary(.Pairs)) # pairwise tests

Rcmdr+ print(confint(.Pairs)) # confidence intervals

Rcmdr+ print(cld(.Pairs)) # compact letter display

Rcmdr+ old.oma <- par(oma=c(0,5,0,0))

Rcmdr+ plot(confint(.Pairs))

Rcmdr+ par(old.oma)

Rcmdr+ })

Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts

Fit: aov(formula = maxspeed ~ vehicle, data = Dataset2)

Linear Hypotheses:

Estimate Std. Error t value Pr(>|t|)

SUV - sedan == 0 16.500 4.585 3.599 0.00346 **

van - sedan == 0 6.482 4.480 1.447 0.33134

van - SUV == 0 -10.018 4.480 -2.236 0.08240 .

---

33

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Adjusted p values reported -- single-step method)

Simultaneous Confidence Intervals

Multiple Comparisons of Means: Tukey Contrasts

Fit: aov(formula = maxspeed ~ vehicle, data = Dataset2)

Quantile = 2.4749

95% family-wise confidence level

Linear Hypotheses:

Estimate lwr upr

SUV - sedan == 0 16.5000 5.1521 27.8479

van - sedan == 0 6.4818 -4.6052 17.5688

van - SUV == 0 -10.0182 -21.1052 1.0688

sedan SUV van

"a" "b" "ab"

Interpretation: Because the one-way ANOVA was significant, we may examine the linear contrasts (pairwise comparisons) to determine which vehicle types differed. In the first section labeled “Multiple Comparisons of Means: Tukey Contrasts”, the results of comparisons between each pair of levels of the independent variable are shown. The left-hand column identifies the levels being compared and the right-hand column shows the p-value for that comparison. Each p-value is followed by a code that allows you to quickly identify whether the difference tested on that line is statistically significant. The possible codes are listed on the line labeled “Signif. codes”. For example, if ** appears after a p-value, the difference between means tested on that line is significant at the .01 level of significance. In the example output shown above, there was a significant difference (p < .01) between maxspeed means of SUVs and sedans and a marginally significant difference (p = .08) between maxspeed means of vans and SUVs. The maxspeed means of vans and sedans did not significantly differ (p = .33). This is also explained at the very bottom of the output, in which sedan is given a symbol of “a”, SUV is given a symbol of “b”, and van is given a symbol of “ab”. Because ab (van) has symbols that overlap a (sedan) and b (SUV), this indicates that the maxspeed means of van are not significantly different from those of sedan or SUV. Because a (sedan) and b (SUV) do not have symbols that overlap (i.e., “a” and “b” are different symbols entirely), this indicates that the maxspeed means of sedan are different from those of SUV.

34

“95% family-wise confidence level” graph A depiction of the 95% confidence intervals for the pairwise differences between sample means will be automatically displayed in a separate window. If the interval does not contain 0, you can conclude with 95% confidence that the means are not equal (i.e., the difference between means is not equal to 0). In the example figure shown below, the confidence interval for the SUV – sedan comparison does not include 0, which is consistent with the difference between the SUV and sedan means being statistically significant, as shown by the post hoc test. If you wish to save this image, right click, and select “Copy as metafile.” The metafile will have higher resolution than the bitmap option, which may be more appropriate for use as a figure in a paper.

35

Two-Way Analysis of Variance (ANOVA)

Testing main effects and interactions with multi-way ANOVAs For this example, we would set up our data as shown in the following table.

Statistics Means Multi-way ANOVA…

A “Multi-Way Analysis of Variance” window will pop up. You may elect to change the model name in the “Enter name for model:” text box or you may keep the default name of “AnovaModel.1.” -Under “Factors (pick one or more)”: Select factors for which you would like to test the interaction. These are the independent variables. Only character variables appear in this list. Press the “Ctrl” key to select more than one variable. Press and hold the “Shift” key to select more than one variable that are listed directly next to each other.

36

-Under “Response Variable (pick one)”: Select the variable for which you want to compare the sample means. This will be the dependent variable. Only numeric variables appear in this list.

Click “OK.” The output will appear in your RStudio console window. ANOVA output Rcmdr> AnovaModel.1 <- (lm(maxspeed ~ driver*vehicle,

data=Dataset3))

Rcmdr> Anova(AnovaModel.1)

Anova Table (Type II tests)

Response: maxspeed

Sum Sq Df F value Pr(>F)

driver 0.02 1 0.0002 0.988984

vehicle 1383.46 2 6.2116 0.006456 **

driver:vehicle 159.31 2 0.7153 0.498777

Residuals 2784.00 25

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation: The main effect of driver is not significant; in other words, the maxspeed means do not differ with the experience of driver (new or old), as p is greater than alpha = .05, F(1)<.001, p = .99. The main effect of vehicle is significant; in other words, the maxspeed means do differ

with the type of vehicle (sedan, SUV, or van), as p is less than alpha = .05, F(2) = 6.21, p<.01. The interaction of driver and vehicle is not significant; in other words, the maxspeed

means do not differ as a function of both driver and vehicle (driver:vehicle), as p is greater than alpha = .05, F(2) = .72, p = .50.

37

Descriptive statistics that are automatically outputted Rcmdr> with(Dataset3, (tapply(maxspeed, list(driver, vehicle),

mean, na.rm=TRUE)))

Rcmdr+ # means

sedan SUV van

new 57.0 73.6 68.0

old 60.4 76.8 61.8

Rcmdr> with(Dataset3, (tapply(maxspeed, list(driver, vehicle),

sd, na.rm=TRUE)))

Rcmdr+ # std. deviations

sedan SUV van

new 8.514693 7.092249 12.58571

old 11.282730 11.798305 10.42593

Rcmdr> with(Dataset3, (tapply(maxspeed, list(driver, vehicle),

function(x)

Rcmdr+ sum(!is.na(x))))) # counts

sedan SUV van

new 5 5 6

old 5 5 5

Interpretation: Means = # means Standard deviations = # std. deviations

Sample size = # counts

38

Graphing interactions with multi-way ANOVAs Graphs Plot of means…

A “Plot Means” window will pop up. -Under “Factors (pick one or more)”: Select 2+ factors to plot. These are the independent variables. Only character variables appear in this list. Press the “Ctrl” key to select more than one variable. Press the “Shift” key to select more than one variable that are listed directly next to each other. -Under “Response Variable (pick one)”: Select the variable for which you want the means to be calculated. This will be the dependent variable. Only numeric variables appear in this list.

39

Click on the “Options” tab (next to the “Data” tab). -Under “Error Bars”: The default option is “Standard errors.” To simplify the graph, we changed the selection to “No error bars”. -Under “Plot Labels”: The default x-axis label is the name of the factor that comes first alphabetically. The default y-axis label is “mean of [the name of your response variable].” The default Graph title is “Plot of Means.” To change any of these default labels, click on the white boxes to type in your new labels.

Click “OK.” A graph will automatically be outputted in a separate window. If you wish to save this image, right click, and select “Copy as metafile.” The metafile will have higher resolution than the bitmap option.

40

Linear regression Perform this analysis when you want to find the equation of the best-fitting straight line to a scatterplot of data involving a predictor (X) variable and a criterion (Y) variable. Finding the best-fitting line amounts to finding a and b (the regression coefficients) in the equation Y = a + bX, where a is the Y-intercept and b is the slope of the best-fitting line. In addition, the standard error of estimate (sY.X) is a measure of the spread of the points in the scatterplot about the regression line (the typical error of predictions made with the regression equation). For this example, we would set up our data as shown in the following table.

41

Click on Statistics Fit models Linear regression…

A “Linear Regression” window will pop up. You may elect to change the model name in the “Enter name for model:” text box or you may keep the default name of “RegModel.1.” This window is set up differently from the others because the response variable (criterion) is on the left-hand side, not the right. -Under “Response variable (pick one)”: Select one variable that you want to serve as the response variable (Y; criterion, or predicted variable). Only numeric variables appear in this list. -Under “Explanatory Variable (pick one or more)”: Select the variable(s) that you want to serve as the predictor variable (X). Only numeric variables appear in this list.

Click “OK.” The output will appear in your RStudio console window as shown on the next page.

42

Rcmdr> RegModel.1 <- lm(traveltime~maxspeed, data=Dataset3)

Rcmdr> summary(RegModel.1)

Call:

lm(formula = traveltime ~ maxspeed, data = Dataset3)

Residuals:

Min 1Q Median 3Q Max

-5457 -4356 -3140 3045 21865

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 4866.1 6938.9 0.701 0.489

maxspeed 12.1 103.0 0.117 0.907

Residual standard error: 6775 on 29 degrees of freedom

Multiple R-squared: 0.0004755, Adjusted R-squared: -0.03399

F-statistic: 0.0138 on 1 and 29 DF, p-value: 0.9073

Interpretation: The regression coefficients are shown in the Estimate column of the Coefficients: section of the output. The Y-intercept (a) is shown in the (Intercept) row and the slope (b) is shown in the maxspeed row. In this example, a = 4866.1 and b = 12.1, so the equation of the best-fitting line is Y = 4866.1 + 12.1 X. The standard error of estimate (sY.X) is shown in the Residual standard error: row. In this example, sY.X = 6775. There are several ways to test the significance of the results of the linear regression. The question is whether the predictor variable predicts the criterion variable. One way to answer this is to estimate the proportion of variance in one variable associated with variance in the other variable (which you may recognize as the correlation coefficient squared). This is expressed in the output as Multiple R-squared; in this case, 0.0004755. The p-value shown on the following line of the output tests whether the Multiple R-squared value is significantly different from zero. In this case, p = 0.9073, which is greater than alpha = .05, so there is not a significant relationship between maxspeed and traveltime. Another way to ask whether maxspeed significantly predicts traveltime is to test whether the regression coefficients a and b are significantly different from zero. The p-values for these tests are shown in the Pr(>|t|) column. For the Y-intercept, p = 0.489, which indicates that the Y-intercept is not significantly different from zero. Likewise, for the slope, p = 0.907, which indicates that the slope is not significantly different from zero. (The p-value testing the significance of the slope is equal to the p-value testing the significance of the Multiple R-squared because they are identical statistical tests for linear regression involving only one predictor variable.)

43

Chi-square Perform a chi-square test of independence when you want to test whether there are significant differences between the proportions of observations that fall into different categories of a nominal variable for different groups. For example, suppose we want to determine whether freshman and sophomore students differ in their choices of lunch options. Suppose that in this simple example, we are only examining freshman and sophomore students and the only lunch options are pizza and salad. For each student who enters the cafeteria, we record whether the student is a freshman or sophomore and whether the student chooses pizza or salad. Note that both variables are measured on nominal scales of measurement. Therefore, the question is whether different proportions of freshman and sophomore students choose pizza and salad. There are two ways to test this question in Rcmdr. We can either (1) enter the raw data for individual observations or (2) enter the frequency counts (i.e., the total numbers of freshmen and sophomores who choose each type of lunch food). We describe both methods below.

Chi-square using raw data For this example, we would set up our data as shown in the following table. Each line represents a different subject that we observed. The values in each line show the food option chosen by that subject (pizza or salad) and the class to which the subject belongs (freshman or sophomore).

IMPORTANT DATA-FORMATTING NOTE: Rcmdr will only make the chi-square analysis available if Rcmdr can identify at least two character variables in your data set. In order for the “Two-way table…” option to be available in the Rcmdr menu, your data must have at least two variables (columns) that contain character data (rather than numeric data). If the active data set does not contain at least two character variables, the “Two-way table…” option will be grayed out (not selectable) in the Rcmdr menu.

44

Click on Statistics Contingency tables Two-way table

A “Two-Way Table” window will pop up. Select one variable under “Row variable (pick one)” and another variable under “Column variable (pick one).” Only character variables appear in these lists. (The statistical outcome will not change if you interchange your row and column variables, but the results will be formatted differently in the output.)

45

Click on the “Statistics” tab (next to the “Data” tab) to select which percentages you would like to appear. -Under “Compute Percentages”: To simplify the outcome, “No percentages” is the default. However, if you would like percentages in addition to frequencies to be outputted, then you may select one of the other options. -Under “Hypothesis Tests”: “Chi-square test of independence” is the default option. To simplify the output, you may leave “Components of chi-square statistic” and “Print expected frequencies” unchecked. If your data set is small enough that a cell may have fewer than five counts in it, then select “Fisher’s exact test.”

Click “OK.” The output will appear in the RStudio console window. Rcmdr> local({

Rcmdr+ .Table <- xtabs(~class+lunch, data=Dataset4)

Rcmdr+ cat("\nFrequency table:\n")

Rcmdr+ print(.Table)

Rcmdr+ .Test <- chisq.test(.Table, correct=FALSE)

Rcmdr+ print(.Test)

Rcmdr+ })

Frequency table:

lunch

class pizza salad

freshman 12 16

sophomore 8 24

Pearson's Chi-squared test

data: .Table

X-squared = 2.1429, df = 1, p-value = 0.1432

46

Interpretation: Lunch preferences (pizza or salad) do not differ by class (freshman or sophomore), as p is greater than an assumed value of alpha = .05. In APA format, we would write the results: 2(1) = 2.14, p = .14.

Chi-square using frequency counts If you already have the frequency counts of each cell of your frequency table (as in the Frequency table: output above), then you can enter those counts directly rather than using the raw data method to enter data for each individual subject. Click on Statistics Contingency tables Enter and analyze two-way table…

An “Enter Two-Way Table” window will pop up.

47

-Next to “Number of Rows:” Adjust the number of rows by sliding the horizontal bar. The default is 2 rows. -Next to “Number of Columns:” Adjust the number of columns by sliding the horizontal bar. The default is 2 columns. -Under “Enter counts”: Change the row and column labels (“1” and “2”) to the levels of your variables (in this example, “pizza” and “salad” for the columns and “freshman” and “sophomore” for the rows). Enter the counts in the remaining boxes of the table.

48

Click “OK.” The output will appear in your RStudio console window.

Rcmdr> .Table <- matrix(c(12,16,8,24), 2, 2, byrow=TRUE)

Rcmdr> rownames(.Table) <- c('freshman', 'sophomore')

Rcmdr> colnames(.Table) <- c('pizza', 'salad')

Rcmdr> .Table # Counts

pizza salad

freshman 12 16

sophomore 8 24

Rcmdr> .Test <- chisq.test(.Table, correct=FALSE)

Rcmdr> .Test

Pearson's Chi-squared test

data: .Table

X-squared = 2.1429, df = 1, p-value = 0.1432

Interpretation: The output is identical to that obtained using the raw data method. Lunch preferences (pizza or salad) do not differ by class (freshman or sophomore), as p is greater than an assumed value of alpha = .05. In APA format, we would write the results: 2(1) = 2.14, p = .14.

49

ADVANCED ANALYSES

Descriptive statistics for sub-groups You can output descriptive statistics for sub-groups of your data. To do so, you can identify a grouping variable and Rcmdr will output descriptive statistics for different values of that variable. For example, suppose we have data on test scores and we want to see how students in different classes (e.g., freshman or sophomore) performed on a given test (e.g., test1). In the data below, you can see that there is a variable called class that indicates whether the student is in the freshman or the sophomore class.

50

Click on Statistics Summaries Numerical summaries…

The “Numerical Summaries” window will appear. In the “Numerical Summaries” window, click on the variable(s) for which you want descriptive statistics (in this example, test1).

51

Click on the “Statistics” tab and select which statistics you would like to calculate (see p. 15). After you have chosen at least one variable and you have chosen which descriptive statistics you want, click on the “Data” tab (if it is not already selected) and click “Summarize by groups…”

A “Groups” window will pop up. This list contains all of the character variables in your data set. You may select one, and only one, variable to serve as the basis of groups in your output (in this example, class). Rcmdr will compute descriptive statistics separately for each unique value of the “Groups variable” (in this example, freshman and sophomore).

52

Click “OK.” The output will appear in the RStudio console window.

Rcmdr> numSummary(Dataset[,"test1"], groups=Dataset$class,

statistics=c("mean",

Rcmdr+ "sd", "IQR", "quantiles"),

quantiles=c(0,.25,.5,.75,1))

mean sd IQR 0% 25% 50% 75% 100% data:n

freshman 19.85714 4.598136 6.0 12 17.5 20 23.5 25 7

sophomore 19.00000 3.891382 6.5 14 15.0 20 21.5 24 8

Interpretation: Each row of the output shows the descriptive statistics for one of the sub-groups. For example, the mean for those in the freshman class is equal to 19.86 while the mean for those in the sophomore class is equal to 19.00. The column labeled data:n shows the sample size in each group. In this example, there are 7 individuals in the freshman group and 8 individuals in the sophomore group.

53

Correlation test: Testing the significance of correlations When you compute a correlation matrix (see p. 18), you can compute correlations between many different variables at one time and you can obtain the p-values to test whether those correlations are significantly different from zero. In this analysis, the correlation test, you can compute the correlation between only two variables at one time. However, in the correlation test analysis, the Student’s t statistic pertaining to the p-value calculation as well as a 95% confidence interval for r are also outputted. Click on Statistics Summaries Correlation test…

54

A “Correlation Test” window will pop up. Select two, and only two, variables that you would like to analyze. Only numeric variables appear in this list. Hold down the “Ctrl” key to select more than one variable. Hold down the “Shift” key to select more than one variable that are listed directly next to each other. -Under “Type of Correlations”: Select the type of correlation you would like to compute. The default is the Pearson product-moment correlation coefficient (r). -Under “Alternative Hypothesis”: Select “Two-sided” if you want the alternative hypothesis to assess whether r is different from 0 (a non-directional test). Select “Correlation < 0” if you want to test whether r is significantly less than 0. Select “Correlation > 0” if you want to test whether r is significantly greater than 0.

Click “OK.” The output will appear in the RStudio console window. Output when Alternative Hypothesis = Two-sided Rcmdr> with(Dataset, cor.test(test1, test2,

alternative="two.sided",

Rcmdr+ method="pearson"))

Pearson's product-moment correlation


t = 8.269, df = 13, p-value = 1.554e-06

alternative hypothesis: true correlation is not equal to 0


0.7623700 0.9723368

sample estimates:

cor

0.91665

Interpretation: The Pearson product-moment correlation between test1 and test2 is 0.91665 (i.e., r = 0.92). The correlation coefficient is significantly different from 0, as p is less than an assumed alpha = .05 (p < .001).

55

Output when Alternative Hypothesis = Correlation < 0 Rcmdr> with(Dataset, cor.test(test1, test2, alternative="less",

method="pearson"))



t = 8.269, df = 13, p-value = 1

alternative hypothesis: true correlation is less than 0


-1.0000000 0.9669085

sample estimates:

cor

0.91665

Interpretation: The Pearson product-moment correlation between test1 and test2 is 0.91665 (i.e., r = 0.92). The correlation is not significantly less than 0 because p is greater than alpha = .05 (p approaches 1). That is, we have evidence that our hypothesis that test2 scores decrease as test1 scores increase is incorrect. The output also shows us that we can be 95% confident that the true correlation in the population is between -1 and 0.9669085. Output when Alternative Hypothesis = Correlation > 0 Rcmdr> with(Dataset, cor.test(test1, test2,

alternative="greater",

Rcmdr+ method="pearson"))



t = 8.269, df = 13, p-value = 7.77e-07

alternative hypothesis: true correlation is greater than 0


0.7979031 1.0000000

sample estimates:

cor

0.91665 Interpretation: The Pearson product-moment correlation between test1 and test2 is 0.91665 (i.e., r = 0.92). The correlation is significantly greater than 0 because p is less than alpha = .05 (p<.001). That is, we have evidence that our hypothesis that test2 scores increase as test1 scores increase is incorrect. The output also shows us that we can be 95% confident that the true correlation in the population is between 0.7979031 and 1.

56

Graphs

Scatterplot A scatterplot typically is used to show the relationship between two variables that were measured from a group of subjects. The researcher often computes a correlation coefficient or performs a linear regression to further describe the relationship between the two variables. Click on Graphs Scatterplot…

57

A “Scatterplot” window will pop up. Select one x-variable and one y-variable. Only numeric variables appear in these lists.

58

Click on the “Options” tab (next to the “Data” tab). -Under “Plot Options”: Deselect all of the checked boxes. The bottom four options are selected by default, but deselecting them will simplify your scatterplot. -Under “Identify Points”: “Automatically” is the default. Change the default to “Do not identify” to simplify your scatterplot. -Under “Plot Labels and Points”: These controls allow you to customize the axis labels and other elements that effects how your scatterplot is displayed.

Click “OK.” A scatterplot will be displayed in a separate window. If you wish to copy and paste the scatterplot (into a Word document, for example), right click, and select “Copy as metafile.” (The metafile will have higher resolution than the bitmap option.) You can then click on a Word document and press Ctrl+v to paste the image.

59

Scatterplot by groups Use this option if you want to plot several scatterplots with different symbol types on a single set of axes. Follow the above directions to return to the “Scatterplot” window. Select the “Plot by groups…” button.

A “Groups” window will pop up. Select the variable in your data that you would like to use as the basis of the different groups (symbol types). Be sure that the “Plot lines by group” box is checked in order to see a separate icon for each level of the chosen variable, which is vehicle in this case. Click “OK” to return to the “Scatterplot” window and to make the necessary changes to the “Options” tab, as mentioned in the Scatterplot explanation in the section above.

60

After the default options in the “Options” tab have been changed, click “OK.” A scatterplot will be displayed in a separate window. If you wish to copy and paste the scatterplot (into a Word document, for example), right click, and select “Copy as metafile.” (The metafile will have higher resolution than the bitmap option.) You can then click on a Word document and press Ctrl+v to paste the image.

61

Line graph Construct a line graph when you want to show how the value of one variable (the y-variable) changes as the value of another variable (the x-variable) increases. (The line graph function of Rcmdr can plot several lines on one graph to depict how several y-variables change as the x-variable increases.) Your data set must have one variable (column of values) to be plotted on the x-axis and at least one other variable (column of values) to be plotted on the y-axis. Note the following two points about the way Rcmdr handles your data in plotting a line graph: 1. Rcmdr will plot the x-y pairs in the order that they appear in rows of your data set (from top to bottom) so you should be sure that the values of your x-variable appear in ascending order in your data set. Rcmdr will give you a warning before drawing the line graph if your x-values are not in order. If you choose to have Rcmdr draw the line graph anyway, the line that it plots will zigzag back and forth to follow the ordering of the x-variable. 2. Rcmdr will plot a point on the line graph for the x-y pair in each row of your data set, so if you have repeated x-values in your data set, Rcmdr will plot multiple points for that x-value on the line graph (and the line graph will zigzag back and forth or it will contain vertical segments). Suppose we want to construct a line graph to depict how temperature changes as time passes. In the sample data set below, the x-variable (time) increases from the top of the data set to the bottom and there is only one entry for each time value.

62

Graphs Line graph…

A “Line Plot” window will pop up. Select one x-variable (independent variable) and one or more y-variables (dependent variables). Only numeric variables appear in these lists.

63

Click “OK.” A graph will automatically be outputted in a separate window. If you wish to copy and paste the graph (into a Word document, for example), right click, and select “Copy as metafile.” (The metafile will have higher resolution than the bitmap option.) You can then click on a Word document and press Ctrl+v to paste the image.

64

Bar graph A bar graph depicts the number of times each value of a nominal variable appears in your data set. Suppose that for the data set shown below, we want to graphically show the number of people who chose pizza for lunch and the number of people who chose salad for lunch. The variable for which you want to plot counts must have character values (e.g., “pizza” and “salad”) rather than numeric values (e.g., “1” and “2”).

(Note that Rcmdr can only show counts for different values of one of the nominal variables in the data set. For example, Rcmdr can show the numbers of students who chose pizza and salad. Rcmdr cannot, for example, show the numbers of freshman who chose pizza and salad and the numbers of sophomores who chose pizza and salad.) Graphs Bar graph…

65

A “Bar Graph” window will pop up. Select one variable under “Variable (pick one).” Rcmdr will count the number of times the different values appear under that variable.

Click “OK.” A graph will be displayed in a separate window. If you wish to copy and paste the graph (into a Word document, for example), right click, and select “Copy as metafile.” (The metafile will have higher resolution than the bitmap option.) You can then click on a Word document and press Ctrl+v to paste the image.

66

Miscellaneous

Opening and Entering data There are several ways in which you can open and enter data in Rcmdr. Instructions beginning on page 9 describe opening a .csv spreadsheet. You can also open an Excel spreadsheet (.xls or .xlsx file) or you can type your data directly into a spreadsheet in Rcmdr. Both methods are described below.

Opening an Excel spreadsheet (Not all versions of Rcmdr are the same. Some versions may not show the option for opening an Excel file that is described below.) Data Import data from Excel file…

An “Import Excel Data Set” window will pop up. The default data set name is “Dataset.” You may change this by clicking in the box. By default, Rcmdr will use the first row of the Excel spreadsheet as the variable names. Uncheck the first box if you do not want this to happen. If the first column of your spreadsheet contains names for the rows/observations (which is not typical), check the second box as well. By default, this is unchecked.

Click “OK” to create the new data set. You can click on the “Edit data set” or “View data set” buttons in the Rcmdr window to edit and view, respectively, the active data set. (See p. 11.)

67

Entering data directly into Rcmdr You can create a data set and type your data directly into Rcmdr. Data New data set…

68

A “New Data Set” window will pop up. Press “OK”

A Data Editor window will pop up. If the Data Editor does not pop up, then you most likely have an illegal character in the filename. In your RStudio console, you might see an error message like the one of the ones below. If so, correct the error, and then press “OK” again. RcmdrMsg: [1] ERROR: "data set" is not a valid name.

RcmdrMsg: [2] ERROR: "Dataset!" is not a valid name.

RcmdrMsg: [3] ERROR: "data$set" is not a valid name.

RcmdrMsg: [4] ERROR: "data,set" is not a valid name.

RcmdrMsg: [5] ERROR: "12set" is not a valid name.

69

You can edit values in the Data Editor window by clicking on a cell and typing and/or deleting as necessary. For example, click on the column heading in order to change the variable name. Variable names follow the same guidelines as data set names. Be careful when double clicking on any cell, as the value will then be changed to NA, and there is no undo option, so you will permanently delete whatever value was in that cell. To add more variables, select the “Add column” button.

To add more observations (i.e., subjects), select the “Add row” button.

Click “OK” when you are done to save your changes in the new data set and to exit the Data Editor. If you enter only numbers in a column, then Rcmdr will recognize the variable as “numeric.” If you enter characters (i.e., at least some letters) in a column, then Rcmdr will recognize the variable as “character.” The distinction between numeric and character variables is very important because Rcmdr will not allow you to run any statistical analyses (i.e., the variable will not appear or the statistical test will be grayed out) for which the appropriate variable type is not set.

70

Recoding variables Imagine you have a numeric variable that indicates the maximum speed at which each subject drives a car during an experiment. Suppose you want to group the subjects into a few categories based on their maximum driving speed; for example, subjects who drove 50 mph or greater would be categorized as “high” and those driving 49 mph or less would be “slow”. This requires you to recode the variable from a continuous, numerical variable to a categorical variable (or “factor”, in Rcmdr language). Follow these steps to recode the variable: Data Manage variables in active data set Recode variables…

A “Recode Variables” window will pop up. Select the variable that you want to recode, in this case, maxspeed. The default setting is that each new variable will be a factor, meaning that the variable will no longer be numeric, but categorical. This is the most common reason to recode variables. If you want to recode your variables to numeric data, then uncheck the box. The default variable name is “variable,” but you may change that by clicking on the text box and typing.

71

Below, we have chosen to name the new variable “speed.” Type the directions to recode the variable in the large box at the bottom of the window.

50:hi = “fast”: This code indicates that we want maxspeed values of 50 or more (50:hi, i.e.,

maxspeed = 50 to the highest value of maxspeed) to be coded as “fast” in the new variable called “speed.”

else = “slow”: All other maxspeed values (i.e., values less than 50) will equal “slow.” Alternatively, this code could have been used:

lo:49 = “slow”: This code indicates that we want maxspeed values of 49 or less (lo:49, i.e.,

maxspeed = the lowest value of maxspeed to 49) to be coded as “slow” in the new variable called “speed.”

else = “fast”: All other maxspeed values (i.e., values greater than 50) will equal “fast.”

Click “OK.” If you view your data set, you will see that a new variable speed has been created with two levels: fast and slow.

72

In general, when recoding a variable, indicate the range of values that you want to assign to different categories by separating them with a colon, :. For example, we could have created three categories (fast, medium, and slow) in this way:

50:hi = “fast”: This code indicates that we want maxspeed values of 50 or more to be coded as “fast” in the new variable called “speed.”

40:49 = “fast”: maxspeed values from 40 to 49 will be assigned a value of “medium”.

else = “slow”: All other maxspeed values (i.e., values less than 40) will equal “slow.”

73

Combining items This is useful when you want to add, subtract, divide, multiply, etc., different variables to form a new one. Data Manage variables in active data set Compute new variable…

A “Compute New Variable” window will pop up.

In the “New variable name” box, type the name of the new variable, e.g., test_total.

74

We want test_total to be equal to the sum of test1 and test2. We can double-click

test1 in the “Current variables (double-click to expression)” box to move it under “Expression to compute.”

Then we can type in the operation. In this case, because we want to add the two variables, we type in the + sign. Other common options: - = subtract * = multiply / = divide

Then complete the remaining expression (in this case, double-click test2).

75

Click “OK.” If you view your data set, you will see that a new variable test_total has

been created as the sum of test1 and test2.

Converting variables from numeric to factor items If you have coded a categorical variable as numeric, then you will find that you will not be able to run certain analyses. Rcmdr only allows you to run analyses that it thinks are most appropriate for your data. The simplest way of converting variables from numeric to factor (categorical) is presented below. In this example, condition is coded as 1 or 2 for two different experimental conditions. Though we intend for this to be a categorical or factor variable, Rcmdr is reading these numbers as numeric measurements.

76

Data Manage variables in active data set Convert numeric variables to factors…

A “Convert Numeric Variables to Factors” window will pop up. Select the variable that you want to convert under “Variables (pick one or more),” in this case, condition. Under “Factor Levels,” select “Use numbers” in order to change the 1 and 2 to factors. The numbers will still appear in the “Condition” column of your data, but they will be coded as factor instead of numeric data. Click “OK.”

Your data will look the same, but condition will now be a factor variable.

77

Coding in R If you are feeling adventurous, you can try entering your own code into the “R Script” box in the “R Commander” window or in the Console window of RStudio. This allows you to exert greater control (or “command”) over R, but it is not necessary for most basic analyses. Some examples are shown in some of the following sections.

Deleting data sets You may want to delete data sets from your session in order to unclutter your list of available data sets. To do this, type rm(Dataset) into the “R Script box,” in which Dataset = the name of the dataset that you wish to delete; in this case, the name of the

data set is Dataset, but you will have most likely given your unwanted data set a more creative name. Click the “Submit” button.

78

Your RStudio Console will say: Rmcdr > rm(Dataset). If you do not have other data sets and if you try to click on the word “Dataset” next to “Data set:” in the Rcmdr window, you will get these error messages in RStudio: RcmdrMsg: [1] ERROR: the dataset Dataset is no longer available

RcmdrMsg: [2] ERROR: There are no data sets from which to

choose.

79

Labeling points in a scatterplot Imagine we have a dataset named demodata that contains student’s names (student), the number of lectures each student attended (lectures), and each student’s scores on exam 1 (exam1). Suppose we would like to create a scatterplot depicting number of lectures on the x-axis and exam 1 score on the y-axis, with each point in the scatterplot labeled with the corresponding student’s name.

First, create a scatterplot with the point-and-click instructions shown previously in this guide (p. 56). After you do so, code similar to this should be created by Rcmdr in the R Script box: scatterplot(exam1~lectures, reg.line=FALSE, smooth=FALSE,

spread=FALSE, boxplots=FALSE, span=0.5, data=demodata)

This code lists y then x, separated by ~: y~x (exam1~lectures).

80

Underneath that code, type in: text(demodata$lectures, demodata$exam1, demodata$student)

This code labels the x-axis using the first variable name in the list (lectures), the y-axis using the second variable name (exam1), and the points using the third variable name (student). As illustrated here, the order in which you list the variables determines whether they are used for the x-axis label, y-axis label, or labels for the points.

Highlight all of the code starting from scatterplot and ending at demodata$student) then click the “Submit” button below the Script box.

81

The graph will pop up in a separate window (but you may have to move your current window in order to see the graph).

82

Repeated-measures ANOVA Perform a repeated-measures ANOVA when (1) you want to compare the means of three or more samples of scores and (2) each subject contributes a score to each sample. For example, imagine that you perform an experiment in which you measure whether people are more relaxed (1) playing with a puppy, (2) playing with a kitten, or (3) sitting alone. You recruit 10 subjects. Each subject plays with a puppy for 15 minutes, then completes a questionnaire that measures how relaxed the subject feels (where higher scores mean greater relaxation). Then each subject plays with a kitten for 15 minutes and completes the questionnaire again. Finally, each subject sits alone for 15 minutes, then completes the questionnaire a third time. Thus each subject contributes a score to the puppy group, the kitten group, and the alone group. Therefore, there are three groups of scores, each containing 10 scores (one from each subject). The independent variable in this experiment is the “puppy/kitten/alone” condition and the dependent variable is the score on the relaxation questionnaire. The data file would appear as shown below. The first column, subject, shows a unique subject identifier for each subject. (Note that each identifier appears three times because each subject participated in each of the three conditions.) The second column, condition, shows whether that row of the table contains the subject’s relaxation score for the puppy, kitten, or alone condition (the IV). The third column, relax, shows the relaxation score (the DV) for that subject and condition.

83

You will have to install a special package in RStudio that will let you run a repeated-measures ANOVA. You will do this in a way that is similar to the way in which you first installed the Rcmdr package, except this package is called ez instead of Rcmdr. The

package is called ez because it should make your life easier. (Get it? ez = “easy”! Could they be any more clever?) Type the following into your RStudio (not Rcmdr) Console window and press Enter: install.packages("ez")

Once the package has been installed, return to your Rcmdr window in order to load the package. Tools Load package(s)…

A “Load Packages” window will pop up. Scroll down until you see the ez package. Select ez then click OK.

84

Return to your RStudio console window. Type in the following line of code and press Enter: options(contrasts=c("contr.sum", "contr.poly"))

Then type in the following line of code (one continuous line of code) and press Enter: ezANOVA(data=RM_Anova, dv=.(relax), wid=.(subject),

within=.(condition), detailed=TRUE)

The code shown above applies to this specific example. For your data, you should replace the following items: Replace RM_Anova with your dataset name. Replace relax with your dependent variable. Replace subject with the variable that identifies your subjects.

Replace condition with your independent variable. Press Enter. The output will appear in your RStudio Console window. You might first see a warning that says: Warning: Converting "subject" to factor for ANOVA.

This is okay. This means that Rcmdr is converting the subject variable from a numeric variable (because we used numbers to identify individual subjects) into a factor or categorical variable in order to meet the criteria for an ANOVA.

85

Underneath the warning (if it appears), you will see your repeated-measures ANOVA results. $ANOVA

Effect DFn DFd SSn SSd F p p<.05

ges

1 (Intercept) 1 9 6453.3333 48.0 1210.00000 6.624759e-11 *

0.9804319

2 condition 2 18 345.8667 80.8 38.52475 3.132592e-07 *

0.7286517

Interpretation: The output appears in a table with the following column headings: Effect = Effect that is tested in each row of the table; we are interested in the row of the

table labeled with the name of the IV (in this case, condition). DFn = Numerator (or between-groups) degrees of freedom DFd = Denominator (or within-groups) degrees of freedom SSn = Numerator (or between-groups) sum of squares SSd = Denominator (or within-groups) degrees of freedom F = F-ratio

p = p-value for the specific effect that we are looking at, which is condition in this case p<.05 = Rcmdr will put an asterisk (*) in this column if the effect shown in the row is significant at the .05 level of significance. ges = Generalized eta-squared measure of effect size. According to these results, there was a significant difference between relaxation scores for the participants in this sample depending on whether they interacted with a puppy or kitten or sat alone and the effect size was large, F(2,18) = 38.52, p < .001, 2 = .73.

86

Post-hoc test for repeated-measures ANOVA Post-hoc tests for repeated-measures ANOVA require that you install and load another package. The package is called agricolae. Type the following into your RStudio Console window and press Enter: install.packages("agricolae")

Once the package has been installed, return to your Rcmdr window in order to load the package. Tools Load package(s)…

A “Load Packages” window will pop up. Scroll down until you see the agricolae package. Select agricolae then click OK.

87

Return to your RStudio window. You need to run the repeated-measures ANOVA in a way that saves the results as an object. This just involves typing an extra word and an arrow in front of the repeated-measures ANOVA code that you typed before. Additionally, you would change the last bit of code to return_aov instead of detailed. So the code would look like this, with the additions bolded: options(contrasts=c("contr.sum", "contr.poly"))

resultsname <- ezANOVA(data=RM_Anova, dv=.(score),

wid=.(participant), within=.(test), return_aov=TRUE) resultsname may be replaced with any name you choose for the results of the analysis. This name becomes the “object name” for the results within RStudio. When you provide an object name, it gives you a way of referring to the results later so you can tell Rcmdr to do additional things with the results. Type the code shown above into the Rstudio console window and press Enter. Next, save the portion of the results containing the ANOVA summary. We called the entire set of results resultsname. The part of resultsname that contains the ANOVA summary is indicated by $aov. To save the ANOVA summary portion of the results as a new object called resultsname2, type the following in RStudio and press Enter: resultsname2 <- summary(resultsname$aov)

In the line of code above, resultsname is the name you gave to the ANOVA results at the top of this page. resultsname2 can be any name you choose. Next, type in the following code so that your degrees of freedom within (dfW) and Mean

Squares Within (MSW) can be recalled later. Hit enter after typing each line. The numbers just refer to the specific row and columns that you want to pull from. Insert the name you chose above for resultsname2 but do not change the numbers: MSW <- resultsname2[[2]][[1]][2, 3]

dfW <- resultsname2[[2]][[1]][2, 1]

Finally, to run the post-hoc test, type in this code in your RStudio window: (HSD.test(y = RM_Anova$relax, trt = RM_Anova$condition, DFerror

= dfW, MSerror = MSW, alpha = .05))

The code shown above applies to this specific example. For your data, you should replace the following items: Replace RM_Anova with your dataset name (the dataset containing the raw data). Replace relax with your dependent variable. Replace condition with the within-group variable.

Press Enter. The output will appear in your RStudio console window (as shown on the next page).

88

$statistics

Mean CV MSerror HSD

66.13333 14.83639 96.27143 7.684262

$parameters

Df ntr StudentizedRange

14 2 3.033186

$means

RM_Anova$score std r Min Max

post 66.33333 12.89334 15 48 90

pre 65.93333 11.84704 15 48 87

$comparison

NULL

$groups

trt means M

1 post 66.33333 a

2 pre 65.93333 a

Look under the $groups heading. If the letters under the M column are different from each other, it means that the group means are significantly different. If the letters under the M columns are not different from each other, that means that the means are not significantly different from each other. In this case, the means for pre- and post-test were not significantly different from each other, as evidenced by both of them being labeled with the letter a. Here’s an example of output in which there are significant differences between means: $groups

trt means M

1 D 73.250 a

2 C 56.875 b

3 B 35.625 c

4 A 34.125 c

In this case, Group A mean = Group B mean (because both of their M levels are the same, c), but both of these group means are significantly different from Group C mean and Group D mean (because the M levels for Group C and Group D are b and a, respectively).

89

Mixed-method ANOVA Perform a mixed-method ANOVA when you have two independent variables, where one IV is an independent-groups (between-subjects) variable and the other IV is a repeated-measures (within-subjects) variable. For example, suppose we want to know whether students in three statistics classes with three different instructors learn different amounts of the course material. We will examine five students from class A, five from class B, and five from class C (a total of 15 students). We will give each student a pre-test at the beginning of the semester to see how much statistics they know before taking the class. Then we will give each student a test at the end of the semester (a post-test) to see how much statistics they know after taking the class. In this example, the type of test, pre or post, is a repeated-measures variable because every student takes both tests. The class, A, B, or C, is an independent-groups variable because each student is in only one of the three classes. The test type (pre or post) and the class are independent variables. The test score is the dependent variable. The following table shows a way to visualize the data for this example: Class: Pre-test Post-test

A

Scorepre1 Scorepre4 Scorepre7 Scorepre10 Scorepre13

Scorepost1 Scorepost4 Scorepost7 Scorepost10 Scorepost13

B



C



In the table, Scorepre1 is the pre-test score for student 1, Scorepost1 is the post-test score for student 1, etc. There is a total of six groups of scores in this research design. Note that each student has a score in both columns, which represent the two levels of the repeated-measures variable. Also note that each student appears in only one of the three Class rows, which represent the three levels of the independent-groups variable.

90

For this example, we can set up the data file as shown below. In each row of data, participant is a unique identifier for each student, test indicates whether that row contains a pre- or post-test score, class indicates whether the student is in class A, B, or C, and score is the score on the test. Note that there are two rows for each student because each student took both the pre-test and post-test, the two levels of the repeated-measures variable. Also note that each student is in only one of the three classes because Class is an independent-groups variable.

In the data shown above, the participants are listed in numerical order such that the pre- and post-test scores alternate and the students in each class are not listed together. You can enter the data in any order as long as the participant number and level of each variable are shown correctly on each line. Return to your RStudio window. Type in the following code: summary(aov(score ~ class*test, data = Dataset))

The code shown above applies to this specific example. For your data, replace the following items: Replace score with your dependent variable. Replace class with your between-groups (independent-groups) variable. Replace test with your within-groups (repeated-measures) variable. The order of class and test does not matter.

91

Press Enter. The output will appear in your RStudio console window. Df Sum Sq Mean Sq F value Pr(>F)

class 2 461 230.5 1.523 0.238

test 1 1 1.2 0.008 0.930

class:test 2 198 98.8 0.653 0.530

Residuals 24 3634 151.4

Interpretation: Each row shows the results for a test of a main effect or of the interaction, as shown in the first column. Df = Degrees of freedom Sum Sq = Sum of squares Mean Sq = Mean squares

F value = F-ratio Pr(>F) = p-value According to these results, there were no significant main effects of class or test, nor was there a significant class x test interaction. class: F(2,24) = 1.52, p = .24 test: F(1,24) = 0.01, p = .93 class:test (class by test interaction): F(2,24) = 0.65, p = .53

Post-hoc test for mixed-method ANOVA In this example, the mixed-method ANOVA detected no significant effects. If any of the effects had been significant, it would be appropriate to follow up the ANOVA with a post-hoc test to do pair-wise comparisons of each pair of the six group means. After you perform the mixed-method ANOVA (following the instructions in the preceding section), then running the post-hoc test is easy. All you have to do is place some of the mixed-method ANOVA code (all but the word summary) in these parentheses: TukeyHSD() Here’s how the example mixed-method ANOVA post-hoc test code would look: TukeyHSD((aov(score ~ class*test, data = Dataset)))

92

Press Enter. The output will appear in your RStudio console window.

Tukey multiple comparisons of means

95% family-wise confidence level

Fit: aov(formula = score ~ class * test, data = Dataset)

$class

diff lwr upr p adj

B-A 9.4 -4.341888 23.141888 0.2227517

C-A 3.0 -10.741888 16.741888 0.8498866

C-B -6.4 -20.141888 7.341888 0.4860130

$test

diff lwr upr p adj

pre-post -0.4 -9.673008 8.873008 0.9297983

$`class:test`

diff lwr upr p adj

B:post-A:post 10.2 -13.8615 34.2615 0.7764164

C:post-A:post -2.0 -26.0615 22.0615 0.9998258

A:pre-A:post -3.2 -27.2615 20.8615 0.9982935

B:pre-A:post 5.4 -18.6615 29.4615 0.9808958

C:pre-A:post 4.8 -19.2615 28.8615 0.9886974

C:post-B:post -12.2 -36.2615 11.8615 0.6262243

A:pre-B:post -13.4 -37.4615 10.6615 0.5314074

B:pre-B:post -4.8 -28.8615 19.2615 0.9886974

C:pre-B:post -5.4 -29.4615 18.6615 0.9808958

A:pre-C:post -1.2 -25.2615 22.8615 0.9999861

B:pre-C:post 7.4 -16.6615 31.4615 0.9287373

C:pre-C:post 6.8 -17.2615 30.8615 0.9491845

B:pre-A:pre 8.6 -15.4615 32.6615 0.8744693

C:pre-A:pre 8.0 -16.0615 32.0615 0.9038305

C:pre-B:pre -0.6 -24.6615 23.4615 0.9999996

Interpretation: According to these results, there are no significant differences between any of the pairwise comparisons between the six individual group means (a total of 15 comparisons), as all of the p-values (p adj) are greater than alpha of .05. This is expected in this specific case due to the fact that the initial mixed-method ANOVA demonstrated non-significant results. Once again, in a real-life situation, you would not have bothered to perform the post-hoc test in this case.

93

Updates

Updating packages Every once in a while, an R package will be updated. In order to check that all of your packages are up to date, type: update.packages()

into your RStudio (not Rcmdr) Console window and press Enter. You will then see prompts asking you: Update (y/N/c)?

Type: y

into the Console window and press Enter. There will be one prompt for every package that may be updated.

RStudio updates To keep up with any RStudio updates and find update instructions, visit https://blog.rstudio.org/. You likely will not be impacted unless commands fail to run.

https://blog.rstudio.org/

A Guide to Data Analysis in R Commander...

Documents

Transcript of A Guide to Data Analysis in R Commander...