Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting...

How to Speak R Getting to Know Your Data Fitting Statistical Models

Introduction to R: for Absolute BeginnersOffice of Methodological & Data Sciences

Sarah Schwartz1

BNR 278

12:30 pm - 3:20 pm, October 2, 2012

1EDUC 455, (435)797-0169, [email protected] or [email protected],http://www.cehs.usu.edu/research/omds

http://www.cehs.usu.edu/research/omds


Download & Install 2 pieces of free software

Video walk-through of both installations link: HEREaccept all defaults

https://www.r-project.org/

• Install first

• “Software Environment”

• The brain

• We won’t work directly with it

https://www.rstudio.com/

• Install second

• “User Interface”

• The go-between for us

• Auto completes & color codes

https://www.youtube.com/watch?v=EHjakj38Nnw

https://www.r-project.org/

https://www.rstudio.com/


Helpful Websites

Tutorials by William B. King, PhD, Coastal Carolina Universityhttp://ww2.coastal.edu/kingw/statistics/R-tutorials

RexRepos R Example Repositoryhttp://www.uni-kiel.de/psychologie/rexrepos

R-bloggers R news & tutorials: broad coveragehttp://www.r-bloggers.com

Quick-R Accessing the power of R includes some graphshttp://www.statmethods.net

Psychology Using R for psychological researchhttp://personality-project.org/r

http://ww2.coastal.edu/kingw/statistics/R-tutorials

http://www.uni-kiel.de/psychologie/rexrepos

http://www.r-bloggers.com

http://www.statmethods.net

http://personality-project.org/r


Outline

How to Speak RNuts & BoltsUsing Add-on PackagesHow to Read in YOUR Own Data

Getting to Know Your DataNumeric SummariesGraphical Summaries

Fitting Statistical ModelsMotor Trend Car Road TestsComparing Group CentersRegression Models


Rstudio Workspace


Other User Interfaces Exist...

R Commander (Rcmdr) http://www.rcommander.com

http://www.rcommander.com


Basic Calculations

prompt in the console,command-line

case sensetive ‘anova’ not the same as‘ANOVA’

comment lines Use the # symbol atleast once

1 + 3 #### addition

## [1] 4

16 / 2 #### division

## [1] 8

5 ^ 2 ###### powers

## [1] 25

sqrt(144) # square root

## [1] 12

log(1.3) #### logrithm

## [1] 0.2623643


Create & Remove Objects

# ALL OF THESE DO THE SAME THINGx=7x = 7x= 7x = 7x = # Press Enter here.7 # Press Enter again.

# TWO WAYS TO ASSIGN OBJECTSAval = 7 # use the equalB.val = 15 # names: no spacesCval <- 10 # use an arrowls() # list environment

## [1] "Aval" "B.val" "Cval" "x"

# YOU CAN REMOVE OBJECTS AFTER CREATING THEMrm(B.val) # remove from environmentls() # list the environment

## [1] "Aval" "Cval" "x"

Aval # what is assigned to this?

## [1] 7

aval # CAPS MATTER!!!

## Error in eval(expr, envir, enclos): object ’aval’ not found


A double-equal tests for equivalence:

5 == 6 # are these equal?

## [1] FALSE

3 < 10 # 'less than'

## [1] TRUE

1 < 2 | 2 == 3 # '|' means `or'

## [1] TRUE

Aval < Cval # can test objects

## [1] TRUE

# Create a vector with "combine"vec1 = c(1, 2, 7, 3, 2, -3)

# Are there ANY TWOs?2 %in% vec1

## [1] TRUE

# test EACH VALUE to see if it is TWO2 == vec1

## [1] FALSE TRUE FALSE FALSE## [5] TRUE FALSE

# COUNT the number of TWOssum(2 == vec1)

## [1] 2


Some Possible CLASSES of R Objects

Individual VALUES:

numeric number values

logical either ‘TRUE’ (codes to 1) or ‘FALSE’ (codes to 0)

factor categorical levels, nominal or ordinal

character text or ‘string’ in SPSS

Data OBJECTS:

vector a 1-dimentional listing of single elements

matrix a 2-dimentional array of elements (rows & columns)

data.frame a matrix with more formatting (nice labels)


x = 1:5

class(x)

## [1] "integer"

x

## [1] 1 2 3 4 5

y = x / 3

class(y)

## [1] "numeric"

y

## [1] 0.3333333## [2] 0.6666667## [3] 1.0000000## [4] 1.3333333## [5] 1.6666667

z = x > 4

class(z)

## [1] "logical"

z

## [1] FALSE FALSE## [3] FALSE FALSE## [5] TRUE

c = factor(c("m","m" ,"f","f","m"))

class(c)

## [1] "factor"

c

## [1] m m f f m## 2 Levels: f ...


Finding a Function

If you’re not sure of a function’s name,use ‘apropors’ to search for it:

apropos("round")

## [1] "round"## [2] "round.Date"## [3] "round.POSIXt"

Then you can search the name of thefunction in the HELP tab of theRStudio. (or use google)

apropos("mean")

## [1] ".colMeans"## [2] ".rowMeans"## [3] "colMeans"## [4] "kmeans"## [5] "mean"## [6] "mean.Date"## [7] "mean.default"## [8] "mean.difftime"## [9] "mean.POSIXct"## [10] "mean.POSIXlt"## [11] "rowMeans"## [12] "weighted.mean"


You can use the Help tab in RStudio to find out about a function.

# Ask for the function's argumentsargs(round)

## function (x, digits = 0)## NULL

round(2.4)

## [1] 2

ceiling(2.4)

## [1] 3

floor(2.4)

## [1] 2

round(2.7)

## [1] 3

ceiling(2.7)

## [1] 3

floor(2.7)

## [1] 2


Missing Values

data = c(1, 0, 2, 5, NA)is.na(data)

## [1] FALSE FALSE FALSE## [4] FALSE TRUE

anyNA(data)

## [1] TRUE

Different functions havedifferent default ways tohandle missing values.Use the HELP todetermine what is thedefault and how tochange it.

1 + 0 + 2 + 5

## [1] 8

mean(data)

## [1] NA

mean(data, na.rm = TRUE)

## [1] 2

sd(data)

## [1] NA

sd(data, na.rm = TRUE)

## [1] 2.160247


R Base vs. External Packages

When you download R, you are only getting the base functions. This is arelatively small collection of functions, but it keeps R running fast.

# included in R base:summary(data) # basic summary statistics

## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's## 0.00 0.75 1.50 2.00 2.75 5.00 1

table(data) # tabulates categoricals

## data## 0 1 2 5## 1 1 1 1

Packages are collections of R functions, data, and compiled code in awell-defined format. The directory where packages are stored is called thelibrary.

By only downloading and installing the packages you need, on aproject-by-project basis, R uses less storage space on your hard drive and activememory.


Hundreds of packages are available for download and installation. Many arevetted and distributed by CRAN, others are available on GitHub, or you cancreate & share packages on an individual level.

Install Download to your computer’s hard drive ONLY ONCE

Load Activate the package’s library EVERY session

# Code for installing all the# packagesin this document

install.packages("psych","xlsx","haven","lattice","MASS","ggplot2","popbio","beeswarm")

NOTE: when you download your first package, select a mirror (a proxy server)


The ‘Psych’ Package

This has been developed at Northwestern University since 2005 to includefunctions most useful for personality, psychometric, and psychological research.The package is also meant to supplement a text on psychometric theory, adraft of which is available at http://personality-project.org/r/book.

# 'LOAD' or 'activate' the packagelibrary(psych)

This package has a nice feature for reading in data from your clipboard:

1. Highlight the data in Excel, including the first row with variable names

2. ‘Copy’ the selection, moving the information to the clipboard

3. Run the code below to store it in R as an object named pipiData

# International Personality Item Poolbfi = read.clipboard.tab()

http://personality-project.org/r/book


Personality self report items taken from the International Personality ItemPool (http://ipip.ori.org) and was included as part of the SyntheticAperture Personality Assessment (SAPA) web based personality assessmentproject http://SAPA-project.org.

5 Items x 5 Factors

• Agreeableness

• Conscientiousness

• Extraversion

• Neuroticism

• Opennness

Response Scale

1. Very Inaccurate

2. Moderately Inaccurate

3. Slightly Inaccurate

4. Slightly Accurate

5. Moderately Accurate

6. Very Accurate

Demographic

• gender

• education

• age

http://ipip.ori.org

http://SAPA-project.org


Investigate the Form of Your Data

class(bfi) # you probably want a data.frame

## [1] "data.frame"

dim(bfi) # rows (subjeccts) & columns (variables)

## [1] 2800 28

names(bfi) # columns should have avariables names

## [1] "A1" "A2" "A3" "A4" "A5" "C1" "C2"## [8] "C3" "C4" "C5" "E1" "E2" "E3" "E4"## [15] "E5" "N1" "N2" "N3" "N4" "N5" "O1"## [22] "O2" "O3" "O4" "O5" "gender" "education" "age"

table(complete.cases(bfi)) # are the cases complete? (no missing values)

#### FALSE TRUE## 564 2236


Declare Categorical Variables - GENDER

# look at the raw form: 4 ways designate a variablebfi[, 26] # designate column number...bfi[, c("gender")] # ...or column name...bfi["gender"] # ...all do the same thing...bfi$gender # ...this is the most common

class(bfi$gender) # the variable's "class"

## [1] "integer"

head(bfi$gender) # look at top cases

## [1] 1 2 2 2 1 2

summary(bfi$gender) # how does it get summarized?

## Min. 1st Qu. Median Mean 3rd Qu. Max.## 1.000 1.000 2.000 1.672 2.000 2.000

table(bfi$gender) # what does "table" do?

#### 1 2## 919 1881


Declare Categorical Variables - GENDER

# define it as categorical: FACTOR is "nominal"bfi$gender = factor(bfi$gender, labels = c("male", "female"))

# now its ready to goclass(bfi$gender) # did the "class" change?

## [1] "factor"

head(bfi$gender) # does it look different?

## [1] male female female female male female## Levels: male female

summary(bfi$gender) # is the summary the same?

## male female## 919 1881

levels(bfi$gender) # this gives a list the LABELS

## [1] "male" "female"


Declare Categorical Variables - EDUCATION

table(bfi$education) # look at the raw form

#### 1 2 3 4 5## 224 292 1249 394 418

# define as categorical: ORDERED is "ordinal"bfi$education = ordered(bfi$education,

labels = c("<HS", "HS", "HS+ ", "degree", "grad+"))# now its ready to gohead(bfi$education, n = 15)

## [1] <NA> <NA> <NA> <NA> <NA> HS+ <NA> HS <HS <NA> <HS <NA> <NA> <NA> <HS## Levels: <HS < HS < HS+ < degree < grad+

summary(bfi$education)

## <HS HS HS+ degree grad+ NA's## 224 292 1249 394 418 223

levels(bfi$education)

## [1] "<HS" "HS" "HS+ " "degree" "grad+"


bfi[1:3, ] # specify rows (subjects) in FRONT of the comma

## A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2 E3 E4 E5 N1 N2 N3 N4 N5 O1 O2 O3 O4 O5 gender## 61617 2 4 3 4 4 2 3 3 4 4 3 3 3 4 4 3 4 2 2 3 3 6 3 4 3 male## 61618 2 4 5 2 5 5 4 4 3 4 1 1 6 4 3 3 3 3 5 5 4 2 4 3 3 female## 61620 5 4 5 4 4 4 5 4 2 5 2 4 4 4 5 4 5 4 2 3 4 2 5 5 2 female## education age## 61617 <NA> 16## 61618 <NA> 18## 61620 <NA> 17

bfi[1:4, 1:7] # specify columns (variables) AFTER the comma

## A1 A2 A3 A4 A5 C1 C2## 61617 2 4 3 4 4 2 3## 61618 2 4 5 2 5 5 4## 61620 5 4 5 4 4 4 5## 61621 4 4 6 5 5 4 4

# ...or list the names of the variables (after comma)bfi[1:3, c("A1", "A2", "A3","A4", "A5", "gender", "education", "age")]

## A1 A2 A3 A4 A5 gender education age## 61617 2 4 3 4 4 male <NA> 16## 61618 2 4 5 2 5 female <NA> 18## 61620 5 4 5 4 4 female <NA> 17


Saving a Reduced Dataset# suppose I'm only interested in subjects under the age of 35table(bfi$age < 35)

#### FALSE TRUE## 738 2062

# AND I only want to keep a few variables (for demo)bfiA = bfi[bfi$age < 35,

c("A1", "A2", "A3","A4", "A5", "gender", "education", "age")]

dim(bfiA) # see a few lines from top and bottom

## [1] 2062 8

headTail(bfiA)

## A1 A2 A3 A4 A5 gender education age## 61617 2 4 3 4 4 male <NA> 16## 61618 2 4 5 2 5 female <NA> 18## 61620 5 4 5 4 4 female <NA> 17## 61621 4 4 6 5 5 female <NA> 17## ... ... ... ... ... ... <NA> <NA> ...## 67551 6 1 3 3 3 male HS+ 19## 67552 2 4 4 3 5 male degree 27## 67556 2 3 5 2 5 female degree 29## 67559 5 2 2 4 4 male degree 31


How to Read in YOUR Own DataBefore you can load your data, you need to tell R where to look.

# get the working directorygetwd()

## [1] "C:/Users/A00315273/Box Sync/Office of Research Services/OMDS/OMDS Workshops/OMDS intro to R"

Notice: you need to use shashes instead of backslashes

# change the working directory to YOUR COMPUTER!!!setwd("C:/Users/A00315273/OMDSworkshop")

If the data is stored in a TEXT file, comma delimited...

# there functions are part of the BASE RmyData = read.table("data.txt", header = TRUE)myData = read.csv("data.csv", header = TRUE)


Best Practices: DataSet in Excel

Often, you may enter your data into Excel.

Make sure the FIRST ROW contains the names of variables.

Names, Values, & Fields

• FIRST variable is unit identification

• NEVER use white SPACES

• AVOID symbols or punctuation: ? [ } * $ %

• USE . or to push words together

• KEEP it short, but meaningful

• ALWAYS use numbers over text

• LEAVE missing cells blank (not .)


Read in Data from Excel Files

Bad Example

Much Better!


Read in Data from Excel Files

# there's a package for that!# "Read, write, format Excel 2007 (xlsx) files"library(xlsx)

# read.xlsx tries to guess variables classes# read.xlsx2 is faster at bigger datasets

myData = read.xlsx("data.xlsx",sheetIndex = 1, # or use sheetName, insteadheader = TRUE) # TRUE if 1st row = names

NOTE: If you are having problems with Excel datasets, try saving it as a “.csv”file (comma delimited) and use the read.table function in Base R.


Read in Data from SPSS, SAS, & Stata Files

# New package this summer...Hadley Wickham is my HERO!library(haven)

# Currently haven can read and write:# logical, integer, numeric, character and factors

# SPSS: Supports both sav & por filesmyData = read_spss("data.sav")myData = read_sav("data.sav")myData = read_por("data.sav")

# SAS: Supports both b7dat & b7cat filesmyData = read_sas("data.b7dat")

# StatamyData = read_stata("data.dta")myData = read_dta("data.dta")

# NOTE all labeled variables are a new class: "labelled"# ... use as_factor() to treat the variable categorical# ... use zap_labels() to treat the variable as continuous


Outline





Mean, Standard Deviation, Ect...

# descriptives on all variablesdescribe(bfiA)

## vars n mean sd median trimmed mad min max range skew kurtosis se## A1 1 2053 2.52 1.42 2 2.36 1.48 1 6 5 0.73 -0.44 0.03## A2 2 2040 4.75 1.20 5 4.92 1.48 1 6 5 -1.07 0.86 0.03## A3 3 2048 4.57 1.31 5 4.75 1.48 1 6 5 -0.97 0.39 0.03## A4 4 2048 4.59 1.54 5 4.81 1.48 1 6 5 -0.91 -0.29 0.03## A5 5 2050 4.50 1.26 5 4.64 1.48 1 6 5 -0.79 0.07 0.03## gender* 6 2062 1.66 0.47 2 1.70 0.00 1 2 1 -0.68 -1.54 0.01## education* 7 1853 3.09 1.06 3 3.11 0.00 1 5 4 -0.04 -0.03 0.02## age 8 2062 23.16 5.22 22 22.98 5.93 3 34 31 0.25 -0.59 0.12


Mean, Standard Deviation, Ect...

# split by a grouping variabledescribeBy(bfiA, bfiA$gender)

## group: male## vars n mean sd median trimmed mad min max range skew kurtosis se## A1 1 699 2.81 1.43 3 2.71 1.48 1 6 5 0.48 -0.75 0.05## A2 2 691 4.46 1.30 5 4.61 1.48 1 6 5 -0.88 0.25 0.05## A3 3 695 4.38 1.30 5 4.52 1.48 1 6 5 -0.78 0.01 0.05## A4 4 697 4.31 1.51 5 4.45 1.48 1 6 5 -0.64 -0.62 0.06## A5 5 695 4.35 1.33 5 4.49 1.48 1 6 5 -0.74 -0.13 0.05## gender* 6 699 1.00 0.00 1 1.00 0.00 1 1 0 NaN NaN 0.00## education* 7 626 3.11 1.15 3 3.14 1.48 1 5 4 -0.04 -0.40 0.05## age 8 699 22.83 5.04 22 22.63 4.45 3 34 31 0.27 -0.29 0.19## -------------------------------------------------------------------## group: female## vars n mean sd median trimmed mad min max range skew kurtosis se## A1 1 1354 2.37 1.39 2 2.17 1.48 1 6 5 0.88 -0.16 0.04## A2 2 1349 4.90 1.12 5 5.07 1.48 1 6 5 -1.16 1.22 0.03## A3 3 1353 4.67 1.31 5 4.86 1.48 1 6 5 -1.10 0.68 0.04## A4 4 1351 4.74 1.53 5 4.99 1.48 1 6 5 -1.08 0.04 0.04## A5 5 1355 4.58 1.22 5 4.71 1.48 1 6 5 -0.80 0.14 0.03## gender* 6 1363 2.00 0.00 2 2.00 0.00 2 2 0 NaN NaN 0.00## education* 7 1227 3.08 1.02 3 3.09 0.00 1 5 4 -0.04 0.21 0.03## age 8 1363 23.32 5.31 23 23.17 5.93 9 34 25 0.23 -0.73 0.14


Cross Tabulations & χ2 test for Independence

# split by a grouping variable# If a variable is included on the left side of the formula,# it is assumed to be a vector of frequenciesedXgender = xtabs(~ education + gender, data = bfiA)edXgender

## gender## education male female## <HS 71 109## HS 70 121## HS+ 303 691## degree 84 169## grad+ 98 137

# chi-squared test for independencechisq.test(edXgender)

#### Pearson's Chi-squared test#### data: edXgender## X-squared = 14.746, df = 4, p-value = 0.005258


Correlation Matrix

How strong is the association between the 5 Agreement Items?

# reduce the dataset for easy of demonstrationbfiAonly = bfi[, c("A1", "A2", "A3", "A4", "A5")]

# GET CORRELATION VALUES & P-VALUEScor(bfiAonly, use = "pairwise.complete.obs")

## A1 A2 A3 A4 A5## A1 1.0000000 -0.3401932 -0.2652471 -0.1464245 -0.1814383## A2 -0.3401932 1.0000000 0.4850980 0.3350872 0.3900836## A3 -0.2652471 0.4850980 1.0000000 0.3604283 0.5041411## A4 -0.1464245 0.3350872 0.3604283 1.0000000 0.3075373## A5 -0.1814383 0.3900836 0.5041411 0.3075373 1.0000000

round(cor(bfiAonly, use = "pairwise.complete.obs"), 3)

## A1 A2 A3 A4 A5## A1 1.000 -0.340 -0.265 -0.146 -0.181## A2 -0.340 1.000 0.485 0.335 0.390## A3 -0.265 0.485 1.000 0.360 0.504## A4 -0.146 0.335 0.360 1.000 0.308## A5 -0.181 0.390 0.504 0.308 1.000


Correlation Matrix with p-values

corr.test(bfiAonly,adjust = "none",method = "spearman")

## Call:corr.test(x = bfiAonly, method = "spearman", adjust = "none")## Correlation matrix## A1 A2 A3 A4 A5## A1 1.00 -0.37 -0.30 -0.16 -0.22## A2 -0.37 1.00 0.50 0.34 0.40## A3 -0.30 0.50 1.00 0.36 0.53## A4 -0.16 0.34 0.36 1.00 0.31## A5 -0.22 0.40 0.53 0.31 1.00## Sample Size## A1 A2 A3 A4 A5## A1 2784 2757 2759 2767 2769## A2 2757 2773 2751 2758 2757## A3 2759 2751 2774 2759 2758## A4 2767 2758 2759 2781 2765## A5 2769 2757 2758 2765 2784## Probability values (Entries above the diagonal are adjusted for multiple tests.)## A1 A2 A3 A4 A5## A1 0 0 0 0 0## A2 0 0 0 0 0## A3 0 0 0 0 0## A4 0 0 0 0 0## A5 0 0 0 0 0#### To see confidence intervals of the correlations, print with the short=FALSE option


Correlation Matrix VisualizeA picture can be worth a thousand words

cor.plot(cor(bfiAonly, use = "pairwise.complete.obs", method = "spearman"))

Correlation plot

A5

A4

A3

A2

A1

A1 A2 A3 A4 A5

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1


psych’s All-in-One PlotA picture can be worth a thousand words

# plots pairs of variablespairs.panels(bfiAonly)

A1

1 2 3 4 5 6

−0.34 −0.27

1 2 3 4 5 6

−0.15

13

5

−0.18

13

5 A20.49 0.34 0.39

A30.36

13

5

0.50

13

5 A40.31

1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6

13

5A5


Histogram: Defaults vs. Options

# all defaultshist(bfi$A1)

Histogram of bfi$A1

bfi$A1

Fre

quen

cy

1 2 3 4 5 6

020

060

0

# better with some defaultshist(bfi$A1,

breaks = 0.5:6.5,main = "This is Much Better",xlab = "Item A-1",col = "gray")

This is Much Better

Item A−1

Fre

quen

cy

1 2 3 4 5 60

200

600


Histogram: Use More Code!0

200

400

600

800

1000

Ready for Publication

''Am indifferent to the feelings of others''Agreeableness Item #1 (q.1146)

Fre

quen

cy

Very Mod Slight Slight Mod VeryInaccuration Accurate


Density Plot: Continuous Distribution

# one way to put two plots on the same pagepar(mfrow=c(1, 2)) # 1 row & 2 columnshist(bfi$age) # rough distributionplot(density(bfi$age, na.rm = TRUE)) # smoothed out

Histogram of bfi$age

bfi$age

Fre

quen

cy

0 20 40 60 80

020

040

060

0

0 20 40 60 80

0.00

0.02

0.04

density.default(x = bfi$age, na.rm = TRUE)

N = 2800 Bandwidth = 2.047

Den

sity


Density Plot: AGE

0 20 40 60 80

0.00

0.01

0.02

0.03

0.04

0.05

0.06

Compare to the Normal Curve

Age

Pro

port

ion

Curves

densitynormal


Bar Plot: Categorical Distribution

par(mfrow=c(1, 2)) # 1 row & 2 columns

# one variable at a time (must give it counts!)barplot(table(bfi$gender))barplot(table(bfi$education))

male female

050

010

0015

00

<HS HS degree

020

060

010

00


Bar Plot: Compare 2 Categorical Distributions0

200

400

600

800

1000

Synthetic Aperture Personality Assessment (SAPA)

Highest Level of Education

Fre

quen

cy

<HS HS HS+ degree grad+

malefemale

020

040

060

080

010

00


Boxplots: GENDER & EDUCATION

par(mfrow=c(1, 2)) # 1 row & 2 columns

# all togetherboxplot(bfiA$age)

# split by education groupsboxplot(bfi$age ~ bfi$education)

510

2030

<HS HS+ grad+

020

4060

80


Boxplots: Use More Options

# reset to one plot per pagepar(mfrow=c(1, 1))

# make it look betterboxplot(age ~ education, data = bfi,

col = heat.colors(5),main = "Build a Better Boxplots",xlab = "Highest Education Obtained",ylab = "Age (years)")


020

4060

80

Build a Better Boxplots

Highest Education Obtained

Age

(ye

ars)


Boxplots: AGE & EDUCATION0

2040

6080

Compare the Genders

Highest Education Obtained

Age

(ye

ars)

020

4060

80


malefemale


Scatterplots: Display Associations

Jitter the education level so dots don’t cover each other so much.

# put 3 plots in one row/pagepar(mfrow = c(1, 3))

plot(bfi$age,jitter(as.numeric(bfi$education),

factor = 0.25),main = "factor = 0.25")


factor = 1),main = "factor = 1")


factor = 2),main = "factor = 2") 0 20 40 60 80

12

34

5

factor = 0.25

bfi$age

jitte

r(as

.num

eric

(bfi$

educ

atio

n), f

acto

r =

0.2

5)

0 20 40 60 80

12

34

5

factor = 1

bfi$age

jitte

r(as

.num

eric

(bfi$

educ

atio

n), f

acto

r =

1)

0 20 40 60 80

12

34

5

factor = 2

bfi$age

jitte

r(as

.num

eric

(bfi$

educ

atio

n), f

acto

r =

2)


Scatterplots: AGE & EDUCATION

0 20 40 60 80

Jitter the Ordinal Variable

Age (years)

Edu

catio

n

<HS

HS

HS+

degree

grad+

0 20 40 60 80


Bubble Plot: Helpful with Overplotting

If you can dream of a type of plot, you can create it!

# aggregate the databfiAag = aggregate(bfiA,

by = list(bfiA$A1,bfiA$A2),

length)

# circle's area ~ number of pointssymbols(bfiAag$Group.1,

bfiAag$Group.2,circles = sqrt(bfiAag$A1/pi)/50,inches = FALSE,main = "Bubble Plot",xlab = "item A1",ylab = "item A2")

1 2 3 4 5 6

12

34

56

Bubble Plot

item A1

item

A2


Outline





Motor Trend Car Road Tests

The data was extracted from the 1974 Motor Trend US magazine, andcomprises fuel consumption and 10 aspects of automobile design andperformance for 32 automobiles (1973-74 models).

mpg Miles/(US) gallon

cyl Number of cylinders

disp Displacement (cu.in.)

hp Gross horsepower

drat Rear axle ratio

wt Weight (lb/1000)

qsec 1/4 mile time

vs V/S

am Transmission

gear Number of forward gears

carb Number of carburetors


Load car Package & the mtcars Data

# Load a New Package:library(car) # "Companion to Applied Regression" (a textbook)

data(mtcars) # Make its Included Data Set Active in the Environment

# check out the datadim(mtcars)

## [1] 32 11

names(mtcars)

## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb"

# set the categorical variablesmtcars$vs = factor(mtcars$vs, labels = c("v", "s"))mtcars$am = factor(mtcars$am, labels = c("automatic", "manual"))


headTail(mtcars)

## mpg cyl disp hp drat wt qsec vs am gear carb## Mazda RX4 21 6 160 110 3.9 2.62 16.46 v manual 4 4## Mazda RX4 Wag 21 6 160 110 3.9 2.88 17.02 v manual 4 4## Datsun 710 22.8 4 108 93 3.85 2.32 18.61 s manual 4 1## Hornet 4 Drive 21.4 6 258 110 3.08 3.21 19.44 s automatic 3 1## ... ... ... ... ... ... ... ... <NA> <NA> ... ...## Ford Pantera L 15.8 8 351 264 4.22 3.17 14.5 v manual 5 4## Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 v manual 5 6## Maserati Bora 15 8 301 335 3.54 3.57 14.6 v manual 5 8## Volvo 142E 21.4 4 121 109 4.11 2.78 18.6 s manual 4 2

summary(mtcars)

## mpg cyl disp hp drat## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 Min. :2.760## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.080## Median :19.20 Median :6.000 Median :196.3 Median :123.0 Median :3.695## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 Mean :3.597## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.920## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 Max. :4.930## wt qsec vs am gear carb## Min. :1.513 Min. :14.50 v:18 automatic:19 Min. :3.000 Min. :1.000## 1st Qu.:2.581 1st Qu.:16.89 s:14 manual :13 1st Qu.:3.000 1st Qu.:2.000## Median :3.325 Median :17.71 Median :4.000 Median :2.000## Mean :3.217 Mean :17.85 Mean :3.688 Mean :2.812## 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:4.000 3rd Qu.:4.000## Max. :5.424 Max. :22.90 Max. :5.000 Max. :8.000


Test Central Differences in 2 Independent Groups# find the meansdescribeBy(mtcars$mpg, mtcars$am)

## group: automatic## vars n mean sd median trimmed mad min max range skew kurtosis se## 1 1 19 17.15 3.83 17.3 17.12 3.11 10.4 24.4 14 0.01 -0.8 0.88## -------------------------------------------------------------------## group: manual## vars n mean sd median trimmed mad min max range skew kurtosis se## 1 1 13 24.39 6.17 22.8 24.38 6.67 15 33.9 18.9 0.05 -1.46 1.71

# view the two groups side-by-sideboxplot(mpg ~ am, data = mtcars, horizontal = TRUE)

auto

mat

ic

10 15 20 25 30


Test Central Differences in 2 Independent Groups

PARAMETRIC t-test for means, assumes normality

t.test(mpg ~ am, data = mtcars)

#### Welch Two Sample t-test#### data: mpg by am## t = -3.7671, df = 18.332, p-value = 0.001374## alternative hypothesis: true difference in means is not equal to 0## 95 percent confidence interval:## -11.280194 -3.209684## sample estimates:## mean in group automatic mean in group manual## 17.14737 24.39231

NON-PARAMETRIC Mann-Whitney U Test, based on ranks

wilcox.test(mpg ~ am, data = mtcars)

#### Wilcoxon rank sum test with continuity correction#### data: mpg by am## W = 42, p-value = 0.001871## alternative hypothesis: true location shift is not equal to 0


More than Two Groups?

# plot to investigateboxplot(drat ~ cyl,

data = mtcars,main = "Between vs. Within",xlab = "Number of Cylinders",ylab = "Rear Axle Ratio",col = "light gray")

grid()

# we can use another packagelibrary(beeswarm)

stripchart(drat ~ cyl,data = mtcars,vertical = TRUE,method = 'jitter',jitter = 0.2,cex = 1,pch = 16,col = c("red",

"blue","dark green"),

add = TRUE)

4 6 8

3.0

3.5

4.0

4.5

5.0

Between vs. Within

Number of Cylinders

Rea

r A

xle

Rat

io


ANOVA

# run the ANOVAanova1 = aov(drat ~ cyl, data = mtcars)summary(anova1)

## Df Sum Sq Mean Sq F value Pr(>F)## cyl 1 4.342 4.342 28.81 8.24e-06 ***## Residuals 30 4.521 0.151## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# to get type III sums of squaresAnova(anova1, type = "III")

## Anova Table (Type III tests)#### Response: drat## Sum Sq Df F value Pr(>F)## (Intercept) 57.217 1 379.714 < 2.2e-16 ***## cyl 4.342 1 28.814 8.245e-06 ***## Residuals 4.521 30## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


ANCOVA

# add a continuous covariateanova2 = aov(drat ~ cyl + wt, data = mtcars)summary(anova2)

## Df Sum Sq Mean Sq F value Pr(>F)## cyl 1 4.342 4.342 32.284 3.83e-06 ***## wt 1 0.620 0.620 4.613 0.0402 *## Residuals 29 3.900 0.134## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Anova(anova2, type = "III")

## Anova Table (Type III tests)#### Response: drat## Sum Sq Df F value Pr(>F)## (Intercept) 56.578 1 420.6933 < 2e-16 ***## cyl 0.464 1 3.4493 0.07346 .## wt 0.620 1 4.6129 0.04022 *## Residuals 3.900 29## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Kruskal Wallis Test

# non-parametric version: uses ranks instead of meanskruskal.test(drat ~ cyl, data = mtcars)

#### Kruskal-Wallis rank sum test#### data: drat by cyl## Kruskal-Wallis chi-squared = 14.395, df = 2, p-value = 0.0007486


Simple Linear Regression: Fit Model# Simple Linear Regressionlinreg = lm(mpg ~ wt, data = mtcars)slr = summary(linreg)slr

#### Call:## lm(formula = mpg ~ wt, data = mtcars)#### Residuals:## Min 1Q Median 3Q Max## -4.5432 -2.3647 -0.1252 1.4096 6.8727#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***## wt -5.3445 0.5591 -9.559 1.29e-10 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 3.046 on 30 degrees of freedom## Multiple R-squared: 0.7528,Adjusted R-squared: 0.7446## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10

summary(linreg)$r.squared

## [1] 0.7528328

summary(linreg)$adj.r.squared

## [1] 0.7445939


Simple Linear Regression: Visualize the Fit

# Plot of relationship and least squares lineplot(mtcars$wt, mtcars$mpg)abline(linreg, col = "red")text(x = 2,

y = 12,labels = bquote(~R^2 ==

.(round(slr$r.squared, 3))),col = "red")

text(x = 4.75,y = 30,labels = bquote(~adj-R^2 ==

.(round(slr$adj.r.squared, 3))),col = "blue")

title(main = "Linear Regression")grid() 2 3 4 5

1015

2025

30

mtcars$wt

mtc

ars$

mpg

R2 = 0.753

adj − R2 = 0.745

Linear Regression


Introducing ggplot2

# a VERY COOL plotting package for next semester's workshop...library(ggplot2)

ggplot(mtcars, aes(x = wt, y = mpg)) +geom_point() +stat_smooth(method = "lm", col = "red") +facet_grid(. ~ am) +theme_bw()

automatic manual

10

20

30

2 3 4 5 2 3 4 5wt

mpg


Multiple Linear Regression: Fit the Model

# add several variables to the modellinreg2 = lm(mpg ~ wt + cyl + hp, data = mtcars)summary(linreg2)

#### Call:## lm(formula = mpg ~ wt + cyl + hp, data = mtcars)#### Residuals:## Min 1Q Median 3Q Max## -3.9290 -1.5598 -0.5311 1.1850 5.8986#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 38.75179 1.78686 21.687 < 2e-16 ***## wt -3.16697 0.74058 -4.276 0.000199 ***## cyl -0.94162 0.55092 -1.709 0.098480 .## hp -0.01804 0.01188 -1.519 0.140015## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 2.512 on 28 degrees of freedom## Multiple R-squared: 0.8431,Adjusted R-squared: 0.8263## F-statistic: 50.17 on 3 and 28 DF, p-value: 2.184e-11


Multiple Linear Regression: Residual Diagnostics

Distribution of Studentized Residuals

sresid

Den

sity

−2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5


Logistic Regression: Fit the Model

# run the logistic regression (outcome has 2 levels)logreg = glm(am ~ mpg,

data = mtcars,family = binomial(link = "logit"))

summary(logreg)

#### Call:## glm(formula = am ~ mpg, family = binomial(link = "logit"), data = mtcars)#### Deviance Residuals:## Min 1Q Median 3Q Max## -1.5701 -0.7531 -0.4245 0.5866 2.0617#### Coefficients:## Estimate Std. Error z value Pr(>|z|)## (Intercept) -6.6035 2.3514 -2.808 0.00498 **## mpg 0.3070 0.1148 2.673 0.00751 **## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### (Dispersion parameter for binomial family taken to be 1)#### Null deviance: 43.230 on 31 degrees of freedom## Residual deviance: 29.675 on 30 degrees of freedom## AIC: 33.675#### Number of Fisher Scoring iterations: 5


Logistic Regression: Visualize the Fit

10 15 20 25 30

0.0

0.2

0.4

0.6

0.8

1.0

Motor Trend Car Road Tests

Miles/(US) gallon

Tran

smis

sion

0

5

10

10

5

0

Aut

omat

ic v

s. M

anua

l

0.0

0.2

0.4

0.6

0.8

1.0


Other Generalized Regresion Models# Can do other distributions and linkspoisreg = glm(carb ~ hp,

data = mtcars,family = poisson(link="log"))

summary(poisreg)

#### Call:## glm(formula = carb ~ hp, family = poisson(link = "log"), data = mtcars)#### Deviance Residuals:## Min 1Q Median 3Q Max## -0.86441 -0.55608 -0.07877 0.21395 1.49103#### Coefficients:## Estimate Std. Error z value Pr(>|z|)## (Intercept) 0.148971 0.265018 0.562 0.574## hp 0.005517 0.001387 3.977 6.97e-05 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### (Dispersion parameter for poisson family taken to be 1)#### Null deviance: 27.043 on 31 degrees of freedom## Residual deviance: 12.279 on 30 degrees of freedom## AIC: 105.64#### Number of Fisher Scoring iterations: 4

Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting...

Documents

Transcript of Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting...