Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting...
Transcript of Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting...
How to Speak R Getting to Know Your Data Fitting Statistical Models
Introduction to R: for Absolute BeginnersOffice of Methodological & Data Sciences
Sarah Schwartz1
BNR 278
12:30 pm - 3:20 pm, October 2, 2012
1EDUC 455, (435)797-0169, [email protected] or [email protected],http://www.cehs.usu.edu/research/omds
How to Speak R Getting to Know Your Data Fitting Statistical Models
Download & Install 2 pieces of free software
Video walk-through of both installations link: HEREaccept all defaults
https://www.r-project.org/
• Install first
• “Software Environment”
• The brain
• We won’t work directly with it
https://www.rstudio.com/
• Install second
• “User Interface”
• The go-between for us
• Auto completes & color codes
How to Speak R Getting to Know Your Data Fitting Statistical Models
Helpful Websites
Tutorials by William B. King, PhD, Coastal Carolina Universityhttp://ww2.coastal.edu/kingw/statistics/R-tutorials
RexRepos R Example Repositoryhttp://www.uni-kiel.de/psychologie/rexrepos
R-bloggers R news & tutorials: broad coveragehttp://www.r-bloggers.com
Quick-R Accessing the power of R includes some graphshttp://www.statmethods.net
Psychology Using R for psychological researchhttp://personality-project.org/r
How to Speak R Getting to Know Your Data Fitting Statistical Models
Outline
How to Speak RNuts & BoltsUsing Add-on PackagesHow to Read in YOUR Own Data
Getting to Know Your DataNumeric SummariesGraphical Summaries
Fitting Statistical ModelsMotor Trend Car Road TestsComparing Group CentersRegression Models
How to Speak R Getting to Know Your Data Fitting Statistical Models
Rstudio Workspace
How to Speak R Getting to Know Your Data Fitting Statistical Models
Other User Interfaces Exist...
R Commander (Rcmdr) http://www.rcommander.com
How to Speak R Getting to Know Your Data Fitting Statistical Models
Basic Calculations
prompt in the console,command-line
case sensetive ‘anova’ not the same as‘ANOVA’
comment lines Use the # symbol atleast once
1 + 3 #### addition
## [1] 4
16 / 2 #### division
## [1] 8
5 ^ 2 ###### powers
## [1] 25
sqrt(144) # square root
## [1] 12
log(1.3) #### logrithm
## [1] 0.2623643
How to Speak R Getting to Know Your Data Fitting Statistical Models
Create & Remove Objects
# ALL OF THESE DO THE SAME THINGx=7x = 7x= 7x = 7x = # Press Enter here.7 # Press Enter again.
# TWO WAYS TO ASSIGN OBJECTSAval = 7 # use the equalB.val = 15 # names: no spacesCval <- 10 # use an arrowls() # list environment
## [1] "Aval" "B.val" "Cval" "x"
# YOU CAN REMOVE OBJECTS AFTER CREATING THEMrm(B.val) # remove from environmentls() # list the environment
## [1] "Aval" "Cval" "x"
Aval # what is assigned to this?
## [1] 7
aval # CAPS MATTER!!!
## Error in eval(expr, envir, enclos): object ’aval’ not found
How to Speak R Getting to Know Your Data Fitting Statistical Models
A double-equal tests for equivalence:
5 == 6 # are these equal?
## [1] FALSE
3 < 10 # 'less than'
## [1] TRUE
1 < 2 | 2 == 3 # '|' means `or'
## [1] TRUE
Aval < Cval # can test objects
## [1] TRUE
# Create a vector with "combine"vec1 = c(1, 2, 7, 3, 2, -3)
# Are there ANY TWOs?2 %in% vec1
## [1] TRUE
# test EACH VALUE to see if it is TWO2 == vec1
## [1] FALSE TRUE FALSE FALSE## [5] TRUE FALSE
# COUNT the number of TWOssum(2 == vec1)
## [1] 2
How to Speak R Getting to Know Your Data Fitting Statistical Models
Some Possible CLASSES of R Objects
Individual VALUES:
numeric number values
logical either ‘TRUE’ (codes to 1) or ‘FALSE’ (codes to 0)
factor categorical levels, nominal or ordinal
character text or ‘string’ in SPSS
Data OBJECTS:
vector a 1-dimentional listing of single elements
matrix a 2-dimentional array of elements (rows & columns)
data.frame a matrix with more formatting (nice labels)
How to Speak R Getting to Know Your Data Fitting Statistical Models
x = 1:5
class(x)
## [1] "integer"
x
## [1] 1 2 3 4 5
y = x / 3
class(y)
## [1] "numeric"
y
## [1] 0.3333333## [2] 0.6666667## [3] 1.0000000## [4] 1.3333333## [5] 1.6666667
z = x > 4
class(z)
## [1] "logical"
z
## [1] FALSE FALSE## [3] FALSE FALSE## [5] TRUE
c = factor(c("m","m" ,"f","f","m"))
class(c)
## [1] "factor"
c
## [1] m m f f m## 2 Levels: f ...
How to Speak R Getting to Know Your Data Fitting Statistical Models
Finding a Function
If you’re not sure of a function’s name,use ‘apropors’ to search for it:
apropos("round")
## [1] "round"## [2] "round.Date"## [3] "round.POSIXt"
Then you can search the name of thefunction in the HELP tab of theRStudio. (or use google)
apropos("mean")
## [1] ".colMeans"## [2] ".rowMeans"## [3] "colMeans"## [4] "kmeans"## [5] "mean"## [6] "mean.Date"## [7] "mean.default"## [8] "mean.difftime"## [9] "mean.POSIXct"## [10] "mean.POSIXlt"## [11] "rowMeans"## [12] "weighted.mean"
How to Speak R Getting to Know Your Data Fitting Statistical Models
You can use the Help tab in RStudio to find out about a function.
# Ask for the function's argumentsargs(round)
## function (x, digits = 0)## NULL
round(2.4)
## [1] 2
ceiling(2.4)
## [1] 3
floor(2.4)
## [1] 2
round(2.7)
## [1] 3
ceiling(2.7)
## [1] 3
floor(2.7)
## [1] 2
How to Speak R Getting to Know Your Data Fitting Statistical Models
Missing Values
data = c(1, 0, 2, 5, NA)is.na(data)
## [1] FALSE FALSE FALSE## [4] FALSE TRUE
anyNA(data)
## [1] TRUE
Different functions havedifferent default ways tohandle missing values.Use the HELP todetermine what is thedefault and how tochange it.
1 + 0 + 2 + 5
## [1] 8
mean(data)
## [1] NA
mean(data, na.rm = TRUE)
## [1] 2
sd(data)
## [1] NA
sd(data, na.rm = TRUE)
## [1] 2.160247
How to Speak R Getting to Know Your Data Fitting Statistical Models
R Base vs. External Packages
When you download R, you are only getting the base functions. This is arelatively small collection of functions, but it keeps R running fast.
# included in R base:summary(data) # basic summary statistics
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's## 0.00 0.75 1.50 2.00 2.75 5.00 1
table(data) # tabulates categoricals
## data## 0 1 2 5## 1 1 1 1
Packages are collections of R functions, data, and compiled code in awell-defined format. The directory where packages are stored is called thelibrary.
By only downloading and installing the packages you need, on aproject-by-project basis, R uses less storage space on your hard drive and activememory.
How to Speak R Getting to Know Your Data Fitting Statistical Models
Hundreds of packages are available for download and installation. Many arevetted and distributed by CRAN, others are available on GitHub, or you cancreate & share packages on an individual level.
Install Download to your computer’s hard drive ONLY ONCE
Load Activate the package’s library EVERY session
# Code for installing all the# packagesin this document
install.packages("psych","xlsx","haven","lattice","MASS","ggplot2","popbio","beeswarm")
NOTE: when you download your first package, select a mirror (a proxy server)
How to Speak R Getting to Know Your Data Fitting Statistical Models
The ‘Psych’ Package
This has been developed at Northwestern University since 2005 to includefunctions most useful for personality, psychometric, and psychological research.The package is also meant to supplement a text on psychometric theory, adraft of which is available at http://personality-project.org/r/book.
# 'LOAD' or 'activate' the packagelibrary(psych)
This package has a nice feature for reading in data from your clipboard:
1. Highlight the data in Excel, including the first row with variable names
2. ‘Copy’ the selection, moving the information to the clipboard
3. Run the code below to store it in R as an object named pipiData
# International Personality Item Poolbfi = read.clipboard.tab()
How to Speak R Getting to Know Your Data Fitting Statistical Models
Personality self report items taken from the International Personality ItemPool (http://ipip.ori.org) and was included as part of the SyntheticAperture Personality Assessment (SAPA) web based personality assessmentproject http://SAPA-project.org.
5 Items x 5 Factors
• Agreeableness
• Conscientiousness
• Extraversion
• Neuroticism
• Opennness
Response Scale
1. Very Inaccurate
2. Moderately Inaccurate
3. Slightly Inaccurate
4. Slightly Accurate
5. Moderately Accurate
6. Very Accurate
Demographic
• gender
• education
• age
How to Speak R Getting to Know Your Data Fitting Statistical Models
Investigate the Form of Your Data
class(bfi) # you probably want a data.frame
## [1] "data.frame"
dim(bfi) # rows (subjeccts) & columns (variables)
## [1] 2800 28
names(bfi) # columns should have avariables names
## [1] "A1" "A2" "A3" "A4" "A5" "C1" "C2"## [8] "C3" "C4" "C5" "E1" "E2" "E3" "E4"## [15] "E5" "N1" "N2" "N3" "N4" "N5" "O1"## [22] "O2" "O3" "O4" "O5" "gender" "education" "age"
table(complete.cases(bfi)) # are the cases complete? (no missing values)
#### FALSE TRUE## 564 2236
How to Speak R Getting to Know Your Data Fitting Statistical Models
Declare Categorical Variables - GENDER
# look at the raw form: 4 ways designate a variablebfi[, 26] # designate column number...bfi[, c("gender")] # ...or column name...bfi["gender"] # ...all do the same thing...bfi$gender # ...this is the most common
class(bfi$gender) # the variable's "class"
## [1] "integer"
head(bfi$gender) # look at top cases
## [1] 1 2 2 2 1 2
summary(bfi$gender) # how does it get summarized?
## Min. 1st Qu. Median Mean 3rd Qu. Max.## 1.000 1.000 2.000 1.672 2.000 2.000
table(bfi$gender) # what does "table" do?
#### 1 2## 919 1881
How to Speak R Getting to Know Your Data Fitting Statistical Models
Declare Categorical Variables - GENDER
# define it as categorical: FACTOR is "nominal"bfi$gender = factor(bfi$gender, labels = c("male", "female"))
# now its ready to goclass(bfi$gender) # did the "class" change?
## [1] "factor"
head(bfi$gender) # does it look different?
## [1] male female female female male female## Levels: male female
summary(bfi$gender) # is the summary the same?
## male female## 919 1881
levels(bfi$gender) # this gives a list the LABELS
## [1] "male" "female"
How to Speak R Getting to Know Your Data Fitting Statistical Models
Declare Categorical Variables - EDUCATION
table(bfi$education) # look at the raw form
#### 1 2 3 4 5## 224 292 1249 394 418
# define as categorical: ORDERED is "ordinal"bfi$education = ordered(bfi$education,
labels = c("<HS", "HS", "HS+ ", "degree", "grad+"))# now its ready to gohead(bfi$education, n = 15)
## [1] <NA> <NA> <NA> <NA> <NA> HS+ <NA> HS <HS <NA> <HS <NA> <NA> <NA> <HS## Levels: <HS < HS < HS+ < degree < grad+
summary(bfi$education)
## <HS HS HS+ degree grad+ NA's## 224 292 1249 394 418 223
levels(bfi$education)
## [1] "<HS" "HS" "HS+ " "degree" "grad+"
How to Speak R Getting to Know Your Data Fitting Statistical Models
bfi[1:3, ] # specify rows (subjects) in FRONT of the comma
## A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2 E3 E4 E5 N1 N2 N3 N4 N5 O1 O2 O3 O4 O5 gender## 61617 2 4 3 4 4 2 3 3 4 4 3 3 3 4 4 3 4 2 2 3 3 6 3 4 3 male## 61618 2 4 5 2 5 5 4 4 3 4 1 1 6 4 3 3 3 3 5 5 4 2 4 3 3 female## 61620 5 4 5 4 4 4 5 4 2 5 2 4 4 4 5 4 5 4 2 3 4 2 5 5 2 female## education age## 61617 <NA> 16## 61618 <NA> 18## 61620 <NA> 17
bfi[1:4, 1:7] # specify columns (variables) AFTER the comma
## A1 A2 A3 A4 A5 C1 C2## 61617 2 4 3 4 4 2 3## 61618 2 4 5 2 5 5 4## 61620 5 4 5 4 4 4 5## 61621 4 4 6 5 5 4 4
# ...or list the names of the variables (after comma)bfi[1:3, c("A1", "A2", "A3","A4", "A5", "gender", "education", "age")]
## A1 A2 A3 A4 A5 gender education age## 61617 2 4 3 4 4 male <NA> 16## 61618 2 4 5 2 5 female <NA> 18## 61620 5 4 5 4 4 female <NA> 17
How to Speak R Getting to Know Your Data Fitting Statistical Models
Saving a Reduced Dataset# suppose I'm only interested in subjects under the age of 35table(bfi$age < 35)
#### FALSE TRUE## 738 2062
# AND I only want to keep a few variables (for demo)bfiA = bfi[bfi$age < 35,
c("A1", "A2", "A3","A4", "A5", "gender", "education", "age")]
dim(bfiA) # see a few lines from top and bottom
## [1] 2062 8
headTail(bfiA)
## A1 A2 A3 A4 A5 gender education age## 61617 2 4 3 4 4 male <NA> 16## 61618 2 4 5 2 5 female <NA> 18## 61620 5 4 5 4 4 female <NA> 17## 61621 4 4 6 5 5 female <NA> 17## ... ... ... ... ... ... <NA> <NA> ...## 67551 6 1 3 3 3 male HS+ 19## 67552 2 4 4 3 5 male degree 27## 67556 2 3 5 2 5 female degree 29## 67559 5 2 2 4 4 male degree 31
How to Speak R Getting to Know Your Data Fitting Statistical Models
How to Read in YOUR Own DataBefore you can load your data, you need to tell R where to look.
# get the working directorygetwd()
## [1] "C:/Users/A00315273/Box Sync/Office of Research Services/OMDS/OMDS Workshops/OMDS intro to R"
Notice: you need to use shashes instead of backslashes
# change the working directory to YOUR COMPUTER!!!setwd("C:/Users/A00315273/OMDSworkshop")
If the data is stored in a TEXT file, comma delimited...
# there functions are part of the BASE RmyData = read.table("data.txt", header = TRUE)myData = read.csv("data.csv", header = TRUE)
How to Speak R Getting to Know Your Data Fitting Statistical Models
Best Practices: DataSet in Excel
Often, you may enter your data into Excel.
Make sure the FIRST ROW contains the names of variables.
Names, Values, & Fields
• FIRST variable is unit identification
• NEVER use white SPACES
• AVOID symbols or punctuation: ? [ } * $ %
• USE . or to push words together
• KEEP it short, but meaningful
• ALWAYS use numbers over text
• LEAVE missing cells blank (not .)
How to Speak R Getting to Know Your Data Fitting Statistical Models
Read in Data from Excel Files
Bad Example
Much Better!
How to Speak R Getting to Know Your Data Fitting Statistical Models
Read in Data from Excel Files
# there's a package for that!# "Read, write, format Excel 2007 (xlsx) files"library(xlsx)
# read.xlsx tries to guess variables classes# read.xlsx2 is faster at bigger datasets
myData = read.xlsx("data.xlsx",sheetIndex = 1, # or use sheetName, insteadheader = TRUE) # TRUE if 1st row = names
NOTE: If you are having problems with Excel datasets, try saving it as a “.csv”file (comma delimited) and use the read.table function in Base R.
How to Speak R Getting to Know Your Data Fitting Statistical Models
Read in Data from SPSS, SAS, & Stata Files
# New package this summer...Hadley Wickham is my HERO!library(haven)
# Currently haven can read and write:# logical, integer, numeric, character and factors
# SPSS: Supports both sav & por filesmyData = read_spss("data.sav")myData = read_sav("data.sav")myData = read_por("data.sav")
# SAS: Supports both b7dat & b7cat filesmyData = read_sas("data.b7dat")
# StatamyData = read_stata("data.dta")myData = read_dta("data.dta")
# NOTE all labeled variables are a new class: "labelled"# ... use as_factor() to treat the variable categorical# ... use zap_labels() to treat the variable as continuous
How to Speak R Getting to Know Your Data Fitting Statistical Models
Outline
How to Speak RNuts & BoltsUsing Add-on PackagesHow to Read in YOUR Own Data
Getting to Know Your DataNumeric SummariesGraphical Summaries
Fitting Statistical ModelsMotor Trend Car Road TestsComparing Group CentersRegression Models
How to Speak R Getting to Know Your Data Fitting Statistical Models
Mean, Standard Deviation, Ect...
# descriptives on all variablesdescribe(bfiA)
## vars n mean sd median trimmed mad min max range skew kurtosis se## A1 1 2053 2.52 1.42 2 2.36 1.48 1 6 5 0.73 -0.44 0.03## A2 2 2040 4.75 1.20 5 4.92 1.48 1 6 5 -1.07 0.86 0.03## A3 3 2048 4.57 1.31 5 4.75 1.48 1 6 5 -0.97 0.39 0.03## A4 4 2048 4.59 1.54 5 4.81 1.48 1 6 5 -0.91 -0.29 0.03## A5 5 2050 4.50 1.26 5 4.64 1.48 1 6 5 -0.79 0.07 0.03## gender* 6 2062 1.66 0.47 2 1.70 0.00 1 2 1 -0.68 -1.54 0.01## education* 7 1853 3.09 1.06 3 3.11 0.00 1 5 4 -0.04 -0.03 0.02## age 8 2062 23.16 5.22 22 22.98 5.93 3 34 31 0.25 -0.59 0.12
How to Speak R Getting to Know Your Data Fitting Statistical Models
Mean, Standard Deviation, Ect...
# split by a grouping variabledescribeBy(bfiA, bfiA$gender)
## group: male## vars n mean sd median trimmed mad min max range skew kurtosis se## A1 1 699 2.81 1.43 3 2.71 1.48 1 6 5 0.48 -0.75 0.05## A2 2 691 4.46 1.30 5 4.61 1.48 1 6 5 -0.88 0.25 0.05## A3 3 695 4.38 1.30 5 4.52 1.48 1 6 5 -0.78 0.01 0.05## A4 4 697 4.31 1.51 5 4.45 1.48 1 6 5 -0.64 -0.62 0.06## A5 5 695 4.35 1.33 5 4.49 1.48 1 6 5 -0.74 -0.13 0.05## gender* 6 699 1.00 0.00 1 1.00 0.00 1 1 0 NaN NaN 0.00## education* 7 626 3.11 1.15 3 3.14 1.48 1 5 4 -0.04 -0.40 0.05## age 8 699 22.83 5.04 22 22.63 4.45 3 34 31 0.27 -0.29 0.19## -------------------------------------------------------------------## group: female## vars n mean sd median trimmed mad min max range skew kurtosis se## A1 1 1354 2.37 1.39 2 2.17 1.48 1 6 5 0.88 -0.16 0.04## A2 2 1349 4.90 1.12 5 5.07 1.48 1 6 5 -1.16 1.22 0.03## A3 3 1353 4.67 1.31 5 4.86 1.48 1 6 5 -1.10 0.68 0.04## A4 4 1351 4.74 1.53 5 4.99 1.48 1 6 5 -1.08 0.04 0.04## A5 5 1355 4.58 1.22 5 4.71 1.48 1 6 5 -0.80 0.14 0.03## gender* 6 1363 2.00 0.00 2 2.00 0.00 2 2 0 NaN NaN 0.00## education* 7 1227 3.08 1.02 3 3.09 0.00 1 5 4 -0.04 0.21 0.03## age 8 1363 23.32 5.31 23 23.17 5.93 9 34 25 0.23 -0.73 0.14
How to Speak R Getting to Know Your Data Fitting Statistical Models
Cross Tabulations & χ2 test for Independence
# split by a grouping variable# If a variable is included on the left side of the formula,# it is assumed to be a vector of frequenciesedXgender = xtabs(~ education + gender, data = bfiA)edXgender
## gender## education male female## <HS 71 109## HS 70 121## HS+ 303 691## degree 84 169## grad+ 98 137
# chi-squared test for independencechisq.test(edXgender)
#### Pearson's Chi-squared test#### data: edXgender## X-squared = 14.746, df = 4, p-value = 0.005258
How to Speak R Getting to Know Your Data Fitting Statistical Models
Correlation Matrix
How strong is the association between the 5 Agreement Items?
# reduce the dataset for easy of demonstrationbfiAonly = bfi[, c("A1", "A2", "A3", "A4", "A5")]
# GET CORRELATION VALUES & P-VALUEScor(bfiAonly, use = "pairwise.complete.obs")
## A1 A2 A3 A4 A5## A1 1.0000000 -0.3401932 -0.2652471 -0.1464245 -0.1814383## A2 -0.3401932 1.0000000 0.4850980 0.3350872 0.3900836## A3 -0.2652471 0.4850980 1.0000000 0.3604283 0.5041411## A4 -0.1464245 0.3350872 0.3604283 1.0000000 0.3075373## A5 -0.1814383 0.3900836 0.5041411 0.3075373 1.0000000
round(cor(bfiAonly, use = "pairwise.complete.obs"), 3)
## A1 A2 A3 A4 A5## A1 1.000 -0.340 -0.265 -0.146 -0.181## A2 -0.340 1.000 0.485 0.335 0.390## A3 -0.265 0.485 1.000 0.360 0.504## A4 -0.146 0.335 0.360 1.000 0.308## A5 -0.181 0.390 0.504 0.308 1.000
How to Speak R Getting to Know Your Data Fitting Statistical Models
Correlation Matrix with p-values
corr.test(bfiAonly,adjust = "none",method = "spearman")
## Call:corr.test(x = bfiAonly, method = "spearman", adjust = "none")## Correlation matrix## A1 A2 A3 A4 A5## A1 1.00 -0.37 -0.30 -0.16 -0.22## A2 -0.37 1.00 0.50 0.34 0.40## A3 -0.30 0.50 1.00 0.36 0.53## A4 -0.16 0.34 0.36 1.00 0.31## A5 -0.22 0.40 0.53 0.31 1.00## Sample Size## A1 A2 A3 A4 A5## A1 2784 2757 2759 2767 2769## A2 2757 2773 2751 2758 2757## A3 2759 2751 2774 2759 2758## A4 2767 2758 2759 2781 2765## A5 2769 2757 2758 2765 2784## Probability values (Entries above the diagonal are adjusted for multiple tests.)## A1 A2 A3 A4 A5## A1 0 0 0 0 0## A2 0 0 0 0 0## A3 0 0 0 0 0## A4 0 0 0 0 0## A5 0 0 0 0 0#### To see confidence intervals of the correlations, print with the short=FALSE option
How to Speak R Getting to Know Your Data Fitting Statistical Models
Correlation Matrix VisualizeA picture can be worth a thousand words
cor.plot(cor(bfiAonly, use = "pairwise.complete.obs", method = "spearman"))
Correlation plot
A5
A4
A3
A2
A1
A1 A2 A3 A4 A5
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
How to Speak R Getting to Know Your Data Fitting Statistical Models
psych’s All-in-One PlotA picture can be worth a thousand words
# plots pairs of variablespairs.panels(bfiAonly)
A1
1 2 3 4 5 6
−0.34 −0.27
1 2 3 4 5 6
−0.15
13
5
−0.18
13
5 A20.49 0.34 0.39
A30.36
13
5
0.50
13
5 A40.31
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
13
5A5
How to Speak R Getting to Know Your Data Fitting Statistical Models
Histogram: Defaults vs. Options
# all defaultshist(bfi$A1)
Histogram of bfi$A1
bfi$A1
Fre
quen
cy
1 2 3 4 5 6
020
060
0
# better with some defaultshist(bfi$A1,
breaks = 0.5:6.5,main = "This is Much Better",xlab = "Item A-1",col = "gray")
This is Much Better
Item A−1
Fre
quen
cy
1 2 3 4 5 60
200
600
How to Speak R Getting to Know Your Data Fitting Statistical Models
Histogram: Use More Code!0
200
400
600
800
1000
Ready for Publication
''Am indifferent to the feelings of others''Agreeableness Item #1 (q.1146)
Fre
quen
cy
Very Mod Slight Slight Mod VeryInaccuration Accurate
How to Speak R Getting to Know Your Data Fitting Statistical Models
Density Plot: Continuous Distribution
# one way to put two plots on the same pagepar(mfrow=c(1, 2)) # 1 row & 2 columnshist(bfi$age) # rough distributionplot(density(bfi$age, na.rm = TRUE)) # smoothed out
Histogram of bfi$age
bfi$age
Fre
quen
cy
0 20 40 60 80
020
040
060
0
0 20 40 60 80
0.00
0.02
0.04
density.default(x = bfi$age, na.rm = TRUE)
N = 2800 Bandwidth = 2.047
Den
sity
How to Speak R Getting to Know Your Data Fitting Statistical Models
Density Plot: AGE
0 20 40 60 80
0.00
0.01
0.02
0.03
0.04
0.05
0.06
Compare to the Normal Curve
Age
Pro
port
ion
Curves
densitynormal
How to Speak R Getting to Know Your Data Fitting Statistical Models
Bar Plot: Categorical Distribution
par(mfrow=c(1, 2)) # 1 row & 2 columns
# one variable at a time (must give it counts!)barplot(table(bfi$gender))barplot(table(bfi$education))
male female
050
010
0015
00
<HS HS degree
020
060
010
00
How to Speak R Getting to Know Your Data Fitting Statistical Models
Bar Plot: Compare 2 Categorical Distributions0
200
400
600
800
1000
Synthetic Aperture Personality Assessment (SAPA)
Highest Level of Education
Fre
quen
cy
<HS HS HS+ degree grad+
malefemale
020
040
060
080
010
00
How to Speak R Getting to Know Your Data Fitting Statistical Models
Boxplots: GENDER & EDUCATION
par(mfrow=c(1, 2)) # 1 row & 2 columns
# all togetherboxplot(bfiA$age)
# split by education groupsboxplot(bfi$age ~ bfi$education)
510
2030
<HS HS+ grad+
020
4060
80
How to Speak R Getting to Know Your Data Fitting Statistical Models
Boxplots: Use More Options
# reset to one plot per pagepar(mfrow=c(1, 1))
# make it look betterboxplot(age ~ education, data = bfi,
col = heat.colors(5),main = "Build a Better Boxplots",xlab = "Highest Education Obtained",ylab = "Age (years)")
<HS HS HS+ degree grad+
020
4060
80
Build a Better Boxplots
Highest Education Obtained
Age
(ye
ars)
How to Speak R Getting to Know Your Data Fitting Statistical Models
Boxplots: AGE & EDUCATION0
2040
6080
Compare the Genders
Highest Education Obtained
Age
(ye
ars)
020
4060
80
<HS HS HS+ degree grad+
malefemale
How to Speak R Getting to Know Your Data Fitting Statistical Models
Scatterplots: Display Associations
Jitter the education level so dots don’t cover each other so much.
# put 3 plots in one row/pagepar(mfrow = c(1, 3))
plot(bfi$age,jitter(as.numeric(bfi$education),
factor = 0.25),main = "factor = 0.25")
plot(bfi$age,jitter(as.numeric(bfi$education),
factor = 1),main = "factor = 1")
plot(bfi$age,jitter(as.numeric(bfi$education),
factor = 2),main = "factor = 2") 0 20 40 60 80
12
34
5
factor = 0.25
bfi$age
jitte
r(as
.num
eric
(bfi$
educ
atio
n), f
acto
r =
0.2
5)
0 20 40 60 80
12
34
5
factor = 1
bfi$age
jitte
r(as
.num
eric
(bfi$
educ
atio
n), f
acto
r =
1)
0 20 40 60 80
12
34
5
factor = 2
bfi$age
jitte
r(as
.num
eric
(bfi$
educ
atio
n), f
acto
r =
2)
How to Speak R Getting to Know Your Data Fitting Statistical Models
Scatterplots: AGE & EDUCATION
0 20 40 60 80
Jitter the Ordinal Variable
Age (years)
Edu
catio
n
<HS
HS
HS+
degree
grad+
0 20 40 60 80
How to Speak R Getting to Know Your Data Fitting Statistical Models
Bubble Plot: Helpful with Overplotting
If you can dream of a type of plot, you can create it!
# aggregate the databfiAag = aggregate(bfiA,
by = list(bfiA$A1,bfiA$A2),
length)
# circle's area ~ number of pointssymbols(bfiAag$Group.1,
bfiAag$Group.2,circles = sqrt(bfiAag$A1/pi)/50,inches = FALSE,main = "Bubble Plot",xlab = "item A1",ylab = "item A2")
1 2 3 4 5 6
12
34
56
Bubble Plot
item A1
item
A2
How to Speak R Getting to Know Your Data Fitting Statistical Models
Outline
How to Speak RNuts & BoltsUsing Add-on PackagesHow to Read in YOUR Own Data
Getting to Know Your DataNumeric SummariesGraphical Summaries
Fitting Statistical ModelsMotor Trend Car Road TestsComparing Group CentersRegression Models
How to Speak R Getting to Know Your Data Fitting Statistical Models
Motor Trend Car Road Tests
The data was extracted from the 1974 Motor Trend US magazine, andcomprises fuel consumption and 10 aspects of automobile design andperformance for 32 automobiles (1973-74 models).
mpg Miles/(US) gallon
cyl Number of cylinders
disp Displacement (cu.in.)
hp Gross horsepower
drat Rear axle ratio
wt Weight (lb/1000)
qsec 1/4 mile time
vs V/S
am Transmission
gear Number of forward gears
carb Number of carburetors
How to Speak R Getting to Know Your Data Fitting Statistical Models
Load car Package & the mtcars Data
# Load a New Package:library(car) # "Companion to Applied Regression" (a textbook)
data(mtcars) # Make its Included Data Set Active in the Environment
# check out the datadim(mtcars)
## [1] 32 11
names(mtcars)
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb"
# set the categorical variablesmtcars$vs = factor(mtcars$vs, labels = c("v", "s"))mtcars$am = factor(mtcars$am, labels = c("automatic", "manual"))
How to Speak R Getting to Know Your Data Fitting Statistical Models
headTail(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb## Mazda RX4 21 6 160 110 3.9 2.62 16.46 v manual 4 4## Mazda RX4 Wag 21 6 160 110 3.9 2.88 17.02 v manual 4 4## Datsun 710 22.8 4 108 93 3.85 2.32 18.61 s manual 4 1## Hornet 4 Drive 21.4 6 258 110 3.08 3.21 19.44 s automatic 3 1## ... ... ... ... ... ... ... ... <NA> <NA> ... ...## Ford Pantera L 15.8 8 351 264 4.22 3.17 14.5 v manual 5 4## Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 v manual 5 6## Maserati Bora 15 8 301 335 3.54 3.57 14.6 v manual 5 8## Volvo 142E 21.4 4 121 109 4.11 2.78 18.6 s manual 4 2
summary(mtcars)
## mpg cyl disp hp drat## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 Min. :2.760## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.080## Median :19.20 Median :6.000 Median :196.3 Median :123.0 Median :3.695## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 Mean :3.597## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.920## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 Max. :4.930## wt qsec vs am gear carb## Min. :1.513 Min. :14.50 v:18 automatic:19 Min. :3.000 Min. :1.000## 1st Qu.:2.581 1st Qu.:16.89 s:14 manual :13 1st Qu.:3.000 1st Qu.:2.000## Median :3.325 Median :17.71 Median :4.000 Median :2.000## Mean :3.217 Mean :17.85 Mean :3.688 Mean :2.812## 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:4.000 3rd Qu.:4.000## Max. :5.424 Max. :22.90 Max. :5.000 Max. :8.000
How to Speak R Getting to Know Your Data Fitting Statistical Models
Test Central Differences in 2 Independent Groups# find the meansdescribeBy(mtcars$mpg, mtcars$am)
## group: automatic## vars n mean sd median trimmed mad min max range skew kurtosis se## 1 1 19 17.15 3.83 17.3 17.12 3.11 10.4 24.4 14 0.01 -0.8 0.88## -------------------------------------------------------------------## group: manual## vars n mean sd median trimmed mad min max range skew kurtosis se## 1 1 13 24.39 6.17 22.8 24.38 6.67 15 33.9 18.9 0.05 -1.46 1.71
# view the two groups side-by-sideboxplot(mpg ~ am, data = mtcars, horizontal = TRUE)
auto
mat
ic
10 15 20 25 30
How to Speak R Getting to Know Your Data Fitting Statistical Models
Test Central Differences in 2 Independent Groups
PARAMETRIC t-test for means, assumes normality
t.test(mpg ~ am, data = mtcars)
#### Welch Two Sample t-test#### data: mpg by am## t = -3.7671, df = 18.332, p-value = 0.001374## alternative hypothesis: true difference in means is not equal to 0## 95 percent confidence interval:## -11.280194 -3.209684## sample estimates:## mean in group automatic mean in group manual## 17.14737 24.39231
NON-PARAMETRIC Mann-Whitney U Test, based on ranks
wilcox.test(mpg ~ am, data = mtcars)
#### Wilcoxon rank sum test with continuity correction#### data: mpg by am## W = 42, p-value = 0.001871## alternative hypothesis: true location shift is not equal to 0
How to Speak R Getting to Know Your Data Fitting Statistical Models
More than Two Groups?
# plot to investigateboxplot(drat ~ cyl,
data = mtcars,main = "Between vs. Within",xlab = "Number of Cylinders",ylab = "Rear Axle Ratio",col = "light gray")
grid()
# we can use another packagelibrary(beeswarm)
stripchart(drat ~ cyl,data = mtcars,vertical = TRUE,method = 'jitter',jitter = 0.2,cex = 1,pch = 16,col = c("red",
"blue","dark green"),
add = TRUE)
4 6 8
3.0
3.5
4.0
4.5
5.0
Between vs. Within
Number of Cylinders
Rea
r A
xle
Rat
io
How to Speak R Getting to Know Your Data Fitting Statistical Models
ANOVA
# run the ANOVAanova1 = aov(drat ~ cyl, data = mtcars)summary(anova1)
## Df Sum Sq Mean Sq F value Pr(>F)## cyl 1 4.342 4.342 28.81 8.24e-06 ***## Residuals 30 4.521 0.151## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# to get type III sums of squaresAnova(anova1, type = "III")
## Anova Table (Type III tests)#### Response: drat## Sum Sq Df F value Pr(>F)## (Intercept) 57.217 1 379.714 < 2.2e-16 ***## cyl 4.342 1 28.814 8.245e-06 ***## Residuals 4.521 30## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
How to Speak R Getting to Know Your Data Fitting Statistical Models
ANCOVA
# add a continuous covariateanova2 = aov(drat ~ cyl + wt, data = mtcars)summary(anova2)
## Df Sum Sq Mean Sq F value Pr(>F)## cyl 1 4.342 4.342 32.284 3.83e-06 ***## wt 1 0.620 0.620 4.613 0.0402 *## Residuals 29 3.900 0.134## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova(anova2, type = "III")
## Anova Table (Type III tests)#### Response: drat## Sum Sq Df F value Pr(>F)## (Intercept) 56.578 1 420.6933 < 2e-16 ***## cyl 0.464 1 3.4493 0.07346 .## wt 0.620 1 4.6129 0.04022 *## Residuals 3.900 29## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
How to Speak R Getting to Know Your Data Fitting Statistical Models
Kruskal Wallis Test
# non-parametric version: uses ranks instead of meanskruskal.test(drat ~ cyl, data = mtcars)
#### Kruskal-Wallis rank sum test#### data: drat by cyl## Kruskal-Wallis chi-squared = 14.395, df = 2, p-value = 0.0007486
How to Speak R Getting to Know Your Data Fitting Statistical Models
Simple Linear Regression: Fit Model# Simple Linear Regressionlinreg = lm(mpg ~ wt, data = mtcars)slr = summary(linreg)slr
#### Call:## lm(formula = mpg ~ wt, data = mtcars)#### Residuals:## Min 1Q Median 3Q Max## -4.5432 -2.3647 -0.1252 1.4096 6.8727#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***## wt -5.3445 0.5591 -9.559 1.29e-10 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 3.046 on 30 degrees of freedom## Multiple R-squared: 0.7528,Adjusted R-squared: 0.7446## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
summary(linreg)$r.squared
## [1] 0.7528328
summary(linreg)$adj.r.squared
## [1] 0.7445939
How to Speak R Getting to Know Your Data Fitting Statistical Models
Simple Linear Regression: Visualize the Fit
# Plot of relationship and least squares lineplot(mtcars$wt, mtcars$mpg)abline(linreg, col = "red")text(x = 2,
y = 12,labels = bquote(~R^2 ==
.(round(slr$r.squared, 3))),col = "red")
text(x = 4.75,y = 30,labels = bquote(~adj-R^2 ==
.(round(slr$adj.r.squared, 3))),col = "blue")
title(main = "Linear Regression")grid() 2 3 4 5
1015
2025
30
mtcars$wt
mtc
ars$
mpg
R2 = 0.753
adj − R2 = 0.745
Linear Regression
How to Speak R Getting to Know Your Data Fitting Statistical Models
Introducing ggplot2
# a VERY COOL plotting package for next semester's workshop...library(ggplot2)
ggplot(mtcars, aes(x = wt, y = mpg)) +geom_point() +stat_smooth(method = "lm", col = "red") +facet_grid(. ~ am) +theme_bw()
automatic manual
10
20
30
2 3 4 5 2 3 4 5wt
mpg
How to Speak R Getting to Know Your Data Fitting Statistical Models
Multiple Linear Regression: Fit the Model
# add several variables to the modellinreg2 = lm(mpg ~ wt + cyl + hp, data = mtcars)summary(linreg2)
#### Call:## lm(formula = mpg ~ wt + cyl + hp, data = mtcars)#### Residuals:## Min 1Q Median 3Q Max## -3.9290 -1.5598 -0.5311 1.1850 5.8986#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 38.75179 1.78686 21.687 < 2e-16 ***## wt -3.16697 0.74058 -4.276 0.000199 ***## cyl -0.94162 0.55092 -1.709 0.098480 .## hp -0.01804 0.01188 -1.519 0.140015## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 2.512 on 28 degrees of freedom## Multiple R-squared: 0.8431,Adjusted R-squared: 0.8263## F-statistic: 50.17 on 3 and 28 DF, p-value: 2.184e-11
How to Speak R Getting to Know Your Data Fitting Statistical Models
Multiple Linear Regression: Residual Diagnostics
Distribution of Studentized Residuals
sresid
Den
sity
−2 −1 0 1 2 3
0.0
0.1
0.2
0.3
0.4
0.5
How to Speak R Getting to Know Your Data Fitting Statistical Models
Logistic Regression: Fit the Model
# run the logistic regression (outcome has 2 levels)logreg = glm(am ~ mpg,
data = mtcars,family = binomial(link = "logit"))
summary(logreg)
#### Call:## glm(formula = am ~ mpg, family = binomial(link = "logit"), data = mtcars)#### Deviance Residuals:## Min 1Q Median 3Q Max## -1.5701 -0.7531 -0.4245 0.5866 2.0617#### Coefficients:## Estimate Std. Error z value Pr(>|z|)## (Intercept) -6.6035 2.3514 -2.808 0.00498 **## mpg 0.3070 0.1148 2.673 0.00751 **## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### (Dispersion parameter for binomial family taken to be 1)#### Null deviance: 43.230 on 31 degrees of freedom## Residual deviance: 29.675 on 30 degrees of freedom## AIC: 33.675#### Number of Fisher Scoring iterations: 5
How to Speak R Getting to Know Your Data Fitting Statistical Models
Logistic Regression: Visualize the Fit
10 15 20 25 30
0.0
0.2
0.4
0.6
0.8
1.0
Motor Trend Car Road Tests
Miles/(US) gallon
Tran
smis
sion
0
5
10
10
5
0
Aut
omat
ic v
s. M
anua
l
0.0
0.2
0.4
0.6
0.8
1.0
How to Speak R Getting to Know Your Data Fitting Statistical Models
Other Generalized Regresion Models# Can do other distributions and linkspoisreg = glm(carb ~ hp,
data = mtcars,family = poisson(link="log"))
summary(poisreg)
#### Call:## glm(formula = carb ~ hp, family = poisson(link = "log"), data = mtcars)#### Deviance Residuals:## Min 1Q Median 3Q Max## -0.86441 -0.55608 -0.07877 0.21395 1.49103#### Coefficients:## Estimate Std. Error z value Pr(>|z|)## (Intercept) 0.148971 0.265018 0.562 0.574## hp 0.005517 0.001387 3.977 6.97e-05 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### (Dispersion parameter for poisson family taken to be 1)#### Null deviance: 27.043 on 31 degrees of freedom## Residual deviance: 12.279 on 30 degrees of freedom## AIC: 105.64#### Number of Fisher Scoring iterations: 4