Modeling Social Data, Lecture 3: Data manipulation in R
-
Upload
jakehofman -
Category
Education
-
view
311 -
download
2
Transcript of Modeling Social Data, Lecture 3: Data manipulation in R
![Page 1: Modeling Social Data, Lecture 3: Data manipulation in R](https://reader031.fdocuments.net/reader031/viewer/2022020213/58a2916e1a28ab36508b56a1/html5/thumbnails/1.jpg)
Data manipulation in RAPAM E4990
Modeling Social Data
Jake Hofman
Columbia University
February 3, 2017
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 1 / 20
![Page 2: Modeling Social Data, Lecture 3: Data manipulation in R](https://reader031.fdocuments.net/reader031/viewer/2022020213/58a2916e1a28ab36508b56a1/html5/thumbnails/2.jpg)
The good, the bad, & the ugly
• R isn’t the best programming language out there• But it happens to be great for data analysis• The result is a steep learning curve with a high payoff
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 2 / 20
![Page 3: Modeling Social Data, Lecture 3: Data manipulation in R](https://reader031.fdocuments.net/reader031/viewer/2022020213/58a2916e1a28ab36508b56a1/html5/thumbnails/3.jpg)
For instance . . .
• You’ll see a mix of camelCase, this.that, and snake case
conventions• Dots (.) (mostly) don’t mean anything special• Likewise, $ gets used in funny ways• R is loosely typed, which can lead to unexpected coercions
and silent fails• It also tries to be clever about variable scope, which can
backfire if you’re not careful
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 3 / 20
![Page 4: Modeling Social Data, Lecture 3: Data manipulation in R](https://reader031.fdocuments.net/reader031/viewer/2022020213/58a2916e1a28ab36508b56a1/html5/thumbnails/4.jpg)
But it will help you . . .
• Do extremely fast exploratory data analysis• Easily generate high-quality data visualizations• Fit and evaluate pretty much any statistical model you can
think of
This will change the way you do data analysis, because you’ll askquestions you wouldn’t have bothered to otherwise
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 4 / 20
![Page 5: Modeling Social Data, Lecture 3: Data manipulation in R](https://reader031.fdocuments.net/reader031/viewer/2022020213/58a2916e1a28ab36508b56a1/html5/thumbnails/5.jpg)
But it will help you . . .
• Do extremely fast exploratory data analysis• Easily generate high-quality data visualizations• Fit and evaluate pretty much any statistical model you can
think of
This will change the way you do data analysis, because you’ll askquestions you wouldn’t have bothered to otherwise
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 4 / 20
![Page 6: Modeling Social Data, Lecture 3: Data manipulation in R](https://reader031.fdocuments.net/reader031/viewer/2022020213/58a2916e1a28ab36508b56a1/html5/thumbnails/6.jpg)
Basic types
• int, double: for numbers• character: for strings• factor: for categorical variables (∼ struct or ENUM)
Factors are handy, but take some getting used to
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 5 / 20
![Page 7: Modeling Social Data, Lecture 3: Data manipulation in R](https://reader031.fdocuments.net/reader031/viewer/2022020213/58a2916e1a28ab36508b56a1/html5/thumbnails/7.jpg)
Basic types
• int, double: for numbers• character: for strings• factor: for categorical variables (∼ struct or ENUM)
Factors are handy, but take some getting used to
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 5 / 20
![Page 8: Modeling Social Data, Lecture 3: Data manipulation in R](https://reader031.fdocuments.net/reader031/viewer/2022020213/58a2916e1a28ab36508b56a1/html5/thumbnails/8.jpg)
Containers
• vector: for multiple values of the same type (∼ array)• list: for multiple values of different types (∼ dictionary)• data.frame: for tables of rectangular data of mixed types (∼
matrix)
We’ll mostly work with data frames, which themselves are lists ofvectors
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 6 / 20
![Page 9: Modeling Social Data, Lecture 3: Data manipulation in R](https://reader031.fdocuments.net/reader031/viewer/2022020213/58a2916e1a28ab36508b56a1/html5/thumbnails/9.jpg)
Containers
• vector: for multiple values of the same type (∼ array)• list: for multiple values of different types (∼ dictionary)• data.frame: for tables of rectangular data of mixed types (∼
matrix)
We’ll mostly work with data frames, which themselves are lists ofvectors
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 6 / 20
![Page 10: Modeling Social Data, Lecture 3: Data manipulation in R](https://reader031.fdocuments.net/reader031/viewer/2022020213/58a2916e1a28ab36508b56a1/html5/thumbnails/10.jpg)
The tidyverse
The tidyverse is a collection of packages that work together tomake data analysis easier:
• dplyr for split / apply / combine type counting• ggplot2 for making plots• tidyr for reshaping and “tidying” data• readr for reading and writing files• ...
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 7 / 20
![Page 11: Modeling Social Data, Lecture 3: Data manipulation in R](https://reader031.fdocuments.net/reader031/viewer/2022020213/58a2916e1a28ab36508b56a1/html5/thumbnails/11.jpg)
Tidy data
The core philosophy is that your data should be in a “tidy” tablewith:
• One observation per row• One variable per column• One measured value per cell
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 8 / 20
![Page 12: Modeling Social Data, Lecture 3: Data manipulation in R](https://reader031.fdocuments.net/reader031/viewer/2022020213/58a2916e1a28ab36508b56a1/html5/thumbnails/12.jpg)
Tidy data
• Most of the work goes into getting your data into shape• After which descriptives statistics, modeling, and visualization
are easy
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 9 / 20
![Page 13: Modeling Social Data, Lecture 3: Data manipulation in R](https://reader031.fdocuments.net/reader031/viewer/2022020213/58a2916e1a28ab36508b56a1/html5/thumbnails/13.jpg)
dplyr: a grammar of data manipulation
dplyr implements the split / apply / combine frameworkdiscussed in the last lecture
• Its “grammar” has five main verbs used in the “apply” phase:
• filter: restrict rows based on a condition (N → N ′)• arrange: reorder rows by a variable (N → N ′)• select: pick out specific columns (K → K ′)• mutate: create new or change existing columns (K → K ′)• summarize: collapse a column into one value (N → 1)
• The group by function creates indices to take care of thesplit and combine phases
The cost is that you have to think “functionally”, in terms of“vectorized” operations
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 10 / 20
![Page 14: Modeling Social Data, Lecture 3: Data manipulation in R](https://reader031.fdocuments.net/reader031/viewer/2022020213/58a2916e1a28ab36508b56a1/html5/thumbnails/14.jpg)
dplyr: a grammar of data manipulation
dplyr implements the split / apply / combine frameworkdiscussed in the last lecture
• Its “grammar” has five main verbs used in the “apply” phase:
• filter: restrict rows based on a condition (N → N ′)• arrange: reorder rows by a variable (N → N ′)• select: pick out specific columns (K → K ′)• mutate: create new or change existing columns (K → K ′)• summarize: collapse a column into one value (N → 1)
• The group by function creates indices to take care of thesplit and combine phases
The cost is that you have to think “functionally”, in terms of“vectorized” operations
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 10 / 20
![Page 15: Modeling Social Data, Lecture 3: Data manipulation in R](https://reader031.fdocuments.net/reader031/viewer/2022020213/58a2916e1a28ab36508b56a1/html5/thumbnails/15.jpg)
filter
filter(trips, start_station_name == "Broadway & E 14 St")
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 11 / 20
![Page 16: Modeling Social Data, Lecture 3: Data manipulation in R](https://reader031.fdocuments.net/reader031/viewer/2022020213/58a2916e1a28ab36508b56a1/html5/thumbnails/16.jpg)
arrange
arrange(trips, starttime)
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 12 / 20
![Page 17: Modeling Social Data, Lecture 3: Data manipulation in R](https://reader031.fdocuments.net/reader031/viewer/2022020213/58a2916e1a28ab36508b56a1/html5/thumbnails/17.jpg)
select
select(trips, starttime, stoptime,
start_station_name, end_station_name)
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 13 / 20
![Page 18: Modeling Social Data, Lecture 3: Data manipulation in R](https://reader031.fdocuments.net/reader031/viewer/2022020213/58a2916e1a28ab36508b56a1/html5/thumbnails/18.jpg)
mutate
mutate(trips, time_in_min = tripduration / 60)
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 14 / 20
![Page 19: Modeling Social Data, Lecture 3: Data manipulation in R](https://reader031.fdocuments.net/reader031/viewer/2022020213/58a2916e1a28ab36508b56a1/html5/thumbnails/19.jpg)
mutate
summarize(trips, mean_duration = mean(tripduration) / 60,
sd_duration = sd(tripduration) / 60)
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 15 / 20
![Page 20: Modeling Social Data, Lecture 3: Data manipulation in R](https://reader031.fdocuments.net/reader031/viewer/2022020213/58a2916e1a28ab36508b56a1/html5/thumbnails/20.jpg)
group by
trips_by_gender <- group_by(trips, gender)
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 16 / 20
![Page 21: Modeling Social Data, Lecture 3: Data manipulation in R](https://reader031.fdocuments.net/reader031/viewer/2022020213/58a2916e1a28ab36508b56a1/html5/thumbnails/21.jpg)
group by
trips_by_gender <- group_by(trips, gender)
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 17 / 20
![Page 22: Modeling Social Data, Lecture 3: Data manipulation in R](https://reader031.fdocuments.net/reader031/viewer/2022020213/58a2916e1a28ab36508b56a1/html5/thumbnails/22.jpg)
group by + summarize
summarize(trips_by_gender,
count = n(),
mean_duration = mean(tripduration) / 60,
sd_duration = sd(tripduration) / 60)
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 18 / 20
![Page 23: Modeling Social Data, Lecture 3: Data manipulation in R](https://reader031.fdocuments.net/reader031/viewer/2022020213/58a2916e1a28ab36508b56a1/html5/thumbnails/23.jpg)
%>%: the pipe operator
trips %>%
group_by(gender) %>%
summarize(count = n(),
mean_duration = mean(tripduration) / 60,
sd_duration = sd(tripduration) / 60)
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 19 / 20
![Page 24: Modeling Social Data, Lecture 3: Data manipulation in R](https://reader031.fdocuments.net/reader031/viewer/2022020213/58a2916e1a28ab36508b56a1/html5/thumbnails/24.jpg)
r4ds.had.co.nz
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 20 / 20