· Web viewR Handout 2 - Introduction to the tidyverse / dplyr package in R In this handout, we...

17
R Handout 2 - Introduction to the tidyverse / dplyr package in R In this handout, we will introduce the dplyr package which can be used to manipulate data in R. A more through discussion of much of the code below was taken from the following R documentation: http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html As stated in this documentation, the dplyr package provides “simple functions that correspond to the most common data manipulation verbs, so that you can easily translate your thoughts into code.” Data Source: https://www.transtats.bts.gov/DL_SelectFields.asp? Table_ID=236&DB_Short_Name=On-Time Tidyverse – A collection of R packages for Data Science 1

Transcript of  · Web viewR Handout 2 - Introduction to the tidyverse / dplyr package in R In this handout, we...

Page 1:  · Web viewR Handout 2 - Introduction to the tidyverse / dplyr package in R In this handout, we will introduce the dplyr package which can be used to manipulate data in R. ... If

R Handout 2 - Introduction to the tidyverse / dplyr package in R

In this handout, we will introduce the dplyr package which can be used to manipulate data in R. A more through discussion of much of the code below was taken from the following R documentation:

http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

As stated in this documentation, the dplyr package provides “simple functions that correspond to the most common data manipulation verbs, so that you can easily translate your thoughts into code.”

Data Source: https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time

Tidyverse – A collection of R packages for Data Science

1

Page 2:  · Web viewR Handout 2 - Introduction to the tidyverse / dplyr package in R In this handout, we will introduce the dplyr package which can be used to manipulate data in R. ... If

Select Import Dataset to read in this file.

Select From Text (readr) and specify the location of the csv file

A snip-if of the code, copy and paste this into your script window….

2

Page 3:  · Web viewR Handout 2 - Introduction to the tidyverse / dplyr package in R In this handout, we will introduce the dplyr package which can be used to manipulate data in R. ... If

Start by installing the tidyverse library / package…

#Download most recent version of tidyverseinstall.packages("tidyverse")

#Load the library / packagelibrary(tidyverse)

Basic commands for understanding your dataset

#Getting all the fieldsnames(FlightDelays)

#Getting first few rowshead(FlightDelays)

#Getting last few rowshead(FlightDelays)

#Viewing the data.frameView(FlightDelays)

3

Page 4:  · Web viewR Handout 2 - Introduction to the tidyverse / dplyr package in R In this handout, we will introduce the dplyr package which can be used to manipulate data in R. ... If

Select columns with select()

If you’re working with a large data set and only a few variables are of actual interest to you, you can select that subset of variables easily with dplyr. For example, consider the following:

dplyr::select(FlightDelays, OP_UNIQUE_CARRIER, ORIGIN, DEP_DELAY)

Using SELECT to get several columns…

dplyr::select(FlightDelays, DAY_OF_WEEK:DEP_DELAY)

4

Page 5:  · Web viewR Handout 2 - Introduction to the tidyverse / dplyr package in R In this handout, we will introduce the dplyr package which can be used to manipulate data in R. ... If

You can also use various “helper functions” within select(), as shown below.

starts_with() ends_with() matches() contains()

Using SELECT to get fields that start with DEST

dplyr::select(FlightDelays, starts_with("DEST"))

Using SELECT to get fields that contain the word DELAY

dplyr::select(FlightDelays, contains("DELAY"))

5

Page 6:  · Web viewR Handout 2 - Introduction to the tidyverse / dplyr package in R In this handout, we will introduce the dplyr package which can be used to manipulate data in R. ... If

Also, a common use of the select() function is to determine how many unique (or distinct) values a variable (or a set of variables) takes on. dplyr::distinct(select(FlightDelays, OP_UNIQUE_CARRIER))

Getting unique combinations of fields… dplyr::distinct(select(FlightDelays, OP_UNIQUE_CARRIER, DEST))

dplyr::distinct(select(FlightDelays, OP_UNIQUE_CARRIER, DEST)) -> Carrier_Destinations

6

Page 7:  · Web viewR Handout 2 - Introduction to the tidyverse / dplyr package in R In this handout, we will introduce the dplyr package which can be used to manipulate data in R. ... If

Filter rows with filter()

With the dplyr package, the filter() function allows you to select a subset of the rows of a data frame.

dplyr::filter(FlightDelays, OP_UNIQUE_CARRIER == 'AA' & DAY_OF_WEEK == 1)

With the filter() function you can give any number of filtering conditions which are joined together with “&” or other Boolean operators. For example, consider the following set of commands…

(1) dplyr::filter(FlightDelays, OP_UNIQUE_CARRIER == 'AA' & DAY_OF_WEEK == 1)

(2) dplyr::filter(FlightDelays, (OP_UNIQUE_CARRIER == 'AA') & (DAY_OF_WEEK == 1) & (ORIGIN == 'MSP') )

(3) dplyr::filter(FlightDelays, (ORIGIN == 'RST') & (ORIGIN == 'LSE') )

(4) dplyr::filter(FlightDelays, (ORIGIN == 'RST') | (ORIGIN == 'LSE') )

(5) dplyr::filter(FlightDelays, ( (ORIGIN == 'RST') | (ORIGIN == 'LSE') ) & (FL_DATE == '2019-01-10' )) -> Local_Flights_Jan10

For each, describe what the filter is doing and determine how many rows meet the conditions specified…

Purpose / # rows(1)

(2)

(3)

(4)

(5)

7

Page 8:  · Web viewR Handout 2 - Introduction to the tidyverse / dplyr package in R In this handout, we will introduce the dplyr package which can be used to manipulate data in R. ... If

Add new columns with mutate()

First, let us create a new data.frame with only the following columns.

FlightDelays2 <- dplyr::select(FlightDelays, DAY_OF_WEEK, OP_UNIQUE_CARRIER, ORIGIN, DEST, DEP_DELAY, ARR_DELAY)

In addition to selecting from existing columns, you can add new columns that are functions of existing columns.

dplyr::mutate(FlightDelays2, Gain = ARR_DELAY - DEP_DELAY)

Note that the newly created column is *not* automatically put into the existing data.frame. The number of variables in FlightData2 did *not* change from above.

8

Page 9:  · Web viewR Handout 2 - Introduction to the tidyverse / dplyr package in R In this handout, we will introduce the dplyr package which can be used to manipulate data in R. ... If

9

Page 10:  · Web viewR Handout 2 - Introduction to the tidyverse / dplyr package in R In this handout, we will introduce the dplyr package which can be used to manipulate data in R. ... If

Putting the mutate into the data.frame requires you to make an assignment…

dplyr::mutate(FlightDelays2, Gain = ARR_DELAY - DEP_DELAY) -> FlightDelays2

Summarize values with summarize()

This lets you create summaries that collapse a data frame to a single row. For example, consider the following:

dplyr::summarise(FlightDelays2, Avg.DEP_DELAY = mean(DEP_DELAY))

Need to remove the NA from the mean calculation…

dplyr::summarise(FlightDelays2, Avg.DEP_DELAY = mean(DEP_DELAY, na.rm=TRUE))

10

Page 11:  · Web viewR Handout 2 - Introduction to the tidyverse / dplyr package in R In this handout, we will introduce the dplyr package which can be used to manipulate data in R. ... If

Commonalities of functions in the dplyr package

Note that all of these functions are similar in the following ways:

The first argument is a data frame Subsequent arguments tell R what to do with that data frame The result is a new data frame

As stated in the aforementioned R documentation, these five functions together “provide the basis of a language of data manipulation.” At the most basic level, we alter data sets in the following ways:

SELECT: Select variables (columns) of interest FILTER: Select observations (rows) of interest MUTATE: Add new variables (columns) that are functions of existing variables SUMMARISE: Aggregation /summarize rows

11

Page 12:  · Web viewR Handout 2 - Introduction to the tidyverse / dplyr package in R In this handout, we will introduce the dplyr package which can be used to manipulate data in R. ... If

Grouped summaries / Aggregation with group_by()

Finally, note that you can also use all of the above functions to process a data set “by group.”

group.carrier <- dplyr::group_by(FlightDelays2, OP_UNIQUE_CARRIER )

dplyr::summarise(group.carrier, Avg.DEP_DELAY = mean(DEP_DELAY, na.rm=TRUE) )

dplyr::summarise(group.carrier, Avg.DEP_DELAY = mean(DEP_DELAY, na.rm=TRUE), Count = n() )

Other Common Summaries

Standard Deviation: sd() Minimum: min() Maximum: max() Count: n()

12

Page 13:  · Web viewR Handout 2 - Introduction to the tidyverse / dplyr package in R In this handout, we will introduce the dplyr package which can be used to manipulate data in R. ... If

Chaining Operations, i.e. piping

The following is an example of piping in the dplyr package… The piping action automatically pushes the data.frame ahead to the next function for use. For example, the FlightDelay2 data.frame is automatically being pushed into the dplyr:group_by() function, the output from the dplyr:group_by() function is automatically being pushed into the dplyr:summarise() function, etc.

( FlightDelays2 %>% dplyr::group_by( OP_UNIQUE_CARRIER ) %>% dplyr::summarise( Avg.DEP_DELAY = mean(DEP_DELAY, na.rm=TRUE) ))

The following code will simple arrange the average departure delays in descending order…

( FlightDelays2 %>% dplyr::group_by( OP_UNIQUE_CARRIER ) %>% dplyr::summarise( Avg.DEP_DELAY = mean(DEP_DELAY, na.rm=TRUE) ) %>% dplyr::arrange( desc(Avg.DEP_DELAY) ))

13

Page 14:  · Web viewR Handout 2 - Introduction to the tidyverse / dplyr package in R In this handout, we will introduce the dplyr package which can be used to manipulate data in R. ... If

Consider the following set of commands with its corresponding output…

(FlightDelays %>% dplyr::filter( (DEST=='RST') | (DEST=='LSE') ) %>% dplyr::group_by(ORIGIN, DEST) %>% dplyr::summarise(Avg_Arrival_Delay = mean(ARR_DELAY, na.rm=TRUE)) %>% dplyr::arrange(DEST))

Questions:

1. What is the purpose of line 2?

2. What are lines 3 & 4 doing?

Consider the following modification to the code provided above. What additional information does this provide? Discuss…

(FlightDelays %>% dplyr::filter( (DEST=='RST') | (DEST=='LSE') ) %>% dplyr::group_by(ORIGIN, DEST, OP_UNIQUE_CARRIER) %>% dplyr::summarise(Avg_Arrival_Delay = mean(ARR_DELAY, na.rm=TRUE)) %>% dplyr::arrange(DEST))

14