Stat 115 Lecture Notes 4 - Washington State University

12
Stat 115 Lecture Notes 4 Xiongzhi Chen Washington State University Contents 2 Overview of data 2 Course materials ............................................ 2 NYC flight data: nycflights13 ................................... 2 NYC flight data: flights ........................................ 2 NYC flight data: airlines ........................................ 2 NYC flight data: airports ....................................... 3 NYC flight data: planes ........................................ 3 NYC flight data: weather ....................................... 3 NYC flight data ............................................ 3 NYC flight data ............................................ 4 Mutating Joins 4 Key ................................................... 4 Data: flights ............................................. 4 Data: airlines ............................................ 5 Add airline name to flights2 .................................... 5 Inner join ................................................ 5 Outer Joins ............................................... 6 Left outer join ............................................. 6 Right outer join ............................................. 6 Full outer join .............................................. 7 Join flights and plances ...................................... 7 Join flights and airports ...................................... 7 Filtering Joins ............................................. 8 Missing values 8 Course contents ............................................. 8 Check object types ........................................... 8 Check object types ........................................... 8 Coerce an object ............................................ 9 Missing observations/values ...................................... 9 Example ................................................. 10 Check for NA .............................................. 10 Remove observatiosn with NAs ..................................... 10 Complete and fill in NAs ........................................ 11 Mising as if not ............................................. 11 Fill in missing values .......................................... 11 License and session Information .................................... 12 1

Transcript of Stat 115 Lecture Notes 4 - Washington State University

Page 1: Stat 115 Lecture Notes 4 - Washington State University

Stat 115 Lecture Notes 4Xiongzhi Chen

Washington State University

Contents2

Overview of data 2Course materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2NYC flight data: nycflights13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2NYC flight data: flights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2NYC flight data: airlines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2NYC flight data: airports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3NYC flight data: planes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3NYC flight data: weather . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3NYC flight data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3NYC flight data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Mutating Joins 4Key . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4Data: flights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4Data: airlines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Add airline name to flights2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Inner join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Outer Joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Left outer join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Right outer join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Full outer join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Join flights and plances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Join flights and airports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Filtering Joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Missing values 8Course contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Check object types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Check object types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Coerce an object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Missing observations/values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Check for NA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Remove observatiosn with NAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Complete and fill in NAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Mising as if not . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Fill in missing values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11License and session Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1

Page 2: Stat 115 Lecture Notes 4 - Washington State University

Overview of data

Course materials

• Chapters 10 of book “R for data science” by Wickham and Grolemun• R library: tidyverse, dplyr, nycflights13

NYC flight data: nycflights13

nycflights13 contains 5 tibbles

• airlines: full carrier name• airports: information about each airport• planes: information plane, identified by tailnum• weather: hourly weather at each NYC airport• flights: airplane scheduled departure and arrival times, tailnum, etc

NYC flight data: flights

# A tibble: 6 x 19year month day dep_time sched_dep_time dep_delay arr_time

<int> <int> <int> <int> <int> <dbl> <int>1 2013 1 1 517 515 2 8302 2013 1 1 533 529 4 8503 2013 1 1 542 540 2 9234 2013 1 1 544 545 -1 10045 2013 1 1 554 600 -6 8126 2013 1 1 554 558 -4 740# ... with 12 more variables: sched_arr_time <int>,# arr_delay <dbl>, carrier <chr>, flight <int>,# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,# distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

NYC flight data: airlines

# A tibble: 6 x 2carrier name<chr> <chr>

1 9E Endeavor Air Inc.2 AA American Airlines Inc.3 AS Alaska Airlines Inc.4 B6 JetBlue Airways5 DL Delta Air Lines Inc.6 EV ExpressJet Airlines Inc.

2

Page 3: Stat 115 Lecture Notes 4 - Washington State University

NYC flight data: airports

# A tibble: 6 x 8faa name lat lon alt tz dst tzone<chr> <chr> <dbl> <dbl> <int> <dbl> <chr> <chr>

1 04G Lansdowne Airp~ 41.1 -80.6 1044 -5 A America/~2 06A Moton Field Mu~ 32.5 -85.7 264 -6 A America/~3 06C Schaumburg Reg~ 42.0 -88.1 801 -6 A America/~4 06N Randall Airport 41.4 -74.4 523 -5 A America/~5 09J Jekyll Island ~ 31.1 -81.4 11 -5 A America/~6 0A9 Elizabethton M~ 36.4 -82.2 1593 -5 A America/~

NYC flight data: planes

# A tibble: 6 x 9tailnum year type manufacturer model engines seats speed<chr> <int> <chr> <chr> <chr> <int> <int> <int>

1 N10156 2004 Fixe~ EMBRAER EMB-~ 2 55 NA2 N102UW 1998 Fixe~ AIRBUS INDU~ A320~ 2 182 NA3 N103US 1999 Fixe~ AIRBUS INDU~ A320~ 2 182 NA4 N104UW 1999 Fixe~ AIRBUS INDU~ A320~ 2 182 NA5 N10575 2002 Fixe~ EMBRAER EMB-~ 2 55 NA6 N105UW 1999 Fixe~ AIRBUS INDU~ A320~ 2 182 NA# ... with 1 more variable: engine <chr>

NYC flight data: weather

# A tibble: 6 x 15origin year month day hour temp dewp humid wind_dir<chr> <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl> <dbl>

1 EWR 2013 1 1 1 39.0 26.1 59.4 2702 EWR 2013 1 1 2 39.0 27.0 61.6 2503 EWR 2013 1 1 3 39.0 28.0 64.4 2404 EWR 2013 1 1 4 39.9 28.0 62.2 2505 EWR 2013 1 1 5 39.0 28.0 64.4 2606 EWR 2013 1 1 6 37.9 28.0 67.2 240# ... with 6 more variables: wind_speed <dbl>,# wind_gust <dbl>, precip <dbl>, pressure <dbl>,# visib <dbl>, time_hour <dttm>

NYC flight data

For nycflights13:

• flights connects to planes via a single variable, tailnum• flights connects to airlines through the carrier variable• flights connects to airports in two ways: via the origin and dest variables• flights connects to weather via origin (the location), and year, month, day, and hour (the time).

3

Page 4: Stat 115 Lecture Notes 4 - Washington State University

NYC flight data

Mutating Joins

Key

• a key is a variable (or set of variables) that uniquely identifies an observation• A primary key uniquely identifies an observation in its own table. For example, planes$tailnum

is a primary key because it uniquely identifies each plane in the planes table• A foreign key uniquely identifies an observation in another table. For example, flights$tailnum

is a foreign key because it appears in the flights table where it matches each flight to a unique plane.

Data: flights

> library(dplyr)> flights2 <- flights %>%+ select(year:day, hour, origin, dest, tailnum, carrier)> flights2# A tibble: 336,776 x 8

year month day hour origin dest tailnum carrier<int> <int> <int> <dbl> <chr> <chr> <chr> <chr>

1 2013 1 1 5 EWR IAH N14228 UA2 2013 1 1 5 LGA IAH N24211 UA3 2013 1 1 5 JFK MIA N619AA AA4 2013 1 1 5 JFK BQN N804JB B65 2013 1 1 6 LGA ATL N668DN DL6 2013 1 1 5 EWR ORD N39463 UA

4

Page 5: Stat 115 Lecture Notes 4 - Washington State University

7 2013 1 1 6 EWR FLL N516JB B68 2013 1 1 6 LGA IAD N829AS EV9 2013 1 1 6 JFK MCO N593JB B6

10 2013 1 1 6 LGA ORD N3ALAA AA# ... with 336,766 more rows

Data: airlines

# A tibble: 6 x 2carrier name<chr> <chr>

1 9E Endeavor Air Inc.2 AA American Airlines Inc.3 AS Alaska Airlines Inc.4 B6 JetBlue Airways5 DL Delta Air Lines Inc.6 EV ExpressJet Airlines Inc.

Add airline name to flights2

> library(dplyr)> flights2 %>%+ select(-origin, -dest) %>%+ left_join(airlines, by = "carrier")# A tibble: 336,776 x 7

year month day hour tailnum carrier name<int> <int> <int> <dbl> <chr> <chr> <chr>

1 2013 1 1 5 N14228 UA United Air Lines In~2 2013 1 1 5 N24211 UA United Air Lines In~3 2013 1 1 5 N619AA AA American Airlines I~4 2013 1 1 5 N804JB B6 JetBlue Airways5 2013 1 1 6 N668DN DL Delta Air Lines Inc.6 2013 1 1 5 N39463 UA United Air Lines In~7 2013 1 1 6 N516JB B6 JetBlue Airways8 2013 1 1 6 N829AS EV ExpressJet Airlines~9 2013 1 1 6 N593JB B6 JetBlue Airways

10 2013 1 1 6 N3ALAA AA American Airlines I~# ... with 336,766 more rows

Inner join

inner_join: unmatched rows are not included in the result

5

Page 6: Stat 115 Lecture Notes 4 - Washington State University

y <- tribble( ~key, ~val_y, 1, "y1", 2, "y2", 4, "y3")

The colored column represents the “key” variable: these are used tomatch the rows between the tables. The gray column represents the“value” column that is carried along for the ride. In these examplesI’ll show a single key variable and single value variable, but the ideageneralizes in a straightforward way to multiple keys and multiplevalues.

A join is a way of connecting each row in x to zero, one, or morerows in y. The following diagram shows each potential match as anintersection of a pair of lines:

(If you look closely, you might notice that we’ve switched the orderof the key and value columns in x. This is to emphasize that joinsmatch based on the key; the value is just carried along for the ride.)

In an actual join, matches will be indicated with dots. The numberof dots = the number of matches = the number of rows in the out‐put.

Inner JoinThe simplest type of join is the inner join. An inner join matchespairs of observations whenever their keys are equal:

180 | Chapter 10: Relational Data with dplyr

Outer Joins

outer join keeps observations that appear in at least one of the tables. There are three types of outer joins:

• A left join keeps all observations in x• A right join keeps all observations in y• A full join keeps all observations in x and y

Left outer joinGraphically, that looks like:

The most commonly used join is the left join: you use this wheneveryou look up additional data from another table, because it preservesthe original observations even when there isn’t a match. The left joinshould be your default join: use it unless you have a strong reason toprefer one of the others.

Another way to depict the different types of joins is with a Venn dia‐gram:

However, this is not a great representation. It might jog your mem‐ory about which join preserves the observations in which table, butit suffers from a major limitation: a Venn diagram can’t show whathappens when keys don’t uniquely identify an observation.

182 | Chapter 10: Relational Data with dplyr

Right outer join

Graphically, that looks like:

The most commonly used join is the left join: you use this wheneveryou look up additional data from another table, because it preservesthe original observations even when there isn’t a match. The left joinshould be your default join: use it unless you have a strong reason toprefer one of the others.

Another way to depict the different types of joins is with a Venn dia‐gram:

However, this is not a great representation. It might jog your mem‐ory about which join preserves the observations in which table, butit suffers from a major limitation: a Venn diagram can’t show whathappens when keys don’t uniquely identify an observation.

182 | Chapter 10: Relational Data with dplyr

6

Page 7: Stat 115 Lecture Notes 4 - Washington State University

Full outer join

Graphically, that looks like:

The most commonly used join is the left join: you use this wheneveryou look up additional data from another table, because it preservesthe original observations even when there isn’t a match. The left joinshould be your default join: use it unless you have a strong reason toprefer one of the others.

Another way to depict the different types of joins is with a Venn dia‐gram:

However, this is not a great representation. It might jog your mem‐ory about which join preserves the observations in which table, butit suffers from a major limitation: a Venn diagram can’t show whathappens when keys don’t uniquely identify an observation.

182 | Chapter 10: Relational Data with dplyr

Join flights and plances

joint flights and planes by tailnum> flights2 %>% left_join(planes, by = "tailnum")# A tibble: 336,776 x 16

year.x month day hour origin dest tailnum carrier year.y<int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <int>

1 2013 1 1 5 EWR IAH N14228 UA 19992 2013 1 1 5 LGA IAH N24211 UA 19983 2013 1 1 5 JFK MIA N619AA AA 19904 2013 1 1 5 JFK BQN N804JB B6 20125 2013 1 1 6 LGA ATL N668DN DL 19916 2013 1 1 5 EWR ORD N39463 UA 20127 2013 1 1 6 EWR FLL N516JB B6 20008 2013 1 1 6 LGA IAD N829AS EV 19989 2013 1 1 6 JFK MCO N593JB B6 2004

10 2013 1 1 6 LGA ORD N3ALAA AA NA# ... with 336,766 more rows, and 7 more variables:# type <chr>, manufacturer <chr>, model <chr>,# engines <int>, seats <int>, speed <int>, engine <chr>

Join flights and airports

joint flights and airports by matching dest in flights to faa in airports; the variables in x will beused in the output> flights2 %>%+ left_join(airports, c("dest" = "faa"))# A tibble: 336,776 x 15

year month day hour origin dest tailnum carrier name<int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr>

1 2013 1 1 5 EWR IAH N14228 UA Geor~2 2013 1 1 5 LGA IAH N24211 UA Geor~3 2013 1 1 5 JFK MIA N619AA AA Miam~4 2013 1 1 5 JFK BQN N804JB B6 <NA>5 2013 1 1 6 LGA ATL N668DN DL Hart~6 2013 1 1 5 EWR ORD N39463 UA Chic~

7

Page 8: Stat 115 Lecture Notes 4 - Washington State University

7 2013 1 1 6 EWR FLL N516JB B6 Fort~8 2013 1 1 6 LGA IAD N829AS EV Wash~9 2013 1 1 6 JFK MCO N593JB B6 Orla~

10 2013 1 1 6 LGA ORD N3ALAA AA Chic~# ... with 336,766 more rows, and 6 more variables: lat <dbl>,# lon <dbl>, alt <int>, tz <dbl>, dst <chr>, tzone <chr>

Filtering Joins

Filtering joins match observations in the same way as mutating joins, but affect the observations, not thevariables. There are two types:

• semi_join(x, y) keeps all observations in x that have a match in y• anti_join(x, y) drops all observations in x that have a match in y

Missing values

Course contents

• Chapter 9 of book “R for data science” by Wickham and Grolemun• R packages: tidyr, dplyr and tidyverse

Check object types

After R imports data, make sure variables have correct object types

• str gives object type of each variable in a data frame or tibble• summary gives object type of each variable, together with 5-value summary for each numeric object• head gives object type for each variable in a data frame or tibble

Check object types

> summary(flights2)year month day

Min. :2013 Min. : 1.000 Min. : 1.001st Qu.:2013 1st Qu.: 4.000 1st Qu.: 8.00Median :2013 Median : 7.000 Median :16.00Mean :2013 Mean : 6.549 Mean :15.713rd Qu.:2013 3rd Qu.:10.000 3rd Qu.:23.00Max. :2013 Max. :12.000 Max. :31.00

hour origin destMin. : 1.00 Length:336776 Length:3367761st Qu.: 9.00 Class :character Class :characterMedian :13.00 Mode :character Mode :characterMean :13.183rd Qu.:17.00Max. :23.00

tailnum carrier

8

Page 9: Stat 115 Lecture Notes 4 - Washington State University

Length:336776 Length:336776Class :character Class :characterMode :character Mode :character

> str(flights2)Classes 'tbl_df', 'tbl' and 'data.frame': 336776 obs. of 8 variables:$ year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...$ month : int 1 1 1 1 1 1 1 1 1 1 ...$ day : int 1 1 1 1 1 1 1 1 1 1 ...$ hour : num 5 5 5 5 6 5 6 6 6 6 ...$ origin : chr "EWR" "LGA" "JFK" "JFK" ...$ dest : chr "IAH" "IAH" "MIA" "BQN" ...$ tailnum: chr "N14228" "N24211" "N619AA" "N804JB" ...$ carrier: chr "UA" "UA" "AA" "B6" ...

> head(flights2)# A tibble: 6 x 8

year month day hour origin dest tailnum carrier<int> <int> <int> <dbl> <chr> <chr> <chr> <chr>

1 2013 1 1 5 EWR IAH N14228 UA2 2013 1 1 5 LGA IAH N24211 UA3 2013 1 1 5 JFK MIA N619AA AA4 2013 1 1 5 JFK BQN N804JB B65 2013 1 1 6 LGA ATL N668DN DL6 2013 1 1 5 EWR ORD N39463 UA

Coerce an object

Force to change the type of a object by as.objecType()> x = c("1","2","3")> x[1] "1" "2" "3"> as.numeric(x)[1] 1 2 3> as.factor(x)[1] 1 2 3Levels: 1 2 3> as.matrix(x,3,1)

[,1][1,] "1"[2,] "2"[3,] "3"> y = c("a","b")> as.numeric(y)[1] NA NA

Missing observations/values

• Missing explicitly, i.e., flagged with NA (,i.e., presence of an absence)

9

Page 10: Stat 115 Lecture Notes 4 - Washington State University

• Missing implicitly, i.e., simply not present in the data (i.e., absence of a presence)

Example

> library(tidyr)> stocks <- tibble(+ year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),+ qtr = c( 1, 2, 3, 4, 2, 3, 4),+ return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66)+ )> stocks# A tibble: 7 x 3

year qtr return<dbl> <dbl> <dbl>

1 2015 1 1.882 2015 2 0.593 2015 3 0.354 2015 4 NA5 2016 2 0.926 2016 3 0.177 2016 4 2.66

Check for NA

Command: is.na() or any(is.na())> any(is.na(stocks))[1] TRUE> is.na(stocks)

year qtr return[1,] FALSE FALSE FALSE[2,] FALSE FALSE FALSE[3,] FALSE FALSE FALSE[4,] FALSE FALSE TRUE[5,] FALSE FALSE FALSE[6,] FALSE FALSE FALSE[7,] FALSE FALSE FALSE

Remove observatiosn with NAs

> stocks %>%+ spread(year, return) %>%+ gather(year, return, `2015`:`2016`, na.rm = TRUE)# A tibble: 6 x 3

qtr year return* <dbl> <chr> <dbl>1 1 2015 1.882 2 2015 0.593 3 2015 0.35

10

Page 11: Stat 115 Lecture Notes 4 - Washington State University

4 2 2016 0.925 3 2016 0.176 4 2016 2.66

Complete and fill in NAs

> stocks %>% complete(year, qtr)# A tibble: 8 x 3

year qtr return<dbl> <dbl> <dbl>

1 2015 1 1.882 2015 2 0.593 2015 3 0.354 2015 4 NA5 2016 1 NA6 2016 2 0.927 2016 3 0.178 2016 4 2.66

Mising as if not

scenario: missing values indicate that the previous value should be carried forward> library(tibble)> treatment <- tribble(+ ~ person, ~ treatment, ~response,+ "Derrick Whitmore", 1, 7,+ NA, 2, 10,+ NA, 3, 9,+ "Katherine Burke", 1, 4+ )> treatment# A tibble: 4 x 3

person treatment response<chr> <dbl> <dbl>

1 Derrick Whitmore 1 72 <NA> 2 103 <NA> 3 94 Katherine Burke 1 4

Fill in missing values

> treatment %>% fill(person)# A tibble: 4 x 3

person treatment response<chr> <dbl> <dbl>

1 Derrick Whitmore 1 72 Derrick Whitmore 2 10

11

Page 12: Stat 115 Lecture Notes 4 - Washington State University

3 Derrick Whitmore 3 94 Katherine Burke 1 4

License and session Information

License> sessionInfo()R version 3.5.0 (2018-04-23)Platform: x86_64-w64-mingw32/x64 (64-bit)Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:[1] LC_COLLATE=English_United States.1252[2] LC_CTYPE=English_United States.1252[3] LC_MONETARY=English_United States.1252[4] LC_NUMERIC=C[5] LC_TIME=English_United States.1252

attached base packages:[1] stats graphics grDevices utils datasets methods[7] base

other attached packages:[1] tidyr_0.8.2 dplyr_0.7.8 nycflights13_1.0.0[4] knitr_1.21

loaded via a namespace (and not attached):[1] Rcpp_1.0.0 rstudioapi_0.8 bindr_0.1.1[4] magrittr_1.5 tidyselect_0.2.5 R6_2.3.0[7] rlang_0.3.0.1 fansi_0.4.0 stringr_1.3.1

[10] tools_3.5.0 xfun_0.4 utf8_1.1.4[13] cli_1.0.1 htmltools_0.3.6 yaml_2.2.0[16] digest_0.6.18 assertthat_0.2.0 tibble_1.4.2[19] crayon_1.3.4 bindrcpp_0.2.2 purrr_0.2.5[22] glue_1.3.0 evaluate_0.12 rmarkdown_1.11[25] stringi_1.2.4 compiler_3.5.0 pillar_1.3.1[28] pkgconfig_2.0.2

12