Coding Lab: Visualizing data with ggplot2

Post on 20-May-2022

36 views 0 download

Transcript of Coding Lab: Visualizing data with ggplot2

Coding Lab: Visualizing data with ggplot2

Ari Anisfeld

Summer 2020

1 / 36

How to use ggplot

I How to map data to aesthetics with aes() (and what thatmeans)

I How to visualize the mappings with geomsI How to get more out of your data by using multiple aestheticsI How to use facets to add dimensionality

There are whole books on how to use ggplot. This is a quickintroduction!

2 / 36

Understanding ggplot()By itself, ggplot() tells R to prepare to make a plot.texas_annual_sales <-texas_housing_data %>%group_by(year) %>%summarize(total_volume = sum(volume, na.rm = TRUE))

ggplot(data = texas_annual_sales)

3 / 36

Adding a mappingAdding mapping = aes() says how the data will map to“aesthetics”.

I e.g. tell R to make x-axis year and y-axis total_volume.I Each row of the data has (year, total_volume).

I R will map that to the coordinate pair (x,y) .I Look at the data before moving on!

ggplot(data = texas_annual_sales,mapping = aes(x = year, y = total_volume))

4e+10

5e+10

6e+10

7e+10

8e+10

2000 2005 2010 2015year

tota

l_vo

lum

e

4 / 36

Visualizing the mapping with a geomgeom_<name> tells R what type of visualization to produce.

Here we see points.

I Each row of the data has (year, total_volume).I R will map that to the coordinate pair (x,y).

ggplot(data = texas_annual_sales,mapping = aes(x = year, y = total_volume)) +

geom_point()

4e+10

5e+10

6e+10

7e+10

8e+10

2000 2005 2010 2015year

tota

l_vo

lum

e

5 / 36

Visualizing the mapping with a geomHere we see bars.

I Each row of the data has (year, total_volume).I R will map that to the coordinate pair (x,y)

ggplot(data = texas_annual_sales,mapping = aes(x = year, y = total_volume)) +

geom_col()

0e+00

2e+10

4e+10

6e+10

8e+10

2000 2005 2010 2015year

tota

l_vo

lum

e

6 / 36

Visualizing the mapping with a geom

Here we see a line connecting each (x,y) pair.ggplot(data = texas_annual_sales,

mapping = aes(x = year, y = total_volume)) +geom_line()

4e+10

5e+10

6e+10

7e+10

8e+10

2000 2005 2010 2015year

tota

l_vo

lum

e

7 / 36

Visualizing the mapping with a geomHere we see a smooth line. R does a statistical transformation!

I Now R doesn’t visualize the mapping (year, total_volume) to each (x,y)pair

I Instead it fits a model to the (x,y) and then plots the “smooth” line

ggplot(data = texas_annual_sales,mapping = aes(x = year, y = total_volume)) +

geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

2.5e+10

5.0e+10

7.5e+10

2000 2005 2010 2015year

tota

l_vo

lum

e

8 / 36

Visualizing the mapping with a geom

We can overlay several geom.ggplot(data = texas_annual_sales,

mapping = aes(x = year, y = total_volume)) +geom_smooth() +geom_point()

2.5e+10

5.0e+10

7.5e+10

2000 2005 2010 2015year

tota

l_vo

lum

e

9 / 36

Visualizing the mapping with a geom

I We saw that we can visualize a relationship between twovariables mapping data to x and y

I The data can be visualized with different geoms that can becomposed (+) together.

I We can even calculate new variables with statistics and plotthose on the fly.

Next: Now we’ll look at aesthetics that go beyond x and y axes.

10 / 36

Using aesthetics to explore data.We’ll use midwest data and start with only mapping to x and y

midwest %>%ggplot(aes(x = percollege,

y = percbelowpoverty)) +geom_point()

0

10

20

30

40

50

10 20 30 40 50percollege

perc

belo

wpo

vert

y

11 / 36

Using aesthetics to explore data.I color maps data to the color of points or lines.

I Each state is assigned a color.I This works with discrete data and continuous data.

midwest %>%ggplot(aes(x = percollege,

y = percbelowpoverty,color = state)) +

geom_point()

0

10

20

30

40

50

10 20 30 40 50percollege

perc

belo

wpo

vert

y state

IL

IN

MI

OH

WI

12 / 36

Using aesthetics to explore data.I shape maps data to the shape of points.

I Each state is assigned a shape.I This works with discrete data only.

midwest %>%ggplot(aes(x = percollege,

y = percbelowpoverty,shape = state)) +

geom_point()

0

10

20

30

40

50

10 20 30 40 50percollege

perc

belo

wpo

vert

y state

IL

IN

MI

OH

WI

13 / 36

Using aesthetics to explore data.I alpha maps data to the transparency of points.

I Here we map the percentage of people within a known povertystatus to alpha

midwest %>%ggplot(aes(x = percollege,

y = percbelowpoverty,alpha = poptotal)) +

geom_point()

0

10

20

30

40

50

10 20 30 40 50percollege

perc

belo

wpo

vert

y poptotal

1e+06

2e+06

3e+06

4e+06

5e+06

14 / 36

Using aesthetics to explore data.I size maps data to the size of points and width of lines.

I Here we map the percentage of people within a known povertystatus to size

midwest %>%ggplot(aes(x = percollege,

y = percbelowpoverty,size = poptotal)) +

geom_point()

0

10

20

30

40

50

10 20 30 40 50percollege

perc

belo

wpo

vert

y poptotal

1e+06

2e+06

3e+06

4e+06

5e+06

15 / 36

Using aesthetics to explore data.

We can combine any and all aesthetics, and even map the same variable tomultiple aestheticsmidwest %>%

ggplot(aes(x = percollege,y = percbelowpoverty,alpha = percpovertyknown,size = poptotal,color = state))+

geom_point()

16 / 36

Using aesthetics to explore data.

0

10

20

30

40

50

10 20 30 40 50percollege

perc

belo

wpo

vert

y

percpovertyknown

85

90

95

state

IL

17 / 36

Using aesthetics to explore data

Different geoms have specific aesthetics that go with them.

I use ? to see which aesthetics a geom accepts (e.g?geom_point)

I the bold aesthetics are required.I the ggplot cheatsheet shows all the geoms with their associated

aesthetics

18 / 36

FacetsFacets provide an additional tool to explore multidimensional datamidwest %>%

ggplot(aes(x = log(poptotal),y = percbelowpoverty)) +

geom_point() +facet_wrap(vars(state))

OH WI

IL IN MI

8 10 12 14 8 10 12 14

8 10 12 140

1020304050

01020304050

log(poptotal)

perc

belo

wpo

vert

y

19 / 36

discrete vs continuous dataaes discrete continuous

limited number of classes unlimited number of classesusually chr or lgl numeric

x, y yes yescolor, fill yes yesshape yes (6 or fewer categories) nosize, alpha not advised yesfacet yes not advised

Here, discrete and continuous have different meaning than in math

I For ggplot meaning is more fluid.I If you do group_by with the var and there are fewer than 6 to

10 groups, discrete visualizations can workI If your “discrete” data is numeric, as.character() or

as_factor() to enforce the decision.20 / 36

color can be continuousmidwest %>%

ggplot(aes(x = percollege,y = percbelowpoverty,color = percpovertyknown)) +

geom_point()

0

10

20

30

40

50

10 20 30 40 50percollege

perc

belo

wpo

vert

y

85

90

95

percpovertyknown

21 / 36

shape does not play well with many categoriesI Will only map to 6 categories, the rest become NA.I We can override this behavior and get up to 25 distinct shapes

midwest %>%ggplot(aes(x = percollege,

y = percbelowpoverty,shape = county)) +

geom_point() +# legend off, otherwise it overwhelmstheme(legend.position = "none")

0

10

20

30

40

50

10 20 30 40 50percollege

perc

belo

wpo

vert

y

22 / 36

alpha and size can be misleading with discrete datamidwest %>%

ggplot(aes(x = percollege,y = percbelowpoverty,alpha = state)) +

geom_point()

## Warning: Using alpha for a discrete variable is not advised.

0

10

20

30

40

50

10 20 30 40 50percollege

perc

belo

wpo

vert

y state

IL

IN

MI

OH

WI

23 / 36

Adding vertical linestexas_annual_sales %>%

ggplot(aes(x = year, y = total_volume)) +geom_point() +geom_vline(aes(xintercept = 2007),

linetype = "dotted")

4e+10

5e+10

6e+10

7e+10

8e+10

2000 2005 2010 2015year

tota

l_vo

lum

e

I add horizontal lines with geom_hline()I add any linear fit using geom_abline() by providing a slope

and intercept.24 / 36

Key take aways

I ggplot starts by mapping data to “aesthetics”.I e.g. What data shows up on x and y axes and how color,

size and shape appear on the plot.I We need to be aware of ‘continuous’ vs. ‘discrete’ variables.

I Then, we use geoms to create a visualization based on themapping.

I Again we need to be aware of ‘continuous’ vs. ‘discrete’variables.

I Making quick plots helps us understand data and makes usaware of data issues

Resources: R for Data Science chap. 3 (r4ds.had.co.nz); RStudio’sggplot cheatsheet.

25 / 36

Appendix: Some graphs you made along the way

26 / 36

lab 0: a map

geom_path is like geom_line, but connects (x, y) pairs in theorder they appear in the data set.storms %>%

group_by(name, year) %>%filter(max(category) == 5) %>%

ggplot(aes(x = long, y = lat, color = name)) +geom_path() +borders("world") +coord_quickmap(xlim = c(-130, -60), ylim = c(20, 50))

27 / 36

lab 0: a map

20

30

40

50

−120 −100 −80 −60long

lat

Dean

Emily

Felix

Gilbert

Hugo

Isabel

Ivan

Katrina

Mitch

28 / 36

lab 1: a line plot

french_data <-wid_data %>%filter(type == "Net personal wealth",

country == "France") %>%mutate(perc_national_wealth = value * 100)

french_data %>%ggplot(aes(y = perc_national_wealth,

x = year,color = percentile)) +

geom_line()

29 / 36

lab 1: a line plot

0

25

50

75

1900 1925 1950 1975 2000year

perc

_nat

iona

l_w

ealth

percentile

p0p50

p50p90

p90p100

p99p100

30 / 36

lab 2: distributions

I geom_density() only requires an x asthetic and it calculatesthe distribution to plot.

I We can set the aesthetics manually, independent of data fornicer graphs.chi_sq_samples <-tibble(x = c(rchisq(100000, 2),

rchisq(100000, 3),rchisq(100000, 4)),

df = rep(c("2", "3", "4"), each = 1e5))

chi_sq_samples %>%ggplot(aes(x = x, fill = df)) +geom_density( alpha = .5) +labs(fill = "df", x = "sample")

31 / 36

lab 2: distributions

0.0

0.1

0.2

0.3

0.4

0 10 20 30sample

dens

ity

df

2

3

4

32 / 36

lab 4: grouped bar graphs

I position = "dodge2" tells R to put bars next to each other,rather than stacked on top of each other.

I Notice we use fill and not color because we’re “filling” anarea.mean_share_per_country %>%

ggplot(aes(y = country,x = mean_share,fill = percentile)) +

geom_col(position = "dodge2") +labs(x = "Mean share of national wealth",

y = "",fill = "Wealth\npercentile")

33 / 36

lab 4: grouped bar graphs

ChinaFrance

IndiaKorea

RussiaS Africa

UKUSA

0.00 0.25 0.50 0.75Mean share of national wealth

Wealthpercentile

p90p100

p99p100

34 / 36

lab 4: faceted bar graph

I Notice that we manipulate our data to the right specificationbefore making this graph

I Using facet_wrap we get a distinct graph for each timeperiod.

mean_share_per_country_with_time %>%ggplot(aes(x = country,

y = mean_share,fill = percentile)) +

geom_col(position = "dodge2") +facet_wrap(vars(time_period))

35 / 36

lab 4: faceted bar graph

1980 to 1999 2000 to present

1959 and earlier 1960 to 1979

ChinaIndiaUSA ChinaIndiaUSA

0.00.20.40.60.8

0.00.20.40.60.8

country

mea

n_sh

are

percentile

p90p100

p99p100

36 / 36