ggplot2 for Epi Studies - University of North Carolina at ...
Transcript of ggplot2 for Epi Studies - University of North Carolina at ...
![Page 1: ggplot2 for Epi Studies - University of North Carolina at ...](https://reader033.fdocuments.net/reader033/viewer/2022052105/6286f25e6a33313de41cb998/html5/thumbnails/1.jpg)
ggplot2 for Epi StudiesLeah McGrath, PhD
November 13, 2017
![Page 2: ggplot2 for Epi Studies - University of North Carolina at ...](https://reader033.fdocuments.net/reader033/viewer/2022052105/6286f25e6a33313de41cb998/html5/thumbnails/2.jpg)
Introduction
Know your data: data exploration is an important part of research
Data visualization is an excellent way to explore data
ggplot2 is an elegant R library that makes it easy to createcompelling graphs
plots can be iteratively built up and easily modified
·
·
·
·
2/42
![Page 3: ggplot2 for Epi Studies - University of North Carolina at ...](https://reader033.fdocuments.net/reader033/viewer/2022052105/6286f25e6a33313de41cb998/html5/thumbnails/3.jpg)
Learning objectives
To create graphs used in manuscripts for epidemiology studies
To review and incorporate previously learned aspects of formattinggraphs
To demonstrate novel data visualizations using Shiny
·
·
·
3/42
![Page 4: ggplot2 for Epi Studies - University of North Carolina at ...](https://reader033.fdocuments.net/reader033/viewer/2022052105/6286f25e6a33313de41cb998/html5/thumbnails/4.jpg)
ggplot architecture review
Aesthetics: specify the variables to display
“geoms”: specify type of plot
Scales: for transforming variables(e.g., log, sq. root).
Facets: creating separate panels for different factors
Themes: Adjust appearance: background, fonts, etc
·
what are x and y?
can also link variables to color, shape, size and transparency
-
-
·
do you want a scatter plot, line, bars, densities, or other typeplot?
-
·
also used to set legend – title, breaks, labels-
·
·
4/42
![Page 5: ggplot2 for Epi Studies - University of North Carolina at ...](https://reader033.fdocuments.net/reader033/viewer/2022052105/6286f25e6a33313de41cb998/html5/thumbnails/5.jpg)
Hemoglobin data
Data from the National Health and Nutritional Examination Survey(NHANES) dataset, 1999-2000
containing data about n=3,990 patients
The file was created by merging demographic data with completeblood count file, and nutritional biochemistry lab file.
Contains measures hemoglobin, iron status, and other anemia-related parameters
·
·
·
·
5/42
![Page 6: ggplot2 for Epi Studies - University of North Carolina at ...](https://reader033.fdocuments.net/reader033/viewer/2022052105/6286f25e6a33313de41cb998/html5/thumbnails/6.jpg)
Anemia data codebook
age = age in years of participant (years)
sex = sex of participant (Male vs Female)
tsat = transferrin saturation (%)
iron = total serum iron (ug/dL)
hgb = hemoglobin concentration (g/dL)
ferr = serum ferritin (mg/mL)
folate = serum folate (mg/mL)
race = participant race (Hispanic, White, Black, Other)
rdw = red cell distribution width (%)
wbc = white blood cell count (SI)
anemia = indicator variable for anemia (according to WHOdefinition)
·
·
·
·
·
·
·
·
·
·
·
6/42
![Page 7: ggplot2 for Epi Studies - University of North Carolina at ...](https://reader033.fdocuments.net/reader033/viewer/2022052105/6286f25e6a33313de41cb998/html5/thumbnails/7.jpg)
Scatter plot review: hemoglobin by age,stratified by ethnicity and sex
ggplot(data=anemia, aes(x=age,y=hgb,color=sex)) + geom_smooth() + geom_jitter(aes(size=1/iron), alpha=0.1) + xlab("Age")+ylab("Hemoglobin (g/dl)") + scale_size(name = "Iron Deficiency") + scale_color_discrete(name = "Sex") + facet_wrap(~race)+theme_bw()
7/42
![Page 8: ggplot2 for Epi Studies - University of North Carolina at ...](https://reader033.fdocuments.net/reader033/viewer/2022052105/6286f25e6a33313de41cb998/html5/thumbnails/8.jpg)
Scatter plot review: hemoglobin by age,stratified by ethnicity and sex
8/42
![Page 9: ggplot2 for Epi Studies - University of North Carolina at ...](https://reader033.fdocuments.net/reader033/viewer/2022052105/6286f25e6a33313de41cb998/html5/thumbnails/9.jpg)
Box plots
ggplot(data=anemia, aes(x=race,y=hgb)) + geom_boxplot()
9/42
![Page 10: ggplot2 for Epi Studies - University of North Carolina at ...](https://reader033.fdocuments.net/reader033/viewer/2022052105/6286f25e6a33313de41cb998/html5/thumbnails/10.jpg)
Box plots with points
ggplot(data=anemia, aes(x=race,y=hgb,color=sex)) + geom_boxplot()+ geom_jitter(alpha=0.1)
10/42
![Page 11: ggplot2 for Epi Studies - University of North Carolina at ...](https://reader033.fdocuments.net/reader033/viewer/2022052105/6286f25e6a33313de41cb998/html5/thumbnails/11.jpg)
Box plots with coordinates flipped
ggplot(data=anemia, aes(x=race,y=hgb,color=sex)) + geom_boxplot()+ geom_jitter(alpha=0.1) + coord_flip()
11/42
![Page 12: ggplot2 for Epi Studies - University of North Carolina at ...](https://reader033.fdocuments.net/reader033/viewer/2022052105/6286f25e6a33313de41cb998/html5/thumbnails/12.jpg)
Violin plots
Kernal density estimates that are placed on each side and mirroredso it forms a symmetrical shape
Easy to compare several distributions
·
·
12/42
![Page 13: ggplot2 for Epi Studies - University of North Carolina at ...](https://reader033.fdocuments.net/reader033/viewer/2022052105/6286f25e6a33313de41cb998/html5/thumbnails/13.jpg)
Violin plots
ggplot(data=anemia, aes(x=race,y=hgb,color=race)) + geom_violin()
13/42
![Page 14: ggplot2 for Epi Studies - University of North Carolina at ...](https://reader033.fdocuments.net/reader033/viewer/2022052105/6286f25e6a33313de41cb998/html5/thumbnails/14.jpg)
Violin plots with underlying data points
ggplot(data=anemia, aes(x=race,y=hgb,color=race)) + geom_violin()+ geom_jitter(alpha=0.1)
14/42
![Page 15: ggplot2 for Epi Studies - University of North Carolina at ...](https://reader033.fdocuments.net/reader033/viewer/2022052105/6286f25e6a33313de41cb998/html5/thumbnails/15.jpg)
Violin plots stratified by 2 variables
ggplot(data=anemia, aes(x=sex,y=hgb,color=race)) + geom_violin()
15/42
![Page 16: ggplot2 for Epi Studies - University of North Carolina at ...](https://reader033.fdocuments.net/reader033/viewer/2022052105/6286f25e6a33313de41cb998/html5/thumbnails/16.jpg)
Violin plots & boxplot with no outliers
ggplot(data=anemia, aes(x=race,y=hgb, color=race)) + geom_violin() + geom_boxplot(width=.1, fill="black", outlier.color=NA) + stat_summary(fun.y=median, geom="point", fill="white", shape=21, size=2.5)
16/42
![Page 17: ggplot2 for Epi Studies - University of North Carolina at ...](https://reader033.fdocuments.net/reader033/viewer/2022052105/6286f25e6a33313de41cb998/html5/thumbnails/17.jpg)
Practice
Use the anemia dataset to practice making scatterplots, boxplots, and violin plots
Try faceting, flipping orientation, changing colors and labels
·
·
str(anemia)
## Classes 'tbl_df', 'tbl' and 'data.frame': 3990 obs. of 13 variables: ## $ age : num 77 49 59 43 37 70 81 38 85 23 ... ## $ sex : Factor w/ 2 levels "Male","Female": 1 1 2 1 1 1 1 2 2 2 ... ## $ tsat : num 16.3 41.5 27.6 28 19.7 18.5 16.9 27.1 13.4 35.8 ... ## $ iron : num 65 141 96 83 64 75 65 97 38 136 ... ## $ hgb : num 14.1 14.5 13.4 15.4 16 16.8 16.6 13.3 10.9 14.5 ... ## $ ferr : num 55 198 155 32 68 87 333 33 166 48 ... ## $ folate: num 24.6 17.1 12.2 13.5 23 46.9 14.6 6.1 30.3 19.9 ... ## $ vite : num 1488 1897 1311 528 3092 ... ## $ vita : num 74.9 84.6 54 41.9 72.5 ... ## $ race : Factor w/ 4 levels "Hispanic","White",..: 2 2 3 3 2 1 2 2 3 1 ... ## $ rdw : num 13.7 13.1 14.3 13.7 13.6 14.4 12.4 11.9 14.1 11.4 ... ## $ wbc : num 7.6 5.9 4.9 4.6 10.2 11.6 9.1 7.6 7.4 5.6 ... ## $ anemia: num 0 0 0 0 0 0 0 0 1 0 ... ## - attr(*, "na.action")=Class 'omit' Named int [1:805] 26 28 32 33 36 37 38 39 45 54 ... ## .. ..- attr(*, "names")= chr [1:805] "26" "28" "32" "33" ...
17/42
![Page 18: ggplot2 for Epi Studies - University of North Carolina at ...](https://reader033.fdocuments.net/reader033/viewer/2022052105/6286f25e6a33313de41cb998/html5/thumbnails/18.jpg)
Forest plots
First gather the data into the proper format including the followingvariables:
·
Estimate
Lower CI
Upper CI
Grouping variable
-
-
-
-
18/42
![Page 19: ggplot2 for Epi Studies - University of North Carolina at ...](https://reader033.fdocuments.net/reader033/viewer/2022052105/6286f25e6a33313de41cb998/html5/thumbnails/19.jpg)
Forest plots
For this example, we take the mean and calculate the upper andlower confidence interval for hemoglobin.
We will stack the row observations into one variable called "Type".
·
·
anemia1 <- anemia %>% select(sex,hgb) %>% group_by(sex) %>% summarise_all(funs("mean",n(),lower=(mean-((sd(.)/sqrt(n()))*1.96)), upper=(mean+((sd(.)/sqrt(n()))*1.96)))) colnames(anemia1)[1] <- "Type" anemia2 <- anemia %>% select(race,hgb) %>% group_by(race) %>% summarise_all(funs("mean",n(),lower=(mean-((sd(.)/sqrt(n()))*1.96)), upper=(mean+((sd(.)/sqrt(n()))*1.96)))) colnames(anemia2)[1] <- "Type" anemia3 <- rbind(anemia1,anemia2)
19/42
![Page 20: ggplot2 for Epi Studies - University of North Carolina at ...](https://reader033.fdocuments.net/reader033/viewer/2022052105/6286f25e6a33313de41cb998/html5/thumbnails/20.jpg)
Forest plots
ggplot(data=anemia3, aes(x=Type, y=mean, ymin=lower, ymax=upper)) + geom_pointrange()
20/42
![Page 21: ggplot2 for Epi Studies - University of North Carolina at ...](https://reader033.fdocuments.net/reader033/viewer/2022052105/6286f25e6a33313de41cb998/html5/thumbnails/21.jpg)
Forest plots: flip the axes, add labels
ggplot(data=anemia3, aes(x=Type, y=mean, ymin=lower, ymax=upper)) + geom_pointrange(shape=20) + coord_flip() + xlab("Demographics") + ylab("Mean Hemoglobin (95% CI)") + theme_bw()
21/42
![Page 22: ggplot2 for Epi Studies - University of North Carolina at ...](https://reader033.fdocuments.net/reader033/viewer/2022052105/6286f25e6a33313de41cb998/html5/thumbnails/22.jpg)
Forest plots: calculating mean and CI withinggplot
ggplot can calculate the mean and CI using stat_summary
Further data manipulation would be needed to stack multiplevariables
·
·
22/42
![Page 23: ggplot2 for Epi Studies - University of North Carolina at ...](https://reader033.fdocuments.net/reader033/viewer/2022052105/6286f25e6a33313de41cb998/html5/thumbnails/23.jpg)
Calculating mean and CI within ggplot
ggplot(anemia, aes(x=race, y=hgb)) + stat_summary(fun.data=mean_cl_normal) + coord_flip() + theme_bw() + xlab("Demographics") + ylab("Mean Hemoglobin (95% CI)")
23/42
![Page 24: ggplot2 for Epi Studies - University of North Carolina at ...](https://reader033.fdocuments.net/reader033/viewer/2022052105/6286f25e6a33313de41cb998/html5/thumbnails/24.jpg)
Forest plots: adding faceting
ggplot(any.fit3, aes(x=V3, y=A1, ymin=lower, ymax=upper)) + geom_pointrange(shape=20) + coord_flip() + xlab("Predictor Variable") + ylab("Adjusted Risk Difference per 100 (95% CI)") + scale_y_continuous(breaks=c(-20,-15,-10,-5,0,5,10,15,20,25), limits = c(-21,26)) + theme_bw() + geom_hline(yintercept=0, lty=2) + facet_grid(setting~., scales= 'free', space='free')
24/42
![Page 25: ggplot2 for Epi Studies - University of North Carolina at ...](https://reader033.fdocuments.net/reader033/viewer/2022052105/6286f25e6a33313de41cb998/html5/thumbnails/25.jpg)
25/42
![Page 26: ggplot2 for Epi Studies - University of North Carolina at ...](https://reader033.fdocuments.net/reader033/viewer/2022052105/6286f25e6a33313de41cb998/html5/thumbnails/26.jpg)
Practice
Use the anemia dataset to practice making forest plots using othercontinuous variables
Use dplyr to create a new, categorized age variable (hint: factor thisbefore graphing). Create a forest plot of mean hemoglobin by agecategory.
·
·
26/42
![Page 27: ggplot2 for Epi Studies - University of North Carolina at ...](https://reader033.fdocuments.net/reader033/viewer/2022052105/6286f25e6a33313de41cb998/html5/thumbnails/27.jpg)
Kaplan-Meier plots - WIHS data
Women’s Interagency HIV Study (WIHS) is an ongoing observationalcohort study with semiannual visits at 10 sites in the US
Data on 1,164 patients who were HIV-positive, free of clinical AIDS,and not on antiretroviral therapy (ART) at study baseline (Dec. 6,1995)
Contains measures information on age, race, CD4 count, drug use,ARV treatment, and time to aids/death
·
·
·
27/42
![Page 28: ggplot2 for Epi Studies - University of North Carolina at ...](https://reader033.fdocuments.net/reader033/viewer/2022052105/6286f25e6a33313de41cb998/html5/thumbnails/28.jpg)
Kaplan-Meier plots
MANY package options to plot survival functions
All use the survival package to calculate survival over time
Allows for multiple treatments and subgroups
Does not take into account competing risks
·
·
survfit(survival) + survplot(rms)
ggkm(sachsmc/ggkm) & ggplot2
ggkm(michaelway/ggkm)
-
-
-
·
·
28/42
![Page 29: ggplot2 for Epi Studies - University of North Carolina at ...](https://reader033.fdocuments.net/reader033/viewer/2022052105/6286f25e6a33313de41cb998/html5/thumbnails/29.jpg)
Kaplan-Meier example 1
Calculate KM within ggplot
https://github.com/sachsmc/ggkm
Prep data
·
·
·
wihs$outcome <- ifelse(is.na(wihs$art),0,1) wihs$time <- ifelse(is.na(wihs$aids_death_art), wihs$dropout,wihs$aids_death_art) wihs <- wihs %>% mutate(time = ifelse(is.na(time),study_end,time))
29/42
![Page 30: ggplot2 for Epi Studies - University of North Carolina at ...](https://reader033.fdocuments.net/reader033/viewer/2022052105/6286f25e6a33313de41cb998/html5/thumbnails/30.jpg)
KM plot within ggplot2
devtools::install_github("sachsmc/ggkm") library(ggkm) ggplot(wihs, aes(time = time, status = outcome)) + geom_km()
30/42
![Page 31: ggplot2 for Epi Studies - University of North Carolina at ...](https://reader033.fdocuments.net/reader033/viewer/2022052105/6286f25e6a33313de41cb998/html5/thumbnails/31.jpg)
KM by treatment group
ggplot(wihs, aes(time = time, status = outcome, color = factor(idu))) + geom_km()
31/42
![Page 32: ggplot2 for Epi Studies - University of North Carolina at ...](https://reader033.fdocuments.net/reader033/viewer/2022052105/6286f25e6a33313de41cb998/html5/thumbnails/32.jpg)
Add confidence bands
ggplot(wihs, aes(time = time, status = outcome, color = factor(idu))) + geom_km() + geom_kmband()
32/42
![Page 33: ggplot2 for Epi Studies - University of North Carolina at ...](https://reader033.fdocuments.net/reader033/viewer/2022052105/6286f25e6a33313de41cb998/html5/thumbnails/33.jpg)
KM example #2
Calculated using survival package
Plots KM curve with numbers at risk
Same package name as previous example!
https://github.com/michaelway/ggkm
·
·
·
·
remove.packages("ggkm") install_github("michaelway/ggkm") library(ggkm)
33/42
![Page 34: ggplot2 for Epi Studies - University of North Carolina at ...](https://reader033.fdocuments.net/reader033/viewer/2022052105/6286f25e6a33313de41cb998/html5/thumbnails/34.jpg)
KM example 2
fit <- survfit(Surv(time,outcome)~idu, data=wihs) ggkm(fit)
34/42
![Page 35: ggplot2 for Epi Studies - University of North Carolina at ...](https://reader033.fdocuments.net/reader033/viewer/2022052105/6286f25e6a33313de41cb998/html5/thumbnails/35.jpg)
KM with numbers at risk
ggkm(fit, table=TRUE, marks = FALSE, ystratalabs = c("No IDU", "History of IDU"))
35/42
![Page 36: ggplot2 for Epi Studies - University of North Carolina at ...](https://reader033.fdocuments.net/reader033/viewer/2022052105/6286f25e6a33313de41cb998/html5/thumbnails/36.jpg)
Cumulative incidence plots
1-survival probability
ipwrisk package - coming soon!
·
·
calculates adjusted cumulative incidence curves using IPTW
addresses censoring (IPCW) and competing risks
produces tables and graphics
-
-
-
36/42
![Page 37: ggplot2 for Epi Studies - University of North Carolina at ...](https://reader033.fdocuments.net/reader033/viewer/2022052105/6286f25e6a33313de41cb998/html5/thumbnails/37.jpg)
Sankey diagram
Visualization that shows the flow of patients between states (overtime)
States, or nodes, can be treatments, comorbidities, hospitalizationsetc.
Paths connecting states are called links - proportion corresponds tothickness of line
Example: https://vizhub.healthdata.org/dex/
·
·
·
·
37/42
![Page 38: ggplot2 for Epi Studies - University of North Carolina at ...](https://reader033.fdocuments.net/reader033/viewer/2022052105/6286f25e6a33313de41cb998/html5/thumbnails/38.jpg)
Basic sankey diagrams in R
library(networkD3) library(reshape2) library(magrittr) nodes <- data.frame(name=c("Renal Failure", "Hemodialysis at 6m", "Transplant at 6m", "Death by 6m", "Hemodialysis at 12m", "Transplant at 12m", "Death by 12m")) links <- data.frame(source=c(0,0,0,1,1,1,2,2,2,3), target=c(1,2,3,4,5,6,4,5,6,6), value=c(70,20,10,40,20,10,15,4,1,10)) sankeyNetwork(Links = links, Nodes = nodes, Source = "source", Target = "target", Value = "value", NodeID ="name", fontSize = 22, nodeWidth = 30,nodePadding = 5)
38/42
![Page 39: ggplot2 for Epi Studies - University of North Carolina at ...](https://reader033.fdocuments.net/reader033/viewer/2022052105/6286f25e6a33313de41cb998/html5/thumbnails/39.jpg)
Basic sankey diagrams in R
Renal Failure
Hemodialysis at 6m
Transplant at 6m
Death by 6m
Hemodialysis at 12m
Transplant at 12m
Death by 12m
39/42
![Page 40: ggplot2 for Epi Studies - University of North Carolina at ...](https://reader033.fdocuments.net/reader033/viewer/2022052105/6286f25e6a33313de41cb998/html5/thumbnails/40.jpg)
Final Tips
Spend time planning your graph
Make sure to have the data in the correct structure before you startgraphing
Start with a simple graph, gradually build in complexity
·
·
·
40/42
![Page 41: ggplot2 for Epi Studies - University of North Carolina at ...](https://reader033.fdocuments.net/reader033/viewer/2022052105/6286f25e6a33313de41cb998/html5/thumbnails/41.jpg)
Further reading
ggplot2: http://docs.ggplot2.org/current/
Cookbook for R: http://www.cookbook-r.com/Graphs/
Quick-R: http://www.statmethods.net/index.html
·
·
·
41/42
![Page 42: ggplot2 for Epi Studies - University of North Carolina at ...](https://reader033.fdocuments.net/reader033/viewer/2022052105/6286f25e6a33313de41cb998/html5/thumbnails/42.jpg)
Wrap-up
Questions?
Acknowledgements: Alan Brookhart, Sara Levintow
Contact info: [email protected]
·
·
·
42/42