R User Guide 1 - Data Restructuring

20
R User Guide 1 - Data Restructuring Lei Wang, CFAR Biometics Core, University of Washington Updated: November 17, 2019 Contents 1 Introduction 1 1.1 Install R and RStudio ........................................ 2 1.2 Create a new R Script ........................................ 2 1.3 Load packages ............................................ 3 1.4 Import and export dataset ...................................... 4 1.5 R Basics ................................................ 4 1.6 dplyr .................................................. 5 1.7 Reshape ................................................ 6 2 HIV Pediatrics Project 8 2.1 Import dataset from REDCap into R ................................ 8 2.2 QC: Until the changes can be made to the REDCap project itself, make updates so that the restructuring will go more smoothly downstream.......................... 12 2.3 Restructure the master REDCap file: Partition the Events and Variables that correspond to each CRF................................................ 16 2.4 The next syntax file “R User Guide 2 - Data Management” performs the following steps: . . . 20 1 Introduction The goal of this guide is to show how to do data management and analysis in R. The first section is R basics. The 2nd section is an data restructuring example based on HIV Pediatrics dataset, which corresponds to the STATA and SAS on CFAR training website. More data managements skills are introduced in the “Data management syntax”. My data folders have this structure: Data SourceData 2019_09sep_09 Stata SAS R (This file imports the data from here.) OutputData (This file write the data to here.) 1

Transcript of R User Guide 1 - Data Restructuring

R User Guide 1 - Data RestructuringLei Wang, CFAR Biometics Core, University of Washington

Updated: November 17, 2019

Contents

1 Introduction 11.1 Install R and RStudio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Create a new R Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Load packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Import and export dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.5 R Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.6 dplyr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.7 Reshape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 HIV Pediatrics Project 82.1 Import dataset from REDCap into R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 QC: Until the changes can be made to the REDCap project itself, make updates so that therestructuring will go more smoothly downstream. . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Restructure the master REDCap file: Partition the Events and Variables that correspond toeach CRF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4 The next syntax file “R User Guide 2 - Data Management” performs the following steps: . . . 20

1 Introduction

The goal of this guide is to show how to do data management and analysis in R. The first section is R basics.The 2nd section is an data restructuring example based on HIV Pediatrics dataset, which corresponds tothe STATA and SAS on CFAR training website. More data managements skills are introduced in the “Datamanagement syntax”.

My data folders have this structure:

Data

SourceData

2019_09sep_09

Stata

SAS

R (This file imports the data from here.)

OutputData (This file write the data to here.)

1

Figure 1: How to use this guide

There are three fonts in this file (Figure 1):

1) Text introduction - explains what we are exploring.2) R script - The R code for the tasks that are described in the text introduction.

3) R output - The lines followed with the R scipt and start with “##” are the R output after executingthe R scipt. It can be observed when you run code in R/RStudio.

1.1 Install R and RStudio

Download the latest version of R from R CRAN (select for Mac, Windows, or Linux). Then install.

Download the most popular R IDE - R Studio (FREE version). Then install.

Then, open R Studio. This program will allow you to have the console, your script file (more on this later),and help panes within the same interface. It also helpfully includes some menus and buttons if you prefer tohave a gentler introduction to using statistical software for the first time, particularly if you are coming toR from a program like SPSS. Even for more advanced users, RStudio can be useful in giving you a platformfor using knitr, rmarkdown, and shiny.

1.2 Create a new R Script

1) In R Studio, click File -> New File -> R Script and you will open a blank .R file.

2) On the top of the R script, please input the title, author, date, and description for this file by usingcomments (with “#” at the beginning of every line). This will help you organize your R scripts.

######################################################################## Title: Data Management in R# Description: Introduce the data management and analysis skills in R

2

# Author: Lei Wang (CFAR Biometrics Core, UW)# Source file: R User Guide 2019_09_09.Rmd# Output file: R_User_Guide_2019_09_09.pdf# Created: 02-20-2019# Updated: 09-27-2019#######################################################################

3) Save the .R file by clicking File -> Save

1.3 Load packages

R has great add-on packages (contain functions), which can be employed for your specific analysis purposes.First, install the package by running “install.packages()” and then make its contents available by running“library()”. Below is a list of useful packages

# install the package first (you only need to install it once)# install.packages("haven")

# you need to load the package everytime you open a R projectlibrary(haven) # useful functions: read_dta read_sas# haven enables R to read and write data from SAS, SPSS, and STATA.

#install.packages("dplyr")library(dplyr)# dplyr is our go to package for fast data manipulation.

# install.packages("reshape2")library(reshape2)# use function "reshape" to transform data from long to wide format, or wide to long format.

# install.packages("ggplot2 ")library(ggplot2 )# visualization

#install.packages("geepack")library(geepack)# GEE models (function geeglm)# Other model packages:# lme4/nlme - Linear and Non-linear mixed effects models# randomForest - Random forest methods from machine learning# multcomp - Tools for multiple comparison testing# glmnet - Lasso and elastic-net regression methods with cross validation# survival - Tools for survival analysis

# install.packages("openxlsx")library (openxlsx)# output datasets into Excel with adjustable formats

3

1.4 Import and export dataset

# Use "/" instead of "\" in the file directories

# .csv dataset (dt_csv is the assigned dataset name in R)dt_csv <- read.csv("C:/Users/lwang7/OneDrive/Dataset.csv", head=T)

# SAS datasetdt_sas <- read_sav("C:/Users/lwang7/OneDrive/Dataset.sav")

# STATA datasetdt_stata <- read_dta("C:/Users/lwang7/OneDrive/stata_dataset.dta")

# Excel spreedsheetdt_xlsx <- read.xlsx("C:/Users/lwang7/OneDrive/Data/Dataset.xlsx",

sheet = "Calculator", startRow = 1, colNames = TRUE,rowNames = FALSE, detectDates = FALSE, skipEmptyRows = TRUE,skipEmptyCols = TRUE, rows = NULL, cols = NULL, check.names = FALSE,namedRegion = NULL, na.strings = "NA", fillMergedCells = FALSE)

# Export dataset into .csv filewrite.csv(mydata, file = paste("C:/Users/lwang7/OneDrive/Output/Newdataset",

Sys.Date(), ".csv"), na="")# This will create a CSV dataset with the current date in the file name.

1.5 R Basics

# display the current pathgetwd()# set the destination pathsetwd("C:/Users/lwang7/Documents")

# overview of the datasetdescribe(mydata)

# Table one variabletable(mydata$gender)

# show information of missing valuestable(mydata$gender, exclude=NULL)

# crosstable of gender*outcometable(mydata$gender, mydata$outcome, exclude=NULL)

# Histogramhist(mydata$weight)

# check the row number and column number of the datasetdim(dataset)

4

# [1] 1289 1474 # row and column, respectively

# Display the ID in pcr dataset but not in the main datasetpcr$ID[which(pcr$ID %notin% main$ID)]

# subset dataset (select rows by ID and then keep 3 columns)pcr2 <- pcr[which(pcr$ID %notin% main$ID) , c("id", "date", "pcr_value")]

1.6 dplyr

dplyr is a powerful R-package of data management. It provides simple functions that correspond to the mostcommon data manipulation tasks to help you translate your ideas into code. It is very efficient that youspend less time waiting for the computer to process the steps. dplyr uses pipe operator %>% to separateevery step, which increased the flexibility of organizing multiple manipulation steps together.

# syntax# dplyr use "%>%" to seperate command# with "%>%", R will execute the next line until a line without %>% at the end

newdata <- olddata %>% # continuearrange(ID, DATE) %>% # continuegroup_by(ID) # stop and execute one by one

newdata2 <- newdata1 %>%mutate (newvar = age) # stop and execute

# the two steps above canbe combined together as belownewdata2<- olddata %>% # continue

arrange(ID, DATE) %>% # continuegroup_by(ID) %>% # continuemutate (newvar = age) # stop and execute one by one

newdata <- olddata %>%

# Create a new variablemutate (ID = substr(as.character(SubjID), 3, 6) ) %>%mutate (DATE = as.Date(DATE2, format = "%Y-%m-%d")) %>%

# replace 8 with 0 in existing variablemutate (adlq01now = ifelse (adlq01now == 8, 0, adlq01now)) %>%

# group by IDgroup_by(ID) %>%

# Sort the dataset by ID and DATEarrange(ID, DATE) %>%

# Create a variable for the number of observations of every subjectmutate (n2=n()) %>%

5

# get the max follow-up datemutate (dt_follow_max = max(dt_follow)) %>%

# free the dataset from groupungroup() %>%

# Calculate the max value of test from multiple tests# Use pmax instead of max to calculate maximum for every row/visit# Please remember to use "na.rm=T" because NAs always cause issues# when we calculate min, max, and sum.mutate (score_max = pmax(c(t1, t2, t3, t4, t5), na.rm=T)) %>%

# Transform the date variable from numeric format to date formatmutate (date_enroll = as.Date(DATE, format = "%Y-%m-%d", origin = "1970-01-01")) %>%

# Calculate the days between two datesmutate ( cd4_days = difftime(date_enroll, cd4dt, units = "days") ) %>%

# Merge datasetsright_join(fu, by=c("ID", "date_enroll")) %>%

# Subset datasets - remove ID equals to "0793"filter (ID != "0793" ) %>%

# Remove these variables and keep othersselect (-nID, -flag, -date_enroll1) %>%

# Keep these variables onlyselect (ID, DATE, cd4_days)

1.7 Reshape

# wide to longcm2 <- reshape(cm1, direction='long',

varying=c('cm_name1','cm_indication1','cm_start_date1','cm_stop_date1','cm_name2','cm_indication2','cm_start_date2','cm_stop_date2','cm_name3','cm_indication3','cm_start_date3','cm_stop_date3'),

timevar="trt_number",times=c('1',

'2','3'),

v.names=c('cm_name',

6

'cm_indication','cm_start_date','cm_stop_date'),

idvar=c('ptid','visitcode'))

7

2 HIV Pediatrics Project

The dataset was exported from REDCap. REDCap generated a .csv file (data) and a .r file (data formatingcode). We need to run the .r program to import and format the dataset first, and then following with datamanagement and analysis.

2.1 Import dataset from REDCap into R

The following code was copied from the REDCap downloaded .r file. I made two changes and recommendthem:

1) CHANGE 1 - Change file pathway where you saved your downloaded REDCap dataset.

2) Add parameter in read.csv for importing characters in character format instead of factor format. Itis easier to use character variables in data management. Some people say that you have to use factorvariables in the statistical models (e.g. linear regression) but it is not necessary because lm() functioncan transfer character into factor automatically. When you need to use factor format, use as.factor()to transform it.

The R code from RedCap will generate a list of new factor variables. The reason is that RedCap savescharactor variables with their code number and category name, which can be found in the code book.For instance, variable “demographics_complete” has three values “0”, “1”, and “2”, corresponding with“Incomplete”, “Unverified”, and “Complete” (in variable “demographics_complete.factor”), respectively.

#Clear existing data and graphicsrm(list=ls())graphics.off()#Load Hmisc librarylibrary(Hmisc)#Read Data

#################### CHANGE 1 ###################### add the file path where you saved the .csvsetwd('C:/Users/lwang7/OneDrive - UW/18_CFAR_Training_Guide/Data/')getwd()

## [1] "C:/Users/lwang7/OneDrive - UW/18_CFAR_Training_Guide/Data"

#################### CHANGE 2 ###################### add stringsAsFactors=FALSE in read.csvdata=read.csv('SourceData/2019_09sep_09/R/CFARBiometricsHIVPra_DATA_2019-09-24_2046.csv',

stringsAsFactors=FALSE)

# data=read.csv('CFARBiometricsHIVPra_DATA_2019-09-24_2046.csv') # original from REDCap

#Setting Labels

label(data$study_id)="Participant Enrollment ID (ends in 9)"label(data$redcap_event_name)="Event Name"label(data$demographics_complete)="Complete?"label(data$ptid_studyid_validation)="Participant Enrollment ID (ends in 9)"

8

label(data$check_studyid_validation)="Check to validate Participant ID"label(data$studyid_validation_complete)="Complete?"label(data$crfversion_en)="Enrollment (EN): CRF Version"label(data$scid_en)="Participant Screening ID (ends in 7)"label(data$ptid_en)="Participant Enrollment ID (ends in 9)"label(data$visitcode_en)="Visit Code"label(data$visitdate_en)="Visit Date"label(data$ageyrs_en)="1a. Age (years)"label(data$sex_en)="1c. Sex"label(data$sympfev_en)="Fever"label(data$sympfevdur_en)="Fever: Duration (days)"label(data$sympoth1_en)="Other (1)"label(data$sympoth1sp_en)="Other (1): Specify"label(data$sympoth1dur_en)="Other (1) : Duration (days)"label(data$sympoth2_en)="Other (2)"label(data$sympoth2sp_en)="Other (2): Specify"label(data$sympoth2dur_en)="Other (2): Duration (days)"label(data$txmed1_en)="Medication (1)"label(data$txmed1name_en)="Medication (1): Name"label(data$txmed1ind_en)="Medication (1): Indication"label(data$txmed1startdt_en)="Medication (1): Start date"label(data$txmed1stopdt_en)="Medication (1): End date"label(data$txmed2_en)="Medication (2)"label(data$txmed2name_en)="Medication (2): Name"label(data$txmed2ind_en)="Medication (2): Indication"label(data$txmed2startdt_en)="Medication (2): Start date"label(data$txmed2stopdt_en)="Medication (2): End date"label(data$weight_en)="Current weight (kg)"label(data$height_en)="Current height (cm)"label(data$temp_en)="Body temperature (degrees C)"label(data$heart_en)="Heart rate (per minute)"label(data$bpsystol_en)="Blood pressure: Systolic (mmHg)"label(data$bpdiast_en)="Blood pressure: Diastolic (mmHg)"label(data$nxtvisitdt_en)="Next visit: Date"label(data$frmcomplby_en)="Form completed by: Staff code"label(data$frmcomplbydt_en)="Form completed by: Date"label(data$frmdbaseby_en)="Form entered into database by: Staff code"label(data$frmdbasebydt_en)="Form entered into database by: Date"label(data$dbasecomment1_en)="Database comments (1)"label(data$dbasecomment2_en)="Database comments (2)"label(data$dbasecomment3_en)="Database comments (3)"label(data$enrollment_en_complete)="Complete?"label(data$crfversion_rand)="Randomization arm (RAND):CRF Version"label(data$ptid_rand)="Participant Enrollment ID (ends in 9)"label(data$randdate_rand)="Randomization: Date"label(data$randarm_rand)="Randomization: Arm"label(data$frmcomplby_rand)="Form completed by: Staff code"label(data$frmcomplbydt_rand)="Form completed by: Date"label(data$frmdbaseby_rand)="Form entered into database by: Staff code"label(data$frmdbasebydt_rand)="Form entered into database by: Date"label(data$dbasecomment1_rand)="Database comments (1)"label(data$dbasecomment2_rand)="Database comments (2)"label(data$dbasecomment3_rand)="Database comments (3)"

9

label(data$randomization_rand_complete)="Complete?"label(data$crfversion_lr)="Lab Results (LR):CRF Version"label(data$ptid_lr)="Participant Enrollment ID (ends in 9)"label(data$visitcode_lr)="Sample collection: Visit Code"label(data$visitdate_lr)="Sample collection: Visit Date"label(data$cd4cnt_lr)="CD4 Count (cells/uL)"label(data$cd8cnt_lr)="CD8 Count (cells/uL)"label(data$cd4to8ratio_lr)="CD4/CD8 ratio"label(data$wbc_lr)="WBC (x10^3 cells/uL)"label(data$rbc_lr)="RBC (x10^6 cells/uL)"label(data$hb_lr)="Hemoglobin (g/dl)"label(data$viralload_lr)="Viral load (copies/ml)"label(data$frmcomplby_lr)="Form completed by: Staff code"label(data$frmcomplbydt_lr)="Form completed by: Date"label(data$frmdbaseby_lr)="Form entered into database by: Staff code"label(data$frmdbasebydt_lr)="Form entered into database by: Date"label(data$dbasecomment1_lr)="Database comments (1)"label(data$dbasecomment2_lr)="Database comments (2)"label(data$dbasecomment3_lr)="Database comments (3)"label(data$lab_results_lr_complete)="Complete?"label(data$crfversion_fu)="Followup (FU) Visit: CRF Version"label(data$ptid_fu)="Participant Enrollment ID (ends in 9)"label(data$visitcode_fu)="Visit Code"label(data$visitdate_fu)="Visit Date"label(data$sympfev_fu)="Fever"label(data$sympfevdur_fu)="Fever: Duration (days)"label(data$sympoth1_fu)="Other (1)"label(data$sympoth1sp_fu)="Other (1): Specify"label(data$sympoth1dur_fu)="Other (1) : Duration (days)"label(data$sympoth2_fu)="Other (2)"label(data$sympoth2sp_fu)="Other (2): Specify"label(data$sympoth2dur_fu)="Other (2): Duration (days)"label(data$txmed1_fu)="Medication (1)"label(data$txmed1name_fu)="Medication (1): Name"label(data$txmed1ind_fu)="Medication (1): Indication"label(data$txmed1startdt_fu)="Medication (1): Start date"label(data$txmed1stopdt_fu)="Medication (1): End date"label(data$txmed2_fu)="Medication (2)"label(data$txmed2name_fu)="Medication (2): Name"label(data$txmed2ind_fu)="Medication (2): Indication"label(data$txmed2startdt_fu)="Medication (2): Start date"label(data$txmed2stopdt_fu)="Medication (2): End date"label(data$weight_fu)="Current weight (kg)"label(data$height_fu)="Current height (cm)"label(data$temp_fu)="Body temperature (degrees C)"label(data$heart_fu)="Heart rate (per minute)"label(data$bpsystol_fu)="Blood pressure: Systolic (mmHg)"label(data$bpdiast_fu)="Blood pressure: Diastolic (mmHg)"label(data$nxtvisitdt_fu)="Next visit: Date"label(data$frmcomplby_fu)="Form completed by: Staff code"label(data$frmcomplbydt_fu)="Form completed by: Date"label(data$frmdbaseby_fu)="Form entered into database by: Staff code"label(data$frmdbasebydt_fu)="Form entered into database by: Date"

10

label(data$dbasecomment1_fu)="Database comments (1)"label(data$dbasecomment2_fu)="Database comments (2)"label(data$dbasecomment3_fu)="Database comments (3)"label(data$followup_fu_complete)="Complete?"#Setting Units

#Setting Factors(will create new variable for factors)data$redcap_event_name.factor = factor(data$redcap_event_name,levels =c("enrollment_arm_1","followup_6month_arm_1","followup_12month_arm_1"))data$demographics_complete.factor = factor(data$demographics_complete,levels=c("0","1","2"))data$check_studyid_validation.factor = factor(data$check_studyid_validation,levels=c("1"))data$studyid_validation_complete.factor = factor(data$studyid_validation_complete,levels=c("0","1","2"))data$crfversion_en.factor = factor(data$crfversion_en,levels=c("1","2"))data$sex_en.factor = factor(data$sex_en,levels=c("1","2"))data$sympfev_en.factor = factor(data$sympfev_en,levels=c("1","0"))data$sympoth1_en.factor = factor(data$sympoth1_en,levels=c("1","0"))data$sympoth2_en.factor = factor(data$sympoth2_en,levels=c("1","0"))data$txmed1_en.factor = factor(data$txmed1_en,levels=c("1","0"))data$txmed2_en.factor = factor(data$txmed2_en,levels=c("1","0"))data$frmcomplby_en.factor = factor(data$frmcomplby_en,levels=c("1","2"))data$frmdbaseby_en.factor = factor(data$frmdbaseby_en,levels=c("1","2"))data$enrollment_en_complete.factor = factor(data$enrollment_en_complete,levels=c("0","1","2"))data$crfversion_rand.factor = factor(data$crfversion_rand,levels=c("1","2"))data$randarm_rand.factor = factor(data$randarm_rand,levels=c("0","1"))data$frmcomplby_rand.factor = factor(data$frmcomplby_rand,levels=c("1","2"))data$frmdbaseby_rand.factor = factor(data$frmdbaseby_rand,levels=c("1","2"))data$randomization_rand_complete.factor = factor(data$randomization_rand_complete,levels=c("0","1","2"))data$crfversion_lr.factor = factor(data$crfversion_lr,levels=c("1","2"))data$frmcomplby_lr.factor = factor(data$frmcomplby_lr,levels=c("1","2"))data$frmdbaseby_lr.factor = factor(data$frmdbaseby_lr,levels=c("1","2"))data$lab_results_lr_complete.factor = factor(data$lab_results_lr_complete,levels=c("0","1","2"))data$crfversion_fu.factor = factor(data$crfversion_fu,levels=c("1","2"))data$sympfev_fu.factor = factor(data$sympfev_fu,levels=c("1","0"))data$sympoth1_fu.factor = factor(data$sympoth1_fu,levels=c("1","0"))data$sympoth2_fu.factor = factor(data$sympoth2_fu,levels=c("1","0"))data$txmed1_fu.factor = factor(data$txmed1_fu,levels=c("1","0"))data$txmed2_fu.factor = factor(data$txmed2_fu,levels=c("1","0"))data$frmcomplby_fu.factor = factor(data$frmcomplby_fu,levels=c("1","2"))data$frmdbaseby_fu.factor = factor(data$frmdbaseby_fu,levels=c("1","2"))data$followup_fu_complete.factor = factor(data$followup_fu_complete,levels=c("0","1","2"))

levels(data$redcap_event_name.factor)=c("Enrollment","Followup: 6-month","Followup: 12-month")levels(data$demographics_complete.factor)=c("Incomplete","Unverified","Complete")levels(data$check_studyid_validation.factor)=c("Validate Participant ID")levels(data$studyid_validation_complete.factor)=c("Incomplete","Unverified","Complete")levels(data$crfversion_en.factor)=c("1.0","2.0")levels(data$sex_en.factor)=c("Male","Female")levels(data$sympfev_en.factor)=c("Yes","No")levels(data$sympoth1_en.factor)=c("Yes","No")levels(data$sympoth2_en.factor)=c("Yes","No")levels(data$txmed1_en.factor)=c("Yes","No")levels(data$txmed2_en.factor)=c("Yes","No")levels(data$frmcomplby_en.factor)=c("Personnel1","Personnel2")levels(data$frmdbaseby_en.factor)=c("Personnel1","Personnel2")

11

levels(data$enrollment_en_complete.factor)=c("Incomplete","Unverified","Complete")levels(data$crfversion_rand.factor)=c("1.0","2.0")levels(data$randarm_rand.factor)=c("Control","Intervention")levels(data$frmcomplby_rand.factor)=c("Personnel1","Personnel2")levels(data$frmdbaseby_rand.factor)=c("Personnel1","Personnel2")levels(data$randomization_rand_complete.factor)=c("Incomplete","Unverified","Complete")levels(data$crfversion_lr.factor)=c("1.0","2.0")levels(data$frmcomplby_lr.factor)=c("Personnel1","Personnel2")levels(data$frmdbaseby_lr.factor)=c("Personnel1","Personnel2")levels(data$lab_results_lr_complete.factor)=c("Incomplete","Unverified","Complete")levels(data$crfversion_fu.factor)=c("1.0","2.0")levels(data$sympfev_fu.factor)=c("Yes","No")levels(data$sympoth1_fu.factor)=c("Yes","No")levels(data$sympoth2_fu.factor)=c("Yes","No")levels(data$txmed1_fu.factor)=c("Yes","No")levels(data$txmed2_fu.factor)=c("Yes","No")levels(data$frmcomplby_fu.factor)=c("Personnel1","Personnel2")levels(data$frmdbaseby_fu.factor)=c("Personnel1","Personnel2")levels(data$followup_fu_complete.factor)=c("Incomplete","Unverified","Complete")

# COPY CONTENT FROM REDCAP .R FILE HERE: ENDS.

2.2 QC: Until the changes can be made to the REDCap project itself, makeupdates so that the restructuring will go more smoothly downstream.

# Now we have the data in R and it's name is "data".# Check the "Environment" pane on the topright of RStudio,# you can find that "data" has 60 observations of 114 variables.dim(data)

## [1] 60 146

# sort "data" by id and visit namedata <- data %>%

arrange (study_id, redcap_event_name)

# save the label informationdata_label = label(data)

2.2.1 QC PTIDs.

# display the top 5 rows of the variables that are listedhead(data[, c('study_id', 'redcap_event_name', 'ptid_studyid_validation',

'ptid_en','ptid_rand', 'ptid_lr', 'ptid_fu','redcap_event_name' )], n=5)

## study_id redcap_event_name ptid_studyid_validation ptid_en## 1 9.9e+09 enrollment_arm_1 NA 9.9e+09

12

## 2 9.9e+09 followup_12month_arm_1 NA NA## 3 9.9e+09 followup_6month_arm_1 NA NA## 4 9.9e+09 enrollment_arm_1 NA 9.9e+09## 5 9.9e+09 followup_12month_arm_1 NA NA## ptid_rand ptid_lr ptid_fu redcap_event_name.1## 1 9.9e+09 9.9e+09 NA enrollment_arm_1## 2 NA 9.9e+09 9.9e+09 followup_12month_arm_1## 3 NA 9.9e+09 9.9e+09 followup_6month_arm_1## 4 9.9e+09 9.9e+09 NA enrollment_arm_1## 5 NA 9.9e+09 9.9e+09 followup_12month_arm_1

# Study_id is not friendly for review, create a character ID variable for convinencedata$study_id = as.character(data$study_id)

# create a new dataset with checking variablesck1 = data %>%

select (study_id, redcap_event_name, ptid_studyid_validation, ptid_en,ptid_rand, ptid_lr, ptid_fu, redcap_event_name) %>%

mutate (check1 = ifelse (!is.na(ptid_en) & study_id != ptid_en, 1, NA),check2 = ifelse (!is.na(ptid_lr) & study_id != ptid_lr, 1, NA),check3 = ifelse (!is.na(ptid_fu) & study_id != ptid_fu, 1, NA),check4 = ifelse (!is.na(ptid_en) & !is.na(ptid_lr)

& ptid_en != ptid_lr, 1, NA))

# These 4 checking variables should be all missing.if (!all(is.na(ck1[, c("check1", "check2", "check3", "check4")]))){

cat("QC PTIDs FAILED!!! check the data below")print (ck1[which(!all(is.na(ck1[, c("check1", "check2",

"check3", "check4")]))),])}else{ cat( "QC PTIDs PASSED!")}

## QC PTIDs PASSED!

# *Now that we've QC'd the PTIDs, we can proceed with just study_id

2.2.2 QC DATES

# display the variable with "date" or "dt" in their namevar_list = colnames(data)[which( grepl("date", colnames(data)) |

grepl("dt", colnames(data)))]var_list

## [1] "visitdate_en" "txmed1startdt_en" "txmed1stopdt_en"## [4] "txmed2startdt_en" "txmed2stopdt_en" "nxtvisitdt_en"## [7] "frmcomplbydt_en" "frmdbasebydt_en" "randdate_rand"## [10] "frmcomplbydt_rand" "frmdbasebydt_rand" "visitdate_lr"## [13] "frmcomplbydt_lr" "frmdbasebydt_lr" "visitdate_fu"## [16] "txmed1startdt_fu" "txmed1stopdt_fu" "txmed2startdt_fu"## [19] "txmed2stopdt_fu" "nxtvisitdt_fu" "frmcomplbydt_fu"## [22] "frmdbasebydt_fu"

13

# overview these variablesdescribe(data[,var_list])

## data[, var_list]#### 22 Variables 60 Observations## ---------------------------------------------------------------------------## visitdate_en : Visit Date## n missing distinct value## 20 40 1 2018-01-01#### Value 2018-01-01## Frequency 20## Proportion 1## ---------------------------------------------------------------------------## randdate_rand : Randomization: Date## n missing distinct value## 20 40 1 2018-01-01#### Value 2018-01-01## Frequency 20## Proportion 1## ---------------------------------------------------------------------------## visitdate_lr : Sample collection: Visit Date## n missing distinct## 60 0 3#### Value 2018-01-01 2018-07-01 2019-01-01## Frequency 20 20 20## Proportion 0.333 0.333 0.333## ---------------------------------------------------------------------------## visitdate_fu : Visit Date## n missing distinct## 40 20 2#### Value 2018-07-01 2019-01-01## Frequency 20 20## Proportion 0.5 0.5## ---------------------------------------------------------------------------#### Variables with all observations missing:#### [1] txmed1startdt_en txmed1stopdt_en txmed2startdt_en## [4] txmed2stopdt_en nxtvisitdt_en frmcomplbydt_en## [7] frmdbasebydt_en frmcomplbydt_rand frmdbasebydt_rand## [10] frmcomplbydt_lr frmdbasebydt_lr txmed1startdt_fu## [13] txmed1stopdt_fu txmed2startdt_fu txmed2stopdt_fu## [16] nxtvisitdt_fu frmcomplbydt_fu frmdbasebydt_fu

# check the format --> date variables were imported as characters (chr)# str(data[,var_list])

# format date variables into date formats

14

data = data %>%mutate ( visitdate_en = as.Date( visitdate_en , format = "%Y-%m-%d"),

randdate_rand = as.Date( randdate_rand , format = "%Y-%m-%d"),visitdate_lr = as.Date( visitdate_lr , format = "%Y-%m-%d"),visitdate_fu = as.Date( visitdate_fu , format = "%Y-%m-%d") )

# check for issuesck2 = data %>%

mutate (check1 = ifelse( !is.na(visitdate_en) & !is.na(visitdate_lr) &visitdate_en != visitdate_lr, 1, NA),

check2 = ifelse( !is.na(visitdate_fu) & !is.na(visitdate_lr) &visitdate_fu != visitdate_lr, 1, NA))

if (!all(is.na(ck2[, c("check1", "check2")]))){cat("QC DATES FAILED!!! check the data below")print (ck2[which(!all(is.na(ck2[, c("check1", "check2")]))),])

}else{ cat( "QC VISIT DATES PASSED!")}

## QC VISIT DATES PASSED!

2.2.3 QC VISIT CODES

# overview variables with "code" in their namedescribe(data[,colnames(data)[which( grepl("code", colnames(data)))] ])

## data[, colnames(data)[which(grepl("code", colnames(data)))]]#### 3 Variables 60 Observations## ---------------------------------------------------------------------------## visitcode_en : Visit Code## n missing distinct Info Mean Gmd## 20 40 1 0 0 0#### Value 0## Frequency 20## Proportion 1## ---------------------------------------------------------------------------## visitcode_lr : Sample collection: Visit Code## n missing distinct Info Mean Gmd## 60 0 3 0.889 6 5.424#### Value 0 6 12## Frequency 20 20 20## Proportion 0.333 0.333 0.333## ---------------------------------------------------------------------------## visitcode_fu : Visit Code## n missing distinct Info Mean Gmd## 40 20 2 0.75 9 3.077#### Value 6 12## Frequency 20 20

15

## Proportion 0.5 0.5## ---------------------------------------------------------------------------

# tables <encourage using "exclude=NULL" to display missing values>table(data$redcap_event_name, data$visitcode_en, exclude=NULL)

#### 0 <NA>## enrollment_arm_1 20 0## followup_12month_arm_1 0 20## followup_6month_arm_1 0 20

table(data$redcap_event_name, data$visitcode_fu, exclude=NULL)

#### 6 12 <NA>## enrollment_arm_1 0 0 20## followup_12month_arm_1 0 20 0## followup_6month_arm_1 20 0 0

table(data$redcap_event_name, data$visitcode_lr, exclude=NULL)

#### 0 6 12## enrollment_arm_1 20 0 0## followup_12month_arm_1 0 0 20## followup_6month_arm_1 0 20 0

# check for issuesck3 = data %>%

mutate (check1 = ifelse( !is.na(visitcode_en) & !is.na(visitcode_lr) &visitcode_en != visitcode_lr, 1, NA),

check2 = ifelse( !is.na(visitcode_fu) & !is.na(visitcode_lr) &visitcode_fu != visitcode_lr, 1, NA))

if (!all(is.na(ck3[, c("check1", "check2")]))){cat("QC VISIT CODES FAILED!!! check the data below")print (ck3[which(!all(is.na(ck3[, c("check1", "check2")]))),])

}else{ cat( "QC VISIT CODES PASSED!")}

## QC VISIT CODES PASSED!

2.3 Restructure the master REDCap file: Partition the Events and Variablesthat correspond to each CRF.

2.3.1 RANDOMIZATION: EXTRACT.

table( data$redcap_event_name, data$crfversion_rand, exclude=NULL)

16

#### 1 <NA>## enrollment_arm_1 20 0## followup_12month_arm_1 0 20## followup_6month_arm_1 0 20

Randomization0 <- data %>%# KEEP DESIRED EVENT: enrollmentfilter( redcap_event_name == "enrollment_arm_1" ) %>%filter( !is.na( crfversion_rand) ) %>%

# KEEP VARIABLES FOR THE CRF.select ( study_id,

redcap_event_name,crfversion_rand : dbasecomment3_rand,randomization_rand_complete) %>%

# create a new variablemutate (rand_indata=1)

# let's see the resulttable( Randomization0$redcap_event_name, Randomization0$crfversion_rand, exclude=NULL)

#### 1## enrollment_arm_1 20

# 20 subjects

2.3.2 VISITS_ENROLLMENT: EXTRACT.

table( data$redcap_event_name, data$crfversion_en, exclude=NULL)

#### 1 <NA>## enrollment_arm_1 20 0## followup_12month_arm_1 0 20## followup_6month_arm_1 0 20

Visits_Enrollment0 <- data %>%# KEEP DESIRED EVENT: enrollmentfilter( redcap_event_name == "enrollment_arm_1" ) %>%filter( !is.na( crfversion_en) ) %>%

# KEEP VARIABLES FOR THE CRF.select ( study_id,

redcap_event_name,crfversion_en : dbasecomment3_en,enrollment_en_complete) %>%

17

# create a new variablemutate ( en_indata=1)

# let's see the resulttable( Visits_Enrollment0$redcap_event_name, Visits_Enrollment0$crfversion_en, exclude=NULL)

#### 1## enrollment_arm_1 20

# 20 subjects

2.3.3 VISITS_FOLLOWUP: EXTRACT.

table( data$redcap_event_name, data$crfversion_fu, exclude=NULL)

#### 1 <NA>## enrollment_arm_1 0 20## followup_12month_arm_1 20 0## followup_6month_arm_1 20 0

Visits_Followup0 <- data %>%# KEEP DESIRED EVENT: FOLLOWUP.mutate ( keepvisit = ifelse ( redcap_event_name %in%

c( "followup_6month_arm_1", "followup_12month_arm_1" ),1, NA )) %>%

filter ( keepvisit == 1) %>%select ( -keepvisit ) %>%

filter( !is.na( crfversion_fu) ) %>%

# KEEP VARIABLES FOR THE CRF.select ( study_id,

redcap_event_name,crfversion_fu : dbasecomment3_fu,followup_fu_complete) %>%

# SORTarrange (study_id, visitcode_fu)

# let's see the resulttable( Visits_Followup0$redcap_event_name, Visits_Followup0$crfversion_fu, exclude=NULL)

#### 1## followup_12month_arm_1 20## followup_6month_arm_1 20

18

# 20 subjects with 2*20 observations

2.3.4 LABS (ENROLLMENT+FOLLOWUP): EXTRACT.

table( data$redcap_event_name, data$crfversion_lr, exclude=NULL)

#### 1## enrollment_arm_1 20## followup_12month_arm_1 20## followup_6month_arm_1 20

Labs0 <- data %>%# KEEP DESIRED EVENT: FOLLOWUP.mutate ( keepvisit = ifelse ( redcap_event_name %in%

c( "enrollment_arm_1","followup_6month_arm_1","followup_12month_arm_1" ),

1, NA )) %>%filter ( keepvisit == 1) %>%select ( -keepvisit ) %>%

filter( !is.na( crfversion_lr) ) %>%

# KEEP VARIABLES FOR THE CRF.select ( study_id,

redcap_event_name,crfversion_lr : dbasecomment3_lr,lab_results_lr_complete) %>%

# SORTarrange (study_id, visitcode_lr)

# let's see the resulttable( Labs0$redcap_event_name, Labs0$crfversion_lr, exclude=NULL)

#### 1## enrollment_arm_1 20## followup_12month_arm_1 20## followup_6month_arm_1 20

# 20 subjects with 3*20 observations & 21 variables

19

2.4 The next syntax file “R User Guide 2 - Data Management” performs thefollowing steps:

2.4.1 Prepare files for incorporation with either of two files: (1) Main (1 record/participant),(2) Longitudinal (1 record/visit).

Subsequent steps merge these two files together to create additional variables (e.g. 1st pre-HAART viralload), and analysis follows.

2.4.2 Append and merge to create the Main and Longitudinal files.

2.4.3 Create other analysis variables.

2.4.4 Example descriptives, figures, and analyses

Please email [email protected] for the source code. Thanks!

20