Reproducibility with R

23
Reproducible R coding CMEC R-Group Martin Jung 12.02.2015

Transcript of Reproducibility with R

Reproducible R codingCMEC R-Group

Martin Jung

12.02.2015

Goals of reproducible programming?

I Make your code readible by you and othersI Group your code and functionalizeI Embrace collaboration, version control and automation

First step - readibility1. Writing cleaner code

Writing cleaner R code | NamesI Keep new filenames descriptive and meaningful

"helper-functions.R"# or for sequences of processing work"01_Download.R""02_Preprocessing.R"#...

I Use CamelCase or Snake_case for variables

"spatial_data""ModelFit""regression.results"

Avoid predetermined names like c or plot

Writing cleaner R code | SpacingUse Spacing just as in the english language

# Goodmodel.fit <- lm(age ~ circumference, data = Orange)

# Badf1=lm(Orange$age~Orange$circumference)

Don’t be afraid of using new lines

model.results <- data.frame(Type = sample(letters, 10),Data = NA,SampleSize = 10 )

# Same goes for loops

# And don't forget good documentation

More on writing clean code

I Google R Style GuideI Hadley Wickhams Style GuideI RopenSci Guide

And there even is a r-package to clean up your code:

formatR

Further ways to improve reproduciability

I Ideally attach your code + data to publicationsI Open-access hoster (DataDryad, Figshare, Zenodo)I Restructuring of workflow with RMarkdown / LaTeX / HTML

Functionalize!

I Many R users are tempted to write their code very specializedand non-reusable

I Number 1 rule for clear coding :

DRY - Don't repeat yourself!

Simple example: We want to fit a linear model to test if in anorange orchard the circumference (mm) increases with age (age oftrees). If so we want to quantify and display theRoot-Mean-Square-Error (RMSE) of this fit for each individualorange tree in the dataset (N = 5).

Normal way:

# Linear modelmodel.fit <- lm(age ~ circumference, data = Orange)model.resid <- residuals( model.fit )model.fitted <- fitted( model.fit )rmse <- sqrt( mean( (model.resid - model.fitted)^2 ))

tapply(model.resid - model.fitted, Orange$Tree,function(x) sqrt( mean( (x)^2 )))

3 1 5 2 4

020

040

060

080

010

0012

0014

00

Defining your functions

Essentially most r-packages are just a compilation of usefulfunctions that users have written.

# We want to get the RMSE of a linear modelrmse <- function(fit, groups = NULL, ...){

f.resid <- residuals(fit);f.fitted <- fitted(fit)if(! is.null( groups )) {

tapply((f.resid-f.fitted), groups, function(x) sqrt(mean(x^2, ...)) )} else {

sqrt(mean((f.resid-f.fitted)^2, ...))}

}

model.fit <- lm(age ~ circumference, data = Orange)

# This function is more flexible, can be further customized and# applied in other situationsrmse(model.fit)

## [1] 1041.809

rmse(model.fit, Orange$Tree)

## 3 1 5 2 4## 602.4244 688.8896 929.9055 1319.1573 1408.7033

(very) short intro into pipes

Pipes (|) are a common tool in the linux / programming world thatcan be used to chain inputs and outputs of functions together. In Rthere are two packages, namely dplyr and magrittr that enablegeneral piping between all functions

Goal:

Solve complex problems by combining simple pieces(Hadley Wickham)

library(dplyr)

model.rmse <- Orange %>%lm(age ~ circumference, data=.) %>%rmse(., Orange$Tree) %>%barplot

OR like this (Correlation within Iris dataset)

iris %>% group_by(Species) %>%summarize(count = n(), pear_r = cor(Sepal.Length, Petal.Length)) %>%arrange(desc(pear_r))

## Source: local data frame [3 x 3]#### Species count pear_r## 1 virginica 50 0.8642247## 2 versicolor 50 0.7540490## 3 setosa 50 0.2671758

Outsource your functions

# Put your function into an extra files

# At the beginning of your main processing script# you simply load them via sourcesource("outsourced.rmse.R")

Easy package writing

I Open RStudioI Install the devtools and roxygen2 packageI Create a new package project and use the existing function as

basisI Create the documentation for itI Update the package metadata and build your package

library(roxygen2)library(devtools)# Build your package with two simple commands# Has to be within your package projectdocument() # Update the namespaceinstall() # Install.package

I However package development has multiple facets and options.I More detailed info on Package development with RStudio.I Higher acceptance for method papers and analysis code. Make

it citable with a DOI

Software management and collaboration with Github

I Git is one of the most commonly used revision control systemsI Originally developed for the Linux kernel by Linus Torvalds

Github is web-based software repository service offeringdistributed revision control

Californian Startup, now the largest code hoster in theworld

Offers public repositories for free, private for money and anice snippet exchange service called gists

How to Git with rstudio (do it later)

1. Setup an account with a git repository hoster like Github2. Install RStudio and git for your platform (http://www.

rstudio.com/ide/docs/version_control/overview)3. Link to the git executable within the RStudio options4. Create a new repository on Github and a new project in

RStudio -> Version Control git5. Clone your empty project (pull), add new files/changes to it

(commit) and (push)

Idea for CMEC R Users:

I Create a Github organization (like a repository basecamp)

Further developments

There are now packages to push gists and normal git updatesdirectly from within R. In order to use them you need a github apikey (instructions on the websites below) rgithub

To detailed to show here, but have a look at the gistr package:gistr