Introduction to R programming Presented by: Shailender Nagpal, Al Ritacco Research Computing UMASS...

91
Introduction to R programming Presented by: Shailender Nagpal, Al Ritacco Research Computing UMASS Medical School 09/17/20 12 Information Services,

Transcript of Introduction to R programming Presented by: Shailender Nagpal, Al Ritacco Research Computing UMASS...

Introduction to R programming

Presented by:Shailender Nagpal, Al Ritacco

Research ComputingUMASS Medical School

09/17/2012Information Services,

AGENDAR Basics: Vectors, Matrices, Data framesBuilt-in functions, Blocks, Branching, LoopsString, Vector and Matrix operationsFile reading and writingWriting custom functionsStatistical analysis and plottingProviding input to programsUsing R scripts with the LSF cluster

00/00/2010Information Services,2

What is R?

• R is a high-level, general-purpose, interpreted, interactive programming language

• Provides a simple iterative, top-down, left to right programming environment for users to create small, and large’ish programs

• R is a graphical and statistical language – allowing users to create complex plots with data in them

00/00/2010Information Services,3

Features of R

• R code is portable between Linux, Mac, Windows• Easy to use and lots of resources are available• Procedural and object-oriented programming, not

strongly "typed"• Similar programming syntax as other languages

– if, if-then-else, while, for, functions, classes, etc• Provides several methods to manipulate data

– Vectors, Matrices, Data Frames, Objects

• Statements are not terminated by semi-colon

00/00/2010Information Services,4

Advantages of R

• R is a general-purpose programming language like C, Java, etc. But it is "higher" level, means is advantageous to use it in certain applications like Bioinformatics– Fewer lines of code than C, Java– No compilation necessary. Prototype and run!– Run every line of code interactively– Vast function library geared towards scientific computing– Save coding time and automate computing tasks– Intuitive. Code is concise, but human readable

00/00/2010Information Services,5

First R program• The obligatory "Hello World" program

# Comment: 1st program: variable, printname = "World"sprintf("Hello %s", name)

• Save these lines of text as a text file with the ".R" extension, then at the command prompt (linux): Rscript hello.R

Hello World

00/00/2010Information Services,6

Understanding the code

• 1st line: A comment, beginning with "#"• 2nd line: Declaration of a string variable• 3rd line: Printing some text to the shell with a variable,

whose value is interpolated by %s• The quotes are not printed, and "name" is replaced by

"World" in the output.

00/00/2010Information Services,7

Second program

• Report summary statistics of DNA sequencedna= "ATAGCAGATAGCAGACGACGAGA"cat("Length of DNA is ", nchar(dna))cat("Number of A bases are ", nchar(dna)-nchar(gsub("A","",dna)))cat("Number of C bases are ", nchar(dna)-nchar(gsub("C","",dna)))cat("Number of G bases are ", nchar(dna)-nchar(gsub("G","",dna)))cat("Number of T bases are ", nchar(dna)-nchar(gsub("T","",dna)))cat("Number of GC dinucleotides are ", (nchar(dna)-nchar(gsub("GC","",dna)))/2)

• In a few lines of code, we can summarize our data! • Can re-use this code to find motifs, RE sites, etc

00/00/2010Information Services,8

How to get R

• Visit www.r-project.org – Download the installation files for your OS if you want the full R IDE– Run through the steps of the Installer, launch R through the Start Menu

• Many versions of R are available on our compute cluster– R/2.14.2– R/2.15.0– R/3.0.1– R/3.0.2– R/3.1.0– R/3.2.0

• To user one of them, enter the following on a shell connected to MGHPCC: module load R/3.2.0; R

The "R" IDE

This R session can be quit by using the q( ) command

The "R" IDE (…contd)

• The prompt– Interactive environment for entering commands– A "command" is an instruction, telling the R interpreter

what to do

• The menus– Loading and saving data, command history, creating and

running scripts, etc– Managing packages– Editing and deleting data– Getting Help

The "R" IDE: Working directory

Always navigate to the "working directly" where you intend to maintain all data files, graphics, results, etc.

The R Prompt

• User can enter instructions to use R as a calculator, then hit the "Enter" button to execute, eg> 1 + 3> (3*5+1)/2

• User can also use "variables" to contain data. PEMDAS rules are followed to build mathematical expressions. Built-in math functions can be usedx =1y = 3x + yz = 5z/ya = (y*z+x)/2b = sin(x); c = sum(x, y, z);

The R "Prompt"

• Anything typed and executed on the R prompt is a "command"

• A "command" follows strict rules of valid syntax, will not execute otherwise. Try:1+/-2))))

• Multiple commands can be executed on the command prompt using the semi-colon operator

• If a command fails to execute, it will usually provide an error message

R Prompt (…contd)

• If a command is "hung-up" for any reason, pressing Ctrl+C will return the R prompt

• If an incomplete command is typed in, it results in a "+" prompt, expecting user input to complete the command

R Comments

• Use "#" character at beginning of line for adding comments, any text after # does not execute

• Helps you and others understand and remember what you did, what was your thought process for adjacent lines of code

• Lets say you intend to sum up a list of numbers

# (sum from 1 to 100 of X)

•The code would look like this:sum = 0 # Initialize variable called "sum" to 0for(x=1:100) { # Use "for" loop to iterate over 1 to 100

sum=sum+x # Add the previous sum to x}Sprintf("The sum of 1..x is %f\n", sum) # Report the result

00/00/2010Information Services,16

R Variables

• Variables – Provide a location to "store" data we are interested in

• Strings, decimals, integers, characters, lists, …– What is a character – a single letter or number– What is a string – a list of characters– What is an integer – a number 4.7 (sometimes referred to

as a real if there is a decimal point)

• Variables can be assigned or changed easily within a R script

00/00/2010Information Services,17

Variables and built-in keywords

• Variable names should represent or describe the data they contain– Do not use meta-characters, stick to alphabets, digits and

underscores. Begin variable with alphabet

• R as a language has keywords that should not be used as variable names. They are reserved for writing syntax and logical flow of the program– Examples include: if, then, else, for, foreach, while, do,

unless, until, break, continue, switch, def, class

00/00/2010Information Services,18

Creating and displaying Variables

• Variables are "placeholders" for data inside R and have a unique name to reference themx = 10 # creates a variablex # displays contents of variable

• A list of existing variables can be displayed usingls()objects()

• Variables can be deleted using the "rm" command:rm(tmp) #removes "tmp" variablerm(list = ls()) #removes all variables

Variables (…contd)

• Variables in R are stored as objects which belong to one of four types– Vector = row/column of data– Matrix = table of data– Data Frame = table with row and column headers– List = group of data objects

• The type of a variable can be checked using (T/F)is.vector(variable_name)is.matrix(variable_name)is.data.frame(variable_name)is.list(variable_name)

Modes in R

• Data "modes" in R– Character = text information– Numeric = numbers– Factor = categorical data– Logical = True/False

• The mode of a variable can be checked bymode(x)

Variable naming conventions

• Variable names must be a set of characters that– cannot begin with a numerical value– cannot have spaces or special characters – can have numbers between or at the end of name– can use period "." to separate long variable names

• In a data frame, row and column names have to be unique

Displaying text and variables• Many commands will display text and values of

variables. Typing in the variable name will display the value. Other functions include "print", "cat", "sprintf"

• Print is most verbose, does not process delimiters and only prints one itemprint("This \t is a test\nwith text")"This \t is a test\nwith text"

• Cat processes delimiters and can take a list of items to concatenate and display. Can send output to filecat("This \t is a test\nwith text", sep="\n")This is a test

with text

00/00/2010Information Services,23

Printing variables• To process variables in a formatted manner, "sprintf"

can be used. Delimiters are not processedx = 5name = "John"sprintf("%s has %d dollars", name,x)

• If you run this as a program, you get this outputJohn has 5 dollars

• Works best with single values. Try using with vector and notice vectorized outputs!x = 1:10sprintf("%s has %d dollars", name,x)

00/00/2010Information Services,24

Using help in R

• The following syntax is used for getting help in R?command_nameapropos("plot"); help(plot);help.search("plot")help.start(); help.demos()

• R help/search available at:– http://finzi.psych.upenn.edu/nmz.html– http://faculty.ucr.edu/~

tgirke/Documents/R_BioCond/R_BioCondManual.html

• Various guides/docs/tutorials available at CRAN– http://www.r-project.org (Click under documentation)

Vectors

• Vector variables can be created in many waysx = 1:10y = c (1, 3, 5, 7, 9, 15)z = c (x, 11:20)z = 1, 4, x # Not valid!!u = seq(0, pi, by=0.1) # Equally-spaced

vectorv = scan() # Accepts user input

• The number of elements in a vector arelength(x)

Vectors (…contd)

• A note of caution about vectors:– Vectors containing numbers, strings, i.e heterogeneous

data, are converted to "numerical" modex = c(7, 9, "Paul")x[1] "7" "9" "Paul"

• Vectors of the same "mode" should be usedgenes = c('ATM1', 'BRCA', 'TNLP2‘)gene_scores = c(48.1, 129.7, 73.2)nucleotides=c("adenine", "cytosine", "guanine", "thymine", "uracil")

Numeric indexing of vectors

• Vector elements can be accessed by subscripting using their numerical index (beginning with 1, blank space represents last value):abc = c("a","b","c","d","e","f","g")abc[1] # displays aabc[5:7] # displays e f gabc[2:] # display b c d e f gabc[c(1,2,5,7)] # displays a b e gabc[c(3:4,6:7)] # displays c d f g

Logical indexing of vectors

• Vector elements can also take logical index (true/false) values as a subscript

• A logical index can be created by specifying a condition on the vector that will result in true or false valuesx = 21:30y = x<25y

[1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSEx[y] # x[x<25] will also work

Logical vs Numerical indexing

• Logical subscripting can be used to delete elements in a vector, based on some conditionsx = 21:30x = x[x<25]x = x[x>21 & x<24]

• Numerical subscripting is effective for deleting elements, too:x = x[c(3, 5, 7)]; # all except 3rd, 5th, 7th removedx = x[-c(3, 5, 7)]; # just 3rd, 5th, 7th removed

Vector Operations

• Various built-in functions can be applied to vectors, exx = c(1, 2, 51, 7, 27, 3, 62, 8, 91, 0, 12, 20, 14)length(x) # How many values in xsum(x) # Sum of values in xmean(x) # Mean of xsd(x) # Standard deviation of xlength(x[x>5]) # How many values > 5

• You can inspect vectors for missing or "NA" valuesz = c(1:5, NA, 10)ind = !is.na(z);z[ind]

R built-in functions

• The R language comes with many built-in functions that can be used with variables to produce summary output. Many mathematical functions are also available

min(x) max(x) len(x) sort(x)sum(x) abs(x) float(x) rev(x)pow(x,y) range(x) round(x,n) runif(x)ceil(x) cos(x) degrees(x) rnorm(x)radians(x) exp(x) floor(x) seq(x,y)hypot(x,y) log(x,base) sin(x) c()sqrt(x) tan(x) int(x) cat()

00/00/2010Information Services,32

Vectorization

• Many R functions operate on individual elements of a vector or matrix, and return a vector or matrix of the same sizex = 1:10y = sin(x)y [1] 0.8414710 0.9092974 0.1411200 -0.7568025 -0.9589243 -0.2794155 [7] 0.6569866 0.9893582 0.4121185 -0.5440211

• Many functions require either vectors or matrices as input, and produce a vector or matrix as output

00/00/2010Information Services,33

Vectorization (..contd)

• In the same vein, arithmetic operations on vectors or matrices work with every elementx = 1:102*xx/2x**2x%%2x+2x-2

00/00/2010Information Services,34

Finding elements in vectors

• The "which" command can be used to find elements in a vector that satisfy a condition. The index/position of those elements is returnedx=round(runif(100)*50)xwhich(x==5) # Any in x have a value of 5?which(x>30) # Any in x greater than 30?which(x>10 & x<20) # Any in x between 10 and 20?which(x<10 | x>20) # Any in x less than 10 or greater than 20?

00/00/2010Information Services,35

Plotting vectors

• Vectors can be plotted in various waysxData = seq(0, pi, by=0.1)plot(x = xData)plot(x = xData, type = "l", col = "green")plot(x = xData, y = sin(xData), col = "green")hist(xData)boxplot(xData)

• On MGHPCC, graphics are suppressed. Output can be sent to PDF

Matrices

• There are many ways of creating data matricesx = array(1:6, dim=c(3,2))x

1 42 53 6

y = array(c(1:3, 9:11, 15:18), dim=c(3,3))z = array(scan(), dim=c(3,3))

• The "matrix" commandx = matrix(scan(), byrow=T, ncol=5)x = matrix(scan(), nrow=5)

Matrices indexing

• Matrix elements can be accessed by subscripting using their numerical indexx[1, 2] #1st row, 2nd columnx[1, ] #1st row, all columnsx[, 1] #All rows, 1st columnx[1, c(1,3)] # 1st row, 1st and 3rd columnx[2:3, ] # 2nd and 3rd row, all columns

• Logical indices can also be usedx[x<5] # vector of elements less than 5

Matrices indexing (..contd)

• Elements can also be accessed by row and columns names, but those names have to be provided firstdimnames(x) = list(c("Row1", "Row2", "Row3", "Row4", "Row5"), c("Col1", "Col2"))x[ , "Col1"] # All elements of Column 1x["Row1",] # All elements of Row 1x["Row1", "Col1"] # Element located at R1,C1

• To remove label names associated with a matrixdimnames(x) = list( c(), c())

Matrices operations

• The dimensions of a matrix can be obtained using the "dim" commanddim(matrix_variable)

• Vectors of equal length can be concatenated row-wise or column-wise to create a matrixx = 1:10; y = 11:20matrix1 = cbind(x, y)matrix2 = rbind(x, y)

• Matrices can be transposedt(x) # Transpose x (rows as columns, columns as rows)

Matrices operations (..contd)

• The "determinant" of a matrix can be readily calculated by using the "det" functionx = array(1:9, dim=c(3,3))det(x)

Data Frames

• A data frame is similar to a matrix, except it has the concept of "columns" being independent "vectors" and of mixed "mode"

• A data frame can be created by reading data from a filex = read.table("data3.txt", header=F)x = read.table("data2.txt", header=T)x = read.table("data1.txt", header=T,row.names=1)

• The column header names of a data frame can be obtained using the "colnames" commandcolnames(x); rownames(x)

Data Frames (…contd)

• Columns of data frames can be referred to by their namesx$Ax$Gene1 # will not work

• Using "attach" function, we can make columns accessible as vectors by using the column name as a variableattach(x)AB

• Use "detatch" to stop this behaviordetach(x)

Assignment 2: Data Frames

• Lets load a sample data frame and see how we can manipulate and analyze datadata(iris) # IRIS data included in Ririsdim(iris)colnames(iris)iris[1,1]iris[1:5, ] iris[5:10, c(1,3,5)]iris$Sepal.Lengthiris$Sepal.Length[4]

Assignment 2: Data frames (…contd)

iris$Sepal.Length[c(1,4,7)]plot(iris$Sepal.Length, iris$Species, pch=18)hist(iris$Sepal.Length)boxplot(iris$Sepal.Length)

Factors

• A "factor" is a vector that specifies a discrete classification (grouping) of other vectors of same lengthstate = c("wa","ma","ca","ri","ca","ma","ri","wa")incomes = c(60, 65, 40, 55, 76, 44, 63, 71)statef = factor(state)stateflevels(statef)mean.income = tapply(incomes, statef, mean)

R "strings"

• R supports string variables and can perform many useful operations on stringsdna= "ATAGCAGATAGCAGACGACGAGA"rest.enzymes = c("TATA","GGACAG","TTAGAA")

• To concatenate 2 or more strings and re-assign to a new variable, the "paste" statement can be useddna2 = paste(dna,"ATGATAGCAGCAGCAG",sep="")

• To split a string based on a pattern or delimiter, use the "strsplit" commandfragments = strsplit(dna2,"GAC")

00/00/2010Information Services,47

String subscripting

• Once a string is created, it can be subscripted using its indices that begin with 1word = "Programming" # A 1-item vector in Rword[1] # Returns "Programming"substr(word,1,1) # "P"substr(word,1,4) # "Prog"substr(word,5,7) # "ramming“

• The length of a string can be determined by the function “nchar”nchar(word)

String operations• R strings can be searched for patterns using "grep",

"grepl", "regexpr" and "gregexpr"regexpr("GAC",dna2) # First match positiongregexpr("GAC",dna2) # Report all matchesgrep("GAC",dna2) # 1=Exists, 0=Absentgrepl("GAC",dna2) # TRUE or FALSE

• R strings can be searched for patterns and replaced with substitution patterns using "sub" and "gsub"gsub("T","U",dna2,ignore.case = TRUE) #DNA->RNA

00/00/2010Information Services,49

String Functions

• Get complement of DNA sequencechartr("ATGC", "TACG", dna2)

• Reverse a DNA sequencepaste(rev(strsplit(dna2,"")[[1]]),collapse='')

• Some other functionstolower(dna2)toupper(dna2)nchar(dna2)strtoi("28")

Lists

• Lists are a generic vector containing other objects• Users can create a complex data structure comprising

of numeric vectors, string vectors, matrices and data frames into a list

• To access each item in a list, a numerical index has to be provided in double square brackets [[ ]]

• Many functions in R return a list, requiring double-brackets to retrieve each returned item

00/00/2010Information Services,51

Lists (…contd)

• Examplen = c(2, 3, 5) s = c("aa", "bb", "cc", "dd", "ee") b = c(TRUE, FALSE, TRUE, FALSE, FALSE) # Build a list containing copies of n, s, bx = list(n, s, b, 3)# Retrieve a list slicex[1]x[1:2]# Retrieve a member’s copy by its referencex[[2]]

00/00/2010Information Services,52

Commands blocks in R• A group of statements surrounded by braces {}• Creates a new context for statements and commands• Ex:

if(x>1) {

print "Test\n"print "x is greater than 1\n“

}

00/00/2010Information Services,53

Logical Operators

• Many types of operators are provided– Relational (<, >)– Equality (==, !=)– Logical (and (&), or (|), not(~))

00/00/2010Information Services,54

The "if" statement

• The if-then-else syntaxif (is.vector(x)) {

y = "Y is a vector"} else{

y = "Y is not a vector"}

• Note: Do not place "else" on a new line

For loops: repeating a code block

• For loops are used for repeatedly executing a code block a pre-set number of times

• An iterator variable is used to change the value of one variable each time the code block is executed

• The code block is executed repeatedly until the iterator variable exhausts its pre-chosen valuesfor(i in 1:10) {

cat(x[i], " ") }

Output:1 2 3 4 5 6 7 8 9 10

For loops: repeating a code block

• Another examplenucleotides=c("adenine", "cytosine", "guanine", "thymine", "uracil")for(nt in nucleotides) {

cat("Nucleotide is: ", nt, "\n")}

Output:Nucleotide is: adenineNucleotide is: cytosineNucleotide is: guanineNucleotide is: thymineNucleotide is: uracil

00/00/2010Information Services,57

While loops: repeating a code block

• While loops are used for repeatedly executing a code block until a condition remains true. The loop is exited if the condition turns falsei=1

while(i<6) {cat(i," ")i=i+1

}

Output1 2 3 4 5

00/00/2010Information Services,58

Break and Continue Statements

• The "break" statement in a "for" or "while" loop transfers control execution outside the loop

• The "continue" statement in a "for" or "while" loop transfers control to the next iteration of the loop, skipping remainder statements after it

Scripts

• Collection of valid R commands, placed in a text file so they execute in a specified order

• Scripts are for implementing workflows and algorithms, and the idea behind them is automation

• Scripts in R have the ".R extension"• To run a R script, use the "source" command

source(avg_height.R")

Bracket Notation in R

• ( )– Creating vectors– making a function call with input arguments

• [ ]– Indexing for a matrix or vector

• { }– Embedding a chunk of R code while creating a function or enclosing in

a for or while loop

User-defined functions

• Collection of valid R commands that implements a workflow or algorithm

• A function may– Read data in a certain file format– Implement a statistical algorithm such as ANOVA– Create a plot

• A function may call other functions – both in-built and user defined

• Functions provide a way for user to utilize the underlying "functionality" just by entering the name of the function on the R command prompt, with its input and output "arguments"

User-defined functions (…contd)

• Subroutines return values– Explicitly with the return command– Implicitly as the value of the last executed statement

• Return values can be a scalar or a flat list

00/00/2010Information Services,63

User-defined functions

• Basic declaration isfunction_name = function(arg1, arg2, ….) expression

twosamp = function(y1, y2) {

n1 = length(y1); n2 = length(y2)yb1 = mean(y1); yb2 = mean(y2)s1 = var(y1); s2 = var(y2)s = ((n1-1)*s1 + (n2-1)*s2)/(n1+n2-2)tst = (yb1 – yb2)/sqrt(s*(1/n1 + 1/n2))return(tst)

}

Using R with Files: Scripts/Output

• In R, there are many ways to read data from files, including R script files– R files containing R commands (scripts) or functions can be

input into R and executed by the "source" command– The output of R commands can be sent to a text file

instead of the R prompt using the "sink" command

sink("r_output.txt") # start saving resultssource("r_script.R") # execute scriptsink() # stop saving results

00/00/2010Information Services,65

Using R with Files: Workspace

• To save existing variables for later use, do– File -> Save Workspace, and save all variables as a ".RData" file in a

user specified directory, orsave.image("project/umw_rcs/training/r/session.Rdata")

• To load saved variables in a new R session– File -> Load Workspace, and specify the ".RData" fileload("project/umw_rcs/training/r/session.RData ")

Using R with Files: Command History

• All the commands issued at R prompt can be collected at the end of a session

• File -> Save History, and save the commands as a ".RHistory" file in a user specified directory– loadhistory(file = ".Rhistory") – savehistory(file = ".Rhistory")

• This is recommended, because you may load and then recall a lengthy command later, from the command history file, using the Up/Down keyboard arrow keys

• User can edit a command history file and create a script

Using R with Files: Image files

• Images generated by R can also be saved as PDF using the "pdf" command

• This is the one of the ways to get graphical output from a script that runs on a linux serverpdf("simple-plot.pdf")plot(1:10,1:10,type="l",main="Simple plot")dev.off()

00/00/2010Information Services,68

Using R with Files: Data import

• Files containing tab or comma delimited data can also be read in as a data.frame using the "read.table" function as illustrated in the Data Frames slide

• To write tab-delimited data from a data frame to a file, use this syntax:write.table(data.frame, "out.file.txt", row.names=FALSE, sep="\t", quote=FALSE)

00/00/2010Information Services,69

R File access

• What is file line-by-line file access?– set of R commands/syntax to work with data files

• Why do we need it?– Makes reading data from files easy, we can also create new

data files

• What different types are there?– Read, write, append

00/00/2010Information Services,70

R File access Example• Method 1con = file("input.file.txt", open="r")lines = readLines(con)for (line in lines){

print(line)}close(con)

00/00/2010Information Services,71

Input file:Last name:First name:Age:Address:Apartment:City:State:ZIP Smith:Al:18:123 Apple St.:Apt. #1:Cambridge:MA:02139

R File access Example (…contd)con = file("input.file.txt", open="r") while(length(line=readLines(con,n = 1) > 0) {

fields = strsplit(line, split=":")cat(fields[2], fields[1], "\n")cat(fields[2], fields[3], "\n")

cat(fields[4],fields[5],fields[6], "\n")} close(con)

Output:Al Smith 123 Apple St., Apt. #1 Cambridge, MA 02139

00/00/2010Information Services,72

R File access writing• Example with formatting

fr = file("mailing_list", "r")fw = file("labels", "w")while(length(line=readLines(fr,n = 1) > 0) {

fields = strsplit(line, split=":")writeLines(sprintf("%s",number),fw, sep="\

n")writeLines(cat((fields[c(2,1)], "\n"), fw)writeLines(cat(fields[c(2,3], "\n"), fw)

writeLines(cat(fields[4:6], "\n"), fw)} close(fr)

close(fw)

00/00/2010Information Services,73

Assignment #3

• Use the function twosamp.R in a script, that reads in data from a file (data1.txt), and does a t-test for each row (the file has 4 columns, 2 belonging to one group, and 2 belonging to another group)

• Sort the results by t-statistic• Hint: use data frames, for loops, and twosamp.R

Other useful R functionsdir() getwd() setwd(dir) options()tapply() sapply() cut() table()head() unique() intersect() union()setdiff() setequal() eigen() rep() seq() file.show()

00/00/2010Information Services,75

Plotting functionsplot() par() dev.off() abline()barplot() pie() hist() stem()boxplot() points(x,y) arrows() lines()pairs() matplot() image() pdf()/ png() jpeg() qqplot()

00/00/2010Information Services,76

Statistics functionsrange() cumsum() mean() median()quantile() IQR() var() sd()cov() cor() moment() skewness()kurtosis() binom.test() wilcox.test()t.test() kruskal.test() prop.test()chisq.test() lm() summary()coefficients() predict() resid()glm() rnorm() runif() sample()

00/00/2010Information Services,77

Packages in R

• A package is a coherent group of functions available for solving a particular problem in statistics

• Eg: "Bioconductor" has some packages which have functions that:– Load gene expression images– Extract gene scores and normalize– Fit gene error models and find differentially regulated genes

• Packages can be downloaded from www.r-project.org or through the Packages -> Install Packages menu in R

Providing input to programs

• It is sometimes convenient not to have to edit a program to change certain data variables

• R allows you to read data from shell directly into program variables with the "scan" or "readLines" command

• Examples:x = scan(n=1) # Reads numbersy = readLines(n=1) # Reads strings

00/00/2010Information Services,79

• Command line arguments are optional data values that can be passed as input to the R program as the program is run– After the name of the program, place string or numeric values

with spaces separating them– Accessed by the "commandArgs" function inside the program– Avoid entering or replacing data by editing the program

• Examples:Rscript arguments.R 10 20 30 40

Command Line Arguments

Using R programs on the cluster

• R scripts can easily be submitted as jobs to be run on the MGHPCC infrastructure

• Basic understanding of Linux commands is required, and an account on the cluster

• Lots of useful and account registration information atwww.umassrc.org

• Feel free to reach out to Research Computing for [email protected]

00/00/2010Information Services,81

What is a computing "Job"?

• A computing "job" is an instruction to the HPC system to execute a command or script– Simple linux commands or R/R/R scripts that can be

executed within miliseconds would probably not qualify to be submitted as a "job"

– Any command that is expected to take up a big portion of CPU or memory for more than a few seconds on a node would qualify to be submitted as a "job". Why? (Hint: multi-user environment)

82

How to submit a "job"

• The basic syntax is:bsub <valid linux command>

• bsub: LSF command for submitting a job• Lets say user wants to execute a R script. On a

linux PC, the command isRscript countDNA.R

• To submit a job to do the work, dobsub Rscript countDNA.R

83

Specifying more "job" options

• Jobs can be marked with options for better job tracking and resource management– Job should be submitted with parameters such as queue

name, estimated runtime, job name, memory required, output and error files, etc.

• These can be passed on in the bsub commandbsub –q short –W 1:00 –R rusage[mem=2048] –J "Myjob" –o hpc.out –e hpc.err Rscript countDNA.R

84

Job submission "options"

85

Option flag or name

Description

-q Name of queue to use. On our systems, possible values are "short" (<=4 hrs execution time), "long" and "interactive"

-W Allocation of node time. Specify hours and minutes as HH:MM

-J Job name. Eg "Myjob"

-o Output file. Eg. "hpc.out"

-e Error file. Eg. "hpc.err"

-R Resources requested from assigned node. Eg: "-R rusage[mem=1024]", "-R hosts[span=1]"

-n Number of cores to use on assigned node. Eg. "-n 8"

Why use the correct queue?

• Match requirements to resources• Jobs dispatch quicker• Better for entire cluster• Help GHPCC staff determine when new resources are

needed

86

A bioinformatics demo

• Log on to the Umass server using Putty on windows or Terminal on Mac

• Request an interactive shell session on one of the compute nodes for this demo

$ bsub –q interactive –W 4:00 –Is bash

• Navigate to the training directory or copy the examples to your local directory

$ cd /project/umw_rcs/training/R

OR$ cp /project/umw_rcs/training/R/* .

A bioinformatics demo (…contd)

• Load the R module to get the R promptmodule load R/3.2.0R

• Lets say we have gene expression microarray data in a text file “sample_and_gene_matrix.txt”

• Goal is to load the data as a data frame, then perform a t-test on each gene, comparing 2 groupsx=read.table("sample_and_gene_matrix.txt",header=T,row.names=1)

00/00/2010Information Services,88

Questions?

• How can we help further?• Please check out books we recommend as well as

web references (next 2 slides)

00/00/2010Information Services,89

R Books

• R books which may be helpful– http://shop.oreilly.com/product/0636920028574.do

• Hands-on Programming with R

– http://shop.oreilly.com/product/0636920028352.do • Learning R

– http://shop.oreilly.com/product/9781783989065.do • R Data Analysis Cookbook

– http://shop.oreilly.com/product/9781118391419.do • The Essential R reference

00/00/2010Information Services,90