Tools and Algorithms in BioinformaticsWeek11: Introduction to R (Heatmaps and Scatter Plots) Guest...

16
1 __________________________________________________________________________________________________ 10/25/2013 GCBA815 Tools and Algorithms in Bioinformatics GCBA815, Fall 2013 Week11: Introduction to R (Heatmaps and Scatter Plots) Guest demonstrator: You Li, Sanjit Pandey Babu Guda Department of Genetics, Cell Biology and Anatomy University of Nebraska Medical Center __________________________________________________________________________________________________ 10/25/2013 GCBA815 Introduction to R http://cran.r-project.org/ Integrated suite of software facilities for data manipulation, calculation and graphical display Among other things it has an effective data handling and storage facility, a suite of operators for calculations on arrays, in particular matrices, a large, coherent, integrated collection of intermediate tools for data analysis, graphical facilities for data analysis and display either directly at the computer or on hardcopy. Resources: http://cran.r-project.org/doc/manuals/r-release/R-intro.pdf http://cran.r-project.org/

Transcript of Tools and Algorithms in BioinformaticsWeek11: Introduction to R (Heatmaps and Scatter Plots) Guest...

Page 1: Tools and Algorithms in BioinformaticsWeek11: Introduction to R (Heatmaps and Scatter Plots) Guest demonstrator: You Li, Sanjit Pandey Babu Guda Department of Genetics, Cell Biology

1

__________________________________________________________________________________________________

10/25/2013 GCBA815

Tools and Algorithms in Bioinformatics

GCBA815, Fall 2013

Week11: Introduction to R

(Heatmaps and Scatter Plots)

Guest demonstrator: You Li, Sanjit Pandey

Babu Guda

Department of Genetics, Cell Biology and Anatomy

University of Nebraska Medical Center

__________________________________________________________________________________________________

10/25/2013 GCBA815

Introduction to R

http://cran.r-project.org/

• Integrated suite of software facilities for data manipulation,

calculation and graphical display

• Among other things it has

• an effective data handling and storage facility,

• a suite of operators for calculations on arrays, in particular

matrices,

• a large, coherent, integrated collection of intermediate tools

for data analysis,

• graphical facilities for data analysis and display either

directly at the computer or on hardcopy.

• Resources:

• http://cran.r-project.org/doc/manuals/r-release/R-intro.pdf

• http://cran.r-project.org/

Page 2: Tools and Algorithms in BioinformaticsWeek11: Introduction to R (Heatmaps and Scatter Plots) Guest demonstrator: You Li, Sanjit Pandey Babu Guda Department of Genetics, Cell Biology

2

__________________________________________________________________________________________________

10/25/2013 GCBA815

Introduction to R

http://cran.r-project.org/

• Download:

• Download R for Windows

• Download R for Linux

• Download R for Mac

• Getting help:

• help(keyword)

• Help(heatmap) or ?heatmap

• On GUI version : Different help topics under “Help” menu.

__________________________________________________________________________________________________

10/25/2013 GCBA815

Introduction to R

http://cran.r-project.org/

• R commands basics

• Change working directory:

• File > Change working directory

• ‘\\’ for working in windows. ‘\’ is an escape

character. Or you can use ‘/’

• Allows you to set default directory for file storage and

retrieval.

• Commands are case sensitive. ?heatmap vs. ?Heatmap

• Use vertical arrow keys to access previous commands in the

history.

> setwd(“X:\\path\\to\\your\\work dir\\") > getwd()

Page 3: Tools and Algorithms in BioinformaticsWeek11: Introduction to R (Heatmaps and Scatter Plots) Guest demonstrator: You Li, Sanjit Pandey Babu Guda Department of Genetics, Cell Biology

3

__________________________________________________________________________________________________

10/25/2013 GCBA815

Introduction to R

http://cran.r-project.org/

• Executing commands from a file:

• Assignment:

• Simplest R data structure is a numeric vector

> source(“commands.R”)

> a <- 100 > a/10 > a+2

• a <- c(1,2,3,4,5,6,7,8,9) : c() combines its arguments into single

dataset.

> a <- c(1,2,3,4,5,6,7,8,9)

__________________________________________________________________________________________________

10/25/2013 GCBA815

Introduction to R

http://cran.r-project.org/

• Loading data from file:

• R is strict on input file formats. It is user’s responsibility to

provide appropriately formatted file.

• “read.table()” or “read.csv()” function:

• This function takes data from csv or any other character

delimited files in tabular format. Safest way is to use text file.

• Example: Read data from a file and store it in “table”

> table <- as.matrix(read.table("heatmap_example_gcba815_100.txt", header=TRUE, row.names=1, sep = "\t"))

Page 4: Tools and Algorithms in BioinformaticsWeek11: Introduction to R (Heatmaps and Scatter Plots) Guest demonstrator: You Li, Sanjit Pandey Babu Guda Department of Genetics, Cell Biology

4

__________________________________________________________________________________________________

10/25/2013 GCBA815

Introduction to R

http://cran.r-project.org/

• Packages:

• Packages are a set of tools that serve a specific function.

• Standard packages are part of R source code and contain basic

function that allow R to work.

• Contributed packages are the packages written by diferent

developers to add missing function to R. Eg: Biocondoctur,

Omegahat etc.

• List of available packages :

• Installing a package:

• Loading a package:

> library()

> install.packages("pheatmap")

> library(pheatmap)

__________________________________________________________________________________________________

10/25/2013 GCBA815

Basic Heatmaps in R

http://cran.r-project.org/

• Heatmaps:

• A graphical representation of data where values contained in

the dataset are represented by colors of different intensities.

• Useful for representing expression data, population density

etc.

• Example:

• Sorting by a column

• Limit the data set to a threshold:

select only the rows with value of first column greater than 4.

> table.sub[order(table.sub[,4]),]

> hmap <- heatmap(table, Rowv=NA, Colv=NA, col = heat.colors(256),

scale="column", margins=c(1,10), cexRow=0.7)

> heat_set<-table.sub[table.sub[,1]>4,]

Page 5: Tools and Algorithms in BioinformaticsWeek11: Introduction to R (Heatmaps and Scatter Plots) Guest demonstrator: You Li, Sanjit Pandey Babu Guda Department of Genetics, Cell Biology

5

__________________________________________________________________________________________________

10/25/2013 GCBA815

Basic Heatmaps in R

http://cran.r-project.org/

• Draw the heatmap:

• Try different variations of parameters to fit you requirements.

> pheatmap(heat_set, cellwidth = 6, cellheight = 10,color =

colorRampPalette(c("red", "darkgreen", "green"))(100), main = "Example

heatmap",treeheight_row=60, treeheight_col=10,fontsize=5,

fontsize_row=5,margins=c(5,10),border_color=NA)

__________________________________________________________________________________________________

10/25/2013 GCBA815

Basic Heatmaps in R

http://cran.r-project.org/

• Draw the heatmap

> table <- as.matrix(read.table("heatmap_example_gcba815_100.txt", header=TRUE, row.names=1, sep = "\t"))

> table.sub <- subset(table, select = c(sample_1,sample_2,sample_3,sample_4))

> library()

> install.packages("pheatmap")

> library(pheatmap)

> hmap <- heatmap(table, Rowv=TRUE, Colv=TRUE, col = cm.colors(256), scale="column", margins=c(5,20))

> #different color mode cm.colors -> heat.colors

> hmap <- heatmap(table, Rowv=TRUE, Colv=TRUE, col = heat.colors(256), scale="column", margins=c(5,20))

> #Order by a column

> table.sub[order(table.sub[,4]),]

> #select only the rows with value of first column greater than 4

> heat_set<-table.sub[table.sub[,1]>4,]

> library("gplots")

> library("pheatmap")

> pheatmap(heat_set, cellwidth = 6, cellheight = 10,color = colorRampPalette(c("red", "darkgreen", "green"))(100), main = "Example heatmap",treeheight_row=60, treeheight_col=10,fontsize=5, fontsize_row=5,margins=c(5,10),border_color=NA, file="heatmap_example_gcba85.pdf")

> dev.off()

Page 6: Tools and Algorithms in BioinformaticsWeek11: Introduction to R (Heatmaps and Scatter Plots) Guest demonstrator: You Li, Sanjit Pandey Babu Guda Department of Genetics, Cell Biology

6

__________________________________________________________________________________________________

10/25/2013 GCBA815

Basic Scatter Plot in R

Final Version Overview

> setwd(“X:/PATH/TO/YOUR/FILE/") > table <- as.matrix(read.csv("./dataExample_gcba815_100.csv", header=TRUE, row.names=1)) > class(table) <- “numeric” > s1 <- subset(table, select="sample_1") > s2 <- subset(table, select="sample_2") > pdf("Comparison_S1S2_scatter.pdf") > plot(log10(s1+1), log10(s2+1), xlab="sample 1 (log10 transformed)", ylab="sample 2 (log10 transformed)", pch=19, xaxs="i", yaxs="i", xlim=c(0, 2), ylim=c(0,2), main=paste(c("r = ", round(cor(log10(s1+1), log10(s2+1)), digits=3)), collapse="")) > abline(lm(log10(s1+1)~log10(s2+1)), col="red") > abline(a=0, b=1, col="blue") > dev.off()

__________________________________________________________________________________________________

10/25/2013 GCBA815

Dataset Example

• Example (.csv)

Page 7: Tools and Algorithms in BioinformaticsWeek11: Introduction to R (Heatmaps and Scatter Plots) Guest demonstrator: You Li, Sanjit Pandey Babu Guda Department of Genetics, Cell Biology

7

__________________________________________________________________________________________________

10/25/2013 GCBA815

Set working directory

• Load the dataset into R

– In windows OS, it uses backslash (i.e. C:\windows\) instead of forward slash. In R, backslash is used as a special symbol. Backslash needs to be replaced by forward slash in R.

> setwd(“X:/PATH/TO/YOUR/FILE/") > getwd()

__________________________________________________________________________________________________

10/25/2013 GCBA815

Load the data

• Load the dataset into R

– In windows OS, it uses backslash (i.e. C:\windows\) instead of forward slash. In R, backslash is used as a special symbol. Backslash needs to be replaced by forward slash in R.

> table <- as.matrix(read.csv("./dataExample_gcba815_100.csv", header=TRUE, row.names=1)) > class(table) <- “numeric”

Page 8: Tools and Algorithms in BioinformaticsWeek11: Introduction to R (Heatmaps and Scatter Plots) Guest demonstrator: You Li, Sanjit Pandey Babu Guda Department of Genetics, Cell Biology

8

__________________________________________________________________________________________________

10/25/2013 GCBA815

Load the data (cont’d)

• Load the dataset into R

– The matrix stored in the variable named “table”

__________________________________________________________________________________________________

10/25/2013 GCBA815

Comparison between S1&S2

• Select the subset > s1 <- subset(table, select="sample_1") > s2 <- subset(table, select="sample_2")

Page 9: Tools and Algorithms in BioinformaticsWeek11: Introduction to R (Heatmaps and Scatter Plots) Guest demonstrator: You Li, Sanjit Pandey Babu Guda Department of Genetics, Cell Biology

9

__________________________________________________________________________________________________

10/25/2013 GCBA815

Comparison between S1&S2 (cont’d)

• Plot (Version 1) > plot(s1, s2)

__________________________________________________________________________________________________

10/25/2013 GCBA815

Comparison between S1&S2 (cont’d)

• Plot (Version 2) > plot(log10(s1+1), log10(s2+1))

Page 10: Tools and Algorithms in BioinformaticsWeek11: Introduction to R (Heatmaps and Scatter Plots) Guest demonstrator: You Li, Sanjit Pandey Babu Guda Department of Genetics, Cell Biology

10

__________________________________________________________________________________________________

10/25/2013 GCBA815

Comparison between S1&S2 (cont’d)

• Plot (Version 3)

– xlab means x-axis label

– pch specify the style of the scatter

> plot(log10(s1+1), log10(s2+1), xlab="sample 1 (log10 transformed)", ylab="sample 2 (log10 transformed)", pch=19)

__________________________________________________________________________________________________

10/25/2013 GCBA815

Comparison between S1&S2 (cont’d)

• Plot (Version 3)

Page 11: Tools and Algorithms in BioinformaticsWeek11: Introduction to R (Heatmaps and Scatter Plots) Guest demonstrator: You Li, Sanjit Pandey Babu Guda Department of Genetics, Cell Biology

11

__________________________________________________________________________________________________

10/25/2013 GCBA815

Comparison between S1&S2 (cont’d)

• Plot (Version 4)

– Adjust x-axis and y-axis by specify xaxs

– “i” for internal, “r” for regular (default)

> plot(log10(s1+1), log10(s2+1), xlab="sample 1 (log10 transformed)", ylab="sample 2 (log10 transformed)", pch=19, xaxs="i", yaxs="i")

__________________________________________________________________________________________________

10/25/2013 GCBA815

Comparison between S1&S2 (cont’d)

• Plot (Version 4)

Page 12: Tools and Algorithms in BioinformaticsWeek11: Introduction to R (Heatmaps and Scatter Plots) Guest demonstrator: You Li, Sanjit Pandey Babu Guda Department of Genetics, Cell Biology

12

__________________________________________________________________________________________________

10/25/2013 GCBA815

Comparison between S1&S2 (cont’d)

• Plot (Version 5)

– Adjust x-axis and y-axis for displaying right data range

– c(0, 2) means from 0 to 2

> plot(log10(s1+1), log10(s2+1), xlab="sample 1 (log10 transformed)", ylab="sample 2 (log10 transformed)", pch=19, xaxs="i", yaxs="i", xlim=c(0, 2), ylim=c(0,2))

__________________________________________________________________________________________________

10/25/2013 GCBA815

Comparison between S1&S2 (cont’d)

• Plot (Version 5)

Page 13: Tools and Algorithms in BioinformaticsWeek11: Introduction to R (Heatmaps and Scatter Plots) Guest demonstrator: You Li, Sanjit Pandey Babu Guda Department of Genetics, Cell Biology

13

__________________________________________________________________________________________________

10/25/2013 GCBA815

Comparison between S1&S2 (cont’d)

• Plot (Version 6)

– Calculate correlation coefficient (r)

> cor(log10(s1+1),log10(s2+1))

__________________________________________________________________________________________________

10/25/2013 GCBA815

Comparison between S1&S2 (cont’d)

• Plot (Version 6)

– “main” specify the title of the graph

– paste(c(a, b)), where a is “r = ” here, b is round(cor(log10(s1+1), log10(s2+1)), digits=2)

> plot(log10(s1+1), log10(s2+1), xlab="sample 1 (log10 transformed)", ylab="sample 2 (log10 transformed)", pch=19, xaxs="i", yaxs="i", xlim=c(0, 2), ylim=c(0,2), main=paste(c("r = ", round(cor(log10(s1+1), log10(s2+1)), digits=3)), collapse=""))

Page 14: Tools and Algorithms in BioinformaticsWeek11: Introduction to R (Heatmaps and Scatter Plots) Guest demonstrator: You Li, Sanjit Pandey Babu Guda Department of Genetics, Cell Biology

14

__________________________________________________________________________________________________

10/25/2013 GCBA815

Comparison between S1&S2 (cont’d)

• Plot (Version 6)

__________________________________________________________________________________________________

10/25/2013 GCBA815

Comparison between S1&S2 (cont’d)

• Plot (Final Version)

– Add regression line (red color)

– Add line with r=1 (blue line)

– “a=0, b=1” means “y = a + bx” => “y = x”

> plot(log10(s1+1), log10(s2+1), xlab="sample 1 (log10 transformed)", ylab="sample 2 (log10 transformed)", pch=19, xaxs="i", yaxs="i", xlim=c(0, 2), ylim=c(0,2), main=paste(c("r = ", round(cor(log10(s1+1), log10(s2+1)), digits=3)), collapse="")) > abline(lm(log10(s1+1)~log10(s2+1)), col="red") > abline(a=0, b=1, col="blue")

Page 15: Tools and Algorithms in BioinformaticsWeek11: Introduction to R (Heatmaps and Scatter Plots) Guest demonstrator: You Li, Sanjit Pandey Babu Guda Department of Genetics, Cell Biology

15

__________________________________________________________________________________________________

10/25/2013 GCBA815

Comparison between S1&S2 (cont’d)

• Plot (Final Version)

__________________________________________________________________________________________________

10/25/2013 GCBA815

Comparison between S1&S2 (cont’d)

• Create PDF file

– Command that are used for generating the plot

> plot(log10(s1+1), log10(s2+1), xlab="sample 1 (log10 transformed)", ylab="sample 2 (log10 transformed)", pch=19, xaxs="i", yaxs="i", xlim=c(0, 2), ylim=c(0,2), main=paste(c("r = ", round(cor(log10(s1+1), log10(s2+1)), digits=3)), collapse="")) > abline(lm(log10(s1+1)~log10(s2+1)), col="red") > abline(a=0, b=1, col="blue")

Page 16: Tools and Algorithms in BioinformaticsWeek11: Introduction to R (Heatmaps and Scatter Plots) Guest demonstrator: You Li, Sanjit Pandey Babu Guda Department of Genetics, Cell Biology

16

__________________________________________________________________________________________________

10/25/2013 GCBA815

Comparison between S1&S2 (cont’d)

• Create PDF file

– Wrap the command by PDF print device in R

> plot(log10(s1+1), log10(s2+1), xlab="sample 1 (log10 transformed)", ylab="sample 2 (log10 transformed)", pch=19, xaxs="i", yaxs="i", xlim=c(0, 2), ylim=c(0,2), main=paste(c("r = ", round(cor(log10(s1+1), log10(s2+1)), digits=3)), collapse="")) > abline(lm(log10(s1+1)~log10(s2+1)), col="red") > abline(a=0, b=1, col="blue")

> pdf("Comparison_S1S2_scatter.pdf")

> dev.off()

__________________________________________________________________________________________________

10/25/2013 GCBA815

Comparison between S1&S2 (cont’d)

• Final Version > setwd(“X:/PATH/TO/YOUR/FILE/") > table <- as.matrix(read.csv("./dataExample_gcba815_100.csv", header=TRUE, row.names=1)) > class(table) <- “numeric” > s1 <- subset(table, select="sample_1") > s2 <- subset(table, select="sample_2") > pdf("Comparison_S1S2_scatter.pdf") > plot(log10(s1+1), log10(s2+1), xlab="sample 1 (log10 transformed)", ylab="sample 2 (log10 transformed)", pch=19, xaxs="i", yaxs="i", xlim=c(0, 2), ylim=c(0,2), main=paste(c("r = ", round(cor(log10(s1+1), log10(s2+1)), digits=3)), collapse="")) > abline(lm(log10(s1+1)~log10(s2+1)), col="red") > abline(a=0, b=1, col="blue") > dev.off()