Tools and Algorithms in BioinformaticsWeek11: Introduction to R (Heatmaps and Scatter Plots) Guest...
Transcript of Tools and Algorithms in BioinformaticsWeek11: Introduction to R (Heatmaps and Scatter Plots) Guest...
1
__________________________________________________________________________________________________
10/25/2013 GCBA815
Tools and Algorithms in Bioinformatics
GCBA815, Fall 2013
Week11: Introduction to R
(Heatmaps and Scatter Plots)
Guest demonstrator: You Li, Sanjit Pandey
Babu Guda
Department of Genetics, Cell Biology and Anatomy
University of Nebraska Medical Center
__________________________________________________________________________________________________
10/25/2013 GCBA815
Introduction to R
http://cran.r-project.org/
• Integrated suite of software facilities for data manipulation,
calculation and graphical display
• Among other things it has
• an effective data handling and storage facility,
• a suite of operators for calculations on arrays, in particular
matrices,
• a large, coherent, integrated collection of intermediate tools
for data analysis,
• graphical facilities for data analysis and display either
directly at the computer or on hardcopy.
• Resources:
• http://cran.r-project.org/doc/manuals/r-release/R-intro.pdf
• http://cran.r-project.org/
2
__________________________________________________________________________________________________
10/25/2013 GCBA815
Introduction to R
http://cran.r-project.org/
• Download:
• Download R for Windows
• Download R for Linux
• Download R for Mac
• Getting help:
• help(keyword)
• Help(heatmap) or ?heatmap
• On GUI version : Different help topics under “Help” menu.
__________________________________________________________________________________________________
10/25/2013 GCBA815
Introduction to R
http://cran.r-project.org/
• R commands basics
• Change working directory:
• File > Change working directory
• ‘\\’ for working in windows. ‘\’ is an escape
character. Or you can use ‘/’
• Allows you to set default directory for file storage and
retrieval.
• Commands are case sensitive. ?heatmap vs. ?Heatmap
• Use vertical arrow keys to access previous commands in the
history.
> setwd(“X:\\path\\to\\your\\work dir\\") > getwd()
3
__________________________________________________________________________________________________
10/25/2013 GCBA815
Introduction to R
http://cran.r-project.org/
• Executing commands from a file:
• Assignment:
• Simplest R data structure is a numeric vector
> source(“commands.R”)
> a <- 100 > a/10 > a+2
• a <- c(1,2,3,4,5,6,7,8,9) : c() combines its arguments into single
dataset.
> a <- c(1,2,3,4,5,6,7,8,9)
__________________________________________________________________________________________________
10/25/2013 GCBA815
Introduction to R
http://cran.r-project.org/
• Loading data from file:
• R is strict on input file formats. It is user’s responsibility to
provide appropriately formatted file.
• “read.table()” or “read.csv()” function:
• This function takes data from csv or any other character
delimited files in tabular format. Safest way is to use text file.
• Example: Read data from a file and store it in “table”
> table <- as.matrix(read.table("heatmap_example_gcba815_100.txt", header=TRUE, row.names=1, sep = "\t"))
4
__________________________________________________________________________________________________
10/25/2013 GCBA815
Introduction to R
http://cran.r-project.org/
• Packages:
• Packages are a set of tools that serve a specific function.
• Standard packages are part of R source code and contain basic
function that allow R to work.
• Contributed packages are the packages written by diferent
developers to add missing function to R. Eg: Biocondoctur,
Omegahat etc.
• List of available packages :
• Installing a package:
• Loading a package:
> library()
> install.packages("pheatmap")
> library(pheatmap)
__________________________________________________________________________________________________
10/25/2013 GCBA815
Basic Heatmaps in R
http://cran.r-project.org/
• Heatmaps:
• A graphical representation of data where values contained in
the dataset are represented by colors of different intensities.
• Useful for representing expression data, population density
etc.
• Example:
• Sorting by a column
• Limit the data set to a threshold:
select only the rows with value of first column greater than 4.
> table.sub[order(table.sub[,4]),]
> hmap <- heatmap(table, Rowv=NA, Colv=NA, col = heat.colors(256),
scale="column", margins=c(1,10), cexRow=0.7)
> heat_set<-table.sub[table.sub[,1]>4,]
5
__________________________________________________________________________________________________
10/25/2013 GCBA815
Basic Heatmaps in R
http://cran.r-project.org/
• Draw the heatmap:
• Try different variations of parameters to fit you requirements.
> pheatmap(heat_set, cellwidth = 6, cellheight = 10,color =
colorRampPalette(c("red", "darkgreen", "green"))(100), main = "Example
heatmap",treeheight_row=60, treeheight_col=10,fontsize=5,
fontsize_row=5,margins=c(5,10),border_color=NA)
__________________________________________________________________________________________________
10/25/2013 GCBA815
Basic Heatmaps in R
http://cran.r-project.org/
• Draw the heatmap
> table <- as.matrix(read.table("heatmap_example_gcba815_100.txt", header=TRUE, row.names=1, sep = "\t"))
> table.sub <- subset(table, select = c(sample_1,sample_2,sample_3,sample_4))
> library()
> install.packages("pheatmap")
> library(pheatmap)
> hmap <- heatmap(table, Rowv=TRUE, Colv=TRUE, col = cm.colors(256), scale="column", margins=c(5,20))
> #different color mode cm.colors -> heat.colors
> hmap <- heatmap(table, Rowv=TRUE, Colv=TRUE, col = heat.colors(256), scale="column", margins=c(5,20))
> #Order by a column
> table.sub[order(table.sub[,4]),]
> #select only the rows with value of first column greater than 4
> heat_set<-table.sub[table.sub[,1]>4,]
> library("gplots")
> library("pheatmap")
> pheatmap(heat_set, cellwidth = 6, cellheight = 10,color = colorRampPalette(c("red", "darkgreen", "green"))(100), main = "Example heatmap",treeheight_row=60, treeheight_col=10,fontsize=5, fontsize_row=5,margins=c(5,10),border_color=NA, file="heatmap_example_gcba85.pdf")
> dev.off()
6
__________________________________________________________________________________________________
10/25/2013 GCBA815
Basic Scatter Plot in R
Final Version Overview
> setwd(“X:/PATH/TO/YOUR/FILE/") > table <- as.matrix(read.csv("./dataExample_gcba815_100.csv", header=TRUE, row.names=1)) > class(table) <- “numeric” > s1 <- subset(table, select="sample_1") > s2 <- subset(table, select="sample_2") > pdf("Comparison_S1S2_scatter.pdf") > plot(log10(s1+1), log10(s2+1), xlab="sample 1 (log10 transformed)", ylab="sample 2 (log10 transformed)", pch=19, xaxs="i", yaxs="i", xlim=c(0, 2), ylim=c(0,2), main=paste(c("r = ", round(cor(log10(s1+1), log10(s2+1)), digits=3)), collapse="")) > abline(lm(log10(s1+1)~log10(s2+1)), col="red") > abline(a=0, b=1, col="blue") > dev.off()
__________________________________________________________________________________________________
10/25/2013 GCBA815
Dataset Example
• Example (.csv)
7
__________________________________________________________________________________________________
10/25/2013 GCBA815
Set working directory
• Load the dataset into R
– In windows OS, it uses backslash (i.e. C:\windows\) instead of forward slash. In R, backslash is used as a special symbol. Backslash needs to be replaced by forward slash in R.
> setwd(“X:/PATH/TO/YOUR/FILE/") > getwd()
__________________________________________________________________________________________________
10/25/2013 GCBA815
Load the data
• Load the dataset into R
– In windows OS, it uses backslash (i.e. C:\windows\) instead of forward slash. In R, backslash is used as a special symbol. Backslash needs to be replaced by forward slash in R.
> table <- as.matrix(read.csv("./dataExample_gcba815_100.csv", header=TRUE, row.names=1)) > class(table) <- “numeric”
8
__________________________________________________________________________________________________
10/25/2013 GCBA815
Load the data (cont’d)
• Load the dataset into R
– The matrix stored in the variable named “table”
__________________________________________________________________________________________________
10/25/2013 GCBA815
Comparison between S1&S2
• Select the subset > s1 <- subset(table, select="sample_1") > s2 <- subset(table, select="sample_2")
9
__________________________________________________________________________________________________
10/25/2013 GCBA815
Comparison between S1&S2 (cont’d)
• Plot (Version 1) > plot(s1, s2)
__________________________________________________________________________________________________
10/25/2013 GCBA815
Comparison between S1&S2 (cont’d)
• Plot (Version 2) > plot(log10(s1+1), log10(s2+1))
10
__________________________________________________________________________________________________
10/25/2013 GCBA815
Comparison between S1&S2 (cont’d)
• Plot (Version 3)
– xlab means x-axis label
– pch specify the style of the scatter
> plot(log10(s1+1), log10(s2+1), xlab="sample 1 (log10 transformed)", ylab="sample 2 (log10 transformed)", pch=19)
__________________________________________________________________________________________________
10/25/2013 GCBA815
Comparison between S1&S2 (cont’d)
• Plot (Version 3)
11
__________________________________________________________________________________________________
10/25/2013 GCBA815
Comparison between S1&S2 (cont’d)
• Plot (Version 4)
– Adjust x-axis and y-axis by specify xaxs
– “i” for internal, “r” for regular (default)
> plot(log10(s1+1), log10(s2+1), xlab="sample 1 (log10 transformed)", ylab="sample 2 (log10 transformed)", pch=19, xaxs="i", yaxs="i")
__________________________________________________________________________________________________
10/25/2013 GCBA815
Comparison between S1&S2 (cont’d)
• Plot (Version 4)
12
__________________________________________________________________________________________________
10/25/2013 GCBA815
Comparison between S1&S2 (cont’d)
• Plot (Version 5)
– Adjust x-axis and y-axis for displaying right data range
– c(0, 2) means from 0 to 2
> plot(log10(s1+1), log10(s2+1), xlab="sample 1 (log10 transformed)", ylab="sample 2 (log10 transformed)", pch=19, xaxs="i", yaxs="i", xlim=c(0, 2), ylim=c(0,2))
__________________________________________________________________________________________________
10/25/2013 GCBA815
Comparison between S1&S2 (cont’d)
• Plot (Version 5)
13
__________________________________________________________________________________________________
10/25/2013 GCBA815
Comparison between S1&S2 (cont’d)
• Plot (Version 6)
– Calculate correlation coefficient (r)
> cor(log10(s1+1),log10(s2+1))
__________________________________________________________________________________________________
10/25/2013 GCBA815
Comparison between S1&S2 (cont’d)
• Plot (Version 6)
– “main” specify the title of the graph
– paste(c(a, b)), where a is “r = ” here, b is round(cor(log10(s1+1), log10(s2+1)), digits=2)
> plot(log10(s1+1), log10(s2+1), xlab="sample 1 (log10 transformed)", ylab="sample 2 (log10 transformed)", pch=19, xaxs="i", yaxs="i", xlim=c(0, 2), ylim=c(0,2), main=paste(c("r = ", round(cor(log10(s1+1), log10(s2+1)), digits=3)), collapse=""))
14
__________________________________________________________________________________________________
10/25/2013 GCBA815
Comparison between S1&S2 (cont’d)
• Plot (Version 6)
__________________________________________________________________________________________________
10/25/2013 GCBA815
Comparison between S1&S2 (cont’d)
• Plot (Final Version)
– Add regression line (red color)
– Add line with r=1 (blue line)
– “a=0, b=1” means “y = a + bx” => “y = x”
> plot(log10(s1+1), log10(s2+1), xlab="sample 1 (log10 transformed)", ylab="sample 2 (log10 transformed)", pch=19, xaxs="i", yaxs="i", xlim=c(0, 2), ylim=c(0,2), main=paste(c("r = ", round(cor(log10(s1+1), log10(s2+1)), digits=3)), collapse="")) > abline(lm(log10(s1+1)~log10(s2+1)), col="red") > abline(a=0, b=1, col="blue")
15
__________________________________________________________________________________________________
10/25/2013 GCBA815
Comparison between S1&S2 (cont’d)
• Plot (Final Version)
__________________________________________________________________________________________________
10/25/2013 GCBA815
Comparison between S1&S2 (cont’d)
• Create PDF file
– Command that are used for generating the plot
> plot(log10(s1+1), log10(s2+1), xlab="sample 1 (log10 transformed)", ylab="sample 2 (log10 transformed)", pch=19, xaxs="i", yaxs="i", xlim=c(0, 2), ylim=c(0,2), main=paste(c("r = ", round(cor(log10(s1+1), log10(s2+1)), digits=3)), collapse="")) > abline(lm(log10(s1+1)~log10(s2+1)), col="red") > abline(a=0, b=1, col="blue")
16
__________________________________________________________________________________________________
10/25/2013 GCBA815
Comparison between S1&S2 (cont’d)
• Create PDF file
– Wrap the command by PDF print device in R
> plot(log10(s1+1), log10(s2+1), xlab="sample 1 (log10 transformed)", ylab="sample 2 (log10 transformed)", pch=19, xaxs="i", yaxs="i", xlim=c(0, 2), ylim=c(0,2), main=paste(c("r = ", round(cor(log10(s1+1), log10(s2+1)), digits=3)), collapse="")) > abline(lm(log10(s1+1)~log10(s2+1)), col="red") > abline(a=0, b=1, col="blue")
> pdf("Comparison_S1S2_scatter.pdf")
> dev.off()
__________________________________________________________________________________________________
10/25/2013 GCBA815
Comparison between S1&S2 (cont’d)
• Final Version > setwd(“X:/PATH/TO/YOUR/FILE/") > table <- as.matrix(read.csv("./dataExample_gcba815_100.csv", header=TRUE, row.names=1)) > class(table) <- “numeric” > s1 <- subset(table, select="sample_1") > s2 <- subset(table, select="sample_2") > pdf("Comparison_S1S2_scatter.pdf") > plot(log10(s1+1), log10(s2+1), xlab="sample 1 (log10 transformed)", ylab="sample 2 (log10 transformed)", pch=19, xaxs="i", yaxs="i", xlim=c(0, 2), ylim=c(0,2), main=paste(c("r = ", round(cor(log10(s1+1), log10(s2+1)), digits=3)), collapse="")) > abline(lm(log10(s1+1)~log10(s2+1)), col="red") > abline(a=0, b=1, col="blue") > dev.off()