Tutorial: analysing Microarray data using …Guiyuan Lei Tutorial: analysing Microarray data using...
Transcript of Tutorial: analysing Microarray data using …Guiyuan Lei Tutorial: analysing Microarray data using...
Tutorial: analysing Microarray data usingBioConductor
Guiyuan Lei
Centre for Integrated Systems Biology of Ageing and Nutrition (CISBAN)School of Mathematics & Statistics
Newcastle Universityhttp://www.mas.ncl.ac.uk/∼ngl9/
4 Feb, 2008
Guiyuan Lei Tutorial: analysing Microarray data using BioConductor
Outline of Tutorial
Tutorial on CISBAN Internal wiki, under ‘Tools & Resources’page http://bioinf.ncl.ac.uk/cisban/doku.php?id=resources:resourceshome
Introduction to BioconductorFibroblast data set (Chapter 2) and Yeast time course data(Chapter 3)
Pre-process of dataModel Fitting for Identifying Differential ExpressionNetwork Inference
Guiyuan Lei Tutorial: analysing Microarray data using BioConductor
Outline of Tutorial
Tutorial on CISBAN Internal wiki, under ‘Tools & Resources’page http://bioinf.ncl.ac.uk/cisban/doku.php?id=resources:resourceshome
Introduction to BioconductorFibroblast data set (Chapter 2) and Yeast time course data(Chapter 3)
Pre-process of dataModel Fitting for Identifying Differential ExpressionNetwork Inference
Guiyuan Lei Tutorial: analysing Microarray data using BioConductor
Why Bioconductor
Bioconductor:open source software for bioinformaticsprovide innovative methodology for analyzing genomic datausing R statistical computing environment
R: Powerful grapphic feature and cut-edge statisticaltechniques, around 800 packages available, around 60basic packages (like affy, limma) in BioconductorPublished Papers using Bioconductorhttp://www.bioconductor.org/pub
Google Scholar Beta, PubMed, BEPress (BerkeleyElectronic Press), Biostatistics, BioMed CentralBioinformatics and IngentaFor example, in Bioinformatics, 161 papers found with‘Bioconductor’ in title
Guiyuan Lei Tutorial: analysing Microarray data using BioConductor
Why Bioconductor
Bioconductor:open source software for bioinformaticsprovide innovative methodology for analyzing genomic datausing R statistical computing environment
R: Powerful grapphic feature and cut-edge statisticaltechniques, around 800 packages available, around 60basic packages (like affy, limma) in BioconductorPublished Papers using Bioconductorhttp://www.bioconductor.org/pub
Google Scholar Beta, PubMed, BEPress (BerkeleyElectronic Press), Biostatistics, BioMed CentralBioinformatics and IngentaFor example, in Bioinformatics, 161 papers found with‘Bioconductor’ in title
Guiyuan Lei Tutorial: analysing Microarray data using BioConductor
Why Bioconductor
Bioconductor:open source software for bioinformaticsprovide innovative methodology for analyzing genomic datausing R statistical computing environment
R: Powerful grapphic feature and cut-edge statisticaltechniques, around 800 packages available, around 60basic packages (like affy, limma) in BioconductorPublished Papers using Bioconductorhttp://www.bioconductor.org/pub
Google Scholar Beta, PubMed, BEPress (BerkeleyElectronic Press), Biostatistics, BioMed CentralBioinformatics and IngentaFor example, in Bioinformatics, 161 papers found with‘Bioconductor’ in title
Guiyuan Lei Tutorial: analysing Microarray data using BioConductor
Getting Bioconductor and associated packages
Install base packages, such as affy and limmasource("http://bioconductor.org/biocLite.R")biocLite()
Install specific packages, such as yeast2probesource("http://bioconductor.org/biocLite.R")biocLite("yeast2probe")
Set search directories in the .Renviron file, e.g.R_LIBS=/data/Rpackages/
Update Bioconductorsource("http://bioconductor.org/biocLite.R")update.packages(repos=biocinstallRepos(), ask=FALSE)
Guiyuan Lei Tutorial: analysing Microarray data using BioConductor
Pre-process of data
Entering data into BioconductorExtraction of Cerevisiae probesetsExploratory data analysisNormalising Microarray dataProbeset level expression to gene level expressionPrincipal Component Analysis
Guiyuan Lei Tutorial: analysing Microarray data using BioConductor
Entering data into Bioconductor
library(affy)fns2 = list.celfiles(path="data2", full.names=TRUE)rawdata = ReadAffy(filenames=fns2)print(rawdata)
Strain 0 hours 1 hour 2 hours 3 hours 4 hoursMutant 1 yeast01.cel yeast02.cel yeast03.cel yeast04.cel yeast05.celWild type 1 yeast06.cel yeast07.cel yeast08.cel yeast09.cel yeast10.celMutant 2 yeast11.cel yeast12.cel yeast13.cel yeast14.cel yeast15.celWild type 2 yeast16.cel yeast17.cel yeast18.cel yeast19.cel yeast20.celMutant 3 yeast21.cel yeast22.cel yeast23.cel yeast24.cel yeast25.celWild type 3 yeast26.cel yeast27.cel yeast28.cel yeast29.cel yeast30.cel
Guiyuan Lei Tutorial: analysing Microarray data using BioConductor
Mask file for Cerevisiae probesets
Mask file to filter out pombe probesetshttp://www.affymetrix.com/Auth/support/downloads/mask files/s cerevisiae.zips_cerevisiae<-scan("s_cerevisiae.msk", skip=2, list("", ""))pombe_filter_out<-s_cerevisiae[[1]]
RemoveProbe 1
source("RemoveProbes.r")library(yeast2probe)cleancdf = cleancdfname("yeast2")RemoveProbes(listOutProbes=NULL, pombe_filter_out,
"yeast2cdf","yeast2probe")
1W.G. Alvord et al., “A microarray analysis for differential gene expressionin the soybean genome using Bioconductor and R.”, Briefings inBioinformatics, September 2007
Guiyuan Lei Tutorial: analysing Microarray data using BioConductor
Cerevisiae probesets IDs
Yeast probeset IDslibrary(yeast2)genenames = as.list(yeast2GENENAME)YeastProbeID <- names(genenames)
Cerevisiae probeset IDsCerevisiaeProbeID <-YeastProbeID[-match(pombe_filter_out,YeastProbeID)]
Guiyuan Lei Tutorial: analysing Microarray data using BioConductor
Cerevisiae gene names
Yeast Transcript IDs from annotation fileYeast 2.na24.annot.csv.zipyeast2annotation=read.csv(file="yeast2annotation.csv",
header=TRUE,stringsAsFactors=FALSE)YeastTranscriptID<-yeast2annotation[,3]yeast2annotationProbesetID<-yeast2annotation[,1]YeastTranscriptID<-YeastTranscriptID[match(YeastProbeID,
yeast2annotationProbesetID)]
Yeast gene namesYeastGeneName<-character()for(i in 1:length(YeastProbeID)){YeastGeneName[i]=genenames[i][[1]]if(is.na(YeastGeneName[i])){YeastGeneName[i]=YeastTranscriptID[i]}
}
Cerevisiae gene namesCerevisiaeGeneName<-YeastGeneName[-match(pombe_filter_out,YeastProbeID)]
Guiyuan Lei Tutorial: analysing Microarray data using BioConductor
Exploratory data analysis: examining raw images
png(filename="cerevisiaeimage.png",width=960, height=480)par(mfrow=c(1,2))image(rawdata[,1])image(rawdata[,27])dev.off()
Guiyuan Lei Tutorial: analysing Microarray data using BioConductor
Exploratory data analysis: probe intensities
png(filename="yeastintensities.png",width=960, height=960)hist(rawdata, lty=1:30, lwd=2)legend(14, 0.60, legend=sampleNames(rawdata), lty=1:30, lwd=2)dev.off()
Guiyuan Lei Tutorial: analysing Microarray data using BioConductor
Exploratory data analysis: MA plots
png(filename="cerevisiaemaplot.png",width=960, height=480)par(mfrow=c(1,2))MAplot(rawdata,which=1)MAplot(rawdata,which=27)dev.off()
Guiyuan Lei Tutorial: analysing Microarray data using BioConductor
Exploratory data analysis: RNA degradation
RNAdeg <- AffyRNAdeg(rawdata)plotAffyRNAdeg(RNAdeg)
RNA degradation plot
5' <−−−−−> 3' Probe Number
Mea
n In
tens
ity :
shift
ed a
nd s
cale
d
0 2 4 6 8 10
010
2030
40
Guiyuan Lei Tutorial: analysing Microarray data using BioConductor
Pre-process of data: Normalisation
eset.rma=rma(rawdata)library(affyPLM)par(mfrow=c(1,2))boxplot(rawdata, col="red",main="Cerevisiae Probe intensities")boxplot(eset.rma, col="blue",main="Cerevisiae RMA expression values")
yeast01.cel yeast08.cel yeast15.cel yeast22.cel yeast29.cel
68
1012
1416
Cerevisiae Probe intensities
yeast01.cel yeast08.cel yeast15.cel yeast22.cel yeast29.cel
46
810
1214
Cerevisiae RMA expression values
Guiyuan Lei Tutorial: analysing Microarray data using BioConductor
Probeset level expression to gene level expression
There are usually several probesets map to one gene inAffymetrix.
CerevisiaeGeneNameLevels<-factor(CerevisiaeGeneName)#Function to average the expression of probesets#which map to same geneprobeset2genelevel<-function(onesample){
return(tapply(onesample,CerevisiaeGeneNameLevels,mean))}#Do the average for each column/arrayCerevisiaeGeneData<-apply(CerevisiaeProbeData,2,probeset2genelevel)
Guiyuan Lei Tutorial: analysing Microarray data using BioConductor
Principal Component Analysis
library(smida)cluster.samples(t(CerevisiaeProbeData),method="pca")
−20 −10 0 10 20 30 40
−10
−5
05
10
1st PC
2nd
PC
Clustering Samples
Principal Component Projection
yeast01.cel
yeast02.cel
yeast03.cel
yeast04.cel
yeast05.cel
yeast06.cel
yeast07.cel
yeast08.cel
yeast09.celyeast10.cel
yeast11.cel
yeast12.celyeast13.cel
yeast14.cel
yeast15.cel
yeast16.cel
yeast17.celyeast18.cel
yeast19.celyeast20.cel
yeast21.cel
yeast22.cel
yeast23.cel
yeast24.cel
yeast25.cel
yeast26.celyeast27.cel
yeast28.cel
yeast29.cel
yeast30.cel
Guiyuan Lei Tutorial: analysing Microarray data using BioConductor
Model Fitting for Identifying Differential Expression
Limma modelConstruct design matrixConstruct constrasts
Plot time course for top differential expressionHeatmap
Guiyuan Lei Tutorial: analysing Microarray data using BioConductor
Limma: design matrix
library(limma)levels = c("m0","m1", "m2", "m3", "m4","w0","w1", "w2", "w3", "w4")X= rep(levels, 3)TS <- factor(X, levels= levels)design <- model.matrix(˜0+TS)colnames(design) <- levels(TS)
E
2666666666666666664
0BBBBBBBBBBBBBBBBB@
yg1yg2...
yg10...
yg26yg27yg28yg29yg30
1CCCCCCCCCCCCCCCCCA
3777777777777777775
=
0BBBBBBBBBBBBBBBBB@
1 0 0 0 0 0 0 0 0 00 1 0 0 0 0 0 0 0 0...
......
......
......
......
...0 0 0 0 0 0 0 0 0 1...
......
......
......
......
...0 0 0 0 0 1 0 0 0 00 0 0 0 0 0 1 0 0 00 0 0 0 0 0 0 1 0 00 0 0 0 0 0 0 0 1 00 0 0 0 0 0 0 0 0 1
1CCCCCCCCCCCCCCCCCA
0BBBBBBBBBBBBB@
m0m1m2m3m4w0w1w2w3w4
1CCCCCCCCCCCCCA(1)
Guiyuan Lei Tutorial: analysing Microarray data using BioConductor
Limma: construct constrsts and model fitting
For identifying differential expression, combine the contrasts bycomparing mutant type and wild type at time point 1,2,3 and 4.
#Model Fittingfit<-lmFit(CerevisiaeProbeData, design)mc<-makeContrasts(’m1-w1’,’m2-w2’,’m3-w3’,’m4-w4’,levels=design)fit2<-contrasts.fit(fit, mc)eb<-eBayes(fit2)
Different ways to rank the differentially expressed probesets:
topTable(eb,sort.by=’logFC’) #log-fold change
topTableF(eb) #F-statistics
Guiyuan Lei Tutorial: analysing Microarray data using BioConductor
Up and down regulated list
modFpvalue<-eb$F.p.value #$F-test p valueselectedgenesindx<-p.adjust(eb$F.p.value,method="bonferroni")<0.05 #$Sig<-modFpvalue[selectedgenesindx]nsiggenes<-length(Sig) #number of differential expressionresults1<-decideTests(eb, method="global")modF<-eb$F #$F-test valuemodFordered<-order(modF, decreasing = TRUE)CerevisiaeRankProbe<-CerevisiaeProbeID[modFordered[1:nsiggenes]]CerevisiaeRankGeneName<-CerevisiaeGeneName[modFordered[1:nsiggenes]]updown<-results1[modFordered[1:nsiggenes],]
Probeset ID Gene Symbol T1 T2 T3 T4ProbesetID 1 Gene 1 -1 -1 -1 -1ProbesetID 2 Gene 2 1 1 1 1ProbesetID 3 Gene 3 -1 -1 -1 -1
Table: Up and down regulated list2
2Not use real gene names hereGuiyuan Lei Tutorial: analysing Microarray data using BioConductor
Plot time course for top differential expression
#Rank the i+1’th differential expressionindx <- rank(modF) == nrow(CerevisiaeProbeData)-i
M
M
M
M
M
0 1 2 3 4
89
1011
1213
ProbesetID 1 Gene 1 Rank= 1
time
Exp
ress
ion
M
M
MM
M
M
M
M
M
M
W W W WW
W W W W WW
W WW
W
M
M
M
M M
0 1 2 3 4
89
1011
1213
14
ProbesetID 2 Gene 2 Rank= 2
time
Exp
ress
ion
M
M
M
M M
M
M
M
MM
W W
W W W
W
W
W W W
W W
W W W
M
M
M
M
M
0 1 2 3 4
78
910
11
ProbesetID 3 Gene 3 Rank= 3
time
Exp
ress
ion
M
M
M
M
M
M
M
M
M
M
W
WW W W
W
W W W
W
W W
W W
W
Figure: Time course expression for top 3 differentially expressedYeast genes3
3Not use real gene names hereGuiyuan Lei Tutorial: analysing Microarray data using BioConductor
Code for plotting time course data
modF <- eb$F #$F is to get F-statisticwhich.M <- c(seq(1,5), seq(11,15), seq(21,25))which.W <- c(seq(6,10), seq(16,20), seq(26,30))par(mfrow=c(1,3),ask=T,cex=0.5)#cex:font sizefor(i in 0:2){indx <- rank(modF) == nrow(CerevisiaeProbeData)-irow1 = CerevisiaeProbeData[indx, which.M]row2 = CerevisiaeProbeData[indx, which.W]id=CerevisiaeProbeID[indx]name = CerevisiaeGeneName[indx]genetitle<-paste(sprintf("%.30s",id)," ",sprintf("%.30s",name)," Rank=", i+1)time=c(0,1,2,3,4)plot(time,row1[1:5],ylim=range(min(row1,row2), max(row1,row2)),
ylab="Expression", main=genetitle,pch=’M’,type=’b’,col=2)lines(time, row1[6:10] , pch=’M’, type=’b’, col=2)lines(time, row1[11:15], pch=’M’, type=’b’, col=2)lines(time, row2[1:5], pch=’W’, type=’b’, col=1)lines(time, row2[6:10], pch=’W’, type=’b’, col=1)lines(time, row2[11:15], pch=’W’, type=’b’, col=1)
}
Guiyuan Lei Tutorial: analysing Microarray data using BioConductor
Heatmap
ngenes = 100m = matrix(nrow=ngenes,ncol=30)rnames = vector("list", length(1))for(i in 0:(ngenes-1)){indx <- rank(modF) == nrow(CerevisiaeProbeData) - im[i+1,] = CerevisiaeProbeData[indx,]rnames[i+1]=CerevisiaeGeneName[indx]}
}heatmap(m)
r1t0
r2t0
r3t0
r1t1
r2t1
r3t1
r1t2
r2t2
r3t2
r1t3
r2t3
r3t3
r1t4
r2t4
r3t4
Gene 2Gene 6Gene 38Gene 93Gene 80Gene 94Gene 88Gene 63Gene 85Gene 12Gene 91Gene 60Gene 46Gene 50Gene 19Gene 74Gene 92Gene 73Gene 100Gene 71Gene 69Gene 9Gene 43Gene 72Gene 32Gene 55Gene 64Gene 61Gene 75Gene 59Gene 86Gene 89Gene 45Gene 44Gene 83Gene 67Gene 68Gene 24Gene 53Gene 84Gene 90Gene 37Gene 87Gene 96Gene 15Gene 13Gene 21Gene 8Gene 39Gene 7Gene 3Gene 4Gene 5Gene 22Gene 42Gene 16Gene 95Gene 48Gene 77Gene 65Gene 76Gene 36Gene 20Gene 29Gene 33Gene 54Gene 52Gene 79Gene 51Gene 70Gene 97Gene 58Gene 62Gene 66Gene 14Gene 47Gene 23Gene 41Gene 1Gene 35Gene 57Gene 10Gene 99Gene 81Gene 98Gene 27Gene 40Gene 34Gene 31Gene 56Gene 11Gene 49Gene 82Gene 78Gene 18Gene 26Gene 25Gene 17Gene 30Gene 28
Guiyuan Lei Tutorial: analysing Microarray data using BioConductor
Network Inference
GeneNetStrimmer’s VAR model
Guiyuan Lei Tutorial: analysing Microarray data using BioConductor
GeneNet: partial correlation network
Yeast Network
Gene 41
Gene 87
Gene 64
Gene 99Gene 57
Gene 81
Gene 94 Gene 91
Gene 98
Gene 96
Gene 58
Gene 68
Gene 88
Gene 92
Gene 80
Gene 95
Gene 69
Gene 97
Gene 65
Gene 63
Gene 66Gene 74
Gene 49
Gene 67 Gene 89
Gene 70
Gene 82Gene 75
Gene 100
Gene 90
Gene 22
Gene 53
Gene 44
Gene 93
Gene 60
Gene 56
Gene 59Gene 55Gene 76
Gene 54
Gene 34
Gene 10
Gene 35
Gene 71
Gene 28
Gene 50
Gene 84
Gene 79
Gene 73
Gene 16
Gene 86
Gene 42
Gene 20
Gene 31
Gene 27
Gene 40Gene 39
Gene 77
Gene 18
Gene 38
Gene 24
Gene 48
Gene 33
Gene 15
Gene 32
Gene 14
Gene 83
Gene 11
Gene 23
Gene 25
Gene 36 Gene 4
Gene 52
Figure: Inferred network by GeneNet package for top 100differentially expressed Yeast genes
Guiyuan Lei Tutorial: analysing Microarray data using BioConductor
GeneNet step 1: build longitudial object
#Need to transpose data matrix, rows to be arraysm = t(m)#Need to rearrange the rows so that the rows are ordered by time points#using the property of design matrix#the entry design[i,j] with value 1 means array i is for time point j!!!mnew = t(matrix(nrow=ngenes,ncol=30))#Get the index of array ordered by time pointarrayindx<-numeric(0)ntime=5for(j in 1:ntime){arrayindx<-c(arrayindx,grep(1,design[,j]),grep(1,design[,j+ntime]))}mnew<-m[arrayindx,]library("GeneNet")# step 1: create longitudinal object #mlong = as.longitudinal(mnew,repeats=c(6,6,6,6,6),time=c(0,1,2,3,4))
Guiyuan Lei Tutorial: analysing Microarray data using BioConductor
GeneNet: step 2 to step 5
# step 2: compute partial correlations #pcor.dyn <- ggm.estimate.pcor(mlong, method = "dynamic")# step 3: assign (local) fdr values to all possible edges #m.edges <- network.test.edges(pcor.dyn,direct=TRUE)dim(m.edges)# step 4: construct graph containing the 150 top edges #m.net <- extract.network(m.edges, method.ggm="number", cutoff.ggm=150)# step 5: plot graph using graphviz ##If rnames has no "", Need for Graphvizfor(i in 1:ngenes){
rnames[i] = paste(’"’,rnames[i],’"’,sep="")}
colnames(m) = rnamesnode.labels <- colnames(m)network.make.dot(filename="net.dot", m.net, node.labels, main="Yeast Network")
Guiyuan Lei Tutorial: analysing Microarray data using BioConductor