EMBO Practical Course on Analysis of High-Throughput ... · orF this practical, we will use the NF-...

25
κ

Transcript of EMBO Practical Course on Analysis of High-Throughput ... · orF this practical, we will use the NF-...

Page 1: EMBO Practical Course on Analysis of High-Throughput ... · orF this practical, we will use the NF- B ChIP-seq data from Kasowski et al. (2010). In addition to sequence reads from

EMBO Practical Course on Analysis of

High-Throughput Sequencing Data

Day 6 - ChIP-seq data analysis

Borbala Gerle

University College London, London, UK

Kathi Zarnack

CRUK London Research Institute, London, UK

26th October 2013

This practical illustrates common ChIP-seq analysis steps based on a number of Bio-conductor packages (see References). We will start from aligned read data of ChIP-seqexperiments with the sequence-speci�c transcription factor NF-κB.

In the �rst part, we will use the package chipseq to perform some initial �ltering steps,determine the fragment size and obtain some diagnostic plots to assess the data quality.We will then extend the reads and use chipseq functionalities to naively call peaks fromthe data. All these analyses will be performed in R.

In the second part, we will leave R and run several commonly used peak-calling algorithms,including MACS version 2, USeq and SISSRS. Returning into R, we will then compare theresults of the di�erent peak callers, assessing their shape, genomic location and overlapbetween peak callers. We will further perform motif analyses to identify the binding motifof this sequence-speci�c transcription factor and assess the location of this motif withinthe peaks.

In the last part which is optional in this practical, we will look at approaches to assessdi�erential binding using DESeq.

1

Page 2: EMBO Practical Course on Analysis of High-Throughput ... · orF this practical, we will use the NF- B ChIP-seq data from Kasowski et al. (2010). In addition to sequence reads from

1 Peak calling

1.1 The aligned read data

For this practical, we will use the NF-κB ChIP-seq data from Kasowski et al. (2010). In additionto sequence reads from the ChIP-seq library, the study also provides sequencing data of the inputDNA which we will use for comparison.

In the �rst step of a ChIP-seq analysis, the obtained sequence reads are mapped to the referencegenome of the organism that was used in the experiments. In our case, the reads were aligned tothe human reference genome (hg19) allowing one mismatch using the program Bowtie that wasintroduced earlier in the course (http://bowtie-bio.sourceforge.net/index.shtml). The mappingresults were then imported into R using the readAligned() function from the ShortRead packageand �ltered to keep only reads mapping to the canonical chromosomes. Storing the aligned readdata as a GRanges object (a set of stranded genomic intervals) saves a large amount of memoryas sequences and quality strings are discarded.

Since these initial steps can be very time-consuming, we performed them prior to the practical.For convenience, we will only consider the subset of reads from both the IP ([["NFKB_IP"]]) andthe input ([["input"]]) samples that map to the chromosomes 10-12. These were stored in theNFKB object which you can load from the shared folder. The commands that were used to generatethe NFKB object from bed �les of aligned reads are shown below (you do not need to run theseduring the practical!).

########### NOT TO BE RUN DURING THE COURSE ####################

### performed outside R ###

# conversion of sra into fast

for i in *.sra

do

~/programmes/sratoolkit.2.1.7-centos_linux64/fastq-dump $i

done

# genome alignment using bowtie

for i in *.fastq

do bsub bowtie -n 1 -k 1 --best hg19 $i ${i%%.fastq}.aln

done

# PREPARATION OF THE DATA OBJECT

### performed in R ###

library(ShortRead)

library(rtracklayer)

# generically to define the folder where the aln files are stored

aln.path <- "/path/to/your/alignment/files/"

data.path <- "/path/to/your/results/folder/"

# specifically for this course define the folder where the aln files are stored

aln.path = "/home/data/ChIP_seq/"

# get all files in this folder

aln.files <- list.files(aln.path, pattern=".aln", full.names=TRUE)

2

Page 3: EMBO Practical Course on Analysis of High-Throughput ... · orF this practical, we will use the NF- B ChIP-seq data from Kasowski et al. (2010). In addition to sequence reads from

# import all files and convert into GRanges

NFKB <- seqapply(aln.files, function(file.name) {

reads=readAligned(file.name, type="Bowtie")

reads=as(reads, "Granges")

return( reads )

}

)

names(NFKB) <- c("NFKB","input")

seqlevels(NFKB) <- c("chr10", "chr11", "chr12")

# save file

save(NFKB, file=paste(data.path,"NFKB_data.Rdata",sep=""))

1.2 Extending reads

Start the practical by importing the required libraries and loading the NFKB object from the sharedfolder at /nfs/training/PeakCalling/.

# loading the required libraries

library(chipseq)

library(GenomicRanges)

# load the NFKB object

load("/nfs/training/PeakCalling/NFKB_data.Rdata")

NFKB is an object of class GRangesList, where each component represents the alignments fromChIP experiment as a GRanges object. Inspect the object (type NFKB and NFKB$NFKB_IP).

Our data consists of aligned 28-nt single-ended reads which correspond to the 5' ends of theco-puri�ed DNA fragments that were sequenced. Extending the alignment of the short read tothe estimated fragment length will ensure that most intervals cover the actual binding positionof NF-κB. There are several methods to estimate the fragment length which are implementedby the estimate.mean.fraglen() function. Depending on the experimental conditions used, thefragment lengths commonly vary around 200 nt. To recover the complete co-puri�ed fragments, wetherefore extend all reads to a length of 200 nt. Since NFKB is a GRangesList, we use the functionendoapply() to loop over the GRanges objects in the list (endoapply() returns an object of thesame class, i.e. a GRangesList in our case).

# estimate the fragment length

fraglen <- estimate.mean.fraglen(NFKB$NFKB_IP)

fraglen

# extend all reads to the estimated fragment length (approximately 200 nt)

reads.ext <- endoapply(NFKB, function(x) resize(x, width = 200))

reads.ext

The default behaviour of the function resize() from the GenomicRanges package is to extend theintervals in a "strand-aware" manner, meaning that all reads are extended in 3' direction on the

3

Page 4: EMBO Practical Course on Analysis of High-Throughput ... · orF this practical, we will use the NF- B ChIP-seq data from Kasowski et al. (2010). In addition to sequence reads from

respective strand. Type ?resize to see a range of other utility functions for modifying GRanges

objects.

1.3 Coverage, islands and depth

In ChIP-seq, one is usually interested in the number of precipitated DNA fragments in the samplethat were mapped to a given genomic locus. This is best represented by a "coverage vector" (or"pile-up vector"), which indicates how many reads cover each base in the genome. The functioncoverage() in the ShortRead package calculates such a vector from alignment information.

In order to allocate a vector of the right size, coverage() needs to know the length of thechromosomes. Since we speci�ed the underlying genome version (genome="hg19") during thebed �le import, this information is already included our GRanges object. Otherwise it could forexample be retrieved from the BSgenome package which contains the full DNA sequences of thehuman genome (using seqlengths(Hsapiens)).

In a �rst step, we calculate the coverage only on the IP sample to introduce the basic steps, andthen explain the processing of multiple samples in the next chapter.

# calculate the coverage vector

cov.IP <- coverage(reads.ext$NFKB_IP)

cov.IP

For e�ciency, the coverage is stored in the run-length encoded Rle format. Rather than storingthe read count on each base, Rle describes runs of identical values. Type ?Rle to learn more.

In order to detect peaks of NF-κB binding, we next search for regions consisting of contigu-ous segments of non-zero coverage, also known as islands. Such islands can be identi�ed usingthe function slice(), specifying a lower threshold of coverage that deliminates the peaks frombackground. We start with including all pile-ups regardless of their height (lower=1).

# determine islands

islands <- slice(cov.IP, lower = 1)

islands

# to see an element within a list, use [[

islands[["chr10"]]

# get the number of peaks (summed over the 3 chromosomes)

sapply(islands,length)

sum(sapply(islands,length))

For each island, we can compute the number of reads that fall within it using the functionviewSums(), which reports the sum of coverage values of all bases within the island.

4

Page 5: EMBO Practical Course on Analysis of High-Throughput ... · orF this practical, we will use the NF- B ChIP-seq data from Kasowski et al. (2010). In addition to sequence reads from

# calculate number of reads within the islands

viewSums(islands)

viewSums(islands)/200

# get the frequency (step-by-step; check what happens with each step)

nread.tab <- viewSums(islands)/200

nread.tab <- table(nread.tab)

nread.tab <- colSums(nread.tab)

nread.tab

Similarly, we can use the function viewMaxs() to calculate the maximum read depth of the islands(i.e. the height of the summits).

# calculate the number of reads at the summit (depth)

viewMaxs(islands)

# calculate read depth over islands (now all in one command)

depth.tab <- colSums(table(viewMaxs(islands)))

depth.tab <- data.frame( depth=as.numeric(names(depth.tab)), freq=depth.tab )

You can plot the distribution of island depths using the function xyplot() from the lattice pack-age (see http://stat.ethz.ch/R-manual/R-devel/library/lattice/html/xyplot.html for more infor-mation). To save the plot to a �le on your desktop instead of having it displayed on your screen,use bitmap(file="~/Desktop/<filename>.jpg",type="jpeg") before and dev.off() after theplotting command (alternatively you can save it as a pdf usingpdf(file="~/Desktop/<filename>.pdf")).

library(lattice)

xyplot( log(freq) ~ depth, data=depth.tab, subset=depth<=20,

type="p", pch=16,

ylim=c(-0.1,12.5),aspect=0.8

)

1.4 Processing multiple lanes

It is useful to be able to apply a procedure to all samples. A function for this purpose isseqapply(), which applies a function to a Sequence object and returns the result as anotherSequence object, if possible. Let's therefor de�ne a function islandDepthSummary() that com-bines all the analyses steps that we performed in the previous chapter as follows:

# define a cumstom function to apply all steps

islandDepthSummary <- function(x){

isl <- slice(coverage(x), lower = 1)

tab <- colSums(table(viewMaxs(isl)))

df <- data.frame( depth=as.numeric(names(tab)), freq=tab )

return(df)

}

5

Page 6: EMBO Practical Course on Analysis of High-Throughput ... · orF this practical, we will use the NF- B ChIP-seq data from Kasowski et al. (2010). In addition to sequence reads from

Below we use the seqapply() function to summarise the full dataset and then �atten the returneddata.frames with the function stack() into one data.frame for subsequent plotting.

# apply islandDepthSummary to extended reads

depth.islands <- seqapply(reads.ext, islandDepthSummary)

lapply(depth.islands, head)

# combine both samples into one data.frame

depth.islands <- stack(depth.islands)

names(depth.islands) <- c("sample", "depth", "freq")

depth.islands <- as.data.frame(depth.islands)

# plotting the distributions for IP and input sample together

xyplot( log(freq) ~ depth, groups=sample, data=depth.islands, subset=depth<=20,

type="p", pch=16, col=c("red","black"),

key=list(text=list(levels(factor(depth.islands$sample)),cex=1.2),col=c("red","black"), corner=c(0.9,0.9)),

ylim=c(-0.5,12.5),aspect=0.8

)

As you would expect, the IP sample contains many more peaks with higher depth compared tothe input sample. If reads were sampled randomly from the genome, then the null distribution ofthe island depth k would be a Poisson distribution:

f(k) =λke−λ

k!

where λ is the mean read depth over all islands. You could use this distribution to model the"noise" in the ChIP-seq experiment.

1.5 Identifying peaks using chipseq

To obtain a set of putative binding sites, or peaks, we need to �nd regions that show coveragesigni�cantly above the noise level. Based on the Poisson-based approach for estimating the noisedistribution as mentioned above, the function peakCutoff() returns a cuto� value for a speci�edfalse-discovery rate (FDR):

# determine a peak cutoff with FDR = 0.01%

peakCutoff(cov.IP, fdr = 0.0001)

Use the plot from the previous chapter to see how the returned cuto� value of 10 �ts to theobserved depth distributions from IP and input. We next use the newly de�ned cuto� to identifythe peaks using slice() and the peakSummary() function:

# get peaks above cutoff (use the rounded value you determined with peakCutoff() )

peaks.IP <- slice(cov.IP, lower = 10)

6

Page 7: EMBO Practical Course on Analysis of High-Throughput ... · orF this practical, we will use the NF- B ChIP-seq data from Kasowski et al. (2010). In addition to sequence reads from

lapply(peaks.IP, head)

# summarize the peak information

chippeaks <- peakSummary(peaks.IP)

chippeaks

# convert from RangedData to GRanges

chippeaks <- as(chippeaks, "GRanges")

The result is a RangedData object with two columns representing the maximum coverage depthand the total number of reads in the peak. It is possible to extend this object with additionalcolumns (such as other peak statistics), which is often useful.

We can now compute the strand-speci�c coverage and look at the coverage underlying individualpeaks using the following code. Can you explain the observed pattern?

# calculate the coverage separately for both strands

cov.pos <- coverage(NFKB$NFKB_IP[strand(NFKB$NFKB_IP) == "+"])

cov.neg <- coverage(NFKB$NFKB_IP[strand(NFKB$NFKB_IP) == "-"])

# get coverage over peaks

peaks.pos <- Views(cov.pos, ranges(chippeaks))

peaks.neg <- Views(cov.neg, ranges(chippeaks))

# identify the peaks with the highest depth on chr10 (use only peaks that are less than 500 nt wide)

sel <- width(chippeaks[seqnames(chippeaks)=="chr10"])<500

peak.order <- order(elementMetadata(chippeaks)$max[as.vector(seqnames(chippeaks)=="chr10")][sel], decreasing=TRUE)

# plot the coverage underlying the 5 highest peaks

coverageplot(peaks.pos$chr10[sel][peak.order[1]], peaks.neg$chr10[sel][peak.order[1]])

coverageplot(peaks.pos$chr10[sel][peak.order[2]], peaks.neg$chr10[sel][peak.order[2]])

coverageplot(peaks.pos$chr10[sel][peak.order[3]], peaks.neg$chr10[sel][peak.order[3]])

coverageplot(peaks.pos$chr10[sel][peak.order[4]], peaks.neg$chr10[sel][peak.order[4]])

coverageplot(peaks.pos$chr10[sel][peak.order[5]], peaks.neg$chr10[sel][peak.order[5]])

7

Page 8: EMBO Practical Course on Analysis of High-Throughput ... · orF this practical, we will use the NF- B ChIP-seq data from Kasowski et al. (2010). In addition to sequence reads from

2 Running di�erent peak callers

Although it is in principle possible to identify peaks by simply using a threshold to slice througha coverage vector, there are much more elaborate tools available to identify peaks in ChIP-seqdata which o�er much more accurate peak predictions. In particular, these tools commonly takeinto account the input data to discriminate real peaks from artifacts, arising e.g. from PCRampli�cation biases or di�erences in chromatin accessibility.

In this part of the practical, we will leave R and apply several commonly used algorithms to callpeaks on our NF-κB ChIP-seq data. These include MACS version 2 (https://github.com/taoliu/MACS),USeq (http://useq.sourceforge.net/) and SISSRs (http://sissrs.rajajothi.com/) that were intro-duce in the lecture this morning. We will run all peak callers on the bed �les SRR038461.bed(NF-κB IP) as well as SRR038464_1 and SRR038465 (input replicates), which are located in theshared folder at /nfs/training/PeakCalling/, and then compare the results.

Running MACS v2:

cd ~/Desktop/

macs2 callpeak -t /nfs/training/PeakCalling/treatment/SRR038461.bed

-c /nfs/training/PeakCalling/all_control.bed -n NFkB_MACS2 -f BED -g hs

-t and -c specify the treatment and control �les. Note that MACS cannot handle replicates,which is why we merged the two input replicates in a combined bed �le called all_control.bed.-f de�nes the input format, -g the genome, and -n can be used to set a name for the output �le.For further parameter options, please take a look athttps://github.com/taoliu/MACS/blob/master/README.rst. MACS will save the output �leinto the folder where you are when starting the program. Therefor please make sure you go to~/Desktop/ before starting MACS v2.

8

Page 9: EMBO Practical Course on Analysis of High-Throughput ... · orF this practical, we will use the NF- B ChIP-seq data from Kasowski et al. (2010). In addition to sequence reads from

Running SISSRS:

sissrs.pl -i /nfs/training/PeakCalling/treatment/SRR038461.bed

-o ~/Desktop/SISSRS_peaks -s 3080000000

-b /nfs/training/PeakCalling/all_control.bed

Parameter options for SISSRs can be found athttp://dir.nhlbi.nih.gov/papers/lmi/epigenomes/sissrs/SISSRs-Manual.pdf.

Running USeq:

java -Xmx2G -jar ~/USeq/Apps/ChIPSeq -s ~/Desktop/USeq_NFkB

-t /nfs/training/PeakCalling/treatment/ -c /nfs/training/PeakCalling/control/

-y bed -v H_sapiens_Feb_2009 -b -e -m

To learn about the di�erent parameter options, please take a look athttp://useq.sourceforge.net/cmdLnMenus.html#ChIPSeq.

Since USeq is the fastest of these algorithms, we will use this algorithms further to test severalparameter settings:

- One example to test would be the minimum number of reads required per window. The defaultis 10, so we could also test 1 and 5 (parameter -i).

- Other relevant parameters include the peak shift (-p; by default, USeq calculates this from theoptimum of the data or defaults to 150; you could try 20 and 77 for smaller fragment sizes or just�xing it to the default value of 150) and the window size (-w; by default, this value is calculatedas peak shift + standard deviation or defaults to 250 if no peak shift could be calculated; for ourcomparisons, we always set it to 250 to avoid re-calculation).

9

Page 10: EMBO Practical Course on Analysis of High-Throughput ... · orF this practical, we will use the NF- B ChIP-seq data from Kasowski et al. (2010). In addition to sequence reads from

3 Comparing the peak caller results

3.1 Importing the peaks

In this chapter, we return to R to compare the results from the di�erent peak callers that wereused in the last chapter. As a �rst step, we will de�ne custom functions to import the peakpositions of the di�erent peak callers and convert them into GRanges objects to aid comparisons.

#### custom function for MACS v2

# function to convert MACS output table into a GRanges object

macs2GRanges <-function(peaks) {

# generate GRanges object

myrange <- GRanges(

seqnames=peaks$chr,

range=IRanges(start=peaks$start, end=peaks$end, names=paste(peaks$chr,peaks$start,sep=":")),

strand="*",

count=peaks$pileup,

score=peaks$X.log10.pvalue.,

FE=peaks$fold_enrichment,

fdr=peaks$X.log10.qvalue.,

maxpos=peaks$abs_summit

)

return(myrange)

}

# load MACS v2 peaks and convert to GRanges

macspeaks <- read.table("~/Desktop/NFkB_MACS2_peaks.xls", header=TRUE, sep="\t", stringsAsFactors=FALSE)

macspeaks <- macs2GRanges(macspeaks)

macspeaks <- macspeaks[seqnames(macspeaks) %in% c("chr10","chr11","chr12")]

#### custom function for USeq

# function to convert USeq output table into a GRanges object

useq2GRanges <-function(peaks) {

# generate GRanges object

myrange <- GRanges(

seqnames=peaks$chr,

range=IRanges(start=peaks$start, end=peaks$end, names=paste(peaks$chr,peaks$start,sep=":")),

strand="*",

score=peaks$score,

# use midpoint of summit bin as maxpos

maxpos=peaks$summit.start+floor(peaks$summit.end-peaks$summit.start)

)

return(myrange)

}

# load USeq peaks (only relevant columns) and convert to GRanges

## Note that you need to complete the path to the USeq file below as the exact naming differs with any new run

useqpeaks <- read.table("~/Desktop/USeq_NFkB/ScanSeqs/EnrichedRegions.../windowData... .xls", skip=1,

header=FALSE, sep="\t", stringsAsFactors=FALSE,

colClasses=c("NULL", "character", "numeric", "numeric", rep("NULL",7), "numeric",

"numeric", rep("NULL", 7), "numeric", rep("NULL",4))

)

names(useqpeaks) <- c("chr","start","end","summit.start","summit.end", "score")

useqpeaks <- useq2GRanges(useqpeaks)

useqpeaks <- useqpeaks[seqnames(useqpeaks) %in% c("chr10","chr11","chr12")]

10

Page 11: EMBO Practical Course on Analysis of High-Throughput ... · orF this practical, we will use the NF- B ChIP-seq data from Kasowski et al. (2010). In addition to sequence reads from

#### custom function for SISSRs

# function to convert SISSRs output table into a GRanges object

sissr2GRanges <-function(peaks) {

# generate GRanges object

myrange <- GRanges(

seqnames=peaks$Chr,

range=IRanges(start=peaks$cStart, end=peaks$cEnd, names=paste(peaks$Chr,peaks$cStart,sep=":")),

strand="*",

count=peaks$NumTags,

score=peaks[,"p-value"],

FE=peaks$Fold

)

return(myrange)

}

# load SISSR peaks and convert to GRanges (keep only peaks on chr10-12)

sissrpeaks <- read.table("~/Desktop/SISSRS_peaks", skip=57, header=FALSE, sep="\t",

stringsAsFactors=FALSE, comment.char="=")

names(sissrpeaks) <- scan("~/Desktop/SISSRS_peaks", skip=55, nlines=1, sep="\t",

what="character")

sissrpeaks <- subset(sissrpeaks, Chr %in% c("chr10","chr11","chr12"))

sissrpeaks <- sissr2GRanges(sissrpeaks)

# use midpoint as maxpos, since SISSR does not report the summit position

elementMetadata(sissrpeaks)$maxpos <- start(sissrpeaks) + floor(width(sissrpeaks)/2)

3.2 Comparing peak overlaps, width and shape

As a �rst step to compare the performance of the di�erent peak callers, we can calculate theoverlap between the three sets which can be displayed in a Venn diagram (note that the overlapof some peaks might be ambiguous due to their shifted locations in the di�erent sets). To obtaina 'weighted' Venn diagram, i.e. the circles and overlapping regions are roughly drawn to scale, setthe parameter doWeights in the plotting function to TRUE. Venn diagrams o�er an informativecomparison for 3-4 datasets, but get harder to interpret with more datasets. The example belowcompares the peaks called by our naive chipseq approach from Chapter 1 with the results ofMACS v2 and USeq. Adjust the code to compare di�erent combinations of peak caller results.

# calculate overlaps between peak sets (centered on chipseq peaks)

chip2macs <- chippeaks%over%macspeaks

chip2useq <- chippeaks%over%useqpeaks

# (100) chippeaks, (010) macspeaks, (001) useqpeaks

weights <- c(

"100"=sum( !(chippeaks%over%union(macspeaks,useqpeaks)) ),

"010"=sum( !(macspeaks%over%union(chippeaks,useqpeaks)) ),

"001"=sum( !(useqpeaks%over%union(chippeaks,macspeaks)) ),

"110"=sum( chip2macs & !chip2useq),

"101"=sum( !chip2macs & chip2useq),

"011"=sum( useqpeaks%over%macspeaks & !(useqpeaks%over%chippeaks) ),

"111"=sum( chip2macs & chip2useq )

)

11

Page 12: EMBO Practical Course on Analysis of High-Throughput ... · orF this practical, we will use the NF- B ChIP-seq data from Kasowski et al. (2010). In addition to sequence reads from

# create Venn object

library(Vennerable)

venn.tab <- Venn(SetNames=c("chipseq","MACS2","USeq"), Weight=weights)

# plot Venn diagram

plot(venn.tab, doWeights=TRUE)

To get a better idea of the behaviour of the di�erent peak callers, we can also compare the widthsof peaks.

# generate a dataframe including the peak widths

df <- data.frame(peak.caller=factor(rep(c("chipseq", "MACS2", "USeq"),

c(length(chippeaks), length(macspeaks), length(useqpeaks)))))

df <- data.frame(df, width=c(width(chippeaks), width(macspeaks), width(useqpeaks)))

# plot histograms of the peak width

histogram( ~width | peak.caller, data=subset(df, width<2000), breaks=20)

Another way to check the peaks from the di�erent sets is to plot the average shape of the peaks.In the following example, we produce such a plot for NF-κB ChIP-seq reads averaged around thetop 100 peak summits predicted by MACS v2 on chromosomes 10-12.

# define windows around summits of the top 100 peaks (+- 1000 bp)

peak.windows <- macspeaks[order(elementMetadata(macspeaks)$score, decreasing=TRUE)[1:100]]

start(peak.windows) <- elementMetadata(peak.windows)$maxpos-1000

end(peak.windows) <- elementMetadata(peak.windows)$maxpos+999

# get sections from coverage

summit.cov <- Views( cov.IP, ranges(peak.windows))

ss <- which(viewSums(summit.cov)>0)

summit.cov <- summit.cov[ss]

# resolve list structure and combine into matrix

summit.cov <- lapply(summit.cov, function(x) lapply(x,as.vector))

summit.cov <- do.call(c,summit.cov)

summit.cov <- do.call(rbind,summit.cov)

# plot

xyplot( colSums(summit.cov)/nrow(summit.cov) ~ -1000:999,

type="l", main="average NFKB fragment profile (over peak summits)",

xlab="distance to summit", ylab="average fragment depth"

)

Try to prepare a similar plot for the peaks from the other peak callers. Note that since USeq doesreport a summit bin instead of a single position, we use the midpoint of this bin to center theplots (see import of USeq peaks above).

3.3 Placing peaks in their genomic context

It is a common goal to classify peaks of interest in the context of genomic features such aspromoters, upstream regions, etc. Using the GenomicFeatures package one can obtain gene

12

Page 13: EMBO Practical Course on Analysis of High-Throughput ... · orF this practical, we will use the NF- B ChIP-seq data from Kasowski et al. (2010). In addition to sequence reads from

annotations from di�erent data sources, including UCSC and Ensembl. For interest, the followingcode demonstrates how to download the latest Ensembl gene predictions for the human genomeand generate a GRanges object containing contiguous transcribed regions of the genome (from the�rst to the last exon). To save time during the practical, we will import a previously preparedGRanges object containing human transcript annotations (the code that was used to preparegregions is given below).

#library(GenomicFeatures)

#db <- makeTranscriptDbFromBiomart()

#gregions <- transcripts(db)

#gregions <- gregions[as.vector(seqnames(gregions))%in%c(10:12)]

#seqlevels(gregions) <- c("chr10","chr11","chr12")

#save(gregions, file="/nfs/training/PeakCalling/gregions.RData")

We will use the transcript coordinates to estimate promoter positions for each transcript, using awindow of 2 kb before and 500 bp after the TSS. We will determine the number of peaks that fallinto these promoter regions for all peak sets. Which peak caller shows the highest percentage ofpeaks in promoter regions? Compare this to the number of peaks falling into similar-sized regionsat the 3' end of genes (hint: flank(..., start=FALSE)).

load("/nfs/training/PeakCalling/ChIP_seq/gregions.Rdata")

# get promoter locations

promoters <- flank(gregions, width=2000)

promoters <- resize(promoters, width=2500)

promoters <- unique(promoters)

# check overlap of chipseq peaks

table(chippeaks %over% promoters)

table(chippeaks %over% promoters)/length(chippeaks)

# determine percentage for all peak callers

all.peaks <- list(chippeaks, macspeaks, useqpeaks, sissrpeaks)

names(all.peaks) <- c("chipseq", "MACS2", "USeq", "SISSRs")

perc.prom <- sapply( all.peaks, function(peaks){

sum(peaks%over%promoters)/length(peaks)

})

perc.prom

A common plot to generate with ChIP-seq data is to see how the coverage correlates with speci�cgenomic coordinates e.g. transcription start sites, summit positions or other ChIP-seq peaks. Theconcept is very similar to the approach we used to determine the average read coverage aroundthe peak summits. In the following example, we plot the average coverage of NF-κB along thepromoters (note that this plot is done with the raw reads rather than peak coordinates; you couldhowever repeat the analyses using only reads that fall within signi�cant peaks).

# get sections from coverage

prom.cov <- Views( cov.IP, ranges(promoters))

13

Page 14: EMBO Practical Course on Analysis of High-Throughput ... · orF this practical, we will use the NF- B ChIP-seq data from Kasowski et al. (2010). In addition to sequence reads from

ssp <- which(viewSums(prom.cov)>0)

prom.cov <- prom.cov[ssp]

# resolve list structure (list of lists) and combine into matrix (1st step to convert Rle's takes some time)

prom.cov <- lapply(prom.cov, function(x) lapply(x,as.vector))

prom.cov <- do.call(c, prom.cov)

prom.cov <- do.call(rbind, prom.cov)

ssp <- lapply(ssp, function(x) lapply(x, as.vector))

ssp <- as.vector( unlist(ssp) )

# invert profiles for transcript on minus strand

prom.cov[as.vector(strand(promoters)[ssp])=="-"] <- rev(prom.cov[as.vector(strand(promoters)[ssp])=="-"])

# plot

xyplot( colSums(prom.cov)/nrow(prom.cov) ~ -2000:499,

type="l", main="average NFKB fragment profile around TSS",

xlab="distance to TSS", ylab="average fragment depth"

)

3.4 Visualising the data in a genome browser

When performing computational analyses, it is important to look at your input data and re-sults in a genome browser. This can either be a webserver like the UCSC Genome Browser(http://genome.ucsc.edu/cgi-bin/hgGateway) or programmes like the Integrated Genome Browser(IGB; http://bioviz.org/igb/) that you run locally on your computer. You �nd a link to the localinstallation on your desktop.

In our case, we would like to upload the peak locations and compare them to the read coverage fromthe NF-κB IP and the input sample. To this end, we need to output the data in a format that canbe readily imported into the genome browser. A suitable format for the read coverages is a WIG�le (http://genome.ucsc.edu/goldenPath/help/wiggle.html), while the peak locations can bedisplayed as a BED �le (http://genome.ucsc.edu/FAQ/FAQformat.html#format1). Functionsto import and export a number of commonly used NGS �le formats can be obtained from thepackage rtracklayer). An example for the peak locations is given below.

library(rtracklayer)

# specific output folder

wig.path <- "~/Desktop/<your folder>/"

# set strand to "+"

strand(chippeaks)="+"

strand(useqpeaks)="+"

strand(macspeaks)="+"

# write bed files of the peaks

export.bed(chippeaks, paste(wig.path,"/chipseq_peaks.bed",sep=""))

export.bed(useqpeaks, paste(wig.path,"/USeq_peaks.bed",sep=""))

export.bed(macspeaks, paste(wig.path,"/MACS2_peaks.bed",sep=""))

The resulting .bed �les can be loaded into IGB (Integrated Genome Browser, http://bioviz.

14

Page 15: EMBO Practical Course on Analysis of High-Throughput ... · orF this practical, we will use the NF- B ChIP-seq data from Kasowski et al. (2010). In addition to sequence reads from

org/igb/) for vizualisation. To do this, start IGB, select your genome of interest (human genome,version hg19) and then import your newly created �les.

4 Motif analysis

In the next step of the practical, we want to identify the binding motif that is recognised by NF-κB.We will use the BSgenome package that contains infrastructure for Biostrings-type objects. Theseimplement memory-e�cient string containers, string-matching algorithms and other utilities forfast manipulation of large biological sequences or sets of sequences (see the BSgenome vignette fordetails). We will start by loading the package for the human genome hg19.

library("BSgenome")

# check which genomes are available

available.genomes()

# load the human genome (version hg19)

library("BSgenome.Hsapiens.UCSC.hg19")

4.1 De novo motif discovery

High-resolution ChIP-seq data allows for an e�cient detection of sequence motifs, which arepreferentially bound by the studied transcript factors (if such motifs actually exist). De novo

motif discovery can reveal the binding speci�city of an unknown factor, con�rm the validity ofcandidate motifs and help validate ChIP-seq experiments.

Here, we will take advantage of the ability of the ChIP-seq peak detection software MACS todetermine peak summit positions in order to perform a de novo motif discovery on a restrainedset of sequences (if you want to perform a similar search on the USeq peaks for comparison, youcould use the center position as before). Taking a window around the center of the bound region,we will save the 100 top and bottom-ranking peaks as well as a random set of peaks into a fasta�le and submit them to MEME for motif discovery. MEME is one of the most popular packagesfor motif discovery (Bailey et al. 2009). it employs expectation maximization to detect sequencepatterns that occur repeatedly in a group of sequences, and can be used both online and in alocally installed version. It will output results in html format, which can then be visualised in abrowser of choice.

# use a distance of 50nt on either side of the summit

mydist <- 25

# define summit regions

macs.summits <- GRanges(

seqnames=seqnames(macspeaks),

range=IRanges( start=elementMetadata(macspeaks)$maxpos - mydist,

end=elementMetadata(macspeaks)$maxpos + mydist),

15

Page 16: EMBO Practical Course on Analysis of High-Throughput ... · orF this practical, we will use the NF- B ChIP-seq data from Kasowski et al. (2010). In addition to sequence reads from

strand="+"

)

# order the summit regions based on their score

macs.summits <- macs.summits[order(-elementMetadata(macspeaks)$score)]

# create unique names for all peaks

names(macs.summits) <- paste( seqnames(macs.summits), start(macs.summits), end(macs.summits), sep=":")

# take a look at the resulting GRanges centered around the MACS peak summit positions

macs.summits

# define the folder where you want to save the fasta files

out.folder <- wig.path

# top 100 peaks

seqs <- getSeq(Hsapiens, macs.summits[1:100], as.character=FALSE)

names(seqs) <- names(macs.summits[1:100])

# export to fasta file

writeXStringSet(seqs,file=paste(out.folder,"NFKB_MACS2_top100.fa",sep=""), "fasta", append=FALSE)

# random sample of 100 peaks

index <- sample(1:length(macs.summits), size=100, replace=FALSE)

seqs <- getSeq(Hsapiens, macs.summits[index], as.character=FALSE)

names(seqs) <- names(macs.summits[index])

writeXStringSet(seqs, file=paste(out.folder,"NFKB_MACS2_random100.fa",sep=""), "fasta",append=FALSE)

# bottom 100 scoring sequences

seqs <- getSeq(Hsapiens, macs.summits[(length(macs.summits)-99):length(macs.summits)], as.character=FALSE)

names(seqs) <- names(macs.summits[(length(macs.summits)-99):length(macs.summits)])

writeXStringSet(seqs, file=paste(out.folder,"NFKB_MACS2_bottom100.fa",sep=""), "fasta", append=FALSE)

The three di�erent sets of 100 peak sequences are now ready for motif discovery. The codesnippet below demonstrates how MEME can be called directly from R and the command lineoutput will be redirected into our R window. We will however use the online version of MEME inthis practical and upload the three .fa �les that we have just produced to the MEME webserver(http://meme.sdsc.edu/meme/cgi-bin/meme.cgi).

###### calling MEME from R ### (RUN MEME ONLINE IN THIS PRACTICAL)

# get list of fasta files

#fa.files <- list.files(paste(out.folder,sep=""), pattern="*fa$", full.names=TRUE)

#

# define MEME parameters

#parameters <- " -nmotifs 3 -minsites 100 -minw 12 -maxw 35 -revcomp -maxsize 500000 -dna -oc "

#

#for (fa.file in fa.files) {

#

# # define MEMEM output folder

# meme.out <- substr(fa.file,1,nchar(fa.file)-3)

#

# # construct full MEME command

# mycommand <- paste("meme ",fa.file,parameters,meme.out,sep="")

#

# # send command to system

# system(mycommand)

#}

Inspect the HTML output. What information is provided with each motif? How many sequences

16

Page 17: EMBO Practical Course on Analysis of High-Throughput ... · orF this practical, we will use the NF- B ChIP-seq data from Kasowski et al. (2010). In addition to sequence reads from

were used to derive the motifs? Compare your motifs to the canonical NF-κB motif (search forRelA in the JASPAR database at http://jaspar.genereg.net/). Reassuringly, the same motif isalso recovered in the other two sets (bottom 100 and randomly chosen peaks), showing that themotif is present in a large fraction of our peaks, regardless of their enrichment score.

4.2 Motif scan

If the sequence preferences of a factor of interest are known, one can scan ChIP-seq peak sequenceswith the corresponding position-weight matrix (PWM) or consensus to check if the majority ofthem indeed contain the known motif. The BSgenome package allows easy consensus matching aswell as PWM scanning. For the following motif scan, we will use the NF-κB binding site that isgiven in the JASPAR database.

In a �rst step, we import the table of base frequencies of the NF-κB binding site andplot a weblogo (for a more detailed description of this type of motif visualisation, seehttp://weblogo.berkeley.edu/).

# read the NF-$\kappa$B PWM from a text file

NFKB.pwm <- read.table("/nfs/training/PeakCalling/NFKB_pwm_meme.txt",

sep="\t", header=FALSE, row.names=1)

NFKB.pwm <- t(NFKB.pwm)

rownames(NFKB.pwm) <- c("A","C","G","T")

# view the motif matrix

NFKB.pwm

# plot the motif logo

library(seqLogo)

seqLogo(NFKB.pwm, ic.scale=TRUE)

Next, we retrieve the sequence under the peaks and prepare two helper functions that will becalled on each peak set. These functions will return a matrix containing the chromosome andstart position for each motif hit, as well as the sequence of each motif instance. mymatchPWM()

uses the function matchPWM() (from BSgenome), which initially returns all PWM hits within aspeci�c string at a speci�ed cuto�, and selects the hit that is closest to the peak summit. Below,we will call these functions on our three peak sets to check how many of the binding sites containa NFKB_pwm_meme.txt motif.

# obtain the sequences under the peaks

elementMetadata(macspeaks)$seqs <- getSeq(Hsapiens, macspeaks, as.character=TRUE)

elementMetadata(chippeaks)$seqs <- getSeq(Hsapiens, chippeaks, as.character=TRUE)

elementMetadata(useqpeaks)$seqs <- getSeq(Hsapiens, useqpeaks, as.character=TRUE)

# function to scan a string for PWM matches at the specified threshold by calling matchPWM()

# keeps only the closest of multiple hits

mymatchPWM <- function (pwm, myseq, threshold, summit) {

# get all matches of PWM

mymatch <- matchPWM(pwm, myseq, min.score=threshold)

# collect starts/seqs into matrix (if any)

17

Page 18: EMBO Practical Course on Analysis of High-Throughput ... · orF this practical, we will use the NF- B ChIP-seq data from Kasowski et al. (2010). In addition to sequence reads from

if (length(mymatch)==0) {

found <- cbind(NA,NA,0)

} else {

found <- cbind(start(mymatch), as.character(mymatch), length(start(mymatch)))

}

colnames(found) <- c("start","seq","nr")

# keep only the match that is closest to the summit

found <- found[order(abs(as.integer(found[,1])-summit),decreasing=FALSE)[1],]

# return matrix

return(found)

}

# function to call mymatchPWM() on each peak in a set

# returns a matrix with motif information per peak

ScanPeaks <- function(peak.GR, pwm, threshold) {

# get all peak sequences

myseqs <- elementMetadata(peak.GR)$seqs

# get summit positions (relative to peak coordinates)

summits <- elementMetadata(peak.GR)$maxpos - start(peak.GR)

# apply mymatchPWM() to all peaks in the set

motifmatrix <- sapply(1:length(myseqs), function(x) mymatchPWM(pwm, myseqs[x], threshold, summits[x]))

motifmatrix <- t(motifmatrix)

# set peak IDs as rownames

rownames(motifmatrix) <- names(peak.GR)

# return motif matrix

return(motifmatrix)

}

# scan all peak sequences for the motif (using a cutoff of 80%)

### This step is likely to take ~ 10 min, so at this point it's convenient to have a coffee :P ###

macs.motifs <- ScanPeaks(macspeaks, as.matrix(NFKB.pwm), "80%")

useq.motifs <- ScanPeaks(useqpeaks, as.matrix(NFKB.pwm), "80%")

chipseq.motifs <- ScanPeaks(chippeaks, as.matrix(NFKB.pwm), "80%")

# add motif information (number of motifs per peaks) to the peak GRanges object

elementMetadata(macspeaks)$motif.no <- as.integer(macs.motifs[,3])

elementMetadata(useqpeaks)$motif.no <- as.integer(useq.motifs[,3])

elementMetadata(chippeaks)$motif.no <- as.integer(chipseq.motifs[,3])

# plot the proportion of peaks with 0, 1 or more motifs in the three peak sets

par(mfrow=c(1,3))

pie( c(sum(as.numeric(elementMetadata(macspeaks)$motif.no==0)),

sum(as.numeric(elementMetadata(macspeaks)$motif.no==1)),

sum(as.numeric(elementMetadata(macspeaks)$motif.no>1))),

labels=c("0","1","2+"), main=list("MACS v2",cex=1.5),

col=c("white","lightgrey","darkgrey")

)

pie( c(sum(as.numeric(elementMetadata(useqpeaks)$motif.no==0)),

sum(as.numeric(elementMetadata(useqpeaks)$motif.no==1)),

sum(as.numeric(elementMetadata(useqpeaks)$motif.no>1))),

labels=c("0","1","2+"), main=list("USeq",cex=1.5),

col=c("white","lightgrey","darkgrey")

)

pie( c(sum(as.numeric(elementMetadata(chippeaks)$motif.no==0)),

sum(as.numeric(elementMetadata(chippeaks)$motif.no==1)),

sum(as.numeric(elementMetadata(chippeaks)$motif.no>1))),

labels=c("0","1","2+"), main=list("chipseq",cex=1.5),

col=c("white","lightgrey","darkgrey")

)

For sequence-speci�c transcription factors, the enrichment of binding motifs under the peaks isusually a good measure to compare the performance of di�erent peak callers. However, as wehave seen previously, di�erent peak callers have a tendency to detect peaks of di�erent widths,

18

Page 19: EMBO Practical Course on Analysis of High-Throughput ... · orF this practical, we will use the NF- B ChIP-seq data from Kasowski et al. (2010). In addition to sequence reads from

which in�uences the changes of �nding a motif hit. Therefore, resolution has a direct e�ect onthis analysis. An alternative way of comparing motif content would be to look at regions of thesame size - e.g. 150 bp up- and downstream of the peak summit position and to compare themotif content of peak regions with background regions.

(OPTIONAL) Having obtained the location of putative motifs, one could now try to re�ne thePWM itself based on the sequences that have matched the initial scan. One could also tryextending the motif up- and downstream by checking if there are any additional bases at the�anks showing a non-random distribution, as follows.

# set the extension distance

mydist <- 10

# get absolute start positions of the motif hits

abs.starts <- as.integer(macs.motifs[,1]) + start(macspeaks)

abs.starts <- abs.starts[!is.na(abs.starts)]

# make GRanges

motifs.GR <- GRanges(

seqnames=seqnames(macspeaks)[!is.na(macs.motifs[,1])],

range=IRanges(start=abs.starts - mydist,

end=abs.starts + mydist + 9),

strand="+"

)

# get sequences

motif.seqs <- getSeq(Hsapiens, motifs.GR, as.character=TRUE)

# get letter counts per position

bs.matrix <- consensusMatrix(motif.seqs)

# transform counts to frequencies

bs.matrix <- apply(bs.matrix, 2, function(x){ x/sum(x) })

# plot the newly obtained sequence logo

seqLogo(bs.matrix, ic.scale=TRUE)

4.3 Motif localisation

A typical part of a ChIP-seq analysis is looking at the motif distribution with respect to theestimated peak summit. The spread of this distribution can be a good measure of noise in thedata. NF-κB has a high information content motif as well as sharp ChIP-seq peaks, so one expectsto see motifs peaking sharply around the center of the plot. Below, we will create pro�les for thethree di�erent peak sets, in order to get an idea of how the use of di�erent peak-calling methodscan in�uence our result.

# set window size

mydist <- 200

# function to determine the motif profile around peak summits

getProfile <- function(peaks.GR, pwm, window.size){

# get regions around summit

19

Page 20: EMBO Practical Course on Analysis of High-Throughput ... · orF this practical, we will use the NF- B ChIP-seq data from Kasowski et al. (2010). In addition to sequence reads from

summits.GR <- GRanges(

seqnames=seqnames(peaks.GR),

range=IRanges( start=elementMetadata(peaks.GR)$maxpos - window.size,

end=elementMetadata(peaks.GR)$maxpos + window.size),

strand="+"

)

# create unique names for all peaks

names(summits.GR) <- paste( seqnames(summits.GR), start(summits.GR), end(summits.GR), sep=":")

# get sequence

elementMetadata(summits.GR)$seqs <- getSeq(Hsapiens, summits.GR, as.character=TRUE)

# scan sequences with the PWM

summit.motifs <- ScanPeaks(summits.GR, pwm, "80%")

summit.motifs <- summit.motifs[!is.na(summit.motifs[,1]),]

# get all covered positions

motif.pos <- sapply(as.integer(summit.motifs[,1]), function(x) seq(x, x+ncol(pwm)-1))

motif.pos <- table(unlist(motif.pos))

# convert to data.frame

motif.pos <- data.frame(motif.pos)

names(motif.pos) <- c("position","frequency")

# shift positions relative to summit

motif.pos$position <- as.integer(as.character(motif.pos$position)) - window.size

# ensure all positions are present in the output

profile <- data.frame(

position=-window.size:window.size,

frequency=motif.pos$frequency[match(-window.size:window.size, motif.pos$position)]

)

profile[is.na(profile)] <- 0

return(profile)

}

# get the profiles for the three different peak sets

macs.profile <- getProfile(macspeaks, as.matrix(NFKB.pwm), mydist)

useq.profile <- getProfile(useqpeaks, as.matrix(NFKB.pwm), mydist)

chipseq.profile <- getProfile(chippeaks, as.matrix(NFKB.pwm), mydist)

# generate plots

pl1 <- xyplot(frequency~position, data=macs.profile,

type="l", main="motifs around MACS v2 peak summits",

aspect=0.8

)

pl2 <- xyplot(frequency~position, data=useq.profile,

type="l", main="motifs around USeq peak summits",

aspect=0.8

)

pl3 <- xyplot(frequency~position, data=chipseq.profile,,

type="l", main="motifs around chipseq peak summits",

aspect=0.8

)

# print plots

print(pl1, split=c(1,1,1,3), more=TRUE)

print(pl2, split=c(1,2,1,3), more=TRUE)

print(pl3, split=c(1,3,1,3))

5 Di�erential histone modi�cation using DESeq

In this last part of the tutorial, we give a simple example of how one may go about addressing thequestion which histone modi�cations are signi�cantly di�erent in condition A versus condition B.

20

Page 21: EMBO Practical Course on Analysis of High-Throughput ... · orF this practical, we will use the NF- B ChIP-seq data from Kasowski et al. (2010). In addition to sequence reads from

For convenience, we will use H3K36me3 data in two di�erent cell types, HepG2 and K562. In sucha case, one may actually prefer doing a simple peak overlap analysis rather than following theprocedure described below. However, we will use it here to illustrate the functionality of the DESeqpackage with ChIP-seq count data. DESeq has been primarily developed by Simon Anders for thedetermination of di�erential expression using a test based on the negative binomial distribution,but it can be principally used for any count data. For details on the underlying methodologyplease read the DESeq vignette and reference manual.

First, we will import the library and prepare the intervals in which we would like to search fordi�erential histone modi�cations. Determining the regions of interest can be done in a number ofdi�erent ways: one could take a peak-based approach, like here, one could partition the genomein equally sized regions, or one could use existing annotation, such as promoter regions.

# import library

library(DESeq)

# define colour range for plotting

Lab.palette <- colorRampPalette(c("white", "yellow", "red"), space = "Lab")

### Read the wanted dataset###

chrlen <- read.table("/nfs/training/PeakCalling/Chromosome_lengths_human.txt",sep="\t")

mychr <- c(10:12)

chrmychr <- paste("chr", mychr, sep="")

chrlen <- chrlen[which(chrlen[,1]%in%mychr),]

###define function to read in data from only our chromosomes to GRanges objects

macs2GRange <- function(peaks, goodchr){

myrange <- GRanges(seqnames=peaks[,1],

ranges=IRanges(peaks[,2],peaks[,3], names=paste(peaks[,1], (peaks[,5]+peaks[,2]),sep=":")),

count=peaks[,6],

score=peaks[,7],

FE=peaks[,8],

fdr=peaks[,9],

summit=(peaks[,5]+peaks[,2])

)

myrange <- myrange[which(seqnames(myrange)%in%goodchr)]

return(myrange)

}

#### read the peak files

# H3K36me3 in HepG2

histone1.GR <- macs2GRange(read.table("/nfs/training/PeakCalling/H3K36me3-K562_peaks.xls",

sep="\t",header=TRUE), chrmychr)

# H3K36me3 in K562

histone2.GR <- macs2GRange(read.table("/nfs/training/PeakCalling/H3K36me3-HepG2_peaks.xls",

sep="\t",header=TRUE), chrmychr)

# select union of two regions as enquiry space

interm <- union(histone1.GR, histone2.GR)

names(interm) <- paste(seqnames(interm), start(interm), end(interm), sep=":")

what <- "H3K36me3"

Once the intervals are de�ned, one needs to obtain the corresponding read counts. There areseveral ways of doing this and here we provide an example starting from bed-formatted alignedread �les. This procedure takes a while and can be skipped in this tutorial, as counts have alreadybeen precalculated and can be imported from an RData object.

21

Page 22: EMBO Practical Course on Analysis of High-Throughput ... · orF this practical, we will use the NF- B ChIP-seq data from Kasowski et al. (2010). In addition to sequence reads from

### Returns a vector with read coverage per each interval starting from a GRanges object (chr, start, end, names) ###

### can be used e.g. for calculating reads/peaks or reads/transcripts ###

### chrlen is a matrix consisting of chromosome names and lengths ###

readsPerInterval <- function(path, outpath, readdir, myint, chrlen, readfile, organism, extend) {

## vector that stores the peak nr. per interval

peakreads <- NULL

## Per chromosome basis (faster reading)

for (index in 1:dim(chrlen)[1]) {

# Grep only the wanted chromosome and transform it to bed

systemcall <- paste("grep 'chr",chrlen[index,1],"' ",path,readdir,readfile," > ",outpath,

readdir,"temp.bed",sep="")

system(systemcall)

# Test that we have reads in input

if (strsplit(system(paste("wc -l ",outpath,readdir,"temp.bed",sep=""),intern=TRUE)," ")[[1]][1]!="0") {

reads <- read.csv(paste(outpath,readdir,"temp.bed",sep=""), sep="\t", header=FALSE,

colClasses=c("character","numeric","numeric","NULL","NULL","character")

)

reads.GR <- GRanges(seqnames=chrlen[index,1], ranges=IRanges(reads[,2],reads[,3]), strand=reads[,4])

### extend step: this can be optional - for no extension 36 is needed

reads.GR <- resize(reads.GR, width = extend)

smallint <- myint[which(seqnames(myint)%in%as.character(chrlen[index,1],sep=""))]

counts <- countOverlaps(smallint, reads.GR)

names(counts) <- names(smallint)

peakreads <- c(peakreads, counts)

}

}

return(peakreads)

}

### Obtain the counts in a set of intervals (represented as GRanges) [can be skipped]

#dircontent <- list.files(paste(inpath,"data/filtered/",sep=""), pattern="*filter_rmdup.bed")

#myfiles <- dircontent[c(3:4,11:12)]

#mylist <- lapply(myfiles, function(x) readsPerInterval(inpath,outpath,"data/filtered/",interm,chrlen,x,"human",150))

#names(mylist) <- myfiles

#save(mylist, file=paste(outpath,dir,"Reads_per_peaks.filtered.rmdup.RData",sep=""))

Once reads/intervals have been obtained, one can proceed with the di�erential modi�cation anal-ysis. All counts/interval sets for each condition are transformed into a CountDataSet object,which is the central data structure in DESeq. The e�ective library size for each sample is theneither estimated from the data, or set by the user. For each condition, a function that allows topredict the variance from the mean is estimated (DESeq assumes that the mean is a good predic-tor of variance). Next, one needs to verify the variance-mean dependence and do some furtheranalyses to con�rm the variance estimations, all described in detail in the DESeq vignette. Wewill not expand on this here, but we recommend that users read the documentation and performthe diagnostic test before testing for di�erential modi�cation in a real-life scenario.

# load the counts per region of interest ###

load("/nfs/training/PeakCalling/Reads_per_peaks.filtered.rmdup.RData")

# set output directory for DESeq analysis

deseqdir <- out.folder

# verify the content of the list

names(mylist)

# construct a counts table

countsperroi <- cbind(mylist[[1]], mylist[[2]], mylist[[3]], mylist[[4]])

22

Page 23: EMBO Practical Course on Analysis of High-Throughput ... · orF this practical, we will use the NF- B ChIP-seq data from Kasowski et al. (2010). In addition to sequence reads from

rownames(countsperroi) <- names(mylist[[1]])

# define conditions

conds <- c("HepG2","HepG2","K562","K562")

# construct a CountDataSet object

cds <- newCountDataSet( countsperroi, conds)

cds

head(counts(cds))

# estimate the library size (for details on the following functions, please refer to the DEseq documentation)

cds <- estimateSizeFactors(cds)

# estimate the variance

cds <- estimateDispersions(cds)

# test for differences between the base means of the two conditions

res <- nbinomTest(cds, "HepG2", "K562")

head(res)

We have obtained a matrix with mean modi�cation levels per condition (as well as a joint esti-mate), the fold change (raw and log2), the p-value for the statistical signi�cance of this change,as well as ratios of the single gene estimates of the base variance to the �tted value. To get anoverview of our data as well as the regions that we have called signi�cant, we can now producescatterplots of counts and estimated mean inside and among conditions.

# function to produce a scatterplot using the smoothScatter function

nice_scatter <- function(x, y, name.x, name.y, what, legend.name, legend.value) {

smoothScatter(x, y, colramp = Lab.palette, main=paste(name.x," vs. ",name.y,sep=""),

xlab=paste(what," ",name.x,sep=""), ylab=paste(what," ",name.y,sep="")

)

abline(lm(y ~ x), col="red", lty="dotted")

legend("topleft", paste(legend.name," = ",round(legend.value,digits=2),sep=""))

}

# function to produce a simple scatterplot using the smoothScatter function

simple_scatter <- function(x, y, name.x, name.y, what, legend.name, legend.value) {

plot(x, y, main=paste(name.x," vs. ",name.y,sep=""),

xlab=paste(what," ",name.x,sep=""), ylab=paste(what," ",name.y,sep="")

)

abline(lm(y ~ x), col="red", lty="dotted")

legend("topleft", paste(legend.name," = ",round(legend.value,digits=2),sep=""))

}

# select according to a significance threshold

condition <- which(res$padj<0.01 & (res$foldChange>1.5 | res$foldChange<0.67))

# plot an overview of the regions called significant

par(mfrow=c(2,2))

nice_scatter(log2(countsperroi[,1]+1), log2(countsperroi[,2]+1), "HepG2 1", "HepG2 2", "log2(counts)",

"Spearman cor", cor(countsperroi[,1],countsperroi[,2],method="spearman"))

nice_scatter(log2(countsperroi[,3]+1), log2(countsperroi[,4]+1), "K562 1", "K562 2", "log2(counts)",

"Spearman cor", cor(countsperroi[,3],countsperroi[,4],method="spearman"))

nice_scatter(log2(res[,3]+1), log2(res[,4]+1), "HepG2", "K562", "log2(counts)",

"Spearman cor", cor(res[,3],res[,4],method="spearman"))

23

Page 24: EMBO Practical Course on Analysis of High-Throughput ... · orF this practical, we will use the NF- B ChIP-seq data from Kasowski et al. (2010). In addition to sequence reads from

simple_scatter(log2(res[,3]+1), log2(res[,4]+1), "HepG2", "K562",

paste("log2(counts)",length(res[condition,3]),"with p<0.01"),

"Spearman cor", cor(res[,3],res[,4],method="spearman"))

points(log2(res[condition,3]+1), log2(res[condition,4]+1), col="red")

# select only regions called significant according to the condition set

changing <- res[condition,]

dim(changing)

# function to split a string vector at a certain pattern and return a trimmed string vector

mysplit <- function(vector, splitsign, numbers, position) {

return(unlist(strsplit(as.character(vector),splitsign))[seq(position, length(vector)*numbers, by=numbers)])

}

# writing gff output files for common sites as well as HepG2- and K562-specific sites

mygff <- cbind(as.character(mysplit(changing[order(changing$padj),1],":",3,1)),what,"ROI",

(mysplit(changing[order(changing$padj),1],":",3,2)),(mysplit(changing[order(changing$padj),1],":",3,3)),

changing[order(changing$padj),13], -10*log10(changing[order(changing$padj),8]),

"+", as.character(changing[order(changing$padj),14]),

paste(what,as.character(changing[order(changing$padj),1]),sep=":"))

write.table(mygff[which(changing$foldChange>1),],

paste(deseqdir,"/DEseq.HepG2-unique.",what,".p01.FC15.gff",sep=""),

sep="\t", row.names=FALSE, col.names=FALSE, quote=FALSE)

write.table(mygff[which(changing$foldChange<1),],

paste(deseqdir,"/DEseq.K562-unique.",what,".p01.FC15.gff",sep=""),

sep="\t", row.names=FALSE, col.names=FALSE, quote=FALSE)

write.table(changing[order(changing$padj),],paste(deseqdir,"/DEseq.HepG2_K562.",what,".p01.FC15.txt",sep=""),

sep="\t", row.names=TRUE, col.names=TRUE, quote=FALSE)

write.table(res[order(res$padj),],paste(deseqdir,"/DEseq.HepG2_K562.",what,".all.txt",sep=""),

sep="\t", row.names=TRUE, col.names=TRUE, quote=FALSE)

We can now visualise the tracks we have just produced in a browser of choice. How does theDESeq methodology compare to a simple peak-overlap comparison?

24

Page 25: EMBO Practical Course on Analysis of High-Throughput ... · orF this practical, we will use the NF- B ChIP-seq data from Kasowski et al. (2010). In addition to sequence reads from

References:

• "Some Basic Analysis of ChIP-Seq Data", Bioconductor.http://www.bioconductor.org/packages/release/bioc/vignettes/chipseq/inst/doc/Work�ow.pdf

• "E�cient genome searching with Biostrings and the BSgenome data packages" Bioconductor.http://www.bioconductor.org/packages/release/bioc/vignettes/BSgenome/inst/doc/GenomeSearching.pdf

• "Analysing RNA-Seq data with the DESeq package", Bioconductor.http://bioconductor.org/packages/release/bioc/vignettes/DESeq/inst/doc/DESeq.pdf

• Kasowski M, Grubert F, He�el�nger C, Hariharan M, Asabere A, Waszak SM, Habegger L, Ro-zowsky J, Shi M, Urban AE, Hong MY, Karczewski KJ, Huber W, Weissman SM, Gerstein MB,Korbel JO, Snyder M. (2010) Variation in transcription factor binding among humans. Science

328:232-5.

• Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, Ren J, Li WW, Noble WS.(2009) MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res 37(Web Serverissue):W202-8.

25