Practical - ChIP-seq data analysis - University of Helsinki · Practical - ChIP-seq data analysis...

Practical - ChIP-seq data analysis

Borbala Gerle

European Bioinformatics Institute, Cambridge, UK

Kathi Zarnack

University College London, London, UK

10.-13. September 2012

This practical illustrates common ChIP-seq analysis steps based on a number of Biocon-ductor packages (see References). We will start from aligned read data of ChIP-seq exper-iments with the sequence-speci�c transcription factor (NF-κB) as well as with a histonemodi�cation mark (H3K36me3).

We �rst will use the package chipseq to perform some initial �ltering steps, determine thefragment size and obtain some diagnostic plots to assess the data quality. We will thenextend the reads and use chipseq to call peaks which we will compare to the results ofthe commonly used peak-�nding algorithms MACS and SISSR. We will further look at thelocalisation of the peaks with respect to genomic annotation and perform motif analysesto identify the binding motif of this sequence-speci�c transcription factor. Finally, we usethe data of the histone modi�cation mark H3K36me3 in two di�erent tissues to illustratean approach to assess di�erential binding from ChIP-seq data.

1 Peak calling

1.1 The aligned read data

For the �rst part of the practical, we will make use of the NF-κB ChIP-seq data from Kasowskiet al. (2010). In addition to sequence reads from the ChIP-seq library, the study also providessequencing data of the input DNA which we will use for comparison.

1

In the �rst step of a ChIP-seq analysis, the obtained sequence reads are mapped back to thereference genome of the organism that was used in the experiments. In our case, the reads werealigned to the human reference genome (hg19) using the program Bowtie, allowing one mismatch(http://bowtie-bio.sourceforge.net/index.shtml). The mapping results were then imported intoR using the readAligned function from the ShortRead package and �ltered to keep only readsmapping to the canonical chromosomes. Storing the aligned read data as a GRanges object (a setof stranded genomic intervals) saves a large amount of memory as sequences are discarded.

Since these initial steps can be very time-consuming, we performed them for you prior to the prac-tical. For convenience, we will only consider the subset of reads from both the IP ([["NFKB_IP"]])and the input ([["input"]]) samples that map to the chromosomes 10-12. These were stored foryou in the NFKB object which you can load from our homepage. The commands that were usedto map the sequence reads in the original fastq �les and to generate the NFKB object from theresulting aln �les are shown below (you do not need to run these during the practical!).

############ performed outside R ############

# conversion of sra into fastq

for i in *.sra

do

~/programmes/sratoolkit.2.1.7-centos_linux64/fastq-dump $i

done

# genome alignment using Bowtie

for i in *.fastq

do

bsub bowtie -n 1 -k 1 --best hg19 $i ${i%%.fastq}.aln

done

############ performed in R ############

library(ShortRead)

# define the folder where the aln files are stored

alnPath = "/nfs/research2/luscombe/kathi/teaching/2012_Helsinki/data/"

# get all files in this folder

alnFiles = list.files(alnPath, pattern=".aln", full.names=TRUE)

# import all files and convert into GRanges

NFKB = seqapply(alnFiles, function(file) {

reads=readAligned(file,type="Bowtie")

reads=as(reads, "GRanges")

return( reads )

}

)

names(NFKB) <- c("NFKB", "input")

NFKB <- NFKB[seqnames(NFKB) %in% c("chr10", "chr11", "chr12")]

save(NFKB, file=paste(alnPath,"NFKB_data.RData",sep=""))

1.2 Extending reads

Start the practical by loading the NFKB object from the course folder on our homepage as follows:

2

library(GenomicRanges)

load(url("http://www.ebi.ac.uk/~zarnack/Helsinki_2012/data/NFKB_data.RData"))

NFKB_data is an object of class GRangesList, where each component represents the alignmentsfrom ChIP experiment as a GRanges object. Inspect the object (type NFKB and NFKB$NFKB_IP).

Our data consists of aligned 28-nt single-ended reads which correspond to the 5' ends of the co-puri�ed DNA fragments that were sequenced. Extending the alignment of the short read to theestimated fragment length will ensure that most intervals cover the actual binding position ofNF-κB. There are several methods to estimate the fragment length which are implemented by theestimate.mean.fraglen() function.

library(chipseq)

fraglen <- estimate.mean.fraglen(NFKB$NFKB_IP)

fraglen

As expected from the experimental conditions used, the fragment lengths vary around 200 nt. To recover the completeco-puri�ed fragments, we therefore extend all reads to a length of 200 nt. Since NFKB is a GRangesList, we use the functionendoapply() to loop over the GRanges objects in the list (endoapply() returns an object of the same class, i.e. a GRangesList

in our case).

# extend all reads to the estimated fragment length (approximately 200 nt)

NFKB.ext <- endoapply(NFKB, function(x) resize(x, width = 200))

NFKB.ext

The default behaviour of the function resize from the GenomicRanges package is to extend theintervals in a "strand-aware" manner, meaning that all reads are extended in 3' direction on therespective strand.Type ?resize to see a range of other utility functions for modifying GRanges

objects.

1.3 Coverage, islands and depth

In ChIP-seq, one is usually interested in the number of precipitated DNA fragments in the samplethat were mapped to a given genomic locus. This is best represented by a "coverage vector" (or"pile-up vector"), which indicates how many reads cover each base in the genome. The functioncoverage() in the ShortRead package calculates such a vector from alignment information. Inorder to allocate a vector of the right size, coverage() needs to know the length of the chromo-somes. This information could for example be retrieved from the BSgenome package which containsthe full DNA sequences of the human genome (using seqlengths(Hsapiens)). However, sinceinstallation of this package is very time-consuming, you can load a table with the chromosomesizes from the course folder on our homepage:

3

#library(BSgenome.Hsapiens.UCSC.hg19)

#seqlengths(NFKB) <- seqlengths(Hsapiens)

chrlen=read.table(url("http://www.ebi.ac.uk/~zarnack/Helsinki_2012/data/Chromosome_lengths_human.txt"),sep="\t")

# adjust chromosome names to fit the GRanges object

chrlen[,1]=paste("chr",chrlen[,1],sep="")

chrlen[chrlen[,1]=="chrMT",1]="chrM"

seqlengths(NFKB)=chrlen[match(names(seqlengths(NFKB)),chrlen[,1]),2]

seqlengths(NFKB.ext) <- chrlen[match(names(seqlengths(NFKB.ext)),chrlen[,1]),2]

Next, we calculate the read coverage. We perform these analyses only on the IP sample tointroduce the basic steps, and then explain the processing of multiple samples in the next chapter.

# calculate the coverage vector

cov.NFKB <- coverage(NFKB.ext$NFKB_IP)

cov.NFKB

For e�ciency, the coverage is stored in the run-length encoded Rle format. Rather than storingthe read count on each base, Rle describes runs of identical values. Type ?Rle to learn more.

In order to detect peaks of NF-κB binding, we next search for regions consisting of contigu-ous segments of non-zero coverage, also known as islands. Such islands can be identi�ed usingthe function slice(), specifying a lower threshold of coverage that deliminates the peaks frombackground. We start with all peaks (lower=1).

# determine islands

islands <- slice(cov.NFKB, lower = 1)

islands

# to see an element within a list, use [[

islands[["chr10"]]

# get the number of peaks (summed over the 3 chromosomes)

sum(sapply(islands,length))

For each island, we can compute the number of reads that form it using the function viewSums(),which reports the sum of coverage values of all bases within the cluster.

# calculate number of reads within the islands

viewSums(islands)

viewSums(islands)/200

# get the frequency (step-by-step; check what happens with each step)

nread.tab <- viewSums(islands)/200

nread.tab <- table(nread.tab)

nread.tab <- colSums(nread.tab)

nread.tab

Similarly, we can use the function viewMaxs() to calculate the maximum read depth of the islands(i.e. the height of the summits).

4

# calculate the number of reads at the summit (depth)

viewMaxs(islands)

# calculate read depth over islands (now all in one command)

depth.tab <- colSums(table(viewMaxs(islands)))

depth.tab <- data.frame( depth=as.numeric(names(depth.tab)), freq=depth.tab )

You can plot the distribution of island depths using the function xyplot() from the lattice

package (see http://stat.ethz.ch/R-manual/R-devel/library/lattice/html/xyplot.html for more in-formation). To save the plot to the �le instead of having it displayed on your screen, usebitmap(file="/<path>/<filename>.jpg",type="jpeg") before and dev.off() after the plot-ting command.

library(lattice)

xyplot( log(freq) ~ depth, data=depth.tab, subset=depth<=20,

type="p", pch=16,

aspect=0.7

)

1.4 Processing multiple lanes

It is useful to be able to apply a procedure to all samples. A function for this purpose isseqapply(), which applies a function to a Sequence object and returns the result as anotherSequence object, if possible. Let us therefore de�ne a function islandDepthSummary() combinesall the analyses steps that we performed in the previous chapter as follows:

# define a cumstom function to apply all steps

islandDepthSummary <- function(x){

isl <- slice(coverage(x), lower = 1)

tab <- colSums(table(viewMaxs(isl)))

df <- data.frame( depth=as.numeric(names(tab)), freq=tab )

return(df)

}

Below we are using the seqapply() function to summarise the full dataset, �attening the returneddata.frames with the function stack() into one data.frame for subsequent plotting.

# apply islandDepthSummary to extended reads

depth.islands <- seqapply(NFKB.ext, islandDepthSummary)

lapply(depth.islands,head)

# combine both samples into one data.frame

depth.islands <- stack(depth.islands)

names(depth.islands) <- c("sample","depth","freq")

depth.islands <- as.data.frame(depth.islands)

# plotting the distributions for IP and input sample

xyplot( log(freq) ~ depth | sample, data=depth.islands, subset=depth<=20,

type="p", pch=16,

aspect=0.7

)

5

As you would expect, the IP sample contains many more peaks with higher depth compared tothe input sample. If reads were sampled randomly from the genome, then the null distribution ofthe island depth k would be a Poisson distribution:

f(k) =λke−λ

k!

where λ is the mean read depth over all islands. We can use this distribution to model the "noise"in the ChIP-seq experiment. Although our samples are not random, the islands with just one ortwo reads may be representative of this null distribution. Using the above formula, express λ interms of f(1) and f(2).

We now plot the island read depth distributions together with the Poisson estimate of the noisedistribution:

xyplot( log(freq) ~ depth | sample, data=depth.islands, subset=depth<=20,

type="p", pch=16,

panel = function(x, y, ...) {

# add background lines for orientation

panel.abline(v=seq(0,20,5), lty="dashed", col="lightgrey")

# add poisson distribution

lambda <- 2 * exp(y[2]) / exp(y[1])

null.est <- function(xx) {

(lambda^xx)*exp(-lambda)/factorial(xx)

}

log.N.hat <- log(null.est(1)) - y[1]

panel.lines(1:10, -log.N.hat + log(null.est(1:10)), col = "black")

# add island depths distribution

panel.xyplot(x, y, ...)

},

ylim=c(-0.1,14.1),aspect=0.7

)

1.5 Identifying peaks

To obtain a set of putative binding sites, or peaks, we need to �nd regions that show coveragesigni�cantly above the noise level. Using the same Poisson-based approach for estimating thenoise distribution as in the plot above, the function peakCutoff() returns a cuto� value for aspeci�ed false-discovery rate (FDR):

# determine a peak cutoff with FDR = 0.01%

peakCutoff(cov.NFKB, fdr = 0.0001)

Use the plot from the previous chapter to see how the returned cuto� value of 10 �ts to theobserved depth distributions from IP, input and Poisson random background. We next use thenewly de�ned cuto� to identify the peaks using slice() and the peakSummary() function:

6

# get peaks above cutoff (use the rounded value you determined with peakCutoff() )

peaks.NFKB <- slice(cov.NFKB, lower = 10)

lapply(peaks.NFKB, head)

# summarize the peak information

chippeaks <- peakSummary(peaks.NFKB)

chippeaks

# convert from RangedData to GRanges

chippeaks <- as(chippeaks, "GRanges")

The result is a RangedData object with two columns representing the maximum coverage depthand the total number of reads in the peak. It is possible to extend this object with additionalcolumns (such as other peak statistics), which is often useful.

We can now compute the strand-speci�c coverage and look at the coverage underlying individualpeaks using the following code. Can you explain the observed pattern observed?

# calculate the coverage separately for both strands

cov.pos <- coverage(NFKB$NFKB_IP[strand(NFKB$NFKB_IP) == "+"])

cov.neg <- coverage(NFKB$NFKB_IP[strand(NFKB$NFKB_IP) == "-"])

# get coverage over peaks

peaks.pos <- Views(cov.pos, as(chippeaks, "RangesList"))

peaks.neg <- Views(cov.neg, as(chippeaks, "RangesList"))

# identify the peaks with the highest depth on chr10 (use only peaks that are less than 500 nt wide)

sel <- width(chippeaks[seqnames(chippeaks)=="chr10"])<500

peak.order <- order(elementMetadata(chippeaks)$max[as.vector(seqnames(chippeaks)=="chr10")][sel], decreasing=TRUE)

# plot the coverage underlying the 5 highest peaks

coverageplot(peaks.pos$chr10[sel][peak.order[1]], peaks.neg$chr10[sel][peak.order[1]])





1.6 Comparing peaks

In addition to the approach described above, there are a number of command-line tools currentlyavailable to identify peaks in ChIP-seq data. We have called peaks on the NF-κB ChIP-seq datausing two such tools: MACS and SISSR. Below, we de�ne custom functions to import the peakpositions of the di�erent peak callers and convert them into (GRanges) objects to aid comparisons.

#### MACS

# function to convert MACS output table into a GRanges object

macs2GRanges <-function(peaks) {

# generate GRanges object

myrange <- GRanges(

seqnames=peaks$chr,

range=IRanges(start=peaks$start, end=peaks$end, names=paste(peaks$chr,peaks$abs_summit,sep=":")),

strand="*",

7

count=peaks$pileup,

score=peaks$X.log10.pvalue.,

FE=peaks$fold_enrichment,

fdr=peaks$X.log10.qvalue.,

maxpos=peaks$abs_summit

)

return(myrange)

}

# load MACS peaks and convert to GRanges (keep only peaks on chr10-12)

macspeaks <- read.table(url("http://www.ebi.ac.uk/~zarnack/Helsinki_2012/data/NFKB_MACS.xls"),

header=TRUE, sep="\t", stringsAsFactors=FALSE)

macspeaks <- subset(macspeaks, chr %in% c("chr10","chr11","chr12"))

macspeaks <- macs2GRanges(macspeaks)

#### SISSR

# function to convert SISSR output table into a GRanges object

sissr2GRanges <-function(peaks) {

# generate GRanges object

myrange <- GRanges(

seqnames=peaks$Chr,

range=IRanges(start=peaks$cStart, end=peaks$cEnd, names=paste(peaks$Chr,peaks$cStart,sep=":")),

strand="*",

count=peaks$NumTags,

score=peaks[,"p-value"],

FE=peaks$Fold

)

return(myrange)

}

# load SISSR peaks and convert to GRanges (keep only peaks on chr10-12)

sissrpeaks <- read.table(url("http://www.ebi.ac.uk/~zarnack/Helsinki_2012/data/SISSR_NFKB.out"),

skip=57, header=FALSE, sep="\t", stringsAsFactors=FALSE, comment.char="=")

names(sissrpeaks) <- scan(url("http://www.ebi.ac.uk/~zarnack/Helsinki_2012/data/SISSR_NFKB.out"),

skip=55, nlines=1, sep="\t", what="character")

sissrpeaks <- subset(sissrpeaks, Chr %in% c("chr10","chr11","chr12"))

sissrpeaks <- sissr2GRanges(sissrpeaks)

# use midpoint as maxpos, since SISSR does not report the summit position

elementMetadata(sissrpeaks)$maxpos <- start(sissrpeaks) + floor(width(sissrpeaks)/2)

As a �rst step to compare the performance of the di�erent peak callers, we can calculate theoverlap between the three sets which can be displayed in a Venn diagram (note that the overlapof some peaks might be ambiguous due to their slightly shifted locations in the di�erent sets).

# calculate overlaps between peak sets (centered on chipseq peaks)

chip2macs <- chippeaks%in%macspeaks

chip2sissr <- chippeaks%in%sissrpeaks

# (100) chippeaks, (010) macspeaks, (001) sissrpeaks

weights <- c(

"100"=sum( !(chippeaks%in%union(macspeaks,sissrpeaks)) ),

"010"=sum( !(macspeaks%in%union(chippeaks,sissrpeaks)) ),

"001"=sum( !(sissrpeaks%in%union(chippeaks,macspeaks)) ),

"110"=sum( chip2macs & !chip2sissr),

"101"=sum( !chip2macs & chip2sissr),

"011"=sum( sissrpeaks%in%macspeaks & !(sissrpeaks%in%chippeaks) ),

8

"111"=sum( chip2macs & chip2sissr )

)

# create Venn object

library(Vennerable)

venn.tab <- Venn(SetNames=c("chipseq","MACS","SISSR"), Weight=weights)

# plot Venn diagram

plot(venn.tab, doWeights=FALSE)

To get a better idea of the behaviour of the di�erent peak callers, we can also compare the widthsof peaks that are either shared or unique for one of the peak callers (since all peaks from MACSand SISSR were also detected by chipseq, we consider these peaks as unique if they are onlydetected by one of the two).

# determine overlapping peaks

# 1. chipseq peaks (shared with SISSR or MACS)

shared <- chippeaks%in%sissrpeaks | chippeaks%in%macspeaks

# 2. MACS peaks (shared with SISSR)

shared <- c(shared, macspeaks%in%sissrpeaks)

# 3. SISSR peaks (shared with MACS)

shared <- c(shared, sissrpeaks%in%macspeaks)

shared[shared] <- "shared"

shared[shared!="shared"] <-"unique"

# generate a dataframe including the peak widths

df <- data.frame(peak.caller=factor(rep(c("chipseq", "MACS", "SISSR"),

c(length(chippeaks), length(macspeaks), length(sissrpeaks)))))

df <- data.frame(df, width=c(width(chippeaks), width(macspeaks), width(sissrpeaks)))

df <- data.frame(df, shared=shared)

# inspect the overlap

table(list(df$peak.caller,df$shared))

# plot histograms of the peak width

histogram(~width|peak.caller*shared, data=subset(df, width<2000),breaks=20)

Another way to check the peaks from the di�erent sets is to plot the average shape of the peaks.In the following example, we produce such a plot for NF-κB ChIP-seq reads averaged around thetop 100 peak summits predicted by MACS on chromosomes 10-12.

# define windows around summits of the top 100 peaks (+- 1000 bp)

NFKB.windows <- macspeaks[order(elementMetadata(macspeaks)$score, decreasing=TRUE)[1:100]]

start(NFKB.windows) <- elementMetadata(NFKB.windows)$maxpos-1000

end(NFKB.windows) <- elementMetadata(NFKB.windows)$maxpos+999

# get sections from coverage

summit.cov <- Views( cov.NFKB, as(NFKB.windows,"RangesList"))

# resolve list structure and combine into matrix

summit.cov <- lapply(summit.cov, function(x) lapply(x,as.vector))

summit.cov <- do.call(c,summit.cov)

summit.cov <- do.call(rbind,summit.cov)

# plot

9

xyplot( colSums(summit.cov)/nrow(summit.cov) ~ -1000:999,

type="l", main="average NFKB fragment profile (MACS peak summits)",

xlab="distance to summit", ylab="average fragment depth"

)

Try to prepare a similar plot for the SISSR peaks. Since SISSR does not report a summit position,you can use the midpoint of the peaks to center the plots:

midpoints <- start(NFKB.windows) + floor(width(NFKB.windows)/2)

1.7 Placing peaks in their genomic context

It is a common goal to classify peaks of interest in the context of genomic features such aspromoters, upstream regions, etc. Using the GenomicFeatures package one can obtain geneannotations from di�erent data sources, including UCSC and Ensembl. For interest, the followingcode demonstrates how to download the latest Ensembl gene predictions for the human genomeand generate a GRanges object containing contiguous transcribed regions of the genome (from the�rst to the last exon). To save time during the practical, we will import a previously preparedGRanges object containing human transcript annotations (the code that was used to preparegregions is given below).

#library(GenomicFeatures)

#db <- makeTranscriptDbFromBiomart()

#gregions <- transcripts(db)

#gregions <- GRanges(

# seqnames=paste("chr",seqnames(gregions),sep=""),

# range=ranges(gregions),

# strand=strand(gregions)

#)

#save(gregions, file="gregions.Rdata")

We will use the transcript coordinates to estimate promoter positions for each transcript, using awindow of 2 kb before and 500 bp after the TSS. We will determine the number of peaks that fallinto these promoter regions for all peak sets. Compare this to the number of peaks falling intosimilar-sized regions at the 3' end of genes (hint: flank(..., start=FALSE)).

load(url("http://www.ebi.ac.uk/~zarnack/Helsinki_2012/data/gregions.RData"))

# get promoter locations

promoters <- flank(gregions, width=2000)

promoters <- resize(promoters, width=2500)

promoters <- unique(promoters)

# check overlap of peak sets

table(chippeaks %in% promoters)

table(chippeaks %in% promoters)/length(chippeaks)

10

A common plot to generate with ChIP-seq data is to see how the coverage correlates with speci�cgenomic coordinates e.g. transcription start sites, summit positions or other ChIP-seq peaks. Theconcept is very similar to the approach we used to determine the average read coverage aroundthe peak summits. In the following example, we plot the average coverage of NF-κB along thepromoters.

# get sections from coverage

prom.cov <- Views( cov.NFKB, as(promoters,"RangesList"))

# resolve list structure and combine into matrix

prom.cov <- lapply(prom.cov, function(x) lapply(x,as.vector))

prom.cov <- do.call(c,prom.cov)

prom.cov <- do.call(rbind,prom.cov)

# invert profiles for transcript on minus strand

prom.cov[as.vector(strand(promoters))=="-"] <- rev(prom.cov[as.vector(strand(promoters))=="-"])

# plot

xyplot( colSums(prom.cov)/nrow(prom.cov) ~ -2000:499,

type="l", main="average NFKB fragment profile around TSS",

xlab="distance to TSS", ylab="average fragment depth"

)

1.8 Visualising the data in a genome browser

When performing computational analyses, it is important to look at your input data and re-sults in a genome browser. This can either be a webserver like the UCSC Genome Browser(http://genome.ucsc.edu/cgi-bin/hgGateway) or programmes like the Integrated Genome Browser(IGB; http://bioviz.org/igb/) that you run locally on your computer.

In our case, we would like to upload the peak locations and compare them to the read coveragefrom the NF-κB IP and the input sample. To this end, we need to output the data in a formatthat can be readily imported into the genome browser. A suitable format for the read coverages isa WIG �le (http://genome.ucsc.edu/goldenPath/help/wiggle.html), while the peak locations canbe displayed as a BED �le (http://genome.ucsc.edu/FAQ/FAQformat.html#format1). Functionsto import and export a number of commonly used NGS �le formats can be obtained from thepackage rtracklayer).

library(rtracklayer)

# calculate the coverages of both samples

covs <- lapply(NFKB.ext, coverage)

# convert to RangedData and set sequence length

covs <- lapply(covs, function(x){

x=as(x,"RangedData")

seqlengths(x) <- chrlen[match(names(genome(x)),chrlen[,1]),2]

return(x)

})

# export as WIG file

11

wig.path <- "<path.to.your.folder>"

export.bw(covs[["NFKB_IP"]], paste(wig.path,"/coverage_NFKB_IP.wig",sep=""),

dataFormat="auto")

export.wig(covs[["input"]], paste(wig.path,"/coverage_input.wig",sep=""),

dataFormat="auto", name="coverage input")

# write bed files of the peaks

export.bed(chippeaks, paste(wig.path,"/chipseq_peaks.bed",sep=""))

export.bed(sissrpeaks, paste(wig.path,"/SISSR_peaks.bed",sep=""))

export.bed(macspeaks, paste(wig.path,"/MACS_peaks.bed",sep=""))

2 Motif analysis

In the next step of the practical, we want to identify the binding motif that is recognised by NF-κB.We will use the BSgenome package that contains infrastructure for Biostrings-type objects. Theseimplement memory-e�cient string containers, string-matching algorithms and other utilities forfast manipulation of large biological sequences or sets of sequences (see the BSgenome vignette fordetails). We will start by loading the package for the human genome hg19.

library("BSgenome")

# check which genomes are available

available.genomes()

# load the human genome (version hg19)

library("BSgenome.Hsapiens.UCSC.hg19")

2.1 De novo motif discovery

High-resolution ChIP-seq data allows for an e�cient detection of sequence motifs, which arepreferentially bound on the DNA by the studied transcript factors (if such motifs actually exist).De novo motif discovery can reveal the binding speci�city of an unknown factor, con�rm thevalidity of candidate motifs and help validate ChIP-seq experiments.

Here, we will take advantage of the ability of the ChIP-seq peak detection software MACS todetermine peak summit positions in order to perform a de novo motif discovery on a restrainedset of sequences (if you want to perform a similar search on the SISSR peaks for comparison,you could use the center of the peak as in Chapter 1.6). Taking a window around the center ofthe bound region, we will save the 100 top and bottom-ranking peaks as well as a random set ofpeaks into a fasta �le and submit them to MEME for motif discovery. MEME is one of the mostpopular packages for motif discovery (Bailey et al. 2009). it employs expectation maximizationto detect sequence patterns that occur repeatedly in a group of sequences, and can be used bothonline and in a locally installed version. It will output results in html format, which can then bevisualised in a browser of choice.

12

# use a distance of 25nt on either side of the summit

mydist <- 25

# define summit regions

macs.summits <- GRanges(

seqnames=seqnames(macspeaks),

range=IRanges( start=elementMetadata(macspeaks)$maxpos - mydist,

end=elementMetadata(macspeaks)$maxpos + mydist),

strand="+"

)

# order the summit regions based on their score

macs.summits <- macs.summits[order(-elementMetadata(macspeaks)$score)]

# create unique names for all peaks

names(macs.summits) <- paste( seqnames(macs.summits), start(macs.summits), end(macs.summits), sep=":")

# take a look at the resulting GRanges centered around the MACS peak summit positions

macs.summits

# define a folder where you want to save the fasta files

out.folder <- <path.to.your.folder>

# top 100 peaks

seqs <- getSeq(Hsapiens, macs.summits[1:100], as.character=FALSE)

names(seqs) <- names(macs.summits[1:100])

# export to fasta file

write.XStringSet(seqs,file=paste(out.folder,"NFKB_MACS_top100.fa",sep=""), "fasta", append=FALSE)

# random sample of 100 peaks

index <- sample(1:length(macs.summits), size=100, replace=FALSE)

seqs <- getSeq(Hsapiens, macs.summits[index], as.character=FALSE)

names(seqs) <- names(macs.summits[index])

write.XStringSet(seqs, file=paste(out.folder,"NFKB_MACS_random100.fa",sep=""), "fasta",append=FALSE)

# bottom 100 scoring sequences

seqs <- getSeq(Hsapiens, macs.summits[(length(macs.summits)-99):length(macs.summits)], as.character=FALSE)

names(seqs) <- names(macs.summits[(length(macs.summits)-99):length(macs.summits)])

write.XStringSet(seqs, file=paste(out.folder,"NFKB_MACS_bottom100.fa",sep=""), "fasta", append=FALSE)

The three di�erent sets of 100 peak sequences can now be uploaded to MEME. The code snippetbelow demonstrates how MEME can be called directly from R and the command line outputwill be redirected into our R window. We will however use the online version of MEME in thispractical and upload the three .fa �les that we have just produced to the MEME webserver(http://meme.sdsc.edu/meme/cgi-bin/meme.cgi).

###### calling MEME from R ### (RUN MEME ONLINE IN THIS PRACTICAL)

# get list of fasta files

#fa.files <- list.files(paste(out.folder,sep=""), pattern="*fa$", full.names=TRUE)

#

# define MEME parameters

#parameters <- " -nmotifs 3 -minsites 100 -minw 12 -maxw 35 -revcomp -maxsize 500000 -dna -oc "

#

#for (fa.file in fa.files) {

#

# # define MEMEM output folder

# meme.out <- substr(fa.file,1,nchar(fa.file)-3)

#

# # construct full MEME command

13

# mycommand <- paste("meme ",fa.file,parameters,meme.out,sep="")

#

# # send command to system

# system(mycommand)

#}

Inspect the HTML output. What information is provided with each motif? How many se-quences were used to derive the motifs? As expected, our top hit in the top 100 peaks closelyresembles the canonical NF-κB motif (for comparison, search for RelA in the JASPAR database athttp://jaspar.genereg.net/cgi-bin/jaspar_db.pl?rm=browse&db=core&tax_group=vertebrates). Re-assuringly, the same motif is also recovered in the other two sets (bottom 100 and randomly chosenpeaks), suggesting that the NF-κB motif is present in a good fraction of our peaks, regardless oftheir enrichment score.

2.2 Motif scan

If the sequence preferences of a factor of interest are known, one can scan ChIP-seq peak sequenceswith the corresponding PWM (or consensus) to check if the majority of them indeed contain theknown motif. The BSgenome package allows easy consensus matching as well as PWM scanning.For the following motif scan, we will use the NF-κB binding site as it is given in the JASPARdatabase. Alternatively, you can adjust the code to use the newly obtained PWM that you gotfrom your MEME analysis.

In a �rst step, we import the table of base frequencies of the NF-κB binding site andplot a weblogo (for a more detailed description of this type of motif visualisation, seehttp://weblogo.berkeley.edu/).

# read the NFKB PWM from a text file

NFKB.pwm <- read.table(url("http://www.ebi.ac.uk/~zarnack/Helsinki_2012/data/NFKB_pwm_meme.txt"), sep="\t", header=FALSE)

NFKB.pwm <- t(NFKB.pwm[,2:5])

rownames(NFKB.pwm) <- c("A","C","G","T")

# view the motif matrix

NFKB.pwm

# plot the motif logo

library(seqLogo)

seqLogo(NFKB.pwm, ic.scale=TRUE)

Next, we retrieve the sequence under the peaks and prepare two helper functions that will becalled on each peak set. These functions will return a matrix containing the chromosome andstart position for each motif hit, as well as the sequence of each motif instance. mymatchPWM usesthe function matchPWM (from BSgenome), which initially returns all PWM hits within a speci�cstring at a speci�ed cuto�, and selects the hit that is closest to the peak summit. Below, wewill call these functions on our three peak sets to check how many of the binding sites contain aNF-κB motif.

14

# obtain the sequences under the peaks

elementMetadata(macspeaks)$seqs <- getSeq(Hsapiens, macspeaks, as.character=TRUE)

elementMetadata(chippeaks)$seqs <- getSeq(Hsapiens, chippeaks, as.character=TRUE)

elementMetadata(sissrpeaks)$seqs <- getSeq(Hsapiens, sissrpeaks, as.character=TRUE)

# function to scan a string for PWM matches at the specified threshold by calling matchPWM()

# keeps only the closest of multiple hits!

mymatchPWM <- function (pwm, myseq, threshold, summit) {

# get all matches of PWM

mymatch <- matchPWM(pwm, myseq, min.score=threshold)

# collect starts/seqs into matrix (if any)

if (length(mymatch)==0) {

found <- cbind(NA,NA,0)

} else {

found <- cbind(start(mymatch), as.character(mymatch), length(start(mymatch)))

}

colnames(found) <- c("start","seq","nr")

# keep only the match that is closest to the summit

found <- found[order(abs(as.integer(found[,1])-summit),decreasing=FALSE)[1],]

# return matrix

return(found)

}

# function to call mymatchPWM() on each peak in a set; returns a matrix with motif information per peak

ScanPeaks <- function(NFKB.GR, pwm, threshold) {

# get all peak sequences

myseqs <- elementMetadata(NFKB.GR)$seqs

# get summit positions (relative to peak coordinates)

summits <- elementMetadata(NFKB.GR)$maxpos - start(NFKB.GR)

# apply mymatchPWM() to all peaks in the set

motifmatrix <- sapply(1:length(myseqs), function(x) mymatchPWM(pwm, myseqs[x], threshold, summits[x]))

motifmatrix <- t(motifmatrix)

# set peak IDs as rownames

rownames(motifmatrix) <- names(NFKB.GR)

# return motif matrix

return(motifmatrix)

}

# scan all peak sequences for the motif (using a cutoff of 80%)

### This step is likely to take ~ 10 min, so at this point it's convenient to have a coffee :P ###

macs.motifs <- ScanPeaks(macspeaks, NFKB.pwm, "80%")

sissr.motifs <- ScanPeaks(sissrpeaks, NFKB.pwm, "80%")

chipseq.motifs <- ScanPeaks(chippeaks, NFKB.pwm, "80%")

# add motif information (number of motifs per peaks) to the peak GRanges object

elementMetadata(macspeaks)$motif.no <- as.integer(macs.motifs[,3])

elementMetadata(sissrpeaks)$motif.no <- as.integer(sissr.motifs[,3])

elementMetadata(chippeaks)$motif.no <- as.integer(chipseq.motifs[,3])

# plot the proportion of peaks with 0, 1 or more motifs in the three peak sets

par(mfrow=c(1,3))

pie( c(sum(as.numeric(elementMetadata(macspeaks)$motif.no==0)),

sum(as.numeric(elementMetadata(macspeaks)$motif.no==1)),

sum(as.numeric(elementMetadata(macspeaks)$motif.no>1))),

labels=c("0","1","2+"), main="NFKB MACS peaks",

col=c("white","lightgrey","darkgrey")

)

pie( c(sum(as.numeric(elementMetadata(sissrpeaks)$motif.no==0)),

sum(as.numeric(elementMetadata(sissrpeaks)$motif.no==1)),

sum(as.numeric(elementMetadata(sissrpeaks)$motif.no>1))),

labels=c("0","1","2+"), main="NFKB SISSR peaks",


)

pie( c(sum(as.numeric(elementMetadata(chippeaks)$motif.no==0)),

sum(as.numeric(elementMetadata(chippeaks)$motif.no==1)),

15

sum(as.numeric(elementMetadata(chippeaks)$motif.no>1))),

labels=c("0","1","2+"), main="NFKB chipseq peaks",


)

As you can see, at the cuto�s used, 30-40% of peaks contain at least one NF-κB motif, dependingon the peak �nding algorithm used. MACS peaks appear to be most enriched for motifs. However,as we have seen previously, di�erent peak callers have a tendency to detect peaks of di�erent width,which in�uences the changes of �nding a motif hit. Therefore, resolution has a direct e�ect onthis analysis. A better way of comparing motif content would be to look at regions of the samesize - e.g. 150 bp up- and downstream of the peak summit position and to compare the motifcontent of peak regions with background regions.

(OPTIONAL) Having obtained the location of putative motifs, one could now try to re�ne thePWM itself based on the sequences that have matched the initial scan. One could also tryextending the motif up- and downstream by checking if there are any additional bases at the�anks showing a non-random distribution, as follows.

# set the extension distance

mydist <- 10

# get absolute start positions of the motif hits

abs.starts <- as.integer(macs.motifs[,1]) + start(macspeaks)

abs.starts <- abs.starts[!is.na(abs.starts)]

# make GRanges

motifs.GR <- GRanges(

seqnames=seqnames(macspeaks)[!is.na(macs.motifs[,1])],

range=IRanges(start=abs.starts - mydist,

end=abs.starts + mydist + 9),

strand="+"

)

# get sequences

motif.seqs <- getSeq(Hsapiens, motifs.GR, as.character=TRUE)

# get letter counts per position

NFKB.matrix <- consensusMatrix(motif.seqs)

# transform counts to frequencies

NFKB.matrix <- apply(NFKB.matrix, 2, function(x){ x/sum(x) })

# plot the newly obtained sequence logo

seqLogo(NFKB.matrix, ic.scale=TRUE)

2.3 Motif localisation

A typical part of a ChIP-seq analysis is looking at the motif distribution with respect to theestimated peak summit. The spread of this distribution can be a good measure of noise in thedata. NF-κB has a high information content motif as well as sharp ChIP-seq peaks, so one expectsto see motifs peaking sharply around the center of the plot. Below, we will create pro�les for the

16

three di�erent peak sets, in order to get an idea of how the use of di�erent peak-calling methodscan in�uence our result.

# set window SIZE

mydist <- 200

# function to determine the motif profile around peak summits

getProfile <- function(NFKB.GR, pwm, window.size){

# get regions around summit

summits.GR <- GRanges(

seqnames=seqnames(NFKB.GR),

range=IRanges( start=elementMetadata(NFKB.GR)$maxpos - window.size,

end=elementMetadata(NFKB.GR)$maxpos + window.size),

strand="+"

)

# create unique names for all peaks

names(summits.GR) <- paste( seqnames(summits.GR), start(summits.GR), end(summits.GR), sep=":")

# get sequence

elementMetadata(summits.GR)$seqs <- getSeq(Hsapiens, summits.GR, as.character=TRUE)

# scan sequences with the PWM

summit.motifs <- ScanPeaks(summits.GR, pwm, "80%")

summit.motifs <- summit.motifs[!is.na(summit.motifs[,1]),]

# get all covered positions

motif.pos <- sapply(as.integer(summit.motifs[,1]), function(x) seq(x, x+ncol(pwm)-1))

motif.pos <- table(unlist(motif.pos))

# convert to data.frame

motif.pos <- data.frame(motif.pos)

names(motif.pos) <- c("position","frequency")

# shift positions relative to summit

motif.pos$position <- as.integer(as.character(motif.pos$position)) - window.size

# ensure all positions are present in the output

profile <- data.frame(position=-window.size:window.size,

frequency=motif.pos$frequency[match(-window.size:window.size, motif.pos$position)]

)

profile[is.na(profile)] <- 0

return(profile)

}

# get the profiles for the 3 different peak sets

macs.profile <- getProfile(macspeaks, NFKB.pwm, mydist)

sissr.profile <- getProfile(sissrpeaks, NFKB.pwm, mydist)

chipseq.profile <- getProfile(chippeaks, NFKB.pwm, mydist)

# generate plots

pl1 <- xyplot(frequency~position, data=macs.profile,

type="l", main="motifs around MACS peak summits",

aspect=0.8

)

pl2 <- xyplot(frequency~position, data=sissr.profile,

type="l", main="motifs around SISSR peak summits",

aspect=0.8

)

pl3 <- xyplot(frequency~position, data=chipseq.profile,,

type="l", main="motifs around chipseq peak summits",

aspect=0.8

)

# print plots

print(pl1, split=c(1,1,1,3), more=TRUE)

print(pl2, split=c(1,2,1,3), more=TRUE)

print(pl3, split=c(1,3,1,3))

17

3 Di�erential histone modi�cation using DESeq

In the last part of the tutorial, we will show how you can assess di�erential protein occupancy usingChIP-seq data. As an example, we will investigate the distribution of histone modi�cations andidentify regions that are signi�cantly di�erent in condition A versus condition B. For convenience,we will use ChIP-seq data for H3K36 trimethylation (H3K36me3) in two di�erent cell types,namely HepG2 and K562, to illustrate the functionality of the package DESeq with ChIP-seqcount data. DESeq was originally developed for the determination of di�erential expression usinga test based on the negative binomial distribution (Anders and Huber, 2010), but it can in principleused for any type of count data. For details on the underlying methodology, take a look at thepublication and the vignette of the DESeq package.

First, we will import the library and prepare the intervals in which we would like to search fordi�erential histone modi�cations. Determining the regions of interest can be done in a number ofdi�erent ways: instead of the peak-based approach that we describe here, you could also partitionthe genome in equally sized regions, or use existing annotation, such as promoter regions.

library(DESeq)

# read the peak files

histone1.GR <- read.table(url("http://www.ebi.ac.uk/~zarnack/Helsinki_2012/data/H3K36me3-K562_peaks.xls"),


histone1.GR <- subset(histone1.GR, chr %in% c("chr10","chr11","chr12"))

histone1.GR <- macs2GRanges(histone1.GR)

histone2.GR <- read.table(url("http://www.ebi.ac.uk/~zarnack/Helsinki_2012/data/H3K36me3-HepG2_peaks.xls"),


histone2.GR <- subset(histone2.GR, chr %in% c("chr10","chr11","chr12"))

histone2.GR <- macs2GRanges(histone2.GR)

# get absolute summit coordinates

elementMetadata(histone1.GR)$summit <- start(histone1.GR)+elementMetadata(histone1.GR)$maxpos

elementMetadata(histone2.GR)$summit <- start(histone2.GR)+elementMetadata(histone2.GR)$maxpos

# create union of both peak sets as total enquiry space

region.sum <- union(histone1.GR, histone2.GR)

names(region.sum) <- paste( seqnames(region.sum), start(region.sum), end(region.sum), sep=":")

Once the intervals are de�ned, one needs to obtain the corresponding read counts. There areseveral ways of doing this. Below, we provide an example starting from bed �le of aligned sequencereads. This procedure takes a while and can be skipped in this tutorial, as counts have alreadybeen precalculated and can be imported from an RData object (see below).

#library(rtracklayer)

#

#bed.files <- c("SRR038461_chr10_chr11_chr12.bed","SRR038464_chr10_chr11_chr12.bed")

#

#read.list <- lapply(bed.files, function(bed.file){

# # import bed file

# reads <- import.bed(url(paste("http://www.ebi.ac.uk/~zarnack/Helsinki_2012/data/",bed.file,sep="")),

# genome="hg19",asRangedData=FALSE)

18

# # mask strand information

# strand(reads) <- "*"

# # count reads into histone mark regions

# read.counts <- countOverlaps(region.sum, reads)

#

# return(read.counts)

#})

#

#save(read.list, file="/homes/zarnack/public_html/Helsinki_2012/data/Reads_per_peak_region.RData")

Once the read counts per intervals have been obtained, one can proceed with the di�erential modi-�cation analysis. The counts of all interval for each condition are transformed into a CountDataSetobject, which is the central data structure in DESeq. The e�ective library size for each sample isthen estimated from the data using the function estimateSizeFactors(). In the next step, thevariance from the mean is estimated, based on the assumption that the mean is a good predictorof variance. The DESeq vignette describes in more detail how to verify the variance-mean depen-dence and presents some further analyses to con�rm the variance estimations. We will not expandon this here, but we recommend that users read the documentation and generate the diagnosticplots before testing for di�erential modi�cation in a real-life scenario.

library(DESeq)

# load the ChIP-seq counts per region

load(url("http://www.ebi.ac.uk/~zarnack/Helsinki_2012/data/Reads_per_peak_region.RData"))

# inspect the count lists

names(read.list)

lapply(read.list, head)

# set output dir for DESeq analysis

out.folder <- "<path.to.your.folder.for.DESeq.results>"

# construct a count table

dcounts <- do.call(cbind, read.list)

# define conditions (including replicates)

conds <- c("HepG2","HepG2","K562","K562")

# create and inspect the CountDataSet object

cds <- newCountDataSet(dcounts, conds)

cds

head(counts(cds))

# estimate size of the libraries (could also be determined externally)

cds <- estimateSizeFactors(cds)

# estimate variance

cds <- estimateDispersions(cds)

# test for differences between the base means of the two conditions

res <- nbinomTest(cds, "HepG2", "K562")

# inspect the result

head(res)

table(res$padj<0.05)

table(res$padj<0.05 & res$log2FoldChange>0)

We have obtained a table with (i) mean read counts per condition (as well as a joint estimate),

19

(ii) the fold change of each region (raw and log2; calculated conditionB over conditionA i.e. K562over HepG2), and (ii) the p-value and - more importantly - the adjusted p-value for the statisticalsigni�cance of this change. To get an overview of the results, we can now produce scatterplots ofcounts and estimated mean within replicates and between conditions.

# create a colour ramp

Lab.palette <- colorRampPalette(c("white", "yellow", "red"), space = "Lab")

# function to produce a nice scatterplot using the smoothScatter() function

nice_scatter <- function(x, y, name.x, name.y, type){

smoothScatter(x, y,

colramp=Lab.palette,

main=paste(name.x," vs. ",name.y,sep=""),

xlab=paste(type," ",name.x,sep=""), ylab=paste(type," ",name.y,sep="")

)

# add a local regression line

lines(lowess(x,y),col="red", lty="dotted")

# add Spearman correlation

legend("topleft", paste("Spearman cor = ",round(cor(x,y,method="spearman"),digits=2)))

}

# generate 4 different plots

par(mfrow=c(2,2))

# HepG2 replicates (add a pseudocount of 1 to the raw counts to mask zeros)

nice_scatter(log2(dcounts[,1]+1), log2(dcounts[,2]+1),

"HepG2 1","HepG2 2","log2(counts)"

)

# K562 replicates

nice_scatter(log2(dcounts[,3]+1), log2(dcounts[,4]+1),

"K562 1","K562 2","log2(counts)"

)

# HepG2 vs. K562 (using mean)

nice_scatter(log2(res$baseMeanA+1), log2(res$baseMeanB+1),

"HepG2","K562","log2(mean)"

)

# set a significance cutoff (adjusted p-value<0.01 plus a fold change>1.5)

cutoff <- res$padj<0.01 & (res$log2FoldChange>log2(1.5) | res$log2FoldChange<(-log2(1.5)))

# MA-type plot of significant changes

plot( log2(res$baseMean), res$log2FoldChange,

pch=19, col=ifelse(cutoff, "red","black"),

main="significant changes"

)

20

References:

• "Some Basic Analysis of ChIP-Seq Data", Bioconductor.http://www.bioconductor.org/packages/release/bioc/vignettes/chipseq/inst/doc/Work�ow.pdf

• "E�cient genome searching with Biostrings and the BSgenome data packages" Bioconductor.http://www.bioconductor.org/packages/release/bioc/vignettes/BSgenome/inst/doc/GenomeSearching.pdf

• "Analysing RNA-Seq data with the DESeq package", Bioconductor.http://bioconductor.org/packages/release/bioc/vignettes/DESeq/inst/doc/DESeq.pdf

• Kasowski M, Grubert F, He�el�nger C, Hariharan M, Asabere A, Waszak SM, Habegger L, Ro-zowsky J, Shi M, Urban AE, Hong MY, Karczewski KJ, Huber W, Weissman SM, Gerstein MB,Korbel JO, Snyder M. (2010) Variation in transcription factor binding among humans. Science

328:232-5.

• Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, Ren J, Li WW, Noble WS.(2009) MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res 37(Web Serverissue):W202-8.

• Anders S, Huber W. (2010) Di�erential expression analysis for sequence count data. Genome Biol11:R106.

21

Practical - ChIP-seq data analysis - University of Helsinki · Practical - ChIP-seq data analysis...

Documents

Transcript of Practical - ChIP-seq data analysis - University of Helsinki · Practical - ChIP-seq data analysis...