Introduction to Chromatin IP – sequencing (ChIP-seq) data ...€¦ · • ChIP-seq quality...
Transcript of Introduction to Chromatin IP – sequencing (ChIP-seq) data ...€¦ · • ChIP-seq quality...
IntroductiontoChromatinIP–sequencing(ChIP-seq)dataanalysis
Stockholm,7November2018
AgataSmialowskaNBIS,SciLifeLab,StockholmUniversity
WorkshoponChIP-seqdataanalysis
ChromatinstateandgeneexpressionPEVPositioneffectvariegationinDrosophilaeye(nature.com)
Juxtapositionofeyecolourgeneswithheterochromatinresultsinthe“mottled”eyecolouration(redandwhite).
Proteins,whichbindheterochromatin,actto“spread”thesilencingsignalbyprovidingaforwardfeedbackloop.
HeterochromatinProtein1;HistonemethyltransferaseSu(var)3-9;H3K9methylation
FirstobservedbyH.Muller1930
www.pollev.com/AGATASMIALOW506
Chromatinimmunoprecipitation
RnDsystems
ApplicationsGeneraltranscriptionmachinery
Applications
Promoter-associatedtranscriptionfactors
Applications
Distalenhancers
Applications
Histonemodificationsandvariants Activationstates
Co-factors
ChIP-seqworkflow
Liu,PottandHuss,BMCBiology2010
designstudyobtaininputchromatinperformprecipitationconstructlibrarysequencelibrarybioinformaticanalysis
WorkflowofaChIP-seqstudy
Wetlab
Criticalfactors
• Antibodyselection• Propercontrolsample(inputchromatinormockIP)• Librarycloningandsequencing• Algorithmforpeakdetection
• Enoughmaterialandbiologicalreplicates• Reproducibilityinchromatinfragmentation• Cross-linkerchoice
Experimentdesign
• Soundexperimentaldesign:replication,randomisationandblocking(R.A.Fisher,1935)
• Intheabsenceofaproperdesign,itisessentiallyimpossibletopartitionbiologicalvariationfromtechnicalvariation
• Sequencingdepth:dependsonthestructureofthesignal;cannotbelinearlyscaledtogenomesize
• Single-vs.paired-endreads:PEimprovesreadmappingconfidenceandgivesadirectmeasureoffragmentsize,whichotherwisehastobemodelledorestimated
Idealdesign:EachsamplehasamatchedinputInputsequencedtoacomparabledepthasIPsample
input library/sequencing
XChIP
replicates
input library/sequencingChIP
replicates
✓input library/sequencing
ChIPreplicates
under-sequencedinput
ChIP
well-sequencedinput
ChIP
X
Experimentdesign
Biologicalreplicatesandrandomisation
technicalreplicatesaregenerallyawasteoftimeandmoney
sample
libraries sequencing
X
✓time------->
experiment1 experiment2 Experiment3… libraries,sequencing,etc
manystudiesdonotaccountforbatcheffectsi. timeii. Origin
replicates libraries sequencing
origin
samples
experimentX
≥2biologicalreplicatesforsiteidentification≥3biologicalreplicatesfordifferentialbinding
pooleddata
under-sequenceddata
X
ifyouneedtopoolyourdata,thenitisunder-sequenced
pooleddata actualreplicates
✓
Importanceofsequencingdepth
Sequencingdepthdependsondatatype
TF:20M
point-source mixedsignal broadsignal
Noclearguidelinesformixedandbroadtypeofpeaks
TranscriptionFactors
ChromatinRemodellers
Histonemarks
ChromatinRemodellers
Histonemarks
RNApolymeraseII
Human: ? ?
H3K4me3:25M H3K36me3:35M H3K27me3:40M
H3K9me3:>55M
Source:TheENCODEconsortium;Jungetal,NAR2014
• ChIP–sequencing:introductionfromabioinformaticspointofview
• PrinciplesofanalysisofChIP-seqdata
• ChIP-seq:downstreamanalyses
• Resources
• ChIP–sequencing:introductionfromabioinformaticspointofview
• PrinciplesofanalysisofChIP-seqdata
• ChIP-seq:downstreamanalyses
• Resources
Chromatin=DNA+proteins
Park,NatureRevGenetics,2009
Dataanalysis
WorkflowofaChIP-seqstudy
Iterativeprocess
Wetlab
designstudyobtaininputchromatinperformprecipitationconstructlibrarysequencelibrarylibraryqualitycontrolfiltersequencesalignsequencesfilteralignmentsidentifypeaks/regionsofenrichmentassessdataqualityunderstandthedata/resultsdownstreamanalyses
• ChIP–sequencing:introductionfromabioinformaticspointofview
• PrinciplesofanalysisofChIP-seqdata
• ChIP-seq:downstreamanalyses
• Resources
Twoquestionstoaddress
• 1.DidtheChIPpartoftheChIP-seqexperimentwork?Wastheenrichmentsuccessful?
• 2.Wherearethebindingsites(oftheproteinofinterest)?
Wordofcaution!
ChIP-seqexperimentsaremoreunpredictablethanRNA-seq!Errorsources:chromatinstructurePCRover-amplificationnon-specificantibodyotherthings?
ChIP-seqQC:didtheChIPwork?
• 1.Inspectthesignal(mappedreads,coverageprofiles)ingenomebrowser
• 2.Computepeak-independentqualitymetrics(crosscorrelation,cumulativeenrichment)
• 3.Assessreplicateconsistency(correlationsbetweenreplicatesofthesamecondition)
tagdensitydistributionreproducibilitysimilarityofcoveragesignalatknownsites…SpottinginconsistenciesConfoundingfactorsUnder-sequencedlibraries…
HowdoIknowmydataisofgoodquality?
Marinovetal,G32013
Librarycomplexity
Sequenceduplicationlevel>80%(lowcomplexitylibrary)
Qualitycontrol:taguniqueness–librarycomplexitymetric
NRF:Non-redundantfraction(ofreads):proportionofuniquetags/total
FastQCBabrahamInstitute
HowdoIknowmydataisofgoodquality?
Marinovetal,G32013
Objective(i.e.peakindependent)metricstoquantifyenrichmentinChIP-seq;forTFinmammaliansystems:NormalisedStrandCorrelationNSCRelativeStrandCorrelationRSC
Large-scalequalityanalysisofpublishedChIP-seqdatasets:20%lowquality25%intermediatequality30%inputshavemetricssimilartoIPs
Strandcross-correlation
Carrolletal,FrontGenet2014
Thecorrelationbetweensignalofthe5ʹendofreadsonthe(+)and(-)strandsisassessedaftersuccessiveshiftsofthereadsonthe(+)strandandthepointofmaximumcorrelationbetweenthetwostrandsisusedasanestimationoffragmentlength.
Strandshift
Crossc
orrelatio
n
Strandcross-correlation
Carrolletal,FrontGenet2014
NSC=MaxCCvalue(fLen)
MinCCRSC=
MaxCC–MinCC
PhantomCC–MinCC
Cross-correlationplots
−500 0 500 1000 1500
0.20
00.
205
0.21
00.
215
0.22
00.
225
strand−shift (105,455)
cros
s−co
rrela
tion
ENCFF000OWMed.sorted.1.bam.picard.bam
NSC=1.14102,RSC=1.06452,Qtag=1
−500 0 500 1000 1500
0.28
60.
288
0.29
00.
292
0.29
40.
296
0.29
80.
300
strand−shift (100,265,245)
cros
s−co
rrela
tion
ENCFF000PET.sorted.1.bam.picard.bam
NSC=1.01443,RSC=0.289702,Qtag=−1
−500 0 500 1000 1500
0.19
0.20
0.21
0.22
0.23
strand−shift (130)
cros
s−co
rrela
tion
ENCFF000PMG.sorted.1.bam
NSC=1.28071,RSC=0.987276,Qtag=0
−500 0 500 1000 15000.25
0.26
0.27
0.28
0.29
0.30
strand−shift (125)
cros
s−co
rrela
tion
ENCFF000PMJ.sorted.1.bam
NSC=1.21367,RSC=1.39752,Qtag=1
−500 0 500 1000 1500
0.27
40.
275
0.27
60.
277
0.27
8
strand−shift (90,200,210)
cros
s−co
rrela
tion
ENCFF000PON.sorted.1.bam.picard.bam
NSC=1.0166,RSC=0.92739,Qtag=0
Verygoodenrichment
Acceptableenrichment Poorenrichment,
possiblyundersequenced
NoclusteringGoodinput
ReadclusteringBadinput
Input
ChIP
Cumulativeenrichmentaka“Fingerprint”isanothermetricforsuccessfulenrichment
http://deeptools.readthedocs.orgDiazetal,GenomeBiol2012
Park,NatureRevGenetics,2009
Peakcalling
appropriatemethodologiesdependondatatype
SPPMACS2
punctate mixedsignal broadsignal
- -
Thisisanactiveareaofalgorithmdevelopment
TranscriptionFactors
ChromatinRemodellers
Histonemarks
ChromatinRemodellers
Histonemarks
RNApolymeraseII
MACS2inbroadmode,windowsapproaches
Principleofpeakdetection
SymmetryinreadsmappedtooppositeDNAstrands
Computationofenrichmentmodel
Pepke,2009
Point-sourcevs.broadpeakdetection
Wilbanks2010
Sequence-specificbinding(TFs) Distributedbinding(histones,RNApol2)
Comparisonofpeakcallingalgorithms
Wilbanks2010
Peakoverlap(Hoetal,2012)
>50%
20%
“Hyper-chippable”regions
Carrolletal,FrontGenet2014
DER–DukeExcludedRegions(11repeatclasses)UHS–UltraHighSignal(openchromatin)DAC–consensusexcludedregions
ReadsmappedtotheseregionsshouldbefilteredoutpriortopeakcallingTracksavailablefromUCSCforhuman,mouse,flyandworm
Qualityconsiderations
• ChIP-seqqualityguidelinesfromtheENCODEproject(Relativestrandcross-correlation,Irreproduciblediscoveryrate)
• Antibodyvalidation• Appropriatesequencingdepth(dependingongenomesizeand
peaktype).Forhumangenomeandbroad-sourcepeaks,min.40-50Mreadsisrequired.
• Experimentalreplication• Fractionofreadsinpeaks(FRiP)>1%• Crosscorrelation(correlationofthedensityofsequencesalignedto
oppositeDNAstrandsaftershiftingbythefragmentsize)• Experimentalverificationofknownbindingsites(andsitesnot
boundasnegativecontrols)
ChIP-exo:improvementinbindingsiteidentification
RheeandPugh,Cell2011
Otherfunctionalgenomicstechniques
Cliffordetal,NatureRevGenet,2014
• ChIP–sequencing:introductionfromabioinformaticspointofview
• PrinciplesofanalysisofChIP-seqdata
• ChIP-seq:downstreamanalyses
• Resources
ChIPseqdownstreamanalyses
• Validation(wetlab)
• Downstreamanalysis– Motifdiscovery– Annotation– Integrationofbindingandexpressiondata– Integrationofvariousbindingdatasets– Differentialbinding
Signalvisualisationandinterpretation
BindingprofileofaTFinrelationtothetranscriptionstartsite
deepToolsngsplotsseqMiner
• Clustering• Heatmaps• Profiles• Comparisonof
differentdatasets
• ChIP–sequencing:introductionfromabioinformaticspointofview
• PrinciplesofanalysisofChIP-seqdata
• ChIP-seq:downstreamanalyses
• Resources
Furtherreading
• ImpactofartifactremovalonChIPqualitymetricsinChIP-seqandChIP-exodata.Carroletal,Front.Genet.2014
• ImpactofsequencingdepthinChIP-seqexperiments.Jungetal,NAR2014
• ChIP-seqguidelinesandpracticesoftheENCODEandmodENCODEconsortia.Landtetal,GenomeRes.2012
• http://genome.ucsc.edu/ENCODE/qualityMetrics.html#definitions
• https://www.encodeproject.org/data-standards
BioconductorChIP-seqresources• Generalpurposetools:
– Rsubread(readmapping;notidealforglobalalignment)– Rbowtie(globalalignment)– GenomicRanges(toolsformanipulatingrangedata)– Rsamtools(SAM/BAMsupport)– htSeqTools(toolsforNGSdata;post-alignmentQC)– chipseq(utilitiesforChIP-seqanalysis)
• Peakcalling– SPP– BayesPeak(HMMandBayesianstatistics)– MOSAiCS(model-basedoneandtwoSampleAnalysisandInferenceforChIP-Seq)– iSeq(HiddenIsingmodels)– ChIPseqR(developedtoanalysenucleosomepositioningdata)– Csaw(apipelineforChIP-seqanalysis,includingstatisticalanalysisofdifferentialoccupancy)
• Qualitycontrol– ChIPQC
• Differentialoccupancy– edgeR– DESeq2– DiffBind(compatiblewithobjectsusedforChIPQC,wrapperforDESeqandedgeRDEfunctions)
• PeakAnnotation– ChIPpeakAnno(annotatingpeakswithgenomecontextinformation)– ChIPSeeker(functionalannotationofpeaks)
TheEpigenomicsRoadmapProject
http://www.roadmapepigenomics.org/• Referencehumanepigenomes• DNAmethylation,histonemodifications,chromatin
accessibilityandsmallRNAtranscripts• Stemcellsandprimaryexvivotissues• 111tissueandcelltypes• 2,804genome-widedatasets
Questions?
• ChIP–sequencing:introductionfromabioinformaticspointofview
• PrinciplesofanalysisofChIP-seqdata
• ChIP-seq:downstreamanalyses
• Resources
• Exerciseoverview
Exercise• 1.Qualitycontrol• 2.Readpreprocessing• 3.Peakcalling• 4.Exploratoryanalysis(sampleclustering)• 5.Visualisation
DidmyChIPwork?
Cross-correlation Cumulativeenrichment
−500 0 500 1000 1500
0.10
0.12
0.14
0.16
0.18
0.20
0.22
0.24
strand−shift (100)
cros
s−co
rrela
tion
ENCFF000PED.chr12.bam
NSC=2.50193,RSC=1.87725,Qtag=2
Exploratoryanalysis
Clusteringoflibrariesbyreadsmappedinbins,genome–wide(spearman)
Clusteringoflibrariesbyreadsmappedinpeaks(pearson)
HeLa
Sknsh&
HepG
2ne
ural
HepG
2
neural
SknshHeLa
HepG2
I
Ch
Ch
I
I
Ch
BindingprofilearoundTSS
That’sallfornow,
timetodosomehands-onwork
Libraryqualitycontrolandpreprocessing
• FastQC/Prinseq
• Trimadaptersifanyadaptersequencesarepresentinthereads(asdeterminedbytheQC)
• Insomecases,you’llobservek-merenrichment(especiallyifthedataisChIP-exo,anewvariationofChIP-seq)–itisnotnecessarilyabadthing,ifsequenceduplicationlevelsarelow;howeveritmayindicatelowcomplexityofthelibrary–awarningsignthattheenrichmentinChIPwasnotsuccessfulorthelibrariesareover-amplified(oftenthelatteristheconsequenceoftheformer)
Mappingreadstothereferencegenome
• Choosetherightreference:assemblyversion(notalwaysthenewestisbest)andtype(primaryassembly,orassemblyfromindividualchromosomesequences+non-chromosomalcontigs;notthetoplevelassembly);choosethematchingannotationfile(GTF,GFF)
• Readmapping:globalalignment• Mappers(=aligners):Bowtie,BWA,BBMap,Novoalign,…(lotsoftoolsare
available)
• Visualisedataingenomebrowser– BAMfilesortracks(wig,bedgraph,bigWig)– Local(IGV)orweb-based(UCSCgenomebrowser)– Dataqualityassessment
Cross-correlationprofiles,RSCandNSC
• Metricstoquantifythefragmentlengthsignalandtheratiooffragmentlengthsignaltoreadlengthsignal
• RelativeCrossCorrelation(RSC)-ChIPtoartifactsignal
• NormalisedCrossCorrelation(NSC)
• TFs:fragmentlengthsareoftengreaterthanthesizeoftheDNAbindingevent,thedistinctclusteringof(+)and(-)readsaroundthissiteisveryapparent
• NSC>1.1(highervaluesindicatemoreenrichment;1=noenrichment)• RSC>0.8(0=nosignal;<1lowqualityChIP;>1highenrichment• Broadpeaks:thisclusteringmaybemorediffuse(fragmentlength<peak)
CC(Fragmentlength)min(CC)
CC(Fragmentlength)-min(CC)CC(readlength)–min(CC)