Integrate Linux Mint 17.1 to Windows Server 2012 Active Directory Domain Controller
mint + annotatr: a pipeline to integrate and annotate DNA ...mint + annotatr: a pipeline to...
Transcript of mint + annotatr: a pipeline to integrate and annotate DNA ...mint + annotatr: a pipeline to...
mint + annotatr: a pipeline to integrate and annotate DNA methylation and hydroxymethylation dataRaymond G. Cavalcante1, Yanxiao Zhang1, Yongseok Park2, Snehal Patil1, Maureen A. Sartor1,3,4 1 Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, USA, 2 Department of Biostatistics, University of Pittsburgh, Pittsburgh, USA 3 Department of Biostatistics, University of Michigan, Ann Arbor, USA, and 4 Center for Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, USA
bismark methylation extractor
macs2 PePr
methylSig
Classifications
wrt tocondition 1 Hyper hmc Hypo hmc No DM No signal
Hyperhmc + mc
Hyper mc Hyper hmc
Hyper mc Hypo hmc Hyper mc Hyper mc
Hypohmc + mc
Hypo mc Hyper hmc
Hypo mc Hypo hmc Hypo mc Hypo mc
No DM Hyper hmc Hypo hmc No DM No DM
No signal Hyper hmc Hypo hmc No DM unclassifiable
hmc peak No hmc peak No signal
Highhmc + mc hmc mc hmc or mc
Lowhmc + mc hmc mc (low) hmc or mc (low)
Nohmc + mc hmc no methylation no methylation
No signal hmc no methylation unclassifiable
<1Kb1-5Kb
txStart txEnd
5’UTR 3’UTR
CDS
Promoter
5’UTR exon
5’UTR intron
CDS intron CDS exon
3’UTR intron
3’UTR exon
cdsStart cdsEnd
CpG island (CGI)
2kb 2kb
CpG shore
CpG shelf
open sea (interCGI)open sea (interCGI)
!
CpG shore
CpG shelf
2kb2kb
<1Kb1-5Kb
txStart txEnd
5’UTR 3’UTR
CDS
Promoter
5’UTR exon
5’UTR intron
CDS intron CDS exon
3’UTR intron
3’UTR exon
cdsStart cdsEnd
CpG island (CGI)
2kb 2kb
CpG shore
CpG shelf
open sea (interCGI)open sea (interCGI)
!
CpG shore
CpG shelf
2kb2kb
Motivation for integrationDNA methylation occurs in a variety of forms. 5-methylcytosine (mc) is the most widely studied, and there is ample evidence for its importance in development and gene regulation. 5-hydroxymethylcytosine (hmc) is a less abundant, but appears to be a good marker for epigenetic changes correlating with changes in gene expression.
The problem: Widely-used assays measuring methylation (e.g. BSseq & RRBS) do not distinguish mc and hmc. This may confuse biological interpretation because it is unclear which mark is playing a role relative to a phenotype.
Integrating assays measuring mc+hmc (e.g. BSseq & RRBS) with others measuring mc (e.g. oxBSseq & MeDIPseq) or hmc (e.g. TABseq & hMeDIPseq), helps differentiate the mc and hmc marks from one another.
C mc hmc
T C C
BSseq RRBS
mc+hmc
mc
hmc
RRBS
hMeDIP
mint integration
The solution: We developed the mint pipeline to • be a complete pipeline from raw reads to interpretation • do sample methylation quantification • do differential methylation analysis between groups • differentiate between regions of mc vs hmc by integrating
data from methylation assays using a simple classifier • automatically generate UCSC genome browser tracks • provide summaries of genomic annotations to facilitate
biological interpretation.
Supported designsHybrid
Bisulfite-based assay and pulldown assay.
Pulldown
Pulldown assays for mc and hmc.
or
andSample-wise
Integration on a per-sample basis.
Comparison-wise
Integration on a per-group basis.
and or
+ +
Overall mint workflowQC
FastQC + MultiQC
Adapter / Quality Trimming
Trim Galore
Alignment bismark / bowtie2
Sample Meth.Quantification
bismark / macs2
Differential Meth. methylSig [1] /
PePr [2]
Classification
UCSC Browser Tracks
Annotation and Visualization annotatr [3]
Setting up and running mintOn a Mac or Linux system: 1. Get mint at github.com/sartorlab/mint 2. Setup dependencies and reference genomes. 3. Annotate the experimental design. 4. Initiate the project with:
Rscript init.R —project name —genome g —datapath path 5. Customize the config.mk file with desired parameters. 6. Use make to run the analysis modules as per the design:
make bisulfite_align make bisulfite_compare make compare_classification
make pulldown_align make pulldown_sample make pulldown_compare make sample_classification
References1. Park Y, et al. (2014) “MethylSig: a whole genome DNA methylation analysis
pipeline” Bioinformatics. 2. Zhang Y, et al. (2014) “PePr: A peak-calling prioritization pipeline to identify
consistent or differential peaks from replicated ChIP-Seq data” Bioinformatics. 3. Cavalcante RG & Sartor MA (2016) “annotatr: Associating genomic regions
with genomic annotations” bioRxiv.
This work was funded by NIH grants RO1CA158286S1 and T32GM070499.
1
2
3
4
5
Quality control
Classification schemasSample classification
Comparison classification
annotatr: simple, fast & flexible annotation of genomic regions
Annotation and Visualization
We developed a general purpose R package, annotatr [3] that: • annotates genomic regions in BED format to pre-built human
and mouse CpG, genic (with Entrez Gene IDs and symbols), and enhancers (below), or custom genomic annotations
• reports all annotations overlapping input genomic regions to highlight multiple-annotations rather than using a prioritization
• summarizes and plots annotations with associated data (right) • is faster than similar R packages, and on par with bedtools.
Get annotatr at github.com/rcavalcante/annotatr UCSC Genome Browser Track Hub
0
10000
20000
30000
CpG islands
CpG shores
CpG shelvesinterCGI
enhancers1to5kb
promoters5UTRs
exonsintrons
3UTRs
intergenic
Annotations
# C
pGs
Data Random Regions
Regions per annotationCpG islands CpG shores CpG shelves interCGI
enhancers 1to5kb promoters 5UTRs
exons introns 3UTRs intergenic
0.00
0.05
0.10
0.15
0.00
0.05
0.10
0.15
0.00
0.05
0.10
0.15
0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100
Percent Methylation
dens
ity
All perc_meth perc_meth in annot_type
Percent methylation over annotations
CpG islands CpG shores CpG shelves interCGI
enhancers 1to5kb promoters 5UTRs
exons introns 3UTRs intergenic
0.0000.0250.0500.075
0.0000.0250.0500.075
0.0000.0250.0500.075
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
Fold Changede
nsity
All fold fold in annot_type
macs2 fold change over annotations
0.00
0.25
0.50
0.75
1.00
All
IDH2mutNBM
Random Regions
DM status
Prop
ortio
n AnnotationsCpG islandsCpG shoresCpG shelvesinterCGI
PePr peaks by annotation
CpG islands CpG shores CpG shelves interCGI
enhancers 1to5kb promoters 5UTRs
exons introns 3UTRs intergenic
012345
012345
012345
−50 0 50 −50 0 50 −50 0 50 −50 0 50Methylation Difference (IDH2mut − NBM)
−log
10(p
valu
e)
methylSig meth. diff. vs −log10(pval)
0.00
0.25
0.50
0.75
1.00
All
mc_hmc_high
mc_hmc_med
mc_hmc_low
mc_hmc_none
Random Regions
Simple classification
Prop
ortio
n AnnotationsCpG islandsCpG shoresCpG shelvesinterCGI
Bis. Simple classification by annotation
0.00
0.25
0.50
0.75
1.00
All
IDH2mutNBM
noDM
Random Regions
DM status
Prop
ortio
n
Annotationsenhancers1to5kbpromoters5UTRsexonsintrons3UTRsintergenic
methylSig DM status by annotation
0.00
0.25
0.50
0.75
1.00
All
hype
r_mc_
hype
r_hmc
hypo
_mc_
hypo
_hmc
hype
r_mc_
hypo
_hmc
hypo
_mc_
hype
r_hmc
hype
r_mc
hype
r_hmc
hypo
_mc
hypo
_hmc
no_D
M
Rando
m Region
s
Compare classification
Prop
ortio
n
Annotationsenhancers1to5kbpromoters5UTRsexonsintrons3UTRsintergenic
Compare classification by annotation
Scalechr21:
IDH2mut_1_mc_hmc_bisulfite_simp_class
IDH2mut_1_hmc_pulldown_peaks
RefSeq Genes
SequencesSNPs
Human mRNAs
DNase Clusters
Txn Factor ChIP
RhesusMouse
DogElephantChicken
X_tropicalisZebrafishLamprey
Common SNPs(146)
RepeatMasker
1 kb hg1944,078,500 44,079,000 44,079,500 44,080,000 44,080,500 44,081,000 44,081,500 44,082,000 44,082,500
IDH2mut_v_NBM_compare_classification
IDH2mut_v_NBM_hmc_pulldown_DM_PePr_peaks
IDH2mut_v_NBM_mc_hmc_bisulfite_DM_methylSig
IDH2mut_1_hmc_pulldown_simple_classification
IDH2mut_1_mc_hmc_bisulfite_simple_classification
IDH2mut_1_mc_hmc_bisulfite_percent_methylation
IDH2mut_1_hmc_pulldown_macs2_peaks
IDH2mut_1_hmc_pulldown_coverage
IDH2mut_1_hmc_input_pulldown_coverage
UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics)
RefSeq Genes
Publications: Sequences in Scientific Articles
Human mRNAs from GenBank
H3K27Ac Mark (Often Found Near Active Regulatory Elements) on 7 cell lines from ENCODE
DNaseI Hypersensitivity Clusters in 125 cell types from ENCODE (V3)
Transcription Factor ChIP-seq (161 factors) from ENCODE with Factorbook Motifs
100 vertebrates Basewise Conservation by PhyloP
Multiz Alignments of 100 Vertebrates
Simple Nucleotide Polymorphisms (dbSNP 146) Found in >= 1% of Samples
Repeating Elements by RepeatMasker
hypo_hmcno_DM
hypo_hmc
hyper_mc_hypo_hmchypo_hmc
no_DM
hypo_hmc
no_DM
NBM NBMNBM
hmc_med hmc_low
PDE9APDE9APDE9APDE9APDE9APDE9APDE9APDE9APDE9APDE9APDE9APDE9APDE9APDE9APDE9APDE9APDE9APDE9APDE9APDE9APDE9A
IDH2mut_v_NBM_mc_hmc_bisulfite_DM
62.4012 _
0 _
IDH2mut_1_mc_hmc_bisulfite_pct_meth
100 _
0 _
IDH2mut_1_hmc_pulldown_cov
89 _
1 _
IDH2mut_1_hmc_input_pulldown_cov
9 _
1 _
Layered H3K27Ac
100 Vert. Cons