Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that...
Transcript of Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that...
Vijayachitra Modhukur BIIT
Next generation sequencing (NGS3)-Epigenomics
1 11/27/13 Bioinformatics course
NGS lectures
11/27/13 Bioinformatics course 2
Genomics
Transcriptomics
Proteomics
Epigenomics
Epigenetics
11/27/13 Bioinformatics course 3
� Greek, epi = above, upon - genetics � The study of heritable changes in gene function that
occur without a change in the DNA sequence. � The study of changes in gene silencing that occur
without changes in the genes themselves. Many genes in the body are permanently turned off as part of normal development. But sometimes that process goes awry, turning off genes that should otherwise remain active. ..
� Epigenome represents all epigenetic phenomenon across the Genome
11/27/13 4
Epigenetics
11/27/13 Bioinformatics course 5
Epigenetics
11/27/13 Bioinformatics course 6
Epigenetic impact
11/27/13 7
Epigenetics market
11/27/13 Bioinformatics course 8
Epigenetics publications
11/27/13 Bioinformatics course 9
Central dogma
11/27/13 Bioinformatics course 10
Epigenetics may also control the central Dogma!
Some regulators
11/27/13 Bioinformatics course 11
11/27/13 Bioinformatics course 12
11/27/13 Bioinformatics course 13
11/27/13 Bioinformatics course 14
Bind to enhancer sites, controlled by hormones or other signals. They increase transcription of the regulated gene
Bind to silencer sites, controlled by hormones or other signals. They decrease transcription of the regulated gene, possibly by interfering with activator
Bind to activators and/or repressors and to basal factors. They communicate the signal from activators and/or repressors to the RNA polymerase.
They enable RNA polymerase to initiate transcription. However, they require interaction with coactivators.
Epigenetic tools
11/27/13 Bioinformatics course 15
Various Epigenetic projects
11/27/13 Bioinformatics course 16
Various projects
11/28/12 Bioinformatics course 30
Chromatin, histones and modifications
11/27/13 Bioinformatics course 17
DNA methylation
11/27/13 18
Some Roles of DNA Methylation in Mammalian System
• Genomic Imprinting • X chromosome inactivation • Developmental controls • Tissue specific regulation
Genomic imprinting
11/27/13 Bioinformatics course 20
11/27/13 Bioinformatics course 21
X chromosome in activation
11/27/13 Bioinformatics course 22
Developmental controls
11/27/13 Bioinformatics course 23
Tissue specific regulation
11/27/13 Bioinformatics course 24
DNA Methylation and Other Human Diseases
-- Imprinting Disorder: • Beckwith-Wiedemann syndrom (BWS) • Prader-Willi syndrome (PWS) • Transient neonatal diabetes mellitus (TNDM)
-- Repeat-instability diseases • Fragile X syndrome (FRAXA) • Facioscapulohumeral muscular dystroph
-- Defects of the methylation machinery • Systemic lupus erythemtosus (SLE) • Immunodeficiency, centromeric instability and facial anomalies (ICF) syndrome
Inheritance of DNA methylation
11/27/13 Bioinformatics course 26
CpG island methylation
11/27/13 27
Understanding of cancer through epigenetics
11/27/13 Bioinformatics course 28
DNA methylation and cancer
11/27/13 29
DNA methylation : data and analysis
11/27/13 Bioinformatics course 30
11/27/13 Bioinformatics course 31
Differentially methylated regions(DMRs). Genomic regions that exhibit statistically significant differences in DNA methylation between sample groups.
BisulphiteBisulphite ions (HSO3
−) selectively deaminate unmethylated but not methylated Cs, giving rise to Us, which are replaced by Ts during subsequent PCR amplification.
DNA methylation and their interpretation in a broader biological context. The final section outlines emerging trends in the analysis and interpretation of DNA meth-ylation data. The structure of this Review follows the flow of a typical DNA methylation mapping study, as illustrated in FIG. 1, and a list of the described software tools is available from TABLE 1.
Data processing and quality controlVarious experimental methods have been devel-oped for genome-wide DNA methylation mapping, each with their own advantages and challenges14,19,20. In this Review, we focus on the three most popular
approaches: bisulphite sequencing, bisulphite microar-rays and enrichment-based methods (FIG. 1a; BOX 1). These three approaches pose distinct computational challenges during data processing and quality control, as outlined below.
Processing bisulphite-sequencing data. As a result of DNA treatment with the bisulphite chemical, the vast majority of unmethylated Cs appears as Ts among the sequencing reads, whereas methylated Cs are largely protected from bisulphite-induced conversion. To cal-culate absolute DNA methylation levels from bisulphite-sequencing data, sequencing reads are aligned to the
8GTKH[KPI�CPF�XCNKFCVKPI�FKȭGTGPEGU�KP�&0#�OGVJ[NCVKQPr Global analysis of DMR list: volcano plots, Q–Q plots, Manhattan plotsr Manual or computational ranking and selection of promising DMRs for experimental XGTKȮECVKQP�CPF�QT�validationr�%QORWVCVKQPCN�FGUKIP�QH�JKIJ�VJTQWIJRWV�CUUC[U�HQT�EQPȮTOKPI�VJG� UGPUKVKXKV[�CPF�URGEKȮEKV[�QH�&/4�KFGPVKȮECVKQP�KP�NCTIG�UCORNG�EQJQTVU
+PVGTRTGVKPI�FKȭGTGPEGU�KP�&0#�OGVJ[NCVKQPr Integrative analysis in the context of other genomic data setsr�5GCTEJ�HQT�UKIPKȮECPV�GPTKEJOGPV�QH�IGPG�HWPEVKQPU�CPF�TGIWNCVQT[� elements among the DMRsr Statistical assessment of confounding factors to assess whether it YQWNF�DG�RNCWUKDNG�VQ�J[RQVJGUK\G�ECWUCN�GȭGEVU�
0CVWTG�4GXKGYU�| )GPGVKEU
a #UUC[U�HQT�&0#�OGVJ[NCVKQP�OCRRKPI
$KUWNRJKVG�UGSWGPEKPIDNA treatment with bisulphite URGEKȮECNN[�introduces mutations at unmethylated Cs. These mutations are mapped by next-generation sequencing
$KUWNRJKVG�OKETQCTTC[U&0#�OGVJ[NCVKQP�URGEKȮE�OWVCVKQPU�CTG�introduced by bisulphite treatment. These mutations are mapped using a genotyping microarray that covers a selection of Cs
'PTKEJOGPV�DCUGF�OGVJQFUMethylated (alternatively, unmethylated) DNA fragments are enriched in a DNA NKDTCT[��6JG�NKDTCT[�EQORQUKVKQP�KU�SWCPVKȮGF�by next-generation sequencing
b &CVC�RTQEGUUKPI�CPF�SWCNKV[�EQPVTQN
2TQEGUUKPI�DKUWNRJKVG�UGSWGPEKPI�FCVCr Bisulphite sequence alignmentr�3WCPVKȮECVKQP�QH�CDUQNWVG�&0#� methylation at single-base resolutionr Quality control
2TQEGUUKPI�DKUWNRJKVG�OKETQCTTC[�FCVCr Data normalizationr�3WCPVKȮECVKQP�QH�CDUQNWVG�&0#� methylation at single-base resolutionr Quality control
2TQEGUUKPI�GPTKEJOGPV�DCUGF�FCVCr DNA sequence alignmentr�3WCPVKȮECVKQP�QH�TGNCVKXG�GPTKEJOGPVr Statistical inference of absolute DNA methylation corrected for CpG densityr Quality control
c &CVC�XKUWCNK\CVKQP�CPF�UVCVKUVKECN�CPCN[UKU
8KUWCNK\KPI�&0#�OGVJ[NCVKQP�FCVCr Visual inspection of selected regions in a genome browserr Global visualization of the distribution of DNA methylationr�%NWUVGTKPI�DCUGF�CUUGUUOGPV�QH�INQDCN�UKOKNCTKV[�CPF�FKȭGTGPEGU� in a set of samples
+FGPVKH[KPI�FKȭGTGPVKCNN[�OGVJ[NCVGF�TGIKQPUr�5VCVKUVKECN�VGUVKPI�HQT�FKȭGTGPVKCN�&0#�OGVJ[NCVKQP�CV�UKPING� %R)U�CPF�QT�NCTIGT�IGPQOKE�TGIKQPUr Statistical correction for multiple hypothesis testingr�4CPMKPI�DCUGF�QP�UVCVKUVKECN�UKIPKȮECPEG�CPF�GȭGEV�UK\G
d 8CNKFCVKQP�CPF�KPVGTRTGVCVKQP
Unprocessed DNA sequencing or OKETQCTTC[�FCVC�CUUC[�URGEKȮE�
.KUV�QH�&/4U�VJCV�CTG�UVCVKUVKECNN[�UKIPKȮECPV
Table with DNA methylation levels for each CpG in each sample (assay-independent)
Figure 1 | Workflow for analysing and interpreting DNA methylation data. a | Genome-wide DNA methylation is mapped with one of the three most commonly used assays, resulting in methylation-specific DNA sequencing or microarray data. b | These raw data are processed and quality-controlled using assay-specific algorithms and software. The main result of data normalization is an assay-independent CpG methylation table that contains absolute DNA methylation levels (!-values) for all covered CpGs. c | Data visualization and statistical analysis identifies relevant associations and derives a list of differentially methylated regions (DMRs) between cases and controls. d | The resulting DMR list is validated both computationally and experimentally, and biological interpretation is assisted by computational tools. (Note that the separation of the analysis workflow into four subsequent steps constitutes a conceptual simplification, and there are a number of reasons why a specific study may need to deviate from this approach.)
REVIEWS
706 | OCTOBER 2012 | VOLUME 13 www.nature.com/reviews/genetics
© 2012 Macmillan Publishers Limited. All rights reserved
Bi-sulphate sequencing
11/27/13 Bioinformatics course 32
Bi-sulphite sequencing
11/27/13 Bioinformatics course 33
Bi-sulfate treated sequencing reads
11/27/13 Bioinformatics course 34
M valuesLogistically transformed !-values. The transformation mitigates some statistical problems of the !-value (namely, limited value range and strongly bimodal distribution) at the cost of reduced biological interpretability.
Batch effectsSystematic biases in the data that are unrelated to the research question but that arise from undesirable (and often unrecognized) differences in sample handling.
ConfoundingA nonrandom relationship between the phenotype of interest and external factors (for example, batch effects or population structure) that can give rise to spurious associations.
advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing date and microarray batch) and the phenotype of interest (for example, cases ver-sus controls) and to use tools for batch effect removal, which can substantially increase robustness and statis-tical power50,52,53. Other common biases in bisulphite microarray data include nonspecific binding of DNA fragments to multiple probes (which has been shown to cause false positives for sex-specific DNA methylation on the autosomes54) and the presence of genetic vari-ants affecting probe binding or read-out. The impact of these technical issues can be minimized by removing all probes that exhibit a high sequence identity with mul-tiple genomic regions as well as those overlapping with common genetic variants.
Processing enrichment-based data. Enrichment-based assays for DNA methylation mapping use various meth-ods for enriching DNA in a methylation-specific manner.
Methylated DNA can be enriched using methylation- specific antibodies (in methylated DNA immuno-precipitation coupled with high-throughput sequencing (MeDIP–seq)), methyl-CpG-binding domain (MBD) proteins (in MBD sequencing (MBD-seq)) or a restric-tion enzyme that specifically cuts methylated DNA (in methylation-dependent restriction enzyme sequencing (McrBC-seq)). Alternatively, unmethylated DNA can be enriched using restriction enzymes that specifically cut unmethylated DNA (for example, in HpaII tiny fragment enrichment by ligation-mediated PCR coupled with sequencing (HELP–seq)). Next-generation sequencing of the resulting DNA libraries counts the frequency of specific DNA fragments in each library and provides the raw data from which DNA methylation levels can be inferred. In contrast to bisulphite sequencing, the DNA methylation information is not contained in the read sequence but in the enrichment or depletion of sequenc-ing reads that map to specific regions of the genome. As a result, enrichment-based methods require careful
Nature Reviews | Genetics
a Setup of the example
b Wild-card alignment
CCGATGATGTCGCTGACGCACGA
YYGATGATGTYGYTGAYGYAYGA
100% 50% 50% 0%
ACGT,ATGA,ATGA,ATGT,TCGA,TCGA,TCGT,TTGT
DNA fragmentation, selectiveconversion of unmethylatedCs into Ts, DNA sequencing
Genomic DNA sequenceDNA methylation level
Bisulphite-sequencing reads
TCGATCGA
TCGTTTGT
ACGTATGT
ATGAATGA
ATGT
c Three-letter alignment
TTGATGATGTTGTTGATGTATGA
TtGATtGA
TtGATtGA
TtGTTTGT
AtGTAtGTATGT
ATGAATGA
ATGT
50% N/A 0%N/A
50% 100% 0%100%
Reference sequence
Reference sequence
Read alignment
DNA methylation level
DNA methylation level
Read alignment
Figure 2 | Two alternative strategies for bisulphite alignment. a | An illustrative example of bisulphite sequencing for a DNA fragment with known DNA methylation levels at four CpGs and a total of eight bisulphite-sequencing reads. For easier visualization, the sequencing reads are four bases long (realistic numbers would be 50 to 200 bases), and the size of the genomic DNA sequence is just 23 bases (3 gigabases would be a realistic number for the human genome). b | Alignment of the bisulphite-sequencing reads (centre) to the reference sequence (top) using a wild-card aligner that tolerates zero mismatches and zero gaps. The aligner replaces each C in the reference sequence by the wild-card letter Y, which can match both C and T in the read sequences. Reads with more than one perfect alignment with the reference sequence are discarded (greyed out), and for each CpG in the genomic DNA sequence, the DNA methylation level (bottom) is calculated as the percentage of aligning Cs among all uniquely mapped reads. Note that the third CpG is incorrectly assigned a DNA methylation level of 100%, which is due to the fact that the unmethylated read was discarded as ambiguous, whereas the methylated read could be uniquely mapped. c | The same alignment carried out by a three-letter aligner, which also tolerates zero mismatches and zero gaps. The aligner replaces each C in the reference sequence by an upper-case T and each C in the sequencing reads by a lower-case t, with no distinction being made between upper-case T and lower-case t during the alignment. As a result of the reduced sequencing complexity with only three letters remaining, a larger number of reads align to more than one position in the reference sequence and are discarded. The three-letter alignment avoids incorrect results in this example, but it fails to provide any values for the first and third CpG. (As an alternative to discarding ambiguous reads, it is also possible to assign them randomly to one of the best-matching positions; in the current example, the wild-card alignment would provide correct results 50% of the time, whereas the three-letter alignment exhibits higher uncertainty and would be correct only 6.25% of the time.)
REVIEWS
NATURE REVIEWS | GENETICS VOLUME 13 | OCTOBER 2012 | 711
© 2012 Macmillan Publishers Limited. All rights reserved
Wild card alignment
11/27/13 Bioinformatics course 35
M valuesLogistically transformed !-values. The transformation mitigates some statistical problems of the !-value (namely, limited value range and strongly bimodal distribution) at the cost of reduced biological interpretability.
Batch effectsSystematic biases in the data that are unrelated to the research question but that arise from undesirable (and often unrecognized) differences in sample handling.
ConfoundingA nonrandom relationship between the phenotype of interest and external factors (for example, batch effects or population structure) that can give rise to spurious associations.
advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing date and microarray batch) and the phenotype of interest (for example, cases ver-sus controls) and to use tools for batch effect removal, which can substantially increase robustness and statis-tical power50,52,53. Other common biases in bisulphite microarray data include nonspecific binding of DNA fragments to multiple probes (which has been shown to cause false positives for sex-specific DNA methylation on the autosomes54) and the presence of genetic vari-ants affecting probe binding or read-out. The impact of these technical issues can be minimized by removing all probes that exhibit a high sequence identity with mul-tiple genomic regions as well as those overlapping with common genetic variants.
Processing enrichment-based data. Enrichment-based assays for DNA methylation mapping use various meth-ods for enriching DNA in a methylation-specific manner.
Methylated DNA can be enriched using methylation- specific antibodies (in methylated DNA immuno-precipitation coupled with high-throughput sequencing (MeDIP–seq)), methyl-CpG-binding domain (MBD) proteins (in MBD sequencing (MBD-seq)) or a restric-tion enzyme that specifically cuts methylated DNA (in methylation-dependent restriction enzyme sequencing (McrBC-seq)). Alternatively, unmethylated DNA can be enriched using restriction enzymes that specifically cut unmethylated DNA (for example, in HpaII tiny fragment enrichment by ligation-mediated PCR coupled with sequencing (HELP–seq)). Next-generation sequencing of the resulting DNA libraries counts the frequency of specific DNA fragments in each library and provides the raw data from which DNA methylation levels can be inferred. In contrast to bisulphite sequencing, the DNA methylation information is not contained in the read sequence but in the enrichment or depletion of sequenc-ing reads that map to specific regions of the genome. As a result, enrichment-based methods require careful
Nature Reviews | Genetics
a Setup of the example
b Wild-card alignment
CCGATGATGTCGCTGACGCACGA
YYGATGATGTYGYTGAYGYAYGA
100% 50% 50% 0%
ACGT,ATGA,ATGA,ATGT,TCGA,TCGA,TCGT,TTGT
DNA fragmentation, selectiveconversion of unmethylatedCs into Ts, DNA sequencing
Genomic DNA sequenceDNA methylation level
Bisulphite-sequencing reads
TCGATCGA
TCGTTTGT
ACGTATGT
ATGAATGA
ATGT
c Three-letter alignment
TTGATGATGTTGTTGATGTATGA
TtGATtGA
TtGATtGA
TtGTTTGT
AtGTAtGTATGT
ATGAATGA
ATGT
50% N/A 0%N/A
50% 100% 0%100%
Reference sequence
Reference sequence
Read alignment
DNA methylation level
DNA methylation level
Read alignment
Figure 2 | Two alternative strategies for bisulphite alignment. a | An illustrative example of bisulphite sequencing for a DNA fragment with known DNA methylation levels at four CpGs and a total of eight bisulphite-sequencing reads. For easier visualization, the sequencing reads are four bases long (realistic numbers would be 50 to 200 bases), and the size of the genomic DNA sequence is just 23 bases (3 gigabases would be a realistic number for the human genome). b | Alignment of the bisulphite-sequencing reads (centre) to the reference sequence (top) using a wild-card aligner that tolerates zero mismatches and zero gaps. The aligner replaces each C in the reference sequence by the wild-card letter Y, which can match both C and T in the read sequences. Reads with more than one perfect alignment with the reference sequence are discarded (greyed out), and for each CpG in the genomic DNA sequence, the DNA methylation level (bottom) is calculated as the percentage of aligning Cs among all uniquely mapped reads. Note that the third CpG is incorrectly assigned a DNA methylation level of 100%, which is due to the fact that the unmethylated read was discarded as ambiguous, whereas the methylated read could be uniquely mapped. c | The same alignment carried out by a three-letter aligner, which also tolerates zero mismatches and zero gaps. The aligner replaces each C in the reference sequence by an upper-case T and each C in the sequencing reads by a lower-case t, with no distinction being made between upper-case T and lower-case t during the alignment. As a result of the reduced sequencing complexity with only three letters remaining, a larger number of reads align to more than one position in the reference sequence and are discarded. The three-letter alignment avoids incorrect results in this example, but it fails to provide any values for the first and third CpG. (As an alternative to discarding ambiguous reads, it is also possible to assign them randomly to one of the best-matching positions; in the current example, the wild-card alignment would provide correct results 50% of the time, whereas the three-letter alignment exhibits higher uncertainty and would be correct only 6.25% of the time.)
REVIEWS
NATURE REVIEWS | GENETICS VOLUME 13 | OCTOBER 2012 | 711
© 2012 Macmillan Publishers Limited. All rights reserved
3 letter alignment
11/27/13 Bioinformatics course 36
M valuesLogistically transformed !-values. The transformation mitigates some statistical problems of the !-value (namely, limited value range and strongly bimodal distribution) at the cost of reduced biological interpretability.
Batch effectsSystematic biases in the data that are unrelated to the research question but that arise from undesirable (and often unrecognized) differences in sample handling.
ConfoundingA nonrandom relationship between the phenotype of interest and external factors (for example, batch effects or population structure) that can give rise to spurious associations.
advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing date and microarray batch) and the phenotype of interest (for example, cases ver-sus controls) and to use tools for batch effect removal, which can substantially increase robustness and statis-tical power50,52,53. Other common biases in bisulphite microarray data include nonspecific binding of DNA fragments to multiple probes (which has been shown to cause false positives for sex-specific DNA methylation on the autosomes54) and the presence of genetic vari-ants affecting probe binding or read-out. The impact of these technical issues can be minimized by removing all probes that exhibit a high sequence identity with mul-tiple genomic regions as well as those overlapping with common genetic variants.
Processing enrichment-based data. Enrichment-based assays for DNA methylation mapping use various meth-ods for enriching DNA in a methylation-specific manner.
Methylated DNA can be enriched using methylation- specific antibodies (in methylated DNA immuno-precipitation coupled with high-throughput sequencing (MeDIP–seq)), methyl-CpG-binding domain (MBD) proteins (in MBD sequencing (MBD-seq)) or a restric-tion enzyme that specifically cuts methylated DNA (in methylation-dependent restriction enzyme sequencing (McrBC-seq)). Alternatively, unmethylated DNA can be enriched using restriction enzymes that specifically cut unmethylated DNA (for example, in HpaII tiny fragment enrichment by ligation-mediated PCR coupled with sequencing (HELP–seq)). Next-generation sequencing of the resulting DNA libraries counts the frequency of specific DNA fragments in each library and provides the raw data from which DNA methylation levels can be inferred. In contrast to bisulphite sequencing, the DNA methylation information is not contained in the read sequence but in the enrichment or depletion of sequenc-ing reads that map to specific regions of the genome. As a result, enrichment-based methods require careful
Nature Reviews | Genetics
a Setup of the example
b Wild-card alignment
CCGATGATGTCGCTGACGCACGA
YYGATGATGTYGYTGAYGYAYGA
100% 50% 50% 0%
ACGT,ATGA,ATGA,ATGT,TCGA,TCGA,TCGT,TTGT
DNA fragmentation, selectiveconversion of unmethylatedCs into Ts, DNA sequencing
Genomic DNA sequenceDNA methylation level
Bisulphite-sequencing reads
TCGATCGA
TCGTTTGT
ACGTATGT
ATGAATGA
ATGT
c Three-letter alignment
TTGATGATGTTGTTGATGTATGA
TtGATtGA
TtGATtGA
TtGTTTGT
AtGTAtGTATGT
ATGAATGA
ATGT
50% N/A 0%N/A
50% 100% 0%100%
Reference sequence
Reference sequence
Read alignment
DNA methylation level
DNA methylation level
Read alignment
Figure 2 | Two alternative strategies for bisulphite alignment. a | An illustrative example of bisulphite sequencing for a DNA fragment with known DNA methylation levels at four CpGs and a total of eight bisulphite-sequencing reads. For easier visualization, the sequencing reads are four bases long (realistic numbers would be 50 to 200 bases), and the size of the genomic DNA sequence is just 23 bases (3 gigabases would be a realistic number for the human genome). b | Alignment of the bisulphite-sequencing reads (centre) to the reference sequence (top) using a wild-card aligner that tolerates zero mismatches and zero gaps. The aligner replaces each C in the reference sequence by the wild-card letter Y, which can match both C and T in the read sequences. Reads with more than one perfect alignment with the reference sequence are discarded (greyed out), and for each CpG in the genomic DNA sequence, the DNA methylation level (bottom) is calculated as the percentage of aligning Cs among all uniquely mapped reads. Note that the third CpG is incorrectly assigned a DNA methylation level of 100%, which is due to the fact that the unmethylated read was discarded as ambiguous, whereas the methylated read could be uniquely mapped. c | The same alignment carried out by a three-letter aligner, which also tolerates zero mismatches and zero gaps. The aligner replaces each C in the reference sequence by an upper-case T and each C in the sequencing reads by a lower-case t, with no distinction being made between upper-case T and lower-case t during the alignment. As a result of the reduced sequencing complexity with only three letters remaining, a larger number of reads align to more than one position in the reference sequence and are discarded. The three-letter alignment avoids incorrect results in this example, but it fails to provide any values for the first and third CpG. (As an alternative to discarding ambiguous reads, it is also possible to assign them randomly to one of the best-matching positions; in the current example, the wild-card alignment would provide correct results 50% of the time, whereas the three-letter alignment exhibits higher uncertainty and would be correct only 6.25% of the time.)
REVIEWS
NATURE REVIEWS | GENETICS VOLUME 13 | OCTOBER 2012 | 711
© 2012 Macmillan Publishers Limited. All rights reserved
MEDIP-Methylated DNA immuno precipitation (Enrichment Based)
11/27/13 Bioinformatics course 37
Antibody 5MeC-mAB
directed against 5-methylcytidine added…
is
binds to the methylated
fraction of
… that the
Magnetic Beads
… that bind to the antibody…
Magnetic beads are added…
… allowing the methylated fraction to be captured with magnets.
… allowing the methylated fraction to be captured with magnets.
… proteinase K…, The methylated fraction can then be isolated using…
… analyzed at candidate loci using qPCR…
Once isolated, methylated DNA can be…
… hybridised to microarrays for genome-wide testing…
… or high-throughput sequenced for whole-genome analysis.
Enrichment based analysis steps
11/27/13 Bioinformatics course 45
� Normalization � log ratios of input vs enriched � Determining regions enriched in methylation(peaks) � Assigning enriched regions to the closest gene � Further downstream analysis, according to question
Bi-sulphate microarray based methylation data
11/27/13 Bioinformatics course 46
11/27/13 Bioinformatics course 47
11/27/13 Bioinformatics course 48
Methylation value in infinium platform
11/27/13 Bioinformatics course 49
� The methylation measures are represented as beta values
� Beta values are continuous variables between 0 and 1
� 0= Not methylated. � 1=100% Methylated. � Methylation beta values: � B=M/(U+M+e) ; 0<B<1
Histone modifications
11/27/13 Bioinformatics course 50
Histone Modifications
http://www.nature.com/nsmb/journal/v14/n11/images/nsmb1337-F1.gif
Li e. al. (2007) Cell 128, 707
Li e. al. (2007) Cell 128, 707
Histone Modifications in Relation to Gene Transcription
Histone Modifications and Human Diseases
Coffin-Lowry syndrome is a rare genetic disorder characterized by mental retardation and abnormalities of the head and facial and other areas. It is caused by mutations in the RSK2 gene (histone phosphorylation) and is inherited as an X-linked dominant genetic trait. Males are usually more severely affected than females.
Rubinstein-Taybi syndrome is characterized by short stature,
moderate to severe intellectual disability, distinctive facial
features, and broad thumbs and first toes. It is caused by mutations in CREB-binding protein (histone acetylation)
Methods to profile histone modifications
11/27/13 Bioinformatics course 55
11/27/13 Bioinformatics course 56
Chip-seq
11/27/13 Bioinformatics course 57
Data analysis of chip-chip and chip-seq
11/27/13 Bioinformatics course 58
Summary of chip-chip chip-seq analysis
11/28/12 Bioinformatics course 53
11/27/13 Bioinformatics course 59
Table 1 | Publicly available ChIP-seq software packages discussed in this review
Profile Peak criteriaa Tag shift Control datab Rank by FDRcUser input parametersd
Artifact filtering:
strand-based/ duplicatee Refs.
CisGenome v1.1
Strand-specific window scan
1: Number of reads in window 2: Number of ChIP reads minus control reads in window
Average for highest ranking peak pairs
Conditional binomial used to estimate FDR
Number of reads under peak
1: Negative binomial 2: conditional binomial
Target FDR, optional window width, window interval
Yes / Yes 10
ERANGE v3.1
Tag aggregation
1: Height cutoff Hiqh quality peak estimate, per-region estimate, or input
Hiqh quality peak estimate, per-region estimate, or input
Used to calculate fold enrichment and optionally P values
P value 1: None 2: # control
# ChIP
Optional peak height, ratio to background
Yes / No 4,18
FindPeaks v3.1.9.2
Aggregation of overlapped tags
Height threshold Input or estimated
NA Number of reads under peak
1: Monte Carlo simulation 2: NA
Minimum peak height, subpeak valley depth
Yes / Yes 19
F-Seq v1.82
Kernel density estimation (KDE)
s s.d. above KDE for 1: random background, 2: control
Input or estimated
KDE for local background
Peak height 1: None 2: None
Threshold s.d. value, KDE bandwidth
No / No 14
GLITR Aggregation of overlapped tags
Classification by height and relative enrichment
User input tag extension
Multiply sampled to estimate background class values
Peak height and fold enrichment
2: # control# ChIP
Target FDR, number nearest neighbors for clustering
No / No 17
MACS v1.3.5
Tags shifted then window scan
Local region Poisson P value
Estimate from high quality peak pairs
Used for Poisson fit when available
P value 1: None 2: # control
# ChIP
P-value threshold, tag length, mfold for shift estimate
No / Yes 13
PeakSeq Extended tag aggregation
Local region binomial P value
Input tag extension length
Used for significance of sample enrichment with binomial distribution
q value 1: Poisson background assumption 2: From binomial for sample plus control
Target FDR No / No 5
QuEST v2.3
Kernel density estimation
2: Height threshold, background ratio
Mode of local shifts that maximize strand cross-correlation
KDE for enrichment and empirical FDR estimation
q value 1: NA 2: # control
# ChIP
as a function of profile threshold
KDE bandwidth, peak height, subpeak valley depth, ratio to background
Yes / Yes 9
SICER v1.02
Window scan with gaps allowed
P value from random background model, enrichment relative to control
Input Linearly rescaled for candidate peak rejection and P values
q value 1: None 2: From Poisson P values
Window length, gap size, FDR (with control) or E-value (no control)
No / Yes 15
SiSSRs v1.4
Window scan N+ – N- sign change, N+ + N- threshold in regionf
Average nearest paired tag distance
Used to compute fold-enrichment distribution
P value 1: Poisson 2: control distribution
1: FDR 1,2: N++ N- threshold
Yes / Yes 11
spp v1.0
Strand specific window scan
Poisson P value (paired peaks only)
Maximal strand cross-correlation
Subtracted before peak calling
P value 1: Monte Carlo simulation 2: # control
# ChIP
Ratio to background
Yes / No 12
USeq v4.2
Window scan Binomial P value Estimated or user specified
Subtracted before peak calling
q value 1, 2: binomial 2: # control
# ChIP
Target FDR No / Yes 20
aThe labels 1: and 2: refer to one-sample and two-sample experiments, respectively. bThese descriptions are intended to give a rough idea of how control data is used by the software. ‘NA’ means that control data are not handled. cDescription of how FDR is or optionally may be computed. ‘None’ indicates an FDR is not computed, but the experimental data may still be analyzed; ‘NA’ indicates the experimental setup (1 sample or 2) is not yet handled by the software. # control / # ChIP, number of peaks called with control (or some portion thereof) and sample reversed. dThe lists of ‘user input parameters’ for each program are not exhaustive but rather comprise a subset of greatest interest to new users. e’Strand-based’ artifiact filtering rejects peaks if the strand-specific distributions of reads do not conform to expectation, for example by exhibiting extreme bias of tag populations for one strand or the other in a region. ‘Duplicate’ filtering refers to either removal of reads that occur in excess of expectation at a location or filtering of called peaks to eliminate those due to low complexity read pileups that may be associated with, for example, microsatellite DNA. fN+ and N– are the numbers of positive and negative strand reads, respectively.
S26 | VOL.6 NO.11s | NOVEMBER 2009 | NATURE METHODS SUPPLEMENT
REVIEW
©20
09 N
ature
Ame
rica,
Inc. A
ll righ
ts re
serv
ed.
Cross-talk between DNA methylation and histone modifications
11/27/13 Bioinformatics course 60
11/27/13 Bioinformatics course 61
11/27/13 Bioinformatics course 62
Epigenetic Databases
11/27/13 Bioinformatics course 63
Databases
11/28/12 Bioinformatics course 56
Bioinformation open access
www.bioinformation.net Current Trends
ISSN 0973-2063 (online) 0973-8894 (print)
Bioinformation 4(7): 331-337 (2010) © 2010 Biomedical Informatics
337
Supplementary material Table 1. Some epigenetic and related databases reviewed in this article.
Database Description URL Ref MethDB Contains information on 19,905 DNA methylation content data
and 5,382 methylation patterns for 48 species, 1,511 individuals, 198 tissues and cell lines and 79 phenotypes.
http://www.methdb.de [39]
PubMeth Contains over 5,000 records on methylated genes in various cancer types.
www.pubmeth.org/ [43]
REBASE Contains over 22,000 DNA methyltransferases genes derived from GenBank.
http://rebase.neb.com/rebase/ rebase.html
[127]
MeInfoText Contains gene methylation information across 205 human cancer types.
http://mit.lifescience.ntu.edu.tw/ [44]
MethPrimerDB Contains 259 primer sets from human, mouse and rat for DNA methylation analysis.
medgen.ugent.be/methprimerdb/ [40]
The Histone Database Contains 254 sequences from histone H1, 383 from histone H2, 311 from histone H2B, 1043 from histone H3 and 198 from histone H4, altogether representing at least 857 species.
http://genome.nhgri.nih.gov/ histones/
[42]
ChromDB Contains 9,341 chromatin-associated proteins, including RNAi-associated proteins, for a broad range of organisms.
http://www.chromdb.org/ [128]
CREMOFAC Contains 1725 redundant and 720 non-redundant chromatin-remodeling factor sequences in eukaryotes.
http://www.jncasr.ac.in/cremofac/
[129]
The Krembil Family Epigenetics Laboratory
Contains DNA methylation data of human chromosomes 21, 22, male germ cells and DNA methylation profiles in monozygotic and dizygotic twins.
http://www.epigenomics.ca í
MethyLogiX DNA methylation database
Contains DNA methylation data of human chromosomes 21 and 22, male germ cells and late-onset Alzheimer's disease.
http://www.methylogix.com/ genetics/database.shtml.htm
[20]
NGS summary
11/27/13 Bioinformatics course 64
� Genomics : to study genome content, variations like snps, cnvs and mutations.
� Transcriptomics (RNA-seq) : to study gene expression, novel transcript fusions, splice variants and splice junctions.
� Epigenomics : Modifications such as DNA methylation and histone modifications and its role in development, disease.
Combinatorial analysis
11/27/13 Bioinformatics course 65
Combinatorial regulation
11/28/12 Bioinformatics course 59
NHGRI Current Topics in Genome Analysis 2010
Week 6: Regulatory and Epigenetic Landscapes of Mammalian Genomes
February 23, 2010
Laura Elnitski, Ph.D.
48
Effect
Towards personalized medicine
11/27/13 Bioinformatics course 66
11/27/13 Bioinformatics course 67