Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that...

67
Vijayachitra Modhukur BIIT [email protected] Next generation sequencing (NGS3)- Epigenomics 1 11/27/13 Bioinformatics course

Transcript of Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that...

Page 1: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

Vijayachitra Modhukur BIIT

[email protected]

Next generation sequencing (NGS3)-Epigenomics

1 11/27/13 Bioinformatics course

Page 2: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

NGS lectures

11/27/13 Bioinformatics course 2

Genomics

Transcriptomics

Proteomics

Epigenomics

Page 3: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

Epigenetics

11/27/13 Bioinformatics course 3

� Greek, epi = above, upon - genetics � The study of heritable changes in gene function that

occur without a change in the DNA sequence. � The study of changes in gene silencing that occur

without changes in the genes themselves. Many genes in the body are permanently turned off as part of normal development. But sometimes that process goes awry, turning off genes that should otherwise remain active. ..

�   Epigenome represents all epigenetic phenomenon across the Genome

Page 4: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

11/27/13 4

Page 5: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

Epigenetics

11/27/13 Bioinformatics course 5

Page 6: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

Epigenetics

11/27/13 Bioinformatics course 6

Page 7: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

Epigenetic impact

11/27/13 7

Page 8: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

Epigenetics market

11/27/13 Bioinformatics course 8

Page 9: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

Epigenetics publications

11/27/13 Bioinformatics course 9

Page 10: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

Central dogma

11/27/13 Bioinformatics course 10

Epigenetics may also control the central Dogma!

Page 11: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

Some regulators

11/27/13 Bioinformatics course 11

Page 12: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

11/27/13 Bioinformatics course 12

Page 13: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

11/27/13 Bioinformatics course 13

Page 14: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

11/27/13 Bioinformatics course 14

Bind to enhancer sites, controlled by hormones or other signals. They increase transcription of the regulated gene

Bind to silencer sites, controlled by hormones or other signals. They decrease transcription of the regulated gene, possibly by interfering with activator

Bind to activators and/or repressors and to basal factors. They communicate the signal from activators and/or repressors to the RNA polymerase.

They enable RNA polymerase to initiate transcription. However, they require interaction with coactivators.

Page 15: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

Epigenetic tools

11/27/13 Bioinformatics course 15

Page 16: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

Various Epigenetic projects

11/27/13 Bioinformatics course 16

Various projects

11/28/12 Bioinformatics course 30

Page 17: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

Chromatin, histones and modifications

11/27/13 Bioinformatics course 17

Page 18: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

DNA methylation

11/27/13 18

Page 19: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

Some Roles of DNA Methylation in Mammalian System

•  Genomic Imprinting •  X chromosome inactivation •  Developmental controls •  Tissue specific regulation

Page 20: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

Genomic imprinting

11/27/13 Bioinformatics course 20

Page 21: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

11/27/13 Bioinformatics course 21

Page 22: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

X chromosome in activation

11/27/13 Bioinformatics course 22

Page 23: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

Developmental controls

11/27/13 Bioinformatics course 23

Page 24: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

Tissue specific regulation

11/27/13 Bioinformatics course 24

Page 25: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

DNA Methylation and Other Human Diseases

-- Imprinting Disorder: •  Beckwith-Wiedemann syndrom (BWS) •  Prader-Willi syndrome (PWS) •  Transient neonatal diabetes mellitus (TNDM)

-- Repeat-instability diseases •  Fragile X syndrome (FRAXA) •  Facioscapulohumeral muscular dystroph

-- Defects of the methylation machinery •  Systemic lupus erythemtosus (SLE) •  Immunodeficiency, centromeric instability and facial anomalies (ICF) syndrome

Page 26: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

Inheritance of DNA methylation

11/27/13 Bioinformatics course 26

Page 27: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

CpG island methylation

11/27/13 27

Page 28: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

Understanding of cancer through epigenetics

11/27/13 Bioinformatics course 28

Page 29: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

DNA methylation and cancer

11/27/13 29

Page 30: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

DNA methylation : data and analysis

11/27/13 Bioinformatics course 30

Page 31: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

11/27/13 Bioinformatics course 31

Differentially methylated regions(DMRs). Genomic regions that exhibit statistically significant differences in DNA methylation between sample groups.

BisulphiteBisulphite ions (HSO3

−) selectively deaminate unmethylated but not methylated Cs, giving rise to Us, which are replaced by Ts during subsequent PCR amplification.

DNA methylation and their interpretation in a broader biological context. The final section outlines emerging trends in the analysis and interpretation of DNA meth-ylation data. The structure of this Review follows the flow of a typical DNA methylation mapping study, as illustrated in FIG. 1, and a list of the described software tools is available from TABLE 1.

Data processing and quality controlVarious experimental methods have been devel-oped for genome-wide DNA methylation mapping, each with their own advantages and challenges14,19,20. In this Review, we focus on the three most popular

approaches: bisulphite sequencing, bisulphite microar-rays and enrichment-based methods (FIG. 1a; BOX 1). These three approaches pose distinct computational challenges during data processing and quality control, as outlined below.

Processing bisulphite-sequencing data. As a result of DNA treatment with the bisulphite chemical, the vast majority of unmethylated Cs appears as Ts among the sequencing reads, whereas methylated Cs are largely protected from bisulphite-induced conversion. To cal-culate absolute DNA methylation levels from bisulphite-sequencing data, sequencing reads are aligned to the

8GTKH[KPI�CPF�XCNKFCVKPI�FKȭGTGPEGU�KP�&0#�OGVJ[NCVKQPr Global analysis of DMR list: volcano plots, Q–Q plots, Manhattan plotsr Manual or computational ranking and selection of promising DMRs for experimental XGTKȮECVKQP�CPF�QT�validationr�%QORWVCVKQPCN�FGUKIP�QH�JKIJ�VJTQWIJRWV�CUUC[U�HQT�EQPȮTOKPI�VJG� UGPUKVKXKV[�CPF�URGEKȮEKV[�QH�&/4�KFGPVKȮECVKQP�KP�NCTIG�UCORNG�EQJQTVU

+PVGTRTGVKPI�FKȭGTGPEGU�KP�&0#�OGVJ[NCVKQPr Integrative analysis in the context of other genomic data setsr�5GCTEJ�HQT�UKIPKȮECPV�GPTKEJOGPV�QH�IGPG�HWPEVKQPU�CPF�TGIWNCVQT[� elements among the DMRsr Statistical assessment of confounding factors to assess whether it YQWNF�DG�RNCWUKDNG�VQ�J[RQVJGUK\G�ECWUCN�GȭGEVU�

0CVWTG�4GXKGYU�| )GPGVKEU

a #UUC[U�HQT�&0#�OGVJ[NCVKQP�OCRRKPI

$KUWNRJKVG�UGSWGPEKPIDNA treatment with bisulphite URGEKȮECNN[�introduces mutations at unmethylated Cs. These mutations are mapped by next-generation sequencing

$KUWNRJKVG�OKETQCTTC[U&0#�OGVJ[NCVKQP�URGEKȮE�OWVCVKQPU�CTG�introduced by bisulphite treatment. These mutations are mapped using a genotyping microarray that covers a selection of Cs

'PTKEJOGPV�DCUGF�OGVJQFUMethylated (alternatively, unmethylated) DNA fragments are enriched in a DNA NKDTCT[��6JG�NKDTCT[�EQORQUKVKQP�KU�SWCPVKȮGF�by next-generation sequencing

b &CVC�RTQEGUUKPI�CPF�SWCNKV[�EQPVTQN

2TQEGUUKPI�DKUWNRJKVG�UGSWGPEKPI�FCVCr Bisulphite sequence alignmentr�3WCPVKȮECVKQP�QH�CDUQNWVG�&0#� methylation at single-base resolutionr Quality control

2TQEGUUKPI�DKUWNRJKVG�OKETQCTTC[�FCVCr Data normalizationr�3WCPVKȮECVKQP�QH�CDUQNWVG�&0#� methylation at single-base resolutionr Quality control

2TQEGUUKPI�GPTKEJOGPV�DCUGF�FCVCr DNA sequence alignmentr�3WCPVKȮECVKQP�QH�TGNCVKXG�GPTKEJOGPVr Statistical inference of absolute DNA methylation corrected for CpG densityr Quality control

c &CVC�XKUWCNK\CVKQP�CPF�UVCVKUVKECN�CPCN[UKU

8KUWCNK\KPI�&0#�OGVJ[NCVKQP�FCVCr Visual inspection of selected regions in a genome browserr Global visualization of the distribution of DNA methylationr�%NWUVGTKPI�DCUGF�CUUGUUOGPV�QH�INQDCN�UKOKNCTKV[�CPF�FKȭGTGPEGU� in a set of samples

+FGPVKH[KPI�FKȭGTGPVKCNN[�OGVJ[NCVGF�TGIKQPUr�5VCVKUVKECN�VGUVKPI�HQT�FKȭGTGPVKCN�&0#�OGVJ[NCVKQP�CV�UKPING� %R)U�CPF�QT�NCTIGT�IGPQOKE�TGIKQPUr Statistical correction for multiple hypothesis testingr�4CPMKPI�DCUGF�QP�UVCVKUVKECN�UKIPKȮECPEG�CPF�GȭGEV�UK\G

d 8CNKFCVKQP�CPF�KPVGTRTGVCVKQP

Unprocessed DNA sequencing or OKETQCTTC[�FCVC�CUUC[�URGEKȮE�

.KUV�QH�&/4U�VJCV�CTG�UVCVKUVKECNN[�UKIPKȮECPV

Table with DNA methylation levels for each CpG in each sample (assay-independent)

Figure 1 | Workflow for analysing and interpreting DNA methylation data. a | Genome-wide DNA methylation is mapped with one of the three most commonly used assays, resulting in methylation-specific DNA sequencing or microarray data. b | These raw data are processed and quality-controlled using assay-specific algorithms and software. The main result of data normalization is an assay-independent CpG methylation table that contains absolute DNA methylation levels (!-values) for all covered CpGs. c | Data visualization and statistical analysis identifies relevant associations and derives a list of differentially methylated regions (DMRs) between cases and controls. d | The resulting DMR list is validated both computationally and experimentally, and biological interpretation is assisted by computational tools. (Note that the separation of the analysis workflow into four subsequent steps constitutes a conceptual simplification, and there are a number of reasons why a specific study may need to deviate from this approach.)

REVIEWS

706 | OCTOBER 2012 | VOLUME 13 www.nature.com/reviews/genetics

© 2012 Macmillan Publishers Limited. All rights reserved

Page 32: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

Bi-sulphate sequencing

11/27/13 Bioinformatics course 32

Page 33: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

Bi-sulphite sequencing

11/27/13 Bioinformatics course 33

Page 34: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

Bi-sulfate treated sequencing reads

11/27/13 Bioinformatics course 34

M valuesLogistically transformed !-values. The transformation mitigates some statistical problems of the !-value (namely, limited value range and strongly bimodal distribution) at the cost of reduced biological interpretability.

Batch effectsSystematic biases in the data that are unrelated to the research question but that arise from undesirable (and often unrecognized) differences in sample handling.

ConfoundingA nonrandom relationship between the phenotype of interest and external factors (for example, batch effects or population structure) that can give rise to spurious associations.

advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing date and microarray batch) and the phenotype of interest (for example, cases ver-sus controls) and to use tools for batch effect removal, which can substantially increase robustness and statis-tical power50,52,53. Other common biases in bisulphite microarray data include nonspecific binding of DNA fragments to multiple probes (which has been shown to cause false positives for sex-specific DNA methylation on the autosomes54) and the presence of genetic vari-ants affecting probe binding or read-out. The impact of these technical issues can be minimized by removing all probes that exhibit a high sequence identity with mul-tiple genomic regions as well as those overlapping with common genetic variants.

Processing enrichment-based data. Enrichment-based assays for DNA methylation mapping use various meth-ods for enriching DNA in a methylation-specific manner.

Methylated DNA can be enriched using methylation- specific antibodies (in methylated DNA immuno-precipitation coupled with high-throughput sequencing (MeDIP–seq)), methyl-CpG-binding domain (MBD) proteins (in MBD sequencing (MBD-seq)) or a restric-tion enzyme that specifically cuts methylated DNA (in methylation-dependent restriction enzyme sequencing (McrBC-seq)). Alternatively, unmethylated DNA can be enriched using restriction enzymes that specifically cut unmethylated DNA (for example, in HpaII tiny fragment enrichment by ligation-mediated PCR coupled with sequencing (HELP–seq)). Next-generation sequencing of the resulting DNA libraries counts the frequency of specific DNA fragments in each library and provides the raw data from which DNA methylation levels can be inferred. In contrast to bisulphite sequencing, the DNA methylation information is not contained in the read sequence but in the enrichment or depletion of sequenc-ing reads that map to specific regions of the genome. As a result, enrichment-based methods require careful

Nature Reviews | Genetics

a Setup of the example

b Wild-card alignment

CCGATGATGTCGCTGACGCACGA

YYGATGATGTYGYTGAYGYAYGA

100% 50% 50% 0%

ACGT,ATGA,ATGA,ATGT,TCGA,TCGA,TCGT,TTGT

DNA fragmentation, selectiveconversion of unmethylatedCs into Ts, DNA sequencing

Genomic DNA sequenceDNA methylation level

Bisulphite-sequencing reads

TCGATCGA

TCGTTTGT

ACGTATGT

ATGAATGA

ATGT

c Three-letter alignment

TTGATGATGTTGTTGATGTATGA

TtGATtGA

TtGATtGA

TtGTTTGT

AtGTAtGTATGT

ATGAATGA

ATGT

50% N/A 0%N/A

50% 100% 0%100%

Reference sequence

Reference sequence

Read alignment

DNA methylation level

DNA methylation level

Read alignment

Figure 2 | Two alternative strategies for bisulphite alignment. a | An illustrative example of bisulphite sequencing for a DNA fragment with known DNA methylation levels at four CpGs and a total of eight bisulphite-sequencing reads. For easier visualization, the sequencing reads are four bases long (realistic numbers would be 50 to 200 bases), and the size of the genomic DNA sequence is just 23 bases (3 gigabases would be a realistic number for the human genome). b | Alignment of the bisulphite-sequencing reads (centre) to the reference sequence (top) using a wild-card aligner that tolerates zero mismatches and zero gaps. The aligner replaces each C in the reference sequence by the wild-card letter Y, which can match both C and T in the read sequences. Reads with more than one perfect alignment with the reference sequence are discarded (greyed out), and for each CpG in the genomic DNA sequence, the DNA methylation level (bottom) is calculated as the percentage of aligning Cs among all uniquely mapped reads. Note that the third CpG is incorrectly assigned a DNA methylation level of 100%, which is due to the fact that the unmethylated read was discarded as ambiguous, whereas the methylated read could be uniquely mapped. c | The same alignment carried out by a three-letter aligner, which also tolerates zero mismatches and zero gaps. The aligner replaces each C in the reference sequence by an upper-case T and each C in the sequencing reads by a lower-case t, with no distinction being made between upper-case T and lower-case t during the alignment. As a result of the reduced sequencing complexity with only three letters remaining, a larger number of reads align to more than one position in the reference sequence and are discarded. The three-letter alignment avoids incorrect results in this example, but it fails to provide any values for the first and third CpG. (As an alternative to discarding ambiguous reads, it is also possible to assign them randomly to one of the best-matching positions; in the current example, the wild-card alignment would provide correct results 50% of the time, whereas the three-letter alignment exhibits higher uncertainty and would be correct only 6.25% of the time.)

REVIEWS

NATURE REVIEWS | GENETICS VOLUME 13 | OCTOBER 2012 | 711

© 2012 Macmillan Publishers Limited. All rights reserved

Page 35: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

Wild card alignment

11/27/13 Bioinformatics course 35

M valuesLogistically transformed !-values. The transformation mitigates some statistical problems of the !-value (namely, limited value range and strongly bimodal distribution) at the cost of reduced biological interpretability.

Batch effectsSystematic biases in the data that are unrelated to the research question but that arise from undesirable (and often unrecognized) differences in sample handling.

ConfoundingA nonrandom relationship between the phenotype of interest and external factors (for example, batch effects or population structure) that can give rise to spurious associations.

advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing date and microarray batch) and the phenotype of interest (for example, cases ver-sus controls) and to use tools for batch effect removal, which can substantially increase robustness and statis-tical power50,52,53. Other common biases in bisulphite microarray data include nonspecific binding of DNA fragments to multiple probes (which has been shown to cause false positives for sex-specific DNA methylation on the autosomes54) and the presence of genetic vari-ants affecting probe binding or read-out. The impact of these technical issues can be minimized by removing all probes that exhibit a high sequence identity with mul-tiple genomic regions as well as those overlapping with common genetic variants.

Processing enrichment-based data. Enrichment-based assays for DNA methylation mapping use various meth-ods for enriching DNA in a methylation-specific manner.

Methylated DNA can be enriched using methylation- specific antibodies (in methylated DNA immuno-precipitation coupled with high-throughput sequencing (MeDIP–seq)), methyl-CpG-binding domain (MBD) proteins (in MBD sequencing (MBD-seq)) or a restric-tion enzyme that specifically cuts methylated DNA (in methylation-dependent restriction enzyme sequencing (McrBC-seq)). Alternatively, unmethylated DNA can be enriched using restriction enzymes that specifically cut unmethylated DNA (for example, in HpaII tiny fragment enrichment by ligation-mediated PCR coupled with sequencing (HELP–seq)). Next-generation sequencing of the resulting DNA libraries counts the frequency of specific DNA fragments in each library and provides the raw data from which DNA methylation levels can be inferred. In contrast to bisulphite sequencing, the DNA methylation information is not contained in the read sequence but in the enrichment or depletion of sequenc-ing reads that map to specific regions of the genome. As a result, enrichment-based methods require careful

Nature Reviews | Genetics

a Setup of the example

b Wild-card alignment

CCGATGATGTCGCTGACGCACGA

YYGATGATGTYGYTGAYGYAYGA

100% 50% 50% 0%

ACGT,ATGA,ATGA,ATGT,TCGA,TCGA,TCGT,TTGT

DNA fragmentation, selectiveconversion of unmethylatedCs into Ts, DNA sequencing

Genomic DNA sequenceDNA methylation level

Bisulphite-sequencing reads

TCGATCGA

TCGTTTGT

ACGTATGT

ATGAATGA

ATGT

c Three-letter alignment

TTGATGATGTTGTTGATGTATGA

TtGATtGA

TtGATtGA

TtGTTTGT

AtGTAtGTATGT

ATGAATGA

ATGT

50% N/A 0%N/A

50% 100% 0%100%

Reference sequence

Reference sequence

Read alignment

DNA methylation level

DNA methylation level

Read alignment

Figure 2 | Two alternative strategies for bisulphite alignment. a | An illustrative example of bisulphite sequencing for a DNA fragment with known DNA methylation levels at four CpGs and a total of eight bisulphite-sequencing reads. For easier visualization, the sequencing reads are four bases long (realistic numbers would be 50 to 200 bases), and the size of the genomic DNA sequence is just 23 bases (3 gigabases would be a realistic number for the human genome). b | Alignment of the bisulphite-sequencing reads (centre) to the reference sequence (top) using a wild-card aligner that tolerates zero mismatches and zero gaps. The aligner replaces each C in the reference sequence by the wild-card letter Y, which can match both C and T in the read sequences. Reads with more than one perfect alignment with the reference sequence are discarded (greyed out), and for each CpG in the genomic DNA sequence, the DNA methylation level (bottom) is calculated as the percentage of aligning Cs among all uniquely mapped reads. Note that the third CpG is incorrectly assigned a DNA methylation level of 100%, which is due to the fact that the unmethylated read was discarded as ambiguous, whereas the methylated read could be uniquely mapped. c | The same alignment carried out by a three-letter aligner, which also tolerates zero mismatches and zero gaps. The aligner replaces each C in the reference sequence by an upper-case T and each C in the sequencing reads by a lower-case t, with no distinction being made between upper-case T and lower-case t during the alignment. As a result of the reduced sequencing complexity with only three letters remaining, a larger number of reads align to more than one position in the reference sequence and are discarded. The three-letter alignment avoids incorrect results in this example, but it fails to provide any values for the first and third CpG. (As an alternative to discarding ambiguous reads, it is also possible to assign them randomly to one of the best-matching positions; in the current example, the wild-card alignment would provide correct results 50% of the time, whereas the three-letter alignment exhibits higher uncertainty and would be correct only 6.25% of the time.)

REVIEWS

NATURE REVIEWS | GENETICS VOLUME 13 | OCTOBER 2012 | 711

© 2012 Macmillan Publishers Limited. All rights reserved

Page 36: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

3 letter alignment

11/27/13 Bioinformatics course 36

M valuesLogistically transformed !-values. The transformation mitigates some statistical problems of the !-value (namely, limited value range and strongly bimodal distribution) at the cost of reduced biological interpretability.

Batch effectsSystematic biases in the data that are unrelated to the research question but that arise from undesirable (and often unrecognized) differences in sample handling.

ConfoundingA nonrandom relationship between the phenotype of interest and external factors (for example, batch effects or population structure) that can give rise to spurious associations.

advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing date and microarray batch) and the phenotype of interest (for example, cases ver-sus controls) and to use tools for batch effect removal, which can substantially increase robustness and statis-tical power50,52,53. Other common biases in bisulphite microarray data include nonspecific binding of DNA fragments to multiple probes (which has been shown to cause false positives for sex-specific DNA methylation on the autosomes54) and the presence of genetic vari-ants affecting probe binding or read-out. The impact of these technical issues can be minimized by removing all probes that exhibit a high sequence identity with mul-tiple genomic regions as well as those overlapping with common genetic variants.

Processing enrichment-based data. Enrichment-based assays for DNA methylation mapping use various meth-ods for enriching DNA in a methylation-specific manner.

Methylated DNA can be enriched using methylation- specific antibodies (in methylated DNA immuno-precipitation coupled with high-throughput sequencing (MeDIP–seq)), methyl-CpG-binding domain (MBD) proteins (in MBD sequencing (MBD-seq)) or a restric-tion enzyme that specifically cuts methylated DNA (in methylation-dependent restriction enzyme sequencing (McrBC-seq)). Alternatively, unmethylated DNA can be enriched using restriction enzymes that specifically cut unmethylated DNA (for example, in HpaII tiny fragment enrichment by ligation-mediated PCR coupled with sequencing (HELP–seq)). Next-generation sequencing of the resulting DNA libraries counts the frequency of specific DNA fragments in each library and provides the raw data from which DNA methylation levels can be inferred. In contrast to bisulphite sequencing, the DNA methylation information is not contained in the read sequence but in the enrichment or depletion of sequenc-ing reads that map to specific regions of the genome. As a result, enrichment-based methods require careful

Nature Reviews | Genetics

a Setup of the example

b Wild-card alignment

CCGATGATGTCGCTGACGCACGA

YYGATGATGTYGYTGAYGYAYGA

100% 50% 50% 0%

ACGT,ATGA,ATGA,ATGT,TCGA,TCGA,TCGT,TTGT

DNA fragmentation, selectiveconversion of unmethylatedCs into Ts, DNA sequencing

Genomic DNA sequenceDNA methylation level

Bisulphite-sequencing reads

TCGATCGA

TCGTTTGT

ACGTATGT

ATGAATGA

ATGT

c Three-letter alignment

TTGATGATGTTGTTGATGTATGA

TtGATtGA

TtGATtGA

TtGTTTGT

AtGTAtGTATGT

ATGAATGA

ATGT

50% N/A 0%N/A

50% 100% 0%100%

Reference sequence

Reference sequence

Read alignment

DNA methylation level

DNA methylation level

Read alignment

Figure 2 | Two alternative strategies for bisulphite alignment. a | An illustrative example of bisulphite sequencing for a DNA fragment with known DNA methylation levels at four CpGs and a total of eight bisulphite-sequencing reads. For easier visualization, the sequencing reads are four bases long (realistic numbers would be 50 to 200 bases), and the size of the genomic DNA sequence is just 23 bases (3 gigabases would be a realistic number for the human genome). b | Alignment of the bisulphite-sequencing reads (centre) to the reference sequence (top) using a wild-card aligner that tolerates zero mismatches and zero gaps. The aligner replaces each C in the reference sequence by the wild-card letter Y, which can match both C and T in the read sequences. Reads with more than one perfect alignment with the reference sequence are discarded (greyed out), and for each CpG in the genomic DNA sequence, the DNA methylation level (bottom) is calculated as the percentage of aligning Cs among all uniquely mapped reads. Note that the third CpG is incorrectly assigned a DNA methylation level of 100%, which is due to the fact that the unmethylated read was discarded as ambiguous, whereas the methylated read could be uniquely mapped. c | The same alignment carried out by a three-letter aligner, which also tolerates zero mismatches and zero gaps. The aligner replaces each C in the reference sequence by an upper-case T and each C in the sequencing reads by a lower-case t, with no distinction being made between upper-case T and lower-case t during the alignment. As a result of the reduced sequencing complexity with only three letters remaining, a larger number of reads align to more than one position in the reference sequence and are discarded. The three-letter alignment avoids incorrect results in this example, but it fails to provide any values for the first and third CpG. (As an alternative to discarding ambiguous reads, it is also possible to assign them randomly to one of the best-matching positions; in the current example, the wild-card alignment would provide correct results 50% of the time, whereas the three-letter alignment exhibits higher uncertainty and would be correct only 6.25% of the time.)

REVIEWS

NATURE REVIEWS | GENETICS VOLUME 13 | OCTOBER 2012 | 711

© 2012 Macmillan Publishers Limited. All rights reserved

Page 37: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

MEDIP-Methylated DNA immuno precipitation (Enrichment Based)

11/27/13 Bioinformatics course 37

Page 38: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

Antibody 5MeC-mAB

directed against 5-methylcytidine added…

is

binds to the methylated

fraction of

… that the

Page 39: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

Magnetic Beads

… that bind to the antibody…

Magnetic beads are added…

… allowing the methylated fraction to be captured with magnets.

Page 40: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

… allowing the methylated fraction to be captured with magnets.

Page 41: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

… proteinase K…, The methylated fraction can then be isolated using…

Page 42: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

… analyzed at candidate loci using qPCR…

Once isolated, methylated DNA can be…

Page 43: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

… hybridised to microarrays for genome-wide testing…

Page 44: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

… or high-throughput sequenced for whole-genome analysis.

Page 45: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

Enrichment based analysis steps

11/27/13 Bioinformatics course 45

�  Normalization �  log ratios of input vs enriched �   Determining regions enriched in methylation(peaks) �  Assigning enriched regions to the closest gene �  Further downstream analysis, according to question

Page 46: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

Bi-sulphate microarray based methylation data

11/27/13 Bioinformatics course 46

Page 47: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

11/27/13 Bioinformatics course 47

Page 48: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

11/27/13 Bioinformatics course 48

Page 49: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

Methylation value in infinium platform

11/27/13 Bioinformatics course 49

� The methylation measures are represented as beta values

� Beta values are continuous variables between 0 and 1

� 0= Not methylated. � 1=100% Methylated. � Methylation beta values: � B=M/(U+M+e) ; 0<B<1

Page 50: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

Histone modifications

11/27/13 Bioinformatics course 50

Page 51: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

Histone Modifications

http://www.nature.com/nsmb/journal/v14/n11/images/nsmb1337-F1.gif

Page 52: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

Li e. al. (2007) Cell 128, 707

Page 53: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

Li e. al. (2007) Cell 128, 707

Histone Modifications in Relation to Gene Transcription

Page 54: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

Histone Modifications and Human Diseases

Coffin-Lowry syndrome is a rare genetic disorder characterized by mental retardation and abnormalities of the head and facial and other areas. It is caused by mutations in the RSK2 gene (histone phosphorylation) and is inherited as an X-linked dominant genetic trait. Males are usually more severely affected than females.

Rubinstein-Taybi syndrome is characterized by short stature,

moderate to severe intellectual disability, distinctive facial

features, and broad thumbs and first toes. It is caused by mutations in CREB-binding protein (histone acetylation)

Page 55: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

Methods to profile histone modifications

11/27/13 Bioinformatics course 55

Page 56: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

11/27/13 Bioinformatics course 56

Page 57: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

Chip-seq

11/27/13 Bioinformatics course 57

Page 58: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

Data analysis of chip-chip and chip-seq

11/27/13 Bioinformatics course 58

Summary of chip-chip chip-seq analysis

11/28/12 Bioinformatics course 53

Page 59: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

11/27/13 Bioinformatics course 59

Table 1 | Publicly available ChIP-seq software packages discussed in this review

Profile Peak criteriaa Tag shift Control datab Rank by FDRcUser input parametersd

Artifact filtering:

strand-based/ duplicatee Refs.

CisGenome v1.1

Strand-specific window scan

1: Number of reads in window 2: Number of ChIP reads minus control reads in window

Average for highest ranking peak pairs

Conditional binomial used to estimate FDR

Number of reads under peak

1: Negative binomial 2: conditional binomial

Target FDR, optional window width, window interval

Yes / Yes 10

ERANGE v3.1

Tag aggregation

1: Height cutoff Hiqh quality peak estimate, per-region estimate, or input

Hiqh quality peak estimate, per-region estimate, or input

Used to calculate fold enrichment and optionally P values

P value 1: None 2: # control

# ChIP

Optional peak height, ratio to background

Yes / No 4,18

FindPeaks v3.1.9.2

Aggregation of overlapped tags

Height threshold Input or estimated

NA Number of reads under peak

1: Monte Carlo simulation 2: NA

Minimum peak height, subpeak valley depth

Yes / Yes 19

F-Seq v1.82

Kernel density estimation (KDE)

s s.d. above KDE for 1: random background, 2: control

Input or estimated

KDE for local background

Peak height 1: None 2: None

Threshold s.d. value, KDE bandwidth

No / No 14

GLITR Aggregation of overlapped tags

Classification by height and relative enrichment

User input tag extension

Multiply sampled to estimate background class values

Peak height and fold enrichment

2: # control# ChIP

Target FDR, number nearest neighbors for clustering

No / No 17

MACS v1.3.5

Tags shifted then window scan

Local region Poisson P value

Estimate from high quality peak pairs

Used for Poisson fit when available

P value 1: None 2: # control

# ChIP

P-value threshold, tag length, mfold for shift estimate

No / Yes 13

PeakSeq Extended tag aggregation

Local region binomial P value

Input tag extension length

Used for significance of sample enrichment with binomial distribution

q value 1: Poisson background assumption 2: From binomial for sample plus control

Target FDR No / No 5

QuEST v2.3

Kernel density estimation

2: Height threshold, background ratio

Mode of local shifts that maximize strand cross-correlation

KDE for enrichment and empirical FDR estimation

q value 1: NA 2: # control

# ChIP

as a function of profile threshold

KDE bandwidth, peak height, subpeak valley depth, ratio to background

Yes / Yes 9

SICER v1.02

Window scan with gaps allowed

P value from random background model, enrichment relative to control

Input Linearly rescaled for candidate peak rejection and P values

q value 1: None 2: From Poisson P values

Window length, gap size, FDR (with control) or E-value (no control)

No / Yes 15

SiSSRs v1.4

Window scan N+ – N- sign change, N+ + N- threshold in regionf

Average nearest paired tag distance

Used to compute fold-enrichment distribution

P value 1: Poisson 2: control distribution

1: FDR 1,2: N++ N- threshold

Yes / Yes 11

spp v1.0

Strand specific window scan

Poisson P value (paired peaks only)

Maximal strand cross-correlation

Subtracted before peak calling

P value 1: Monte Carlo simulation 2: # control

# ChIP

Ratio to background

Yes / No 12

USeq v4.2

Window scan Binomial P value Estimated or user specified

Subtracted before peak calling

q value 1, 2: binomial 2: # control

# ChIP

Target FDR No / Yes 20

aThe labels 1: and 2: refer to one-sample and two-sample experiments, respectively. bThese descriptions are intended to give a rough idea of how control data is used by the software. ‘NA’ means that control data are not handled. cDescription of how FDR is or optionally may be computed. ‘None’ indicates an FDR is not computed, but the experimental data may still be analyzed; ‘NA’ indicates the experimental setup (1 sample or 2) is not yet handled by the software. # control / # ChIP, number of peaks called with control (or some portion thereof) and sample reversed. dThe lists of ‘user input parameters’ for each program are not exhaustive but rather comprise a subset of greatest interest to new users. e’Strand-based’ artifiact filtering rejects peaks if the strand-specific distributions of reads do not conform to expectation, for example by exhibiting extreme bias of tag populations for one strand or the other in a region. ‘Duplicate’ filtering refers to either removal of reads that occur in excess of expectation at a location or filtering of called peaks to eliminate those due to low complexity read pileups that may be associated with, for example, microsatellite DNA. fN+ and N– are the numbers of positive and negative strand reads, respectively.

S26 | VOL.6 NO.11s | NOVEMBER 2009 | NATURE METHODS SUPPLEMENT

REVIEW

©20

09 N

ature

Ame

rica,

Inc. A

ll righ

ts re

serv

ed.

Page 60: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

Cross-talk between DNA methylation and histone modifications

11/27/13 Bioinformatics course 60

Page 61: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

11/27/13 Bioinformatics course 61

Page 62: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

11/27/13 Bioinformatics course 62

Page 63: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

Epigenetic Databases

11/27/13 Bioinformatics course 63

Databases

11/28/12 Bioinformatics course 56

Bioinformation open access

www.bioinformation.net Current Trends

ISSN 0973-2063 (online) 0973-8894 (print)

Bioinformation 4(7): 331-337 (2010) © 2010 Biomedical Informatics

337

Supplementary material Table 1. Some epigenetic and related databases reviewed in this article.

Database Description URL Ref MethDB Contains information on 19,905 DNA methylation content data

and 5,382 methylation patterns for 48 species, 1,511 individuals, 198 tissues and cell lines and 79 phenotypes.

http://www.methdb.de [39]

PubMeth Contains over 5,000 records on methylated genes in various cancer types.

www.pubmeth.org/ [43]

REBASE Contains over 22,000 DNA methyltransferases genes derived from GenBank.

http://rebase.neb.com/rebase/ rebase.html

[127]

MeInfoText Contains gene methylation information across 205 human cancer types.

http://mit.lifescience.ntu.edu.tw/ [44]

MethPrimerDB Contains 259 primer sets from human, mouse and rat for DNA methylation analysis.

medgen.ugent.be/methprimerdb/ [40]

The Histone Database Contains 254 sequences from histone H1, 383 from histone H2, 311 from histone H2B, 1043 from histone H3 and 198 from histone H4, altogether representing at least 857 species.

http://genome.nhgri.nih.gov/ histones/

[42]

ChromDB Contains 9,341 chromatin-associated proteins, including RNAi-associated proteins, for a broad range of organisms.

http://www.chromdb.org/ [128]

CREMOFAC Contains 1725 redundant and 720 non-redundant chromatin-remodeling factor sequences in eukaryotes.

http://www.jncasr.ac.in/cremofac/

[129]

The Krembil Family Epigenetics Laboratory

Contains DNA methylation data of human chromosomes 21, 22, male germ cells and DNA methylation profiles in monozygotic and dizygotic twins.

http://www.epigenomics.ca í

MethyLogiX DNA methylation database

Contains DNA methylation data of human chromosomes 21 and 22, male germ cells and late-onset Alzheimer's disease.

http://www.methylogix.com/ genetics/database.shtml.htm

[20]

Page 64: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

NGS summary

11/27/13 Bioinformatics course 64

� Genomics : to study genome content, variations like snps, cnvs and mutations.

� Transcriptomics (RNA-seq) : to study gene expression, novel transcript fusions, splice variants and splice junctions.

� Epigenomics : Modifications such as DNA methylation and histone modifications and its role in development, disease.

Page 65: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

Combinatorial analysis

11/27/13 Bioinformatics course 65

Combinatorial regulation

11/28/12 Bioinformatics course 59

NHGRI Current Topics in Genome Analysis 2010

Week 6: Regulatory and Epigenetic Landscapes of Mammalian Genomes

February 23, 2010

Laura Elnitski, Ph.D.

48

Effect

Page 66: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

Towards personalized medicine

11/27/13 Bioinformatics course 66

Page 67: Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing

11/27/13 Bioinformatics course 67