Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that...

Vijayachitra Modhukur BIIT

[email protected]

Next generation sequencing (NGS3)-Epigenomics

1 11/27/13 Bioinformatics course

NGS lectures

11/27/13 Bioinformatics course 2

Genomics

Transcriptomics

Proteomics

Epigenomics

Epigenetics


� Greek, epi = above, upon - genetics � The study of heritable changes in gene function that

occur without a change in the DNA sequence. � The study of changes in gene silencing that occur

without changes in the genes themselves. Many genes in the body are permanently turned off as part of normal development. But sometimes that process goes awry, turning off genes that should otherwise remain active. ..

�  Epigenome represents all epigenetic phenomenon across the Genome

11/27/13 4

Epigenetics


Epigenetic impact

11/27/13 7

Epigenetics market


Epigenetics publications


Central dogma


Epigenetics may also control the central Dogma!

Some regulators



Bind to enhancer sites, controlled by hormones or other signals. They increase transcription of the regulated gene

Bind to silencer sites, controlled by hormones or other signals. They decrease transcription of the regulated gene, possibly by interfering with activator

Bind to activators and/or repressors and to basal factors. They communicate the signal from activators and/or repressors to the RNA polymerase.

They enable RNA polymerase to initiate transcription. However, they require interaction with coactivators.

Epigenetic tools


Various Epigenetic projects


Various projects


Chromatin, histones and modifications


DNA methylation

11/27/13 18

Some Roles of DNA Methylation in Mammalian System

•  Genomic Imprinting •  X chromosome inactivation •  Developmental controls •  Tissue specific regulation

Genomic imprinting


X chromosome in activation


Developmental controls


Tissue specific regulation


DNA Methylation and Other Human Diseases

-- Imprinting Disorder: •  Beckwith-Wiedemann syndrom (BWS) •  Prader-Willi syndrome (PWS) •  Transient neonatal diabetes mellitus (TNDM)

-- Repeat-instability diseases •  Fragile X syndrome (FRAXA) •  Facioscapulohumeral muscular dystroph

-- Defects of the methylation machinery •  Systemic lupus erythemtosus (SLE) •  Immunodeficiency, centromeric instability and facial anomalies (ICF) syndrome

Inheritance of DNA methylation


CpG island methylation

11/27/13 27

Understanding of cancer through epigenetics


DNA methylation and cancer

11/27/13 29

DNA methylation : data and analysis



Differentially methylated regions(DMRs). Genomic regions that exhibit statistically significant differences in DNA methylation between sample groups.

BisulphiteBisulphite ions (HSO3

−) selectively deaminate unmethylated but not methylated Cs, giving rise to Us, which are replaced by Ts during subsequent PCR amplification.

DNA methylation and their interpretation in a broader biological context. The final section outlines emerging trends in the analysis and interpretation of DNA meth-ylation data. The structure of this Review follows the flow of a typical DNA methylation mapping study, as illustrated in FIG. 1, and a list of the described software tools is available from TABLE 1.

Data processing and quality controlVarious experimental methods have been devel-oped for genome-wide DNA methylation mapping, each with their own advantages and challenges14,19,20. In this Review, we focus on the three most popular

approaches: bisulphite sequencing, bisulphite microar-rays and enrichment-based methods (FIG. 1a; BOX 1). These three approaches pose distinct computational challenges during data processing and quality control, as outlined below.

Processing bisulphite-sequencing data. As a result of DNA treatment with the bisulphite chemical, the vast majority of unmethylated Cs appears as Ts among the sequencing reads, whereas methylated Cs are largely protected from bisulphite-induced conversion. To cal-culate absolute DNA methylation levels from bisulphite-sequencing data, sequencing reads are aligned to the

8GTKH[KPI�CPF�XCNKFCVKPI�FKȭGTGPEGU�KP�&0#�OGVJ[NCVKQPr Global analysis of DMR list: volcano plots, Q–Q plots, Manhattan plotsr Manual or computational ranking and selection of promising DMRs for experimental XGTKȮECVKQP�CPF�QT�validationr�%QORWVCVKQPCN�FGUKIP�QH�JKIJ�VJTQWIJRWV�CUUC[U�HQT�EQPȮTOKPI�VJG� UGPUKVKXKV[�CPF�URGEKȮEKV[�QH�&/4�KFGPVKȮECVKQP�KP�NCTIG�UCORNG�EQJQTVU

+PVGTRTGVKPI�FKȭGTGPEGU�KP�&0#�OGVJ[NCVKQPr Integrative analysis in the context of other genomic data setsr�5GCTEJ�HQT�UKIPKȮECPV�GPTKEJOGPV�QH�IGPG�HWPEVKQPU�CPF�TGIWNCVQT[� elements among the DMRsr Statistical assessment of confounding factors to assess whether it YQWNF�DG�RNCWUKDNG�VQ�J[RQVJGUK\G�ECWUCN�GȭGEVU�

0CVWTG�4GXKGYU�| )GPGVKEU

a #UUC[U�HQT�&0#�OGVJ[NCVKQP�OCRRKPI

$KUWNRJKVG�UGSWGPEKPIDNA treatment with bisulphite URGEKȮECNN[�introduces mutations at unmethylated Cs. These mutations are mapped by next-generation sequencing

$KUWNRJKVG�OKETQCTTC[U&0#�OGVJ[NCVKQP�URGEKȮE�OWVCVKQPU�CTG�introduced by bisulphite treatment. These mutations are mapped using a genotyping microarray that covers a selection of Cs

'PTKEJOGPV�DCUGF�OGVJQFUMethylated (alternatively, unmethylated) DNA fragments are enriched in a DNA NKDTCT[��6JG�NKDTCT[�EQORQUKVKQP�KU�SWCPVKȮGF�by next-generation sequencing

b &CVC�RTQEGUUKPI�CPF�SWCNKV[�EQPVTQN

2TQEGUUKPI�DKUWNRJKVG�UGSWGPEKPI�FCVCr Bisulphite sequence alignmentr�3WCPVKȮECVKQP�QH�CDUQNWVG�&0#� methylation at single-base resolutionr Quality control

2TQEGUUKPI�DKUWNRJKVG�OKETQCTTC[�FCVCr Data normalizationr�3WCPVKȮECVKQP�QH�CDUQNWVG�&0#� methylation at single-base resolutionr Quality control

2TQEGUUKPI�GPTKEJOGPV�DCUGF�FCVCr DNA sequence alignmentr�3WCPVKȮECVKQP�QH�TGNCVKXG�GPTKEJOGPVr Statistical inference of absolute DNA methylation corrected for CpG densityr Quality control

c &CVC�XKUWCNK\CVKQP�CPF�UVCVKUVKECN�CPCN[UKU

8KUWCNK\KPI�&0#�OGVJ[NCVKQP�FCVCr Visual inspection of selected regions in a genome browserr Global visualization of the distribution of DNA methylationr�%NWUVGTKPI�DCUGF�CUUGUUOGPV�QH�INQDCN�UKOKNCTKV[�CPF�FKȭGTGPEGU� in a set of samples

+FGPVKH[KPI�FKȭGTGPVKCNN[�OGVJ[NCVGF�TGIKQPUr�5VCVKUVKECN�VGUVKPI�HQT�FKȭGTGPVKCN�&0#�OGVJ[NCVKQP�CV�UKPING� %R)U�CPF�QT�NCTIGT�IGPQOKE�TGIKQPUr Statistical correction for multiple hypothesis testingr�4CPMKPI�DCUGF�QP�UVCVKUVKECN�UKIPKȮECPEG�CPF�GȭGEV�UK\G

d 8CNKFCVKQP�CPF�KPVGTRTGVCVKQP

Unprocessed DNA sequencing or OKETQCTTC[�FCVC�CUUC[�URGEKȮE�

.KUV�QH�&/4U�VJCV�CTG�UVCVKUVKECNN[�UKIPKȮECPV

Table with DNA methylation levels for each CpG in each sample (assay-independent)

Figure 1 | Workflow for analysing and interpreting DNA methylation data. a | Genome-wide DNA methylation is mapped with one of the three most commonly used assays, resulting in methylation-specific DNA sequencing or microarray data. b | These raw data are processed and quality-controlled using assay-specific algorithms and software. The main result of data normalization is an assay-independent CpG methylation table that contains absolute DNA methylation levels (!-values) for all covered CpGs. c | Data visualization and statistical analysis identifies relevant associations and derives a list of differentially methylated regions (DMRs) between cases and controls. d | The resulting DMR list is validated both computationally and experimentally, and biological interpretation is assisted by computational tools. (Note that the separation of the analysis workflow into four subsequent steps constitutes a conceptual simplification, and there are a number of reasons why a specific study may need to deviate from this approach.)

REVIEWS

706 | OCTOBER 2012 | VOLUME 13 www.nature.com/reviews/genetics

© 2012 Macmillan Publishers Limited. All rights reserved

Bi-sulphate sequencing


Bi-sulphite sequencing


Bi-sulfate treated sequencing reads


M valuesLogistically transformed !-values. The transformation mitigates some statistical problems of the !-value (namely, limited value range and strongly bimodal distribution) at the cost of reduced biological interpretability.

Batch effectsSystematic biases in the data that are unrelated to the research question but that arise from undesirable (and often unrecognized) differences in sample handling.

ConfoundingA nonrandom relationship between the phenotype of interest and external factors (for example, batch effects or population structure) that can give rise to spurious associations.

advisable to process samples in an order that minimizes confounding between potential sources of batch effects (for example, processing date and microarray batch) and the phenotype of interest (for example, cases ver-sus controls) and to use tools for batch effect removal, which can substantially increase robustness and statis-tical power50,52,53. Other common biases in bisulphite microarray data include nonspecific binding of DNA fragments to multiple probes (which has been shown to cause false positives for sex-specific DNA methylation on the autosomes54) and the presence of genetic vari-ants affecting probe binding or read-out. The impact of these technical issues can be minimized by removing all probes that exhibit a high sequence identity with mul-tiple genomic regions as well as those overlapping with common genetic variants.

Processing enrichment-based data. Enrichment-based assays for DNA methylation mapping use various meth-ods for enriching DNA in a methylation-specific manner.

Methylated DNA can be enriched using methylation- specific antibodies (in methylated DNA immuno-precipitation coupled with high-throughput sequencing (MeDIP–seq)), methyl-CpG-binding domain (MBD) proteins (in MBD sequencing (MBD-seq)) or a restric-tion enzyme that specifically cuts methylated DNA (in methylation-dependent restriction enzyme sequencing (McrBC-seq)). Alternatively, unmethylated DNA can be enriched using restriction enzymes that specifically cut unmethylated DNA (for example, in HpaII tiny fragment enrichment by ligation-mediated PCR coupled with sequencing (HELP–seq)). Next-generation sequencing of the resulting DNA libraries counts the frequency of specific DNA fragments in each library and provides the raw data from which DNA methylation levels can be inferred. In contrast to bisulphite sequencing, the DNA methylation information is not contained in the read sequence but in the enrichment or depletion of sequenc-ing reads that map to specific regions of the genome. As a result, enrichment-based methods require careful

Nature Reviews | Genetics

a Setup of the example

b Wild-card alignment

CCGATGATGTCGCTGACGCACGA

YYGATGATGTYGYTGAYGYAYGA

100% 50% 50% 0%

ACGT,ATGA,ATGA,ATGT,TCGA,TCGA,TCGT,TTGT

DNA fragmentation, selectiveconversion of unmethylatedCs into Ts, DNA sequencing

Genomic DNA sequenceDNA methylation level

Bisulphite-sequencing reads

TCGATCGA

TCGTTTGT

ACGTATGT

ATGAATGA

ATGT

c Three-letter alignment

TTGATGATGTTGTTGATGTATGA

TtGATtGA

TtGATtGA

TtGTTTGT

AtGTAtGTATGT

ATGAATGA

ATGT

50% N/A 0%N/A

50% 100% 0%100%

Reference sequence

Reference sequence

Read alignment

DNA methylation level


Read alignment

Figure 2 | Two alternative strategies for bisulphite alignment. a | An illustrative example of bisulphite sequencing for a DNA fragment with known DNA methylation levels at four CpGs and a total of eight bisulphite-sequencing reads. For easier visualization, the sequencing reads are four bases long (realistic numbers would be 50 to 200 bases), and the size of the genomic DNA sequence is just 23 bases (3 gigabases would be a realistic number for the human genome). b | Alignment of the bisulphite-sequencing reads (centre) to the reference sequence (top) using a wild-card aligner that tolerates zero mismatches and zero gaps. The aligner replaces each C in the reference sequence by the wild-card letter Y, which can match both C and T in the read sequences. Reads with more than one perfect alignment with the reference sequence are discarded (greyed out), and for each CpG in the genomic DNA sequence, the DNA methylation level (bottom) is calculated as the percentage of aligning Cs among all uniquely mapped reads. Note that the third CpG is incorrectly assigned a DNA methylation level of 100%, which is due to the fact that the unmethylated read was discarded as ambiguous, whereas the methylated read could be uniquely mapped. c | The same alignment carried out by a three-letter aligner, which also tolerates zero mismatches and zero gaps. The aligner replaces each C in the reference sequence by an upper-case T and each C in the sequencing reads by a lower-case t, with no distinction being made between upper-case T and lower-case t during the alignment. As a result of the reduced sequencing complexity with only three letters remaining, a larger number of reads align to more than one position in the reference sequence and are discarded. The three-letter alignment avoids incorrect results in this example, but it fails to provide any values for the first and third CpG. (As an alternative to discarding ambiguous reads, it is also possible to assign them randomly to one of the best-matching positions; in the current example, the wild-card alignment would provide correct results 50% of the time, whereas the three-letter alignment exhibits higher uncertainty and would be correct only 6.25% of the time.)

REVIEWS

NATURE REVIEWS | GENETICS VOLUME 13 | OCTOBER 2012 | 711


Wild card alignment













100% 50% 50% 0%





TCGATCGA

TCGTTTGT

ACGTATGT

ATGAATGA

ATGT



TtGATtGA

TtGATtGA

TtGTTTGT

AtGTAtGTATGT

ATGAATGA

ATGT

50% N/A 0%N/A

50% 100% 0%100%

Reference sequence

Reference sequence

Read alignment



Read alignment


REVIEWS



3 letter alignment













100% 50% 50% 0%





TCGATCGA

TCGTTTGT

ACGTATGT

ATGAATGA

ATGT



TtGATtGA

TtGATtGA

TtGTTTGT

AtGTAtGTATGT

ATGAATGA

ATGT

50% N/A 0%N/A

50% 100% 0%100%

Reference sequence

Reference sequence

Read alignment



Read alignment


REVIEWS



MEDIP-Methylated DNA immuno precipitation (Enrichment Based)


Antibody 5MeC-mAB

directed against 5-methylcytidine added…

is

binds to the methylated

fraction of

… that the

Magnetic Beads

… that bind to the antibody…

Magnetic beads are added…

… allowing the methylated fraction to be captured with magnets.

… allowing the methylated fraction to be captured with magnets.

… proteinase K…, The methylated fraction can then be isolated using…

… analyzed at candidate loci using qPCR…

Once isolated, methylated DNA can be…

… hybridised to microarrays for genome-wide testing…

… or high-throughput sequenced for whole-genome analysis.

Enrichment based analysis steps


�  Normalization �  log ratios of input vs enriched �  Determining regions enriched in methylation(peaks) �  Assigning enriched regions to the closest gene �  Further downstream analysis, according to question

Bi-sulphate microarray based methylation data


Methylation value in infinium platform


� The methylation measures are represented as beta values

� Beta values are continuous variables between 0 and 1

� 0= Not methylated. � 1=100% Methylated. � Methylation beta values: � B=M/(U+M+e) ; 0<B<1

Histone modifications


Histone Modifications

http://www.nature.com/nsmb/journal/v14/n11/images/nsmb1337-F1.gif

Li e. al. (2007) Cell 128, 707

Li e. al. (2007) Cell 128, 707

Histone Modifications in Relation to Gene Transcription

Histone Modifications and Human Diseases

Coffin-Lowry syndrome is a rare genetic disorder characterized by mental retardation and abnormalities of the head and facial and other areas. It is caused by mutations in the RSK2 gene (histone phosphorylation) and is inherited as an X-linked dominant genetic trait. Males are usually more severely affected than females.

Rubinstein-Taybi syndrome is characterized by short stature,

moderate to severe intellectual disability, distinctive facial

features, and broad thumbs and first toes. It is caused by mutations in CREB-binding protein (histone acetylation)

Methods to profile histone modifications


Chip-seq


Data analysis of chip-chip and chip-seq


Summary of chip-chip chip-seq analysis



Table 1 | Publicly available ChIP-seq software packages discussed in this review

Profile Peak criteriaa Tag shift Control datab Rank by FDRcUser input parametersd

Artifact filtering:

strand-based/ duplicatee Refs.

CisGenome v1.1

Strand-specific window scan

1: Number of reads in window 2: Number of ChIP reads minus control reads in window

Average for highest ranking peak pairs

Conditional binomial used to estimate FDR

Number of reads under peak

1: Negative binomial 2: conditional binomial

Target FDR, optional window width, window interval

Yes / Yes 10

ERANGE v3.1

Tag aggregation

1: Height cutoff Hiqh quality peak estimate, per-region estimate, or input

Hiqh quality peak estimate, per-region estimate, or input

Used to calculate fold enrichment and optionally P values

P value 1: None 2: # control

# ChIP

Optional peak height, ratio to background

Yes / No 4,18

FindPeaks v3.1.9.2

Aggregation of overlapped tags

Height threshold Input or estimated

NA Number of reads under peak

1: Monte Carlo simulation 2: NA

Minimum peak height, subpeak valley depth

Yes / Yes 19

F-Seq v1.82

Kernel density estimation (KDE)

s s.d. above KDE for 1: random background, 2: control

Input or estimated

KDE for local background

Peak height 1: None 2: None

Threshold s.d. value, KDE bandwidth

No / No 14

GLITR Aggregation of overlapped tags

Classification by height and relative enrichment

User input tag extension

Multiply sampled to estimate background class values

Peak height and fold enrichment

2: # control# ChIP

Target FDR, number nearest neighbors for clustering

No / No 17

MACS v1.3.5

Tags shifted then window scan

Local region Poisson P value

Estimate from high quality peak pairs

Used for Poisson fit when available

P value 1: None 2: # control

# ChIP

P-value threshold, tag length, mfold for shift estimate

No / Yes 13

PeakSeq Extended tag aggregation

Local region binomial P value

Input tag extension length

Used for significance of sample enrichment with binomial distribution

q value 1: Poisson background assumption 2: From binomial for sample plus control

Target FDR No / No 5

QuEST v2.3

Kernel density estimation

2: Height threshold, background ratio

Mode of local shifts that maximize strand cross-correlation

KDE for enrichment and empirical FDR estimation

q value 1: NA 2: # control

# ChIP

as a function of profile threshold

KDE bandwidth, peak height, subpeak valley depth, ratio to background

Yes / Yes 9

SICER v1.02

Window scan with gaps allowed

P value from random background model, enrichment relative to control

Input Linearly rescaled for candidate peak rejection and P values

q value 1: None 2: From Poisson P values

Window length, gap size, FDR (with control) or E-value (no control)

No / Yes 15

SiSSRs v1.4

Window scan N+ – N- sign change, N+ + N- threshold in regionf

Average nearest paired tag distance

Used to compute fold-enrichment distribution

P value 1: Poisson 2: control distribution

1: FDR 1,2: N++ N- threshold

Yes / Yes 11

spp v1.0

Strand specific window scan

Poisson P value (paired peaks only)

Maximal strand cross-correlation

Subtracted before peak calling

P value 1: Monte Carlo simulation 2: # control

# ChIP

Ratio to background

Yes / No 12

USeq v4.2

Window scan Binomial P value Estimated or user specified

Subtracted before peak calling

q value 1, 2: binomial 2: # control

# ChIP

Target FDR No / Yes 20

aThe labels 1: and 2: refer to one-sample and two-sample experiments, respectively. bThese descriptions are intended to give a rough idea of how control data is used by the software. ‘NA’ means that control data are not handled. cDescription of how FDR is or optionally may be computed. ‘None’ indicates an FDR is not computed, but the experimental data may still be analyzed; ‘NA’ indicates the experimental setup (1 sample or 2) is not yet handled by the software. # control / # ChIP, number of peaks called with control (or some portion thereof) and sample reversed. dThe lists of ‘user input parameters’ for each program are not exhaustive but rather comprise a subset of greatest interest to new users. e’Strand-based’ artifiact filtering rejects peaks if the strand-specific distributions of reads do not conform to expectation, for example by exhibiting extreme bias of tag populations for one strand or the other in a region. ‘Duplicate’ filtering refers to either removal of reads that occur in excess of expectation at a location or filtering of called peaks to eliminate those due to low complexity read pileups that may be associated with, for example, microsatellite DNA. fN+ and N– are the numbers of positive and negative strand reads, respectively.

S26 | VOL.6 NO.11s | NOVEMBER 2009 | NATURE METHODS SUPPLEMENT

REVIEW

©20

09 N

ature

Ame

rica,

Inc. A

ll righ

ts re

serv

ed.

Cross-talk between DNA methylation and histone modifications


Epigenetic Databases


Databases


Bioinformation open access

www.bioinformation.net Current Trends

ISSN 0973-2063 (online) 0973-8894 (print)

Bioinformation 4(7): 331-337 (2010) © 2010 Biomedical Informatics

337

Supplementary material Table 1. Some epigenetic and related databases reviewed in this article.

Database Description URL Ref MethDB Contains information on 19,905 DNA methylation content data

and 5,382 methylation patterns for 48 species, 1,511 individuals, 198 tissues and cell lines and 79 phenotypes.

http://www.methdb.de [39]

PubMeth Contains over 5,000 records on methylated genes in various cancer types.

www.pubmeth.org/ [43]

REBASE Contains over 22,000 DNA methyltransferases genes derived from GenBank.

http://rebase.neb.com/rebase/ rebase.html

[127]

MeInfoText Contains gene methylation information across 205 human cancer types.

http://mit.lifescience.ntu.edu.tw/ [44]

MethPrimerDB Contains 259 primer sets from human, mouse and rat for DNA methylation analysis.

medgen.ugent.be/methprimerdb/ [40]

The Histone Database Contains 254 sequences from histone H1, 383 from histone H2, 311 from histone H2B, 1043 from histone H3 and 198 from histone H4, altogether representing at least 857 species.

http://genome.nhgri.nih.gov/ histones/

[42]

ChromDB Contains 9,341 chromatin-associated proteins, including RNAi-associated proteins, for a broad range of organisms.

http://www.chromdb.org/ [128]

CREMOFAC Contains 1725 redundant and 720 non-redundant chromatin-remodeling factor sequences in eukaryotes.

http://www.jncasr.ac.in/cremofac/

[129]

The Krembil Family Epigenetics Laboratory

Contains DNA methylation data of human chromosomes 21, 22, male germ cells and DNA methylation profiles in monozygotic and dizygotic twins.

http://www.epigenomics.ca í

MethyLogiX DNA methylation database

Contains DNA methylation data of human chromosomes 21 and 22, male germ cells and late-onset Alzheimer's disease.

http://www.methylogix.com/ genetics/database.shtml.htm

[20]

NGS summary


� Genomics : to study genome content, variations like snps, cnvs and mutations.

� Transcriptomics (RNA-seq) : to study gene expression, novel transcript fusions, splice variants and splice junctions.

� Epigenomics : Modifications such as DNA methylation and histone modifications and its role in development, disease.

Combinatorial analysis


Combinatorial regulation


NHGRI Current Topics in Genome Analysis 2010

Week 6: Regulatory and Epigenetic Landscapes of Mammalian Genomes

February 23, 2010

Laura Elnitski, Ph.D.

48

Effect

Towards personalized medicine


Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that...

Documents

Transcript of Next generation sequencing (NGS3)- Epigenomics · advisable to process samples in an order that...