Download - Enzymatic Methyl-seq: Next Generation Methylomes

Transcript
Page 1: Enzymatic Methyl-seq: Next Generation Methylomes

Fiona J. Stewart1, V K Chaithanya Ponnaluri1, Brittany S. Sexton1, Louise Williams1, Bradley W. Langhorst1, Romualdas Vaisvila1, Yvonne Helbert2, Lei Zhang2, Timothy Harkins2, Kevin J. McKernan2, Eileen T. Dimalanta1, Theodore B. Davis1

1 New England Biolabs, Inc., Ipswich, MA 01938 2 Medicinal Genomics Corp., Woburn, MA 01801

Enzymatic Methyl-seq: Next Generation Methylomes

DNA methylation is important for gene regulation. The ability to accurately identify 5-methylcytosine (5mC) and 5-hydroxymethylcytosine (5hmC) gives us greater insight into potential gene regulatory mechanisms. Bisulfite sequencing (BS) is traditionally used to detect methylated Cs, however, BS does have its drawbacks. DNA is commonly damaged and degraded by the chemical bisulfite reaction resulting in libraries that demonstrate high GC bias and are enriched for methylated regions. To overcome these limitations, we developed an enzymatic approach, NEBNextÒEnzymatic Methyl-seq (EM-seqÔ), for methylation detection that minimizes DNA damage, resulting in longer fragments and minimal GC bias.

Illumina libraries were prepared using bisulfite and EM-seq methods with 50 ng DNA from Arabidopsis thaliana and Cannabis sativa DNA. Libraries were sequenced using Illumina’s NextSeq500 (2x75). Reads were aligned using BWAMeth 0.2. and methylation information was extracted from the alignments using MethyDackel. Total 5mC levels were compared between the sequencing data from EM-seq and WGBS libraries and LCMS (Liquid Chromatography Mass Spectrometry). 5mC levels determined by EM-seq are close to those from LCMS, whereas, WGBS results in an overestimation of 5mC. Additionally, EM-seq libraries produce higher quality sequencing metrics such as longer inserts, lower duplication rates, a higher percentage of mapped reads and less GC bias compared to bisulfite converted libraries. We conclude that EM-seq is superior to WGBS and delivers higher library yields, more accurate methylation information, reduced DNA damage, increased sequencing length, and decreased GC-bias.

Sample Preparation

Data Analysis

Methyl KitBWAmeth

Methyl Extraction

Align(BWAmeth) Deduplicate Picard

• Reads were aligned to Jamaican Lion reference genome (August 2018 assembly) or theArabidopsis reference genome (TAIR10) (for Jamaican Lion, four miscellaneous contigswere removed from methylation analysis)

• Data were analyzed using the tools in above flowchart.

• Two plant DNAs were used to make EM-seq libraries• Cannabis sativa genomic DNA (Jamaican Lion): female clones (leaf, seeded and

unseeded flowers) & male sibling (flowers) plants• Arabidopsis thaliana

• Libraries were made using 50 ng genomic DNA, spiked with control DNA (unmethylatedlambda & CpG-methylated pUC19)

• Libraries were sequenced using an Illumina NextSeq 500, 2x76 base paired reads. 5caC issequenced as C and deaminated C as T.

• Bisulfite conversion was performed using Zymo Research EZ DNA Methylation-GoldTM kit

Cannabis sativa: Higher Quality Sequencing Metrics with EM-seq compared to WGBS Cannabis sativa female leaf EM-seq libraries are superior to WGBS

EM-seq libraries outperform WGBS for Cannabis sativa input DNA. (A)Fewer PCR cycles are required for EM-seq. (B) EM-seq libraries havelower duplication than WGBS. (C) EM-seq libraries have larger libraryinsert sizes than WGBS. In addition, the EM-seq protocol has beenoptimized for both standard and large insert libraries. (D) GC distributionis more even for EM-seq than WGBS. (E) EM-seq has higher coveragedepths than WGBS. 50 million 2 x 76 base reads were used for analysis(Figures A-D) and 130 million 2 x 76 base reads were used for Figure E.

0

5

10

15

20

25

Female Leaf EM-seq-1 Female Leaf EM-seq-2 Female Leaf WGBS-1 Female Leaf WGBS-2

Yiel

d (n

M)

PCR

Yie

ld (n

M)

EM-seq-1 EM-seq-2 WGBS-1 WGBS-20

5

10

15

20

25

6 PCR cycles 8 PCR cycles

0

0.5

1

1.5

2

2.5

3

3.5

Female Leaf EM-seq-1 Female Leaf EM-seq-2 Female Leaf WGBS-1 Female Leaf WGBS-2

Perc

ent D

uplic

atio

n

0.5

1.0

1.5

2.0

2.5

3.0

3.5

EM-seq-1 EM-seq-2 WGBS-1 WGBS-2

% D

uplic

atio

n

0

EM-seq can be used to investigate plant genomic DNA• analysis of the Cannabis sativa methylome identified genes involved in seed and THC production • the Arabidopsis methylome was successfully probed

Acknowledgements Thank you to Laurie Mazzola and Danielle Fuchs for their technical support.

• Higher library yields with fewer PCR cycles• Lower percent duplication• More even base coverage

• Larger library insert sizes• Less GC bias• Similar percentage methylation as LC-MS

Differential CpG methylation identified between Cannabis flower tissues

Differential methylation across Cannabis sativa flower tissues using EM-seq data. Female flower and female seeded flower (clones) as well as the male flower (sibling)were studied. (A) Percent cytosine methylation in the CpG, CHG, and CHH contexts. Female methylation levels for both flower and seeded flower were higher than for themale flower indicating methylation patterns are potentially determined by sex. Control DNA methylation levels were <0.5% for unmethylated lambda and >97.5% for CpGmethylated pUC19 (data not shown). (B) Volcano plot of the significant (q<0.01) differential methylation calls for the CpG context between (1) female flower andfemale seeded flower (2) female flower and male flower. The differential methylation calls for female flower and female seeded flower identified 30 hypomethylatedCpGs but no hypermethylated CpGs. The differential methylation calls for female and male flower identified >50,000 hypomethylated CpGs and >11,000hypermethylated CpGs. (C) Comparison of the hypomethylated CpGs with differential methylation between the female flower samples & female and male flowers.

B

MethylatedUnmethylated

Genomic Location (504 bp)

MethylatedUnmethylated

Differential methylation analysis comparingthe female flower with the seeded flower(clones) and the male flower (sibling).Significant differential methylation calls(q<0.01) for CpG context (1x coverageminimum) between female flower with femaleseeded flower and/or male flower wereidentified. (A) Region 1 BLASTs to the edestingene family of the Cannabis sativa genome.These genes are linked to the positiveregulation of seed production. The femaleflower is CpG methylated in this region,suggesting the expression of the genes areturned off, while the female seeded flower isunmethylated at this CpG. (B) Region 2BLASTs to the THC acid synthase (THCA)gene of the Cannabis sativa genome and thisgene is linked to the positive regulation ofTHCA production. The female flower is CpGunmethylated in this region, suggesting theexpression of the gene is turned on, while themale flower is methylated at this CpG,suggesting the expression of the gene isturned off. Male flowers produce log scaleless THCA.

(A) Comparison of the total percentage of 5mC determined using LC-MS, EM-seq and WGBS for female leaf. The percentage ofmethylated cytosines using LC-MS is calculated by determining theamount of methylated cytosines and the total cytosines ((5mC/(totalC)x100). 5mC percentages for EM-seq and WGBS were determinedby combining 5mC in the three cytosine contexts (CpG, CHH, CHG).EM-seq cytosine methylation numbers are closer to LCMSmethylation values than WGBS. (B) Cytosine methylation in the CpG,CHG, and CHH contexts for EM-seq and WGBS for female leaf.Control cytosine methylation for unmethylated lambda DNA was<0.5% for EM-seq and <2% for WGBS, and for CpG methylatedpUC19 was >97.5% for both EM-seq and WGBS (data not shown).(C) CpG context correlations of EM-seq and WGBS libraries (1xminimum coverage). Both EM-seq and WGBS libraries were highlycorrelated between replicates and methods (CHG and CHH contextdata not shown). 50 million, 2 x 76 base reads were used formethylation analysis.

Female & Female Seeded Flower

Female & Male Flower

C

B

Methylation profiles identified genes involved in seed & cannabinoid production

Genomic Location (992 bp)

WGBS0

10

20

30

40

LC-MS Female Leaf EM-seq Female Leaf WGBS Female Leaf

Tota

l % 5

mC

A 40

30

20

10

0EM-seqLC-MS

A

0

25

50

75

100

CpG CHG CHH CpG CHG CHH CpG CHG CHH

100

75

50

25

0

% 5

mC

per

C c

onte

xt

Female Flower Female Seeded Flower

Male Flower

C

Hypo Hyper

0 100

q-va

lue

Methylation Difference

0.0000

0.0025

0.0050

0.0075

0.0100

−100 −50 0 50 100meth.diff

qvalue factor(factor)

hypo

0.0000

0.0025

0.0050

0.0075

0.0100

−100 −50 0 50 100meth.diff

qvalue

factor(factor)hypo

hyper

0.0100

0.0075

0.0050

0.0025

0.0000

-100 0 100Methylation Difference

-100

0.0100

0.0075

0.0050

0.0025

0.0000

q-va

lue

Female & Female Seeded Flower Female & Male Flower

50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800

Insert Size

JL

0.00%

0.02%

0.04%

0.06%

0.08%

0.10%

0.12%

0.14%

0.16%

0.18%

0.20%

0.22%

Avg. %

of in

sert

s

% o

f Rea

ds

EM-seq (standard)EM-seq (large insert)WGBS

0

4

8

12

16

20

50 100 200 300 400 500 600 700Insert Size (bp)

Nor

mal

ized

Cov

erag

e

Percent GC0 10 20 30 40 50 60 70 80 90 100

EM-seqWGBS

00.20.40.60.81.01.21.41.61.82.02.2

A B

C D

Coverage Depth

% o

f Bas

es a

t Cov

erag

e D

epth

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Coverage Depth

Pro

ject

Medic

inalG

enom

ics_Leaf_

WG

BS

_E

Mseq_C

om

bin

edB

am

s

0.0%

1.0%

2.0%

3.0%

4.0%

5.0%

6.0%

7.0%

8.0%

9.0%

10.0%

Avg. F

rac B

ases

10

0

4

2

6

8

1

EM-seqWGBS

305 10 15 20 25

E

0

25

50

75

100

CpG CHG CHH CpG CHG CHH CpG CHG CHH CpG CHG CHH

EM-seq Female Leaf WGBS Female Leaf

B 100

75

50

25

0

% 5

mC

per

C c

onte

xt

CpG

CH

GC

HH

CpG

CH

GC

HH

CpG

CH

GC

HH

CpG

CH

GC

HH

C

0

0.05

0.1

0.15

0.2

0.25

0.3

0 200 400 600

Cou

ntM

illion

s

Insert Size (bp)

Library Insert Size WGBS-1WGBS-2EM-seq-1EM-seq-2

Histogram of CpG coverage

log10 of read coverage per base

Freq

uenc

y

0.0 0.5 1.0 1.5 2.0 2.5 3.0

050

0000

1000

000

1500

000

2000

000

0.30 00.30.200.50.30.31.42.1

7.3

21

41.1

20.4

4.2

0.20 0 0 0 00.10.10 0 0 0 00.10 0 0

EM−seq−1

Histogram of CpG coverage

log10 of read coverage per base

Freq

uenc

y

0.0 0.5 1.0 1.5 2.0 2.5 3.0

020

0000

4000

0060

0000

8000

0010

0000

012

0000

0

1.20 0

2.2

3.6

0

11.7

7.57.8

21.4

10.510.5

7.8

5.8

32.2

1.71.10.80.50.30.20.10 0 0 0 0 0 0 0 0 0

WGBS−1

Histogram of % CpG methylation

% methylation per base

Freq

uenc

y

0 20 40 60 80 100

050

0000

1500

000

2500

000

3500

000 64.5

1.8 0.7 0.6 0.5 0.5 0.5 0.5 0.4 0.7 0.4 0.7 0.8 1.2 1.7 2.3 3.24.9 5.9

8.1

EM−seq−1

Histogram of % CpG methylation

% methylation per base

Freq

uenc

y

0 20 40 60 80 100

050

0000

1500

000

2500

000

3500

000

63.6

2 1.2 1 0.6 0.4 0.6 0.5 0.4 1 0.2 0.9 0.6 1.4 1.9 2.1 2.44.5 4.1

10.7

WGBS−1

0

1

2

3

4

5

6

1 11 21 31 41

CpG

s co

vere

dM

illion

s

Minimum Coverage Depth

CpG coverage distribution

WGBS-1WGBS-2EM-seq-1EM-seq-2

EM-seq and WGBS libraries were made using 50 ng Arabidopsis thaliana genomic DNA. Libraries were sequenced on an Illumina NextSeq 500 (2 x 75 bases). 125 million paired end reads for each library were aligned to TAIR10 using bwa-meth 0.2.2. CpG, CHH and CHG sites on both strands were counted independently. (A) EM-seq libraries have larger insert sizes. (B) Cytosine methylation in the CpG and CHN contexts for EM-seq and WGBS. (C) GC Bias plot for EM-seq and WGBS libraries. EM-seq libraries show less bias across the GC content.

Histogram of CHG coverage

log10 of read coverage per base

Freq

uenc

y

0.0 0.5 1.0 1.5 2.0 2.5 3.0

050

0000

1000

000

1500

000

2000

000

2500

000

0.30 00.20.200.40.20.21.21.8

6.5

19.9

41.6

22

4.7

0.20 0 0 0 00.10.10 0 0 0 00.10 0 0

EM−seq−1

Histogram of CHG coverage

log10 of read coverage per base

Freq

uenc

y

0.0 0.5 1.0 1.5 2.0 2.5 3.0

020

0000

4000

0060

0000

8000

0012

0000

0

1.10 0

2.1

3.5

0

11.8

7.78.2

22.5

1110.6

7.5

5.3

2.61.91.40.90.60.40.30.20.10 0 0 0 0 0 0 0 0 0

WGBS−1

Histogram of % CHG methylation

% methylation per base

Freq

uenc

y

0 20 40 60 80 100

0e+0

01e

+06

2e+0

63e

+06

4e+0

65e

+06

80.7

1.8 1 1 1 0.9 1 1.1 0.9 1.4 0.8 1.2 1 1.2 1.2 1 0.9 0.8 0.5 0.7

EM−seq−1

Histogram of % CHG methylation

% methylation per base

Freq

uenc

y

0 20 40 60 80 100

0e+0

01e

+06

2e+0

63e

+06

4e+0

6

78.7

2.5 1.5 1.3 1 0.7 1 1 0.8 1.6 0.5 1.2 0.8 1.3 1.3 1.1 0.9 1 0.6 1.5

WGBS−1

Histogram of CHH coverage

log10 of read coverage per base

Freq

uenc

y

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0e

+00

2.0e

+06

4.0e

+06

6.0e

+06

8.0e

+06

1.0e

+07

1.2e

+07

0.30 00.30.200.40.20.31.31.9

6.8

20.3

41.4

21.4

4.6

0.20 0 0 0 00.10.10 0 0 0 00.10 0 0

EM−seq−1

Histogram of CHH coverage

log10 of read coverage per base

Freq

uenc

y

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0e+0

01e

+06

2e+0

63e

+06

4e+0

65e

+06

6e+0

6

1.20 0

23.4

0

11.2

7.27.7

21.4

10.710.8

8.1

6

3.12.3

1.71.10.80.50.30.20.10 0 0 0 0 0 0 0 0 0

WGBS−1

Histogram of % CHH methylation

% methylation per base

Freq

uenc

y

0 20 40 60 80 100

0.0e

+00

5.0e

+06

1.0e

+07

1.5e

+07

2.0e

+07

2.5e

+07

88.9

3.5 1.6 1.3 0.9 0.6 0.6 0.5 0.4 0.5 0.2 0.3 0.2 0.2 0.1 0.1 0 0 0 0.1

EM−seq−1

Histogram of % CHH methylation

% methylation per base

Freq

uenc

y

0 20 40 60 80 100

0.0e

+00

5.0e

+06

1.0e

+07

1.5e

+07

2.0e

+07

2.5e

+07 84.8

4.5 2.5 1.8 1.1 0.7 0.8 0.6 0.4 0.7 0.2 0.4 0.2 0.3 0.2 0.1 0.1 0.1 0 0.2

WGBS−1

01234567

1 11 21 31 41

CH

Gs

cove

red

Milli

ons

Minimum Coverage Depth

CHG coverage distribution

WGBS-1WGBS-2EM-seq-1EM-seq-2

05

101520253035

1 21 41

CH

H c

over

edM

illion

s

Minimum Coverage Depth

CHH coverage distribution

WGBS-1WGBS-2EM-seq-1EM-seq-2

www.NEBNext.com

0%

10%

20%

30%

40%

EM-seq WGBS

% 5

mC

in C

pG c

onte

xt

CpG

EM-seq WGBS

0%

2%

4%

6%

8%

10%

EM-seq WGBS

% 5

mC

in C

HN

con

text

CHN

EM-seq WGBS

EM-seq libraries cover more cytosines to greater minimum depths than WGBS. EM-seq identifies more CpGs, CHHs and CHGs, at higher coverage depth compared to WGBS, resulting in more usable information.

Cytosine correlations of EM-seq and WGBS libraries (1x minimum coverage). Both EM-seq and WGBS libraries were highly correlated between replicates and methods). 125 million, 2 x75 base reads were used for methylation analysis. Correlations in the CpG, CHG and CHH contexts are shown

Cannabis sativa

Arabidopsis thaliana

Introduction Methods

Methods

CpG

CH

G

CH

H

CpG

CH

G

CH

H

CpG

CH

G

CH

H

C

AFemale Flower-1

Female Flower-2

Female Seeded Flower-1Female Seeded Flower-2

Male Flower-1

Male Flower-2

Female Flower-1

Female Flower-2

Female Seeded Flower-1Female Seeded Flower-2

Male Flower-1

Male Flower-2

012345678

0 20 40 60 80 100

Nor

mal

ized

cov

erag

e

Percent GC

GC-bias plotWGBS-1WGBS-2EM-seq-1EM-seq-2

A

C

C

Conclusion

Histograms showing CpG,CHG and CHH coverage ofEM-seq (A) and WGBS (C)libraries were plotted usingMethylKit. The histogram isshifted right for EM-seqcompared to WGBS indicatingthat more CpGs are detected athigher coverage for EM-seqlibraries compared to WGBS.Percent methylation forlibraries made using EM-seq(B) and WGBS (D) are shownin the CpG, CHG and CHHcontexts. EM-seq and WGBSdata both show that mostcytosines in the CHG and CHHcontexts are not methylated.Methylation is found in the CpGcontext for both EM-seq andWGBS libraries.

A

B

C

D

EM-seq libraries compared to WGBS libraries had: