Machine Learning Applied to Single-Molecule Electronic DNA ...€¦ · National Institute of...

1
Machine Learning Applied to Single-Molecule Electronic DNA Mapping for Structural Variant Verification in Human Genomes Bready, B.; Davis, J.; Grinberg, B.; Kaiser, M.; Oliver, J.; Sage, J.; Seward, L. Nabsys 2.0 LLC, Providence, RI 02903 Using SV-Verify, thousands of hypotheses can be queried in a single analysis. In the human genome NA24385, a set of 9,000 putative deletion calls 300 bp with varying levels of support from different technologies, was evaluated. The distributions of resulting posterior probability values for all considered deletions asserted by 1 (n = 4443), 2 (n = 600), 3 (n = 691), and 4 (n = 244) technologies are shown in the left panel. The tailed ends of the distributions indicate that SV-Verify is able to discriminate between accurate calls (enriched in 4 Tech set) and inaccurate calls (prevalent in 1 Tech set). ROC curves were used to convert the posterior probability output from each SVM to a specificity value for each putative call. Here we employed a specificity threshold of 0.9 (sp90) to confirm a putative deletion. The percent of evaluated deletions confirmed at sp90, filtered by putative deletion size and number of technologies making the call, is shown in the right panel. SV-Verify demonstrated good sensitivity across all considered putative deletion size ranges, as low as 300 bp. SV-Verify training and workflow The Nabsys SV-Verify software package provides an efficient, robust pipeline for the systematic and automated evaluation of putative SVs. SVM training was accomplished using reference material from the National Institute of Standards and Technology (NIST) Genome in a Bottle (GIAB) consortium for a well- characterized human genome, NA12878. Deletion calls 300 bp asserted by multiple technologies were parsed into four classes and used to train four distinct SVMs. The relationship between specificity and sensitivity for each respective SVM is graphed as a receiver operating characteristic (ROC) curve shown to the right. To evaluate hypothesized SVs, Nabsys single-molecule reads, a base reference map, and variant reference maps are used as inputs to HD-Mapping. SV-Verify utilizes the four unique SVMs tailored to different classes of structural variation (i.e. size, type and complexity) to output a posterior probability for each putative variant. Posterior probabilities and ROC curves are then used to determine calls at a given specificity threshold. In order to construct whole genome maps, high molecular weight DNA is isolated using solution phase, commercially available kits. The genomic DNA is tagged in a sequence-specific manner through an enzymatic nicking reaction. As single molecules pass through the detector the presence of the DNA backbone and attached tags are sensed as changes in the resistance of the detector. The resulting data indicate the time between tagged sites on each DNA backbone. The temporal events are converted to distance-based events where the distances between tags (termed an “interval”) are reported in base pairs. Single-molecule electronic detection The accuracy of Nabsys single-molecule read mapping is central to the effective detection of SVs across a range of sizes using SV-Verify. To demonstrate the mapping accuracy of the Nabsys platform, single-molecule data were collected for E. coli MG1655 nicked with Nt.BspQI and mapped to the high quality reference. As shown in the plot above, there is a high degree of agreement between the expected reference interval sizes and the consensus interval sizes generated through Nabsys read mapping. The linear relationship observed (R 2 was found to be 0.9999) extended down to intervals as small as 300 bp, well below the diffraction limit of optical mapping approaches. De novo assembly of these data resulted in 3 maps that spanned 99.4% of the reference with 0 false positives and 0 false negatives for intervals >500 bp. These results highlight the accuracy and resolution of Nabsys HD-Mapping that forms the foundation of the SV-Verify pipeline. Underlying technology biases can significantly impact the accuracy of SV calls, particularly as the size of a given SV exceeds read length. To investigate this phenomenon, we determined the percentages of evaluated putative deletion calls made using Illumina or PacBio data (underlying data type; may include several data sets and several callers) that were confirmed at sp90 or refuted using a posterior probability threshold of 0.1 by SV-Verify in various size ranges (see below). The number of calls per category is indicated above each bar. As expected based on read length, the percentage of Illumina deletion calls confirmed by SV-Verify decreased as a function of deletion size, while the longer PacBio read lengths translated to more consistent calls across a range of deletion sizes, highlighting the importance of long- range information for large SV call accuracy. Conclusions ASHG 2017 Booth 753 The Nabsys HD-Mapping platform combined with the SV-Verify software package provides a high throughput, fully automated tool for the evaluation of structural variation in human genomes. SV-Verify can clearly distinguish between accurate and inaccurate deletion calls as small as 300 bp in size. Results here highlight the critical need for orthogonal technologies with a broad effective size range for accurate characterization of SVs on a genome-wide scale. We thank Justin Zook, Ph.D. and Marc Salit, Ph.D., of the NIST Genome-Scale Measurement Group for welcoming our participation in GIAB. NA24385 putative deletion call set evaluation using SV-Verify Illumina PacBio Evaluation of Illumina and PacBio calls Mapping of single-molecule reads to evaluate thousands of putative SVs simultaneously ROC curves resulting from each support vector machine The importance of structural variation in human disease and the difficulty of detecting structural variants larger than 50 base pairs has led to the development of several long-read sequencing technologies and optical mapping platforms. Frequently, multiple technologies and ad hoc methods are required to obtain a consensus regarding the location, size and nature of a structural variant, with no single approach able to reliably bridge the gap of variant sizes between those readily detected using NGS technologies and the largest rearrangements observed with optical mapping. Often, structural variants larger than 10 kilobases are not detected. To address this unmet need, we have developed a new software package, SV-Verify TM , which utilizes data collected with the Nabsys High Definition Mapping (HD-Mapping TM ) system, to perform hypothesis-based verification of putative deletions. We demonstrate that whole genome maps, constructed from data generated by electronic detection of tagged DNA, hundreds of kilobases in length, can be used effectively to facilitate calling of structural variants ranging in size from 300 base pairs to hundreds of kilobase pairs. SV-Verify implements hypothesis-based verification of putative structural variants using supervised machine learning. Machine learning is realized using a set of support vector machines, capable of concurrently testing several thousand independent hypotheses. We describe support vector machine training, utilizing 1089 deletions and 4637 negative controls from a well-characterized human genome. Plots delineating the specificity versus sensitivity of each of the support vector machines will be presented. We subsequently applied the trained classifiers to another human genome, evaluating > 5000 putative deletions, demonstrating high sensitivity and specificity for deletions from 300 base pairs to hundreds of kilobases. Over 78% of deletions called by three or more technologies were confirmed by SV-Verify. Single-molecule tag detection at a velocity of >1 Mbp/s. Tagged sample introduction into instrument High molecular weight DNA isolation Sequence-specific tag attachment 35-500 kb molecules t V DNA backbone Sequence-specific tags Deletions 300 – 499 bp Deletions 500 – 999 bp Deletions spanning multiple intervals Deletions 1000 bp Advantages of Nabsys electronic detection: Long-range information Easy to multiplex without cross-talk Highly scalable High resolution, direct detection of 300 bp intervals Low, stochastic single-molecule false-positive and false-negative rates Electrophoretic and hydrodynamic control of access to detector Highly sensitive detection enables tag detection during translocation Wider range of useful DNA lengths as compared to optical methods 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 Reference Interval Size (bp) Consensus Interval Size (bp) 300 800 1300 1800 300 800 1300 1800 y = 0.9983x + 10.52 R 2 = 0.9999 E. coli MG1655 Nt.BspQI consensus vs. reference interval sizes 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 Posterior Probability (bin center) Fraction of Total Deletions 1 Tech 2 Tech 3 Tech 4 Tech Posterior probability distributions parsed by number of asserting technologies 300-499 500-999 1000-2999 3000 0% 20% 40% 60% 80% 100% Putative Deletion Size (bp) Percent of Evaluated Confirmed Call sensitivity parsed by number of asserting technologies for different deletion sizes 1 Tech 2 Tech 3 Tech 4 Tech 300-499 500-999 1000-2999 3000 0% 20% 40% 60% 80% 100% Putative Deletion Size (bp) Percent of Evaluated Confirmed Refuted 562 139 454 255 778 1024 540 935 300-499 500-999 1000-2999 0% 20% 40% 60% 80% 100% Putative Deletion Size (bp) Percent of Evaluated 465 50 164 33 234 71 228 59 3000 Confirmed Refuted

Transcript of Machine Learning Applied to Single-Molecule Electronic DNA ...€¦ · National Institute of...

Page 1: Machine Learning Applied to Single-Molecule Electronic DNA ...€¦ · National Institute of Standards and Technology (NIST) Genome in a Bottle (GIAB) consortium for a well-characterized

(—THIS SIDEBAR DOES NOT PRINT—) DES IGN GUIDE

This PowerPoint 2007 template produces a 48”x48” presentation poster. You can use it to create your research poster and save valuable time placing titles, subtitles, text, and graphics. We provide a series of online tutorials that will guide you through the poster design process and answer your poster production questions. To view our template tutorials, go online to PosterPresentations.com and click on HELP DESK. When you are ready to print your poster, go online to PosterPresentations.com Need assistance? Call us at 1.510.649.3001

QUICK START

Zoom in and out As you work on your poster zoom in and out to the level that is more comfortable to you. Go to VIEW > ZOOM.

Title, Authors, and Affiliations

Start designing your poster by adding the title, the names of the authors, and the affiliated institutions. You can type or paste text into the provided boxes. The template will automatically adjust the size of your text to fit the title box. You can manually override this feature and change the size of your text. TIP: The font size of your title should be bigger than your name(s) and institution name(s).

Adding Logos / Seals Most often, logos are added on each side of the title. You can insert a logo by dragging and dropping it from your desktop, copy and paste or by going to INSERT > PICTURES. Logos taken from web sites are likely to be low quality when printed. Zoom it at 100% to see what the logo will look like on the final poster and make any necessary adjustments. TIP: See if your school’s logo is available on our free poster templates page.

Photographs / Graphics You can add images by dragging and dropping from your desktop, copy and paste, or by going to INSERT > PICTURES. Resize images proportionally by holding down the SHIFT key and dragging one of the corner handles. For a professional-looking poster, do not distort your images by enlarging them disproportionally.

Image Quality Check Zoom in and look at your images at 100% magnification. If they look good they will print well. If they are blurry or pixelated, you will need to replace it with an image that is at a high-resolution.

ORIGINAL DISTORTED

Cornerhandles

Good

prin

/ngqu

ality

Badprin/n

gqu

ality

QUICK START (cont. )

How to change the template color theme

You can easily change the color theme of your poster by going to the DESIGN menu, click on COLORS, and choose the color theme of your choice. You can also create your own color theme. You can also manually change the color of your background by going to VIEW > SLIDE MASTER. After you finish working on the master be sure to go to VIEW > NORMAL to continue working on your poster.

How to add Text The template comes with a number of pre-formatted placeholders for headers and text blocks. You can add more blocks by copying and pasting the existing ones or by adding a text box from the HOME menu.

Text size

Adjust the size of your text based on how much content you have to present. The default template text offers a good starting point. Follow the conference requirements.

How to add Tables

To add a table from scratch go to the INSERT menu and click on TABLE. A drop-down box will help you select rows and columns. You can also copy and a paste a table from Word or another PowerPoint document. A pasted table may need to be re-formatted by RIGHT-CLICK > FORMAT SHAPE, TEXT BOX, Margins.

Graphs / Charts

You can simply copy and paste charts and graphs from Excel or Word. Some reformatting may be required depending on how the original document has been created.

How to change the column configuration

RIGHT-CLICK on the poster background and select LAYOUT to see the column options available for this template. The poster columns can also be customized on the Master. VIEW > MASTER.

How to remove the info bars

If you are working in PowerPoint for Windows and have finished your poster, save as PDF and the bars will not be included. You can also delete them by going to VIEW > MASTER. On the Mac adjust the Page-Setup to match the Page-Setup in PowerPoint before you create a PDF. You can also delete them from the Slide Master.

Save your work Save your template as a PowerPoint document. For printing, save as PowerPoint or “Print-quality” PDF.

Print your poster When you are ready to have your poster printed go online to PosterPresentations.com and click on the “Order Your Poster” button. Choose the poster type the best suits your needs and submit your order. If you submit a PowerPoint document you will be receiving a PDF proof for your approval prior to printing. If your order is placed and paid for before noon, Pacific, Monday through Friday, your order will ship out that same day. Next day, Second day, Third day, and Free Ground services are offered. Go to PosterPresentations.com for more information.

Student discounts are available on our Facebook page. Go to PosterPresentations.com and click on the FB icon.

©2015PosterPresenta/ons.com2117FourthStreet,[email protected]

Machine Learning Applied to Single-Molecule Electronic DNA Mapping for Structural Variant Verification in Human Genomes

Bready, B.; Davis, J.; Grinberg, B.; Kaiser, M.; Oliver, J.; Sage, J.; Seward, L. Nabsys 2.0 LLC, Providence, RI 02903

Using SV-Verify, thousands of hypotheses can be queried in a single analysis. In the human genome NA24385, a set of 9,000 putative deletion calls ≥300 bp with varying levels of support from different technologies, was evaluated. The distributions of resulting posterior probability values for all considered deletions asserted by 1 (n = 4443), 2 (n = 600), 3 (n = 691), and 4 (n = 244) technologies are shown in the left panel. The tailed ends of the distributions indicate that SV-Verify is able to discriminate between accurate calls (enriched in 4 Tech set) and inaccurate calls (prevalent in 1 Tech set). ROC curves were used to convert the posterior probability output from each SVM to a specificity value for each putative call. Here we employed a specificity threshold of 0.9 (sp90) to confirm a putative deletion. The percent of evaluated deletions confirmed at sp90, filtered by putative deletion size and number of technologies making the call, is shown in the right panel. SV-Verify demonstrated good sensitivity across all considered putative deletion size ranges, as low as 300 bp.

SV-Verify training and workflow The Nabsys SV-Verify software package provides an efficient, robust pipeline for the systematic and automated evaluation of putative SVs. SVM training was accomplished using reference material from the National Institute of Standards and Technology (NIST) Genome in a Bottle (GIAB) consortium for a well-characterized human genome, NA12878. Deletion calls ≥300 bp asserted by multiple technologies were parsed into four classes and used to train four distinct SVMs. The relationship between specificity and sensitivity for each respective SVM is graphed as a receiver operating characteristic (ROC) curve shown to the right. To evaluate hypothesized SVs, Nabsys single-molecule reads, a base reference map, and variant reference maps are used as inputs to HD-Mapping. SV-Verify utilizes the four unique SVMs tailored to different classes of structural variation (i.e. size, type and complexity) to output a posterior probability for each putative variant. Posterior probabilities and ROC curves are then used to determine calls at a given specificity threshold.

In order to construct whole genome maps, high molecular weight DNA is isolated using solution phase, commercially available kits. The genomic DNA is tagged in a sequence-specific manner through an enzymatic nicking reaction. As single molecules pass through the detector the presence of the DNA backbone and attached tags are sensed as changes in the resistance of the detector. The resulting data indicate the time between tagged sites on each DNA backbone. The temporal events are converted to distance-based events where the distances between tags (termed an “interval”) are reported in base pairs.

Single-molecule electronic detection

The accuracy of Nabsys single-molecule read mapping is central to the effective detection of SVs across a range of sizes using SV-Verify. To demonstrate the mapping accuracy of the Nabsys platform, single-molecule data were collected for E. coli MG1655 nicked with Nt.BspQI and mapped to the high quality reference. As shown in the plot above, there is a high degree of agreement between the expected reference interval sizes and the consensus interval sizes generated through Nabsys read mapping. The linear relationship observed (R2 was found to be 0.9999) extended down to intervals as small as 300 bp, well below the diffraction limit of optical mapping approaches. De novo assembly of these data resulted in 3 maps that spanned 99.4% of the reference with 0 false positives and 0 false negatives for intervals >500 bp. These results highlight the accuracy and resolution of Nabsys HD-Mapping that forms the foundation of the SV-Verify pipeline.

Underlying technology biases can significantly impact the accuracy of SV calls, particularly as the size of a given SV exceeds read length. To investigate this phenomenon, we determined the percentages of evaluated putative deletion calls made using Illumina or PacBio data (underlying data type; may include several data sets and several callers) that were confirmed at sp90 or refuted using a posterior probability threshold of ≤0.1 by SV-Verify in various size ranges (see below). The number of calls per category is indicated above each bar. As expected based on read length, the percentage of Illumina deletion calls confirmed by SV-Verify decreased as a function of deletion size, while the longer PacBio read lengths translated to more consistent calls across a range of deletion sizes, highlighting the importance of long-range information for large SV call accuracy.

Conclusions

ASHG 2017 Booth 753

The Nabsys HD-Mapping platform combined with the SV-Verify software package provides a high throughput, fully automated tool for the evaluation of structural variation in human genomes. SV-Verify can clearly distinguish between accurate and inaccurate deletion calls as small as 300 bp in size. Results here highlight the critical need for orthogonal technologies with a broad effective size range for accurate characterization of SVs on a genome-wide scale. We thank Justin Zook, Ph.D. and Marc Salit, Ph.D., of the NIST Genome-Scale Measurement Group for welcoming our participation in GIAB.

NA24385 putative deletion call set evaluation using SV-Verify

Illumina PacBio

Evaluation of Illumina and PacBio calls

Mapping of single-molecule reads to evaluate thousands of putative SVs simultaneously

ROC curves resulting from each support vector machine

The importance of structural variation in human disease and the difficulty of detecting structural variants larger than 50 base pairs has led to the development of several long-read sequencing technologies and optical mapping platforms. Frequently, multiple technologies and ad hoc methods are required to obtain a consensus regarding the location, size and nature of a structural variant, with no single approach able to reliably bridge the gap of variant sizes between those readily detected using NGS technologies and the largest rearrangements observed with optical mapping. Often, structural variants larger than 10 kilobases are not detected.

To address this unmet need, we have developed a new software package, SV-VerifyTM, which utilizes data collected with the Nabsys High Definition Mapping (HD-MappingTM) system, to perform hypothesis-based verification of putative deletions. We demonstrate that whole genome maps, constructed from data generated by electronic detection of tagged DNA,

hundreds of kilobases in length, can be used effectively to facilitate calling of structural variants ranging in size from 300 base pairs to hundreds of kilobase pairs. SV-Verify implements hypothesis-based verification of putative structural variants using supervised machine learning. Machine learning is realized using a set of support vector machines, capable of concurrently testing several thousand independent hypotheses. We describe support vector machine training, utilizing 1089 deletions and 4637 negative controls from a well-characterized human genome. Plots delineating the specificity versus sensitivity of each of the support vector machines will be presented. We subsequently applied the trained classifiers to another human genome, evaluating > 5000 putative deletions, demonstrating high sensitivity and specificity for deletions from 300 base pairs to hundreds of kilobases. Over 78% of deletions called by three or more technologies were confirmed by SV-Verify.

Single-molecule tag detection at a velocity of >1 Mbp/s.

Tagged sample introduction into instrument

High molecular weight DNA isolation

Sequence-specific tag attachment

35-500 kb molecules t

V DNA

backbone Sequence-specific tags

Deletions 300 – 499 bp Deletions 500 – 999 bp Deletions spanning multiple intervals Deletions ≥1000 bp

Advantages of Nabsys electronic detection: •  Long-range information •  Easy to multiplex without cross-talk •  Highly scalable •  High resolution, direct detection of 300 bp intervals •  Low, stochastic single-molecule false-positive and

false-negative rates •  Electrophoretic and hydrodynamic control of access

to detector •  Highly sensitive detection enables tag detection

during translocation •  Wider range of useful DNA lengths as compared to

optical methods 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

Reference Interval Size (bp)

Con

sens

usIn

terv

alSi

ze(b

p)

300 800 1300 1800300

800

1300

1800y = 0.9983x + 10.52

R2 = 0.9999

a.E. coli MG1655 Nt.BspQI consensus vs. reference

interval sizes

0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.950.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

Posterior Probability (bin center)

Frac

tion

ofTo

talD

elet

ions

0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.950.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

Posterior Probability (bin center)

Frac

tion

ofTo

talD

elet

ions

1 Tech2 Tech3 Tech4 Tech

a.

b.

Posterior probability distributions parsed by number of asserting technologies

300-499 500-999 1000-2999 ≥30000%

20%

40%

60%

80%

100%

Putative Deletion Size (bp)

Perc

ento

fEva

luat

edCo

nfirm

ed

1 Tech

2 Tech

3 Tech

4 Tech

Call sensitivity parsed by number of asserting technologies for different deletion sizes

0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.950.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

Posterior Probability (bin center)

Frac

tion

ofTo

talD

elet

ions

0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.950.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

Posterior Probability (bin center)

Frac

tion

ofTo

talD

elet

ions

1 Tech2 Tech3 Tech4 Tech

a.

b.

300-499 500-999 1000-2999 ≥30000%

20%

40%

60%

80%

100%

Putative Deletion Size (bp)

Perc

ento

fEva

luat

ed

ConfirmedRefuted

300-499 500-999 1000-29990%

20%

40%

60%

80%

100%

Putative Deletion Size (bp)

Perc

ento

fEva

luat

ed

300-499 500-999 1000-29990%

20%

40%

60%

80%

100%

Putative Deletion Size (bp)

Perc

ento

fEva

luat

ed

300-499 500-999 1000-29990%

20%

40%

60%

80%

100%

Putative Deletion Size (bp)

Perc

ento

fEva

luat

ed

a. Illumina b. PacBio

c. Bionano d. Complete Genomics

562

139

454

255

778

1024

540

935 465

50

164

33

234

71

228

59

11

1

67

5

192

29

208

39

263

14

93

5

161

30

232192

≥3000

≥3000 ≥3000

ConfirmedRefuted

ConfirmedRefuted

ConfirmedRefuted

300-499 500-999 1000-2999 ≥30000%

20%

40%

60%

80%

100%

Putative Deletion Size (bp)

Perc

ento

fEva

luat

ed

ConfirmedRefuted

300-499 500-999 1000-29990%

20%

40%

60%

80%

100%

Putative Deletion Size (bp)

Perc

ento

fEva

luat

ed

300-499 500-999 1000-29990%

20%

40%

60%

80%

100%

Putative Deletion Size (bp)

Perc

ento

fEva

luat

ed

300-499 500-999 1000-29990%

20%

40%

60%

80%

100%

Putative Deletion Size (bp)

Perc

ento

fEva

luat

ed

a. Illumina b. PacBio

c. Bionano d. Complete Genomics

562

139

454

255

778

1024

540

935 465

50

164

33

234

71

228

59

11

1

67

5

192

29

208

39

263

14

93

5

161

30

232192

≥3000

≥3000 ≥3000

ConfirmedRefuted

ConfirmedRefuted

ConfirmedRefuted