Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere...

46
Sergey Koren Staff Scientist, Genome Informatics Section, NHGRI Telomere-to-telomere assembly of a complete human X chromosome @sergekoren

Transcript of Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere...

Page 1: Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere assembly of a complete ... The genomes assembly problem Esperanza Duke, highland sire

Sergey KorenStaff Scientist, Genome Informatics Section, NHGRI

Telomere-to-telomere assembly of a complete human X chromosome

@sergekoren

Page 2: Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere assembly of a complete ... The genomes assembly problem Esperanza Duke, highland sire

• Rodger Staden (1979)• “If the 5′ end of the sequence from one gel reading is

the same as the 3′ end of the sequence from another the data is said to overlap. If the overlap is of sufficient length to distinguish it from being a repeat in the sequence the two sequences must be contiguous. The data from the two gel readings can then be joined to form one longer continuous sequence.”

40 years of genome assembly

Page 3: Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere assembly of a complete ... The genomes assembly problem Esperanza Duke, highland sire

• First complete de novo assemblies• 2012: Bacteria (106 bp)

Class I Class II Class III

Yersinia pestisCO92

Escherichia coliO26:H11 11368

Bacillus anthracisAmes

0

20

0

161

16

171

Are we there yet?

Page 4: Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere assembly of a complete ... The genomes assembly problem Esperanza Duke, highland sire

• First complete de novo assemblies• 2012: Bacteria (106 bp)

• 2014: Yeast (107 bp)

Are we there yet?

Page 5: Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere assembly of a complete ... The genomes assembly problem Esperanza Duke, highland sire

• First complete de novo assemblies• 2012: Bacteria (106 bp)

• 2014: Yeast (107 bp)

• 2014: Drosophila (108 bp)

3L3R

2R

2L X

Are we there yet?

Page 6: Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere assembly of a complete ... The genomes assembly problem Esperanza Duke, highland sire

• First complete de novo assemblies• 2012: Bacteria (106 bp)

• 2014: Yeast (107 bp)

• 2014: Drosophila (108 bp)

• ????: Human (109 bp)

Are we there yet?

Page 7: Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere assembly of a complete ... The genomes assembly problem Esperanza Duke, highland sire

The Vertebrate Genomes Project Pipeline

Rhie and VGP Assembly Working Group, in preparation 7

PacBio

10XG

Contigging+ Purging

Scaffolding

BioNanoScaffolding

Hi-C

Gap-filling &

Manual Curation

Final assembly

AAA

C TGGATGGGGATGGGGA TGGGGA

A TGGGGA

Polishing

Scaffolding

exon 1 exon 2 exon 3

PrimaryAlternate

Page 8: Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere assembly of a complete ... The genomes assembly problem Esperanza Duke, highland sire

VGP GenomeArk: 1st data release

Jennifer Vashon of Maine Department of Inland Fisheries and Wildlife, left, and UMass lynx team coordinator, Tanya Lama, with an adult male lynx from northern Maine whose DNA was used to create first-ever whole genome for the species. The lynx has since been released to the wild. (MassWildlife photo / Bill Byrne)

https://VertebrateGenomesProject.org

Page 9: Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere assembly of a complete ... The genomes assembly problem Esperanza Duke, highland sire

Not all genomes are equal

Repeat (%)Contig NG50 (Mb)

Contigs

Page 10: Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere assembly of a complete ... The genomes assembly problem Esperanza Duke, highland sire

Heterozygosity causes allelic duplicationPrimary

Alternates

Ass

embl

ed s

ize

/ Ex

p. G

enom

e Si

ze

(%)

Falcon-unzip

Primary Alt

Size (Gbp) 1.95 0.73

NG50 (Mbp) 2.6 0.02

BUSCO 94.2% 40.6%

Duplication 20.8% 3.4%

1.6%

Lynx image credit: Jacob W. Frank

Image credit: wikimedia

Page 11: Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere assembly of a complete ... The genomes assembly problem Esperanza Duke, highland sire

Resolving haplotypes

Page 12: Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere assembly of a complete ... The genomes assembly problem Esperanza Duke, highland sire

The genomes assembly problem

Esperanza

Molly, yak damDuke, highland sire

~1% heterozygosity

Page 13: Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere assembly of a complete ... The genomes assembly problem Esperanza Duke, highland sire

State of the art: pseudo-haplotype

Page 14: Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere assembly of a complete ... The genomes assembly problem Esperanza Duke, highland sire

Trio binning with TrioCanu

×DamSire

F1 cross

Parentalk-mers

Sire haplotype

Dam haplotype

Sire assembly Dam assembly

Unassigned

Complete assembly of parental haplotypes with trio binning. Koren, Rhie et al. 2018

Page 15: Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere assembly of a complete ... The genomes assembly problem Esperanza Duke, highland sire

Esperanza: The nearly perfect diploid

125x PacBio coverage (~60x per haplotype), TrioCanu haplotig NG50 70 Mbp, BUSCOs 94%

1 2 3 4 5 6 7 8 9 10 11 12 13

14 15 16 17 18 19 20 21 22 23 24

25 26 27 28 29 X

Dam (yak)

1 2 3 4 5 6 7 8 9 10 11 12 13

14 15 16 17 18 19 20 21 22 23 24

25 26 27 28 29 X

Sire (Highland) Esperanza

Page 16: Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere assembly of a complete ... The genomes assembly problem Esperanza Duke, highland sire

Resolving repeats

Page 17: Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere assembly of a complete ... The genomes assembly problem Esperanza Duke, highland sire

Real-time data from a nanopore$1000 (free) instrument$100 / bacterial genome85–90% read accuracy

Oxford nanopore technologies

Page 18: Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere assembly of a complete ... The genomes assembly problem Esperanza Duke, highland sire

• ONT R9 pore: E. coli CsgG membrane protein• Read lengths >1 Mbp possible

Ultra-long read nanopore sequencing

*Assuming 3.4 Å per bp, 1 Mbp = 3,400,000 Å (0.34 mm) = 40,000x height of the pore

120 Å

85 Å~2 miles in 37 min

8 cm

Page 19: Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere assembly of a complete ... The genomes assembly problem Esperanza Duke, highland sire

Ultra-long read benefits

10 20 50 100 200 500

020

4060

8010

0

Maximum spanned repeat (kbp)

NG

50 (M

bp)

NA1

2878

20x

CHM1 P6Clive

GRCh38N

A128

78 3

0x

NA1

2878

30x

+ 5

x ul

tra

Nanopore sequencing and assembly of a human genome with ultra-long reads.Jain, Koren, Miga, Quick, Rand, Sasani, Tyson, et al. Nature Biotech (2018)

Page 20: Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere assembly of a complete ... The genomes assembly problem Esperanza Duke, highland sire

Ultra-long read benefits

10 20 50 100 200 500

020

4060

8010

0

Maximum spanned repeat (kbp)

NG

50 (M

bp)

NA1

2878

20x

NA12878 30x ultra

CHM1 P6Clive

GRCh38N

A128

78 3

0x

NA1

2878

30x

+ 5

x ul

tra

Nanopore sequencing and assembly of a human genome with ultra-long reads.Jain, Koren, Miga, Quick, Rand, Sasani, Tyson, et al. Nature Biotech (2018)

Page 21: Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere assembly of a complete ... The genomes assembly problem Esperanza Duke, highland sire

• The human reference is incomplete• 368 unresolved issues, 102 gaps• Segmental duplications, rDNAs• Centromeres, telomeres, heterochromatin

• These gaps contain important information• Missing reference sequence leads to mapping artifacts• Variation in these gaps is unexplored (e.g. rDNAs)• We don’t know what we don’t know…

• Long-read sequencing can close these gaps

The One Genome Project

Page 22: Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere assembly of a complete ... The genomes assembly problem Esperanza Duke, highland sire

The One Genome: CHM13hTERT

Cell line from Urvashi Surti, Pitt; SKY karyotype from Jennifer Gerton, StowersN=46; XX

Page 23: Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere assembly of a complete ... The genomes assembly problem Esperanza Duke, highland sire

• From May 1/18 – Jan 8/19• 94 MinION/GridION flow cells• 11.1M reads• 155 Gb (1.6 Gb / flow cell) (50x)• 99 Gb in reads >50 kb (32x)• 78 Gb in reads >70 kb (25x)• Max mapped read length 1.04 Mb

• Assembled with Canu

CHM13 sequencing at NISC

Page 24: Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere assembly of a complete ... The genomes assembly problem Esperanza Duke, highland sire

The human genome, 2017

GRCh38 NG50 contig 56.4 Mbp

1 2 3 4 5 6 7 8 9 10 11 121 2 3 4 5 6 7 8 9 10 11 12

13 14 15 16 17 18 19 20 21 22 X

1 2 3 4 5 6 7 8 9 10 11 12

GRCh38

Page 25: Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere assembly of a complete ... The genomes assembly problem Esperanza Duke, highland sire

The human genome, 2018

CHM13 NG50 contig 74.8 Mbp (70x PacBio + 35x UL ONT)

1 2 3 4 5 6 7 8 9 10 11 12

13 14 15 16 17 18 19 20 21 22 X

Page 26: Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere assembly of a complete ... The genomes assembly problem Esperanza Duke, highland sire

Xq XpXCEN

2.72 Mb

~2 kb[ [n = 1346

DXZ1 Satellite

ChrX centromere detailPFGE SouthernddPCR

Page 27: Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere assembly of a complete ... The genomes assembly problem Esperanza Duke, highland sire

Repeat resolution using variants2.72 Mb

Page 28: Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere assembly of a complete ... The genomes assembly problem Esperanza Duke, highland sire

First complete X chromosome

DXZ4 arrays contained between 12 and 100 3.0 kb repeat units with an average array containing 57 (Tremblay et al 2011)

Page 29: Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere assembly of a complete ... The genomes assembly problem Esperanza Duke, highland sire

How to validate 1300 copy 2kb repeat?>=500bp, >=97% idy

500kb

Page 30: Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere assembly of a complete ... The genomes assembly problem Esperanza Duke, highland sire

Independent markers confirm structure

Page 31: Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere assembly of a complete ... The genomes assembly problem Esperanza Duke, highland sire

Lab validation supports structure2.72 Mb

1.82 Mb

BstEII BstEIIBstEII BstEII

Xq Xp

269 kb 668 kb

1.8

2.4

1.10.7

0.3 0.5

Mb Mb

Page 32: Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere assembly of a complete ... The genomes assembly problem Esperanza Duke, highland sire

• Large repeats remain challenging• Satellite arrays, SegDups, rDNAs, …

• Haploid assembly is solved by long reads• Complete human reference in 1–2 years

• Diploid assembly is solved by trios/Hi-C/ss• Complete haplotypes will become the new norm

• Enabling reference genomes for all species• Genome10k and the Vertebrate Genomes Project

Almost there!

Page 33: Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere assembly of a complete ... The genomes assembly problem Esperanza Duke, highland sire

The next 20 years

Page 34: Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere assembly of a complete ... The genomes assembly problem Esperanza Duke, highland sire

34

Bioinformatics is currently artesinal

Page 35: Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere assembly of a complete ... The genomes assembly problem Esperanza Duke, highland sire

Scaling up, training volunteersMarcela Uliano

Maxmilian Driller

Chul LeeGiulio FormentiSimona Secomandi

Univ. of Milan

Freie Univ. Berlin

Seoul Nat. Univ.

Calvinna CaswaraMajid Vafadar

Chai FungtammasanNicholas Hill

Page 36: Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere assembly of a complete ... The genomes assembly problem Esperanza Duke, highland sire

Automating sequence QC

Mash: fast genome and metagenome distance estimation using MinHash. Ondov et al. 2016

Rice SMRT cells in Canadian LynxExample of no contamination (mPhyDis1)

10X

PacBio

PacBiolow cov.

PacBiocontam.

Page 37: Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere assembly of a complete ... The genomes assembly problem Esperanza Duke, highland sire

Luo et al. Clairvoyante: a multi-task convolutional deep neural network for variant calling in Single Molecule Sequencing. 2018

Automating assembly QC?

Page 38: Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere assembly of a complete ... The genomes assembly problem Esperanza Duke, highland sire

Total bp 1,182.18 MbpNumber of contigs 723NG50 10.27 MbpMax contig size 41.28 Mbp

Carmine Bee-eater

Total bp 3,450.80 MbpNumber of contigs 3,124NG50 4.60 MbpMax contig size 35.90 Mbp

Common brushtail possum

First trainee results

Page 39: Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere assembly of a complete ... The genomes assembly problem Esperanza Duke, highland sire

Evolving compute requirements

https://www.pacb.com/wp-content/uploads/SMRT-Analysis-Brochure-Gain-a-deeper-understanding-of-your-sequencing-data.pdf

These are biowulf norm nodes!

Page 40: Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere assembly of a complete ... The genomes assembly problem Esperanza Duke, highland sire

ML gaining popularity

A universal SNP and small-indel variant caller using deep neural networks. Poplin et al. 2018

From squiggle to basepair: computational approaches for improving nanopore sequencing read accuracy. Rang et al. 2018

Page 41: Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere assembly of a complete ... The genomes assembly problem Esperanza Duke, highland sire

Yet another nanopore basecaller: flappie

• Runtime for 2.5k reads = 15.14h (8 cores)• 11m reads / 2.5k = 1/4000th of total• 15.14 * 8 * 4000 = 515,282 CPU h!• Reservation!

Page 42: Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere assembly of a complete ... The genomes assembly problem Esperanza Duke, highland sire

Tools changed too fast

Identity %

• But at least it is better

• New GPU basecaller

released before reservation

• Same job took 0.03 h

• 0.03 * 1 * 4000 = 130 CPU h

• Finished in < 3 hrs

• Used reservation for other

analysis but under-utilized

Page 43: Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere assembly of a complete ... The genomes assembly problem Esperanza Duke, highland sire

} $750 40x human genome*} 15–90m sample prep} 1m–48h runtime} 0.1Tb per hour (6Tb in 48h)} 1.44GB/s raw data stream

} >100kb reads possible} Pay for runs, not instrument

} “Sequence until”} “Read until”

} Primary data to retain?} 10tb/genome @ 10k genomes

} 100 petabytes¨ SRA = 5 petabytes¨ $0.5-2 M/yr on S3

https://www.nanoporetech.com* Assuming 2,880 flowcell commitment and 120 Gbp/cell

Democratized sequencing is here

Page 44: Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere assembly of a complete ... The genomes assembly problem Esperanza Duke, highland sire

• How to democratize expertise/HPC?• What is the role of HPC in a peer-to-peer

world?• Lots of small groups doing compute• No experience with “proper” HPC usage• Jobs that run fine on a laptop don’t translate to

HPC

Democratized sequencing is here

Page 45: Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere assembly of a complete ... The genomes assembly problem Esperanza Duke, highland sire

AcknowledgementsKaren Miga

genomeinformatics.github.io• Adam Phillippy• Arang Rhie• Brian Walenz• Alexander Dilthey• Brian Ondov

Page 46: Telomere-to-telomere assembly of a complete human X … · 2/28/2018  · Telomere-to-telomere assembly of a complete ... The genomes assembly problem Esperanza Duke, highland sire