Design Goals Crash Course: Reference-guided Assembly.

38
MICHAEL STRÖMBERG Boston College Data Club April 2008
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    226
  • download

    3

Transcript of Design Goals Crash Course: Reference-guided Assembly.

MICHAEL STRÖMBERGBoston College Data Club

April 2008

Design Goals

Crash Course: Reference-guided Assembly

Crash Course: Reference-guided Assembly

Crash Course: Reference-guided Assembly

Sequencing Technologie

s

future

Next-Gen Sequence Lengths

Capillary (Sanger) Roche 454 FLX0

200

400

600

800

1000

1200

1400

1600

maxmeanmin

Sequencing Technology

Sequence L

ength

(bp)

Illumina AB SOLiD Helicos0

10

20

30

40

50

60

70

80

maxmeanmin

Sequencing Technology

Sequence L

ength

(bp)

3 6 9 12 15 18 21 24 27 30 330%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Unique Genome Coverage (H. sapiens)

Sequence Length

Uniq

ue G

enom

e C

overa

ge

Mixing It Up: Paired-end Reads

0 50 100 150 200 250 300 3500

200

400

600

800

1000

1200

1400

1600

1800

fragment length (bp)

read p

air

s (

count)

How Does It Work?

How Does It Work?

C. elegans: a case for INDELs

SPEED100 million Illumina readsAlignment time: 93 min (17,800 reads/s)

Assembly time: 100 min

INDELS

INDEL validation rate: 89.3 % (216)SNP validation rate: 97.8 % (229)

P. stipitis: Co-assembly

Capillary454 FLX

454 GS20

Illumina

Scaling Up

Dec-05 Mar-06 Jul-06 Oct-06 Jan-07 Apr-07 Aug-07 Nov-07 Feb-08 Jun-08 10,000

100,000

1,000,000

10,000,000

100,000,000

1,000,000,000

10,000,000,000

Project Date

Refe

rence S

equence L

ength

(bp)

C. elegans

M. musculus

H. sapiens

P. stipitis

M. musculus mtDNA

H. sapiens CAPON region

D. melanogaster

H. sapiens ENCODE region

Performance: Aligners

Aligners: Feature Set

ELAND MAQNewble

r SHRiMP SOAP

SequencingPlatforms

Illumina454

SOLiDcapillary

Illumina IlluminaSOLiD

454 IlluminaSOLiD

Illumina

AlignmentAlgorithm

Smith-Waterma

n

Hash-based

Hash-based

FlowMapper

Smith-Waterma

n

Hash-based

Co-assemblyCreation

?

GappedAlignments ?

Paired-end Reads

PlatformBinaries

Windows, Mac, Linux,

Sun, iPhone

Mac, Linux Linux Mac, Linux Mac, Linux

Performance: AlignerIllumina 35 bp (X Chromosome)

program aligned reads/s

MOSAIK 180 - 16,658

ELAND 7,716

SOAP 1,637

MAQ 1,376

SHRIMP 39

MOSAIK (fast)

MOSAIK (single)

MOSAIK (multi)

MOSAIK (all)

ELAND MAQ SOAP SHRIMP0

2000

4000

6000

8000

10000

12000

14000

16000

Performance: AlignerRoche 454 FLX ~250 bp

program aligned reads/s

Roche 454 Newbler 1,176

MOSAIK 317 - 616

Using P. stipitis (15.4 Mbp) 454 FLX data set. 932,565 reads basecalled by PyroBayes†.

† Quinlan et al. Pyrobayes: an improved base caller for SNP discovery in pyrosequences. Nature Methods (2008)

Accuracy: Synthetic Data Sets

1 per 1.3 kb 1 per 7.2 kb

H. sapiens Xchromosome

1 million

Accuracy: Classification

MOSAIK

(fas

t)

MOSAIK

(sin

gle)

MOSAIK

(mul

ti)

MOSAIK

(all)

ELAND

MAQSO

AP0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

unique readsnon-unique reads

Accuracy: Unique Read Alignment

MOSAIK (fast) MOSAIK (single) MOSAIK (multi) MOSAIK (all) ELAND MAQ SOAP0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

readsINDELsSNPs

Reasons to use ?

• FAST• Accurate• Multiprocessor (OPENMP)

• Co-assemblies• Gapped alignments• Widely used

“One tool, many technologies,

many applications”

(Near) Future Development

• All technologies– Pacific BioSciences– Helicos

• All application areas– Adapter trimming– Coverage graphs

• Optimization• Improved paired-end read support• File format standardization (SAF & SRF)

1000 Genomes Project

• Many samples with light coverage (1000 dg)

– 100 samples from 10 populations at 2x coverage– Find 90% of the 1 % frequency variants per

population

• Trios with moderate coverage (990 dg)

– 30 trios at 11x coverage

• If you’re looking for SNPs, are your tools and methods robust?

Scaling Up: Disk Footprint

• Current situation: files created by MOSAIK are not optimized for speed or size– Assembly can take a long time (slow disk

speed)

• Hypothetical solution– Optimize the file formats– Ditch the built-in index– Keep data sorted by aligned location

Scaling Up: Disk Footprint

Scaling Up: Memory Footprint

• Current situation: storing the entire human genome stored with all associated hash locations

– Optimized hash table ≈ 55 GB RAM

– File-based hash table (BerkeleyDB)• User selects how much RAM to use• Dreadfully slow performance• Large disk footprint ≈ 65 GB file

Scaling Up: Memory Footprint

Scaling Up: Memory Footprint

9 10 11 12 13 14 15 16 17 180

5

10

15

20

25

30

35

40

45

50

55

60

65

70

JumpDB Memory Usage (Human Genome)

JumpDB MOSAIK hash table

hash size (bp)

mem

ory

used (

GB

RA

M)

Berkeley (all positions in database)

Berkeley (1 position in database)

Jump (all positions in file-based database)

Mosaik hash table

0 4 8 12 16 20

Alignment Performance with 35bp human reads

Reads/s

Scaling Up: Speed & Sensitivity

• Current situation: speed increases as the hash size increases, sensitivity decreases

• Hypothetical solution: use small hash sizes and require a clustering of a predefined length.

• Status: Implemented but not tested.

BORK! BORK! BORK!

(translated: when will MOSAIK get published?)

Acknowledgements

Boston CollegeGabor MarthDerek BarnettMichele BusbyWeichun HuangAaron QuinlanChip Stewart

Thomas SeyfriedMike Kiebish

Washington University School of Medicine

Elaine MardisJarret GlasscockVincent Magrini

AgencourtDouglas SmithWei Tao