Making the most of short reads torsten seemann - agrf ngs sig - 28 apr 2009

24
Making the most of short reads Making the most of short reads Torsten Seemann Victorian Bioinformatics Consortium Monash University

Transcript of Making the most of short reads torsten seemann - agrf ngs sig - 28 apr 2009

Page 1: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

Making the most of short readsMaking the most of short reads

Torsten Seemann

Victorian Bioinformatics ConsortiumMonash University

Page 2: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

07/08/12 Making the most of short reads 2

Outline

● About the VBC● Sequencing technologies● Read mapping● Applications● Conclusion● Questions

Page 3: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

07/08/12 Making the most of short reads 3

What is the VBC ?

● Victorian Bioinformatics Consortium● 2000-2005

– Monash .med .infotech, CSIRO, DPI– $4M STI grant from State Govt.

● 2005+– Dept. Microbiology, Monash Uni.– NHMRC/ARC Network Parisitology– Micromon (sequencing centre)

Page 4: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

07/08/12 Making the most of short reads 4

Where is the VBC ?

● Monash Uni.● Clayton Campus● STRIP2 / Bldg 76● Level 2● Microbiology● Rooms 223-225

Page 5: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

07/08/12 Making the most of short reads 5

VBC capabilities

● Sequence analysis● Assembly, annotation, SNPs● Anything-omics!● Microarray analysis/storage● Data mining/visualization● Custom software development● Computer system architecture

Page 6: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

07/08/12 Making the most of short reads 6

VBC Collaborators

● Monash Uni.● Uni. Melbourne● Bio21● UNSW, Uni. Syd● UQ : IMB● MIMR, MMC, Austin● MISCL

● CSIRO : FSA, LI● USDA : ARS● Pasteur Institute● TIGR● UCSD● UCLA● Uni. Copenhagen

Page 7: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

07/08/12 Making the most of short reads 7

Sanger sequencing

● Dye terminated capillary sequencing● Read length ~ 300 - 900 bp● Yield ~ 1 Mbp per day maximum● Cost ~ $HIGH

Page 8: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

07/08/12 Making the most of short reads 8

Roche 454 FLX+

● Pyro-sequencing ● Read length ~ 100 - 250 bp● Yield ~ 600 Mbp (250 bp PE)● Run time ~ 1 day● Prep time ~ 5 days● Homo-polymer run errors● Cost $MEDIUM

Page 9: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

07/08/12 Making the most of short reads 9

ABI SOLID 3

● Sequencing by ligation● Read length ~ 35 – 50 bp● Yield ~ 15,000 Mbp (50 bp PE)● Run time ~ 14 days● Prep time ~ ? days● Colour space error propagation● Cost $MEDIUM

Page 10: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

07/08/12 Making the most of short reads 10

Illumina GA2 (Solexa)

● Sequencing by synthesis● Read length ~ 36 – 100 bp● Yield ~ 6,000 Mbp (36bp PE)● Run time ~ 5 days● Prep time ~ 1 day● No homo-polymer errors● Cost $LOW

Page 11: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

07/08/12 Making the most of short reads 11

Illumina output 36bp

Bad read

@HWUSI-EAS100R:3:1:5:1526#0/1TCCCTTGCATTACTCTTAATCGAGGAAATCCCTTTG+HWUSI-EAS100R:3:1:5:1526#0/1abbaaaaaaaaaaaaaaaaa_X^WT]a```a_a\`\

@HWUSI-EAS100R:3:1:3:1073#0/2TGNNNNNNCAAATTCANNNNNNNTCNNTTTATATCT+HWUSI-EAS100R:3:1:3:1073#0/2a\DDDDDD^[K]BBBBBBBBBBBBBBBBBBBBBBBB

Good read

'B'=Q2 Pr(wrong)=0.38

'a'=Q33 Pr(wrong)=0.0005

Page 12: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

07/08/12 Making the most of short reads 12

Read mapping

● Align 108 36bp reads to 5 Mbp reference● Traditional tools too slow● New crop of “short read aligners” (SRA)

– SHRiMP – MAQ– Bowtie– ELAND– Novocraft

Page 13: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

07/08/12 Making the most of short reads 13

SRA capabilities● SNP = Single nucleotide polymorphism

– Subsitution, eg. A → C– insertion or deletion (“indel”) eg. A → -

● Warning: not all aligners support indels!● We tend to use SHRiMP

– Supports substitutions and indels– Fast SIMD implementation & parallelizable– Full post-hit Smith-Waterman alignment– Will identify “most” high scoring hits

Page 14: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

07/08/12 Making the most of short reads 14

Genome coverage

● Mapped 7 M reads to 4 Mbp genome● Yellow line is mean coverage (56x)● Bowl shaped coverage = circular genome● Could be used to guide scaffolding

Page 15: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

07/08/12 Making the most of short reads 15

Missing DNA

● Read coverage drops to zero where reference has DNA that the new sequence does not

● LB022 absent● hemH present

Page 16: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

07/08/12 Making the most of short reads 16

Repeated DNA

● Coverage increases in repeated areas● LA_SNP3199 is probably triplicated in

this strain – depth 120, average 40

Page 17: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

07/08/12 Making the most of short reads 17

SNPs

● SNPs appear as dips/pinches in the coverage graph

● LA1299 gene has possible 4 SNPs relative to ref.

● Rest of gene has average coverage

Page 18: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

07/08/12 Making the most of short reads 18

Repairing 454 data

● 454 has “homopolymer” errors● Loses track if same base > 3 times in row● Traditional assemblers don't like too many

indels or frame shifts● 454 developed Newbler assembler● Challenging for hybrid assemblies● What if we could “repair” our 454 data?

Page 19: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

07/08/12 Making the most of short reads 19

454 Repair Guide

● One sample with 454 and Illumina reads● Get a read mapper supporting indels● Align all your Illumina reads to 454 data● If sufficient un-ambiguous depth

– correct the 454 sequence!

● Can apply to old closed sequences, 454 contigs, 454 reads etc.

● Find old errors via resequencing

Page 20: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

07/08/12 Making the most of short reads 20

Example repair>FF6ELPM06G1HYY original 180bpAAATCTAAAAGAATAGTCGTGGAGCAGGTAGAAAACCTAGATTTACTGAAGAAGAAAAAATATTATAAGAGCTCAAAGAAAAGAAGGAAAACAATAAAAGAGCTTGCAACTTTAAATAATTGTAGCTTTGGAGTAATTCATAAAATTTTACATGAATAATAAATAAAAGGGGATTGAGAT

Sequence Pos Change type Old New EvidenceFF6ELPM06G1HYY 11 insertion-before - A "A"x166FF6ELPM06G1HYY 61 insertion-before - A "A"x212 "-"x12FF6ELPM06G1HYY 92 insertion-before - A "A"x368 "-"x1

>FF6ELPM06G1HYY repaired 183bpAAATCTAAAAAGAATAGTCGTGGAGCAGGTAGAAAACCTAGATTTACTGAAGAAGAAAAAAATATTATAAGAGCTCAAAGAAAAGAAGGAAAAACAATAAAAGAGCTTGCAACTTTAAATAATTGTAGCTTTGGAGTAATTCATAAAATTTTACATGAATAATAAATAAAAGGGGATTGAGAT

Page 21: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

07/08/12 Making the most of short reads 21

Trimming short reads

● Quality worsens toward 3' end ● Many reads have “N” basecalls● Variation across flowcell/slide

● Will reduce data size● Trade quality for depth● Is it worth it?

Page 22: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

07/08/12 Making the most of short reads 22

Should I trim?● For 36 bp

– Results are mixed– Usually best NOT to trim– Depth will “fix” most errors

● For 75+ bp– 3' quality can be very poor– Seems best to trim– Not all reads need trimming

● More research needed

Page 23: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

07/08/12 Making the most of short reads 23

Conclusion

● Short read mapping is a powerful tool for genomic discovery

– Automated analysis eg. SNPs– Visualization eg. depth/coverage graphs– Repairing longer read data

● Still need de novo assembly for unmapped reads

Page 24: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

07/08/12 Making the most of short reads 24

Contact me

Webhttp://www.vicbioinformatics.com/

[email protected]