CS/BioE 598AGB: Genome Assembly, part II Tandy Warnow.

27
CS/BioE 598AGB: Genome Assembly, part II Tandy Warnow

Transcript of CS/BioE 598AGB: Genome Assembly, part II Tandy Warnow.

Page 1: CS/BioE 598AGB: Genome Assembly, part II Tandy Warnow.

CS/BioE 598AGB:Genome Assembly, part II

Tandy Warnow

Page 2: CS/BioE 598AGB: Genome Assembly, part II Tandy Warnow.

nature biotechnology volume 29 number 11 november 2011

Page 3: CS/BioE 598AGB: Genome Assembly, part II Tandy Warnow.

Supplementary Figure 1. De Bruijn graph from reads with sequencing errors. (a) A de Bruijn graph E on our set of reads with k = 4. Finding an Eulerian cycle is already a straightforward task, but for this value of k, it is trivial. (b) If TGGAGTG is incorrectly sequenced as a sixth read (in addition to the correct TGGCGTG read), then the result is a bulge in the de Brujin graph, which complicates assembly.

(Supplementary materials from the Compeau, Pevzner, and Tesler paper, Nature Biotech, 2011)

Page 4: CS/BioE 598AGB: Genome Assembly, part II Tandy Warnow.

(c) An illustration of a de Bruijn graph E with many bulges. The process of bulge removal should leave only the red edges remaining, yielding an Eulerian path in the resulting graph.

(Supplementary materials from the Compeau, Pevzner, and Tesler paper,Nature Biotech, 2011)

Page 5: CS/BioE 598AGB: Genome Assembly, part II Tandy Warnow.

(Supplementary materials from the Compeau, Pevzner, and Tesler paper, Nature Biotech, 2011)

Page 6: CS/BioE 598AGB: Genome Assembly, part II Tandy Warnow.
Page 7: CS/BioE 598AGB: Genome Assembly, part II Tandy Warnow.
Page 8: CS/BioE 598AGB: Genome Assembly, part II Tandy Warnow.
Page 9: CS/BioE 598AGB: Genome Assembly, part II Tandy Warnow.
Page 10: CS/BioE 598AGB: Genome Assembly, part II Tandy Warnow.

N50

• The N50 value is the size of the smallest contig (or scaffold) such that 50% of the genome is contained in contigs of size N50 or larger. This is the standard metric used to evaluate the quality of an assembly.

• Salzberg et al. computed “corrected N50” values by splitting contigs (or scaffolds) where errors are identified.

Page 11: CS/BioE 598AGB: Genome Assembly, part II Tandy Warnow.
Page 12: CS/BioE 598AGB: Genome Assembly, part II Tandy Warnow.
Page 13: CS/BioE 598AGB: Genome Assembly, part II Tandy Warnow.
Page 14: CS/BioE 598AGB: Genome Assembly, part II Tandy Warnow.
Page 15: CS/BioE 598AGB: Genome Assembly, part II Tandy Warnow.
Page 16: CS/BioE 598AGB: Genome Assembly, part II Tandy Warnow.
Page 17: CS/BioE 598AGB: Genome Assembly, part II Tandy Warnow.
Page 18: CS/BioE 598AGB: Genome Assembly, part II Tandy Warnow.
Page 19: CS/BioE 598AGB: Genome Assembly, part II Tandy Warnow.
Page 20: CS/BioE 598AGB: Genome Assembly, part II Tandy Warnow.
Page 21: CS/BioE 598AGB: Genome Assembly, part II Tandy Warnow.
Page 22: CS/BioE 598AGB: Genome Assembly, part II Tandy Warnow.

From Mihai Pop’s paper

Page 23: CS/BioE 598AGB: Genome Assembly, part II Tandy Warnow.

Differing Conclusions

• Compeau et al.: “De Bruijn graphs are not a cure-all…Short read sequencing technologies …favor the use of de Bruijn graphs...and are also well suited to representing genomes with repeats. However, if a future sequencing technology produces high quality reads with tens of thousands of bases,…,the pendulum could swing back toward favoring overlap-based approaches for assembly.”

Page 24: CS/BioE 598AGB: Genome Assembly, part II Tandy Warnow.

Mihai Pop’s conclusion

Page 25: CS/BioE 598AGB: Genome Assembly, part II Tandy Warnow.

Salzberg’s conclusions

Page 26: CS/BioE 598AGB: Genome Assembly, part II Tandy Warnow.

Salzberg’s conclusions

Page 27: CS/BioE 598AGB: Genome Assembly, part II Tandy Warnow.