A hierarchical approach to building contig scaffolds
description
Transcript of A hierarchical approach to building contig scaffolds
![Page 1: A hierarchical approach to building contig scaffolds](https://reader033.fdocuments.net/reader033/viewer/2022051116/5681555f550346895dc329ec/html5/thumbnails/1.jpg)
A hierarchical approach to building contig scaffolds
Mihai PopDan Kosack
Steven L. SalzbergGenome Research 14(1), pp. 149-159, 2004.
![Page 2: A hierarchical approach to building contig scaffolds](https://reader033.fdocuments.net/reader033/viewer/2022051116/5681555f550346895dc329ec/html5/thumbnails/2.jpg)
Sequencing pipeline
• Random sequencing• un-related reads 500-700 base-pairs
• Assembly• un-related contigs 5,000-10,000 base-pairs
• Scaffolding• un-related scaffolds 30,000-50,000 base-pairs
• Finishing/gap closure• completed genomes millions-billions of base-
pairs
![Page 3: A hierarchical approach to building contig scaffolds](https://reader033.fdocuments.net/reader033/viewer/2022051116/5681555f550346895dc329ec/html5/thumbnails/3.jpg)
Scaffolding
• Given a set of non-overlapping contigsorder and orient them along a chromosome
III III IV
I
IIIII
IV
![Page 4: A hierarchical approach to building contig scaffolds](https://reader033.fdocuments.net/reader033/viewer/2022051116/5681555f550346895dc329ec/html5/thumbnails/4.jpg)
Clone-mates
Clone
Insert
F R
FR
I II
R
I
F
II
F
II
R
I
![Page 5: A hierarchical approach to building contig scaffolds](https://reader033.fdocuments.net/reader033/viewer/2022051116/5681555f550346895dc329ec/html5/thumbnails/5.jpg)
Scaffolder output
Sequencing gaps
Physical gaps
![Page 6: A hierarchical approach to building contig scaffolds](https://reader033.fdocuments.net/reader033/viewer/2022051116/5681555f550346895dc329ec/html5/thumbnails/6.jpg)
Problems with the data
• Incorrect sizing of inserts• cut from gel – sizing is subjective• error increases with size
• Chimeras (ends belong to different inserts)• biological reasons (esp. for large sized inserts)• sample tracking (human error)
• Software must handle a certain error rate.
![Page 7: A hierarchical approach to building contig scaffolds](https://reader033.fdocuments.net/reader033/viewer/2022051116/5681555f550346895dc329ec/html5/thumbnails/7.jpg)
Theoretical abstraction
• Given a set of entities (reads/contigs) and constraints between them (overlaps/mate pairs) provide a linear/circular embedding that preserves most constraints.
![Page 8: A hierarchical approach to building contig scaffolds](https://reader033.fdocuments.net/reader033/viewer/2022051116/5681555f550346895dc329ec/html5/thumbnails/8.jpg)
Graph representation• Nodes: contigs• Directed edges: constraints on relative
placement of contigs – relative order and relative orientation
• Embedding: order (coordinate along chromosome) and orientation (strand sampled)
![Page 9: A hierarchical approach to building contig scaffolds](https://reader033.fdocuments.net/reader033/viewer/2022051116/5681555f550346895dc329ec/html5/thumbnails/9.jpg)
Challenges
• Orientation – node coloring problem (forward/reverse)• feasibility – no cycles with odd number of
“reversal” edges (blue edges)• optimality – remove minimum number of edges
such that a solution exists (NP-hard)
![Page 10: A hierarchical approach to building contig scaffolds](https://reader033.fdocuments.net/reader033/viewer/2022051116/5681555f550346895dc329ec/html5/thumbnails/10.jpg)
Challenges
• Ordering – generate a linear embedding• feasibility – lengths of parallel DAG paths are
consistent• optimality – remove minimum number of edges
such that DAG is feasible (NP-hard)
![Page 11: A hierarchical approach to building contig scaffolds](https://reader033.fdocuments.net/reader033/viewer/2022051116/5681555f550346895dc329ec/html5/thumbnails/11.jpg)
The real world
• Use of scaffolds• Analysis – longest unambiguous sub-graphs• Finishing – present all “reliable” relationships
between contigs• Sources of error
• mis-assemblies• sizing errors (increases with library size)• chimeras
![Page 12: A hierarchical approach to building contig scaffolds](https://reader033.fdocuments.net/reader033/viewer/2022051116/5681555f550346895dc329ec/html5/thumbnails/12.jpg)
Ambiguous scaffold
I
II
III
I II III
I IIIII
I III I’ II
![Page 13: A hierarchical approach to building contig scaffolds](https://reader033.fdocuments.net/reader033/viewer/2022051116/5681555f550346895dc329ec/html5/thumbnails/13.jpg)
Hierarchical scaffolding
1. For each contig pair, consolidate all linking data into a single relationship – 2 correct links required
![Page 14: A hierarchical approach to building contig scaffolds](https://reader033.fdocuments.net/reader033/viewer/2022051116/5681555f550346895dc329ec/html5/thumbnails/14.jpg)
Hierarchical scaffolding
2. Use most reliable links to build scaffolds
3. Repeatedly build super-scaffolds based on less reliable linking data
![Page 15: A hierarchical approach to building contig scaffolds](https://reader033.fdocuments.net/reader033/viewer/2022051116/5681555f550346895dc329ec/html5/thumbnails/15.jpg)
Rationale
Hierarchical step
Problemcomplexity
problem size (#nodes)error rate
![Page 16: A hierarchical approach to building contig scaffolds](https://reader033.fdocuments.net/reader033/viewer/2022051116/5681555f550346895dc329ec/html5/thumbnails/16.jpg)
Linking information
• Overlaps
• Mate-pair links
• Similarity links
• Physical markers
• Gene synteny
reference genome
physical map
![Page 17: A hierarchical approach to building contig scaffolds](https://reader033.fdocuments.net/reader033/viewer/2022051116/5681555f550346895dc329ec/html5/thumbnails/17.jpg)
BAMBOO (BAMBUS)Best effort Attempt
Multiple Branches allowedOrder, Orient
![Page 18: A hierarchical approach to building contig scaffolds](https://reader033.fdocuments.net/reader033/viewer/2022051116/5681555f550346895dc329ec/html5/thumbnails/18.jpg)
Inputs
• Set of contigs: names and lengths• Groups of contig links:
• groups correspond to “quality” of links• link: relative distance between contig origins
relative orientation of contigs
• Priorities for each group – specify order in which links are considered
BE EB
min <= dist <= max
![Page 19: A hierarchical approach to building contig scaffolds](https://reader033.fdocuments.net/reader033/viewer/2022051116/5681555f550346895dc329ec/html5/thumbnails/19.jpg)
Outputs• XML representation of layout:
• contig orientations• contig position (x-coordinate of contig origin)• links used to construct layout
• Graphical display of the layout• uses GraphViz package from AT&T
![Page 20: A hierarchical approach to building contig scaffolds](https://reader033.fdocuments.net/reader033/viewer/2022051116/5681555f550346895dc329ec/html5/thumbnails/20.jpg)
1.0 release
• XML input not yet supported
• All scaffold placed in the same output file
• Only Linux executable released
• Hacker friendly
![Page 21: A hierarchical approach to building contig scaffolds](https://reader033.fdocuments.net/reader033/viewer/2022051116/5681555f550346895dc329ec/html5/thumbnails/21.jpg)
Current release: 2.33
• XML input and more general output module• Collection of input modules from common
assembly formats• Better handling of priority data• Repeat masking features• More platforms supported and source code
released as open source• http://amos.sourceforge.net
![Page 22: A hierarchical approach to building contig scaffolds](https://reader033.fdocuments.net/reader033/viewer/2022051116/5681555f550346895dc329ec/html5/thumbnails/22.jpg)
Future enhancements
• Option to generate un-ambiguous (non-branching) scaffolds
• Better layout algorithms• Specialized drawing tools• Interactive browser
• Represent/handle multiple haplotypes
![Page 23: A hierarchical approach to building contig scaffolds](https://reader033.fdocuments.net/reader033/viewer/2022051116/5681555f550346895dc329ec/html5/thumbnails/23.jpg)
Acknowledgements
• Dan Kosack• Steven Salzberg• Martin Shumway
• Hean Koo• Luke Tallon• Jessica Vamathevan