Biological Motivation for Fragment Assembly

19
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake

description

Biological Motivation for Fragment Assembly. Rhys Price Jones Anne R. Haake. What is fragment assembly?. The reconstruction of the contiguous chromosomal DNA sequence from short, experimentally-generated fragments i.e. sequence reassembly - PowerPoint PPT Presentation

Transcript of Biological Motivation for Fragment Assembly

Page 1: Biological Motivation for Fragment Assembly

Biological Motivation for Fragment Assembly

Rhys Price Jones

Anne R. Haake

Page 2: Biological Motivation for Fragment Assembly

What is fragment assembly?

• The reconstruction of the contiguous chromosomal DNA sequence from short, experimentally-generated fragments– i.e. sequence reassembly

• The sequence reassembly process must realign the short fragments, in the correct order, and then generate a consensus sequence.

Page 3: Biological Motivation for Fragment Assembly

A Simple Case

• Suppose target sequence is known to be about 10 bp

• Sequenced fragments are:

ACCGTCGTGCTTACTACCGT

Page 4: Biological Motivation for Fragment Assembly

--ACCGT------CGTGCTTAC------TACCGT--

__________TTACCGTGC

Overlaps between fragments and the estimated length of the target sequence guide the assembly

Page 5: Biological Motivation for Fragment Assembly

Why is fragment assembly important?

• We need to have reliable, complete genomic sequences of human and other model organisms

• Base-pair sequence is the most basic piece of DNA information (gene structure and function described by sequence)

Page 6: Biological Motivation for Fragment Assembly

Why fragment the DNA in the first place?

• Human genome is large: ~3 X 109 base pairs long

• Sequencers can generate sequences only approx. 500-600 bp long at a time

Page 7: Biological Motivation for Fragment Assembly

Solutions?

• Directed Sequencing: use custom primers to sequentially sequence from genomic DNA This is a slow and expensive process

• Shotgun Sequencing: DNA is extracted, fragmented (e.g. sheared), cloned, sequenced from both ends of clone, reassembled, and finished (gaps are closed)

Page 8: Biological Motivation for Fragment Assembly

Solutions?

• Cloning of fragments is accomplished using different vectors, chosen according to the size of the fragments (inserts into the vector).

• Large fragments: YACs 1 Mb, BACs 100-200 Kb

• Intermediate: Cosmids, Lamba• Small: Plasmids, M13

Page 9: Biological Motivation for Fragment Assembly

Genome Sequencing Strategies

• Human Genome Project: map-based strategy– initially used “tiling set” of large clones that cover

genome– ends of the tiling set clones sequenced to allow

ordering/mapping to the chromosome– individual clones subjected to shotgun sequencing – the sequences from the clones (shotgun

fragments) then reassembled

• Celera: whole genome sequence strategy– shotgun sequencing

Page 10: Biological Motivation for Fragment Assembly

Celera: Whole Genome Sequencing

• Celera (which won the race for the draft human sequence) took a whole genome sequence strategy

• cloned all of the fragmented human genome into 3 different sized clone libraries

• sequenced both ends of each clone• reassembly • advances in automated sequencing speed

and accuracy were key to the success of the Celera approach

Page 11: Biological Motivation for Fragment Assembly

Another Reason Fragment Assembly is Important:

• Assembly and/or clustering sets of expressed sequence tags (ESTs)

• The problem is that these are partial and they may span more than one exon (intron sequences, present in the genomic sequence have been spliced out)

• Identity of the ESTs and assignment to genes is aided by finding overlap with other ESTs.

Page 12: Biological Motivation for Fragment Assembly

Experimental issues present some challenges for algorithm development

• DNA sequencing data is imperfect• Every base in the DNA should be covered several

times (at least twice; once in each direction) to minimize effects of random errors

• Base calling (determining of the base identity from the DNA sequencer trace) errors can occur -the quality of traces is not always high. Capillary tube sequencing has reduced errors caused by lane bleed-through of slab gel sequencing

Page 13: Biological Motivation for Fragment Assembly

• Basecalling software (e.g. Phred) attempts to assign base to each position in sequence as well as quality data

• The quality of the sequence tends to degrade at the ends.

• Vector sequence also contaminating at ends.• NHGR standard: 99.99% accuracy before

submission of sequence to GenBank.

Page 14: Biological Motivation for Fragment Assembly

SeqManContig assembler and trace viewer. Can align

against a reference sequence

http://www.dnastar.com/images2/r13a_lg.gif

http://www.dnastar.com/images2/r13a_lg.gif

Page 15: Biological Motivation for Fragment Assembly

A big issue:

• Human genome contains repetitive sequences– Highly repetitive: not-transcribed, role unknown,

present in millions of copies. Satellite (5-50 bp), – Moderately repetitive: some are transcribed,

present in up to 100,000’s of copies• Tandem repeats e.g. Minisatellite (12-100 bp),

Microsatellite (2-6 bp), telomeres• Interspersed repeats: larger repeats with high copy

number e.g. SINE (Alu), LINE, tRNAs, rRNAs

Page 16: Biological Motivation for Fragment Assembly

Another issue:

• Orientation of the fragments is unknown• Is the input fragment or its reverse

complement a substring of the consensus?

CACGT CACGT--------ACGT -ACGT---------ACTACG --CGTAGT----GTACT -----AGTAC---ACTGA --------ACTGACTGA ---------CTGA

Page 17: Biological Motivation for Fragment Assembly

Yet, another

• Chimeras (mixed or heterogeneous DNA) may be introduced during the cloning process

• DNA from non-contiguous regions of the chromosome may be introduced as well as host DNA (for example, when growing plasmids in E. coli, the E. coli chromosomal DNA often contaminates clones)

Page 18: Biological Motivation for Fragment Assembly

General Considerations:

• The algorithms used to generate the consensus sequence must take the biological issues into account (although some don’t!).

• Need to consider prior biological information when analyzing a program’s assembly output.– e.g. known chromosomal sites or DNA

fingerprinting data may be inconsistent with the program’s assembly output.