String Metrics in Classification of Mobile Genetic Elements Discrete Mathematical Biology, Math 8803...
-
Upload
teagan-appleyard -
Category
Documents
-
view
218 -
download
1
Transcript of String Metrics in Classification of Mobile Genetic Elements Discrete Mathematical Biology, Math 8803...
String Metrics in Classification of Mobile Genetic Elements
Discrete Mathematical Biology, Math 8803
www.yale.edu/turner/projects/ecoli.htm www.geneticengineering.org/evolution
Ryan WagnerRyan Wagner
Biology/Bioinformatics PhD studentBiology/Bioinformatics PhD student
http://pdb.lbl.gov/microscopies
String Metrics in Classification of Mobile Genetic Elements
1. Mathematical relevance
2. Biological relevance
3. Test and review of four distance methods
4. What was “good, bad, and ugly”.
Introduction: distances on strings
D(a, b) = 0 a = b, the identity axiomD(a, b) = D(b, a), the symmetry axiomD(a, b) + D(b, c) D(a, c), the triangle inequality
Formal definition of a distance function, D
Tree Additivity
When a tree is made from a matrix of pairwise distance metrics, the distance between any two leaves (sequences) equals the sum of the edge lengths connecting them (Baake and Heaseler, 1997).
Introduction:
mathematical distance vs. evolutionary distance
• the three metric properties comprise the basis for characterization
• may also be characterized by Turing Machine computability (Ahlbrandt et al., 2004)
• amenable to both alignment-based and alignment-free methods
• when obtained by common statistical correction techniques, fails to satisfy the triangle inequality
• the tree additivity property may hold where the triangle inequality fails
• not developed for alignment-free distances on DNA strings
Introduction: mobile genetic elements
Plasmid - an autonomous, self-replicating circularpiece of DNA found outside the chromosome
in bacteria. www8.nos.noaa.gov/coris_glossary
Bacteriophage - a virus that attacks and infects bacterial cells.www.ncbi.nlm.nih.gov/ICTBdb/ICTVdB
www.microbe-edu.org/etudiant
Transposon - a DNA sequence capable of moving to new locations within the same cell
Methods: data collection and software
• DNA sequences for replication initiation (RepA) and division partition (ParA) in both plasmids and host chromosomes obtained from NCBI, www.ncbi.nlm.nih.gov/genomes/lproks.cgi
• DNA sequences of selected plasmids from gram-negative bacteria also obtained from NCBI• Neighbor-joining trees constructed for each set of pairwise distances using PHYLIP, http://evolution.genetics.washington.edu/phylip.html
• Custom Perl script used to generate matrices of pairwise distances:
4G_lovleyi 0.000000 0.864000 0.887000 0.844000Acidovoro 0.864000 0.000000 0.664000 0.724000Acid_JS42 0.887000 0.664000 0.000000 0.836000Xanth_axo 0.844000 0.724000 0.836000 0.000000
Methods: edit distance
Here is where horizontal gene transfer begins to cause problems.
Data structure in custom script for test input:ATTGCGAGC and ATGCGACC
A T G C G A C C0 1 2 3 4 5 6 7 8
A 1 0 1 2 3 4 5 6 7T 2 1 0 1 2 3 4 5 6T 3 2 1 2 3 4 5 6 7G 4 3 2 1 2 3 4 5 6C 5 4 3 2 1 2 3 4 5G 6 5 4 3 2 1 2 3 4A 7 6 5 4 3 2 1 2 3G 8 7 6 5 4 3 2 3 4C 9 8 7 6 5 4 3 2 3
Levenshtein distance = 3, from lower right corner, no traceback needed
Methods: the problem with edit distance
Consider GTGACGTACTATTGC_ and
GTACTATTGCGTGAC
Consider GTGACGTACTATTGC_ and
GTGAGTACTATTGCC1 character delete/insert
Edit distance = 2
5 character delete/insertEdit distance = 8
Allowing block deletions, block insertions, and block reversals confers better approximations to the recombinant nature of DNA evolution (Long-Hui, 2004).
However, the least-constrained application of block edit distance has
O(n3) time complexity. Constrained block edit distance computation is NP-hard (Lopresti and Tomkins, 1997)
Methods: Euclidean distance over dinucleotide counts
A new paradigm: complexity-based distance metrics which do not employ alignments nor dynamic programming
a = GTGACGTACTATTGC
b = GTACTATTGCGTGAC
Computation of counts vectors for a and b
dinucleotide a b L2
TC + GA 1 1 0
TG + CA 2 2 0
CT + AG 1 1 0
AC + GT 4 4 0
TT + AA 1 1 0
CC + GG 0 0 0
CG 1 1 0
AT 1 1 0
GC 1 1 0
TA 2 2 0
L2 = (1/16)[∑ | a*ij b*ij |],
where a*ij = freq(ij)/(freq(i) freq(j))
here L2 = 0
TC + GA
TG + CA
CT + AG
AC + GT
TT + AA
CC + GG
CG
AT
GC
TA
Methods: compression distance by the Burrows-Wheeler transform (scheme from Mantaci et al., 2008)
a0 = GTGACGTACTATTGC b0 = GTACTATTGCGTGAC
a1 = TGACGTACTATTGCG b1 = TACTATTGCGTGACG
a2 = GACGTACTATTGCGT b2 = ACTATTGCGTGACGT
a3 = ACGTACTATTGCGTG b3 = CTATTGCGTGACGTA
.
.
.
.
.
.
a14 = CGTGACGTACTATTG b14 = CGTACTATTGCGTGA
Merge lists“Blue” list “Red” list
Merged list is then sorted:
ACGTACTATTGCGTG GACGTACTATTGCGTG GACTATTGCGTGACGT TACTATTGCGTGACGT TATTGCGTGACGTACT TATTGCGTGACGTACT TCGTACTATTGCGTGA ACGTGACGTACTATTG GCTATTGCGTGACGTA ACTATTGCGTGACGTA AGACGTACTATTGCGT TGACGTACTATTGCGT TGCGTGACGTACTATT TGCGTGACGTACTATT TGTACTATTGCGTGAC CGTACTATTGCGTGAC CGTGACGTACTATTGC CGTGACGTACTATTGC CTACTATTGCGTGACG GTACTATTGCGTGACG GTATTGCGTGACGTAC CTATTGCGTGACGTAC CTGACGTACTATTGCG GTGACGTACTATTGCG GTGCGTGACGTACTAT TTGCGTGACGTACTAT TTTGCGTGACGTACTA ATTGCGTGACGTACTA A
BRBRBRBRBRBRRBBRRBRBBRRBBRRB
Column of last characters is the Burrows-Wheeler transform.
Note “runs” of nucleotides.
Sequence “color” is then correlated to Burrows-Wheeler column
If color countsin each segment of runs is equal, sum is 0.
Else, sum up total unequal colors
Distance = 2
Methods: “rank” distance
Related to Hamming distance, but less sensitive to insertions/deletions (from Dinu and Sgarro, 2006)
a = GTGACGTACTATTGC b = GTACTATTGCGTGAC
Index each base and correlate it to its position in the sequence relative to the other sequence:• e.g. count the first occurrence of G in a and b, compute the difference in their positions, • count the second occurrence of G in a and b, compute the difference in their positions, …
Methods: “rank” distance
a = GTGACGTACTATTGC b = GTACTATTGCGTGAC
positiondifference= 0
positiondifference= 6
positiondifference= 5
Sum rank counts for G
Repeat procedure for T, A, and C
Sum rank counts for all four bases and normalize by arithmetic mean of sequence length
Distance = 0.01667, c.f. normed edit distance = 0.5333
Results of attempt to cluster by mobile element type
Multiple sequence alignment-based NJ tree - customary bioinformatics.
Sequences of different taxonomic groups paired closely - diagnostic of mobile genetic elements
Results of attempt to cluster by mobile element type
Edit distance tree gives same topology
Results of attempt to cluster by mobile element type
Dinucleotide counts over Euclidean distance and Rank distance successfully group two plasmids
Results of attempt to cluster by mobile element type
Burrows-Wheeler compression pairwise distances do not give a clear clustering.
Why did the BWT distances not perform well?
Insurmountable problem: the BWT distance script, as given, could not compute distances on whole plasmids.
Diagnosis: time-complexity of BWT is O(n·log(n)), but space complexity is O(n2)
RepA-ParA sequence data were too short for useful shared repeat regions to appear.
Remedy: Run complete plasmid sequences through BWT distance script
Mantaci et al. also found their BWT distance does not satisfy the triangle inequality (2008)
Can dinucleotide counts or rank distance be made to perform better in separating mobile elements?
Li et al (2004) used trinucleotide counts combined with higher-order nucleotide word counts to accurately infer an evolutionary tree of mammalian mitochondrial DNA.
Such simple methods cannot hope to approximate Kolomogorov complexity distance.
Recall that Kolmogorov complexity is related to the length of the Turing Machine needed to transform sequence a into
sequence b (Li et al., 2004).
Open issues
• So far, only dinucleotide counts have been developed for clustering of mobile elements (Blaisdell and Karlin, 1996)
• BWT distance and Rank distance were developed to cluster mammalian mitochondrial DNA (Mantaci et al.,2008; Dina and Sgarro, 2006).
• Rank distance not shown to satisfy triangle inequality
• Can it be proven whether or not a pairwise distance satisfying the triangle inequality yields an additive tree.
ReferencesAhlbrandt, C., Benson, G., and Casey, W. (2004) “Minimal entropy probability between
genome families.” Journal of Mathematical Biology. 48:563-590.
Baake, E. (1998) “What can and cannot be inferred from pairwise sequence comparisons?” Mathematical Biosciences. 154:1-21
Blaisdell, B.E., Campbell, A.M., and Karlin, S. (1996) “Similarities and dissimilarities of phage genomes.” Proc. Natl. Acad. Sci. 93:5854-5859.
Dinu, L.P. and Sgarro, A. (2006) “A low-complexity distance for DNA strings.” Fundamenta Informaticae. 76:361-372.
Li. M, Chen, X., Li, X., Ma, B., and Vianyi, P. (2004) “The similarity metric.” IEEE Transactions on Information Theory XX(Y)
Long-Hui, W., Juan, L., Zhou, H-B., and Feng, Shi. (2004) "A new distances metric and its application in phylogenetic tree construction." Proceedings of the 2004 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology.
Lopresti, D. and Tomkins, A. (1997) "Block edit models for approximate string matching." Theoretical Computer Science. 181:159-179
Mantaci, S., Restivo, A., and Sciortino. (2008) “Distance measure for biological sequences: Some recent approaches.” International Journal of Approximate Reasoning. 47:109-124.