Precomputing Edit-Distance Specificity of Short Oligonucleotides
description
Transcript of Precomputing Edit-Distance Specificity of Short Oligonucleotides
Precomputing Edit-Distance
Specificity of Short
Oligonucleotides
Precomputing Edit-Distance
Specificity of Short
OligonucleotidesNathan EdwardsCenter for Bioinformatics and Computational BiologyUniversity of Maryland, College Park
2
Polymerase Chain Reaction
3
Polymerase Chain Reaction
4
Primer Specificity
• Need to ensure that primers hybridize to a specific (specified) locus only• Require exactly one occurrence of specified
sequence• Require no (potential) mis-hybridization loci
• Bottleneck computation in primer-design• Design / check iteration is problematic
5
k-unique 20-mers
• Edit-distance as a surrogate for mis-hybridization potential
• k-unique loci:• All non-self genomic loci are require more
than k edits in (global) alignment• Closest non-self genomic loci requires
(k+1) edits in (global) alignment
6
Find all k-unique 20-mers
• Naïve algorithm: O(n2km)• Quadratic in size of genome.
• 0-unique (exact match) 20-mers• (Expected) linear time algorithm
• Achieve expected linear time using a hybrid approach (blastn):• Use partial exact match to “seed” expensive
dynamic programming alignment• Large chunks ) Fast, but miss occurrences• Small chunks ) Slow, but correct
7
Baeza-Yates Perleberg: • Correct and O(n) for small k
• At least 1 chunk is observed with no error.• Small k → Large chunks → Fast and correct• Largest correct chunk: floor(m/(k+1))
Inexact sequence match
≠ = ≠q
g
8
Example worst case alignments
TCCCGC-TAGATTGAGATCT||||||v||||||*||||||TCCCGCCTAGATTTAGATCT
ACTTGTCCACAGTGCTTAAG||||||*||||||*||||||ACTTGTGCACAGTCCTTAAG
9
Brute-force approach
ACTTGTGCACAGTCCTTAAG
AA:18AC:1,9AG:11,19CA:8,10CC:14CT:2,15
GC:7GT:5,12TA:17TC:13TG:4,6 TT:3,16
2-mer position table
10
Brute-force approach
ACTTGTGCACAGTCCTTAAG
ACTTGTGCACAGTCCTTAAG
11
Brute-force approach
ACTTGTGCACAGTCCTTAAG
ACTTGTGCACAGTCCTTAAG
12
Brute-force approach
ACTTGTGCACAGTCCTTAAG
ACTTGTGCACAGTCCTTAAG
13
Brute-force approach
ACTTGTGCACAGTCCTTAAG
ACTTGTGCACAGTCCTTAAG
14
Brute-force approach
ACTTGTGCACAGTCCTTAAG
ACTTGTGCACAGTCCTTAAG
15
Brute-force approach
ACTTGTGCACAGTCCTTAAG
ACTTGTGCACAGTCCTTAAG
16
Brute-force approach
Divide the genome into 10 Mb blocksFor all pairs of blocks:
For all l-mer matches:Do all pair-wise DPs containing matchIf ≤ k edits, mark position non-unique
300 x 300 pairs of blocksFor 20-mers:
k=1 ) l=10; k=2 ) l=6; k=3 ) l=5 ; k=4 ) l=4.
17
Brute-force approach
Things are looking really, really, bad:• Seeds are too short• 90,000 pair-wise block comparisons
Actually quite good (seed size 12):• Non-uniqueness certificates are dense• Almost all positions eliminated early• Behaves more like linear time than
quadratic
18
In practice (edit-dist 4)
19
In practice (edit-dist 4)
20
In practice (edit-dist 4)
21
In practice (edit-dist 3)
22
In practice (edit-dist 3)
23
In practice (edit-dist 4,3,2)
24
In practice (edit-dist 4,3,2)
25
Edit distance 2
• After seed size 12• ~ 27K (0.288%) positions have no match
• After seed size 8• ~ 3K (0.029%) positions have no match
• Using seed size 6 is still too slow• Need a more sophisticated hashing strategy• 6-mers match in too many places!
26
Spaced seed-set design problem
• Given:• mer-size: m ( = 20 )• # errors: k ( = 1,2,3)• # cares: l ( = 10,12,14 )
• Find the smallest set of spaced seeds that will find all alignments.
27
Solution for (20,2,8)
• 11111111, 111101111
TCCCGCGTAGATTGAGATCT ||||||*||||||*|||||| TCCCGCCTAGATTTAGATCT
• How can we find these spaced seed set solutions?
28
Spaced seed set design set-cover formulation
• Set cover instance:• Ground set: all possible placements of the
k errors (alignments)• Covering sets: all possible placements of
the l care positions
• For (m=20,k=2,l=10),• 190 elements, 184,756 sets!• Need to reduce the number of sets!
29
Dirty secret of spaced seeds
• Spaced seeds take O(# cares) to update!• Contiguous seeds are O(1) to update
• 101010101010101 vs 11111111• 8 steps to update vs 1 step to update
• Constant time update for spaced seeds?• Yes, if they have a certain structure
30
O(1) spaced seed update
ACGTACGTACGTACGTACGTA G A G C T C T G A G A T C T C ...Spaced seed 1010101 can be updated
in 1 step!
31
O(1) spaced seed update
• “Periodic” spaced seeds can be updated in “constant” time
• 11011011011 2 steps• 11001100110011 2 steps• 1000010000100001 1 step
• Need to minimize the number of update steps, not the number of templates• 11111111,111101111 has update cost 5.
32
Edit-distance SS-SDP
• Position of matching bases might shift!• Need 11111111 ↓ to get CCGCTAGA
• Need 111101111 ↑ to get CCGCTAGA
• Set cover formulation no longer works
TCCCGC-TAGATTGAGATCT||||||v||||||*||||||TCCCGCCTAGATTTAGATCT
33
Edit-Distance SS-SDP
• Use a variation on set cover:• q:111101111,r:11111111 covers:
• Pay for query & reference update costs separately
• Control size of problem by only enumerating templates with small update cost
r:TCCCGC-TAGATTGAGATCT ||||||v||||||*||||||q:TCCCGCCTAGATTTAGATCT
34
Solution for (20,2,10)
Query Templates: 1: 11111111110000000000 Cost: 1 2: 11111011111000000000 Cost: 5 27: 11111000001111100000 Cost: 5 42: 11111000000001111100 Cost: 5 Text Templates: 1: 11111111110000000000 Cost: 1 2: 11111011111000000000 Cost: 5 32: 11111000000111110000 Cost: 5 37: 11111000000011111000 Cost: 5 42: 11111000000001111100 Cost: 5Pairs of templates: 1: 11111111110000000000 1: 11111111110000000000 Covers: 1274 2: 11111011111000000000 1: 11111111110000000000 Covers: 260 2: 11111011111000000000 2: 11111011111000000000 Covers: 1218 1: 11111111110000000000 2: 11111011111000000000 Covers: 309 42: 11111000000001111100 32: 11111000000111110000 Covers: 42 27: 11111000001111100000 32: 11111000000111110000 Covers: 319 42: 11111000000001111100 37: 11111000000011111000 Covers: 186 27: 11111000001111100000 37: 11111000000011111000 Covers: 51 42: 11111000000001111100 42: 11111000000001111100 Covers: 287
35
k-unique human 20-mers
• No 4-unique 20-mers• No 3-unique 20-mers
• 0. 038% of (forward) human 20-mers are 2-unique• 1088322 in total• about 1 every 2638 bases• Fast 2-uniquness oracle
36
F. tularensis 20-mer signatures
• Exact match in all six strains• No match to bacterial background at edit-
distance k
• No 3-unique 20-mer signatures• 263 2-unique 20-mer signatures
• 0.013%
• 1.3M 20-mer signatures (no background check)• 1.2M 0-unique 20-mer signatures• 580K 1-unique 20-mer signatures
37
Conclusions
• Precompute of human k-unique 20-mers is now feasible!• Faster for large edit-distance!• Need spaced seed-set designs
• Constant time update for spaced seeds• Good integer programming formulation of SS-
SDP• Limited template enumeration based on update cost• Work with integer programming experts to solve
effectively
38
Next Steps
• Publish!
• Adapt for Tm and/or hybridization model
• Convert to native BOINC-application
• Integrate with primer-design software