Precomputing Edit-Distance Specificity of Short Oligonucleotides

Precomputing Edit-Distance

Specificity of Short

Oligonucleotides

Precomputing Edit-Distance

Specificity of Short

OligonucleotidesNathan EdwardsCenter for Bioinformatics and Computational BiologyUniversity of Maryland, College Park

2

Polymerase Chain Reaction

3

Polymerase Chain Reaction

4

Primer Specificity

• Need to ensure that primers hybridize to a specific (specified) locus only• Require exactly one occurrence of specified

sequence• Require no (potential) mis-hybridization loci

• Bottleneck computation in primer-design• Design / check iteration is problematic

5

k-unique 20-mers

• Edit-distance as a surrogate for mis-hybridization potential

• k-unique loci:• All non-self genomic loci are require more

than k edits in (global) alignment• Closest non-self genomic loci requires

(k+1) edits in (global) alignment

6

Find all k-unique 20-mers

• Naïve algorithm: O(n2km)• Quadratic in size of genome.

• 0-unique (exact match) 20-mers• (Expected) linear time algorithm

• Achieve expected linear time using a hybrid approach (blastn):• Use partial exact match to “seed” expensive

dynamic programming alignment• Large chunks ) Fast, but miss occurrences• Small chunks ) Slow, but correct

7

Baeza-Yates Perleberg: • Correct and O(n) for small k

• At least 1 chunk is observed with no error.• Small k → Large chunks → Fast and correct• Largest correct chunk: floor(m/(k+1))

Inexact sequence match

≠ = ≠q

g

8

Example worst case alignments

TCCCGC-TAGATTGAGATCT||||||v||||||*||||||TCCCGCCTAGATTTAGATCT

ACTTGTCCACAGTGCTTAAG||||||*||||||*||||||ACTTGTGCACAGTCCTTAAG

9

Brute-force approach

ACTTGTGCACAGTCCTTAAG

AA:18AC:1,9AG:11,19CA:8,10CC:14CT:2,15

GC:7GT:5,12TA:17TC:13TG:4,6 TT:3,16

2-mer position table

10




11




12




13




14




15




16


Divide the genome into 10 Mb blocksFor all pairs of blocks:

For all l-mer matches:Do all pair-wise DPs containing matchIf ≤ k edits, mark position non-unique

300 x 300 pairs of blocksFor 20-mers:

k=1 ) l=10; k=2 ) l=6; k=3 ) l=5 ; k=4 ) l=4.

17


Things are looking really, really, bad:• Seeds are too short• 90,000 pair-wise block comparisons

Actually quite good (seed size 12):• Non-uniqueness certificates are dense• Almost all positions eliminated early• Behaves more like linear time than

quadratic

18

In practice (edit-dist 4)

19


20


21


22


23

In practice (edit-dist 4,3,2)

24

In practice (edit-dist 4,3,2)

25

Edit distance 2

• After seed size 12• ~ 27K (0.288%) positions have no match

• After seed size 8• ~ 3K (0.029%) positions have no match

• Using seed size 6 is still too slow• Need a more sophisticated hashing strategy• 6-mers match in too many places!

26

Spaced seed-set design problem

• Given:• mer-size: m ( = 20 )• # errors: k ( = 1,2,3)• # cares: l ( = 10,12,14 )

• Find the smallest set of spaced seeds that will find all alignments.

27

Solution for (20,2,8)

• 11111111, 111101111

TCCCGCGTAGATTGAGATCT ||||||*||||||*|||||| TCCCGCCTAGATTTAGATCT

• How can we find these spaced seed set solutions?

28

Spaced seed set design set-cover formulation

• Set cover instance:• Ground set: all possible placements of the

k errors (alignments)• Covering sets: all possible placements of

the l care positions

• For (m=20,k=2,l=10),• 190 elements, 184,756 sets!• Need to reduce the number of sets!

29

Dirty secret of spaced seeds

• Spaced seeds take O(# cares) to update!• Contiguous seeds are O(1) to update

• 101010101010101 vs 11111111• 8 steps to update vs 1 step to update

• Constant time update for spaced seeds?• Yes, if they have a certain structure

30

O(1) spaced seed update

ACGTACGTACGTACGTACGTA G A G C T C T G A G A T C T C ...Spaced seed 1010101 can be updated

in 1 step!

31

O(1) spaced seed update

• “Periodic” spaced seeds can be updated in “constant” time

• 11011011011 2 steps• 11001100110011 2 steps• 1000010000100001 1 step

• Need to minimize the number of update steps, not the number of templates• 11111111,111101111 has update cost 5.

32

Edit-distance SS-SDP

• Position of matching bases might shift!• Need 11111111 ↓ to get CCGCTAGA

• Need 111101111 ↑ to get CCGCTAGA

• Set cover formulation no longer works

TCCCGC-TAGATTGAGATCT||||||v||||||*||||||TCCCGCCTAGATTTAGATCT

33

Edit-Distance SS-SDP

• Use a variation on set cover:• q:111101111,r:11111111 covers:

• Pay for query & reference update costs separately

• Control size of problem by only enumerating templates with small update cost

r:TCCCGC-TAGATTGAGATCT ||||||v||||||*||||||q:TCCCGCCTAGATTTAGATCT

34

Solution for (20,2,10)

Query Templates: 1: 11111111110000000000 Cost: 1 2: 11111011111000000000 Cost: 5 27: 11111000001111100000 Cost: 5 42: 11111000000001111100 Cost: 5 Text Templates: 1: 11111111110000000000 Cost: 1 2: 11111011111000000000 Cost: 5 32: 11111000000111110000 Cost: 5 37: 11111000000011111000 Cost: 5 42: 11111000000001111100 Cost: 5Pairs of templates: 1: 11111111110000000000 1: 11111111110000000000 Covers: 1274 2: 11111011111000000000 1: 11111111110000000000 Covers: 260 2: 11111011111000000000 2: 11111011111000000000 Covers: 1218 1: 11111111110000000000 2: 11111011111000000000 Covers: 309 42: 11111000000001111100 32: 11111000000111110000 Covers: 42 27: 11111000001111100000 32: 11111000000111110000 Covers: 319 42: 11111000000001111100 37: 11111000000011111000 Covers: 186 27: 11111000001111100000 37: 11111000000011111000 Covers: 51 42: 11111000000001111100 42: 11111000000001111100 Covers: 287

35

k-unique human 20-mers

• No 4-unique 20-mers• No 3-unique 20-mers

• 0. 038% of (forward) human 20-mers are 2-unique• 1088322 in total• about 1 every 2638 bases• Fast 2-uniquness oracle

36

F. tularensis 20-mer signatures

• Exact match in all six strains• No match to bacterial background at edit-

distance k

• No 3-unique 20-mer signatures• 263 2-unique 20-mer signatures

• 0.013%

• 1.3M 20-mer signatures (no background check)• 1.2M 0-unique 20-mer signatures• 580K 1-unique 20-mer signatures

37

Conclusions

• Precompute of human k-unique 20-mers is now feasible!• Faster for large edit-distance!• Need spaced seed-set designs

• Constant time update for spaced seeds• Good integer programming formulation of SS-

SDP• Limited template enumeration based on update cost• Work with integer programming experts to solve

effectively

38

Next Steps

• Publish!

• Adapt for Tm and/or hybridization model

• Convert to native BOINC-application

• Integrate with primer-design software

Precomputing Edit-Distance Specificity of Short Oligonucleotides

Documents

Transcript of Precomputing Edit-Distance Specificity of Short Oligonucleotides