Precomputing Edit-Distance Specificity of Short Oligonucleotides

42
Precomputing Edit- Distance Specificity of Short Oligonucleotides Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park

description

Precomputing Edit-Distance Specificity of Short Oligonucleotides. Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park. Polymerase Chain Reaction. Polymerase Chain Reaction. Primer Specificity. - PowerPoint PPT Presentation

Transcript of Precomputing Edit-Distance Specificity of Short Oligonucleotides

Page 1: Precomputing Edit-Distance Specificity of Short Oligonucleotides

Precomputing Edit-Distance

Specificity of Short

Oligonucleotides

Precomputing Edit-Distance

Specificity of Short

OligonucleotidesNathan EdwardsCenter for Bioinformatics and Computational BiologyUniversity of Maryland, College Park

Page 2: Precomputing Edit-Distance Specificity of Short Oligonucleotides

2

Polymerase Chain Reaction

Page 3: Precomputing Edit-Distance Specificity of Short Oligonucleotides

3

Polymerase Chain Reaction

Page 4: Precomputing Edit-Distance Specificity of Short Oligonucleotides

4

Primer Specificity

• Need to ensure that primers hybridize to a specific (specified) locus only• Require exactly one occurrence of specified

sequence• Require no (potential) mis-hybridization loci

• Bottleneck computation in primer-design• Design / check iteration is problematic

Page 5: Precomputing Edit-Distance Specificity of Short Oligonucleotides

5

k-unique 20-mers

• Edit-distance as a surrogate for mis-hybridization potential

• k-unique loci:• All non-self genomic loci are require more

than k edits in (global) alignment• Closest non-self genomic loci requires

(k+1) edits in (global) alignment

Page 6: Precomputing Edit-Distance Specificity of Short Oligonucleotides

6

Find all k-unique 20-mers

• Naïve algorithm: O(n2km)• Quadratic in size of genome.

• 0-unique (exact match) 20-mers• (Expected) linear time algorithm

• Achieve expected linear time using a hybrid approach (blastn):• Use partial exact match to “seed” expensive

dynamic programming alignment• Large chunks ) Fast, but miss occurrences• Small chunks ) Slow, but correct

Page 7: Precomputing Edit-Distance Specificity of Short Oligonucleotides

7

Baeza-Yates Perleberg: • Correct and O(n) for small k

• At least 1 chunk is observed with no error.• Small k → Large chunks → Fast and correct• Largest correct chunk: floor(m/(k+1))

Inexact sequence match

≠ = ≠q

g

Page 8: Precomputing Edit-Distance Specificity of Short Oligonucleotides

8

Example worst case alignments

TCCCGC-TAGATTGAGATCT||||||v||||||*||||||TCCCGCCTAGATTTAGATCT

ACTTGTCCACAGTGCTTAAG||||||*||||||*||||||ACTTGTGCACAGTCCTTAAG

Page 9: Precomputing Edit-Distance Specificity of Short Oligonucleotides

9

Brute-force approach

ACTTGTGCACAGTCCTTAAG

AA:18AC:1,9AG:11,19CA:8,10CC:14CT:2,15

GC:7GT:5,12TA:17TC:13TG:4,6 TT:3,16

2-mer position table

Page 10: Precomputing Edit-Distance Specificity of Short Oligonucleotides

10

Brute-force approach

ACTTGTGCACAGTCCTTAAG

ACTTGTGCACAGTCCTTAAG

Page 11: Precomputing Edit-Distance Specificity of Short Oligonucleotides

11

Brute-force approach

ACTTGTGCACAGTCCTTAAG

ACTTGTGCACAGTCCTTAAG

Page 12: Precomputing Edit-Distance Specificity of Short Oligonucleotides

12

Brute-force approach

ACTTGTGCACAGTCCTTAAG

ACTTGTGCACAGTCCTTAAG

Page 13: Precomputing Edit-Distance Specificity of Short Oligonucleotides

13

Brute-force approach

ACTTGTGCACAGTCCTTAAG

ACTTGTGCACAGTCCTTAAG

Page 14: Precomputing Edit-Distance Specificity of Short Oligonucleotides

14

Brute-force approach

ACTTGTGCACAGTCCTTAAG

ACTTGTGCACAGTCCTTAAG

Page 15: Precomputing Edit-Distance Specificity of Short Oligonucleotides

15

Brute-force approach

ACTTGTGCACAGTCCTTAAG

ACTTGTGCACAGTCCTTAAG

Page 16: Precomputing Edit-Distance Specificity of Short Oligonucleotides

16

Brute-force approach

Divide the genome into 10 Mb blocksFor all pairs of blocks:

For all l-mer matches:Do all pair-wise DPs containing matchIf ≤ k edits, mark position non-unique

300 x 300 pairs of blocksFor 20-mers:

k=1 ) l=10; k=2 ) l=6; k=3 ) l=5 ; k=4 ) l=4.

Page 17: Precomputing Edit-Distance Specificity of Short Oligonucleotides

17

Brute-force approach

Things are looking really, really, bad:• Seeds are too short• 90,000 pair-wise block comparisons

Actually quite good (seed size 12):• Non-uniqueness certificates are dense• Almost all positions eliminated early• Behaves more like linear time than

quadratic

Page 18: Precomputing Edit-Distance Specificity of Short Oligonucleotides

18

In practice (edit-dist 4)

Page 19: Precomputing Edit-Distance Specificity of Short Oligonucleotides

19

In practice (edit-dist 4)

Page 20: Precomputing Edit-Distance Specificity of Short Oligonucleotides

20

In practice (edit-dist 4)

Page 21: Precomputing Edit-Distance Specificity of Short Oligonucleotides

21

In practice (edit-dist 3)

Page 22: Precomputing Edit-Distance Specificity of Short Oligonucleotides

22

In practice (edit-dist 3)

Page 23: Precomputing Edit-Distance Specificity of Short Oligonucleotides

23

In practice (edit-dist 4,3,2)

Page 24: Precomputing Edit-Distance Specificity of Short Oligonucleotides

24

In practice (edit-dist 4,3,2)

Page 25: Precomputing Edit-Distance Specificity of Short Oligonucleotides

25

Edit distance 2

• After seed size 12• ~ 27K (0.288%) positions have no match

• After seed size 8• ~ 3K (0.029%) positions have no match

• Using seed size 6 is still too slow• Need a more sophisticated hashing strategy• 6-mers match in too many places!

Page 26: Precomputing Edit-Distance Specificity of Short Oligonucleotides

26

Spaced seed-set design problem

• Given:• mer-size: m ( = 20 )• # errors: k ( = 1,2,3)• # cares: l ( = 10,12,14 )

• Find the smallest set of spaced seeds that will find all alignments.

Page 27: Precomputing Edit-Distance Specificity of Short Oligonucleotides

27

Solution for (20,2,8)

• 11111111, 111101111

TCCCGCGTAGATTGAGATCT ||||||*||||||*|||||| TCCCGCCTAGATTTAGATCT

• How can we find these spaced seed set solutions?• One/two table? 2 x false positives

Page 28: Precomputing Edit-Distance Specificity of Short Oligonucleotides

28

Spaced seed set design set-cover formulation

• Set cover instance:• Ground set: all possible placements of the

k errors (alignments)• Covering sets: all possible placements of

the l care positions

• For (m=20,k=2,l=10),• 190 elements, 184,756 sets!• Need to reduce the number of sets!

Page 29: Precomputing Edit-Distance Specificity of Short Oligonucleotides

29

Dirty secret of spaced seeds

• Spaced seeds take O(# cares) to update!• Contiguous seeds are O(1) to update

• 101010101010101 vs 11111111• 8 steps to update vs 1 step to update

• Constant time update for spaced seeds?• Yes, if they have a certain structure(s)• Restrict spaced seeds to small update cost

Page 30: Precomputing Edit-Distance Specificity of Short Oligonucleotides

30

O(1) spaced seed update

ACGTACGTACGTACGTACGT1: A G A G2: C T C T ...Spaced seed 1010101 can be updated

in 1 step!

Page 31: Precomputing Edit-Distance Specificity of Short Oligonucleotides

31

O(1) spaced seed update

ACGTACGTACGTACGTACGTACGTACG -> ACGACG CGTACGT -> CGTCGT...Spaced seed 1110111 can be updated

in 1 step!

Page 32: Precomputing Edit-Distance Specificity of Short Oligonucleotides

32

O(1) spaced seed update

• “Period” step update• 11001100110011 2 steps• 1000010000100001 1 step

• “Runs” step update• 11100111111 1 step• 11101110111 2 steps

• Minimize the number of update steps• Weighted set-cover instance…

Page 33: Precomputing Edit-Distance Specificity of Short Oligonucleotides

33

Edit-distance SS-SDP

• Position of matching bases might shift!• Need 11111111 ↓ to get CCGCTAGA

• Need 111101111 ↑ to get CCGCTAGA

• Set cover formulation no longer works

TCCCGC-TAGATTGAGATCT||||||v||||||*||||||TCCCGCCTAGATTTAGATCT

Page 34: Precomputing Edit-Distance Specificity of Short Oligonucleotides

34

Edit-Distance SS-SDP

• Use a variation on set cover:• q:111101111,r:11111111 covers:

• Pay for query & reference update costs separately

• Control size of problem by only enumerating templates with small update cost

r:TCCCGC-TAGATTGAGATCT ||||||v||||||*||||||q:TCCCGCCTAGATTTAGATCT

Page 35: Precomputing Edit-Distance Specificity of Short Oligonucleotides

35

Correct solutions for 1-unique 20-mers107 random sequence x 107 random sequence

• Seed: 1111111111 (Best single seed solution, weight 10)• ~ 9.5 expensive dynamic programs per locus

• Seed set: 11111111111;111110111111 (weight 11)• ~ 8.9 DP/locus

• Seed set (weight 11)

• ~ 7.8 DP/locus

• Seed set: 111111111111;1111101111111 (weight 12)• ~ 2.2 DP/locus

11111111111 ~ 111101111111111101111111 ~ 11111111111111101111111 ~ 111101111111

Page 36: Precomputing Edit-Distance Specificity of Short Oligonucleotides

36

Correct solutions for 1-unique 20-mers

107 random sequence x 107 random sequence

• Seed set: 111111111111;1111101111111 (weight 12)• ~ 2.5 DP/locus

• Seed set (weight 12)

• ~ 1.8 DP/locus

• Seed set: 1111111111111;11111101111111 (weight 13)• ~ 0.56 DP/locus (same specificity as contiguous seed weight 12)

• Seed set (weight 14)• ~ 0.20

1111110111111 x 111111111111 1111110111111

111111001111111

11111111111111 ~ 11111111111111 111110111111111 ~ 11111111111111 111111111110111 ~ 11111111111111 111111111110111 ~ 111111111110111 11111111111111 ~ 111111101111111111110111111111 ~ 111110111111111

Page 37: Precomputing Edit-Distance Specificity of Short Oligonucleotides

Correct solutions for 2-unique 20-mers

• Seed: 111111 (Best single seed solution, weight 6)• ~ 2439 DP/locus

• Weight 10 – 73 DP/locus (specificity of 8/9 contig seed)

• Weight 12 – 10 DP/locus (specificity of 10 contig seed)

37

1111111111 ~ 1111111111 11111011111 ~ 1111111111 11111011111 ~ 11111011111 1111100000011111 ~ 1111100000011111 11111000000011111 ~ 1111100000011111 111110000000011111 ~ 1111100000011111 111110000011111 ~ 1111100000011111 111110000000011111 ~ 11111000000000011111

Page 38: Precomputing Edit-Distance Specificity of Short Oligonucleotides

38

k-unique human 20-mers

• No 4-unique 20-mers• No 3-unique 20-mers

• 0. 038% of (forward) human 20-mers are 2-unique• 1088322 in total• about 1 every 2638 bases• Fast 2-uniquness oracle

Page 40: Precomputing Edit-Distance Specificity of Short Oligonucleotides

40

Integration with High Performance Computing

• Break sequence into chunks of size 107.• Remember which positions have been eliminated.

• Integrated with (UMIACS) Condor• Too unreliable for very large sequences• NFS filesystem is unreliable• Simultaneous jobs capped at ~ 300

• Integrated with hadoop/map-reduce on 80 nodes (Edwards Lab)• Reliability improved, DFS solves (some) filesystem

woes• Much better scalability (in theory, yet to be tested)• Explicit synchronization of reduce step is undesirable.

Page 41: Precomputing Edit-Distance Specificity of Short Oligonucleotides

41

Other improvements

• Other groupings are possible:• Species designation on FASTA defline can be

any suitable partition• Constraints on the position of edits:

• False positive due to mishybridization at 3’ end is unlikely to be observed with some technologies

• Constraint on valid Tm range:• Computed as in Primer3• Can eliminate undesirable mers early

Page 42: Precomputing Edit-Distance Specificity of Short Oligonucleotides

42

Conclusions

• Precompute of human k-unique 20-mers is now feasible!• Faster for large edit-distance!• Need spaced seed-set designs

• Constant time update for spaced seeds• Good integer programming formulation of SS-

SDP• Limited template enumeration based on update cost• Work with integer programming experts to solve

effectively