PatternHunter: A Fast and Highly Sensitive Homology Search Method Bin Ma Department of Computer...
-
Upload
aldous-ward -
Category
Documents
-
view
214 -
download
1
Transcript of PatternHunter: A Fast and Highly Sensitive Homology Search Method Bin Ma Department of Computer...
![Page 1: PatternHunter: A Fast and Highly Sensitive Homology Search Method Bin Ma Department of Computer Science University of Western Ontario.](https://reader035.fdocuments.net/reader035/viewer/2022070415/5697bf9c1a28abf838c93612/html5/thumbnails/1.jpg)
PatternHunter: A Fast and Highly Sensitive Homology Search Method
Bin MaDepartment of Computer Science
University of Western Ontario
![Page 2: PatternHunter: A Fast and Highly Sensitive Homology Search Method Bin Ma Department of Computer Science University of Western Ontario.](https://reader035.fdocuments.net/reader035/viewer/2022070415/5697bf9c1a28abf838c93612/html5/thumbnails/2.jpg)
GCNTACACGTCACCATCTGTGCCACCACNCATGTCTCTAGTGATCCCTCATAAGTTCCAACAAAGTTTGC|| ||||| | ||| |||| || |||||||||||||||||| | |||||||| | | |||||GCCTACACACCGCCAGTTGTG-TTCCTGCTATGTCTCTAGTGATCCCTGAAAAGTTCCAGCGTATTTTGC
GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----|| ||||||||| |||||| | ||||| |||||||| ||| |||||||| | | | || GAATACTCAACAGCAACATCAACGGGCAGCAGAAAATAGGCTTTGCCATCACTGCCATTAAGGATGTGGG
------------------TGTTGAGGAAAGCAGACATTGACCTCACCGAGAGGGCAGGCGAGCTCAGGTA ||||||||||||| ||| ||||||||||| || ||||||| || |||| |TTGACAGTACACTCATAGTGTTGAGGAAAGCTGACGTTGACCTCACCAAGTGGGCAGGAGAACTCACTGA
GGATGAGGTGGAGCATATGATCACCATCATACAGAACTCAC-------CAAGATTCCAGACTGGTTCTTG||||||| |||| | | |||| ||||| || ||||| || |||||| |||||||||||||||GGATGAGATGGAACGTGTGATGACCATTATGCAGAATCCATGCCAGTACAAGATCCCAGACTGGTTCTTG
A homology between mouse and human genomes
Smith-Waterman is the most accurate method.Time complexity : O(mn).
![Page 3: PatternHunter: A Fast and Highly Sensitive Homology Search Method Bin Ma Department of Computer Science University of Western Ontario.](https://reader035.fdocuments.net/reader035/viewer/2022070415/5697bf9c1a28abf838c93612/html5/thumbnails/3.jpg)
GCNTACACGTCACCATCTGTGCCACCACNCATGTCTCTAGTGATCCCTCATAAGTTCCAACAAAGTTTGC|| ||||| | ||| |||| || |||||||||||||||||| | |||||||| | | |||||GCCTACACACCGCCAGTTGTG-TTCCTGCTATGTCTCTAGTGATCCCTGAAAAGTTCCAGCGTATTTTGC
GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----|| ||||||||| |||||| | ||||| |||||||| ||| |||||||| | | | || GAATACTCAACAGCAACATCAACGGGCAGCAGAAAATAGGCTTTGCCATCACTGCCATTAAGGATGTGGG
------------------TGTTGAGGAAAGCAGACATTGACCTCACCGAGAGGGCAGGCGAGCTCAGGTA ||||||||||||| ||| ||||||||||| || ||||||| || |||| |TTGACAGTACACTCATAGTGTTGAGGAAAGCTGACGTTGACCTCACCAAGTGGGCAGGAGAACTCACTGA
GGATGAGGTGGAGCATATGATCACCATCATACAGAACTCAC-------CAAGATTCCAGACTGGTTCTTG||||||| |||| | | |||| ||||| || ||||| || |||||| |||||||||||||||GGATGAGATGGAACGTGTGATGACCATTATGCAGAATCCATGCCAGTACAAGATCCCAGACTGGTTCTTG
BLAST finds a “hit” and then extends
Seed match = hit
![Page 4: PatternHunter: A Fast and Highly Sensitive Homology Search Method Bin Ma Department of Computer Science University of Western Ontario.](https://reader035.fdocuments.net/reader035/viewer/2022070415/5697bf9c1a28abf838c93612/html5/thumbnails/4.jpg)
Example of missing a target
Fail:GAGTACTCAACACCAACATTAGTGGGCAATGGAAAAT|| ||||||||| |||||| | |||||| ||||||GAATACTCAACAGCAACATCAATGGGCAGCAGAAAAT Dilemma
Sensitivity – needs shorter seeds the success rate of finding a homology
Speed – needs longer seeds Mega-BLAST uses seeds of length 28.
![Page 5: PatternHunter: A Fast and Highly Sensitive Homology Search Method Bin Ma Department of Computer Science University of Western Ontario.](https://reader035.fdocuments.net/reader035/viewer/2022070415/5697bf9c1a28abf838c93612/html5/thumbnails/5.jpg)
PatternHunter uses “spaced seeds” 111010010100110111 (called a model)
Eleven required matches (weight=11) Seven “don’t care” positions
GAGTACTCAACACCAACATTAGTGGCAATGGAAAAT…
|| ||||||||| ||||| || ||||| ||||||
GAATACTCAACAGCAACACTAATGGCAGCAGAAAAT…
111010010100110111 Hit = all the required matches are satisfied. BLAST seed model = 11111111111
![Page 6: PatternHunter: A Fast and Highly Sensitive Homology Search Method Bin Ma Department of Computer Science University of Western Ontario.](https://reader035.fdocuments.net/reader035/viewer/2022070415/5697bf9c1a28abf838c93612/html5/thumbnails/6.jpg)
Observations re. spaced seeds
Seed models with different shapes can detect different homologies.
Two consequences: Some models may detect more homologies
than others More sensitive homology search PatternHunter I
Can use several seed models simultaneously to hit more homologies
Approaching 100% sensitive homology search PatternHunter II
![Page 7: PatternHunter: A Fast and Highly Sensitive Homology Search Method Bin Ma Department of Computer Science University of Western Ontario.](https://reader035.fdocuments.net/reader035/viewer/2022070415/5697bf9c1a28abf838c93612/html5/thumbnails/7.jpg)
Spaced Seed – PatternHunter I:
![Page 8: PatternHunter: A Fast and Highly Sensitive Homology Search Method Bin Ma Department of Computer Science University of Western Ontario.](https://reader035.fdocuments.net/reader035/viewer/2022070415/5697bf9c1a28abf838c93612/html5/thumbnails/8.jpg)
Weight of a seed Lemma: The expected number of hits of a weight
W length M seed model within a length L region with similarity p is (L-M+1)pW
Proof: There are (L-M+1) positions a hit can occur. At each position, pW hit is expected. Q.E.D.
Seed models with the same weight generate approximately the same amount of hits.
Speed is approximately the same. Sensitivity is not necessarily the same.
num of hits v.s. num of regions that contain hits.
GAGTACTCAACACCAACATTAGTGGCAATGGAAAAT|| ||||||||| ||||| || ||||| ||||||GAATACTCAACAGCAACACTAATGGCAGCAGAAAAT 111010010100110111
![Page 9: PatternHunter: A Fast and Highly Sensitive Homology Search Method Bin Ma Department of Computer Science University of Western Ontario.](https://reader035.fdocuments.net/reader035/viewer/2022070415/5697bf9c1a28abf838c93612/html5/thumbnails/9.jpg)
Simulated sensitivity curves
![Page 10: PatternHunter: A Fast and Highly Sensitive Homology Search Method Bin Ma Department of Computer Science University of Western Ontario.](https://reader035.fdocuments.net/reader035/viewer/2022070415/5697bf9c1a28abf838c93612/html5/thumbnails/10.jpg)
Why spaced seeds are better?
TTGACCTCACC? |||||||||||? TTGACCTCACC? 11111111111 11111111111
CAA?A??A?C??TA?TGG?|||?|??|?|??||?|||?CAA?A??A?C??TA?TGG?111010010100110111 111010010100110111
• BLAST’s seed usually uses more than one hits to detect one homology (redundant)• Spaced seeds uses fewer hits to detect one homology (efficient)
![Page 11: PatternHunter: A Fast and Highly Sensitive Homology Search Method Bin Ma Department of Computer Science University of Western Ontario.](https://reader035.fdocuments.net/reader035/viewer/2022070415/5697bf9c1a28abf838c93612/html5/thumbnails/11.jpg)
PH’s seed does not overlap heavily PH’s seed do not overlap heavily when shifts:111010010100110111 111010010100110111 111010010100110111 111010010100110111 111010010100110111 111010010100110111
111010010100110111 ......
The hits at different positions are independent. The probability of having the second hit is 5*p6
+ … compare to BLAST’s model p + p2 + p3 + p4 + …
![Page 12: PatternHunter: A Fast and Highly Sensitive Homology Search Method Bin Ma Department of Computer Science University of Western Ontario.](https://reader035.fdocuments.net/reader035/viewer/2022070415/5697bf9c1a28abf838c93612/html5/thumbnails/12.jpg)
Indeed
Indeed, under the condition that there is one hit in a length 64, 70% similar homology, the average number of hits in that region is 2.0 for PH’s weight-11 seed 3.6 for contiguous weight-11 seed.
![Page 13: PatternHunter: A Fast and Highly Sensitive Homology Search Method Bin Ma Department of Computer Science University of Western Ontario.](https://reader035.fdocuments.net/reader035/viewer/2022070415/5697bf9c1a28abf838c93612/html5/thumbnails/13.jpg)
A dynamic programming algorithm to compute sensitivity
R[1..n]: Random homology, Pr(R[i]=1) = p; We want Pr(R is hit by a seed model x)
DP[i,s] denotes Pr(R[1..i] is hit | R[1..i] ends with s)
1; |s|=|x| and s is hit DP[i,s]= DP[i-1,s[1..|s|-1]; |s|=|x| and s is not
hit p*DP[i,(1s)] + (1-p)*DP[i,(0s)]; else O(n*2|x|). Better algorithm exists.
![Page 14: PatternHunter: A Fast and Highly Sensitive Homology Search Method Bin Ma Department of Computer Science University of Western Ontario.](https://reader035.fdocuments.net/reader035/viewer/2022070415/5697bf9c1a28abf838c93612/html5/thumbnails/14.jpg)
PatternHunter I performance Blastn MB28 PH E.coli (4.7M) v.s. H.inf (1.8M)
716s /158M 5s/561M 34s/78M Arabidopsis chr2 (19.6M) v.s. chr4 (17.5M)
-- 21720s/1087M 5020s/279M Human chr21 (26.2M) v.s. chr22 (35M)
-- -- 14512s/419M All used a 700MHZ PentiumIII PC with 1G byte
memory. Human (3G) v.s. Mouse (3G)*
Using 2-hit, weight 12 seed, PH used 6 days with a 1GHZ PentiumIII PC with 2G byte memory.
With Blast, it would otherwise take months with parallel computers to finish.
![Page 15: PatternHunter: A Fast and Highly Sensitive Homology Search Method Bin Ma Department of Computer Science University of Western Ontario.](https://reader035.fdocuments.net/reader035/viewer/2022070415/5697bf9c1a28abf838c93612/html5/thumbnails/15.jpg)
Multiple Seeds – PatternHunter II:
![Page 16: PatternHunter: A Fast and Highly Sensitive Homology Search Method Bin Ma Department of Computer Science University of Western Ontario.](https://reader035.fdocuments.net/reader035/viewer/2022070415/5697bf9c1a28abf838c93612/html5/thumbnails/16.jpg)
PatternHunter II: Optimized Multiple seeds
Basic Searching Algorithm1. Select a group of spaced seed
models2. For each hit of each model, conduct
extension to find a homology.
Selecting optimal multiple seed is NP-hard.
![Page 17: PatternHunter: A Fast and Highly Sensitive Homology Search Method Bin Ma Department of Computer Science University of Western Ontario.](https://reader035.fdocuments.net/reader035/viewer/2022070415/5697bf9c1a28abf838c93612/html5/thumbnails/17.jpg)
Seed Selection Algorithm1. Let A be an empty set.2. Let s be the seed such that A⋃{s} has the
highest hit probability.3. A=A⋃{s}; if |A|<K go to 2. Approximation ratio 1-1/e Computing the hit probability of multiple
seeds is NP-hard. Efficient algorithm when number of zeros is
limited. PTAS to compute the probability
approximately.
![Page 18: PatternHunter: A Fast and Highly Sensitive Homology Search Method Bin Ma Department of Computer Science University of Western Ontario.](https://reader035.fdocuments.net/reader035/viewer/2022070415/5697bf9c1a28abf838c93612/html5/thumbnails/18.jpg)
PTAS to compute the probability approximately. Randomly generate m homologies
independently. Suppose n of them are hit by our seeds. Let p be the sensitivity of our seeds.
If , then with probability 1-2/K, Can be proved by Chernoff’s
bounds.
2
log3
K
m
m
np
![Page 19: PatternHunter: A Fast and Highly Sensitive Homology Search Method Bin Ma Department of Computer Science University of Western Ontario.](https://reader035.fdocuments.net/reader035/viewer/2022070415/5697bf9c1a28abf838c93612/html5/thumbnails/19.jpg)
The seeds obtained under a simple homology distribution
(homology identity = 0.7, homology length=64)
111011001011010111,1111000100010011010111,1100110100101000110111,1110100011110010001101,…………
![Page 20: PatternHunter: A Fast and Highly Sensitive Homology Search Method Bin Ma Department of Computer Science University of Western Ontario.](https://reader035.fdocuments.net/reader035/viewer/2022070415/5697bf9c1a28abf838c93612/html5/thumbnails/20.jpg)
Simulated sensitivity curves:
Solid curves: Multiple (1, 2, 4, 8, 16) weight-12 spaced seeds.
Dashed curves: Optimal spaced seeds with weight = 11, 10, 9, 8.
Typically, “Doubling the seed number” gains better sensitivity than “decreasing the weight by 1”.
One weight-12
Two weight-12
One weight-11
![Page 21: PatternHunter: A Fast and Highly Sensitive Homology Search Method Bin Ma Department of Computer Science University of Western Ontario.](https://reader035.fdocuments.net/reader035/viewer/2022070415/5697bf9c1a28abf838c93612/html5/thumbnails/21.jpg)
Coding region seeds The first two bases of a codon is more
conserved than the third base. Coding regions matches have patterns
like 110110…… The seeds trained under a coding region
homology distribution are called the coding region seeds. PHII’s default seeds were trained under a
simple distribution (0.8, 0.8, 0.5).
![Page 22: PatternHunter: A Fast and Highly Sensitive Homology Search Method Bin Ma Department of Computer Science University of Western Ontario.](https://reader035.fdocuments.net/reader035/viewer/2022070415/5697bf9c1a28abf838c93612/html5/thumbnails/22.jpg)
Experiments on real data About 30k mouse ESTs (25Mb) and 4k
human ESTs (3Mb) downloaded from NCBI genbank. “low complexity” regions were filtered out.
SSearch (Smith-Waterman method) finds “all” pairs of ESTs with significant local alignments.
Check how many percents of those pairs can be “found” by BLAST and different configurations of PatternHunter.
![Page 23: PatternHunter: A Fast and Highly Sensitive Homology Search Method Bin Ma Department of Computer Science University of Western Ontario.](https://reader035.fdocuments.net/reader035/viewer/2022070415/5697bf9c1a28abf838c93612/html5/thumbnails/23.jpg)
Sensitivity curves:
![Page 24: PatternHunter: A Fast and Highly Sensitive Homology Search Method Bin Ma Department of Computer Science University of Western Ontario.](https://reader035.fdocuments.net/reader035/viewer/2022070415/5697bf9c1a28abf838c93612/html5/thumbnails/24.jpg)
Recent development Can 100% sensitivity be achieved
with reasonable speed? Yes. When >=80% similarity, 100%
sensitivity can be achieved with approximately 40 weight-9 seeds.
![Page 25: PatternHunter: A Fast and Highly Sensitive Homology Search Method Bin Ma Department of Computer Science University of Western Ontario.](https://reader035.fdocuments.net/reader035/viewer/2022070415/5697bf9c1a28abf838c93612/html5/thumbnails/25.jpg)
Open questions: Can the hit probability of one (or
constant number of) seed be computed in polynomial time? Current: Polynomial time algorithms exist
when num of 0s in one seed is O(log n). PTAS.
Can the optimal seed (or set of seeds) be found in polynomial time? For general distributions of the homologies,
these are NP-hard.
![Page 26: PatternHunter: A Fast and Highly Sensitive Homology Search Method Bin Ma Department of Computer Science University of Western Ontario.](https://reader035.fdocuments.net/reader035/viewer/2022070415/5697bf9c1a28abf838c93612/html5/thumbnails/26.jpg)
How the hits are found efficiently? Put all the seeds of database in a
lookup table. For each seed in the query, find all
the occurrences of the seed in the database by looking at the lookup table.