BLAST: Basic Local Alignment Search Tool Altschul et al. J ...
Transcript of BLAST: Basic Local Alignment Search Tool Altschul et al. J ...
![Page 1: BLAST: Basic Local Alignment Search Tool Altschul et al. J ...](https://reader033.fdocuments.net/reader033/viewer/2022051600/6280182d9d281246346b4386/html5/thumbnails/1.jpg)
BLAST:Basic Local Alignment Search Tool
Altschul et al. J. Mol Bio. 1990.
![Page 2: BLAST: Basic Local Alignment Search Tool Altschul et al. J ...](https://reader033.fdocuments.net/reader033/viewer/2022051600/6280182d9d281246346b4386/html5/thumbnails/2.jpg)
Hashing
A hash function maps a key to a value
![Page 3: BLAST: Basic Local Alignment Search Tool Altschul et al. J ...](https://reader033.fdocuments.net/reader033/viewer/2022051600/6280182d9d281246346b4386/html5/thumbnails/3.jpg)
• Hash table is a data structure: a way to store key-value pairs, and a way to retrieve them
• Based on the idea of a hash function. This maps a key or an object (e.g., a string, or a more complex record) to an integer, the “address”
• The value of the key is then stored at that address in memory
Hash table
![Page 4: BLAST: Basic Local Alignment Search Tool Altschul et al. J ...](https://reader033.fdocuments.net/reader033/viewer/2022051600/6280182d9d281246346b4386/html5/thumbnails/4.jpg)
• Key: (AAACGTAT, 1234321) • i.e., a 8 bp-string and its location in genome
• We want to store many such strings and their locations
• and later retrieve all locations of a particular string really quickly
• Hash function h(AAACGTAT) = 435
Hashing: an example
Key=String
Value = Address of where Location(String) is stored
![Page 5: BLAST: Basic Local Alignment Search Tool Altschul et al. J ...](https://reader033.fdocuments.net/reader033/viewer/2022051600/6280182d9d281246346b4386/html5/thumbnails/5.jpg)
• Let’s assume that there are 48 = 64K memory locations available.
• The first time we see (AAACGTAT, *), we store it at address h(AAACGTAT) = 435.
• The next time we see (AAACGTAT, *), we compute h(AAACGTAT), go to 435, find it already occupied. A collision!
Hashing: an example
![Page 6: BLAST: Basic Local Alignment Search Tool Altschul et al. J ...](https://reader033.fdocuments.net/reader033/viewer/2022051600/6280182d9d281246346b4386/html5/thumbnails/6.jpg)
• Buckets: Address 435 can store multiple keys/objects (e.g., as a linked list)
• Linear probing: If an address is occupied, store the key/object in next available location
• Multiple hashing: have an army of hash functions. If the first one (“h”) led to a collision, try another hash function (“h2”)
How to handle collisions
![Page 7: BLAST: Basic Local Alignment Search Tool Altschul et al. J ...](https://reader033.fdocuments.net/reader033/viewer/2022051600/6280182d9d281246346b4386/html5/thumbnails/7.jpg)
Bucketing and Chaining
![Page 8: BLAST: Basic Local Alignment Search Tool Altschul et al. J ...](https://reader033.fdocuments.net/reader033/viewer/2022051600/6280182d9d281246346b4386/html5/thumbnails/8.jpg)
Open addressing and linear probing
![Page 9: BLAST: Basic Local Alignment Search Tool Altschul et al. J ...](https://reader033.fdocuments.net/reader033/viewer/2022051600/6280182d9d281246346b4386/html5/thumbnails/9.jpg)
Preprocessing and hash
Preprocessing: store exact matches of all short patterns on the textby a hash table
ATC
AAA
TTC
{1,6,100,2000,5454, …, }
{15,21,30,785,3434, …, }
{5,164,220,502,943, …, }
h
haddress1
haddress2
address3
retrieve
retrieve
retrieve
![Page 10: BLAST: Basic Local Alignment Search Tool Altschul et al. J ...](https://reader033.fdocuments.net/reader033/viewer/2022051600/6280182d9d281246346b4386/html5/thumbnails/10.jpg)
• Given two sequences of same length, the similarity score of their alignment (without gaps) is the sum of similarity values for each pair of aligned residues
• Maximal segment pair (MSP): Highest scoring pair of identical length segments from the two sequences being compared (“query” and “subject”)
• The similarity score of an MSP is called the MSP score
• BLAST heuristically aims to find them
BLAST: finding maximal segment pairs
![Page 11: BLAST: Basic Local Alignment Search Tool Altschul et al. J ...](https://reader033.fdocuments.net/reader033/viewer/2022051600/6280182d9d281246346b4386/html5/thumbnails/11.jpg)
Maximal segment pairs and High scoring pairs
• Goal: report database sequences that have MSP score above some threshold S.
• Thus, sequences with at least one locally maximal segment pair that scores above S.
![Page 12: BLAST: Basic Local Alignment Search Tool Altschul et al. J ...](https://reader033.fdocuments.net/reader033/viewer/2022051600/6280182d9d281246346b4386/html5/thumbnails/12.jpg)
• A molecular biologist may be interested in all conserved regions shared by two proteins, not just their highest scoring pair
• A segment pair (segments of identical lengths) is locally maximal if its score cannot be improved by extending or shortening in either direction
• BLAST attempts to find all locally maximal segment pairs above some score cutoff.
High scoring pairs (or local maximal segment pairs)
![Page 13: BLAST: Basic Local Alignment Search Tool Altschul et al. J ...](https://reader033.fdocuments.net/reader033/viewer/2022051600/6280182d9d281246346b4386/html5/thumbnails/13.jpg)
A quick way to find MSPs
SeedExtend Extend
• Homologous sequences tend to have very similar or even identical substrings, also called seeds.
• From a seed, it is possible to construct a local HSP/MSP by extending to flanking regions.
![Page 14: BLAST: Basic Local Alignment Search Tool Altschul et al. J ...](https://reader033.fdocuments.net/reader033/viewer/2022051600/6280182d9d281246346b4386/html5/thumbnails/14.jpg)
Efficient algorithm?
![Page 15: BLAST: Basic Local Alignment Search Tool Altschul et al. J ...](https://reader033.fdocuments.net/reader033/viewer/2022051600/6280182d9d281246346b4386/html5/thumbnails/15.jpg)
1. Break query sequence into words
![Page 16: BLAST: Basic Local Alignment Search Tool Altschul et al. J ...](https://reader033.fdocuments.net/reader033/viewer/2022051600/6280182d9d281246346b4386/html5/thumbnails/16.jpg)
• Find exact matches to query words • Can be done in efficiently
• Hashing • Alternatively AC finite state machine
ATC
AAA
TTC
{1,6,100,2000,5454, …, }
{15,21,30,785,3434, …, }
{5,164,220,502,943, …, }
h
haddress1
haddress2
address3
retrieve
retrieve
retrieve
2. Find database hits
![Page 17: BLAST: Basic Local Alignment Search Tool Altschul et al. J ...](https://reader033.fdocuments.net/reader033/viewer/2022051600/6280182d9d281246346b4386/html5/thumbnails/17.jpg)
2. Find database hits
![Page 18: BLAST: Basic Local Alignment Search Tool Altschul et al. J ...](https://reader033.fdocuments.net/reader033/viewer/2022051600/6280182d9d281246346b4386/html5/thumbnails/18.jpg)
3. Extend hits
![Page 19: BLAST: Basic Local Alignment Search Tool Altschul et al. J ...](https://reader033.fdocuments.net/reader033/viewer/2022051600/6280182d9d281246346b4386/html5/thumbnails/19.jpg)
How to handle possible mismatches in words?
Neighbor words
![Page 20: BLAST: Basic Local Alignment Search Tool Altschul et al. J ...](https://reader033.fdocuments.net/reader033/viewer/2022051600/6280182d9d281246346b4386/html5/thumbnails/20.jpg)
How to handle possible mismatches in words?
![Page 21: BLAST: Basic Local Alignment Search Tool Altschul et al. J ...](https://reader033.fdocuments.net/reader033/viewer/2022051600/6280182d9d281246346b4386/html5/thumbnails/21.jpg)
How to handle possible mismatches in words?
![Page 22: BLAST: Basic Local Alignment Search Tool Altschul et al. J ...](https://reader033.fdocuments.net/reader033/viewer/2022051600/6280182d9d281246346b4386/html5/thumbnails/22.jpg)
How to handle possible mismatches in words?
![Page 23: BLAST: Basic Local Alignment Search Tool Altschul et al. J ...](https://reader033.fdocuments.net/reader033/viewer/2022051600/6280182d9d281246346b4386/html5/thumbnails/23.jpg)
Parameters
• Word length: 3 for protein, 11 for DNA/RNA • Thresholds T and S:
• BLAST minimizes time spent on database sequences whose similarity with the query has little chance of exceeding this cutoff S.
• Main strategy: seek only segment pairs (one from database, one query) that contain a word pair with score >= T
• Intuition: If the sequence pair has to score above S, its most well matched word (of some predetermined small length) must score above T
• Lower T => Fewer false negatives • Lower T => More pairs to analyze
![Page 24: BLAST: Basic Local Alignment Search Tool Altschul et al. J ...](https://reader033.fdocuments.net/reader033/viewer/2022051600/6280182d9d281246346b4386/html5/thumbnails/24.jpg)
• BLAST may not find all segment pairs above threshold S
• Bounds on the error: not hard bounds, but statistical bounds• “Highly likely” to find the MSP
Choosing threshold S
![Page 25: BLAST: Basic Local Alignment Search Tool Altschul et al. J ...](https://reader033.fdocuments.net/reader033/viewer/2022051600/6280182d9d281246346b4386/html5/thumbnails/25.jpg)
• Is the score high enough to provide evidence of homology?
• Are the scores of alignments of random sequences higher than this score?
• What are is the expected number of alignments between random sequences with score greater than this score?
Choosing threshold S
• BLAST may not find all segment pairs above threshold S
• Bounds on the error: not hard bounds, but statistical bounds• “Highly likely” to find the MSP
![Page 26: BLAST: Basic Local Alignment Search Tool Altschul et al. J ...](https://reader033.fdocuments.net/reader033/viewer/2022051600/6280182d9d281246346b4386/html5/thumbnails/26.jpg)
• BLAST may not find all segment pairs above threshold S
• Bounds on the error: not hard bounds, but statistical bounds• “Highly likely” to find the MSP
Choosing threshold S
• Suppose the MSP has been calculated by BLAST (and suppose this is the true MSP)
• Suppose this observed MSP with a score S. • What are the chances that the MSP score for two
unrelated sequences would be >= S? • If the chances are very low, then we can be confident that
the two sequences must not have been unrelated
![Page 27: BLAST: Basic Local Alignment Search Tool Altschul et al. J ...](https://reader033.fdocuments.net/reader033/viewer/2022051600/6280182d9d281246346b4386/html5/thumbnails/27.jpg)
Statistics: Question
• Given two random sequences of lengths m and n• What is the probability that they will produce an MSP score
of >= S ?
![Page 28: BLAST: Basic Local Alignment Search Tool Altschul et al. J ...](https://reader033.fdocuments.net/reader033/viewer/2022051600/6280182d9d281246346b4386/html5/thumbnails/28.jpg)
Statistics: intuition
Given a binary 0/1 sequence and a query string of k consecutive ones • Probability in a sequence of length k: 1/2k • Probability in a sequence of length k+1?
• 1 - (1 - 1/2k)2 • How about the probability in a sequence of length k+n?
• 1 - (1 - 1/2k)n+1 • The longer the sequence, the more likely you are going to
get k ones by chance!
![Page 29: BLAST: Basic Local Alignment Search Tool Altschul et al. J ...](https://reader033.fdocuments.net/reader033/viewer/2022051600/6280182d9d281246346b4386/html5/thumbnails/29.jpg)
The probability will depend on: • How long is are the sequences (the longer the
easier to get a local score above threshold by chance)
• Scoring matrix • Distribution of amino acids in each sequence
Statistics: more intuition
![Page 30: BLAST: Basic Local Alignment Search Tool Altschul et al. J ...](https://reader033.fdocuments.net/reader033/viewer/2022051600/6280182d9d281246346b4386/html5/thumbnails/30.jpg)
Statistics: Intuition
![Page 31: BLAST: Basic Local Alignment Search Tool Altschul et al. J ...](https://reader033.fdocuments.net/reader033/viewer/2022051600/6280182d9d281246346b4386/html5/thumbnails/31.jpg)
Approach
![Page 32: BLAST: Basic Local Alignment Search Tool Altschul et al. J ...](https://reader033.fdocuments.net/reader033/viewer/2022051600/6280182d9d281246346b4386/html5/thumbnails/32.jpg)
How to compute the probability?
![Page 33: BLAST: Basic Local Alignment Search Tool Altschul et al. J ...](https://reader033.fdocuments.net/reader033/viewer/2022051600/6280182d9d281246346b4386/html5/thumbnails/33.jpg)
Simulation1. Generate many random sequence pairs
2. Compute the distribution of the SCOREs
score
frequency
SCORE
![Page 34: BLAST: Basic Local Alignment Search Tool Altschul et al. J ...](https://reader033.fdocuments.net/reader033/viewer/2022051600/6280182d9d281246346b4386/html5/thumbnails/34.jpg)
How to compute the p-value (probability)?
![Page 35: BLAST: Basic Local Alignment Search Tool Altschul et al. J ...](https://reader033.fdocuments.net/reader033/viewer/2022051600/6280182d9d281246346b4386/html5/thumbnails/35.jpg)
Statistical test
Simulation
p-value = 0.001p-value = 0.45
![Page 36: BLAST: Basic Local Alignment Search Tool Altschul et al. J ...](https://reader033.fdocuments.net/reader033/viewer/2022051600/6280182d9d281246346b4386/html5/thumbnails/36.jpg)
Is this efficient enough?
![Page 37: BLAST: Basic Local Alignment Search Tool Altschul et al. J ...](https://reader033.fdocuments.net/reader033/viewer/2022051600/6280182d9d281246346b4386/html5/thumbnails/37.jpg)
Another observation
![Page 38: BLAST: Basic Local Alignment Search Tool Altschul et al. J ...](https://reader033.fdocuments.net/reader033/viewer/2022051600/6280182d9d281246346b4386/html5/thumbnails/38.jpg)
Extreme value distribution
![Page 39: BLAST: Basic Local Alignment Search Tool Altschul et al. J ...](https://reader033.fdocuments.net/reader033/viewer/2022051600/6280182d9d281246346b4386/html5/thumbnails/39.jpg)
Extreme value distribution
z
z
![Page 40: BLAST: Basic Local Alignment Search Tool Altschul et al. J ...](https://reader033.fdocuments.net/reader033/viewer/2022051600/6280182d9d281246346b4386/html5/thumbnails/40.jpg)
Compute a p-value
![Page 41: BLAST: Basic Local Alignment Search Tool Altschul et al. J ...](https://reader033.fdocuments.net/reader033/viewer/2022051600/6280182d9d281246346b4386/html5/thumbnails/41.jpg)
Parameters
z
z
![Page 42: BLAST: Basic Local Alignment Search Tool Altschul et al. J ...](https://reader033.fdocuments.net/reader033/viewer/2022051600/6280182d9d281246346b4386/html5/thumbnails/42.jpg)
Statistical test
EVD
p-value = 0.001p-value = 0.45
![Page 43: BLAST: Basic Local Alignment Search Tool Altschul et al. J ...](https://reader033.fdocuments.net/reader033/viewer/2022051600/6280182d9d281246346b4386/html5/thumbnails/43.jpg)
Significance: P-value and E-value
![Page 44: BLAST: Basic Local Alignment Search Tool Altschul et al. J ...](https://reader033.fdocuments.net/reader033/viewer/2022051600/6280182d9d281246346b4386/html5/thumbnails/44.jpg)
Parameters
z
z
![Page 45: BLAST: Basic Local Alignment Search Tool Altschul et al. J ...](https://reader033.fdocuments.net/reader033/viewer/2022051600/6280182d9d281246346b4386/html5/thumbnails/45.jpg)
E-value
Approximation:
if x is very small, then 1-exp(-x) can be approximated by x
Therefore,
P(Z>=x)
So E-value = DatabaseLength * p-value
where N is the database size (not the aligned length n)