Basic Local Alignment Search Tool Presented by Mei Liu August 7, 2008.
-
date post
22-Dec-2015 -
Category
Documents
-
view
216 -
download
3
Transcript of Basic Local Alignment Search Tool Presented by Mei Liu August 7, 2008.
Introduction
BLAST Finds regions of local similarity between sequences Assesses which DNA or protein sequences in a large
database have significant similarity with a given query sequence
Infer functional and evolutionary relationships between sequences
Help identify members of gene families Two implementations of BLAST: one by NCBI and the
other at Washington University
Introduction
WU-BLAST printouts give the following values Score or High Score Bit scores Expect values P-values
Outline
Comparison of two aligned sequences BLAST random walk Parameter calculations Choice of score Bounds and approximation for BLAST p-value Normalized and bit scores Number of high-scoring excursions Karlin-Altschul sum statistic
Outline
Comparison of two unaligned sequences Comparison of a query sequence against a
database Minimum significance lengths Parametric or non-parametric test? Gapped BLAST and PSI BLAST
1. Two Aligned Sequences Given an ungapped global alignment of two
protein sequences, both of length N Null hypothesis: for each aligned pair of amino
acids, the two amino acids are generated by independent mechanisms
Null hypothesis probability of the amino acid pair (j, k) =
Alternative hypothesis probability of the amino acid pair (j, k) =
'kj pp
),( kjq
(10.1)
(10.2)
1.1 BLAST Random Walk
Number the positions from left to right as 1, 2, …, N
A score S(j, k) is allocated to each aligned amino acid pair (j, k)
In application of BLAST, the score is found by BLOSUM or PAM
1.1 BLAST Random Walk PAM
Developed by Margaret Dayhoff in 1970s calculated by observing the differences in closely
related proteins PAM1 matrix estimates what rate of substitution would
be expected if 1% of the amino acids had changed Derived matrices as high as PAM250 Higher numbers in the PAM matrix naming scheme
denote larger evolutionary distance Not work very well for aligning evolutionarily
divergent sequences
1.1 BLAST Random Walk BLOSUM
Henikoff and Henikoff constructed these matrices using multiple alignments of evolutionarily divergent proteins
Probabilities used in the matrix calculation are computed by looking at "blocks" of conserved sequences found in multiple protein alignments
To reduce bias from closely related sequences, segments in a block with a sequence identity above a certain threshold were clustered
For the BLOSUM62, this threshold was set at 62% Larger numbers in the BLOSUM matrix naming
scheme denote higher sequence similarity
1.1 BLAST Random Walk Accumulated score at
position i is calculated as the sum of scores for various amino acid comparisons at positions 1, 2, … , i
Sequence 1: T Q L A A W C R M T C F E I E C K V
Sequence 2: R H L D S W R R A S D D A R I E E G
S(j, k): -1, 1, 5, -2, 1, 15, -4, 7, -1, 2, -4, etc.
Accumulated Score: -1, 0, 5, 3, 4, 19, 15, 22, etc.
1.1 BLAST Random Walk Let Y1, Y2, … be the respective maximum heights of the
walk relative to the height of any ladder point after leaving this ladder point and before arriving at the next
Define Ymax as the maximum of these maxima Ymax is the test statistic used in BLAST, so it is necessary
to find its null hypothesis distribution Random variables Yi exhibit geometric-like distribution
C and depends on the substitution matrix used and amino acid frequencies { } and { }
Probability distribution of Ymax, apart from C and , also depends on the mean number of ladder points in the walk
yCeyYP )(
'kpjp
1.2 Parameter Calculations
Step size is identified with a score S(j,k) Null hypothesis probability of taking a step of any size is
found from the two sets of frequencies { } and { } When null hypothesis is true, can be calculated
jp
d
k
kk
C
j
jj
ekQe
eRQ
C
1
1
**
*
1
1
'kp
kj
kjSkj epp
,
),(' 1(10.3)
(7.61)
1.2 Parameter Calculations Ymax depends on C, , and mean number of ladder points in
BLAST walk Mean number of ladder points in turn depends on the
distance A between ladder points
Calculation of A depends on the calculation of R-j
Two alternative approaches in calculation
d
cjj
c
jj
jP
jR
A 1(7.41)
1.2 Parameter Calculations Decomposition of paths
Ex. A walk with 2 possible steps: +1, -2 with respective probabilities p, q=1-p
Any ladder point reached in the walk is at a distance 1 or 2 below the previous one
Respective probabilities of the two cases are R-1 and R-2 = 1 – R-1
Probability that -2 is a ladder point is: Probability that it goes to -2 immediately, and Probability that it first goes to +1 reaches 0 -2
1.2 Parameter Calculations
21
2
2 12
4
RR
p
qpqqR
Then value of A follows from Eq. (7.41)
Since two sequences compared are each of length N, and mean distance between ladder points is A
The mean number of ladder points is N/A
(10.5)
(10.4)
d
Cj j
C
j j
jp
jRA 1 (7.41)
222 )1( RRpqR
Directly -2 +1 0 -2
1.3 Choice of a Score BLAST score is a log likelihood ratio Why?
Similar to sequence analysis If random variable Y has a discrete probability distribution,
this “score” statistic is defined as the log likelihood ratio
If amino acid pair (j,k) is observed at any position, and if pjpk' and q(j,k) are null and alternative hypothesis probabilities
);(
);(log)(
0
10,1
yP
yPyS
'
),(log),(
kj pp
kjqkjS (10.6)
1.3 Choice of a Score Second argument leads to the choice of a specific
proportionality constant Suppose some arbitrary substitution matrix is chosen
with (j,k) element S(j,k), let q(j,k) be defined implicitly by
where is defined in equation (10.3)
Thus q(j,k) can be defined explicitly by
kj
kjSkj epp
,
),(' 1
'1 ),(log),(
kj pp
kjqkjS
),('),( kjSkj eppkjq
kjkjq
,1),(
(10.7)
(10.3)
(10.8)
1.3 Choice of a Score Karlin and Altschul (1990) and Karlin (1994)
showed that When null hypothesis is true, the frequency with which
the observation (j,k) arises in high-scoring excursions is asymptotically equal to q(j,k)
Then argued that a score scheme is “optimal” if the frequency of the observation (j,k) in high-scoring excursions is asymptotically equal to the “target” frequency q(j,k), the frequency arising if the alternative hypothesis is true
i.e. frequency in the most biologically relevant alignments of conserved regions
1.3 Choice of a Score
Argument for the use of S(j,k) as the score statistic lead to following procedures:
Various possibilities for q(j,k) One frequently adopted choice is derived from
the evolutionary arguments that lead to PAMn matrix construction in 6.5.3
'
)(
log),(k
njk
p
mkjS
'1 ),(log),(
kj pp
kjqkjS
)(),( njkjmpkjq
(10.7)
(10.9) (10.10)
1.3 Choice of a Score Choice of S(j,k) can as be related to relative entropy
Score defined is proportional to the support given by the observation (j,k) in favor of the alternative hypothesis over the null hypothesis
Eq. 1.124 shows that when the alternative hypothesis is true, the mean support for the alternative over the null hypothesis is
'1 ),(log),(
kj pp
kjqkjS
kj kj pp
kjqkjqH
,'
),(log),(
kj
kjSEkjSkjqH,
)),((),(),(
(10.7)
(10.11) (10.12)
1.3 Choice of a Score
Mean score in high-scoring segments is asymptotically
'1 ),(log),(
kj pp
kjqkjS
kj
kjSkjqH,
),(),(
H1
kjkjSkjq
,),(),(
(10.13)
(10.12)
(10.7)
1.3 Choice of a Score Simulations show that the convergence to this
asymptotic value is very slow Direct computation of H is not possible
and S(j,k) are known, but q(j,k) is unknown BLAST uses indirect approach to calculate H
where q(j,k) is first calculated by),('),( kjS
kj eppkjq
kj
kjSkjqH,
),(),(
(10.8)
(10.12)
1.4 Bounds and Approximation for BLAST P-value
Test statistic used in BLAST is the maximum Ymax of n ≈ N/A random variables Each being a random upwards excursion height
following a ladder point in the BLAST random walk In section 7.6.4, it was shown that each upward
excursion has the geometric-like distribution Obtain asymptotic bounds for the null hypothesis
distribution of Ymax and hence asymptotic bounds for a BLAST P-value
1.4 Bounds and Approximation for BLAST P-value
There exists an asymptotic distribution for the maximum of n iid continuous random variables whose density function has support of the form (A, +∞)
However, Ymax is a discrete random variable Use the continuous distribution results to find asymptotic
bounds for the distribution of Ymax If Xmax is the max of n iid continuous r.v. and if Ymax =
floor(Xmax), then Ymax is a discrete r.v.
Thus, for any positive integer y
maxmaxmax 1 XYX
)1()()( maxmaxmax yXPyYPyXP (10.14)
1.4 Bounds and Approximation for BLAST P-value
Let Xmax be the max of n iid r.v. each having exponential distribution and Ymax = Floor(Xmax) Ymax has the same distribution as the max of n iid r.v.
each having geometric distribution Applying Eq. (2.130) and bounds in (10.14), we
have a close approximation
)1(
)1(
)1(
1)(1
1)(1
)(
max
max
max
yy
yy
yy
nCenCe
nene
nene
eyYPe
eyYPe
eyYPe
(10.15)
(10.16)
(10.17)
1.4 Bounds and Approximation for BLAST P-value
If we replace n by N/A for the mean number of BLAST ladder points and define a new parameter K by
The inequality (10.17) becomes
If replace y by x+-1logN, we have
eA
CK
)2()1(
1)(1 max
yy NKeNKe eyYPe
)1max(max
)1(
)1(
1)(1
1)log(1
)log(
maxmax
1max
1max
yy
xx
xx
KNeKNe
KeKe
KeKe
eyYPe
exNYPe
exNYPe
(10.18)
(10.19)
(10.20)
(10.21)
(10.22)
1.4 Bounds and Approximation for BLAST P-value
These bounds for BLAST P-value are not directly relevant in practice because BLAST search involves comparison of short query
sequence with a large DB with many fragments No a priori alignment
Nevertheless, P-value approximation derives ultimately from the lower P-value bound in Eq. (10.22)
More appropriate to use conservative (overestimate the true P-value) upper bound in (10.22) rather than lower bound
1.5 Normalized and Bit Scores Karlin and Altschul (1993) call the following
expression a “normalized score”
In terms of this score, the inequalities (10.20) can be written as
From the upper inequality P-value corresponding to an observed value s' is
)log(' max NKYS
ss eee esSPe )'(
seesSP 1)'(
'
1seevalueP
(10.28)
(10.27)
(10.26)
(10.25)
1.5 Normalized and Bit Scores
BLAST record a score similar to the normalized score S', namely the “bit” score defined by
2log
logmax KYscorebit
1.6 Number of High-Scoring Excursions
Quantity E' = quantity “Expect” in BLAST Under null hypothesis, for each excursion, the
maximum height Y has a geometric-like distribution
# of excursions = N/A In BLAST, mean number of excursions reaching a
height v or more is approximately e
A
CK vNKe
vi cevYP )(
(10.34)(10.18)
1.6 Number of High-Scoring Excursions
Expected value of the number of excursions corresponding to the observed maximal score ymax
)1log('
1
'log'
'
'
max
valuePE
evalueP
ES
NKeE
E
y
(10.35)
(10.36)
(10.37)
1.7 Karlin-Altschul Sum Statistic
Focusing on Ymax loses information provided by heights of the 2nd, 3rd, etc. excursions in the random walk
Consider r largest Yi values
Compute r normalized scores where
rYYYY 2max1 )(
)log(' NKYS ii (10.38)
1.7 Karlin-Altschul Sum Statistic Karlin and Altschul (1993) showed that to a close
approximation, the null hypothesis joint density function is
Any reasonable function of can be the test statistic
Use transformation methods introduced in Chap. 2 to find the distribution of this test statistic
In turn allows computations of P-value and E or Expect value corresponding to any observed value of this statistic
r
kk
srS sessf r
11 exp),,(
''2
'1 ,, rSSS
(10.39)
1.7 Karlin-Altschul Sum Statistic Statistic suggested is the sum of the normalized
scores, called the Karlin-Altschul sum statistic
Null hypothesis density function f(t) of Tr
When t is sufficiently large, this density function can be used to find the approximate expression
''1 rr SST
0
/)()2( )exp()!2(!
)( dyeyrr
etf rtyr
t
Tr
)!1(!)(
1
rr
tetTP
rt
r
(10.40)
(10.41)
1.7 Karlin-Altschul Sum Statistic The approximation (10.41) is sufficiently accurate when t
> r(r+1), and BLAST uses it when the inequality holds
If t is the observed value of Tr, the right hand side in (10.41) provides the approximate P-value corresponding to this observed value
This is used as a component of the eventual BLAST printout P-value
Ex. s1 = 4.4 and s2 = 2.5 r = 1, P-value for the highest normalized score 4.4 = e-4.4 = 0.012 r = 2, P-value for the sum 6.9 = 6.9/2 * e-6.9 = 0.0035
)!1(!)(
1
rr
tetTP
rt
r (10.41)
2. Two Unaligned Sequences
Given two sequences of lengths N1 and N2, but no specific alignment is given
Need to find the significance of high-scoring segment pairs between all possible (ungapped) local alignments
2.1 Theoretical and Empirical Background
BLAST considers all ungapped alignments determined by all possible relative positions of two sequences
For each relative position, alignment is extended as far as possible in either direction, giving a total of N1+N2-1 ungapped alignments
2.1 Theoretical and Empirical Background
Each alignment yields a random walk Total N1N2 comparisons between two
sequences taking all possible positions relative to each other
Many conclusions from previous section can be carried over to the present case with N replaced by N1N2 or a more refined function allowing for edge
effects
2.1 Theoretical and Empirical Background
Ymax is the maximum score achieved in the random walk comparing sequences, using all possible ungapped local alignments
Mean number of ladder points: Assume null hypothesis is true, inequalities in (10.21) is replaced by
Normalized score S' is redefined as
Expected number E' of excursions reaching a height ymax or more is
Null hypothesis mean of Ymax is
ANN 21
)1(
1))log((1 211
max
xx KeKe exNNYPe
)log(' 21max KNNYS
max21' yKeNNE
))(log( 211 KNN
(10.42)
(10.43)
(10.44)
(10.45)
(10.46)
2.2 Edge Effects A high-scoring random walk excursion might be
cut short at the end of a sequence match So the height of high-scoring excursions and the
number of such excursions will be less than predicted by theory
Edge effects is an important factor in the comparison of two comparatively short sequences
BLAST theory concerns two long sequences In practice, BLAST considers databases of large
number of short sequences
2.2 Edge Effects BLAST calculations allow for edge effects by subtracting
from both N1 and N2 a factor depending on the mean length of any high-scoring excursion
Eq. (10.13) showed that the mean value of the step in high-scoring excursion asymptotically approaches
Given the height achieved by a high-scoring excursion is denoted by y, the mean length E(L|y) of this excursion, conditional on y, is
BLAST theory replaces N1 and N2 by
H1
H
yyLE
)|(
)()( 2'21
'1 LENNLENN
(10.47)
2.2 Edge Effects Specifically, the normalized score is replaced by
Expected number of excursions scoring v or higher is replaced by
E' is given by
H
YNN
H
YNN
KNNY
max2
'2
max1
'1
'2
'1max )log(
H
vNN
H
vNN
KeNN v
2'21
'1
'2
'1
max'2
'1' yKeNNE (10.51)
(10.50)
(10.49)
(10.48)
2.2 Edge Effects The use of edge correction in (10.49) assumes that
asymptotic formula for the mean step size in a high-scoring excursion is appropriate
Values calculated from Eq. (10.47) is inaccurate for anything other than very large values of N
Use of edge correction in (10.49) might in practice lead to P-value estimates less than the correct values for anything other than very large N
H
yyLE
)|( (10.47)
2.2 Edge Effects In BLAST, edge effect correction factor for the
Karlin-Altschul sum statistic Tr is calculated as follows Raw edge effect correction is calculated as
Edge correction value E(L) is defined by
f is an “overlap adjustment factor” that can be chosen by the user
Default f = 0.125 implies that overlaps between segments of up to 12.5% are allowed
HYYY r /)( 21
11
1)()( 21
rfr
rYYY
HLE r
(10.52)
2.3 Multiple Testing No obvious choice for the value of r BLAST considers all r = 1, 2, 3, … and choose the set of
HSPs with lowest sum statistic P-value as the most significant
However, it implies that a sequence of tests, one for each r So issue of multiple testing arises Ignoring multiple testing issue can lead to a significant
overestimate of BLAST P-values Unfortunately, no rigorous theory available to deal with
this issue In practice, it is handled in an ad hoc manner
2.3 Multiple Testing Ex. WU-BLAST
P-value is adjusted by dividing by a factor
When r = 1, the factor became 1- π, which implies that E' is divided by 1- π
BLAST default value 0.5 of π implies that E=2E', so that
P-value is then found as
1)1( r
max'2
'12 yKeNNE
EevalueP 1
(10.56)
(10.57)
3. Query Sequence vs. Database
Compare query sequence to each database sequence to obtain P-values for individual comparisons
For r = 1, probability that in a match with score v or more is
Expect, the mean number of HSPs scoring v or more in the entire database is given by
D = total length of DB (sum of lengths of all database sequences) N2 = length of the database sequence
Ee1
2
)1(
N
DeExpect
E
ExpectevalueP 1
(10.58)
(10.59)
(10.60)
3. Query Sequence vs. Database For r > 1, from each P-value, a total
database value of Expect is calculated by
Finally, all single (r = 1) HSPs or summed (r > 1) HSPs with sufficiently low values of Expect are listed
2
)(
N
DvaluePExpect
ExpectevalueP 1 (10.60)
(10.61)
4. Minimum Significance Lengths
Correct Choice of n When sequences are distantly related, similarities
between them might be subtle Cannot detect significant similarity unless a long
alignment is available On the other hand, if sequences are very similar, then a
relatively short alignment is sufficient If the similarity is subtle, each aligned pair will tell us
less than an aligned pair in more similar sequences (in terms of information)
This lead to the concept of information content per position in an alignment
4. Minimum Significance Lengths Using a PAMn matrix is to test: Alternative hypothesis: n is the correct value to
use in the evolutionary process leading to the two protein sequences
Null hypothesis: appropriate value of n is +∞ Here, assume that the alternative hypothesis is
correct (i.e. correct value of n is chosen) Explore aspects of power of the testing procedure
by finding the mean length of protein sequence needed before the alternative hypothesis is accepted
4. Minimum Significance Lengths
Suppose that, we decide to adopt a testing procedure with Type I error α (FP)
The value s of the normalized score statistic S' is given by s = -logα
Corresponding value ymax of Ymax is
When alternative hypothesis is true, mean score for the amino acids comparison at any position is
NK
y log1max
kj kj kj pp
kjqkjqkjSkjq
, ,
1 ),(log),(),(),(
(10.64)
(10.65)
4. Minimum Significance Lengths
In Chapter 7, it showed that if Mean final position in a random walk is F Mean step size is G Then mean number of steps needed to reach the final
position is F/G Mean sequence length needed in the maximally
scoring local alignment in order to obtain significance with Type I error α is
kjkj ppkjq
kjq
NK
,
),(log),(
log
(10.66)
4. Minimum Significance Lengths Since various components can be interpreted in terms of
bits of information, thus write the ratio (10.66) as
Denominator = mean of the relative support, in terms of bits, provided by one observation for the alternative hypothesis against the null hypothesis, given that the alternative hypothesis is true
Numerator = mean total number of bits of information needed to claim that two sequences are similar
kjkj ppkjq
kjq
NK
, 2
2
),(log),(
log
(10.67)
4. Minimum Significance Lengths
It is known that typically K = 0.1, α = 0.05 or 0.01 Thus numerator is largely determined by length
N, which is approximately log2N Ex. N = 1000, need 9.97 bits of information to
claim significant similarity between two sequences
Main interest is the minimum significant length
4. Minimum Significance Lengths If n is large, q(j,k) is close to pjpk
Mean information per aligned pair given in the denominator is small
Minimum significant length is large If null and alternative hypotheses specify quite similar
probabilities for any aligned pair, many observations will in general be needed to decide between two hypotheses
If n is small Mean relative support for the alternative hypothesis is
large Minimum significant length is small
4. Minimum Significance Lengths Limiting (n 0) values
q(j,j) = pj q(j,k) = 0 for j ≠ k Denominator, mean support from each position in favor
of the alternative hypothesis, approaches
If all amino acids are equally frequent, this mean support is log220 = 4.32
In practice, actual frequencies of observed amino acids imply that a more appropriate value is about 4.17
Thus, minimum significant length is (log2N)/4.17 If N = 1000, this is about 2.39
j jj pp 2log
4. Minimum Significance Lengths
When N = 1000 and n = 250 Corresponds to a PAM250 substitution matrix Probabilities q(j,k) are such that each amino
acid pair provides a mean of only 0.36 bits of information
Minimum significance length is log(1000)/0.36 = 28 is required on average to accept the alternative hypothesis
4. Minimum Significance Lengths
Incorrect Choice of n Above calculations all assume the correct value of n is
chosen, thus correct alternative hypothesis probabilities q(j,k) is used
In practice, it is impossible to choose a unique correct value for n when using a PAM matrix
Suppose there is a unique correct value m leading to a PAMm matrix, but an incorrect value n was chosen and PAMn matrix is used instead
What does this imply?
4. Minimum Significance Lengths Suppose that with the correct choice m, the probability of
the ordered pair (j,k) is r(j,k) The mean score is then
r(j,k) = q(j,k) when n = m, mean score is positive More generally, mean score is positive when n and m are
close But, as m +∞, r(j,k) pjpk, mean score is negative Thus for any choice of n there will be values of m
sufficiently large compared to n so that the mean score is negative
kj kj pp
kjqkjr
,
1 ),(log),( (10.68)
4. Minimum Significance Lengths
When mean score is positive, minimal significance length is
Minimal length depends on q(j,k), that is on the choice of n
Choice of n involves substantial extrinsic guesswork, thus it is important to assess the implications of an incorrect choice
kjkj ppkjq
kjr
NK
,
),(log),(
log (10.69)
4. Minimum Significance Lengths Negative means arise when m is sufficiently large
compared to n, that is When two species being compared diverged a long
time in the past relative to the time assumed by the PAM matrix used in analysis
The more negative this mean is, the more likely that the null hypothesis will be accepted
In the limit m +∞, when r(j,k) = pjpk, the probability of rejecting the null hypothesis is equal to the chosen Type I error
Ex. If n = 100 is chosen, the mean score is negative when m is 193 or more
4. Minimum Significance Lengths
In conclusion, Correctly chosen small value of n leads to shorter
minimal significance lengths Incorrect small choice may lead to the possibility that a
real similarity between the two sequences will not be picked up
In practice, to overcome this problem, sometimes uses a variety of substitution matrices
However, it must be viewed with some caution, especially in the light of multiple testing problem
5. Parametric or Non-parametric Parametric test: test statistic is found from
likelihood ratio arguments Non-parametric test: test statistic is found on
reasonable but nevertheless arbitrary grounds Many of calculations and arguments used in
preceding sections derive from the derivation of the score S(j,k) in a substitution matrix from likelihood ratio arguments
In this sense, BLAST testing theory can be thought of as a parametric procedure deriving from the likelihood ratio theory
5. Parametric or Non-parametric Assumptions made in the theory are, however, subject to
debate Time homogeneity assumption implicit in calculations
cannot be sustained Genetic code influenced substitutions earlier in time and various
chemical properties influenced substitutions more recently Thus, comparisons of distantly related species can be problematic
Further, if data in a large database come from a collection of species whose respective evolutionary divergence times might differ widely, the concept of a uniformly correct choice of n is not meaningful
5. Parametric or Non-parametric Even if these claims are true, the statistical aspects
of the BLAST procedure are still valid P-value calculations are still correct, so even if
these scores were chosen in any more or less reasonable way, no problems arise with the correctness of the calculations
In this sense, BLAST testing process can be thought of as a non-parametric procedure
6.1 Gapped BLAST Allows gaps in sequence alignments In comparison of two sequences, there will be some
maximum scores
Maximum score over all possible gapped alignments Null hypothesis probability distribution is determined by
the substitution matrix used and gap penalty chosen The distribution can be estimated through simulation
Randomly generate two sequences of lengths N1 and N2 From these sequences, find the observed maximum score denoted
by y1 Procedure is repeated n times yielding n observed highest scores
y1, y2, …, yn
)(max
gappedY
6.1 Gapped BLAST Approximation was made that the distribution of
Ymax in gapped case is of the same form in the ungapped case with revised values of K and
Approach described above depends on simulation results
If a penalty of δ is assigned to each gap in the alignment of two sequences, then (10.45) is replaced by
)6/()(_
121 seNNK y
11'
*21max
e
TKeNNE y (10.73)
(10.72)
6.2 PSI BLAST PSI (Position Specific Iterated) BLAST In regular BLAST, a fixed substitution matrix is used to
score positions in alignments It relies on one matrix to provide the most meaningful
scores for all positions in the query sequence simultaneously
PSI-BLAST Uses a standard substitution matrix in the first step Sequences found are then used to derived a separate scoring
scheme for each position in the query sequence and used for the second BLAST search
The procedure is iterated until no further iteration seems useful
6.2 PSI BLAST Query sequence is first compared to database sequences All database sequence segments having a sufficiently close
similarity with the query (ex. Expect < 0.01) are reported From this collection of sites, a frequency fi of amino acid i
is calculated, and used to estimate frequency Qi of amino acid i at this site
In PSI-BLAST, Σigi = 1 no longer holds Shaffer et al. (2001) described a new implementation
where pi is the background frequency of amino acid i and p(i,j) is the frequency with amino acids i and j aligned through evolutionary descent
iii
jjji
gfQpjiqfg /),(
j jji
i
pjipffQ
/),(
(10.75)
(10.76)