Basic Local Alignment Search Tool Presented by Mei Liu August 7, 2008.

73
Basic Local Alignment Search Tool Presented by Mei Liu August 7, 2008
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    216
  • download

    3

Transcript of Basic Local Alignment Search Tool Presented by Mei Liu August 7, 2008.

Basic Local Alignment Search Tool

Presented by Mei Liu

August 7, 2008

Introduction

BLAST Finds regions of local similarity between sequences Assesses which DNA or protein sequences in a large

database have significant similarity with a given query sequence

Infer functional and evolutionary relationships between sequences

Help identify members of gene families Two implementations of BLAST: one by NCBI and the

other at Washington University

Introduction

WU-BLAST printouts give the following values Score or High Score Bit scores Expect values P-values

Outline

Comparison of two aligned sequences BLAST random walk Parameter calculations Choice of score Bounds and approximation for BLAST p-value Normalized and bit scores Number of high-scoring excursions Karlin-Altschul sum statistic

Outline

Comparison of two unaligned sequences Comparison of a query sequence against a

database Minimum significance lengths Parametric or non-parametric test? Gapped BLAST and PSI BLAST

1. Two Aligned Sequences Given an ungapped global alignment of two

protein sequences, both of length N Null hypothesis: for each aligned pair of amino

acids, the two amino acids are generated by independent mechanisms

Null hypothesis probability of the amino acid pair (j, k) =

Alternative hypothesis probability of the amino acid pair (j, k) =

'kj pp

),( kjq

(10.1)

(10.2)

1.1 BLAST Random Walk

Number the positions from left to right as 1, 2, …, N

A score S(j, k) is allocated to each aligned amino acid pair (j, k)

In application of BLAST, the score is found by BLOSUM or PAM

1.1 BLAST Random Walk PAM

Developed by Margaret Dayhoff in 1970s calculated by observing the differences in closely

related proteins PAM1 matrix estimates what rate of substitution would

be expected if 1% of the amino acids had changed Derived matrices as high as PAM250 Higher numbers in the PAM matrix naming scheme

denote larger evolutionary distance Not work very well for aligning evolutionarily

divergent sequences

1.1 BLAST Random Walk BLOSUM

Henikoff and Henikoff constructed these matrices using multiple alignments of evolutionarily divergent proteins

Probabilities used in the matrix calculation are computed by looking at "blocks" of conserved sequences found in multiple protein alignments

To reduce bias from closely related sequences, segments in a block with a sequence identity above a certain threshold were clustered

For the BLOSUM62, this threshold was set at 62% Larger numbers in the BLOSUM matrix naming

scheme denote higher sequence similarity

1.1 BLAST Random Walk

1.1 BLAST Random Walk Accumulated score at

position i is calculated as the sum of scores for various amino acid comparisons at positions 1, 2, … , i

Sequence 1: T Q L A A W C R M T C F E I E C K V

Sequence 2: R H L D S W R R A S D D A R I E E G

S(j, k): -1, 1, 5, -2, 1, 15, -4, 7, -1, 2, -4, etc.

Accumulated Score: -1, 0, 5, 3, 4, 19, 15, 22, etc.

1.1 BLAST Random Walk Let Y1, Y2, … be the respective maximum heights of the

walk relative to the height of any ladder point after leaving this ladder point and before arriving at the next

Define Ymax as the maximum of these maxima Ymax is the test statistic used in BLAST, so it is necessary

to find its null hypothesis distribution Random variables Yi exhibit geometric-like distribution

C and depends on the substitution matrix used and amino acid frequencies { } and { }

Probability distribution of Ymax, apart from C and , also depends on the mean number of ladder points in the walk

yCeyYP )(

'kpjp

1.2 Parameter Calculations

Step size is identified with a score S(j,k) Null hypothesis probability of taking a step of any size is

found from the two sets of frequencies { } and { } When null hypothesis is true, can be calculated

jp

d

k

kk

C

j

jj

ekQe

eRQ

C

1

1

**

*

1

1

'kp

kj

kjSkj epp

,

),(' 1(10.3)

(7.61)

1.2 Parameter Calculations Ymax depends on C, , and mean number of ladder points in

BLAST walk Mean number of ladder points in turn depends on the

distance A between ladder points

Calculation of A depends on the calculation of R-j

Two alternative approaches in calculation

d

cjj

c

jj

jP

jR

A 1(7.41)

1.2 Parameter Calculations Decomposition of paths

Ex. A walk with 2 possible steps: +1, -2 with respective probabilities p, q=1-p

Any ladder point reached in the walk is at a distance 1 or 2 below the previous one

Respective probabilities of the two cases are R-1 and R-2 = 1 – R-1

Probability that -2 is a ladder point is: Probability that it goes to -2 immediately, and Probability that it first goes to +1 reaches 0 -2

1.2 Parameter Calculations

21

2

2 12

4

RR

p

qpqqR

Then value of A follows from Eq. (7.41)

Since two sequences compared are each of length N, and mean distance between ladder points is A

The mean number of ladder points is N/A

(10.5)

(10.4)

d

Cj j

C

j j

jp

jRA 1 (7.41)

222 )1( RRpqR

Directly -2 +1 0 -2

1.3 Choice of a Score BLAST score is a log likelihood ratio Why?

Similar to sequence analysis If random variable Y has a discrete probability distribution,

this “score” statistic is defined as the log likelihood ratio

If amino acid pair (j,k) is observed at any position, and if pjpk' and q(j,k) are null and alternative hypothesis probabilities

);(

);(log)(

0

10,1

yP

yPyS

'

),(log),(

kj pp

kjqkjS (10.6)

1.3 Choice of a Score Second argument leads to the choice of a specific

proportionality constant Suppose some arbitrary substitution matrix is chosen

with (j,k) element S(j,k), let q(j,k) be defined implicitly by

where is defined in equation (10.3)

Thus q(j,k) can be defined explicitly by

kj

kjSkj epp

,

),(' 1

'1 ),(log),(

kj pp

kjqkjS

),('),( kjSkj eppkjq

kjkjq

,1),(

(10.7)

(10.3)

(10.8)

1.3 Choice of a Score Karlin and Altschul (1990) and Karlin (1994)

showed that When null hypothesis is true, the frequency with which

the observation (j,k) arises in high-scoring excursions is asymptotically equal to q(j,k)

Then argued that a score scheme is “optimal” if the frequency of the observation (j,k) in high-scoring excursions is asymptotically equal to the “target” frequency q(j,k), the frequency arising if the alternative hypothesis is true

i.e. frequency in the most biologically relevant alignments of conserved regions

1.3 Choice of a Score

Argument for the use of S(j,k) as the score statistic lead to following procedures:

Various possibilities for q(j,k) One frequently adopted choice is derived from

the evolutionary arguments that lead to PAMn matrix construction in 6.5.3

'

)(

log),(k

njk

p

mkjS

'1 ),(log),(

kj pp

kjqkjS

)(),( njkjmpkjq

(10.7)

(10.9) (10.10)

1.3 Choice of a Score Choice of S(j,k) can as be related to relative entropy

Score defined is proportional to the support given by the observation (j,k) in favor of the alternative hypothesis over the null hypothesis

Eq. 1.124 shows that when the alternative hypothesis is true, the mean support for the alternative over the null hypothesis is

'1 ),(log),(

kj pp

kjqkjS

kj kj pp

kjqkjqH

,'

),(log),(

kj

kjSEkjSkjqH,

)),((),(),(

(10.7)

(10.11) (10.12)

1.3 Choice of a Score

Mean score in high-scoring segments is asymptotically

'1 ),(log),(

kj pp

kjqkjS

kj

kjSkjqH,

),(),(

H1

kjkjSkjq

,),(),(

(10.13)

(10.12)

(10.7)

1.3 Choice of a Score Simulations show that the convergence to this

asymptotic value is very slow Direct computation of H is not possible

and S(j,k) are known, but q(j,k) is unknown BLAST uses indirect approach to calculate H

where q(j,k) is first calculated by),('),( kjS

kj eppkjq

kj

kjSkjqH,

),(),(

(10.8)

(10.12)

1.4 Bounds and Approximation for BLAST P-value

Test statistic used in BLAST is the maximum Ymax of n ≈ N/A random variables Each being a random upwards excursion height

following a ladder point in the BLAST random walk In section 7.6.4, it was shown that each upward

excursion has the geometric-like distribution Obtain asymptotic bounds for the null hypothesis

distribution of Ymax and hence asymptotic bounds for a BLAST P-value

1.4 Bounds and Approximation for BLAST P-value

There exists an asymptotic distribution for the maximum of n iid continuous random variables whose density function has support of the form (A, +∞)

However, Ymax is a discrete random variable Use the continuous distribution results to find asymptotic

bounds for the distribution of Ymax If Xmax is the max of n iid continuous r.v. and if Ymax =

floor(Xmax), then Ymax is a discrete r.v.

Thus, for any positive integer y

maxmaxmax 1 XYX

)1()()( maxmaxmax yXPyYPyXP (10.14)

1.4 Bounds and Approximation for BLAST P-value

Let Xmax be the max of n iid r.v. each having exponential distribution and Ymax = Floor(Xmax) Ymax has the same distribution as the max of n iid r.v.

each having geometric distribution Applying Eq. (2.130) and bounds in (10.14), we

have a close approximation

)1(

)1(

)1(

1)(1

1)(1

)(

max

max

max

yy

yy

yy

nCenCe

nene

nene

eyYPe

eyYPe

eyYPe

(10.15)

(10.16)

(10.17)

1.4 Bounds and Approximation for BLAST P-value

If we replace n by N/A for the mean number of BLAST ladder points and define a new parameter K by

The inequality (10.17) becomes

If replace y by x+-1logN, we have

eA

CK

)2()1(

1)(1 max

yy NKeNKe eyYPe

)1max(max

)1(

)1(

1)(1

1)log(1

)log(

maxmax

1max

1max

yy

xx

xx

KNeKNe

KeKe

KeKe

eyYPe

exNYPe

exNYPe

(10.18)

(10.19)

(10.20)

(10.21)

(10.22)

1.4 Bounds and Approximation for BLAST P-value

These bounds for BLAST P-value are not directly relevant in practice because BLAST search involves comparison of short query

sequence with a large DB with many fragments No a priori alignment

Nevertheless, P-value approximation derives ultimately from the lower P-value bound in Eq. (10.22)

More appropriate to use conservative (overestimate the true P-value) upper bound in (10.22) rather than lower bound

1.5 Normalized and Bit Scores Karlin and Altschul (1993) call the following

expression a “normalized score”

In terms of this score, the inequalities (10.20) can be written as

From the upper inequality P-value corresponding to an observed value s' is

)log(' max NKYS

ss eee esSPe )'(

seesSP 1)'(

'

1seevalueP

(10.28)

(10.27)

(10.26)

(10.25)

1.5 Normalized and Bit Scores

BLAST record a score similar to the normalized score S', namely the “bit” score defined by

2log

logmax KYscorebit

1.6 Number of High-Scoring Excursions

Quantity E' = quantity “Expect” in BLAST Under null hypothesis, for each excursion, the

maximum height Y has a geometric-like distribution

# of excursions = N/A In BLAST, mean number of excursions reaching a

height v or more is approximately e

A

CK vNKe

vi cevYP )(

(10.34)(10.18)

1.6 Number of High-Scoring Excursions

Expected value of the number of excursions corresponding to the observed maximal score ymax

)1log('

1

'log'

'

'

max

valuePE

evalueP

ES

NKeE

E

y

(10.35)

(10.36)

(10.37)

1.7 Karlin-Altschul Sum Statistic

Focusing on Ymax loses information provided by heights of the 2nd, 3rd, etc. excursions in the random walk

Consider r largest Yi values

Compute r normalized scores where

rYYYY 2max1 )(

)log(' NKYS ii (10.38)

1.7 Karlin-Altschul Sum Statistic Karlin and Altschul (1993) showed that to a close

approximation, the null hypothesis joint density function is

Any reasonable function of can be the test statistic

Use transformation methods introduced in Chap. 2 to find the distribution of this test statistic

In turn allows computations of P-value and E or Expect value corresponding to any observed value of this statistic

r

kk

srS sessf r

11 exp),,(

''2

'1 ,, rSSS

(10.39)

1.7 Karlin-Altschul Sum Statistic Statistic suggested is the sum of the normalized

scores, called the Karlin-Altschul sum statistic

Null hypothesis density function f(t) of Tr

When t is sufficiently large, this density function can be used to find the approximate expression

''1 rr SST

0

/)()2( )exp()!2(!

)( dyeyrr

etf rtyr

t

Tr

)!1(!)(

1

rr

tetTP

rt

r

(10.40)

(10.41)

1.7 Karlin-Altschul Sum Statistic The approximation (10.41) is sufficiently accurate when t

> r(r+1), and BLAST uses it when the inequality holds

If t is the observed value of Tr, the right hand side in (10.41) provides the approximate P-value corresponding to this observed value

This is used as a component of the eventual BLAST printout P-value

Ex. s1 = 4.4 and s2 = 2.5 r = 1, P-value for the highest normalized score 4.4 = e-4.4 = 0.012 r = 2, P-value for the sum 6.9 = 6.9/2 * e-6.9 = 0.0035

)!1(!)(

1

rr

tetTP

rt

r (10.41)

2. Two Unaligned Sequences

Given two sequences of lengths N1 and N2, but no specific alignment is given

Need to find the significance of high-scoring segment pairs between all possible (ungapped) local alignments

2.1 Theoretical and Empirical Background

BLAST considers all ungapped alignments determined by all possible relative positions of two sequences

For each relative position, alignment is extended as far as possible in either direction, giving a total of N1+N2-1 ungapped alignments

2.1 Theoretical and Empirical Background

Each alignment yields a random walk Total N1N2 comparisons between two

sequences taking all possible positions relative to each other

Many conclusions from previous section can be carried over to the present case with N replaced by N1N2 or a more refined function allowing for edge

effects

2.1 Theoretical and Empirical Background

Ymax is the maximum score achieved in the random walk comparing sequences, using all possible ungapped local alignments

Mean number of ladder points: Assume null hypothesis is true, inequalities in (10.21) is replaced by

Normalized score S' is redefined as

Expected number E' of excursions reaching a height ymax or more is

Null hypothesis mean of Ymax is

ANN 21

)1(

1))log((1 211

max

xx KeKe exNNYPe

)log(' 21max KNNYS

max21' yKeNNE

))(log( 211 KNN

(10.42)

(10.43)

(10.44)

(10.45)

(10.46)

2.2 Edge Effects A high-scoring random walk excursion might be

cut short at the end of a sequence match So the height of high-scoring excursions and the

number of such excursions will be less than predicted by theory

Edge effects is an important factor in the comparison of two comparatively short sequences

BLAST theory concerns two long sequences In practice, BLAST considers databases of large

number of short sequences

2.2 Edge Effects BLAST calculations allow for edge effects by subtracting

from both N1 and N2 a factor depending on the mean length of any high-scoring excursion

Eq. (10.13) showed that the mean value of the step in high-scoring excursion asymptotically approaches

Given the height achieved by a high-scoring excursion is denoted by y, the mean length E(L|y) of this excursion, conditional on y, is

BLAST theory replaces N1 and N2 by

H1

H

yyLE

)|(

)()( 2'21

'1 LENNLENN

(10.47)

2.2 Edge Effects Specifically, the normalized score is replaced by

Expected number of excursions scoring v or higher is replaced by

E' is given by

H

YNN

H

YNN

KNNY

max2

'2

max1

'1

'2

'1max )log(

H

vNN

H

vNN

KeNN v

2'21

'1

'2

'1

max'2

'1' yKeNNE (10.51)

(10.50)

(10.49)

(10.48)

2.2 Edge Effects The use of edge correction in (10.49) assumes that

asymptotic formula for the mean step size in a high-scoring excursion is appropriate

Values calculated from Eq. (10.47) is inaccurate for anything other than very large values of N

Use of edge correction in (10.49) might in practice lead to P-value estimates less than the correct values for anything other than very large N

H

yyLE

)|( (10.47)

2.2 Edge Effects In BLAST, edge effect correction factor for the

Karlin-Altschul sum statistic Tr is calculated as follows Raw edge effect correction is calculated as

Edge correction value E(L) is defined by

f is an “overlap adjustment factor” that can be chosen by the user

Default f = 0.125 implies that overlaps between segments of up to 12.5% are allowed

HYYY r /)( 21

11

1)()( 21

rfr

rYYY

HLE r

(10.52)

2.3 Multiple Testing No obvious choice for the value of r BLAST considers all r = 1, 2, 3, … and choose the set of

HSPs with lowest sum statistic P-value as the most significant

However, it implies that a sequence of tests, one for each r So issue of multiple testing arises Ignoring multiple testing issue can lead to a significant

overestimate of BLAST P-values Unfortunately, no rigorous theory available to deal with

this issue In practice, it is handled in an ad hoc manner

2.3 Multiple Testing Ex. WU-BLAST

P-value is adjusted by dividing by a factor

When r = 1, the factor became 1- π, which implies that E' is divided by 1- π

BLAST default value 0.5 of π implies that E=2E', so that

P-value is then found as

1)1( r

max'2

'12 yKeNNE

EevalueP 1

(10.56)

(10.57)

3. Query Sequence vs. Database

Compare query sequence to each database sequence to obtain P-values for individual comparisons

For r = 1, probability that in a match with score v or more is

Expect, the mean number of HSPs scoring v or more in the entire database is given by

D = total length of DB (sum of lengths of all database sequences) N2 = length of the database sequence

Ee1

2

)1(

N

DeExpect

E

ExpectevalueP 1

(10.58)

(10.59)

(10.60)

3. Query Sequence vs. Database For r > 1, from each P-value, a total

database value of Expect is calculated by

Finally, all single (r = 1) HSPs or summed (r > 1) HSPs with sufficiently low values of Expect are listed

2

)(

N

DvaluePExpect

ExpectevalueP 1 (10.60)

(10.61)

4. Minimum Significance Lengths

Correct Choice of n When sequences are distantly related, similarities

between them might be subtle Cannot detect significant similarity unless a long

alignment is available On the other hand, if sequences are very similar, then a

relatively short alignment is sufficient If the similarity is subtle, each aligned pair will tell us

less than an aligned pair in more similar sequences (in terms of information)

This lead to the concept of information content per position in an alignment

4. Minimum Significance Lengths Using a PAMn matrix is to test: Alternative hypothesis: n is the correct value to

use in the evolutionary process leading to the two protein sequences

Null hypothesis: appropriate value of n is +∞ Here, assume that the alternative hypothesis is

correct (i.e. correct value of n is chosen) Explore aspects of power of the testing procedure

by finding the mean length of protein sequence needed before the alternative hypothesis is accepted

4. Minimum Significance Lengths

Suppose that, we decide to adopt a testing procedure with Type I error α (FP)

The value s of the normalized score statistic S' is given by s = -logα

Corresponding value ymax of Ymax is

When alternative hypothesis is true, mean score for the amino acids comparison at any position is

NK

y log1max

kj kj kj pp

kjqkjqkjSkjq

, ,

1 ),(log),(),(),(

(10.64)

(10.65)

4. Minimum Significance Lengths

In Chapter 7, it showed that if Mean final position in a random walk is F Mean step size is G Then mean number of steps needed to reach the final

position is F/G Mean sequence length needed in the maximally

scoring local alignment in order to obtain significance with Type I error α is

kjkj ppkjq

kjq

NK

,

),(log),(

log

(10.66)

4. Minimum Significance Lengths Since various components can be interpreted in terms of

bits of information, thus write the ratio (10.66) as

Denominator = mean of the relative support, in terms of bits, provided by one observation for the alternative hypothesis against the null hypothesis, given that the alternative hypothesis is true

Numerator = mean total number of bits of information needed to claim that two sequences are similar

kjkj ppkjq

kjq

NK

, 2

2

),(log),(

log

(10.67)

4. Minimum Significance Lengths

It is known that typically K = 0.1, α = 0.05 or 0.01 Thus numerator is largely determined by length

N, which is approximately log2N Ex. N = 1000, need 9.97 bits of information to

claim significant similarity between two sequences

Main interest is the minimum significant length

4. Minimum Significance Lengths If n is large, q(j,k) is close to pjpk

Mean information per aligned pair given in the denominator is small

Minimum significant length is large If null and alternative hypotheses specify quite similar

probabilities for any aligned pair, many observations will in general be needed to decide between two hypotheses

If n is small Mean relative support for the alternative hypothesis is

large Minimum significant length is small

4. Minimum Significance Lengths Limiting (n 0) values

q(j,j) = pj q(j,k) = 0 for j ≠ k Denominator, mean support from each position in favor

of the alternative hypothesis, approaches

If all amino acids are equally frequent, this mean support is log220 = 4.32

In practice, actual frequencies of observed amino acids imply that a more appropriate value is about 4.17

Thus, minimum significant length is (log2N)/4.17 If N = 1000, this is about 2.39

j jj pp 2log

4. Minimum Significance Lengths

When N = 1000 and n = 250 Corresponds to a PAM250 substitution matrix Probabilities q(j,k) are such that each amino

acid pair provides a mean of only 0.36 bits of information

Minimum significance length is log(1000)/0.36 = 28 is required on average to accept the alternative hypothesis

4. Minimum Significance Lengths

Incorrect Choice of n Above calculations all assume the correct value of n is

chosen, thus correct alternative hypothesis probabilities q(j,k) is used

In practice, it is impossible to choose a unique correct value for n when using a PAM matrix

Suppose there is a unique correct value m leading to a PAMm matrix, but an incorrect value n was chosen and PAMn matrix is used instead

What does this imply?

4. Minimum Significance Lengths Suppose that with the correct choice m, the probability of

the ordered pair (j,k) is r(j,k) The mean score is then

r(j,k) = q(j,k) when n = m, mean score is positive More generally, mean score is positive when n and m are

close But, as m +∞, r(j,k) pjpk, mean score is negative Thus for any choice of n there will be values of m

sufficiently large compared to n so that the mean score is negative

kj kj pp

kjqkjr

,

1 ),(log),( (10.68)

4. Minimum Significance Lengths

When mean score is positive, minimal significance length is

Minimal length depends on q(j,k), that is on the choice of n

Choice of n involves substantial extrinsic guesswork, thus it is important to assess the implications of an incorrect choice

kjkj ppkjq

kjr

NK

,

),(log),(

log (10.69)

4. Minimum Significance Lengths Negative means arise when m is sufficiently large

compared to n, that is When two species being compared diverged a long

time in the past relative to the time assumed by the PAM matrix used in analysis

The more negative this mean is, the more likely that the null hypothesis will be accepted

In the limit m +∞, when r(j,k) = pjpk, the probability of rejecting the null hypothesis is equal to the chosen Type I error

Ex. If n = 100 is chosen, the mean score is negative when m is 193 or more

4. Minimum Significance Lengths

In conclusion, Correctly chosen small value of n leads to shorter

minimal significance lengths Incorrect small choice may lead to the possibility that a

real similarity between the two sequences will not be picked up

In practice, to overcome this problem, sometimes uses a variety of substitution matrices

However, it must be viewed with some caution, especially in the light of multiple testing problem

5. Parametric or Non-parametric Parametric test: test statistic is found from

likelihood ratio arguments Non-parametric test: test statistic is found on

reasonable but nevertheless arbitrary grounds Many of calculations and arguments used in

preceding sections derive from the derivation of the score S(j,k) in a substitution matrix from likelihood ratio arguments

In this sense, BLAST testing theory can be thought of as a parametric procedure deriving from the likelihood ratio theory

5. Parametric or Non-parametric Assumptions made in the theory are, however, subject to

debate Time homogeneity assumption implicit in calculations

cannot be sustained Genetic code influenced substitutions earlier in time and various

chemical properties influenced substitutions more recently Thus, comparisons of distantly related species can be problematic

Further, if data in a large database come from a collection of species whose respective evolutionary divergence times might differ widely, the concept of a uniformly correct choice of n is not meaningful

5. Parametric or Non-parametric Even if these claims are true, the statistical aspects

of the BLAST procedure are still valid P-value calculations are still correct, so even if

these scores were chosen in any more or less reasonable way, no problems arise with the correctness of the calculations

In this sense, BLAST testing process can be thought of as a non-parametric procedure

6.1 Gapped BLAST Allows gaps in sequence alignments In comparison of two sequences, there will be some

maximum scores

Maximum score over all possible gapped alignments Null hypothesis probability distribution is determined by

the substitution matrix used and gap penalty chosen The distribution can be estimated through simulation

Randomly generate two sequences of lengths N1 and N2 From these sequences, find the observed maximum score denoted

by y1 Procedure is repeated n times yielding n observed highest scores

y1, y2, …, yn

)(max

gappedY

6.1 Gapped BLAST Approximation was made that the distribution of

Ymax in gapped case is of the same form in the ungapped case with revised values of K and

Approach described above depends on simulation results

If a penalty of δ is assigned to each gap in the alignment of two sequences, then (10.45) is replaced by

)6/()(_

121 seNNK y

11'

*21max

e

TKeNNE y (10.73)

(10.72)

6.2 PSI BLAST PSI (Position Specific Iterated) BLAST In regular BLAST, a fixed substitution matrix is used to

score positions in alignments It relies on one matrix to provide the most meaningful

scores for all positions in the query sequence simultaneously

PSI-BLAST Uses a standard substitution matrix in the first step Sequences found are then used to derived a separate scoring

scheme for each position in the query sequence and used for the second BLAST search

The procedure is iterated until no further iteration seems useful

6.2 PSI BLAST Query sequence is first compared to database sequences All database sequence segments having a sufficiently close

similarity with the query (ex. Expect < 0.01) are reported From this collection of sites, a frequency fi of amino acid i

is calculated, and used to estimate frequency Qi of amino acid i at this site

In PSI-BLAST, Σigi = 1 no longer holds Shaffer et al. (2001) described a new implementation

where pi is the background frequency of amino acid i and p(i,j) is the frequency with amino acids i and j aligned through evolutionary descent

iii

jjji

gfQpjiqfg /),(

j jji

i

pjipffQ

/),(

(10.75)

(10.76)

Any questions?