Bioinformatics Algorithms - Univerzita...

38
Bioinformatics Algorithms David Hoksza http://siret.ms.mff.cuni.cz/hoksza Pairwise Sequence Similarity

Transcript of Bioinformatics Algorithms - Univerzita...

  • Bioinformatics Algorithms

    David Hoksza

    http://siret.ms.mff.cuni.cz/hoksza

    Pairwise Sequence Similarity

  • Outline

    • Motivation

    • Basic string algorithms• Hamming distance

    • Edit distance

    • Domain-specific algorithms• Needleman-Wunsch

    • Smith-Waterman

    • Scoring matrices

  • Evolutionary rationale for sequence alignment

    3

  • Motivation

    • Function prediction• Identification of possible functions

    of new molecules

    • Analysis of known molecules for previously unknown activity

    • Motif discovery and search• Identification of commonalities in a

    set of sequences

    • Function-related motifs

    • Construction of evolutionary trees• Pairwise sequence similarity

    employed when determining closeness of elements in the tree

    • A single step in a wide variety of complex algorithms and pipelines

  • Biological Sequences

    • DNAs, RNAs and proteins can be linearly ordered• sequence of nucleotides (DNA, RNA)

    • sequence of amino acids (proteins)

    • DNA and RNA molecules can be viewed as strings over alphabet of size 4 (ACGU/ACGT)

    • Protein molecules can be viewed as strings over alphabet of size 20 (ARNDCEQGHILKMFPSTWYV)

    • Stringology algorithms are often employed

  • Hamming Distance (HD)

    • The Hamming distance between two strings of the same length is the number of positions where the two strings differ (mismatches)

    • 𝑯𝑫 𝑺𝟏, 𝑺𝟐 = σ𝒊=𝟏𝒏 𝒇( 𝑺𝟏[𝒊], 𝑺𝟐[𝒊]), 𝒇(𝒂, 𝒃) = ቐ

    𝟎 ֞ 𝒂 = 𝒃𝟏 ֞ 𝒂 ≠ 𝒃

    • Exerciseo Algorithm?

    o INDUSTRY/INTEREST

    HD(

    G C G C A T G G A T T G A G C G A,

    T G C G C C A T T G A T G A C C A

    ) = 15

  • Alignment

    A global alignment of a pair of sequences (𝑺𝟏, 𝑺𝟐) is defined as a pair of sequences 𝑮𝟏, 𝑮𝟐 where 𝑮𝒊 (defined over the same alphabet as 𝑆𝑖 enriched by the “-” symbol) is derived from 𝑺𝒊 by inserting “-” (gaps) at various positions. Moreover ∄𝒊: 𝑮𝟏 𝒊 = 𝑮𝟐 𝒊 = "−" 𝑨𝑵𝑫 |𝑮𝟏| = |𝑮𝟐|. If position 𝑖 contains “-”, it is called a gap position. A position 𝑖 where 𝐺1 𝑖 = 𝐺2 𝑖 is called match otherwise (𝐺1 𝑖 ≠𝐺2 𝑖 ≠ "−“) it is called mismatch.

    Let 𝑮𝟏, 𝑮𝟐 be a global alignment of a pair of sequences (𝑺𝟏, 𝑺𝟐) and let 𝑳𝟏 and 𝑳𝟐be subs of 𝑮𝟏 and 𝑮𝟐, such that 𝑳𝟏 = |𝑳𝟐|. Then 𝑳𝟏, 𝑳𝟐 forms a local alignment of 𝑺𝟏, 𝑺𝟐 .

    Among all the possible alignments we are interested in the optimal one which is the alignment with the best score (interpreted as sequence similarity/score). A particular method of finding and scoring an alignment depends on the actual method (see the following sections), however, it is always based on a sum over the scores between aligned characters/gaps.

  • Alignment

    W R I T E R S

    V I N T N E R

    W R I T T E R S

    V I N T N E R

    W R I T E R S

    V I N T N E R

    Which one is best

    and what does it

    mean “best”?

  • Edit Distance (ED)

    • Drawback of HD is the so-called sequence shift

    • Edit (Levensthein) distance of two strings is the minimum number of edit operations (insertions, deletions, substitutions) on individual characters to convert one string into the other.

    • Let 𝑨 be the set of all possible alignements (𝑨𝟏, 𝑨𝟐) of sequences 𝑆1 and 𝑆2. The edit distance 𝑬𝑫 is defined as

    𝑬𝑫 𝑺𝟏, 𝑺𝟐 = 𝐦𝐢𝐧(𝑨𝒊,𝑨𝒋)∈𝑨

    𝑯𝑫(𝑨𝒊, 𝑨𝒋)

    HD(

    G C G C A T T,

    T G C G C A T

    ) = 6

    HD(

    _ G C G C A T T,

    T G C G C A T _

    ) = 2

  • ED Computation (1)

    • ED can be computed using dynamic programming; using backtracking one can also output the alignments corresponding to the ED value

    𝐷 𝑖, 𝑗 = min൞

    𝐷 𝑖 − 1, 𝑗 + 1

    𝐷 𝑖, 𝑗 − 1 + 1

    𝐷 𝑖 − 1, 𝑗 − 1 + 𝛿 𝑖, 𝑗

    , 𝛿 𝑖, 𝑗 = ቊ0, 𝑆1(𝑖) = 𝑆2(𝑗)

    1, 𝑆1 𝑖 ≠ 𝑆2(𝑗)

    𝐷 𝑖, 0 = 𝑖𝐷 0, 𝑗 = 𝑗

    • 𝐷 𝑖, 𝑗 expresses the ED of 𝑖 and 𝑗 long prefixes of 𝑆1 and 𝑆2• 𝐸𝐷 𝑆1, 𝑆2 = 𝐷 |𝑆1|, |𝑆2|

  • ED Computation (2)

    W R I T E R S

    0 1 2 3 4 5 6 7

    V 1 1 2 3 4 5 6 7

    I 2 2 2 2 3 4 5 6

    N 3 3 3 3 3 4 5 6

    T 4 4 4 4 3 4 5 6

    N 5 5 5 5 4 4 5 6

    E 6 6 6 6 5 4 5 6

    R 7 7 6 7 6 5 4 5

    𝑬𝑫(𝑾𝑹𝑰𝑻𝑬𝑹𝑺, 𝑽𝑰𝑵𝑻𝑵𝑬𝑹)= 𝒅(𝟕, 𝟕) = 𝟓

    • With five operations, it is possible to transform string WRITERS to string VINTNER

  • ED - Backtracking

    • Add arrows (pointers) based on the following conditions

    • 𝑫(𝒊, 𝒋) → 𝑫 𝒊, 𝒋 − 𝟏 ֞ 𝐷(𝑖, 𝑗) = 𝐷 𝑖, 𝑗 − 1 + 1

    • 𝑫(𝒊, 𝒋) → 𝑫 𝒊 − 𝟏, 𝒋 ֞ 𝐷(𝑖, 𝑗) = 𝐷 𝑖 − 1, 𝑗 + 1

    • 𝑫 𝒊, 𝒋 → 𝑫 𝒊 − 𝟏, 𝒋 ֞

    𝐷 𝑖, 𝑗 = 𝐷 𝑖 − 1, 𝑗 − 1 + 𝛿 𝑖, 𝑗

    • Follow the paths starting in the lower right corner to get all optimal alignments

    W R I T E R S

    0 1 2 3 4 5 6 7

    V 1 1 2 3 4 5 6 7

    I 2 2 2 2 3 4 5 6

    N 3 3 3 3 3 4 5 6

    T 4 4 4 4 3 4 5 6

    N 5 5 5 5 4 4 5 6

    E 6 6 6 6 5 4 5 6

    R 7 7 6 7 6 5 4 5

    W R I T E R S

    V I N T N E R

    W R I T E R S

    V I N T N E R

    W R I T E R S

    v I N T N E R

  • Operation-Weighted ED

    • Generalization of ED with arbitrary weight (cost, score) associated with each of the editing operations

    𝐷 𝑖, 𝑗 = min൞

    𝐷 𝑖 − 1, 𝑗 + 𝑑

    𝐷 𝑖, 𝑗 − 1 + 𝑑

    𝐷 𝑖 − 1, 𝑗 − 1 + 𝛿 𝑖, 𝑗

    , 𝛿 𝑖, 𝑗 = ቊ𝑒, 𝑆1 (𝑖) = 𝑆2(𝑗)

    𝑟, 𝑆1 𝑖 ≠ 𝑆2(𝑗)

    𝐷 𝑖, 0 = 𝑖 × 𝑑𝐷 0, 𝑗 = 𝑗 × 𝑑

    𝑂𝑊𝐸𝐷 𝑆1, 𝑆2 = 𝐷 |𝑆1|, |𝑆2|

    • ED … 𝑑 = 1, 𝑒 = 0, 𝑟 = 1

  • OWED

    • Exercise

    • strings: WRITERS, VINTNER

    • 𝑑 = 2 𝑒 = 0, 𝑟 = 3

  • Alphabet-Weighted Edit Distance

    • Generalization of ED where not only operations are weighted but each substitution or indel (insertion/deletion) operation is a functionof the elements participating in that operation

    • Exercise

    • formalization of AWED concept

    • Computation of AWED is identical to OWED (dynamic programming)

  • Global Sequence Alignment (GA)

    • Global sequence alignment problem is to find an alignment with the highest score amongst all possible global alignments of two sequences while similarity of a pair of symbols at a position in an alignment is a function of the aligned symbols

    • unlike in classical ED (or its variants), in GA problem we work with similarity and not distance (better alignments have higher value of similarity/score)

    • in a protein or nucleotide sequence alignment it is more favorable to have one longer gap rather than multiple short gaps and therefore opening and extension of a gapshould be penalized differently (affine gap penalty model)

    𝑮𝑨 𝑺𝟏, 𝑺𝟐 = 𝐦𝐚𝐱(𝑨𝟏,𝑨𝟐)∈𝑨

    𝒊=𝟏

    |(𝑨𝟏,𝑨𝟐)|

    𝒔 𝑨𝟏 𝒊 , 𝑨𝟐 𝒊 − 𝜸𝒐𝒑𝒆𝒏 #𝒈𝒂𝒑𝒔 − 𝜸𝒆𝒙𝒕𝒆𝒏𝒅 #𝒔𝒑𝒂𝒄𝒆𝒔

  • Gap penalization in alignments

    • Optimally we would like the penalty function• To be a function of the individual symbols in the gap in order to reflect specific

    composition (sequence) of symbols including its length• To consider neighborhood of the gap

    • Issues• No available polynomial algorithm• Difficult to parametrize due to the lack of available data (see the slides about scoring

    systems)

    • Available models• Linear gap penalty model (e.g. edit distance)• Affine gap penalty model (Needleman-Wunsch/Smith-Waterman)

    17

  • GA – Needleman-Wunsch (1)

    𝐺 𝑖, 𝑗 = max൞

    𝐻 𝑖, 𝑗

    𝑉 𝑖, 𝑗

    𝐺 𝑖 − 1, 𝑗 − 1 + 𝑠 𝑖, 𝑗

    𝐻 𝑖, 𝑗 = max ൝𝐻 𝑖 − 1, 𝑗 − 𝛾𝑒𝑥𝑡𝑒𝑛𝑑𝐺 𝑖 − 1, 𝑗 − 𝛾𝑜𝑝𝑒𝑛

    𝑉 𝑖, 𝑗 = max ൝𝑉 𝑖, 𝑗 − 1 − 𝛾𝑒𝑥𝑡𝑒𝑛𝑑𝐺 𝑖, 𝑗 − 1 − 𝛾𝑜𝑝𝑒𝑛

    𝐺 𝑖, 0 = 𝑉 𝑖, 0 = 𝛾𝑜𝑝𝑒𝑛− 𝑖 − 1 × 𝛾𝑒𝑥𝑡𝑒𝑛𝑑𝐺 0, 𝑗 = 𝐻 0, 𝑗 = 𝛾𝑜𝑝𝑒𝑛−(𝑗 − 1) × 𝛾𝑒𝑥𝑡𝑒𝑛𝑑

    • Basically AWED but expressed in terms of

    similarity rather than distance

    • Moreover, the affine gap penalty model is

    applied (as opposed to the linear one)

    • open gap penalty – penalty for opening

    a gap in the alignment

    • extend gap penalty – penalty for non-

    opening gap positions

    • The recursion formula is very similar to that

    of ED, but two matrices (to account for the

    affine gap penalty model) are used

    • Can be optimized to work with 1D arrays

  • GA – Needleman-Wunsch (3)

    H matrix

    V matrix

    L matrix

    source: Jones & Pevzner, An Introduction to Bioinformatics Algorithms

    𝐺 𝑖, 𝑗 = max൞

    𝐻 𝑖, 𝑗

    𝑉 𝑖, 𝑗

    𝐺 𝑖 − 1, 𝑗 − 1 + 𝑠 𝑖, 𝑗

    𝐻 𝑖, 𝑗 = max ൝𝐻 𝑖 − 1, 𝑗 − 𝛾𝑒𝑥𝑡𝑒𝑛𝑑𝐺 𝑖 − 1, 𝑗 − 𝛾𝑜𝑝𝑒𝑛

    𝑉 𝑖, 𝑗 = max ൝𝑉 𝑖, 𝑗 − 1 − 𝛾𝑒𝑥𝑡𝑒𝑛𝑑𝐺 𝑖, 𝑗 − 1 − 𝛾𝑜𝑝𝑒𝑛

    𝐺 𝑖, 0 = 𝑉 𝑖, 0 = 𝛾𝑜𝑝𝑒𝑛− 𝑖 − 1 × 𝛾𝑒𝑥𝑡𝑒𝑛𝑑𝐺 0, 𝑗 = 𝐻 0, 𝑗 = 𝛾𝑜𝑝𝑒𝑛−(𝑗 − 1) × 𝛾𝑒𝑥𝑡𝑒𝑛𝑑

  • GA – Needleman-Wunsch (2)

    • Gap costs are positive• See later slides for how to obtain reasonable scoring systems for biological

    sequences

    • The value of the similarity (score of the alignment) is in the lower right corner of 𝐺

    • All optimal alignments can be found by backtracking the 𝐺 matrix (identically as in ED)

  • Local Sequence Alignment

    • Motivation • Even sequences not sharing common global characteristics can be

    functionally similar

    • Two globally dissimilar proteins can share common core due to the evolutionary nature of the protein similarity• The residues corresponding to the protein active site are more stable with respect to

    mutation then the rest of the protein• mutation in the active site can easily lead to malfunction of given protein

    • stability of the active site residues is “propagated” back to DNA sequence coding for that protein

    • Local sequence alignment (LA) enables one to align only substrings sharing high similarity• can be viewed as GA over all pairs of substrings

  • LA - example

    source: Jones & Pevzner, An Introduction to Bioinformatics Algorithms

  • LA – Smith-Waterman

    • Modification of NW such that it allows alignments to start at an arbitrary position (free ride)

    • Modification of the dynamic programming matrix by not allowing negative values• alignments passing that cell would have lower

    overall score then those starting at that location

    • Alignment is obtained by backtrackingfrom a highest scoring cell to a cell having score 0

    • Scores of randomly aligned residues (random match) need to be negative

    𝐿 𝑖, 𝑗 = max

    𝐻 𝑖, 𝑗

    𝑉 𝑖, 𝑗

    𝐿 𝑖 − 1, 𝑗 − 1 + 𝑠 𝑖, 𝑗𝟎

    𝐻 𝑖, 𝑗 = max ൝𝐻 𝑖 − 1, 𝑗 − 𝛾𝑒𝑥𝑡𝑒𝑛𝑑𝐿 𝑖 − 1, 𝑗 − 𝛾𝑜𝑝𝑒𝑛

    𝑉 𝑖, 𝑗 = max ൝𝑉 𝑖, 𝑗 − 1 − 𝛾𝑒𝑥𝑡𝑒𝑛𝑑𝐿 𝑖, 𝑗 − 1 − 𝛾𝑜𝑝𝑒𝑛

    𝐿 𝑖, 0 = 𝑉 𝑖, 0 = 𝛾𝑜𝑝𝑒𝑛− 𝑖 − 1 × 𝛾𝑒𝑥𝑡𝑒𝑛𝑑𝐿 0, 𝑗 = 𝑉 0, 𝑗 = 𝛾𝑜𝑝𝑒𝑛−(𝑗 − 1) × 𝛾𝑒𝑥𝑡𝑒𝑛𝑑

  • Other flavors for special use cases – overlap matches (semi-global alignment)

    • Do not penalize overhanging matches

    • Simply initialize 𝐺(𝑖, 0) and 𝐺(0, 𝑗) to 0 and do not penalize first/last row and column + adjust traceback accordingly

    • Repeated matches version (see the following slide) can be modified to the semi-global version as well

    24

    source: Durbin et al. Probabilistic sequence analysis

  • Other flavors for special use cases – repeated matches

    • Retrieve all matches with significant score (e. g. multiple motifs) → threshold 𝑡

    25

    𝑡 = 20dots = unaligned positions

    L 𝑖, 𝑗 = max

    𝐿 𝑖, 0𝐿 𝑖 − 1, 𝑗 − 1 + 𝑠(𝑖, 𝑗)

    𝐿 𝑖, 𝑗 − 1 − 𝛾

    𝐿 𝑖 − 1, 𝑗 − 𝛾

    L 𝑖, 0 = max ቊ𝐿(𝑖 − 1, 0)

    𝐿 𝑖 − 1, 𝑗 − 𝑡, L 0,0 = 0

    source: Durbin et al. Probabilistic sequence analysis

  • Optimizations

    • Space• Can be computed in 𝑂(𝑚) space instead of 𝑂(𝑛 ×𝑚), if only the score is

    required

    • Hirschberg’s algorithm allows to also obtain the alignment 𝑂 𝑛 +𝑚

    • Time

    • Using the “Four Russians” technique, the distance can be computed in 𝑂(𝑛2

    log 𝑛)

    26

  • Evolutionary nature of sequence alignment

    • We align sequences for which we assume that they are evolutionary related, i.e. they have a common ancestor sequence

    • Mismatches and gaps were collected over time, resulting in the two observed sequences

    • The correct alignment is the one which matches positions that originate in the same amino acid position in the common ancestor

    27

  • Scoring system (1)

    • How to define scores for the residue pairs and gaps?

    • Probabilistic model• Given an aligned pair of sequences (alignment), determine the relative likelihood

    that the sequences are related as opposed to being unrelated

    • Random model (sequences 𝑆1 and 𝑆2 unrelated)

    𝑖

    𝑓(𝑆1 𝑖 )ෑ

    𝑗

    𝑓(𝑆2 𝑗 )

    • 𝑓(𝑎) – probability/frequence of letter a occurring in a sequence

    • Match model (sequences 𝑆1 and 𝑆2 related)• 𝑝𝑎𝑏 - joint probability that residues a and b have been independently derived from some

    unknown origin

    𝑖

    𝑝𝑆1 𝑖 ,𝑆2[𝑖]

    28

    Probability of the alignment under the

    random model (sequences are unrelated)

    Probability of the alignment under the match

    model (sequences have the same origin)

  • Scoring system (2)

    • The ratio of match and random model → the odds ratio

    𝑃(𝑆1, 𝑆2|𝑀)

    𝑃(𝑆1, 𝑆2|𝑅)=

    ς𝑖 𝑝𝑆1 𝑖 ,𝑆2[𝑖]ς𝑖 𝑓(𝑆1 𝑖 )ς𝑖 𝑓(𝑆2 𝑖 )

    =ෑ

    𝑖

    𝑝𝑆1 𝑖 ,𝑆2[𝑖]

    𝑓(𝑆1 𝑖 )𝑓(𝑆2 𝑖 )

    • To arrive at an additive scoring system → the log odds ratio

    𝑺 =

    𝒊

    𝒔( 𝑺𝟏 𝒊 , 𝑺𝟐 𝒊 )

    𝒔 𝒂, 𝒃 = 𝒍𝒐𝒈(𝒑𝒂𝒃

    𝒇 𝒂 𝒇(𝒃))

    29

    We want the score for an alignment to

    be as high as possible for related

    sequences and as low as possible for

    unrelated sequences → odds ratio

  • Scoring system (3)

    • How to determine the substitution probabilities?

    • The simplest approach to define 𝑠(𝑎, 𝑏) would be to count the frequencies of the aligned residue pairs in confirmed alignments → normalized 𝑝𝑎𝑏, 𝑓(𝑎)(including gaps)

    • Issues• Random sample of confirmed alignments

    • Different pairs of sequences have diverged by different amounts• Recent divergence from a common ancestor → 𝑝𝑎𝑏 (𝑎 ≠ 𝑏) should be small → 𝑠(𝑎, 𝑏) should

    be strongly negative

    • Past divergence from a common ancestor → 𝑝𝑎𝑏 should be close to one for all 𝑎 and 𝑏• Thus scores should depend on the expected divergence of the sequences to be aligned

  • Substitution matrices

    • PAM• Point/Percent Accepted Mutation (Dayhoff et al.)

    • BLOSSUM• BLOckS SUbstitution Matrix (Henikoff et al.)

    • For evolutionary studies PAM 250 or BLOSUM 62 matrices are often used as the default ones

    31

  • PAM matrices (1)

    • The idea of PAM matrices is to obtain substitution data from similar sequences (PAM 1) and to extrapolate that knowledge to longer evolutionary distances (PAM n)

    • PAM unit • The amount of evolution that will on average change 1% of the amino acids within a

    protein sequence• Two sequences 𝑺𝟏 and 𝑺𝟐 are defined as being one PAM unit diverged if a series of

    accepted point mutations has converted 𝑺𝟏 to 𝑺𝟐 with an average of one accepted point-mutation event per one-hundred amino acids• Accepted point mutation, means non-lethal, non-silent, mutation

    • One PAM unit divergence does not mean one percent sequence difference because one position can undergo multiple mutations• even for sequences which are 100 PAM units diverged it is not true that they differ in every

    position• 200 PAM units diverged sequences are expected to be identical in about 25% positions

    32

  • PAM matrices (2)

    • PAM matrices are scoring matrices encoding expected evolutionary change at the AA level

    • Ideal PAM 𝑛 matrix construction• collect sequences known to be 𝒏 PAM units diverged (point mutations only)• align each pair of sequences

    • compute 𝒇(𝒊, 𝒋) being the number positions where 𝑨𝒊 aligns with 𝑨𝒋 divided by the total number of pairs in all the aligned data

    • let 𝒇(𝒊) be the frequency of 𝑨𝒊 in all the sequences

    𝑷𝑨𝑴𝒏 𝒊, 𝒋 = 𝐥𝐨𝐠𝒇(𝒊, 𝒋)

    𝒇 𝒊 𝒇(𝒋)

  • PAM matrices (3)

    • Actual PAM 𝑛 matrix construction• When insertions and deletions are considered, the correct correspondence needs to

    consider the introduction of gaps• Gap distribution could be obtained using an alignment but that, in turn, requires existing

    substitution matrix

    • Take highly similar sequences (Dayhoff at al. took phylogenetic trees of 71 families)• Sequences with low divergence are essentially the same length or the gap positions are evident

    • Compute 𝑴𝟏 𝒊, 𝒋 , the probability that 𝐴𝑖 will mutate into 𝐴𝑗 in one PAM unit• 𝑀𝑛 = 𝑀1

    𝑛 is then the probability matrix for 𝑛 PAM units• Probabilities in 𝑀𝑛 are modeled by Markov chain model, thus multiplying 𝑀𝑛 with 𝑀 results for each

    amino acids pair 𝑖, 𝑗 in probability that 𝑖 will turn to 𝑗 through any other aminoacids in 𝑛 + 1 steps

    𝑷𝑨𝑴𝒏 𝒊, 𝒋 = 𝐥𝐨𝐠𝒇(𝒊)𝑴𝒏(𝒊, 𝒋)

    𝒇 𝒊 𝒇(𝒋)= 𝐥𝐨𝐠

    𝑴𝒏(𝒊, 𝒋)

    𝒇(𝒋)

  • PAM matrices (4)

    • Assumptions

    • Sites evolve independently

    • The frequencies of AAs remain constant over time

    • The mutational process causing replacements in an interval of one PAM unit is the same for longer periods

    • Evolution of a single site is independent of previous mutations

  • BLOSUM matrices (1)

    • PAM matrices for high N are derived from N=1 by raising to the corresponding power which does not capture well the true difference between short and long time substitutions

    • PROSITE• dictionary of biologically significant sites and patterns in proteins

    • BLOCKS• database of protein motifs derived from the PROSITE library• consists of blocks of contiguous intervals in multiple sequence alignments• for each PROSITE group, multiple sequence alignment from sequences in that group

    was formed and high sequence similarity segments up to 60 positions were identified

  • BLOSUM matrices (2)

    • Define 𝑷 as the set of all pairs of positions in BLOCKS such that each pair comes from the same column in the same block

    • if each block contains 𝒌 rows and there are 𝒏 columns in total, |𝑷| = 𝒏𝒌𝟐

    • Let 𝒏(𝒊, 𝒋) be the relative number of all pairs in 𝑃 containing amino acids 𝒊 and 𝒋

    • Let 𝒇(𝒊) be the fraction of positions containing amino acids 𝒊

    • Let 𝒆 𝒊, 𝒋 = 𝒇 𝒊 𝒇(𝒋)

    𝑩𝑳𝑶𝑺𝑼𝑴 𝒊, 𝒋 = 𝟐 𝐥𝐨𝐠𝒏(𝒊, 𝒋)

    𝒆(𝒊, 𝒋)

    • BLOSUM 𝒏 is then obtained by removing highly similar rows, i.e., rows having sequence similarity higher than 𝒏%• BLOSUM 62 is formed by successive removal of sequences so that finally every pair of sequences are

    similar up to 62%

    observed

    expected

  • BLOSUM - Example

    • Let us have three AA residues A, B, C

    B A B A

    A A A C

    A A C C

    A A B A

    A A C C

    A A B C

    AA Observed freq. - f(i)

    A 14/24

    B 4/24

    C 6/24

    AA pair Observed freq. n(i,j)

    A to A 26/60

    A to B 8/60

    A to C 10/60

    B to B 3/60

    B to C 6/60

    C to C 7/60

    AA pair Expected freq. e(i, j)

    A to A 14/24 × 14/24

    A to B 14/24 × 4/24

    A to C 14/24 × 6/24

    B to B 4/24 × 4/24

    B to C 4/24 × 6/24

    C to C 6/24 × 6/24

    AA pair OF EF 2log(OF/EF)

    A to A 26/60 196/576 0.70

    A to B 8/60 112/576 -1.09

    A to C 10/60 168/576 -1.61

    B to B 3/60 16/576 1.70

    B to C 6/60 48/576 0.53

    C to C 7/60 36/576 1.80

    A B C

    A 1 -1 -2

    B -1 2 1

    C -2 1 2

    BLOSUM

    Credit: Sorin Istrai (Brown University)