Computational Genomics Lecture #3

59
Computational Genomics Lecture #3 Much of this class has been edited from Nir Friedman’s lecture which is available at www.cs.huji.ac.il /~nir. Changes made by Dan Geiger, then Shlomo Moran, and finally Benny Chor. Background Readings : Chapters 2.5, 2.7 in the text book, Biological Sequence Analysis, Durbin et al., 2001. Chapters 3.5.1- 3.5.3, 3.6.2 in Introduction to Computational Molecular Biology, Setubal and Meidanis, 1997. Chapter 15 in Gusfield’s book. 1. Crash intro to ML 2. Scoring functions and DNA and AAs 3. Multiple sequence alignment

description

Computational Genomics Lecture #3. Crash intro to ML Scoring functions and DNA and AAs Multiple sequence alignment. Background Readings : Chapters 2.5, 2.7 in the text book, Biological Sequence Analysis , Durbin et al., 2001. - PowerPoint PPT Presentation

Transcript of Computational Genomics Lecture #3

Page 1: Computational Genomics Lecture #3

Computational GenomicsLecture #3

Much of this class has been edited from Nir Friedman’s lecture which is available at www.cs.huji.ac.il/~nir. Changes made by Dan Geiger, then Shlomo Moran, and finally Benny Chor.

Background Readings: Chapters 2.5, 2.7 in the text book, Biological Sequence Analysis, Durbin et al., 2001.Chapters 3.5.1- 3.5.3, 3.6.2 in Introduction to Computational Molecular Biology, Setubal and Meidanis, 1997. Chapter 15 in Gusfield’s book.

1. Crash intro to ML2. Scoring functions and DNA and AAs 3. Multiple sequence alignment

Page 2: Computational Genomics Lecture #3

Scoring Functions, Reminder• So far, we discussed dynamic programming

algorithms for– global alignment– local alignment

• All of these assumed a scoring function:

that determines the value of perfect matches, substitutions, insertions, and deletions.

}){(}){(:

Page 3: Computational Genomics Lecture #3

Where does the scoring function come from ?

We have defined an additive scoring function by specifying a function ( , ) such that (x,y) is the score of replacing x by y (x,-) is the score of deleting x (-,x) is the score of inserting x

But how do we come up with the “correct” score ?

Answer: By encoding experience of what are similar sequences for the task at hand. Similarity depends on time, evolution trends, and sequence types.

Page 4: Computational Genomics Lecture #3

Why probability setting is appropriate to define and interpret a

scoring function ? • Similarity is probabilistic in nature because biological changes like mutation, recombination, and selection, are random events.

• We could answer questions such as:• How probable it is for two sequences to be similar?• Is the similarity found significant or spurious?• How to change a similarity score when, say, mutation rate of a specific area on the chromosome becomes known ?

Page 5: Computational Genomics Lecture #3

A Probabilistic Model• For starters, will focus on alignment without

indels.

• For now, we assume each position (nucleotide /amino-acid) is independent of other positions.

• We consider two options:M: the sequences are Matched (related)

R: the sequences are Random (unrelated)

Page 6: Computational Genomics Lecture #3

Unrelated Sequences• Our random model R of unrelated sequences

is simple– Each position is sampled independently from a

distribution over the alphabet – We assume there is a distribution q() that

describes the probability of letters in such positions.

• Then:

i

R(s[1..n], t[1..n] | ) q q(s[i]) (P t[i])

Page 7: Computational Genomics Lecture #3

Related Sequences• We assume that each pair of aligned positions

(s[i],t[i]) evolved from a common ancestor• Let p(a,b) be a distribution over pairs of letters.• p(a,b) is the probability that some ancestral letter

evolved into this particular pair of letters.

i

i

P(s[1..n], t[1..n] | ) p(s[i], t[i])

(

M

Rs[1..n], t[1..n] | ) (s[i]) (t[iP )q ]q

Compare to:

Page 8: Computational Genomics Lecture #3

Odd-Ratio Test for Alignment

i

ii

p(s[i], t[i])P(s, t | ) p(s[i], t[i])

QP(s, t | ) q(s[i])q(t[i]) q(s[i])q(t[R i])

M

If Q > 1, then the two strings s and t are more likely tobe related (M) than unrelated (R).

If Q < 1, then the two strings s and t are more likely tobe unrelated (R) than related (M).

Page 9: Computational Genomics Lecture #3

Score(s[i],t[i])

Log Odd-Ratio Test for Alignment

Taking logarithm of Q yields

ii

P(s, t | M) p(s[i], t[i]) p(s[i], t[i])log

P(s, t | R) q(s[i])log

q(t[i]) q(s[i])q(t[i])log

If log Q > 0, then s and t are more likely to be related.If log Q < 0, then they are more likely to be unrelated.

How can we relate this quantity to a score function ?

Page 10: Computational Genomics Lecture #3

Probabilistic Interpretation of Scores

• We define the scoring function via

• Then, the score of an alignment is the log-ratio between the two models:

– Score > 0 Model is more likely

– Score < 0 Random is more likely

)()(),(

log),(bqaq

bapba

Page 11: Computational Genomics Lecture #3

Modeling Assumptions• It is important to note that this interpretation

depends on our modeling assumption!!

• For example, if we assume that the letter in each position depends on the letter in the preceding position, then the likelihood ratio will have a different form.

• If we assume, for proteins, some joint distribution of letters that are nearby in 3D space after protein folding, then likelihood ratio will again be different.

Page 12: Computational Genomics Lecture #3

Estimating Probabilities• Suppose we are given a long string

s[1..n] of letters from • We want to estimate the distribution q(·)

that generated the sequence

• How should we go about this?

We build on the theory of parameter estimation in statistics using either maximum likelihood (today) estimation or the Bayesian approach (later on).

Page 13: Computational Genomics Lecture #3

Estimating q()

• Suppose we are given a long string s[1..n] of letters from – s can be the concatenation of all sequences in our

database

• We want to estimate the distribution q()

• That is, q is defined per single letters

a

nN

i 1 a

qL( | q) ( [is s ]) )q(a

Likelihood function:

Page 14: Computational Genomics Lecture #3

Estimating q() (cont.)

How do we define q? Intuitively

aq(a)N

n

a

nN

i 1 a

L(s | q) q(s[i]) q(a)

Likelihood function:

ML parameters

(Maximum Likelihood)

Page 15: Computational Genomics Lecture #3

Crash Course on Maximum Likelihood: Binomial Experiment

When tossed, this device (“thumbtack”) can land in one of two positions: Head or Tail

Head Tail

We denote by the (unknown) probability P(H).Estimation task: Given a sequence of toss samples x[1], x[2], …, x[M] we want to estimate the probabilities P(H)= and P(T) = 1 -

Page 16: Computational Genomics Lecture #3

Statistical Parameter Fitting

• Consider instances x[1], x[2], …, x[M] , such that

– The set of values that x can take is known– Each is sampled from the same distribution– Each sampled independently of the rest

i.i.d.samples

The task is to find a vector of parameters that have generated the given data. This vector parameter can be used to predict future data.

Page 17: Computational Genomics Lecture #3

The Likelihood Function• How good is a particular ?

It depends on how likely it is to generate the observed data

The likelihood for the sequence H,T,T,H,H is

m

D mxPDPL )|][()|()(

)1()1()(DL

0 0.2 0.4 0.6 0.8 1

L()

Page 18: Computational Genomics Lecture #3

Sufficient Statistics• To compute the likelihood in the thumbtack

example we only require NH and NT

(the number of heads and the number of

tails)

•NH and NT are sufficient statistics for the binomial distribution

THD

NNL )1()(

Page 19: Computational Genomics Lecture #3

Sufficient Statistics• A sufficient statistic is a function of the data that

summarizes the relevant information for the likelihood

Datasets

Statistics

Formally, s(D) is a sufficient statistics if for any two datasets D and D’ s(D) = s(D’ ) LD() = LD’ ()

Page 20: Computational Genomics Lecture #3

Maximum Likelihood Estimation

MLE Principle:

Choose parameters that maximize the likelihood function

• This is one of the most commonly used estimators in statistics

• Intuitively appealing

• One usually maximizes the log-likelihood function defined as lD() = loge LD()

Page 21: Computational Genomics Lecture #3

Example: MLE in Binomial Data• Applying the MLE principle (taking derivative)

we get 1loglog THD NNl

1TH NN

0 0.2 0.4 0.6 0.8 1

L()

Example:(NH,NT ) = (3,2)

MLE estimate is 3/5 = 0.6

TH

H

NN

N

(Which coincides with what one would expect)

Page 22: Computational Genomics Lecture #3

Estimating p(·,·) Intuition:

• Find pair of aligned sequences s[1..n], t[1..n],

• Estimate probability of pairs:

• The sequences s and t can be the concatenation of many aligned pairs from the database

n

Nbap ba,),(

Number of times a is

aligned with b in (s,t)

Page 23: Computational Genomics Lecture #3

Problems in Estimating p(·,·)

• How do we find pairs of aligned sequences?

• How far is the ancestor ?– earlier divergence low sequence similarity– recent divergence high sequence similarity

• Does one letter mutate to the other or are they both mutations of a common ancestor having yet another residue/nucleotide acid ?

Page 24: Computational Genomics Lecture #3

Scoring Matrices

Deal with DNA first (simpler)then AA (not too bad either)

Page 25: Computational Genomics Lecture #3

What is it & why ?

• Let alphabet contain N letters – N = 4 and 20 for nucleotides and amino acids

• N x N matrix• (i,j) shows the relationship between i-th and j-th

letters.– Positive number if letter i is likely to mutate into letter j– Negative otherwise– Magnitude shows the degree of proximity

• Symmetric

Page 26: Computational Genomics Lecture #3

Scoring Matrices for DNA

A C G T

A 1 0 0 0

C 0 1 0 0

G 0 0 1 0

T 0 0 0 1

A C G T

A 1 -3 -3 -3

C -3 1 -3 -3

G -3 -3 1 -3

T -3 -3 -3 1

A C G T

A 1 -5 -1 -5

C -5 1 -5 -1

G -1 -5 1 -5

T -5 -1 -5 1

Transitions & transversions

identity BLAST

Page 27: Computational Genomics Lecture #3

A R N D C Q E G H I L K M F P S T W Y VA 5 -2 -1 -2 -1 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -2 -2 0 R -2 7 0 -1 -3 1 0 -2 0 -3 -2 3 -1 -2 -2 -1 -1 -2 -1 -2 N -1 0 6 2 -2 0 0 0 1 -2 -3 0 -2 -2 -2 1 0 -4 -2 -3 D -2 -1 2 7 -3 0 2 -1 0 -4 -3 0 -3 -4 -1 0 -1 -4 -2 -3 C -1 -3 -2 -3 12 -3 -3 -3 -3 -3 -2 -3 -2 -2 -4 -1 -1 -5 -3 -1 Q -1 1 0 0 -3 6 2 -2 1 -2 -2 1 0 -4 -1 0 -1 -2 -1 -3 E -1 0 0 2 -3 2 6 -2 0 -3 -2 1 -2 -3 0 0 -1 -3 -2 -3 G 0 -2 0 -1 -3 -2 -2 7 -2 -4 -3 -2 -2 -3 -2 0 -2 -2 -3 -3H -2 0 1 0 -3 1 0 -2 10 -3 -2 -1 0 -2 -2 -1 -2 -3 2 -3 I -1 -3 -2 -4 -3 -2 -3 -4 -3 5 2 -3 2 0 -2 -2 -1 -2 0 3 L -1 -2 -3 -3 -2 -2 -2 -3 -2 2 5 -3 2 1 -3 -3 -1 -2 0 1 K -1 3 0 0 -3 1 1 -2 -1 -3 -3 5 -1 -3 -1 -1 -1 -2 -1 -2 M -1 -1 -2 -3 -2 0 -2 -2 0 2 2 -1 6 0 -2 -2 -1 -2 0 1 F -2 -2 -2 -4 -2 -4 -3 -3 -2 0 1 -3 0 8 -3 -2 -1 1 3 0 P -1 -2 -2 -1 -4 -1 0 -2 -2 -2 -3 -1 -2 -3 9 -1 -1 -3 -3 -3 S 1 -1 1 0 -1 0 0 0 -1 -2 -3 -1 -2 -2 -1 4 2 -4 -2 -1 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -1 -1 2 5 -3 -1 0 W -2 -2 -4 -4 -5 -2 -3 -2 -3 -2 -2 -2 -2 1 -3 -4 -3 15 3 -3 Y -2 -1 -2 -2 -3 -1 -2 -3 2 0 0 -1 0 3 -3 -2 -1 3 8 -1V 0 -2 -3 -3 -1 -3 -3 -3 -3 3 1 -2 1 0 -3 -1 0 -3 -1 5

The BLOSUM45 Matrix

Page 28: Computational Genomics Lecture #3

Scoring Matrices for Amino Acids

• Chemical similarities– Non-polar, Hydrophobic (G, A, V, L, I, M, F, W, P)– Polar, Hydrophilic (S, T, C, Y, N, Q)– Electrically charged (D, E, K, R, H)– Requires expert knowledge

• Genetic code: Nucleotide substitutions– E: GAA, GAG– D: GAU, GAC– F: UUU, UUC

• Actual substitutions– PAM– BLOSUM

Page 29: Computational Genomics Lecture #3

Scoring Matrices: Actual Substitutions

• Manually align proteins

• Look for amino acid substitutions

• Entry ~ log (freq(observed) / freq(expected))

• Log-odds matrices

Page 30: Computational Genomics Lecture #3

BLOSUM BLOcks Substitution Matrices

Henikoff & Henikoff, 1992

Next slides taken from lecture notes by Tamer Kahveci, Next slides taken from lecture notes by Tamer Kahveci, CISE DepartmentCISE Department, University of Florida University of Florida (www.cise.ufl.edu/~tamer/teaching/ fall2004/lectures/03-CAP5510-Fall04.ppt(www.cise.ufl.edu/~tamer/teaching/ fall2004/lectures/03-CAP5510-Fall04.ppt

Page 31: Computational Genomics Lecture #3

BLOSUM Matrix• Begin with a set of protein sequences and obtain

aligned blocks.– ~2000 blocks from 500 families of related proteins

• A block is the ungapped alignment of a highly conserved region of a family of proteins.

• MOTIF program is used to find blocks• Substitutions in these blocks are used to compute

BLOSUM matrix

WWYIR CASILRKIYIYGPV GVSRLRTAYGGRKNRGWFYVR … CASILRHLYHRSPA … GVGSITKIYGGRKRNGWYYVR AAAVARHIYLRKTV GVGRLRKVHGSTKNRGWYFIR AASICRHLYIRSPA GIGSFEKIYGGRRRRG

block 1 block 2 block 3

Page 32: Computational Genomics Lecture #3

a≠b

i j >= i

Constructing the Matrix• Count the frequency of occurrence of each amino acid.

This gives the background distribution pa

• Count the number of times amino acid a is aligned with amino acid b: fab

– A block of width w and depth s contributes ws(s-1)/2 pairs.– Denote by np the total number of pairs.

• Compute the occurrence probability of each pair: qab = fab/ np

• Compute the expected probability of occurrence of each pair

eab = 2papb, if a ≠ b

papb otherwise• Compute twice (?) the log likelihood ratios, normalize, and

round to nearest integer.

2* log2 qab / eab

Page 33: Computational Genomics Lecture #3

a b

Computation of BLOSUM-X

• The amount of similarity in blocks has a great effect on the BLOSUM score. BLOSUM-X is generated by taking only blocks with %X identity.

• For example, a BLOSUM62 matrix is calculated from protein blocks with 62% identity.

• So BLOSUM80 represents closer sequences (more recent divergence) than BLOSUM62.

• On the web, Blast uses BLOSUM80, BLOSUM62 (the default), or BLOSUM45.

Page 34: Computational Genomics Lecture #3

BLOSUM 62 Matrix

M I L V-small hydrophobic

N D E Q-acid, hydrophilic

H R K-basic

F Y W-aromatic

S T P A G-small hydrophilic

C-sulphydryl

Check scores for

Page 35: Computational Genomics Lecture #3

Equivalent PAM and BLOSSUM matrices:

PAM100 = Blosum90 PAM120 = Blosum80 PAM160 = Blosum60 PAM200 = Blosum52 PAM250 = Blosum45

BLOSUM62 is the default matrix to use.

PAM vs. BLOSUM

Page 36: Computational Genomics Lecture #3

And Now

Ladies and GentlemenBoys and Girlsthe holy grail

Multiple Sequence Alignment

Page 37: Computational Genomics Lecture #3

Multiple Sequence Alignment

S1=AGGTC

S2=GTTCG

S3=TGAACPossible alignment

A-T

GGG

G--

TTA

-TA

CCC

-G-

Possible alignment

AG-

GTT

GTG

T-A

--A

CCA

-GC

Page 38: Computational Genomics Lecture #3

Multiple Sequence Alignment

Definition: Given strings S1, S2, …,Sk a multiple (global) alignment map them to strings S’1, S’2, …,S’k that may contain blanks, where:

1. |S’1|= |S’2|=…= |S’k|

2. The removal of spaces from S’i leaves Si

Aligning more than two sequences.

Page 39: Computational Genomics Lecture #3

Multiple alignmentsWe use a matrix to represent the alignment of k sequences,

K=(x1,...,xk). We assume no columns consists solely of blanks.

M Q _ I L L L

M L R - L L -

M K _ I L L L

M P P V L I L

The common scoring functions give a score to each column, and set: score(K)= ∑i score(column(i))

For k=10, a scoring function has 2k -1 > 1000 entries to specify. The scoring function is symmetric - the order of arguments need not matter: score(I,_,I,V) = score(_,I,I,V).

x1

x2

x3

x4

Page 40: Computational Genomics Lecture #3

SUM OF PAIRS

M Q _ I L L L

M L R - L L -

M K _ I L L L

M P P V L I L

A common scoring function is SP – sum of scores of the projected pairwise alignments: SPscore(K)=∑i<j score(xi,xj).

In order for this score to be written as ∑i score(column(i)),we set score(-,-) = 0. Why ?

Because these entries appear in the sum of columns but not in the sum of projected pairwise alignments (lines).

Note that we need to specify the score(-,-) because a column may have several blanks (as long as not all entries are blanks).

Page 41: Computational Genomics Lecture #3

SUM OF PAIRS

M Q _ I L L L

M L R - L L -

M K _ I L L L

M P P V L I L

Definition: The sum-of-pairs (SP) value for a multiple global

alignment A of k strings is the sum of the values of all projected

pairwise alignments induced by A where the pairwise alignment

function score(xi,xj) is additive.

2

k

Page 42: Computational Genomics Lecture #3

Example

Consider the following alignment:

a c - c d b -

- c - a d b d

a - b c d a d

Using the edit distance and for ,

this alignment has a SP value of

0, xx 1, yx yx

33 +43 + 4 + 5 = 12

Page 43: Computational Genomics Lecture #3

Multiple Sequence AlignmentGiven k strings of length n, there is a natural generalization of the

dynamic programming algorithm that finds an alignment that maximizes

SP-score(K) = ∑i<j score(xi,xj).

Instead of a 2-dimensional table, we now have a k-dimensional table to fill.

For each vector i =(i1,..,ik), compute an optimal multiple alignment for the k prefix sequences x1(1,..,i1),...,xk(1,..,ik).

The adjacent entries are those that differ in their index by one or zero. Each entry depends on 2k-1 adjacent entries.

Page 44: Computational Genomics Lecture #3

The idea via K=2

])[,(],[

)],[(],[

])[],[(],[

max],[

1jtj1iV

1is1jiV

1jt1isjiV

1j1iV

])..[],..[(],[ j1ti1sdjiV

V[i,j] V[i+1,j]

V[i,j+1] V[i+1,j+1] Note that the new cell index (i+1,j+1) differs from previous indices by one of 2k-1 non-zero binary vectors (1,1), (1,0), (0,1).

Recall the notation:

and the following recurrence for V:

Page 45: Computational Genomics Lecture #3

Multiple Sequence AlignmentGiven k strings of length n, there is a generalization of the dynamic

programming algorithm that finds an optimal SP alignment.

Computational Cost:

• Instead of a 2-dimensional table we now have a k-dimensional table to fill.

• Each dimension’s size is n+1. Each entry depends on 2k-1 adjacent entries.

Number of evaluations of scoring function : O(2knk)

Page 46: Computational Genomics Lecture #3

Complexity of the DP approachNumber of cells nk.

Number of adjacent cells O(2k).Computation of SP score for each column(i,b) is o(k2)

Total run time is O(k22knk) which is totally unacceptable !

Maybe one can do better?

Page 47: Computational Genomics Lecture #3

But MSA is Intractable

Not much hope for a polynomial algorithm because the problem has been shown to be NP complete (proof is quite

Tricky and recent. Some previous proofs were bogus).

Look at Isaac Elias presentation of NP completeness proof.

Need heuristic or approximation to reduce time.

Page 48: Computational Genomics Lecture #3

Multiple Sequence Alignment – Approximation Algorithm

Now we will see an O(k2n2) multiple alignment algorithm for the SP-score that approximatethe optimal solution’s score by a factor of at most 2(1-1/k) < 2.

Page 49: Computational Genomics Lecture #3

Star AlignmentsRather then summing up all pairwise alignments, select a fixed sequence S1 as a center, and set

Star-score(K) = ∑j>0score(S1,Sj).

The algorithm to find optimal alignment: at each step, add another sequence aligned with S1, keeping old gaps and possibly adding new ones.

Page 50: Computational Genomics Lecture #3

Multiple Sequence Alignment – Approximation Algorithm

Polynomial time algorithm:

assumption: the function δ is a distance function:

• (triangle inequality)

Let D(S,T) be the value of the minimum global alignment between S and T.

0),( xx),(),(),( yxzyzx

0),(),( xyyx

Page 51: Computational Genomics Lecture #3

Multiple Sequence Alignment – Approximation Algorithm (cont.)

Polynomial time algorithm:

The input is a set Γ of k strings Si.

1. Find “center string” S1 that minimizes S

1D S ,S

2. Call the remaining strings S2, …,Sk.

3. Add a string to the multiple alignment that initially contains only S1 as follows:

• Suppose S1, …,Si-1 are already aligned as S’1, …,S’i-1. Add Si by running dynamic programming algorithm on S’1 and Si to produce S’’1 and S’i.

• Adjust S’2, …,S’i-1 by adding spaces to those columns where spaces were added to get S’’1 from S’1.

• Replace S’1 by S’’1.

Page 52: Computational Genomics Lecture #3

Multiple Sequence Alignment – Approximation Algorithm (cont.)

Time analysis:

• Choosing S1 – running dynamic programming algorithm

times – O(k2n2)

• When Si is added to the multiple alignment, the length of S1

is at most in, so the time to add all k strings is

2

k

1

1

22k

i

nkOninO

Page 53: Computational Genomics Lecture #3

Multiple Sequence Alignment – Approximation Algorithm (cont.)

Performance analysis:

• M - The alignment produced by this algorithm.

For all i, d(1,i)=D(S1,Si)

(we performed optimal alignment between S’1 and Si and )0),(

k

i

k

ijj

jidMv1 1

,•

• d(i,j) - the distance M induces on the pair Si,Sj.

• M* - optimal alignment.

Page 54: Computational Genomics Lecture #3

Multiple Sequence Alignment – Approximation Algorithm (cont.)

Performance

analysis:

k

llSSDk

21,)1(2

k

jjSSDk

21,

2)1(2

)(

)(*

k

k

Mv

Mv

k

i

k

ijj

jidMv1 1

, jdidk

i

k

ijj

,1,11 1

k

l

ldk2

,1)1(2

Triangle inequality

k

i

k

ijj

jidMv1 1

** ,

k

i

k

ijj

ji SSD1 1

,

k

i

k

ijj

jSSD1 1

1,

Definition of S1

Page 55: Computational Genomics Lecture #3

Multiple Sequence Alignment – Approximation Algorithm

Algorithm relies heavily on scoring function

being a distance. It produced an alignment

whose SP score is at most twice the minimum.

What if scoring function was similarity?

Can we get an efficient algorithm whose score

is half the maximum? Third of maximum? …

We dunno !

Page 56: Computational Genomics Lecture #3

Tree AlignmentsAssume that there is a tree T=(V,E) whose leaves are the sequences. • Associate a sequence in each internal node.• Tree-score(K) = ∑(i,j)Escore(xi,xj).

Finding the optimal assignment of sequences to the internal nodes is NP Hard.

We will meet again this problem in the study ofPhylogenetic trees (it is related to the parsimony problem).

Page 57: Computational Genomics Lecture #3

Multiple Sequence Alignment Heuristics

similar

Perform all 6 pair wise alignments. Find scores.Build a “similarity tree”.

A.

B. Multiple alignment following the tree from A.

Example - 4 sequences A, B, C, D.

ABCD

BDAC

Align most similar pairs allowing gaps to optimize alignment.

B

D

A

CAlign the next most similar pair.

Now, “align the alignments”, introducing gaps if necessary to optimize alignment of (BD) with (AC).

distant

Page 58: Computational Genomics Lecture #3

The tree-based progressive method for multiple sequence alignment, used in practice (Clustal)

(a) a tree (dendrogram) obtained by cluster analysis (b) pairwise alignment of 2 sequences’ alignments.

(a)

DEHUG3

DEPGG3

DEBYG3

DEZYG3

DEBSGF

(b) L W R D G R G A L Q

L W R G G R G A A Q

D W R - G R T A S G

L R R - A R T A S A

L - R G A R A A A E

(modified from Speed’s ppt presentation,see p. 81 in Kanehisa’s book)

Page 59: Computational Genomics Lecture #3

Visualization of Alignment Helps