CS 5263 Bioinformatics

61
CS 5263 Bioinformatics Lecture 6: Sequence Alignment Statistics

description

CS 5263 Bioinformatics. Lecture 6: Sequence Alignment Statistics. Review of last lecture. How to map gaps more accurately?. GACGCCGAACG ||||| ||| GACGC---ACG. GACGCCGAACG |||| | | || GACG-C-A-CG. Score = 8 x m – 3 x d. Score = 8 x m – 3 x d. Gaps usually occur in bunches - PowerPoint PPT Presentation

Transcript of CS 5263 Bioinformatics

Page 1: CS 5263 Bioinformatics

CS 5263 Bioinformatics

Lecture 6: Sequence Alignment Statistics

Page 2: CS 5263 Bioinformatics

Review of last lecture

• How to map gaps more accurately?

GACGCCGAACG||||| |||GACGC---ACG

GACGCCGAACG|||| | | ||GACG-C-A-CG

Score = 8 x m – 3 x d Score = 8 x m – 3 x d

Gaps usually occur in bunches

- During evolution, chunks of DNA may be lost or inserted entirely

- Aligning genomic sequences vs. cDNAs: cDNAs are spliced versions of the genomic seqs

Page 3: CS 5263 Bioinformatics

Model gaps more accurately

• Previous model:– Gap of length n incurs penalty nd

• General: – Convex function– E.g. (n) = c * sqrt (n)

F(i-1, j-1) + s(xi, yj)

F(i, j) = max maxk=0…i-1F(k,j) – (i-k)

maxk=0…j-1F(i,k) – (j-k)

– Running Time: O((M+N)MN) (cubic)– Space: O(NM)

n

n

Page 4: CS 5263 Bioinformatics

Compromise: affine gaps

(n) = d + (n – 1)e | |gap gapopen extension

de

(n)

Match: 2

Gap open: -5

Gap extension: -1

GACGCCGAACG||||| |||GACGC---ACG

GACGCCGAACG|||| | | ||GACG-C-A-CG

8x2-5-2 = 9 8x2-3x5 = 1

• We want to find the optimal alignment with affine gap penalty in

• O(MN) time

• O(MN) or better O(M+N) memory

Page 5: CS 5263 Bioinformatics

Dynamic programming

• Consider three sub-problems when aligning x1..xi and y1..yj

– F(i,j): best alignment (score) of x1..xi & y1..yj if xi aligns to yj

– Ix(i,j): best alignment of x1..xi & y1..yj if yj aligns to gap– Iy(i,j): best alignment of x1..xi & y1..yj if xi aligns to gap

xi

yj

xi

yj

xi

yj

F(i, j) Ix(i, j) Iy(i, j)

Page 6: CS 5263 Bioinformatics

F

Ix

Iy

(xi,yj) /

(xi,yj) /

(xi,yj) /

(xi,-) / d

(xi,-) / e

(-, yj) / d

(-, yj) / eInput Output

Start state

Current state Input Output Next state

F (xi,yj) F

F (-,yj) d Ix

F (xi,-) d Iy

Ix (-,yj) e Ix

… … … …

Page 7: CS 5263 Bioinformatics

AAC

ACT

F-F-F-F

AAC

|||

ACT

F-Iy-F-F-Ix

AAC-

||

-ACT

F-F-Iy-F-Ix

AAC-

| |

A-CT

F

Ix

Iy

(xi,yj) /

(xi,yj) /

(xi,yj) /

(xi,-) / d

(xi,-) / e

(-, yj) / d

(-, yj) / e

startstate

Given a pair of sequences, an alignment (not necessarily optimal) corresponds to a state path in the FSM.

Optimal alignment: a state path to read the two sequences such that the total output score is the highest

Page 8: CS 5263 Bioinformatics

F

Ix

Iy

(xi,yj) /

(xi,yj) /

(xi,yj) /

(xi,-) /d

(xi,-)/e

(-, yj) /d

(-, yj)/e

F(i-1, j-1) + (xi, yj)

F(i, j) = max Ix(i-1, j-1) + (xi, yj)

Iy(i-1, j-1) + (xi, yj)

xi

yj

Page 9: CS 5263 Bioinformatics

F

Ix

Iy

(xi,yj) /

(xi,yj) /

(xi,yj) /

(xi,-) /d

(xi,-)/e

(-, yj) /d

(-, yj)/e

F(i, j-1) + d

Ix(i, j) = max

Ix(i, j-1) + e

xi

yj

Ix(i, j)

Page 10: CS 5263 Bioinformatics

F

Ix

Iy

(xi,yj) /

(xi,yj) /

(xi,yj) /

(xi,-) /d

(xi,-)/e

(-, yj) /d

(-, yj)/e

F(i-1, j) + d

Iy(i, j) = max

Iy(i-1, j) + e

xi

yj

Iy(i, j)

Page 11: CS 5263 Bioinformatics

F(i – 1, j – 1)F(i, j) = (xi, yj) + max Ix(i – 1, j – 1)

Iy(i – 1, j – 1)

F(i, j – 1) + d Ix(i, j) = max

Ix(i, j – 1) + e

F(i – 1, j) + d Iy(i, j) = max

Iy(i – 1, j) + e

Continuing alignment

Closing gaps in x

Closing gaps in y

Opening a gap in x

Gap extension in x

Opening a gap in y

Gap extension in y

Page 12: CS 5263 Bioinformatics

0 - - -

-

-

-

-

- - - -

-5

-6

-7

-8

- -5 -6 -7

-

-

-

-

F: aligned on both Iy: Insertion on y

F(i, j)

F(i-1, j-1)

Ix(i-1, j-1)

Iy(i-1, j-1)

Ix(i,j)Ix(i,j-1)

F(i,j-1) Iy(i,j)

Iy(i-1,j)

F(i-1,j)

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

x =

y =

x =

y =

x =

y =

Ix: Insertion on x

(xi, yj)

d

e

de

m = 2s = -2d = -5e = -1

Page 13: CS 5263 Bioinformatics

0 - - -

- 2

-

-

-

- - - -

-5

-6

-7

-8

- -5 -6 -7

-

-

-

-

F Iy

Ix

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

x =

y =

x =

y =

x =

y =

F(i, j)

F(i-1, j-1)

Ix(i-1, j-1)

Iy(i-1, j-1)

(xi, yj) = 2

m = 2s = -2d = -5e = -1

Page 14: CS 5263 Bioinformatics

0 - - -

- 2 -7

-

-

-

- - - -

-5

-6

-7

-8

- -5 -6 -7

-

-

-

-

F Iy

Ix

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

x =

y =

x =

y =

x =

y =

F(i, j)

F(i-1, j-1)

Ix(i-1, j-1)

Iy(i-1, j-1)

(xi, yj) = -2

m = 2s = -2d = -5e = -1

Page 15: CS 5263 Bioinformatics

0 - - -

- 2 -7 -8

-

-

-

- - -

-5

-6

-7

-8

-5 -6 -7

- - -3 -4

-

-

-

F Iy

Ix

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

x =

y =

x =

y =

x =

y =

Ix(i,j)Ix(i,j-1)

F(i,j-1)d = -5

e = -1

m = 2s = -2d = -5e = -1

Page 16: CS 5263 Bioinformatics

0 - - -

- 2 -7 -8

- -7

-

-

- - -

-5 - - -

-6

-7

-8

-5 -6 -7

- - -3 -4

-

-

-

F Iy

Ix

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

x =

y =

x =

y =

x =

y =

F(i, j)

F(i-1, j-1)

Ix(i-1, j-1)

Iy(i-1, j-1)

(xi, yj) = -2

m = 2s = -2d = -5e = -1

Page 17: CS 5263 Bioinformatics

0 - - -

- 2 -7 -8

- -7 4 -1

-

-

- - -

-5 - - -

-6

-7

-8

-5 -6 -7

- - -3 -4

-

-

-

F Iy

Ix

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

x =

y =

x =

y =

x =

y =

F(i, j)

F(i-1, j-1)

Ix(i-1, j-1)

Iy(i-1, j-1)

(xi, yj) = 2

m = 2s = -2d = -5e = -1

Page 18: CS 5263 Bioinformatics

0 - - -

- 2 -7 -8

- -7 4 -1

-

-

- - -

-5 - - -

-6

-7

-8

-5 -6 -7

- - -3 -4

- - -12 -1

-

-

F Iy

Ix

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

x =

y =

x =

y =

x =

y =

Ix(i,j)Ix(i,j-1)

F(i,j-1)d = -5

e = -1

m = 2s = -2d = -5e = -1

Page 19: CS 5263 Bioinformatics

0 - - -

- 2 -7 -8

- -7 4 -1

-

-

- - -

-5 - - -

-6 -3

-7

-8

-5 -6 -7

- - -3 -4

- - -12 -1

-

-

F Iy

Ix

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

x =

y =

x =

y =

x =

y =

Iy(i,j)

Iy(i-1,j)F(i-1,j)

d=-5e=-1

m = 2s = -2d = -5e = -1

Page 20: CS 5263 Bioinformatics

0 - - -

- 2 -7 -8

- -7 4 -1

- -8 -5 2

- -9 -6 1

- - -

-5 - - -

-6 -3 -12 -13

-7 -8 -1 -6

-8 -13 -2 -3

-5 -6 -7

- - -3 -4

- - -12 -1

- - -13 -10

- - -14 -11

F Iy

Ix

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

x =

y =

x =

y =

x =

y =

F(i, j)

F(i-1, j-1)

Ix(i-1, j-1)

Iy(i-1, j-1)

Ix(i,j)Ix(i,j-1)

F(i,j-1) Iy(i,j)

Iy(i-1,j)

F(i-1,j)(xi, yj)

d

e

de

m = 2s = -2d = -5e = -1

Page 21: CS 5263 Bioinformatics

0 - - -

- 2 -7 -8

- -7 4 -1

- -8 -5 2

- -9 -6 1

- - -

-5 - - -

-6 -3 -12 -13

-7 -8 -1 -6

-8 -13 -2 -3

-5 -6 -7

- - -3 -4

- - -12 -1

- - -13 -10

- - -14 -11

F Iy

Ix

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

GCAC

|| |

GC-C

x =

y =

x =

y =

x =

y =

x

y

G

C

A

C

G C C

x =

y =

m = 2s = -2d = -5e = -1

Page 22: CS 5263 Bioinformatics

Today: statistics of alignment

Where does (xi, yj) come from?

Are two aligned sequences actually related?

Page 23: CS 5263 Bioinformatics

Probabilistic model of alignments

• We’ll first focus on protein alignments without gaps

• Given an alignment, we can consider two possible models– R: the sequences are related by evolution– U: the sequences are unrelated

• How can we distinguish these two models?• How is this view related to amino-acid

substitution matrix?

Page 24: CS 5263 Bioinformatics

Model for unrelated sequences

• Assume each position of the alignment is independently sampled from some distribution of amino acids

• ps: probability of amino acid s in the sequences

• Probability of seeing an amino acid s aligned to an amino acid t by chance is– Pr(s, t | U) = ps * pt

• Probability of seeing an ungapped alignment between x = x1…xn and y = y1…yn randomly is

i

Page 25: CS 5263 Bioinformatics

Model for related sequences

• Assume each pair of aligned amino acids evolved from a common ancestor

• Let qst be the probability that amino acid s in one sequence is related to t in another sequence

• The probability of an alignment of x and y is give by

Page 26: CS 5263 Bioinformatics

Probabilistic model of Alignments

• How can we decide which model (U or R) is more likely?

• One principled way is to consider the relative likelihood of the two models (the odd ratios)– A higher ratio means that R is more likely than U

Page 27: CS 5263 Bioinformatics

Log odds ratio

• Taking logarithm, we get

• Recall that the score of an alignment is given by

Page 28: CS 5263 Bioinformatics

• Therefore, if we define

• We are actually defining the alignment score as the log odds ratio between the two models R and U

Page 29: CS 5263 Bioinformatics

How to get the probabilities?

• ps can be counted from the available protein sequences

• But how do we get qst? (the probability that s and t have a common ancestor)

• Counted from trusted alignments of related sequences

Page 30: CS 5263 Bioinformatics

Protein Substitution Matrices

• Two popular sets of matrices for protein sequences– PAM matrices [Dayhoff et al, 1978]

• Better for aligning closely related sequences

– BLOSUM matrices [Henikoff & Henikoff, 1992]• For both closely or remotely related sequences

Page 31: CS 5263 Bioinformatics

BLOSUM-N matrices

• Constructed from a database called BLOCKS• Contain many closely related sequences

– Conserved amino acids may be over-counted

• N = 62: the probabilities qst were computed using trusted alignments with no more than 62% identity– identity: % of matched columns

• Using this matrix, the Smith-Waterman algorithm is most effective in detecting real alignments with a similar identity level (i.e. ~62%)

Page 32: CS 5263 Bioinformatics

Positive for chemically similar substitution

Common amino acids get low weights

Rare amino acids get high weights

: Scaling factor to convert score to integer.Important: when you are told that ascoring matrix is in half-bits => = ½ ln2

Page 33: CS 5263 Bioinformatics

BLOSUM-N matrices

• If you want to detect homologous genes with high identity, you may want a BLOSUM matrix with higher N. say BLOSUM75

• On the other hand, if you want to detect remote homology, you may want to use lower N, say BLOSUM50

• BLOSUM-62: good for most purposes

45 62 90

Weak homology Strong homology

Page 34: CS 5263 Bioinformatics

For DNAs

• No database of trusted alignments to start with

• Specify the percentage identity you would like to detect

• You can then get the substitution matrix by some calculation

Page 35: CS 5263 Bioinformatics

For example

• Suppose pA = pC = pT = pG = 0.25

• We want 88% identity

• qAA = qCC = qTT = qGG = 0.22, the rest = 0.12/12 = 0.01

(A, A) = (C, C) = (G, G) = (T, T)

= log (0.22 / (0.25*0.25)) = 1.26(s, t) = log (0.01 / (0.25*0.25)) = -1.83 for

s ≠ t.

Page 36: CS 5263 Bioinformatics

Substitution matrix

A C G T

A 1.26 -1.83 -1.83 -1.83

C -1.83 1.26 -1.83 -1.83

G -1.83 -1.83 1.26 -1.83

T -1.83 -1.83 -1.83 1.26

Page 37: CS 5263 Bioinformatics

• Scale won’t change the alignment• Multiply by 4 and then round off to get integers

A C G T

A 5 -7 -7 -7

C -7 5 -7 -7

G -7 -7 5 -7

T -7 -7 -7 5

Page 38: CS 5263 Bioinformatics

Arbitrary substitution matrix

• Say you have a substitution matrix provided by someone

• It’s important to know what you are actually looking for when you use the matrix

Page 39: CS 5263 Bioinformatics

• What’s the difference? • Which one should I use for my sequences?

A C G T

A 1 -2 -2 -2

C -2 1 -2 -2

G -2 -2 1 -2

T -2 -2 -2 1

A C G T

A 5 -4 -4 -4

C -4 5 -4 -4

G -4 -4 5 -4

T -4 -4 -4 5

NCBI-BLAST WU-BLAST

Page 40: CS 5263 Bioinformatics

• We had

• Scale it, so that

• Reorganize:

Page 41: CS 5263 Bioinformatics

• Since all probabilities must sum to 1,

• We have

• Suppose again ps = 0.25 for any s

• We know (s, t) from the substitution matrix

• We can solve the equation for λ

• Plug λ into to get qst

Page 42: CS 5263 Bioinformatics

A C G T

A 1 -2 -2 -2

C -2 1 -2 -2

G -2 -2 1 -2

T -2 -2 -2 1

A C G T

A 5 -4 -4 -4

C -4 5 -4 -4

G -4 -4 5 -4

T -4 -4 -4 5

= 1.33

qst = 0.24 for s = t, and 0.004 for s ≠ t

Translate: 95% identity

= 0.19

qst = 0.16 for s = t, and 0.03 for s ≠ t

Translate: 65% identity

NCBI-BLAST WU-BLAST

Page 43: CS 5263 Bioinformatics

Details for solving

Known: (s,t) = 1 for s=t, and (s,t) = -2 for s t.Since

and s,t qst = 1, we have

12 * ¼ * ¼ * e-2 + 4 * ¼ * ¼ * e = 1 Let e = x, we have¾ x-2 + ¼ x = 1. Hence,x3 – 4x2 + 3 = 0;• X has three solutions: 3.8, 1, -0.8• Only the first leads to a positive = ln (3.8) = 1.33

A C G T

A 1 -2 -2 -2

C -2 1 -2 -2

G -2 -2 1 -2

T -2 -2 -2 1

Page 44: CS 5263 Bioinformatics

Today: statistics of alignment

Where does (xi, yj) come from?

Are two aligned sequences actually related?

Page 45: CS 5263 Bioinformatics

Statistics of Alignment Scores

• Q: How do we assess whether an alignment provides good evidence for homology (i.e., the two sequences are evolutionarily related)?– Is a score 82 good? What about 180?

• A: determine how likely it is that such an alignment score would result from chance

Page 46: CS 5263 Bioinformatics

P-value of alignment

• p-value– The probability that the alignment score can

be obtained from aligning random sequences– Small p-value means the score is unlikely to

happen by chance

• The most common thresholds are 0.01 and 0.05– Also depend on purpose of comparison and

cost of misclaim

Page 47: CS 5263 Bioinformatics

Statistics of global seq alignment

• Theory only applies to local alignment• For global alignment, your best bet is to do Monte-Carlo

simulation– What’s the chance you can get a score as high as the real

alignment by aligning two random sequences?

• Procedure– Given sequence X, Y– Compute a global alignment (score = S)– Randomly shuffle sequence X (or Y) N times, obtain

X1, X2, …, XN

– Align each Xi with Y, (score = Ri)– P-value: the fraction of Ri >= S

Page 48: CS 5263 Bioinformatics

Human HEXA

Fly HEXO1

Score = -74

Page 49: CS 5263 Bioinformatics

-95 -90 -85 -80 -75 -70 -65 -60 -55 -500

5

10

15

20

25

30

35

40

45

Alignment Score

Num

ber

of S

eque

nces

-74

Distribution of the alignment scores between fly HEXO1 and 200 randomly shuffled human HEXA sequences

There are 88 random sequences with alignment score >= -74. So: p-value = 88 / 200 = 0.44 => alignment is not significant

Page 50: CS 5263 Bioinformatics

……………………………………………………

Mouse HEXA

Human HEXA

Score = 732

Page 51: CS 5263 Bioinformatics

-200 -100 0 100 200 300 400 500 600 700 8000

5

10

15

20

25

30

35

40

45

Alignment Score

Num

ber

of S

eque

nces

732

Distribution of the alignment scores between mouse HEXA and 200 randomly shuffled human HEXA sequences

-230 -220 -210 -200 -190 -180 -170 -160 -1500

5

10

15

20

25

30

35

40

45

Alignment Score

Num

ber

of S

eque

nces

• No random sequences with alignment score >= 732– So: the P-value is less than 1 / 200 = 0.05

• To get smaller p-value, have to align more random sequences– Very slow

• Unless we can fit a distribution (e.g. normal distribution)– Such distribution may not be generalizable– No theory exists for global alignment score distribution

Page 52: CS 5263 Bioinformatics

Statistics for local alignment

• Elegant theory exists• Score for ungapped local alignment follows extreme value

distribution (Gumbel distribution)

Normal distribution

Extreme value distribution

An example extreme value distribution:

• Randomly sample 100 numbers from a normal distribution, and compute max

• Repeat 100 times.

• The max values will follow extreme value distribution

Page 53: CS 5263 Bioinformatics

Statistics for local alignment

• Given two unrelated sequences of lengths M, N• Expected number of ungapped local alignments

with score at least S can be calculated by– E(S) = KMN exp[-S]– Known as E-value : scaling factor as computed in last lecture– K: empirical parameter ~ 0.1

• Depend on sequence composition and substitution matrix

Page 54: CS 5263 Bioinformatics

P-value for local alignment score

• P-value for a local alignment with score S

)(

exp1)(exp1

SE

SKMNeSESxP

when P is small.

Page 55: CS 5263 Bioinformatics

Example

• You are aligning two sequences, each has 1000 bases

• m = 1, s = -1, d = -inf (ungapped alignment)

• You obtain a score 20

• Is this score significant?

Page 56: CS 5263 Bioinformatics

= ln3 = 1.1 (computed as discussed on slide #41)• E(S) = K MN exp{- S}• E(20) = 0.1 * 1000 * 1000 * 3-20 = 3 x 10-5

• P-value = 3 x 10-5 << 0.05• The alignment is significant

9 10 11 12 13 14 15 16 17 180

50

100

150

200

250

300

350

400

Alignment Score

Num

ber

of S

eque

nces

20

Distribution of 1000 random sequence pairs

Page 57: CS 5263 Bioinformatics

Multiple-testing problem

• Searching a 1000-base sequence against a database of 106 sequences (each of length 1000)

• How significant is a score 20 now?• You are essentially comparing 1000 bases with 1000x106

= 109 bases (ignore edge effect)• E(20) = 0.1 * 1000 * 109 * 3-20 = 30• By chance we would expect to see 30 matches

– The P-value (probability of seeing at least one match with score >= 30) is 1 – e-30 = 0.9999999999

– The alignment is not significant– Caution: it does NOT mean that the two sequences are unrelated.

Rather, it simply means that you have NO confidence to say whether the two sequences are related.

Page 58: CS 5263 Bioinformatics

Score threshold to determine significance

• You want a p-value that is very small (even after taking into consideration multiple-testing)

• What S will guarantee you a significant p-value?

E(S) P(S) << 1

=> KMN exp[-S] << 1

=> log(KMN) -S < 0=> S > T + log(MN) / (T = log(K) / , usually small)

Page 59: CS 5263 Bioinformatics

Score threshold to determine significance

• In the previous example– m = 1, s = -1, d = -inf => = 1.1

• Aligning 1000bp vs 1000bpS > log(106) / 1.1 = 13.

So 20 is significant.

• Searching 1000bp against 106 x 1000bpS > log(1012) / 1.1 = 25.

so 20 is not significant.

Page 60: CS 5263 Bioinformatics

Statistics for gapped local alignment

• Theory not well developed• Extreme value distribution works well

empirically• Need to estimate K and empirically

– Given the database and substitution matrix, generate some random sequence pairs

– Do local alignment– Fit an extreme value distribution to obtain K

and

Page 61: CS 5263 Bioinformatics

In summary

• How to obtain a substitution matrix?– Obtain qst and ps from established alignments (for DNA: from

your knowledge)– Computing score:

• How to understand arbitrary substitution matrix?– Solve function to obtain and target qst

– Which tells you what percent identity you are expecting

• How to understand alignment score?– probability that a score can be expected from chance.– Global alignment: Monte-Carlo simulation– Local alignment: Extreme Value Distribution

• Estimate p-value from a score• Determine a score threshold without computing a p-value