Sequence Similarity

43
Sequence Similarity

description

Sequence Similarity. Why sequence similarity. structural similarity >25% sequence identity  similar structure evolutionary relationship all proteins come from < 2000 (super) families related functional role similar structure  similar function functional modules are often preserved. - PowerPoint PPT Presentation

Transcript of Sequence Similarity

Page 1: Sequence Similarity

Sequence Similarity

Page 2: Sequence Similarity

Why sequence similarity

structural similarity

>25% sequence identity similar structure

evolutionary relationship

all proteins come from < 2000 (super)families

related functional role

similar structure similar function

functional modules are often preserved

Page 3: Sequence Similarity

Muscle cells and contraction

Page 4: Sequence Similarity

Actin and myosin during muscle movement

Page 5: Sequence Similarity

Actin structure

Page 6: Sequence Similarity

Actin sequence

• Actin is ancient and abundant Most abundant protein in cells 1-2 actin genes in bacteria, yeasts, amoebas Humans: 6 actin genes

-actin in muscles; -actin, -actin in non-muscle cells• ~4 amino acids different between each version

MUSCLE ACTIN Amino Acid Sequence

1 EEEQTALVCD NGSGLVKAGF AGDDAPRAVF PSIVRPRHQG VMVGMGQKDS YVGDEAQSKR 61 GILTLKYPIE HGIITNWDDM EKIWHHTFYN ELRVAPEEHP VLLTEAPLNP KANREKMTQI 121 MFETFNVPAM YVAIQAVLSL YASGRTTGIV LDSGDGVSHN VPIYEGYALP HAIMRLDLAG 181 RDLTDYLMKI LTERGYSFVT TAEREIVRDI KEKLCYVALD FEQEMATAAS SSSLEKSYEL 241 PDGQVITIGN ERFRGPETMF QPSFIGMESS GVHETTYNSI MKCDIDIRKD LYANNVLSGG 301 TTMYPGIADR MQKEITALAP STMKIKIIAP PERKYSVWIG GSILASLSTF QQMWITKQEY 361 DESGPSIVHR KCF

Page 7: Sequence Similarity

A related protein in bacteria

Page 8: Sequence Similarity

Relation between sequence and structure

Page 9: Sequence Similarity

A multiple alignment of actins

Page 10: Sequence Similarity

Gene expression

Protein

RNA

DNA

transcription

translation

CCTGAGCCAACTATTGATGAA

PEPTIDE

CCUGAGCCAACUAUUGAUGAA

Page 11: Sequence Similarity

Biomolecules as Strings

• Macromolecules are the chemical building blocks of cells

Proteins• 20 amino acids

Nucleic acids• 4 nucleotides {A, C, G, ,T}

Polysaccharides

Page 12: Sequence Similarity

The information is in the sequence

• Sequence Structure Function

• Sequence similarity

Structural and/or Functional similarity

• Nucleic acids and proteins are related by molecular evolution Orthologs: two proteins in animals X and Y that evolved from one

protein in immediate ancestor animal Z Paralogs: two proteins that evolved from one protein through

duplication in some ancestor Homologs: orthologs or paralogs that exhibit sequence similarity

Page 13: Sequence Similarity

Protein Phylogenies

• Proteins evolve by both duplication and species divergence

orthologs

paralogs

duplication

Page 14: Sequence Similarity

Evolution

Page 15: Sequence Similarity

Evolution at the DNA level

…ACGGTGCAGTTACCA…

…AC - - - - CAGTCACCA…

Mutation

SEQUENCE EDITS

REARRANGEMENTS

Deletion

InversionTranslocationDuplication

Page 16: Sequence Similarity

Evolutionary Rates

OK

OK

OK

X

X

Still OK?

next generation

Changes in non-functional sites are OK, so will be propagated

Most changes in functional sites are deleterious and will be rejected

Page 17: Sequence Similarity

Sequence conservation implies function

Proteins between humans and rodents are on average 85% identical

Page 18: Sequence Similarity

Sequence Alignment

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

DefinitionGiven two strings x = x1x2...xM, y

= y1y2…yN,

an alignment is an assignment of gaps to positions

0,…, M in x, and 0,…, N in y, so as to line up each letter in one sequence with either a letter, or a gap

in the other sequence

AGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCGGTCGATTTGCCCGAC

Page 19: Sequence Similarity

What is a good alignment?

Alignment: The “best” way to match the letters of one sequence with those of the other

How do we define “best”?

Alignment:A hypothesis that the two sequences come from a common ancestor through sequence edits

Parsimonious explanation:Find the minimum number of edits that transform one sequence into the other

Page 20: Sequence Similarity

Scoring Function

• Sequence edits:AGGCCTC

Mutations AGGACTC

InsertionsAGGGCCTC

DeletionsAGG–CTC

Scoring Function:Match: + mMismatch: – sGap: – d

Score F = (# matches) m – (# mismatches) s – (#gaps) d

Page 21: Sequence Similarity

How do we compute the best alignment?

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

Too many possible alignments:

O( 2M+N)

Page 22: Sequence Similarity

Alignment is additive

Observation:The score of aligning x1……xM

y1……yN

is additive

Say that x1…xi xi+1…xM

aligns to y1…yj yj+1…yN

The two scores add up:

F(x[1:M], y[1:N]) = F(x[1:i], y[1:j]) + F(x[i+1:M], y[j+1:N])

Key property: optimal solution to the entire problem is composed of optimal solutions to subproblems– Dynamic Programming

Page 23: Sequence Similarity

Dynamic Programming

Construct a DP matrix F: MxN:

Suppose we wish to alignx1……xM

y1……yN

Let F(i, j) = optimal score of aligning

x1……xi

y1……yj

Page 24: Sequence Similarity

Dynamic Programming (cont’d)

Notice three possible cases:

1. xi aligns to yj

x1……xi-1 xi

y1……yj-1 yj

2. xi aligns to a gap

x1……xi-1 xi

y1……yj -

3. yj aligns to a gap

x1……xi -

y1……yj-1 yj

m, if xi = yj

F(i, j) = F(i-1, j-1) + -s, if not

F(i, j) = F(i-1, j) – d

F(i, j) = F(i, j-1) – d

Page 25: Sequence Similarity

Dynamic Programming (cont’d)

• How do we know which case is correct?

Inductive assumption:F(i, j – 1), F(i – 1, j), F(i – 1, j – 1) are optimal

Then,F(i – 1, j – 1) + s(xi, yj)

F(i, j) = max F(i – 1, j) – dF(i, j – 1) – d

Where s(xi, yj) = m, if xi = yj; -s, if not

Page 26: Sequence Similarity

Example

x = AGTA m = 1y = ATA s = -1

d = -1

A G T A

0 -1 -2 -3 -4

A -1 1 0 -1 -2

T -2 0 0 1 0

A -3 -1 -1 0 2

F(i,j) i = 0 1 2 3 4

j = 0

1

2

3

Optimal Alignment:

F(4, 3) = 2

AGTAA - TA

Page 27: Sequence Similarity

The Needleman-Wunsch Algorithm

1. Initialization.a. F(0, 0) = 0b. F(0, j) = - j dc. F(i, 0) = - i d

2. Main Iteration. Filling-in partial alignmentsa. For each i = 1……M

For each j = 1……N F(i-1,j-1) + s(xi, yj)

[case 1]F(i, j) = max F(i-1, j) – d

[case 2] F(i, j-1) – d

[case 3]

DIAG, if [case 1]Ptr(i,j) = LEFT, if [case 2]

UP, if [case 3]

3. Termination. F(M, N) is the optimal score, andfrom Ptr(M, N) can trace back optimal alignment

Page 28: Sequence Similarity

Performance

• Time:O(NM)

• Space:O(NM)

• Possible to reduce space to O(N+M) using Hirschberg’s divide & conquer algorithm

Page 29: Sequence Similarity

Substitutions of Amino Acids

Mutation rates between amino acids have dramatic differences!

How can we quantify the differencesin rates by which one amino acidreplaces another across related proteins?

Page 30: Sequence Similarity

Substitution Matrices

BLOSUM matrices:

1. Start from BLOCKS database (curated, gap-free alignments)2. Cluster sequences according to > X% identity

3. Calculate Aab: # of aligned a-b in distinct clusters, correcting by 1/mn, where m, n are the two cluster sizes

4. Estimate

P(a) = (b Aab)/(c≤d Acd); P(a, b) = Aab/(c≤d Acd)

Page 31: Sequence Similarity

Gaps are not inserted uniformly

Page 32: Sequence Similarity

A state model for alignment

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACCIMMJMMMMMMMJJMMMMMMJMMMMMMMIIMMMMMIII

M(+1,+1)

I(+1, 0)

J(0, +1)

Alignments correspond 1-to-1 with sequences of states M, I, J

Page 33: Sequence Similarity

Let’s score the transitions

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACCIMMJMMMMMMMJJMMMMMMJMMMMMMMIIMMMMMIII

M(+1,+1)

I(+1, 0)

J(0, +1)

Alignments correspond 1-to-1 with sequences of states M, I, J

s(xi, yj)

s(xi, yj) s(xi, yj)

-d -d

-e -e

Page 34: Sequence Similarity

A probabilistic model for alignment

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACCIMMJMMMMMMMJJMMMMMMJMMMMMMMIIMMMMMIII

M(+1,+1)

I(+1, 0)

J(0, +1)

Assign probabilities to every transition (arrow), and emission (pair of letters or gaps)

• Probabilities of mutation reflect amino acid similarities

• Different probabilities for opening and extending gap

Page 35: Sequence Similarity

A Pair HMM for alignments

MP(xi, yj)

IP(xi)

JP(yj)

log(1 – 2)

log(1 – )

log log log log

log(1 – )

log Prob(xi, yj)

Highest scoring path corresponds to the most likely alignment!

Page 36: Sequence Similarity

How do we find the highest scoring path?

• Compute the following matrices (DP) M(i, j): most likely alignment of x1…xi with y1…yj ending in state M

I(i, j): most likely alignment of x1…xi with y1…yj ending in state I

J(i, j): most likely alignment of x1…xi with y1…yj ending in state J

M(i, j) = log( Prob(xi, yj) ) +

max{ M(i-1, j-1) + log(1-2),

I(i-1, j) + log(1-), J(i, j-1) + log(1-) }

I(i, j) = max{ M(i-1, j) + log ,

I(i-1, j) + log }

MP(xi, yj)

IP(xi)

JP(yj)

log(1 – 2)

log(1 – )

log log log log

log(1 – )

log Prob(xi, yj)

Page 37: Sequence Similarity

The Viterbi algorithm for alignment

• For each i = 1, …, M For each j = 1, …, N

M(i, j) = log( Prob(xi, yj) ) +

max { M(i-1, j-1) + log(1-2),

I(i-1, j) + log(1-), J(i, j-1) + log(1-)

}

I(i, j) = max { M(i-1, j) + log ,

I(i-1, j) + log }

J(i, j) = max { M(i-1, j) + log ,

I(i-1, j) + log }

When matrices are filled, we can trace back from (M, N) the likeliest alignment

Page 38: Sequence Similarity

One way to view the state paths – State M

x1

xm

y1 yn……

……

Page 39: Sequence Similarity

State I

x1

xm

y1 yn……

……

Page 40: Sequence Similarity

State J

x1

xm

y1 yn……

……

Page 41: Sequence Similarity

Putting it all together

States I(i, j) are connected with states J and M (i-1, j)

States J(i, j) are connected with states I and M (i-1, j)

States M(i, j) are connected with states J and I (i-1, j-1)

x1

xm

y1 yn……

……

Page 42: Sequence Similarity

Putting it all together

States I(i, j) are connected with states J and M (i-1, j)

States J(i, j) are connected with states I and M (i-1, j)

States M(i, j) are connected with states J and I (i-1, j-1)

Optimal solution is the best scoring path from top-left to bottom-right corner

This gives the likeliest alignment according to our HMM

x1

xm

y1 yn……

……

Page 43: Sequence Similarity

Yet another way to represent this model

Mx1 Mxm

Sequence X

BEGIN Iy Iy

Ix Ix

END

We are aligning, or threading, sequence Y through sequence X

Every time yj lands in state xi, we get substitution score s(xi, yj)

Every time yj is gapped, or some xi is skipped, we pay gap penalty