CAP5510 – Bioinformatics Multiple Alignment

35
1 CAP5510 – Bioinformatics Multiple Alignment Tamer Kahveci CISE Department University of Florida

description

CAP5510 – Bioinformatics Multiple Alignment. Tamer Kahveci CISE Department University of Florida. Goals. Understand What is multiple alignment Why align multiple sequences Learn How multiple alignments are scored Major multiple alignment methods Dynamic programming Standard MSA - PowerPoint PPT Presentation

Transcript of CAP5510 – Bioinformatics Multiple Alignment

Page 1: CAP5510 – Bioinformatics Multiple Alignment

1

CAP5510 – BioinformaticsMultiple Alignment

Tamer Kahveci

CISE Department

University of Florida

Page 2: CAP5510 – Bioinformatics Multiple Alignment

2

Goals

• Understand – What is multiple alignment– Why align multiple sequences

• Learn – How multiple alignments are scored– Major multiple alignment methods

• Dynamic programming– Standard – MSA

• Progressive alignment– Star – CLUSTALW

Page 3: CAP5510 – Bioinformatics Multiple Alignment

3

What is Multiple Alignment?• Alignment of more than two sequences• Global: multiple alignment

– http://www-igbmc.u-strasbg.fr/BioInfo/BAliBASE/

scxa_buteu vrdgyiaddk dcayfcgr.. .naycdeeck ...kgaesgk cwyagqygna scx1_titse .kdgypveyd ncayicwnyd .naycdklck ..dkkadsgy cyw...vhil scx6_titse .regypadsk gckitcflta .agycntect ..lkkgssgy caw.....pa scx1_cenno .kdgylvdak gckkncyklg kndycnrecr mkhrggsygy c.....ygfg six2_leiqu ..dgyirkrd gcklsclfg. .negcnkeck ..syggsygy cwt...wgla

scxa_buteu cwcyklpdwv pikqkvsgk. cn....scx1_titse cycyglpdse ptktn..gk. cksgkkscx6_titse cycyglpesv kiwtsetnk. c.....scx1_cenno cyceglsdst ptwplp.nkt csgk..six2_leiqu cwceglpd.e ktwksetn.t cg....

Page 4: CAP5510 – Bioinformatics Multiple Alignment

4

What is Local Multiple Alignment?

• Local: motif• Local: motif (http://blocks.fhcrc.org/blocks-bin/getblock.sh?PR00624 )

ID HISTONEH5; BLOCK AC PR00624A; distance from previous block=(9,12) DE Histone H5 signature BL adapted; width=22; seqs=9; 99.5%=986; strength=1407 H10_HUMAN|P07305 ( 10) AKPKRAKASKKSTDHPKYSDMI 63 H5A_XENLA|P22844 ( 11) AKPKRSKALKKSTDHPKYSDMI 71 H10_RAT|P43278 ( 10) AKPKRAKAAKKSTDHPKYSDMI 70 H10_MOUSE|P10922 ( 10) AKPKRAKASKKSTDHPKYSDMI 63 Q91759 ( 9) AKPRRSKASKKSTDHPKYSDMI 71 H5B_XENLA|P22845 ( 9) AKPRRSKASKKSTDHPKYSDMI 71 H5_CHICK|P02259 ( 11) AKPKRVKASRRSASHPTYSEMI 100 H5_CAIMO|P06513 ( 12) AKPKRAKAPRKPASHPSYSEMI 91 H5_ANSAN|P02258 ( 12) AKPKRARAPRKPASHPTYSEMI 100

Page 5: CAP5510 – Bioinformatics Multiple Alignment

5

Why Multiple Alignment

• Basis for phylogeny

• Helps find conserved regions in sets of proteins– Conserved regions

• Provide insight into substitution patterns• Gives hints about functional sites

Page 6: CAP5510 – Bioinformatics Multiple Alignment

6

How to Evaluate Multiple Alignments

Page 7: CAP5510 – Bioinformatics Multiple Alignment

7

Sum of Pairs (SP)

• Sum of induced pairwise alignment score of all pairs

• Ignore space pairs aligned together

A cwcyklpdwv pikqkvsgk. cn....B cycyglpdse ptktn..gk. cksgkkC cycyglpesv kiwtsetnk. c.....D cyceglsdst ptwplp.nkt csgk..

A cwcyklpdwv pikqkvsgk cn....B cycyglpdse ptktn..gk cksgkk

A cwcyklpdwv pikqkvsgk cnC cycyglpesv kiwtsetnk c.

A cwcyklpdwv pikqkvsgk. cn..D cyceglsdst ptwplp.nkt csgk

B cycyglpdse ptktn..gk cksgkkC cycyglpesv kiwtsetnk c.....

B cycyglpdse ptktn.gk. cksgkkD cyceglsdst ptwplpnkt csgk..

C cycyglpesv kiwtsetnk. c...D cyceglsdst ptwplp.nkt csgk

+

Page 8: CAP5510 – Bioinformatics Multiple Alignment

8

BAliBASE Benchmark

• Compare to a set of hand-aligned sequences• Check positions of letters

– If the letters appear at the same position as the benchmark => good

• Score between 0 ( ) and 1 ( )

• http://www-igbmc.u-strasbg.fr/BioInfo/BAliBASE/prog_scores.html

Page 9: CAP5510 – Bioinformatics Multiple Alignment

9

Finding Multiple Alignments

Page 10: CAP5510 – Bioinformatics Multiple Alignment

10

Dynamic Programming

Page 11: CAP5510 – Bioinformatics Multiple Alignment

11

• Similar to pairwise alignment– Compare NV and NS

Dynamic Programming

If k sequences are aligned– => k-dimensional matrix is filled

V

S NVNS

= max

N + VN S

N + VN -

N + -N S

22-1 = 3 cases

Page 12: CAP5510 – Bioinformatics Multiple Alignment

12

NANSNV

s

NA-N

NVs

V

S

A

NANS-N

s

-NNS-N

s

-N-N

NVs

NNN

s

AS-

A--

-NNSNV

-S-

NA-N

NV--V

NANS-N

-SV

NA-N-N

A-V

-NNS-N

AS-

-N-N

NVASV

NNN

NANSNV

s

s

s

s

s

s

s

s max

k=3 2k –1=7 cases

Dynamic Programming

Page 13: CAP5510 – Bioinformatics Multiple Alignment

13

Complexity

• Space complexity: O(nk) for k sequences each n long.

• Computing at a cell: O(2k). cost of computing δ.• Time complexity: O(2knk). cost of computing δ.• Finding the optimal solution is exponential in k • Proven to be NP-complete for a number of cost

functions

Page 14: CAP5510 – Bioinformatics Multiple Alignment

14

MSA

(Carrillo, Lipman’ 88)

Page 15: CAP5510 – Bioinformatics Multiple Alignment

15

MSA – Idea

1

2

3

Page 16: CAP5510 – Bioinformatics Multiple Alignment

16

MSA algorithm (1/3)

• Find pairwise alignment• Trial multiple alignment produced by a tree, cost = d• This provides a limit to the volume within which optimal

alignments are found• Specifics

– Sequences x1, .., xr.– Alignment A, cost = c(A)– Optimal alignment A* – Aij = induced alignment on xi, .., xj on account of A– D(xi,xj) = cost of optimal pairwise alignment of xi,xj <= c(Aij )

Page 17: CAP5510 – Bioinformatics Multiple Alignment

17

i < j(i,j) ≠ (u,v)

i < j(i,j) ≠ (u,v)

MSA algorithm (2/3)

• d >= c(A*) = c(A*uv) + Σ c(A*ij) >=

c(A*uv) + Σ D(xi,xj)

• c(A*uv) <= d - Σ D(xi,xj) = B(u,v)

• Compute B(u,v) for each pair of u,v• Consider any cell f with projection (s,t) on u,v plane.• If A* passes through f then A*uv passes through (s,t)

– beststuv = best pairwise alignment of xu,xv that passes through (s,t).

– beststuv = distance of the prefixes up to (s,t) + cost(xs

i,xsj) + distance

of suffixes after (s,t)

i < j(i,j) ≠ (u,v)

Page 18: CAP5510 – Bioinformatics Multiple Alignment

18

MSA algorithm (3/3)

• If beststuv > B(u,v), then

– A* cannot pass through cell f – Discard such cells from computation of DP

Page 19: CAP5510 – Bioinformatics Multiple Alignment

19

Question

s1: MPEs2: MKEs3: MSKEs4: SKE

Align:

BLOSUM 62

Page 20: CAP5510 – Bioinformatics Multiple Alignment

20

Progressive Alignment

Page 21: CAP5510 – Bioinformatics Multiple Alignment

21

Star Alignment

Page 22: CAP5510 – Bioinformatics Multiple Alignment

22

Star Alignments

• Heuristic method for multiple sequence alignments

• Select a sequence c as the center of the star

• For each sequence x1, …, xk such that xi c, perform a Needleman-Wunsch global alignment for xi and c

Page 23: CAP5510 – Bioinformatics Multiple Alignment

23

Star Alignments Example

s2

s1s3

s4

s1: MPEs2: MKEs3: MSKEs4: SKE

MPE

| |

MKE

MSKE

| ||

M-KE

SKE

||

MKEMPEMKE

M-PEM-KEMSKES-KE

M-PEM-KEMSKE

All induced pairwise alignments to the center sequence is the optimal one.•How should we choose a center? (Exercise: try s4 as the center)

•Try all of them?

Page 24: CAP5510 – Bioinformatics Multiple Alignment

24

CLUSTAL-W

(Thompson, Higgins, Gibson 1994)

Page 25: CAP5510 – Bioinformatics Multiple Alignment

25

CLUSTAL-W (1/4)

• Given sequences A, B, C, D, E• Compare all pairs and construct a distance

matrix

A B C D E

A

B

C

D

E

Page 26: CAP5510 – Bioinformatics Multiple Alignment

26

CLUSTAL-W (2/4)

• Find phylogenetic tree for A, B, C, D, E using neighbor joining

DB

A

C

E

DB

A

C

E DBA C E

DB

A

C

E

Page 27: CAP5510 – Bioinformatics Multiple Alignment

27

CLUSTAL-W (3/4)

• Align sequences starting from leaf level– Edge weights are used to compute the score of the alignment

DBA C E

•O(k2n2) time •O(n2) space

•Result depends on sequence order

Page 28: CAP5510 – Bioinformatics Multiple Alignment

28

CLUSTAL-W (4/4)

• Sample query using ClustalW

• http://www.cise.ufl.edu/~tamer/teaching/fall2007/other/sampleMSAquery

• http://www.ebi.ac.uk/clustalw/

Page 29: CAP5510 – Bioinformatics Multiple Alignment

29

Other Progressive Methods

• T-COFFEE

• PILUP

• Muscle

• …

Page 30: CAP5510 – Bioinformatics Multiple Alignment

30

T-coffee (Notredame, Higgins, Heringa 2000)

• Find a library of alignments between pairs of sequences. • Create a new scoring matrix for each pair of sequences

using the library– Directly from alignment of s1 and s2– Indirectly through alignment of s1, s3 and s3, s2.

s1

s2

Scoring matrix for s1 and s2

• Use these scoring matrices during progressive alignment

Page 31: CAP5510 – Bioinformatics Multiple Alignment

31

Iterative Alignment

Page 32: CAP5510 – Bioinformatics Multiple Alignment

32

PRRP (Gotoh 1996)

• Motivation: If the initial sequences are not good ones, progressive alignment fails.

• Idea: Iteratively update the alignment

Page 33: CAP5510 – Bioinformatics Multiple Alignment

33

PRRP

DBA C E

2. Construct phylogenetic tree based on multiple alignment

A cwcyklpdwv pikqkvsgk. cn....B cycyglpdse ptktn..gk. cksgkkC cycyglpesv kiwtsetnk. c.....D cyceglsdst ptwplp.nkt csgk..E cyceglpdst piwplp.nkt ctgk..

3. Align sequences

A cwcyklpdwv pikqkvsgk. cn....B cycyglpdse ptktn..gk. cksgkkC cycyglpesv kiwtsetnk. c.....D cyceglsdst ptwplp.nkt csgk..E cyceglpdst piwplp.nkt ctgk..

1. Find some initial alignment

Go back if the result has improved

Page 34: CAP5510 – Bioinformatics Multiple Alignment

34

Other methods

• Genetic algorithm (machine learning)

• Partial order graphs (graph matching)

• HMMER (hidden markov model)

• For a comparison:– http://www.cise.ufl.edu/~tamer/papers/psb2006.pdf

Page 35: CAP5510 – Bioinformatics Multiple Alignment

35

Motif Logos

ID HISTONEH5; BLOCK AC PR00624A; distance from previous block=(9,12) DE Histone H5 signature BL adapted; width=22; seqs=9; 99.5%=986; strength=1407 H10_HUMAN|P07305 ( 10) AKPKRAKASKKSTDHPKYSDMI 63 H5A_XENLA|P22844 ( 11) AKPKRSKALKKSTDHPKYSDMI 71 H10_RAT|P43278 ( 10) AKPKRAKAAKKSTDHPKYSDMI 70 H10_MOUSE|P10922 ( 10) AKPKRAKASKKSTDHPKYSDMI 63 Q91759 ( 9) AKPRRSKASKKSTDHPKYSDMI 71 H5B_XENLA|P22845 ( 9) AKPRRSKASKKSTDHPKYSDMI 71 H5_CHICK|P02259 ( 11) AKPKRVKASRRSASHPTYSEMI 100 H5_CAIMO|P06513 ( 12) AKPKRAKAPRKPASHPSYSEMI 91 H5_ANSAN|P02258 ( 12) AKPKRARAPRKPASHPTYSEMI 100