Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global...

56
Bioinformatics David Gilbert Bioinformatics Research Centre www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow Sequence comparison 2 local pairwise alignment

Transcript of Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global...

Page 1: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

Bioinformatics

David GilbertBioinformatics Research Centre

www.brc.dcs.gla.ac.ukDepartment of Computing Science, University of Glasgow

Sequence comparison 2local pairwise alignment

Page 2: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 2

Lecture contents• Variations on dynamic programming

– Gap penalties– Substitution matrices

• To explain the reason that local alignments may be moreappropriate than global ones.

• To describe the use of Dot-Plots in visualising analignment

• To describe the Smith-Waterman method of finding andscoring an optimal local pairwise alignment

• To describe in outline the BLAST algorithm for databasesearch

Page 3: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 3

Solution to Week 1 Exercise

A

A

D

C

0

ACEEA0

Page 4: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 4

Percentage sequence identity number of identical residues x 100 = ________________________________ number of residues in smallest sequence

Can differ if have gaps/no_gaps: compute for these sequences:

-TGCAT-A- | | | |AT-C-TGAT

TGCATA | |ATCTGAT

Sequence similarity - change at amino-acid residue or nucleotide that preserves the physico-chemical properties of the residue

Page 5: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 5

β and α globin, without gapsβ MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKα VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK

β VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGα KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA

β KEFTPPVQAAYQKVVAGVANALAHKYHα VHASLDKFLASVSTVLTSKYR

Compute the identity%

Page 6: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 6

β and α globin, with gapsCLUSTAL W (1.81) multiple sequence alignment

β MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKα --VLSPADKTNVKAAWGKVGAHAG----EYGAEALERMFLSFPTTKTYFPHFDLSHGSAQ β VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGα VKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLP

β KEFTPPVQAAYQKVVAGVANALAHKYHα AEFTPAVHASLDKFLASVSTVLTSKYR

Compute the identity%

Page 7: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 7

Human beta globin hits coyote!

>SW:HBB_CANFA P02056 HEMOGLOBIN BETA CHAIN. Length = 146

Score = 276 bits (698), Expect = 2e-74 Identities = 131/146 (89%), Positives = 137/146 (93%)

Query:2 VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV 61 VHLT EEKS V+ LWGKVNVDEVGGEALGRLL+VYPWTQRFF+SFGDLSTPDAVM N KVSbjct: 1 VHLTAEEKSLVSGLWGKVNVDEVGGEALGRLLIVYPWTQRFFDSFGDLSTPDAVMSNAKV 60

Query: 62 KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK 121 KAHGKKVL +FSDGL +LDNLKGTFA LSELHCDKLHVDPENF+LLGNVLVCVLAHHFGKSbjct: 61 KAHGKKVLNSFSDGLKNLDNLKGTFAKLSELHCDKLHVDPENFKLLGNVLVCVLAHHFGK 120

Query: 122 EFTPPVQAAYQKVVAGVANALAHKYH 147 EFTP VQAAYQKVVAGVANALAHKYHSbjct: 121 EFTPQVQAAYQKVVAGVANALAHKYH 146

Blast output

Compute the identity%

What happened?

Page 8: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 8

Penalising gaps• Gap = maximal consecutive run of spaces in an alignment (1 or more

spaces)

• Simple penalty - each gap contributes a constant weight

• More complex - gap penalty proportional to gap length

• Large gap penalty → few gaps (less substrings in alignment). Smallpenalty → fragmented alignments.

• FASTA:– GAPOPEN: Penalty for the first residue in a gap

(-12 for proteins, -16 for DNA).

– GAPEXT: Penalty for additional residues in a gap(-2 for proteins, -4 for DNA).

Page 9: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 9

Substitution matrices• Unitary matrix: match=1, mismatch=0

– sparse matrix (most elements are 0)• Poor diagnostic power

– all identical matches carry identical weighting• We can enhance scoring potential of weak but

biologically significant signals• Scoring matrices - weight matches for non-identical

residues according to observed substitution rates.• More on this later!

A C G T

A

C

G

T

Page 10: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 10

Global and local alignment• Global alignment - as per dynamic programming

solution as explained– Needleman & Wunsch algorithm (1970)

• Local alignment - find local regions from eachstring which are similar:– Corresponds to shorter, localised paths in the matrix.– Justification - biological functional sites localised to

short conserved regions (no indels/mutations).– Smith-Waterman algorithm (1981)

Page 11: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 11

Local alignment• Start & end dynamic programming computation at any cells instead of

(0,0) and (i,j)

• The matrix contains a maximum value that may not be at (i,j)[the end of the input sequences]– represents the endpoint of an alignment s.t. no other pair of segments with greater

similarity exists between the 2 sequences

Page 12: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 12

Global vs local alignment

Global,Needleman &Wunsch

Local,Smith &Waterman

Page 13: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 13

Local Pairwise Alignment• Distantly related sequences i.e. proteins

– Uneven accumulation of mutations along sequence• Similar segments in overall dissimilar sequences

– Rearrangement of gene segments in genome• Related sub-sequences in unrelated genes

• Local similarity corresponds to– Shared structural or functional motif

• Robust to mutations• Evolutionarily important

• Global alignment may fail in such cases– Island of similarity lost in random symbol matches

Page 14: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 14

Local Pairwise Alignment

• Require to find similar segments in sequences

• Database search task : Find homologous sequences {d} to query q indatabase D– In a reasonable time– Present only homologous sequences (True Positives)– Do not present non-homologous sequences (False Positives)

• First – how to find local alignments?

Page 15: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 15

Dot Matrices• First technique to discover local similarities

– M by N matrix created – symbols of q (length M) on one axis, symbols of d(length N) on the other

– Matrix populated with dots and spaces– Dot in cell (i,j) indicates that q(i) = d(j)

• Easy to understand visualisation

• Common substrings found easily – contiguous diagonal dots

Page 16: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 16

Dot plots• A convenient way of comparing 2 sequences visually

• Use matrix, put 1 sequence on X-axis, 1 on Y-axis

• Cells with identical characters filled with a ‘1’, non-identical with ‘0’ (simplest scheme -could have weights)

• Identical sequences will look like WHAT?

• Similar sequences will have a broken diagonal, plus some other lines

• Distantly related sequences - much noisier.

• Can generate an alignment

• Best path through dotplot given by dynamic programming algorithms:– global alignment = Needleman & Wunsch– local alignment = Smith & Waterman

Page 17: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 17

H L T P E E K S V H T

H

A

K

P

E

E

K

S

A

V

T

Dot plots

Page 18: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 18

H L T P E E K S V H T

H x x

A

K x

P x

E x x

E x x

K x

S x

A

V x

T x x

Dot plots

AlignmentHLTPEEKSVHT| ||||| |HAKPEEKSAVT

Page 19: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 19

A dotplot

Page 20: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 20

Try a dotplot and alignmentM T F R D L L S V S F E G P R P

M

T

F

R

D

L

L

S

V

S

F

E

G

P

R

P

D

S

S

A

G

G

Page 21: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 21

Try a dotplot and alignment

• Two sequences

q = ANTGDSCTAWCDEFGHIKPQWERTY

d = TREDFGAACDEFGHIKLHYTYTRTRERAECDEFGHIKHYGT

Page 22: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 22

Dot Matrices• Easy to identify common recurring substring

CDEFGHIK

• Anti-diagonal identifies reversed substring TRE

• Matrix image can be ‘noisy’– Most of dots not associated with a common substring

• Matrices can be very large & unwieldy for typical proteinsequences ~ 500 ~ 1000 aa’s

Page 23: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 23

Smith-Waterman Method• Require objective score of alignment

• Can employ dynamic programming method (Lecture 1) thoughrequires some changes

• Consider following two sequences– q = ACEDECADE– d = REDCEDKL

• Unsure at what symbols (residues) highest scoring local alignmentsend – all pairs should be considered

• Consider q8 and d6

Page 24: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 24

Smith-Waterman Method• Consider q8 and d6 i.e q = ACEDECAD & d = REDCED• Scoring 0.5 equality, -0.3 inequality, -0.5 gap

• Removing first two pairs in alignment will improve alignment score – negative scores

0.4-0.10.2-0.30.2-0.3-0.8-0.3a.s0.5-0.30.5-0.50.50.5-0.5-0.3c.sDEC-DE-Rd6

DACEDECAq8

Page 25: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 25

Smith-Waterman Method

1.20.71.00.51.00.5a.s

0.5-0.30.5-0.50.50.5c.s

DEC-DEd2…6

DACEDEq3…8

Page 26: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 26

Smith-Waterman Method• Removing prefixes with negative scores improves overall score

• To find best local alignment ending in qi & dj must find beststarting point

• Fortunately a DP method can be employed

• In this case all negative values are replaced with zero

• Simple change to the global alignment DP

Page 27: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 27

Smith-Waterman Method

Si-1,j-1 + s(qi, dj)

Si,j-1 + s(-, dj)Si,j = max

Si-1,j + s(qi, -)0{

Page 28: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 28

Smith-Waterman Method• Now initialise first row & columns with 0• In this example: Score as 0.5 equality, -0.3 inequality, -0.5 gap, Si,j ≥ 0

0.40.90.70.50.200.500E

0.20.71.20.200.5000D

0.40.40.20.70.50000A

0.90.70.70.510.2000C

0.71.21.01.00.70.50.500E

0.511.50.50.51000D000.51000.500E00000.50000C000000000A

0000000000

LKDECDER0I/J

Page 29: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 29

Smith-Waterman Method

• Initialise first row & columns with 0

• Si,j ≥ 0

• Find maximum partial alignment score Sk,l

• Trace backwards, constructing alignment, until reach acell with value 0

Page 30: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 30

Smith-Waterman Method

0.40.90.70.50.200.500E

0.20.71.20.200.5000D

0.40.40.20.70.50000A

0.90.70.70.510.2000C

0.71.21.01.00.70.50.500E

0.511.50.50.51000D000.51000.500E00000.50000C000000000A

0000000000

LKDECDER0I/J

• Find maximum partial alignment score Sk,l

• Trace backwards, constructing alignment, until reach a cell with value 0

Page 31: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 31

Smith-Waterman Method

0.40.90.70.50.200.500E

0.20.71.20.200.5000D

0.40.40.20.70.50000A

0.90.70.70.510.2000C

0.71.21.01.00.70.50.500E

0.511.50.50.51000D000.51000.500E00000.50000C000000000A

0000000000

LKDECDER0I/J

• Find maximum partial alignment score Sk,l

• Trace backwards, constructing alignment, until reach a cell with value 0

Page 32: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 32

FASTA (Lipman & Pearson 1985)• Fast approximation to the Smith-Waterman algorithm

• Local alignments - tries to find paths of regional similarity, rather than trying to find thebest alignment between 2 sequences.

• Alignments can contain gaps.

• Rapid

• Heuristic - not guaranteed to find the best alignment between 2 sequences; it may missmatches.– uses a strategy which is expected to find most matches, but sacrifices complete sensitivity in

order to gain speed.

• A substitution matrix is used during all phases of protein searches

Page 33: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 33

FASTA• Identifies short words (k-tuples) common to both sequences

(nucleotides k=1 or 2, amino-acids, k up to 6)

• Similar to focussing on diagonal matches in dynamicprogramming algorithm

• Uses heuristic to join k-tuples close on same diagonal

• If significant number of matches found, then uses dynamicprogramming to compute gapped alignment

Page 34: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 34

FASTAalgorithm

www.compbio.dundee.ac.uk/ftp/preprints/review93/Figure9.pdf

Page 35: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 35

FASTA outputHBB_CANFA P02056 HEMOGLOBIN BETA CHAIN. (146 aa)initn: 886 init1: 886 opt: 886 Z-score: 1095.3 bits: 208.6 E(): 2.9e-54Smith-Waterman score: 886; 89.726% identity (89.726% ungapped) in 146 aa overlap (2-147:1-146)

10 20 30 40 50 60EMBOSS MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK :::: :::: :..:::::::::::::::::::.:::::::::.::::::::::::.: :SW:HBB VHLTAEEKSLVSGLWGKVNVDEVGGEALGRLLIVYPWTQRFFDSFGDLSTPDAVMSNAK 10 20 30 40 50

70 80 90 100 110 120EMBOSS VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG :::::::::..::::: .::::::::: ::::::::::::::::.:::::::::::::::SW:HBB VKAHGKKVLNSFSDGLKNLDNLKGTFAKLSELHCDKLHVDPENFKLLGNVLVCVLAHHFG 60 70 80 90 100 110

130 140 EMBOSS KEFTPPVQAAYQKVVAGVANALAHKYH ::::: :::::::::::::::::::::SW:HBB KEFTPQVQAAYQKVVAGVANALAHKYH

120 130 140

Page 36: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 36

Database Search - BLAST• SW complexity similar to DP for global alignment

• Not realistic for database search in terms of time

• Trade-off guarantee of finding best alignment with time expense

• Basic Local Alignment Search Tool – BLAST (Alschul et al 1990)

• Employs fast search to find small segments with similar score in both sequences

• Extend small segments (local alignments)

• Returns maximal scoring pairs (MSP) & MSP score & statistical significance ofscores

Page 37: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 37

Database Search - BLAST

• Query sequence split into words of defined length– Query string q = AFGTULL with word length of L = 3– AFG, FGT, GTU, TUL, ULL

• Define a threshold alignment score T

• Find all word-pairs of length L with score ≥ T– For amino acids there are 203 = 8000 distinct words w of length L=3– e.g Find all w such that S(w, AFG) ≥ T

Page 38: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 38

Database Search - BLAST

• Search database for all ‘hits’ - sequences with exact matches to each w– Indexing of sequences to create ‘inverted file’ by employing hash table to index

sequences – fast

• Extend alignment of ‘hits’ while score increases – producing High Scoring Pair’s

• Return sequences with HSP’s which have significantly (statistically) higherscores than a threshold Smax

• Smax obtained empirically from random sequences

Page 39: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 39

Database Search - BLAST

• Varying the threshold alignment score T– Search time decreases as T is increased, fewer word pairs are found– Sensitivity of search decreases as T is increased, word pairs overlooked (homologous

sequences may be discarded)

• The score of the alignment Smax AND the associated statistical significance arerequired to assess whether homology is suggested

Page 40: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 40

BLAST - Basic Local Alignment ToolAltschul et al 1990

• Given 2 sequences:– Segment pair - pair of subsequences of the same length forming an ungapped

alignment– Computes all segment pairs– If there is a MSP maximal segment pair (highest score of all pairs for 1

comparison) above some cutoff score C and C is “significant” then report hit– Also reports those sequences where the score of MSP < C, but several segment

pairs in combination which are significant.– Reports score from highest scoring pairs & probability scores [E values]

(expected by chance).

• Only produces ungapped alignments

Page 41: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 41

BLASTAlgorithm

www.compbio.dundee.ac.uk/ftp/preprints/review93/Figure10.pdf

Page 42: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 42

Gapped BLAST (Altschul et al 1997)

• Seeks only 1 (not all) ungapped alignments withsignificant match

• Uses dynamic programming to extend residuesin both directions to give gapped alignment

• 3x faster than ungapped BLAST

Page 43: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 43

Edited results (EMBL)Database: embl: 958,670 sequences; 2,466,994,978 total letters Score ESequences producing significant alignments: (bits) Value

EM_HUM:HSBGL1 V00497 Human messenger RNA for beta-globin. 1241 0.0EM_HUM:AF181989 AF181989 Homo sapiens hemoglobin beta subuni... 1116 0.0EM_HUM:HSHEMOB M25113 Human sickle beta-hemoglobin mRNA. 1100 0.0EM_PAT:I32884 I32884 Sequence 9 from patent US 5589367. 910 0.0EM_HUM:HS202231 U20223 Human thalassemia beta globin gene, c... 416 e-114EM_OM:AGHBD M19061 Spider monkey (A.geoffroyi) delta-globin ... 369 1e-99EM_OM:CPHBB5CP J00330 monkey (c.polykomos) beta-globin gene;... 367 4e-99EM_OM:PPHBD M21825 Orangutan delta globin gene, complete cds. 347 4e-93EM_OM:CPHBDPSC J00335 Monkey (colobus) delta-globin pseudoge... 297 3e-78EM_OM:LMHBB M15734 Lemur (brown) beta-globin gene, complete ... 270 7e-70EM_OM:TSHBD J04428 T.syrichta delta-globin gene, complete cds. 266 1e-68EM_OM:OCU60902 U60902 Otolemur crassicaudatus epsilon-, gamm... 266 1e-68EM_OM:LEBGLOB Y00347 Lepus europaeus adult beta-globin gene 266 1e-68EM_OM:GCDELGLB M61740 G.crassicaudatus beta globin gene, com... 266 1e-68EM_OM:MOHBDPS J00332 monkey (anubis) silent delta-globin gene. 262 2e-67EM_OM:TSHBB J04429 T.syrichta beta globin gene, complete cds. 260 7e-67EM_PAT:A34698 A34698 Synthetic pSXBeta+ sequence 258 3e-66EM_OM:OCBGLO V00882 Rabbit (O. cuniculus) gene for beta-globin. 250 7e-64EM_OM:BTBG M63453 Bovine Beta globin gene and globin (PSI-3)... 220 6e-55

Page 44: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 44

Edited results (Swiss-prot)Database: swissprot: 86,593 sequences; 31,411,157 total letters

Score ESequences producing significant alignments: (bits) Value

SW:HBB_HUMAN P02023 HEMOGLOBIN BETA CHAIN. (human) 306 2e-83SW:HBB_GORGO P02024 HEMOGLOBIN BETA CHAIN. (gorilla) 305 4e-83SW:HBB2_PANLE P18988 HEMOGLOBIN BETA-2 CHAIN. (lion) 302 3e-82SW:HBB_HYLLA P02025 HEMOGLOBIN BETA CHAIN. (gibbon) 300 8e-82SW:HBB_PREEN P02032 HEMOGLOBIN BETA CHAIN. (Hanumam langur) 298 5e-81SW:HBB_COLPO P19885 HEMOGLOBIN BETA CHAIN. (Colobus) 295 3e-80SW:HBB_CERAE P02028 HEMOGLOBIN BETA CHAIN. (Green monkey) 295 3e-80SW:HBB_MACFU P02027 HEMOGLOBIN BETA CHAIN. (Japanese macaque) 293 2e-79SW:HBB_CALAR P18985 HEMOGLOBIN BETA CHAIN. (Marmoset) 292 2e-79SW:HBB_ATEGE P02034 HEMOGLOBIN BETA CHAIN. (Spider monkey) 292 2e-79SW:HBB_MANSP P08259 HEMOGLOBIN BETA CHAIN. (Mandrill) 291 4e-79…SW:HBB1_RAT P02091 HEMOGLOBIN BETA CHAIN, (Rat) 255 4e-68SW:HBB_ERIEU P02059 HEMOGLOBIN BETA CHAIN. (Hedgehog) 252 2e-67SW:HBB_PANPO P04244 HEMOGLOBIN BETA CHAIN. (Bison) 251 5e-67SW:HBB_BISBO P09422 HEMOGLOBIN BETA CHAIN. (Leopard) 251 5e-67

Database: swissprot: 86,593 sequences; 31,411,157 total letters

Score ESequences producing significant alignments: (bits) Value

SW:HBB_HUMAN P02023 HEMOGLOBIN BETA CHAIN. (human) 306 2e-83SW:HBB_GORGO P02024 HEMOGLOBIN BETA CHAIN. (gorilla) 305 4e-83SW:HBB2_PANLE P18988 HEMOGLOBIN BETA-2 CHAIN. (lion) 302 3e-82SW:HBB_HYLLA P02025 HEMOGLOBIN BETA CHAIN. (gibbon) 300 8e-82SW:HBB_PREEN P02032 HEMOGLOBIN BETA CHAIN. (Hanumam langur) 298 5e-81SW:HBB_COLPO P19885 HEMOGLOBIN BETA CHAIN. (Colobus) 295 3e-80SW:HBB_CERAE P02028 HEMOGLOBIN BETA CHAIN. (Green monkey) 295 3e-80SW:HBB_MACFU P02027 HEMOGLOBIN BETA CHAIN. (Japanese macaque) 293 2e-79SW:HBB_CALAR P18985 HEMOGLOBIN BETA CHAIN. (Marmoset) 292 2e-79SW:HBB_ATEGE P02034 HEMOGLOBIN BETA CHAIN. (Spider monkey) 292 2e-79SW:HBB_MANSP P08259 HEMOGLOBIN BETA CHAIN. (Mandrill) 291 4e-79…SW:HBB1_RAT P02091 HEMOGLOBIN BETA CHAIN, (Rat) 255 4e-68SW:HBB_ERIEU P02059 HEMOGLOBIN BETA CHAIN. (Hedgehog) 252 2e-67SW:HBB_PANPO P04244 HEMOGLOBIN BETA CHAIN. (Bison) 251 5e-67SW:HBB_BISBO P09422 HEMOGLOBIN BETA CHAIN. (Leopard) 251 5e-67

Page 45: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 45

Blast alignment

>SW:HBB_CANFA P02056 HEMOGLOBIN BETA CHAIN. Length = 146

Score = 276 bits (698), Expect = 2e-74 Identities = 131/146 (89%), Positives = 137/146 (93%)

Query: 2 VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV 61 VHLT EEKS V+ LWGKVNVDEVGGEALGRLL+VYPWTQRFF+SFGDLSTPDAVM N KVSbjct: 1 VHLTAEEKSLVSGLWGKVNVDEVGGEALGRLLIVYPWTQRFFDSFGDLSTPDAVMSNAKV 60

Query: 62 KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK 121 KAHGKKVL +FSDGL +LDNLKGTFA LSELHCDKLHVDPENF+LLGNVLVCVLAHHFGKSbjct: 61 KAHGKKVLNSFSDGLKNLDNLKGTFAKLSELHCDKLHVDPENFKLLGNVLVCVLAHHFGK 120

Query: 122 EFTPPVQAAYQKVVAGVANALAHKYH 147 EFTP VQAAYQKVVAGVANALAHKYHSbjct: 121 EFTPQVQAAYQKVVAGVANALAHKYH 146

Blast output

Page 46: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 46

FASTA vs BLAST• BLAST faster than FASTA without significant loss of ability to

find the similar database sequences.

• BLAST & FAST equivalent for highly similar sequences

• FASTA may be better for less similar sequences

• But can always make a full local alignment(Smith-Waterman algorithm) - highest potential of finding lesssimilar sequences in a database search since the entire sequencelengths are compared.

Page 47: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 47

EMBL - some stats31/01/06

Total nucleotides(current 118,263,140,052)

Number of entries(current 65,933,089)

Page 48: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 48

Smith-Waterman programs

• MPsrch Edinburgh University

– http://www.ebi.ac.uk/MPsrch/

• Scanps2.3 Geoff Barton (EBI; University of Dundee)

– http://www.ebi.ac.uk/scanps/

• Blitz (bic_sw) Compugen -- uses MPsrch & Scanps

– http://www.ebi.ac.uk/bic_sw/ (email only)

Page 49: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 49

Multiple alignments• Analyse gene families

– reveal (subtle) conserved family characteristics

characters 1 2 3 4 5 6 7 8 9 10

S1 Y D G G A V - E A L S2 Y D G G - - - E A L S3 F E G G I L V E A L S4 F D - G I L V Q A V S5 Y E G G A V V Q A L

consensus y d G G AI VL V e A l

sequ

ence

s

Page 50: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 50

• Simultaneous: N-wise alignment (adapted from pairwise approach)

– uses N-dimension matrix.– Complexity is

• O(m1m2) [2 sequences length m1 & m2 ]• O(mn) [n sequences of length m]

– Thus only good for short sequences.

• Manua1 (!)

• Progessive (heuristic) e.g. ClustalW:– compute pairwise sequence identities– construct binary tree (can output phylogenetic tree)– align similar sequences in pairs, add distantly related ones later.

Multiple aligment - methods

s1

s2

s3

s4

s5

a1

a2

a3

a4

Page 51: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 51

Multiple sequence alignment (globins)CLUSTAL W (1.81) multiple sequence alignment

Human VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV 60Gorilla VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV 60Rabbit VHLSSEEKSAVTALWGKVNVEEVGGEALGRLLVVYPWTQRFFESFGDLSSANAVMNNPKV 60Pig VHLSAEEKEAVLGLWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSNADAVMGNPKV 60 ***:.***.** .*******:****************************..:***.****

Human KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK 120Gorilla KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFKLLGNVLVCVLAHHFGK 120Rabbit KAHGKKVLAAFSEGLSHLDNLKGTFAKLSELHCDKLHVDPENFRLLGNVLVIVLSHHFGK 120Pig KAHGKKVLQSFSDGLKHLDNLKGTFAKLSELHCDQLHVDPENFRLLGNVIVVVLARRLGH 120 ******** :**:** **********.*******:********:*****:* **::::*:

Human EFTPPVQAAYQKVVAGVANALAHKYH 146Gorilla EFTPPVQAAYQKVVAGVANALAHKYH 146Rabbit EFTPQVQAAYQKVVAGVANALAHKYH 146Pig DFNPNVQAAFQKVVAGVANALAHKYH 146 :*.* ****:****************

Page 52: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 52

Multiple sequencealignments &

phylogenetic trees

((Human:0.00000,Gorilla:0.00685) :0.04110,Rabbit:0.05479,Pig:0.10959);

Pair ScoreHuman-Gorilla 99Human-Rabbit 90Gorilla-Rabbit 89Human-Pig 84Gorilla-Pig 84Rabbit-Pig 83

Page 53: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 53

What can we do with multiple alignments?• Create (databases of) profiles derived from multiple alignments for protein

families– profile = multiple alignment + observed character frequencies at each

position

• Search with a sequence against a database of profiles(e.g. PROSITE database)– faster than sequence against sequence– gives a more general result (“the input sequence matches globin profile”)

• Search with a profile against a database of sequences– PSI-BLAST : can identify more distant relationships than by normal

BLAST search

Page 54: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 54

PSI-BLAST (position specific iterated BLAST)

Single proteinsequence

Search database(BLAST)

Multiple alignmentProfile

Estimate statisticalsignificance of local

alignments

?iterateuntil

convergence

Page 55: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 55

PSI-BLAST (Altschul et al 1997)

(1) Start with 1 sequence (or profile) = ‘probe’

(2) Search with BLAST and select top hits manually orautomatically

(3) Make multiple alignment & profile

(4) Estimate statistical significance of local alignments.If significance ok & you want to continue, then go to (1) using theprofile, else exit

Page 56: Bioinformatics Sequence comparison 2people.brunel.ac.uk/.../website_bioinformaticsHM/... · Global and local alignment •Global alignment - as per dynamic programming solution as

(c) David Gilbert 2008 Sequence Comparison (2) 56

Dates &programs

FAST

A

BLA

ST

Gap

ped

BLA

ST&

PSI

BLA

ST