Post on 12-Jan-2016
description
Dynamic Programmingfor
Pairwise Alignment 2
Dr Alexei Drummond
Department of Computer Science
alexei@cs.auckland.ac.nz
Semester 2, 2006
2
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Review
Dynamic programming algorithm for global alignment (Needleman & Wunsch)
Given sequences:
F(i,j) = score of best alignment
between
and €
Y = (y1,y2,...,yn )
X = (x1,x2,...,xm )
€
(x1,x2,...,x i)
€
(y1,y2,...,y j )
3
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Principle of Optimality
Optimal alignment
€
x1, x2, x3, ..., x i
€
y1, y2, y3, ..., y j
€
F(i, j)
4
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Principle of Optimality
Optimal alignment
€
x1, x2, x3, ..., x i
€
y1, y2, y3, ..., y j
Looks like ……
€
x1,x2,x3,...,x i−1
€
y1,y2,y3,...,y j−1
€
x i
€
y j
€
F(i, j)
€
F(i −1, j −1) + s(x i,y j )
5
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Principle of Optimality
Optimal alignment
€
x1, x2, x3, ..., x i
€
y1, y2, y3, ..., y j
Looks like ……
€
x1,x2,x3,...,x i−1
€
y1,y2,y3,...,y j−1
€
x i
€
y j
€
F(i, j)
€
F(i −1, j −1) + s(x i,y j )
or ……………
€
x1,x2,x3,...,x i
€
y1,y2,y3,...,y j−1
€
−
€
y j
€
F(i, j −1) − d
6
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Principle of Optimality
Optimal alignment
€
x1, x2, x3, ..., x i
€
y1, y2, y3, ..., y j
Looks like ……
€
x1,x2,x3,...,x i−1
€
y1,y2,y3,...,y j−1
€
x i
€
y j
€
F(i, j)
€
F(i −1, j −1) + s(x i,y j )
or ……………
€
x1,x2,x3,...,x i
€
y1,y2,y3,...,y j−1
€
−
€
y j
€
F(i, j −1) − d
or ……………
€
x1,x2,x3,...,x i−1
€
y1,y2,y3,...,y j
€
x i
€
−
€
F(i −1, j) − d
7
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Principle of Optimality
Optimal alignment
€
x1, x2, x3, ..., x i
€
y1, y2, y3, ..., y j
Looks like ……
€
x1,x2,x3,...,x i−1
€
y1,y2,y3,...,y j−1
€
x i
€
y j
€
F(i, j)
€
F(i −1, j −1) + s(x i,y j )
or ……………
€
x1,x2,x3,...,x i
€
y1,y2,y3,...,y j−1
€
−
€
y j
€
F(i, j −1) − d
or ……………
€
x1,x2,x3,...,x i−1
€
y1,y2,y3,...,y j
€
x i
€
−
€
F(i −1, j) − d
so ……………
€
F(i −1, j −1) + s(x i,y j )
F(i, j) = max F(i, j −1) − d
F(i −1, j) − d
8
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Basis
€
x1, x2, x3, ..., x i
€
− − − − ... −
€
y1, y2, y3, ..., y j
€
− − − − ... −
€
F(i,0) = F(i −1,0) + s(x i,−)
€
F(0, j) = F(0, j −1) + s(−,y j )
€
F(0,0) = 0
9
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Filling up table
0
F matrix
0
1
2
m
0 1 2 n
X
Y
Optimalalignmentscore
10
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Constructing alignment
0
F matrix
0
1
2
m
0 1 2 n
X
Y
Optimalalignmentscore
11
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Example
0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80
-8 -2 -9 -17 -25 -33 -42 -49 -57 -65 -73
-16 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60
-24 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37
-32 -14 -18 -13 -8 -9 -13 -7 -3 -11 -19
-40 -22 -8 -16 -16 -9 -12 -15 -7 3 -5
-48 -30 -16 -3 -11 -11 -12 -12 -15 -5 2
-56 -38 -24 -11 -6 -12 -14 -15 -12 -9 1
F matrix
0
1
2
m
0 1 2 n
X
Optimalalignmentscore
P
A
W
H
E
A
E
Y
H E A G A W G H E E
AlignmentAlignmentX
Y H E A G A W G H E - E
- - P - A W - H E A E
12
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Time and space
€
⇒ Θ(mn)
F matrix
0
1
2
m
0 1 2 n
€
(m +1) × (n +1) table entries space
Each entry computed in constant time
€
⇒ Θ(mn) time
13
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Smith & Waterman algorithm
Computes local alignment.
i.e. look for best alignment of subsequences of X and Y, ignoring scoresof regions on either side
Y
X
Best subsequence alignment
14
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Recurrences
€
0
F(i −1, j −1) + s(x i,y j )
F(i, j) = max F(i, j −1) − d
F(i −1, j) − d
€
F(i,0) = F(0, j) = 0Basis:
15
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Example
F H E A G A W G H E E
0 0 0 0 0 0 0 0 0 0 0
P 0 0 0 0 0 0 0 0 0 0 0
A 0 0 0 5 0 5 0 0 0 0 0
W 0 0 0 0 2 0 20 12 4 0 0
H 0 10 2 0 0 0 12 18 22 14 6
E 0 2 16 8 0 0 4 10 18 28 20
A 0 0 8 21 13 5 0 4 10 20 27
E 0 0 6 13 18 12 4 0 4 16 26
16
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Example
F H E A G A W G H E E
0 0 0 0 0 0 0 0 0 0 0
P 0 0 0 0 0 0 0 0 0 0 0
A 0 0 0 5 0 5 0 0 0 0 0
W 0 0 0 0 2 0 20 12 4 0 0
H 0 10 2 0 0 0 12 18 22 14 6
E 0 2 16 8 0 0 4 10 18 28 20
A 0 0 8 21 13 5 0 4 10 20 27
E 0 0 6 13 18 12 4 0 4 16 26
AlignmentX
Y A W G H E
A W - H E
17
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Repeated (local) matches
Long sequences - interested in all local alignments with significant score,> threshold T.
e.g. copies of repeated domain or motif in a protein.
X = sequence containing motif
Y = target sequence
Method is asymmetric
Y
Matching parts of X
18
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Principle of Optimality
Given sequences
Define F(i,j) (i ≥ 1) = best sum of match scores in
and €
Y = (y1,y2,...,yn )
X = (x1,x2,...,xm )
€
(x1,x2,...,x i)
€
(y1,y2,...,y j )
€
y j
€
x i
€
y j
assuming
and match ends in
is in a matched region
or
19
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Ends of matches
€
F(0,0) = 0
€
F(0, j) = best sum of completed match scores to
€
(y1,y2,...,y j )
assuming that
€
y j is not in a matched region
€
F(0, j −1)
F(0, j) = max F(i, j −1) −T, i =1,...,n
Row 0 therefore marks unmatched regions and ends of matches in Y.
20
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
General recurrence
€
F(0, j)
F(i −1, j −1) + s(x i,y j )
F(i, j) = max F(i, j −1) − d
F(i −1, j) − d
Start of new match
Extension of previous match
21
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Filling up table
0
F matrix
0
1
2
m
0 1 2 n
X
Y
22
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Filling up table
0
F matrix
0
1
2
m
0 1 2 n
X
Y
23
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Filling up table
0
F matrix
0
1
2
m
0 1 2 n
X
Y
24
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Filling up table
0
F matrix
0
1
2
m
0 1 2 n
X
Y
25
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Filling up table
0
F matrix
0
1
2
m
0 1 2 n
X
Y
26
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
0
Filling up table
F matrix
0
1
2
m
0 1 2 n
X
Y
27
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
0
Filling up table
F matrix
0
1
2
m
0 1 2 n
X
Y
28
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
0
Filling up table
F matrix
0
1
2
m
0 1 2 n
X
Y
29
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Filling up table
0
F matrix
0
1
2
m
0 1 2 n
X
Y
30
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
0
Filling up table
F matrix
0
1
2
m
0 1 2 n
X
Y
31
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Filling up table
0
F matrix
0
1
2
m
0 1 2 n
X
Y
OptimalSum ofalignmentscores
32
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
ExampleF H E A G A W G H E E
0 0 0 0 1 1 1 1 1 3 9
P 0 0 0 0 1 1 1 1 1 3 9
A 0 0 0 5 1 6 1 1 1 3 9
W 0 0 0 0 2 1 21 13 5 3 9
H 0 10 2 0 1 1 13 19 23 15 9
E 0 2 16 8 1 1 5 11 19 29 21
A 0 0 8 21 13 6 1 5 11 21 28
E 0 0 6 13 18 12 4 1 5 17 27
9
Extra cell for final total score
33
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Example
AlignmentX
Y H E A G A W G H E E
H E A . A W - H E .
Extra cell for final total score
F H E A G A W G H E E
0 0 0 0 1 1 1 1 1 3 9
P 0 0 0 0 1 1 1 1 1 3 9
A 0 0 0 5 1 6 1 1 1 3 9
W 0 0 0 0 2 1 21 13 5 3 9
H 0 10 2 0 1 1 13 19 23 15 9
E 0 2 16 8 1 1 5 11 19 29 21
A 0 0 8 21 13 6 1 5 11 21 28
E 0 0 6 13 18 12 4 1 5 17 27
9
34
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Overlap matchesY Y
X X
YY
X X
Don’t penalize overhanging ends i.e. set F(i,0) = F(0,j) = 0
€
F(i −1, j −1) + s(x i,y j )
F(i, j) = max F(i, j −1) − d
F(i −1, j) − d
Otherwise
35
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
ExampleF H E A G A W G H E E
0 0 0 0 0 0 0 0 0 0 0
P 0 -2̀ -1 -1 -2 -1 -4 -2 -2 -1 -1
A 0 -2 -2 4 -1 3 -4 -4 -4 -3 -2
W 0 -3 -5 -4 1 -4 18 10 2 6 -6
H 0 10 2 6 -6 -1 10 16 20 12 4
E 0 2 16 8 0 7 2 8 16 26 18
A 0 -2 8 21 13 5 3 2 8 18 25
E 0 0 4 13 18 12 4 4 2 14 24
36
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
ExampleF H E A G A W G H E E
0 0 0 0 0 0 0 0 0 0 0
P 0 -2̀ -1 -1 -2 -1 -4 -2 -2 -1 -1
A 0 -2 -2 4 -1 3 -4 -4 -4 -3 -2
W 0 -3 -5 -4 1 -4 18 10 2 6 -6
H 0 10 2 6 -6 -1 10 16 20 12 4
E 0 2 16 8 0 7 2 8 16 26 18
A 0 -2 8 21 13 5 3 2 8 18 25
E 0 0 4 13 18 12 4 4 2 14 24
AlignmentX
Y G A W G H E E
P A W - H E A
37
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Affine gap penalities
• Affine score: (g) = -d - (g-1)e
gap-open penality gap-extension penalty
• Different penalties associated with extending alignment with gap symbol
Y = C C T W PX = C S T W -
Y = C C T W PX = C S T - -
different from
38
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
General recurrence
€
F(i −1, j −1) + s(x i,y j )
F(i, j) = max F(k, j) + γ(i − k), k = 0,1,...,i −1
(i, j > 0) F(i,k) + γ ( j − k), k = 0,1,..., j −1
Extend by matching
€
x i and y j
Extend by matching suffix of Y to gap of length i-k
Extend by matching suffix of X to gap of length j-k
€
Θ(n3)Problem: Procedure runs in worst-case time
39
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
version
€
Θ(n2)
Extra variables
€
M(i, j) = best score of alignment of (x1,x2,...,x i) and
(y1,y2,...,y j ) given that x i is aligned with y j Ix (i, j) = best score of alignment of (x1,x2,...,x i) and
(y1,y2,...,y j ) given that x i is aligned with a gap
Iy (i, j) = best score of alignment of (x1,x2,...,x i) and
(y1,y2,...,y j ) given that y j is aligned with a gap
40
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Recurrences
€
M(i −1, j) − d
Ix (i, j) = max Ix (i −1, j) − e
(i, j > 0)
M(i, j −1) − d
Iy (i, j) = max Iy (i, j −1) − e
(i, j > 0)
M(i −1, j −1) + S(x i,y j )
M(i, j) = max Ix (i −1, j −1) + S(x i,y j )
Iy (i −1, j −1) + S(x i,y j )
(i, j > 0)
aligned to start of gap
€
x i
€
Θ(n2)Procedure runs in worst-case time
aligned to continuation of gap
€
x i
aligned to start of gap
€
y j
aligned to continuation of gap
€
y j
41
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Linear space alignment
F matrix
0
1
2
m
0 1 2 n
X
Y
42
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Linear space alignment
F matrix
0
1
2
m
0 1 2 n
X
Y
43
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Linear space alignment
F matrix
0
1
2
m
0 1 2 n
X
Y
44
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Linear space alignment
F matrix
0
1
2
m
0 1 2 n
X
Y
45
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Linear space alignment
F matrix
0
1
2
m
0 1 2 n
X
Y
46
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Linear space alignment
F matrix
0
1
2
m
0 1 2 n
X
Y
€
m2⎣ ⎦
47
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Linear space alignment
F matrix
0
1
2
m
0 1 2 n
X
Y
€
m2⎣ ⎦
48
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Linear space alignment
F matrix
0
1
2
m
0 1 2 n
X
Y
€
m2⎣ ⎦
49
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Linear space alignment
F matrix
0
1
2
m
0 1 2 n
X
Y
€
m2⎣ ⎦
50
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Linear space alignment
F matrix
0
1
2
m
0 1 2 n
X
Y
€
m2⎣ ⎦
51
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Linear space alignment
F matrix
0
1
2
m
0 1 2 n
X
Y
€
m2⎣ ⎦
52
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Linear space algorithm
From top
From bottom
+
=
k
€
Ftop ( j)
€
Fbottom ( j)
€
Ftop ( j) + Fbot ( j)
€
k ∈ {0,1,...,n} such that
€
Ftop (k) + Fbot (k) is maximized
k is on path of optimal alignment
53
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Linear space alignmentHirschberg’s insight
F
m
n00
€
m2⎣ ⎦
k
54
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Linear space alignmentHirschberg’s insight
F
m
n00
€
m2⎣ ⎦
k
55
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Software for pairwise alignment
Pure D.P. runs in
€
Θ(mn) time
Example
100 million residues in database
Search sequence of length 10,000
# F matrix cells to be calculated:
€
1012
Computer speed: 10 million cells a second
Total time: 100,000 seconds = 28 hours (approx.)
56
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Heuristic methods
FASTA (Pearson & Lipman, 1988)
Words in X and Y(length ktup)
. . .
…, ( i, j ), …cgtta
Position in X Position in Y
. . .
• sort matches on j - i • extend best matches (ungapped)• join neighbouring matches by inserting gaps• realign best matches by dynamic programming
57
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Sensitivity
Tradeoff
High values of ktup: faster search, but may miss significant matches
Low values of ktup: catches more matches, but slower
ktup = 1 for sensitivity close to dynamic programming
Available from
http://www.fasta.bioch.virginia.edu/
58
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Example>short1.seq Length: 100 August 1, 2003 11:09 Type: N Check: 5940atgaaattaacagcaatagctaaagcaacattagcattaggaatattaacaacaggtgtgatgacagcagaaagtcaaactgtaaacgcgaaagtaaagt
>short2.seq Length: 100 August 1, 2003 10:43 Type: N Check: 1744atgaagatgacagcaattgcgaaagccagtttagctctaagtattttagcgactggggttataacatcaacggctcaaactgtaaatgcgagcgaacatg
/seqprg/slib/bin/lalign -N 5000 -n -r "+5/-4" -f -12 -g -4 -w 75 -q @ @
resetting to DNA matrix resetting to DNA matrix LALIGN finds the best local alignments between two sequences version 2.1u03 April 2000Please cite: X. Huang and W. Miller (1991) Adv. Appl. Math. 12:373-381
resetting to DNA matrixalignments < E( 0.05):score: 51 (50 max) Comparison of:(A) @ short1.seq Length: 100 August 1, 2003 11:09 Type - 100 nt(B) @ short2.seq Length: 100 August 1, 2003 10:43 Typ - 100 nt using matrix file: DNA, gap penalties: -12/-4 E(limit) 0.05
71.4% identity in 91 nt overlap (1-91:1-91); score: 221 E(10000): 3.7e-12
10 20 30 40 50 60 70short1 ATGAAATTAACAGCAATAGCTAAAGCAACATTAGCATTAGGAATATTAACAACAGGTGTGATGACAGCAGAAAGT ::::: : :::::::: :: ::::: : ::::: :: : :: ::: : :: :: :: :: ::: :: :short2 ATGAAGATGACAGCAATTGCGAAAGCCAGTTTAGCTCTAAGTATTTTAGCGACTGGGGTTATAACATCAACGGCT 10 20 30 40 50 60 70
80 90short1 CAAACTGTAAACGCGA ::::::::::: ::::short2 CAAACTGTAAATGCGA 80 90
----------
Input sequences
Output matches
59
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Example
More matches
64.1% identity in 39 nt overlap (17-54:32-69); score: 53 E(10000): 3.7e+02
20 30 40 50short1 TAGCTAAAGCAACATTAGC-ATTAGGAATATTAACAACA ::::: : : ::::: : : :: ::: :::: ::short2 TAGCTCTAAGTATTTTAGCGACTGGGGTTAT-AACATCA 40 50 60
----------
73.9% identity in 23 nt overlap (60-77:6-28); score: 53 E(10000): 3.7e+02
60 70short1 GATGACAGCA-----GAAAGTCA :::::::::: ::::: ::short2 GATGACAGCAATTGCGAAAGCCA 10 20
----------
60
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
BLAST
Developed by Altschul & al (1990)
Preprocesses query sequence
Makes list of “neighbourhood words” with match > T
Tries to extend “seed” matches (ungapped) in database sequences
GAPPED-BLAST looks for gapped alignments
61
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Genetics Computer Group package
GCG at University of Wisconsin
Commercial package (http://www.gcg.com/)
* assemble * backtranslate * bestfit * blast * breakup * chopup * circles * codonfrequency * codonpreference * coilscan * compare * composition * compresstext * comptable * consensus * correspond * corrupt * dataset * detab * distances * diverge * domes * dotplot * extractpeptide * fasta * fasta_parsable_output * fetch * figure * findpatterns * fingerprint * fitconsensus * foldrna * framealign * frames * framesearch * fromembl * fromfasta * fromgenbank * fromig * frompir * fromstaden * gap * gapshow * gcgtoblast * gelassemble * geldisassemble
* gelenter * gelintroduction * gelmerge * gelstart * gelview * getseq * growtree * helicalwheel * hthscan * isoelectric * lineup * listfile * lookup * lprint * map * mapplot * mapsort * mfold * moment * motifs * mountains * name * names * netblast * nooverlap * olddistances * onecase * overlap * paupdisplay * paupsearch * pepdata * pepplot * peptidemap * peptidesort * peptidestructure * pileup * plasmidmap * plotfold * plotsimilarity * plotstructure * plottest * pretty * prime * profileanalysis * profilegap * profilemake
* profilescan * profilesearch * profilesegments * publish * red * reformat * repeat * replace * reverse * sample * seg * segments * seqed * seqlab * setkeys * setplot * shiftover * shuffle * simplify * spew * spscan * squiggles * statplot * stemloop * stringsearch * symbol * terminator * testcode * tfasta * tofasta * toig * topir * tostaden * translate * whats_new_9.0 * whats_new_9.1 * window * wordsearch * xnu
62
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
GAP
GAP (“Global Alignment Program” ?)
Needleman & Wunsch algorithm
Input in GCG format
Use GETSEQ
!!NA_SEQUENCE 1.0 GETSEQ from gcg, August 14, 19103 12:19.
Length: 389 August 14, 19103 12:19 Type: N Check: 9580 ..
1 AAATGATAAA CTATTTTACT TTATGTCTAA GGTCTTTCAT AATATGAAAT
51 AGAATGTAGA TATTGCAACA ATAGCATTTT TGGAGACAGC TACCTCCTTT
101 ACCAGGAATA ATCTTTGCAT GTCACATTTA GAGATAAAGC TCAAAATGCA
151 AATCCTTCCC CTGAGAGTGG GAAAGCATTA ACAAATGAGA GTGGGAAAAG
201 CATTAACAAA GCATTAACAC AGGTCTTTAC ATATTCAAAA TATTAAACTA
251 ATGCTAGGAT TATAGACTTG ATTTTAAGAC ATGGTAGTTA ATAGAAAAGT
301 TCTAGATTGA AAACAATTTT GCAAAAATAT ACATTTGGTA TATGTGTATA
351 TATGTATGTG GTATATATAT ATCNACTAGG GAAAATATA
63
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Example<heman.lsb.sbs.auckland.ac.nz:/usr/users/gcg/359Stuff>gapGap uses the algorithm of Needleman and Wunsch to find the alignment oftwo complete sequences that maximizes the number of matches and minimizesthe number of gaps.
GAP of what sequence 1 ? Hs#S374655.gcg
Begin (* 1 *) ? End (* 389 *) ? Reverse (* No *) ?
to what sequence 2 (* Hs#S374655.gcg *) ? Hs#S1117589.gcg
Begin (* 1 *) ? End (* 323 *) ? Reverse (* No *) ?
What is the gap creation penalty (* 50 *) ?
What is the gap extension penalty (* 3 *) ?
What should I call the paired output display file (* Hs#S374655.pair *) ?
Aligning ................-. Aligning ................-..
Gaps: 0 Quality: 3080 Quality Ratio: 9.536 % Similarity: 95.356 Length: 389
64
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Display fileGAP of: Hs#S374655.gcg check: 9580 from: 1 to: 389 GETSEQ from gcg, August 14, 19103 12:19.to: Hs#S1117589.gcg check: 8814 from: 1 to: 323 GETSEQ from gcg, August 14, 19103 12:20.
Symbol comparison table: /usr/users/gcg/gcgcore/data/rundata/nwsgapdna.cmp CompCheck: 8760 Gap Weight: 50 Average Match: 10.000 Length Weight: 3 Average Mismatch: 0.000 Quality: 3080 Length: 389 Ratio: 9.536 Gaps: 0 Percent Similarity: 95.356 Percent Identity: 95.356 Match display thresholds for the alignment(s): | = IDENTITY : = 5 . = 1 Hs#S374655.gcg x Hs#S1117589.gcg August 18, 19103 17:59 .. . . . . . 1 AAATGATAAACTATTTTACTTTATGTCTAAGGTCTTTCATAATATGAAAT 50 ||||||||||||||||||||||||||||||||||||||||||||||| 1 ...TGATAAACTATTTTACTTTATGTCTAAGGTCTTTCATAATATGAAAT 47 . . . . . 51 AGAATGTAGATATTGCAACAATAGCATTTTTGGAGACAGCTACCTCCTTT 100 |||||||||||||||||||||||||||||||||||||||||||||||||| 48 AGAATGTAGATATTGCAACAATAGCATTTTTGGAGACAGCTACCTCCTTT 97 . . . . . 101 ACCAGGAATAATCTTTGCATGTCACATTTAGAGATAAAGCTCAAAATGCA 150 |||||||||||||||||||||||||||||||||||||||||||| ||||| 98 ACCAGGAATAATCTTTGCATGTCACATTTAGAGATAAAGCTCAAGATGCA 147 . . . . . 151 AATCCTTCCCCTGAGAGTGGGAAAGCATTAACAAATGAGAGTGGGAAAAG 200 |||||||||||||||||||||||||||||||||||||||||||||||||| 148 AATCCTTCCCCTGAGAGTGGGAAAGCATTAACAAATGAGAGTGGGAAAAG 197 . . . . . 201 CATTAACAAAGCATTAACACAGGTCTTTACATATTCAAAATATTAAACTA 250 |||||||||||||||||||||||||||||||||||||||||||||||||| 198 CATTAACAAAGCATTAACACAGGTCTTTACATATTCAAAATATTAAACTA 247 . . . . . 251 ATGCTAGGATTATAGACTTGATTTTAAGACATGGTAGTTAATAGAAAAGT 300 ||||||||||||||||||||||||||| |||||||| ||||| 248 ATGCTAGGATTATAGACTTGATTTTAAACATGGGTAGTTATAGAAAAAGG 297 . . . . . 301 TCTAGATTGAAAACAATTTTGCAAAAATATACATTTGGTATATGTGTATA 350 |||||||||||||||| ||| ||| 298 TCTAGATTGAAAACAAATTTTGCAAA........................ 323 . .
65
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Bestfit<heman.lsb.sbs.auckland.ac.nz:/usr/users/gcg/359Stuff>bestfit
BestFit makes an optimal alignment of the best segment of similaritybetween two sequences. Optimal alignments are found by inserting gaps tomaximize the number of matches using the local homology algorithm ofSmith and Waterman.
BESTFIT of what sequence 1 ? short1.gcg
Begin (* 1 *) ? End (* 100 *) ? Reverse (* No *) ?
to what sequence 2 (* short1.gcg *) ? short2.gcg
Begin (* 1 *) ? End (* 100 *) ? Reverse (* No *) ?
What is the gap creation penalty (* 50 *) ?
What is the gap extension penalty (* 3 *) ?
What should I call the paired output display file (* short1.pair *) ?
Aligning ....-. Aligning ....-.
Gaps: 0 Quality: 416 Quality Ratio: 4.571 % Similarity: 71.429 Length: 91
Smith & Waterman algorithm
Local alignment
Same interface as GAP
66
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Bestfit display fileBESTFIT of: short1.gcg check: 2998 from: 1 to: 100
GETSEQ from gcg, August 18, 19103 15:25.
to: short2.gcg check: 6455 from: 1 to: 100
GETSEQ from gcg, August 18, 19103 15:26.
Symbol comparison table: /usr/users/gcg/gcgcore/data/rundata/swgapdna.cmp CompCheck: 2335
Gap Weight: 50 Average Match: 10.000 Length Weight: 3 Average Mismatch: -9.000
Quality: 416 Length: 91 Ratio: 4.571 Gaps: 0 Percent Similarity: 71.429 Percent Identity: 71.429
Match display thresholds for the alignment(s): | = IDENTITY : = 5 . = 1
short1.gcg x short2.gcg August 18, 19103 15:27 ..
. . . . . 1 atgaaattaacagcaatagctaaagcaacattagcattaggaatattaac 50 ||||| | |||||||| || ||||| | ||||| || | || ||| | 1 atgaagatgacagcaattgcgaaagccagtttagctctaagtattttagc 50 . . . . 51 aacaggtgtgatgacagcagaaagtcaaactgtaaacgcga 91 || || || || ||| || |||||||||||| |||| 51 gactggggttataacatcaacggctcaaactgtaaatgcga 91
67
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Wordsearch
Algorithm similar to algorithm of Wilbur and Lipman (1983).
Compares one sequence (the query) to any group of sequences.
Comparisons can be viewed as set of dot-plots.
Search finds registers of comparison (diagonals) that have the largest number of short perfect matches (words).
Best segment of similarity along each diagonal viewed with program SEGMENTS.
68
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Wordsearch example<heman.lsb.sbs.auckland.ac.nz:/usr/users/gcg/359Stuff>wordsearch
WordSearch identifies sequences in the database that share largenumbers of common words in the same register of comparison with yourquery sequence. The output of WordSearch can be displayed withSegments.
WORDSEARCH with what query sequence ? short1.gcg
Begin (* 1 *) ? End (* 100 *) ?
Search for query in what sequence(s) (* GenEMBL:* *) ? short2.gcg
What word size (* 6 *) ?
List how many best diagonals (* 50 *) ? 4
Integrate how many adjacent diagonals (* 3 *) ?
What should I call the output file (* short1.word *) ?
1 short2.gcg Len: 100
6-mers found: 168 Diagonals with words: 6 Total diagonals: 398 Sequences searched: 1 CPU time: 00.03
Output file: /usr/users/gcg/359Stuff/short1.word
69
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Short1.word contents
!!SEQUENCE_LIST 1.0 (Nucleotide) WORDSEARCH of: /usr/users/gcg/359Stuff/short1.gcg check: 2998 from: 1 to:100
GETSEQ from gcg, August 18, 19103 15:25.
TO: short2.gcg Sequences: 1 Total-length: 100 August 18, 19103 15:47
Word-size: 6 Words: 168 Diagonals: 6 Total-diagonals: 398 Integral-width: 3 Alphabet: 4 List-size: 4 CPU minutes: 0.00
Sequence Strd Diag Score Width Documentation ..
/short2.gcg + 0 20 3 GETSEQ from gcg, August 18, 19103 15:26./short2.gcg + -54 10 3 GETSEQ from gcg, August 18, 19103 15:26./short2.gcg - -47 7 3 GETSEQ from gcg, August 18, 19103 15:26./short2.gcg + -69 7 3 GETSEQ from gcg, August 18, 19103 15:26.
70
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Run SEGMENTS
<heman.lsb.sbs.auckland.ac.nz:/usr/users/gcg/359Stuff>segments
Segments aligns and displays the segments of similarity found byWordSearch.
(BestFit) SEGMENTS from what WORDSEARCH file ? short1.word
What should I call the output file (* short1.pairs *) ?
Aligning ....-. /usr/users/gcg/359Stuff/short2.gcg 100 bp Gaps: 0 Quality:500 / Length: 98 Aligning ..-. /usr/users/gcg/359Stuff/short2.gcg 100 bp Gaps: 0 Quality:100 / Length: 10 Aligning ..-. /usr/users/gcg/359Stuff/short2.gcg 100 bp Gaps: 0 Quality:112 / Length: 32 Aligning .-. /usr/users/gcg/359Stuff/short2.gcg 100 bp Gaps: 0 Quality: 96 / Length: 16
71
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Short1.pairs contents(BestFit) SEGMENTS from: short1.word August 18, 19103 15:48
(Nucleotide) WORDSEARCH of:/usr/users/gcg/359Stuff/short1.gcg check: 2998from: 1 to: 100GETSEQ from gcg, August 18, 19103 15:25.TO: short2.gcg Sequences: 1 Total-length: 100 August 18, 19103 15:47Word-size: 6 Words: 168 Diagonals: 6 Total-diagonals: 398Integral-width: 3 Alphabet: 4 List-size: 4 CPU minutes: 0.00
AvMatch: 3.84 AvMisMatch: -6.00 GapWeight: 50 LengthWeight: 3 ..
Match display thresholds for the alignment(s): | = IDENTITY : = 3 . = 1
short1.gcg check: 2998 from: 1 to: 100/usr/users/gcg/359Stuff/short2.gcg check: 6455 from: 1 to: 100 GETSEQ from gcg, August 18, 19103 15:26. Gaps: 0 Quality: 500 Ratio:5.102 Score:20 Width:3 Limits: +/-4 . . . . . 1 atgaaattaacagcaatagctaaagcaacattagcattaggaatattaac 50 ||||| | |||||||| || ||||| | ||||| || | || ||| | 1 atgaagatgacagcaattgcgaaagccagtttagctctaagtattttagc 50 . . . . 51 aacaggtgtgatgacagcagaaagtcaaactgtaaacgcgaaagtaaa 98 || || || || ||| || |||||||||||| |||| | | | 51 gactggggttataacatcaacggctcaaactgtaaatgcgagcgaaca 98
72
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Short1.pairs (continued)short1.gcg check: 2998 from: 54 to: 100/usr/users/gcg/359Stuff/short2.gcg check: 6455 from: 1 to: 100 GETSEQ from gcg, August 18, 19103 15:26. Gaps: 0 Quality: 100 Ratio:10.000 Score:10 Width:3 Limits: +/-4 . 60 gatgacagca 69 |||||||||| 6 gatgacagca 15
short1.gcg check: 2998 from: 47 to: 100 /Reverse/usr/users/gcg/359Stuff/short2.gcg check: 6455 from: 1 to: 100 GETSEQ from gcg, August 18, 19103 15:26. Gaps: 0 Quality: 112 Ratio:3.500 Score:7 Width:3 Limits: +/-4 . . . 40 ctaatgctaatgttgctttagctattgctgtt 9 | | ||| || | ||||||| | | || 14 caattgcgaaagccagtttagctctaagtatt 45
short1.gcg check: 2998 from: 69 to: 100/usr/users/gcg/359Stuff/short2.gcg check: 6455 from: 1 to: 100 GETSEQ from gcg, August 18, 19103 15:26. Gaps: 0 Quality: 96 Ratio:6.000 Score:7 Width:3 Limits: +/-4 . 79 actgtaaacgcgaaag 94 || | || ||||||| 10 acagcaattgcgaaag 25
73
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
EMBOSSWhat is EMBOSS?The European Molecular Biology Open Software SuiteEMBOSS is a package of high-quality FREE Open Sourcesoftware for sequence analysis.
Applications in EMBOSSThe EMBOSS programs and their documentation.
User DocumentationTutorial, Command syntax, Sequences and Databases, Reference
Jemboss and other InterfacesMany groups are creating graphical interfaces to EMBOSSJemboss is our supported interface
Downloading the softwareYou can download, install and run the software on most UNIXcomputersIt is known to work on: Irix, AIX(4.3.3 and 5.1), Red Hat, SuSe, Debian,HPUX11/IA64, MacOSX, Mandrake, NetBSD, Slackware, Solaris,Tru64 Unix (Full support soon. Loan machine being arranged)It is reported to work on: FreeBSD, OSF, SuSE-PPC
LATEST NEWS: Release 2.7.1 available as of 3rd June 2003
74
Dyn
amic
Pro
gra
mm
ing
fo
r P
airw
ise
Alig
nm
ent
Suite of programsgetorf HGMP Finds and extracts open reading frames (ORFs)helixturnhelix HGMP Finds nucleic acid binding domains.hmoment HGMP Hydrophobic moment calculationiep HGMP Calculates the isoelectric point of a proteininfoalign HGMP Information on a multiple sequence alignmentinfoseq HGMP Displays some simple information about sequencesisochore Sanger Plots isochores in large DNA sequencesjembossctl HGMP J emboss Authentication Controllindna Norway Draws linear maps of DNA constructslistor HGMP Writes a list file of the logical OR of
two sets of sequencesmarscan HGMP Finds MAR/SAR sites in nucleic sequencesmaskfeat HGMP Mask off features of a sequencemaskseq HGMP Mask off regions of a sequence.matcher Sanger Local alignment of two sequencesmegamerger HGMP Merge two large overlapping nucleic acid sequencesmerger HGMP Merge two overlapping sequencesmsbar HGMP Mutate sequence beyond all recognitionmwcontam HGMP Shows molwts that match across a set of filesmwfilter HGMP Filter noisy molwts from mass spec outputneedle HGMP Needleman-Wunsch global alignment.