CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] 06-11-2006 Sequence Analysis Alignments 2: Local...
-
date post
21-Dec-2015 -
Category
Documents
-
view
219 -
download
0
Transcript of CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] 06-11-2006 Sequence Analysis Alignments 2: Local...
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[1] 06-11-2006 Sequence Analysis
Alignments 2:Local alignment
Sequence Analysis06-11-2006
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[2] 06-11-2006 Sequence Analysis
What can sequence alignment tell us about structure
HSSP Sander & Schneider, 1991
≥30% sequence identity
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[3] 06-11-2006 Sequence Analysis
Pair-wise alignmentComplexity of the problem
Combinatorial explosion- 1 gap in 1 sequence: n+1 possibilities- 2 gaps in 1 sequence: (n+1)n - 3 gaps in 1 sequence: (n+1)n(n-1), etc.
2n (2n)! 22n
= ~ n (n!)2 n 2 sequences of 300 a.a.: ~1088 alignments 2 sequences of 1000 a.a.: ~10600 alignments!
T D W V T A L KT D W L - - I K
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[4] 06-11-2006 Sequence Analysis
The algorithm for linear gap penalties
M[i-1,j-1]+score(X[i],Y[j])
M[i,j]= max M[i,j-1]-g
M[i-1,j]-g
Corresponds to:X1…Xi-1 Xi
Y1…Yj-1 Yj
X1…Xi -
Y1…Yj-1 Yj
X1…Xi-1 Xi
Y1…Yj-1 -
Value from residueexchange matrix
i-1 i
j-1 j
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[5] 06-11-2006 Sequence Analysis
– Substitution (or match/mismatch)
• DNA
• proteins
– Gap penalty
• Linear: gp(k)=k
• Affine: gp(k)=+k
• Concave, e.g.: gp(k)=log(k)The score for an alignment is the sum
of the scores over all alignment columns
Dynamic programmingScoring alignments
•/Gap length
pen
alt
y
affine
concave
linear
,
General alignment score:
Sa,b = )(kgpNk
k li jbas ),(
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[6] 06-11-2006 Sequence Analysis
Scoring and gap penalties
• Scoring:– DNA or protein?– Which evolutionary distance do we cover?
• Gap penalty: always less than 0 (i.e. they lower the alignment score)– Linear: g(k)=k– Affine: g(k)=+k harder to implement
– Concave: g(k)=log(k) no efficient solution
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[7] 06-11-2006 Sequence Analysis
Global dynamic programming – general
algorithm
M[i-1,j-1]
M[i,j] = score(X[i],Y[j]) + max max{M[0<x<i-1, j-1] - gopen - (i-x-1)gextension}
max{M[i-1, 0<x<j-1] - gopen - (i-y-1)gextension}
Value form residueexchange matrix
i-1 i
j-1 j
Gap open penalty
Gap extension penalty
Number of gap extensions
This more general way of dynamic programming also
allows for affine or other gap penalty regimes
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[8] 06-11-2006 Sequence Analysis
Easy DP recipe for using affine gap penalties
• M[i,j] is optimal alignment (highest scoring alignment until [i,j])• Check
– Cell[i-1, j-1]: apply score for cell[i-1, j-1] – preceding row until j-2: apply appropriate gap penalties
– preceding column until i-2: apply appropriate gap penalties
i-1
j-1
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[9] 06-11-2006 Sequence Analysis
Note about gap penalties
• Some affine schemes use
gap_penalty = -gopen –gextension*(l-1),
while others use
gap_penalty = -gopen –gextension*l,
where l is the length of the gap.
One can be converted into the other by adapting the gopen penalty
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[10] 06-11-2006 Sequence Analysis
Global dynamic programming
gopen=10, gextension=2
D W V T A L K
0 -12 -14 -16 -18 -20 -22 -24
T -12 8 -9 -6 -5 -9 -11 -14
D -14 0 9 2 2 3 -5 -3 -34
W -16 -13 25 11 5 4 9 0 -21
V -18 -10 -4 37 21 19 19 15 -16
L -20 -14 -2 23 46 31 37 26 1
K -22 -12 -9 17 33 53 39 50 14
-34 -29 -1 17 39 27 50
D W V T A L K
T 8 3 8 11 9 9 8
D 12 1 6 8 8 4 8
W 1 25 2 3 2 6 5
V 6 2 12 8 8 10 6
L 4 6 10 9 6 14 5
K 8 5 6 8 7 5 13
These values are copied from the PAM250 matrix, after being made non-negative by adding 8 to each PAM250 matrix cell (-8 is the lowest number in the PAM250 matrix)
The extra bottom row and rightmost column give the final global alignment scores
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[11] 06-11-2006 Sequence Analysis
Variation on global alignment
• Global alignment: the previous algorithm is called globalalignment, because it uses all letters from both sequences.
CAGCACTTGGATTCTCGG CAGC-----G-T----GG
• Semi-global alignment: don’t penalize for start/end gaps (omit the start/end of sequences).
CAGCA-CTTGGATTCTCGG ---CAGCGTGG--------– Applications of semi-global:
– Finding a gene in genome– Placing marker onto a chromosome– One sequence much longer than the other
– Danger! – really bad alignments for divergent seqs
seq X:
seq Y:
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[12] 06-11-2006 Sequence Analysis
Global alignments - review
• Take two sequences: X[j] and Y[j]
• The best alignment for X[1…i] and Y[1…j] is called M[i, j]
• Initiation: M[0,0]=0
• Apply the equation
• Find the alignment with backtracing
6543210
76543210
AGCGGAG
0-AGTGAG-
X[j]
Y[i]
2
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[13] 06-11-2006 Sequence Analysis
Global and local alignmentGlobal and local alignment
Local
Global
A B
A
B
A B
B A
AB
Local
Global
A BA
CC
A B C
A B C
A BA
CC
Problem: Problem:
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[14] 06-11-2006 Sequence Analysis
Local alignment
• What’s local?– Allow only parts of the sequence to match– Results in High Scoring Segments– Locally maximal: cannot make it better by
trimming/extending the alignment
Seq X:
Seq Y:
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[15] 06-11-2006 Sequence Analysis
Local alignment
• Why local?
– Parts of sequence diverge faster evolutionary pressure does not put constraints on the whole sequence
– Proteins have modular constructionsharing domains between sequences
Seq X:
Seq Y:
seq X:
seq Y:
seq X:
seq Y:
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[16] 06-11-2006 Sequence Analysis
Domains - exampleImmunoglobulin domain
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[17] 06-11-2006 Sequence Analysis
Global local alignment• Take the old, good equation
• Look at the result of the global alignment
q
e
s
-
euqes-
CAGCACTTGGATTCTCG-CA-C-----GATTCGT-G
a) global align
b) retrieve the result
c) sum score along the result
align.pos.
sum
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[18] 06-11-2006 Sequence Analysis
Local alignment – breaking the alignment• A recipe
– Just don’t let the score go below 0
– Start the new alignment when it happens
– Where is the result in the matrix?
Before: After:
sum
align.pos.
q
e
s
0-
euqes-
q
e
0s
0-
euqes-
sum
align.pos.
sum
align.pos.
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[19] 06-11-2006 Sequence Analysis
Local alignment – the equation
• Init the matrix with 0’s
• Read the maximal value from anywhere in the matrix
• Find the result with backtracking
M[i, j] =M[i, j-1] – g
M[i-1, j] – g
M[i-1, j-1] + score(X[i],Y[j])
max
0Great contribution to science!
q
e
s
-
euqes-
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[20] 06-11-2006 Sequence Analysis
Finding second best alignment• We can find the best local alignment in the
sequence
• But where is the second best one?
…
1613A
1815G
1618201714A
15171916G
141618A
…
…AGA…
A clump
Best alignment
Scoring:1 for match-2 for a gap
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[21] 06-11-2006 Sequence Analysis
Clump of an alignment
• Alignments sharing at least one aligned pair
A clump…
A
G
20A
1916G
1618A
…
…AGA…
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[22] 06-11-2006 Sequence Analysis
Example: repeats proteins
Local alignment traces showing similarity between repeats
Note: repeats are similar motifs but need not be identical
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[23] 06-11-2006 Sequence Analysis
Clumpsge
ne Y
gene X
The figure shows two proteins with so-called tandem repeats (similar motifs adjacent in the sequence)
‘Shadows’ of various alignments are visible – these all derive from parts of a same locally optimal alignment
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[24] 06-11-2006 Sequence Analysis
Finding second best alignment
• Don’t let any matched pair contribute to the next alignment
1…
130A
02G
10001A
0000G
100A
…
…AGA…
Recalculate the clump
“Clear” the best alignment
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[25] 06-11-2006 Sequence Analysis
Clumps ge
ne Y
gene X
The figure shows two proteins with so-called tandem repeats (similar motifs adjacent in the sequence)
The highest scoring local alignment is indicated
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[26] 06-11-2006 Sequence Analysis
Clumpsge
ne Y
gene X
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[27] 06-11-2006 Sequence Analysis
Extraction of alignments – Waterman-Eggert algorithm
1. Repeata. Retrieve the highest scoring alignment
b. Set its trace to 0
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[28] 06-11-2006 Sequence Analysis
General local DP algorithm for using affine gap penalties
M[i,j] is optimal alignment (highest scoring alignment until [i, j]) At each cell [i, j] in search matrix, check Max coming from:
any cell in preceding row until j-2: add score for cell[i, j] minus appropriate gap penalties;
any cell in preceding column until i-2: add score for cell[i, j] minus appropriate gap penalties;
or cell[i-1, j-1]: add score for cell[i, j] Select highest scoring cell anywhere in matrix and do trace-back until
zero-valued cell or start of sequence(s)
i-1
j-1 Penalty = Popen + gap_length*Pextension
Si,j = Max
Si,j + Max{S0<x<i-1,j-1 - Pi - (i-x-1)Px}
Si,j + Si-1,j-1
Si,j + Max {Si-1,0<y<j-1 - Pi - (j-y-1)Px}
0
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[29] 06-11-2006 Sequence Analysis
Example: general algorithm for local alignmentDP algorithm with affine gap penalties (PAM250,
Po=10, Pe=2)
Extra start/end columns/rows not necessary (no end-gaps). Each negative scoring cell is set to zero. Highest scoring cell may be found anywhere in search matrix after calculating it. Trace highest scoring cell back to first cell with zero value (or the beginning of one or both sequences)
D W V T A L K
T 0 -5 0 3 1 1 0
D 4 -7 -2 0 0 -4 0
W -7 17 -6 -5 -6 -2 -3
V -2 -6 4 0 0 2 -2
L -4 -2 2 1 -2 6 -3
K 0 -3 -2 0 -1 -3 5
D W V T A L K
T 0 0 0 3
D 4 0 0 0
W 0 21 0 0
V 0 0 25 9
L 0 0 11
K 0 0
PAM250
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[30] 06-11-2006 Sequence Analysis
Pitfalls of alignments
Alignment is often not a reconstruction of evolution(common ancestor is usually extinct by the time of alignment)
Repeats: matches to the same fragment
seq X:
seq Y:
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[31] 06-11-2006 Sequence Analysis
Summary
1. Global e.g Needleman-Wunsch algorithm
2. Semi-global 3. Local
e.g.Smith-Waterman algorithm
4. Many local alignmentsaka Waterman-Eggert algorithm
What’s the number of steps in these algorithms?
How much memory is used?
seq X:
seq Y:
seq X:
seq Y:
seq X:
seq Y:
seq X:
seq Y:
These questions deal with the algorithmic complexity (time, memory) of alignment
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[32] 06-11-2006 Sequence Analysis
Semi-global alignment
• Ignore start gaps– First row/column set
to 0
• Ignore end gaps– Read the result from
last row/column
T
G
A
-
GTGAG-
1300-10
-202-210
-1-1-11-10
000000
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[33] 06-11-2006 Sequence Analysis
Low complexity regions• Local composition bias
– Replication slippage: e.g. triplet repeats
• Results in spurious hits– Breaks down statistical
models– Different proteins reported as
hits due to similar composition
– Up to ¼ of the sequence can be biased
• Widely used filtering program: SEGS (Wootton and Federhen, 1996)
Huntington’s disease
– Huntingtin gene of unknown (!) function
– Repeats #: 6-35: normal; 36-120: disease
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[34] 06-11-2006 Sequence Analysis
Synteny
• Synteny is preservation of gene order– 5 to 20 million years of divergence
Sequencing and comparison of yeast species to identify genes and regulatory elements, Manolis Kellis et.al, Nature 423, 241 - 254 (15 May 2003)
Arrows in figure represent genes; arrow direction indicates ‘+’ or ‘-’ strand of DNA
Figure shows preservation of gene order in four yeast species
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[35] 06-11-2006 Sequence Analysis
Genome vs.
gene
Genome Gene
Built of... DNA DNA
Describes Organism Protein
Stored as... Part of genome
Circular/ linear Linear
“Life cycle” DNA-DNA-DNA-... DNA-RNA-protein
Size 100b ... 100000b
Amount per cell 1 500 .. 50000
2% ... 100% ~30%
Single molecule, or a few of them
Both (depending on the species)
0.5Mb*) ... 3500Mb
Information content
*) b stands for bases or nucleotides
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[36] 06-11-2006 Sequence Analysis
Take-home messages (I)
• Global, semi-global and local alignment
• Three types of gap penalties
• ‘easy’ (fast) and general algorithm
• Pitfalls of local alignment
• Low-complexity regions
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[37] 06-11-2006 Sequence Analysis
Make sure you understand and can carry out
1. the ‘simple’ DP algorithm (for linear gap penalties)
2. The general DP algorithm for global, semi-global and local alignment!
Take-home messages (II)