Intro Sequence comparisons Visualization Alignments Scoring
Transcript of Intro Sequence comparisons Visualization Alignments Scoring
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
Last time
• Introduction• What is Bioinformatics?• Databases in Bioinformatics
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
Today: Sequence comparisons
• Visualisation• Different objectives• Pairwise alignments
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
Sequence comparisons: Goals
• What are the similarities?• Local similarities — domains and motifs• What is variable?
• Identify positions — basis for evolutionarystudies
• Understand structural similarities• Determine ancestry
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
Sequence comparisons: Goals
• What are the similarities?• Local similarities — domains and motifs• What is variable?
• Identify positions — basis for evolutionarystudies
• Understand structural similarities• Determine ancestry
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
Sequence comparisons: Goals
• What are the similarities?• Local similarities — domains and motifs• What is variable?
• Identify positions — basis for evolutionarystudies
• Understand structural similarities
• Determine ancestry
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
Sequence comparisons: Goals
• What are the similarities?• Local similarities — domains and motifs• What is variable?
• Identify positions — basis for evolutionarystudies
• Understand structural similarities• Determine ancestry
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
Homology
• Definition: Homology = common ancestry
• Principle: Similarity⇒homology• Quote: ”These sequences are somewhat
homologous”. Bad!
Similarity 6= homology
• Correct: ”These sequences are somewhatsimilar”.
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
Homology
• Definition: Homology = common ancestry• Principle: Similarity⇒homology
• Quote: ”These sequences are somewhathomologous”. Bad!
Similarity 6= homology
• Correct: ”These sequences are somewhatsimilar”.
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
Homology
• Definition: Homology = common ancestry• Principle: Similarity⇒homology• Quote: ”These sequences are somewhat
homologous”.
Bad!
Similarity 6= homology
• Correct: ”These sequences are somewhatsimilar”.
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
Homology
• Definition: Homology = common ancestry• Principle: Similarity⇒homology• Quote: ”These sequences are somewhat
homologous”. Bad!
Similarity 6= homology
• Correct: ”These sequences are somewhatsimilar”.
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
Important questions
• When are two sequences significantlysimilar?
• How do we evaluate similarity?
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
Important questions
• When are two sequences significantlysimilar?
• How do we evaluate similarity?
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
Data
• DNA: genes, genomes, non-coding DNA,etc
• Codons• RNA• Peptides
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
Idea of dotplots
Q V A S K I N T N ES
•
V
•
A
•
T
•
K
•
I
•
YMN
• •
E
•
Put dot where identical residues
, then filter outrandomness
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
Idea of dotplots
Q V A S K I N T N ES •V •A •T •K •I •YMN • •E •
Put dot where identical residues
, then filter outrandomness
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
Idea of dotplots
Q V A S K I N T N ES
•
V •A •T
•
K •I •YMN • •E •
Put dot where identical residues, then filter outrandomness
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
Dotplots in practicePttMAP20 (horizontal) vs. OsMAP20 (vertical)
0 100
0
50
100
150
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
Dotplots in practicePttMAP20 (horizontal) vs. OsMAP20 (vertical)
0 100
0
50
100
150
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
Dotplots in practicePttMAP20 (horizontal) vs. OsMAP20 (vertical)
0 100
0
50
100
150
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
Dotplots in practicePttMAP20 (horizontal) vs. OsMAP20 (vertical)
0 100
0
50
100
150
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
What happened here?
s1: A B C Ds2: A C B D
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
What happened here?
s1: A B C Ds2: A C B D
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
Genomic dotplot
Many inversions around origin and termini of replication.
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
Genomic dotplot
Many inversions around origin and termini of replication.
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
Visualizing with alignmentOsMAP20 69 SQTSPKRSSPKHEQPLSYFRLHTEERAIKRAGFNYQVASKINTNEIIRR
S+ +PK + ++ +P F+LHT +RA+KRA FNY VA+KI NE +RPttMAP20 43 SKVAPKPFAKENTKPQE-FKLHTGQRALKRAMFNYSVATKIYMNEQQKR
OsMAP20 118 FEEKLSKVIEEREIKMMRKEMVHKAQLMPAFDKPFHPQRSTRPLTVPKEE++ K+IEE E++ MRKEMV +AQLMP FD+PF PQRS+RPLTVP+E
PttMAP20 91 QIERIQKIIEEEEVRTMRKEMVPRAQLMPYFDRPFFPQRSSRPLTVPRE
OsMAP20 167 PSFPSF
PttMAP20 140 PSF
OsMAP20 69 SQTSPKRSSPKHEQPLSYFRLHTEERAIKRAGFNYQVASKINTNEIIRRF 118|:.:||..:.::.:| ..|:|||.:||:|||.|||.||:||..||..:|.
PttMAP20 43 SKVAPKPFAKENTKP-QEFKLHTGQRALKRAMFNYSVATKIYMNEQQKRQ 91
OsMAP20 119 EEKLSKVIEEREIKMMRKEMVHKAQLMPAFDKPFHPQRSTRPLTVPKEPS 168.|::.|:|||.|::.||||||.:|||||.||:||.||||:||||||:|||
PttMAP20 92 IERIQKIIEEEEVRTMRKEMVPRAQLMPYFDRPFFPQRSSRPLTVPREPS 141
OsMAP20 169 F--LRLKC--CI 176| :..|| ||
PttMAP20 142 FHMVNSKCWSCI 153
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
Visualizing with alignmentOsMAP20 69 SQTSPKRSSPKHEQPLSYFRLHTEERAIKRAGFNYQVASKINTNEIIRR
S+ +PK + ++ +P F+LHT +RA+KRA FNY VA+KI NE +RPttMAP20 43 SKVAPKPFAKENTKPQE-FKLHTGQRALKRAMFNYSVATKIYMNEQQKR
OsMAP20 118 FEEKLSKVIEEREIKMMRKEMVHKAQLMPAFDKPFHPQRSTRPLTVPKEE++ K+IEE E++ MRKEMV +AQLMP FD+PF PQRS+RPLTVP+E
PttMAP20 91 QIERIQKIIEEEEVRTMRKEMVPRAQLMPYFDRPFFPQRSSRPLTVPRE
OsMAP20 167 PSFPSF
PttMAP20 140 PSF
OsMAP20 69 SQTSPKRSSPKHEQPLSYFRLHTEERAIKRAGFNYQVASKINTNEIIRRF 118|:.:||..:.::.:| ..|:|||.:||:|||.|||.||:||..||..:|.
PttMAP20 43 SKVAPKPFAKENTKP-QEFKLHTGQRALKRAMFNYSVATKIYMNEQQKRQ 91
OsMAP20 119 EEKLSKVIEEREIKMMRKEMVHKAQLMPAFDKPFHPQRSTRPLTVPKEPS 168.|::.|:|||.|::.||||||.:|||||.||:||.||||:||||||:|||
PttMAP20 92 IERIQKIIEEEEVRTMRKEMVPRAQLMPYFDRPFFPQRSSRPLTVPREPS 141
OsMAP20 169 F--LRLKC--CI 176| :..|| ||
PttMAP20 142 FHMVNSKCWSCI 153
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
Alignments
• Def: A pairwise alignment is a pairing ofsymbols between two sequences.
• Global alignment: Involves wholesequences.
• Local alignment: Involves parts ofsequences.
• Semiglobal or ends-free alignment: Ignore”overhang” in similar sequences withdifferent lengths
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
Alignments
• Def: A pairwise alignment is a pairing ofsymbols between two sequences.
• Global alignment: Involves wholesequences.
• Local alignment: Involves parts ofsequences.
• Semiglobal or ends-free alignment: Ignore”overhang” in similar sequences withdifferent lengths
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
Alignments
• Def: A pairwise alignment is a pairing ofsymbols between two sequences.
• Global alignment: Involves wholesequences.
• Local alignment: Involves parts ofsequences.
• Semiglobal or ends-free alignment: Ignore”overhang” in similar sequences withdifferent lengths
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
Alignments
• Def: A pairwise alignment is a pairing ofsymbols between two sequences.
• Global alignment: Involves wholesequences.
• Local alignment: Involves parts ofsequences.
• Semiglobal or ends-free alignment: Ignore”overhang” in similar sequences withdifferent lengths
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
Global vs localOsMAP20 69 SQTSPKRSSPKHEQPLSYFRLHTEERAIKRAGFNYQVASKINTNEIIRRF 118
|:.:||..:.::.:| ..|:|||.:||:|||.|||.||:||..||..:|.PttMAP20 43 SKVAPKPFAKENTKP-QEFKLHTGQRALKRAMFNYSVATKIYMNEQQKRQ 91
OsMAP20 119 EEKLSKVIEEREIKMMRKEMVHKAQLMPAFDKPFHPQRSTRPLTVPKEPS 168.|::.|:|||.|::.||||||.:|||||.||:||.||||:||||||:|||
PttMAP20 92 IERIQKIIEEEEVRTMRKEMVPRAQLMPYFDRPFFPQRSSRPLTVPREPS 141
OsMAP20 169 F--LRLKC--CI 176| :..|| ||
PttMAP20 142 FHMVNSKCWSCI 153
OsMAP20 1 MEK--TRKATSPKSSMTSSTGPKSPVRNGGSPPHKKSTSEFRGRKNESQI 48||| |:.|.......:|.:.|.|....|.:....|..
PttMAP20 1 MEKAHTKSALKKLVKASSQSAPWSNAARGMAKDDLKDP------------ 38
OsMAP20 49 FRKGGQDSITLDESKRRSPTSQTSPKRSSPKHEQPLSYFRLHTEERAIKR 98..|:|| .:||..:.::.:| ..|:|||.:||:||
PttMAP20 39 ---------LYDKSK-------VAPKPFAKENTKP-QEFKLHTGQRALKR 71
OsMAP20 99 AGFNYQVASKINTNEIIRRFEEKLSKVIEEREIKMMRKEMVHKAQLMPAF 148|.|||.||:||..||..:|..|::.|:|||.|::.||||||.:|||||.|
PttMAP20 72 AMFNYSVATKIYMNEQQKRQIERIQKIIEEEEVRTMRKEMVPRAQLMPYF 121
OsMAP20 149 DKPFHPQRSTRPLTVPKEPSF--LRLKC--CIGGEFHRHFCYNA------ 188|:||.||||:||||||:|||| :..|| ||..:...::..:|
PttMAP20 122 DRPFFPQRSSRPLTVPREPSFHMVNSKCWSCIPEDELYYYFEHAHPHDHA 171
OsMAP20 189 -KAIK 192|.:|
PttMAP20 172 WKPVK 176
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
More terminology
• Insertion• Deletion• Indel — when we don’t know• Gap — indel in an alignment• Indel character: usually ”–”
1 MEK--TRKATSPKSSMTSSTGPKSPVRNGGSPPHKKSTSEFRGRKNESQI 48||| |:.|.......:|.:.|.|....|.:....|..
1 MEKAHTKSALKKLVKASSQSAPWSNAARGMAKDDLKDP------------ 38
49 FRKGGQDSITLDESKRRSPTSQTSPKRSSPKHEQPLSYFRLHTEERAIKR 98..|:|| .:||..:.::.:| ..|:|||.:||:||
39 ---------LYDKSK-------VAPKPFAKENTKP-QEFKLHTGQRALKR 71
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
Choosing alignment?OsMAP20 69 SQTSPKRSSPKHEQPLSYFRLHTEERAIKRAGFNYQVASKINTNEIIRR
S+ +PK + ++ +P F+LHT +RA+KRA FNY VA+KI NE +RPttMAP20 43 SKVAPKPFAKENTKPQE-FKLHTGQRALKRAMFNYSVATKIYMNEQQKR
OsMAP20 118 FEEKLSKVIEEREIKMMRKEMVHKAQLMPAFDKPFHPQRSTRPLTVPKEE++ K+IEE E++ MRKEMV +AQLMP FD+PF PQRS+RPLTVP+E
PttMAP20 91 QIERIQKIIEEEEVRTMRKEMVPRAQLMPYFDRPFFPQRSSRPLTVPRE
OsMAP20 167 PSFPSF
PttMAP20 140 PSF
OsMAP20 69 SQTSPKRSSPKHEQPLSYFRLHTEERAIKRAGFNYQVASKINTNEIIRRF 118|:.:||..:.::.:| ..|:|||.:||:|||.|||.||:||..||..:|.
PttMAP20 43 SKVAPKPFAKENTKP-QEFKLHTGQRALKRAMFNYSVATKIYMNEQQKRQ 91
OsMAP20 119 EEKLSKVIEEREIKMMRKEMVHKAQLMPAFDKPFHPQRSTRPLTVPKEPS 168.|::.|:|||.|::.||||||.:|||||.||:||.||||:||||||:|||
PttMAP20 92 IERIQKIIEEEEVRTMRKEMVPRAQLMPYFDRPFFPQRSSRPLTVPREPS 141
OsMAP20 169 F--LRLKC--CI 176| :..|| ||
PttMAP20 142 FHMVNSKCWSCI 153
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
Principle: Identity• Def: The identity in an alignment is the
fraction of identical paired symbols.• Early selection criteria: Choose alignment
with highest identity
Here: 62112 ≈ 55% identity
OsMAP20 69 SQTSPKRSSPKHEQPLSYFRLHTEERAIKRAGFNYQVASKINTNEIIRRF 118|:.:||..:.::.:| ..|:|||.:||:|||.|||.||:||..||..:|.
PttMAP20 43 SKVAPKPFAKENTKP-QEFKLHTGQRALKRAMFNYSVATKIYMNEQQKRQ 91
OsMAP20 119 EEKLSKVIEEREIKMMRKEMVHKAQLMPAFDKPFHPQRSTRPLTVPKEPS 168.|::.|:|||.|::.||||||.:|||||.||:||.||||:||||||:|||
PttMAP20 92 IERIQKIIEEEEVRTMRKEMVPRAQLMPYFDRPFFPQRSSRPLTVPREPS 141
OsMAP20 169 F--LRLKC--CI 176| :..|| ||
PttMAP20 142 FHMVNSKCWSCI 153
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
Principle: Identity• Def: The identity in an alignment is the
fraction of identical paired symbols.• Early selection criteria: Choose alignment
with highest identityHere: 62
112 ≈ 55% identityOsMAP20 69 SQTSPKRSSPKHEQPLSYFRLHTEERAIKRAGFNYQVASKINTNEIIRRF 118
|:.:||..:.::.:| ..|:|||.:||:|||.|||.||:||..||..:|.PttMAP20 43 SKVAPKPFAKENTKP-QEFKLHTGQRALKRAMFNYSVATKIYMNEQQKRQ 91
OsMAP20 119 EEKLSKVIEEREIKMMRKEMVHKAQLMPAFDKPFHPQRSTRPLTVPKEPS 168.|::.|:|||.|::.||||||.:|||||.||:||.||||:||||||:|||
PttMAP20 92 IERIQKIIEEEEVRTMRKEMVPRAQLMPYFDRPFFPQRSSRPLTVPREPS 141
OsMAP20 169 F--LRLKC--CI 176| :..|| ||
PttMAP20 142 FHMVNSKCWSCI 153
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
Scoring an alignment
• Identity looses info on similarity
• Better: assign score to every pair ofsymbols. s(x , y) = cExample: for DNA
s A T G CA 2 -1 1 -1T -1 2 -1 1G 1 -1 2 -1C -1 1 -1 2
• Indel scores: s(x ,−) = s(−, x)?= −1
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
Scoring an alignment
• Identity looses info on similarity• Better: assign score to every pair of
symbols. s(x , y) = cExample: for DNA
s A T G CA 2 -1 1 -1T -1 2 -1 1G 1 -1 2 -1C -1 1 -1 2
• Indel scores: s(x ,−) = s(−, x)?= −1
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
Scoring an alignment
• Identity looses info on similarity• Better: assign score to every pair of
symbols. s(x , y) = cExample: for DNA
s A T G CA 2 -1 1 -1T -1 2 -1 1G 1 -1 2 -1C -1 1 -1 2
• Indel scores: s(x ,−) = s(−, x)?= −1
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
Scoring an alignment• Alignment x , y from sequences x and y .
E.g.: x = AAGTT, y = AATT, alignment isx AAGTTy AA-TT
• Alignment score is
S(x , y) =
|x |∑i=1
s(xi , yi)
• Here:
S(x , y) = s(A, A) + s(A, A)
+ s(G,−) + s(T , T ) + s(T , T )
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
Scoring an alignment• Alignment x , y from sequences x and y .
E.g.: x = AAGTT, y = AATT, alignment isx AAGTTy AA-TT
• Alignment score is
S(x , y) =
|x |∑i=1
s(xi , yi)
• Here:
S(x , y) = s(A, A) + s(A, A)
+ s(G,−) + s(T , T ) + s(T , T )
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
How do we choose an alignment?
• Want to choose best global alignment• Many alignments• Given x = x1x2 · · · xm and y = y1y2 · · · yn,
find x , y that maximize score S(x , y).
• Idea: Find best way of ending alignment
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
How do we choose an alignment?
• Want to choose best global alignment• Many alignments• Given x = x1x2 · · · xm and y = y1y2 · · · yn,
find x , y that maximize score S(x , y).• Idea: Find best way of ending alignment
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
How to end alignment: alternativesOne of:
x1 · · · xm−1y1 · · · yn−1
xmyn
Mm−1,n−1 + s(xm, yn)
or
x1 · · · xm−1y1 · · · yn
xm−
Mm−1,n + s(xm,−)
or
x1 · · · xmy1 · · · yn−1
−yn
Mm,n−1 + s(−, yn)
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
How to end alignment: alternativesOne of:
x1 · · · xm−1y1 · · · yn−1
xmyn
Mm−1,n−1 + s(xm, yn)
or
x1 · · · xm−1y1 · · · yn
xm−
Mm−1,n + s(xm,−)
or
x1 · · · xmy1 · · · yn−1
−yn
Mm,n−1 + s(−, yn)
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
How to end alignment: alternativesOne of:
x1 · · · xm−1y1 · · · yn−1
xmyn
Mm−1,n−1 + s(xm, yn)
or
x1 · · · xm−1y1 · · · yn
xm− Mm−1,n + s(xm,−)
or
x1 · · · xmy1 · · · yn−1
−yn
Mm,n−1 + s(−, yn)
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
How to end alignment: alternativesOne of:
x1 · · · xm−1y1 · · · yn−1
xmyn
Mm−1,n−1 + s(xm, yn)
or
x1 · · · xm−1y1 · · · yn
xm− Mm−1,n + s(xm,−)
or
x1 · · · xmy1 · · · yn−1
−yn
Mm,n−1 + s(−, yn)
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
A rekursion for max alignment score
Note: for global alignment
M0,0 = 0
Mm,n = max
Mm−1,n−1 + s(xm, yn) m > 0, n > 0Mm−1,n + s(xm,−) m > 0, n ≥ 0Mm,n−1 + s(−, yn) m ≥ 0, n > 0
We get:Mm,n = max
x ,yS(x , y)
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
Computing Mm,n
• Keep Mi ,j in a table• Table + Rekursion = Dynamic Programming• Needleman-Wunch algorithm
• mn elements in table⇒Time complexity is ∼ mn.
• When filling the table, note alternatives.• Backtracking for retrieving the alignment.
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
Computing Mm,n
• Keep Mi ,j in a table• Table + Rekursion = Dynamic Programming• Needleman-Wunch algorithm• mn elements in table
⇒Time complexity is ∼ mn.• When filling the table, note alternatives.• Backtracking for retrieving the alignment.
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
DP and backtracking
From Eddy, Nature Biotech, 2004
Intro Sequence comparisons Visualization Alignments Scoring Algorithms
DP for local alignments
• Smith-Waterman algorithm• Allow ”restarting” from zero.
M0,0 = 0
Mm,n = max
Mm−1,n−1 + s(xm, yn) m > 0, n > 0Mm−1,n + s(xm,−) m > 0, n ≥ 0Mm,n−1 + s(−, yn) m ≥ 0, n > 00 ← Here!