Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
-
Upload
julian-moore -
Category
Documents
-
view
222 -
download
4
Transcript of Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
![Page 1: Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.](https://reader037.fdocuments.net/reader037/viewer/2022110211/56649ec55503460f94bcf7b8/html5/thumbnails/1.jpg)
Sequence Analysis
CSC 487/687 Introduction to computing for Bioinformatics
![Page 2: Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.](https://reader037.fdocuments.net/reader037/viewer/2022110211/56649ec55503460f94bcf7b8/html5/thumbnails/2.jpg)
Aligning Sequences
Sequences Representing proteins or nucleic acid (DNA/RNA)
molecules Order of amino acids (for proteins – nucleotides for
DNA/RNA) along one chain Sequence alignment
The identification of residue-residue correspondences Any assignment of correspondences that preserves the
order of residues within the sequences
![Page 3: Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.](https://reader037.fdocuments.net/reader037/viewer/2022110211/56649ec55503460f94bcf7b8/html5/thumbnails/3.jpg)
Evolutionary Basis of Sequence Alignment Identity: Quantity that describes how much
two sequences are alike in the strictest terms. Similarity: Quantity that relates how much
two amino acid sequences are alike. Homology: a conclusion drawn from data
suggesting that two genes share a common
evolutionary history.
![Page 4: Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.](https://reader037.fdocuments.net/reader037/viewer/2022110211/56649ec55503460f94bcf7b8/html5/thumbnails/4.jpg)
Evolutionary Basis of Sequence Alignment Homologous sequences
Related by evolution (common ancestors)
Alignment of homologous sequences Identifying relationship between the sequence
elements Match up characters coming from same
characters in ancestor
![Page 5: Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.](https://reader037.fdocuments.net/reader037/viewer/2022110211/56649ec55503460f94bcf7b8/html5/thumbnails/5.jpg)
Alignment and Evolution Assume we know evolutionary history relating q and d:
The true alignment can be found using h as a template:h : GLVS Tq’: GLISVTd’: GIV--T
![Page 6: Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.](https://reader037.fdocuments.net/reader037/viewer/2022110211/56649ec55503460f94bcf7b8/html5/thumbnails/6.jpg)
Alignment Evolution Given an alignment, several different evolutionary histories
may be (equally) plausible Example:
Alignment:q’: GLISVTd’: G-I-VT
One possible history: H*:GLIVT /\ ->S / \ L-> / \q:GLISVT d:GIVT
![Page 7: Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.](https://reader037.fdocuments.net/reader037/viewer/2022110211/56649ec55503460f94bcf7b8/html5/thumbnails/7.jpg)
Global and Local Alignment Global
Assuming that the complete sequences are the results of evolution from the same ancestor sequence
Local Align segments of the sequences so that the segments are evolutionarily related
AncestorS1
S2
AncestorS1
S2
![Page 8: Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.](https://reader037.fdocuments.net/reader037/viewer/2022110211/56649ec55503460f94bcf7b8/html5/thumbnails/8.jpg)
Pairwise sequence alignments Vs Multiple sequence alignments Pairwise sequence alignment: two sequence
Multiple sequence alignments: a mutual alignment of more than two sequences
![Page 9: Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.](https://reader037.fdocuments.net/reader037/viewer/2022110211/56649ec55503460f94bcf7b8/html5/thumbnails/9.jpg)
The dotplot
![Page 10: Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.](https://reader037.fdocuments.net/reader037/viewer/2022110211/56649ec55503460f94bcf7b8/html5/thumbnails/10.jpg)
The dotplot
Captures not only the overall similarity of two sequences, but also the complete set and relative quality of different possible alignments Diagonal ― Horizontal ― a gap is introduced in the sequence
indexing the rows Vertical ― a gap is introduced in the sequence
indexing the columns
![Page 11: Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.](https://reader037.fdocuments.net/reader037/viewer/2022110211/56649ec55503460f94bcf7b8/html5/thumbnails/11.jpg)
Dotplots and alignments
A path through the dotplot is as an edit script; Each move performs an operation ― a
substitution, an insertion or a deletion. When the end of the path is reached, the
effect will change one sequence into the other.
Several different sequences of edit operations may convert one string to the other in the same number of steps.
![Page 12: Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.](https://reader037.fdocuments.net/reader037/viewer/2022110211/56649ec55503460f94bcf7b8/html5/thumbnails/12.jpg)
Dotplots and alignments
Although a sequence of edit operations derived from an optimal alignment may correspond to an actual evolutionary pathway
Impossible to prove that it does. The larger the edit distance, the larger the
number of reasonable evolutionary pathways between two sequences.
![Page 13: Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.](https://reader037.fdocuments.net/reader037/viewer/2022110211/56649ec55503460f94bcf7b8/html5/thumbnails/13.jpg)
Dotplots and alignments
The dotplots between pairs of proteins with increasingly more distant relationships.
The dotplot comparisons of the sulphydryl proteinase papain from papaya, with four homologues ― the close relative, kiwi fruit actinidin, the more distant relatives, human procathepsin L, human cathepsin B, and staphyloccus anueus.
![Page 14: Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.](https://reader037.fdocuments.net/reader037/viewer/2022110211/56649ec55503460f94bcf7b8/html5/thumbnails/14.jpg)
Example
![Page 15: Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.](https://reader037.fdocuments.net/reader037/viewer/2022110211/56649ec55503460f94bcf7b8/html5/thumbnails/15.jpg)
Example
![Page 16: Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.](https://reader037.fdocuments.net/reader037/viewer/2022110211/56649ec55503460f94bcf7b8/html5/thumbnails/16.jpg)
Example
![Page 17: Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.](https://reader037.fdocuments.net/reader037/viewer/2022110211/56649ec55503460f94bcf7b8/html5/thumbnails/17.jpg)
Example
![Page 18: Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.](https://reader037.fdocuments.net/reader037/viewer/2022110211/56649ec55503460f94bcf7b8/html5/thumbnails/18.jpg)
Measures of sequence similarity Hamming distance ― the number of positions
with mismatching characters.
Edit distance ― the minimum number of “edit operations” required to change one string into the other.
![Page 19: Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.](https://reader037.fdocuments.net/reader037/viewer/2022110211/56649ec55503460f94bcf7b8/html5/thumbnails/19.jpg)
What is an Alignment? A global alignment of two sequences A and B
contains all characters of A and B in the same order
one symbol from A can be aligned with one symbol from B a symbol can be aligned with a blank, written as ‘-’ two blanks cannot be aligned Every symbol from A and from B must be aligned
Example:A:INVEST, B:INTERESTIN--VEST INV--EST IN-V--ESTINTEREST INTEREST IN-TEREST
![Page 20: Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.](https://reader037.fdocuments.net/reader037/viewer/2022110211/56649ec55503460f94bcf7b8/html5/thumbnails/20.jpg)
Computing Alignments
There exist a large number of alignments for a pair of sequences
In order to use a computer to do the alignment process in a meaningful way, we need Scoring scheme – mathematical way to
calculate goodness of candidate alignments Search method – algorithm able to identify
high scoring alignments
![Page 21: Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.](https://reader037.fdocuments.net/reader037/viewer/2022110211/56649ec55503460f94bcf7b8/html5/thumbnails/21.jpg)
Choosing Scoring Scheme
Scoring scheme should be Simple – to allow for
efficient calculation and search for best alignment
Biologically meaningful (give score to biologically good alignments)
![Page 22: Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.](https://reader037.fdocuments.net/reader037/viewer/2022110211/56649ec55503460f94bcf7b8/html5/thumbnails/22.jpg)
Simple Scoring Scheme Assign score to each column in the alignment Columns are of the following sorts:
Alignment score: sum of score over all columns
R: matrix giving score for all possible character pairs (e.g., all pairs of amino acid symbols)
![Page 23: Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.](https://reader037.fdocuments.net/reader037/viewer/2022110211/56649ec55503460f94bcf7b8/html5/thumbnails/23.jpg)
Alignment Score – Example R identity matrix – identical characters
score1, unequal 0, g=1
ALIGN1:
V - E I T G E I S TP R E - T E R I - T0 -1 1 -1 1 0 0 1 -1 1 Score: 1
ALIGN2:
V E I T G E I S TP R E T - E R I T0 0 0 1 -1 1 0 0 1 Score: 2
![Page 24: Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.](https://reader037.fdocuments.net/reader037/viewer/2022110211/56649ec55503460f94bcf7b8/html5/thumbnails/24.jpg)
Finding the Minimum Scoring Alignment
Large number of possible alignments – cannot generate all and score them to find the best
Task – alignA=a1a2...am and
B=b1b2...bn
![Page 25: Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.](https://reader037.fdocuments.net/reader037/viewer/2022110211/56649ec55503460f94bcf7b8/html5/thumbnails/25.jpg)
Independence Between Sub-alignments Observations:
The score of the alignment up to and including character i from A and character j from B is independent of how the rest of the sequences are aligned
The best solution to (i,j) can be “locked”, its score recorded in Di,j
Dm,n is the score of the best global alignment
Amenable to dynamic Programming
![Page 26: Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.](https://reader037.fdocuments.net/reader037/viewer/2022110211/56649ec55503460f94bcf7b8/html5/thumbnails/26.jpg)
Dynamic programming algorithm Individual edit operations include:
Substitution of bj for ai ― represented (ai, bj)
Deletion of ai from sequence A― represented (ai,)
Deletion of bj from sequence B― represented (,bj)
![Page 27: Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.](https://reader037.fdocuments.net/reader037/viewer/2022110211/56649ec55503460f94bcf7b8/html5/thumbnails/27.jpg)
Dynamic programming algorithm A cost function d is defined on edit operations
d(ai, bj)=cost of a mutation in an alignment in which position i of sequence A corresponds to position j of sequence B
d(ai,) or d( bj) = cost of a deletion or insertion
The minimum weighted distance between sequences A and B as D(A,B)=min (d(x,y))
![Page 28: Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.](https://reader037.fdocuments.net/reader037/viewer/2022110211/56649ec55503460f94bcf7b8/html5/thumbnails/28.jpg)
Three Alternative Alignment Ends The alignment between a1a2...ai and b1b2...bi
ends in one of three ways:
ai
-
a1..i-1
b1..j
ai
bj
a1..i-1
b1..j-1
-bj
a1..i
b1..j-1
To calculate Di,j we pick the one thatgives the lowest cost
![Page 29: Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.](https://reader037.fdocuments.net/reader037/viewer/2022110211/56649ec55503460f94bcf7b8/html5/thumbnails/29.jpg)
Recurrence Relation
ai
-
a1..i-1
b1..j
ai
bj
a1..i-1
b1..j-1
-bj
a1..i
b1..j-1
min, jiD 1,1 jiD
1, jiD
jiD ,1jiD ,1
1,1 jiD
1, jiD
Assume that Di-1,j, Di-1,j-1, Di,j-1 have been calculated already
d(ai,)
d(ai,bj)
d(,bj)
![Page 30: Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.](https://reader037.fdocuments.net/reader037/viewer/2022110211/56649ec55503460f94bcf7b8/html5/thumbnails/30.jpg)
Basis of Recursion
Align empty string to string of length i (resp. j) – can be done by aligning to i (resp. j) blanks:
i
kki
j
kkj
adD
bdD
00,
0,0
),(
),(
![Page 31: Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.](https://reader037.fdocuments.net/reader037/viewer/2022110211/56649ec55503460f94bcf7b8/html5/thumbnails/31.jpg)
Calculating Score of Best Alignment Using Matrix
cost of bestalignment
H matrix
![Page 32: Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.](https://reader037.fdocuments.net/reader037/viewer/2022110211/56649ec55503460f94bcf7b8/html5/thumbnails/32.jpg)
Time Complexity
Sequences of lengths n and m
Two sequences of length l
)(nmO
)( 2lO