Sequence Alignment,Blast, Fasta, MSA
-
Upload
sucheta-tripathy -
Category
Documents
-
view
829 -
download
0
description
Transcript of Sequence Alignment,Blast, Fasta, MSA
![Page 1: Sequence Alignment,Blast, Fasta, MSA](https://reader035.fdocuments.net/reader035/viewer/2022081422/556e2fa0d8b42a6a698b4f54/html5/thumbnails/1.jpg)
Sequence Similarity Searches (Blast) Pairwise and Multiple
Sequence AlignmentsSucheta Tripathy, 16th November 2012
![Page 2: Sequence Alignment,Blast, Fasta, MSA](https://reader035.fdocuments.net/reader035/viewer/2022081422/556e2fa0d8b42a6a698b4f54/html5/thumbnails/2.jpg)
A protein sequence from species A◦ What is the nearest species this protein is similar
to?◦ Where is it originated from?◦ Putative function.◦ If it has a conserved motif etc.
Sequence Similarities Why??
![Page 3: Sequence Alignment,Blast, Fasta, MSA](https://reader035.fdocuments.net/reader035/viewer/2022081422/556e2fa0d8b42a6a698b4f54/html5/thumbnails/3.jpg)
Blast (Basic Local Alignment Search Tool)◦ NCBI Blast◦ Wu-Blast◦ PSI-Blast
Fasta SSearch
Similarity Searches
![Page 4: Sequence Alignment,Blast, Fasta, MSA](https://reader035.fdocuments.net/reader035/viewer/2022081422/556e2fa0d8b42a6a698b4f54/html5/thumbnails/4.jpg)
Heuristic (Educated guess) Does not compare sequence to its entirety. Quickly locates short matches(seeds) Word size Seeds are extended in both directions Threshold is defined
◦ > Threshold -> keep the alignment◦ < Threshold -> discard the alignment
Blast
![Page 5: Sequence Alignment,Blast, Fasta, MSA](https://reader035.fdocuments.net/reader035/viewer/2022081422/556e2fa0d8b42a6a698b4f54/html5/thumbnails/5.jpg)
Example of word size
GLKFA -> 3GLK, LKF, FKA
![Page 6: Sequence Alignment,Blast, Fasta, MSA](https://reader035.fdocuments.net/reader035/viewer/2022081422/556e2fa0d8b42a6a698b4f54/html5/thumbnails/6.jpg)
A Query sequence:◦ Nucleotide◦ Protein
A Target Database◦ Nucleotide◦ Protein
Blast Program◦ Blastn◦ Blastp◦ tBlastx (Slowest Nt query translated against Nt database
trlt.)◦ tBlastn (Protein query translated nt. Database)◦ Blastx (Nucleotide trnslt against Protein database)
Blast Contd…
![Page 7: Sequence Alignment,Blast, Fasta, MSA](https://reader035.fdocuments.net/reader035/viewer/2022081422/556e2fa0d8b42a6a698b4f54/html5/thumbnails/7.jpg)
E Value -> Probability value at which the sequence hits may occur by chance
Score -> Similarity score.◦ By chance rain probability is 0.001◦ Passing by chance etc.◦ Less the e –value the better is the sensitivity of
the alignment.
Blast Parameters
![Page 8: Sequence Alignment,Blast, Fasta, MSA](https://reader035.fdocuments.net/reader035/viewer/2022081422/556e2fa0d8b42a6a698b4f54/html5/thumbnails/8.jpg)
Remove Low Complexity regions Generate all the k mers. List All Possible matching key words.
- Blast cares about only high scoring pairs- Fasta stores all pairs irrespective of the scores.
Extend the matches into high scoring pairs(HSPs)
Evaluate results depending on thresholds set. Extend HSPs and join them together.
Blast Step by Step
![Page 9: Sequence Alignment,Blast, Fasta, MSA](https://reader035.fdocuments.net/reader035/viewer/2022081422/556e2fa0d8b42a6a698b4f54/html5/thumbnails/9.jpg)
ATGGGGCGAGGCAGCGGCACCTTCGAGCGTCTCCTAGACAAGGCGACCAGCCAGCTCCTGTTGGAGACAGATTGGGAGTCCATTTTGCAGATCTGCGACCTGATCCGCCAAGGGGACACACAAGCAAAATATGCTGTGAATTCCATCAAGAAGAAAGTCAACGACAAGAACCCACACGTCGCCTTGTATGCCCTGGAGGTCATGGAATCTGTGGTAAAGAACTGTGGCCAGACAGTTCATGATGAGGTGGCCAACAAGCAGACCATGGAGGAGCTGAAGGACCTGCTGAAGAGACAAGTGGAGGTAAACGTCCGTAACAAGATCCTGTACCTGATCCAGGCCTGGGCGCATGCCTTCCGGAACGAGCCCAAGTACAAGGTGGTCCAGGACACCTACCAGATCATGAAGGTGGAGGGGCACGTCTTTCCAGAATTCAAAGAGAGCGATGCCATGTTTGCTGCCGAGAGAGCCCCAGACTGGGTGGACGCTGAGGAATGCCACCGCTGCAGGGTGCAGTTCGGGGTGATGACCCGTAAGCACCACTGCCGGGCGTGTGGGCAGATATTCTGTGGAAAGTGTTCTTCCAAGTACTCCACCATCCCCAAGTTTGGCATCGAGAAGGAGGTGCGCGTGTGTGAGCCCTGCTACGAGCAGCTGAACAGGAAAGCGGAGGGAAAGGCCACTTCCACCACTGA
![Page 10: Sequence Alignment,Blast, Fasta, MSA](https://reader035.fdocuments.net/reader035/viewer/2022081422/556e2fa0d8b42a6a698b4f54/html5/thumbnails/10.jpg)
Dot matrix method (bioinfx.net) Dynamic Programming method
◦ Global(Needleman-Wunsch method)◦ Local (Smith-Waterman method)
Word Method or K-tuple method(Heuristic)
Pairwise Sequence comparison
FTFTALILLAVAVFTALLLAAV
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC50453/pdf/pnas01096-0363.pdf
![Page 11: Sequence Alignment,Blast, Fasta, MSA](https://reader035.fdocuments.net/reader035/viewer/2022081422/556e2fa0d8b42a6a698b4f54/html5/thumbnails/11.jpg)
Multiple Sequence Alignment
![Page 12: Sequence Alignment,Blast, Fasta, MSA](https://reader035.fdocuments.net/reader035/viewer/2022081422/556e2fa0d8b42a6a698b4f54/html5/thumbnails/12.jpg)
Uses Neighbor joining guide tree(NJ).◦ N number of sequences
½ * N! / (N-r)! -> Number of pairs 5 sequences (5,4,3,2,1)
(5,4), (5,3), (5,2), (5,1); (4,3),(4,2),(4,1);(3,2),(3,1);(2,1)
Clustal
![Page 13: Sequence Alignment,Blast, Fasta, MSA](https://reader035.fdocuments.net/reader035/viewer/2022081422/556e2fa0d8b42a6a698b4f54/html5/thumbnails/13.jpg)
Types of Matrices
PAMBLOSSUMGONNETDNA Identity MatrixDNA PUPY matrix
![Page 14: Sequence Alignment,Blast, Fasta, MSA](https://reader035.fdocuments.net/reader035/viewer/2022081422/556e2fa0d8b42a6a698b4f54/html5/thumbnails/14.jpg)
Substitution Matrices Insertion and deletions are less likely
than a substitution Insertion and Deletion in DNA sequence leads to
Frame shift.
Scores and Penalty
PAM Matrices(Point Accepted Mutation Matrices) Margaret Dayhoff 1978
PAM1 -> Expected rates of substition if 1% of the amino acids have changed
BLOSUM : Blocks Substitution Matrix (% of identity)
![Page 15: Sequence Alignment,Blast, Fasta, MSA](https://reader035.fdocuments.net/reader035/viewer/2022081422/556e2fa0d8b42a6a698b4f54/html5/thumbnails/15.jpg)
15
PAM matrices are based on a simple evolutionary model
MATLFC MLTLCC
M(A/L)TL(F/C)CAncestral sequence?
Two changes
• Only mutations are allowed • Sites evolve independently
![Page 16: Sequence Alignment,Blast, Fasta, MSA](https://reader035.fdocuments.net/reader035/viewer/2022081422/556e2fa0d8b42a6a698b4f54/html5/thumbnails/16.jpg)
Guidelines for Using MatricesGuidelines for using matricies
Protein Query LengthMatrix Open Gap Extend Gap>300 BLOSUM50 -10 -285-300 BLOSUM62 -7 -150-85 BLOSUM80 -16 -4>300 PAM250 -10 -285-300 PAM120 -16 -435-85 MDM40 -12 -2<=35 MDM20 -22 -4<=10 MDM10 -23 -4
PAM100 ==> Blosum90PAM120 ==> Blosum80PAM160 ==> Blosum60PAM200 ==> Blosum52PAM250 ==> Blosum45
![Page 17: Sequence Alignment,Blast, Fasta, MSA](https://reader035.fdocuments.net/reader035/viewer/2022081422/556e2fa0d8b42a6a698b4f54/html5/thumbnails/17.jpg)
17
Scoring MatricesS = [sij] gives score of aligning character i
with character j for every pair i, j.
STPPCTCA
0+ 3+ (-3)+ 1
= 1