Sequence Analysis: Part II. Database Searching
Transcript of Sequence Analysis: Part II. Database Searching
![Page 1: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/1.jpg)
Bioinformatics
Sequence Analysis: Part II. DatabaseSearching
Fran Lewitter, Ph.D.Head, BiocomputingWhitehead Institute
![Page 2: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/2.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 2
Course SyllabusJan 7 Sequence Analysis I. Pairwise alignments, database searching including
BLAST (FL) [1, 2, 3]Jan 14 Sequence Analysis II. Database searching (continued), Pattern searching(FL)[7]Jan 21 No Class - Martin Luther King HolidayJan 28 Sequence Analysis III. Hidden Markov models, gene finding algorithms (FL)[8]
Feb 4 Computational Methods I. Genomic Resources and Unix (GB)Feb 11 Computational Methods II. Sequence analysis with Perl. (GB)Feb 18 No Class - President's BirthdayFeb 25 Computational Methods III. Sequence analysis with Perl and BioPerl (GB)
Mar 4 Proteins I. Multiple sequence alignments, phylogenetic trees (RL) [4, 6]Mar 11 Proteins II. Profile searches of databases, revealing protein motifs (RL) [9]Mar 18 Proteins III.Structural Genomics:structural comparisons and predictions (RL)
Mar 25 Microarrays: designing chips, clustering methods (FL)
![Page 3: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/3.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 3
Topics to Cover• Introduction• Scoring alignments• Alignment methods
– Dot matrix analysis– Exhaustive methods; Dynamic programming algorithm
(Smith-Waterman (Local), Needleman-Wunsch(Global)
– Heuristic methods; Approximate methods; word or k-tuple (FASTA, BLAST)
• Significance of alignments• Database searching methods• Demo
![Page 4: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/4.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 4
Questions
• Why do a database search?• What database should be searched?• What alignment algorithm to use?• What do the results mean?
![Page 5: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/5.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 5
Issues affecting DB Search
• Substitution matrices• Statistical significance• Filtering• Database choices
![Page 6: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/6.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 6
BLASTP Results
![Page 7: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/7.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 7
Topics to Cover• Introduction• Scoring alignments• Alignment methods
– Dot matrix analysis– Exhaustive methods; Dynamic programming algorithm (Smith-
Waterman (Local), Needleman-Wunsch (Global)– Heuristic methods; Approximate methods; word or k-tuple
(FASTA, BLAST)• Significance of alignments• Database searching methods• Demo
![Page 8: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/8.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 8
Significance of Alignment
![Page 9: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/9.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 8
Significance of Alignment
How strong can an alignment be expected by chancealone?
![Page 10: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/10.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 8
Significance of Alignment
How strong can an alignment be expected by chancealone?
• Real but non-homologous sequences
![Page 11: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/11.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 8
Significance of Alignment
How strong can an alignment be expected by chancealone?
• Real but non-homologous sequences• Real sequences that are shuffled to preserve
compositional properties
![Page 12: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/12.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 8
Significance of Alignment
How strong can an alignment be expected by chancealone?
• Real but non-homologous sequences• Real sequences that are shuffled to preserve
compositional properties• Sequences that are generated randomly based upon a
DNA or protein sequence model
![Page 13: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/13.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 9
Extreme Value Distribution• When 2 sequences have been
aligned optimally, thesignificance of a localalignment score can be testedon the basis of the distributionof scores expected by aligningtwo random sequences of thesame length and composition asthe two test sequences.
-2 20 5
x
![Page 14: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/14.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 10
Statistical Significance
![Page 15: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/15.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 10
Statistical Significance
• Raw Scores - score of an alignment equal to thesum of substitution and gap scores.
![Page 16: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/16.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 10
Statistical Significance
• Raw Scores - score of an alignment equal to thesum of substitution and gap scores.
• Bit scores - scaled version of an alignment’s rawscore that accounts for the statistical properties ofthe scoring system used.
![Page 17: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/17.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 10
Statistical Significance
• Raw Scores - score of an alignment equal to thesum of substitution and gap scores.
• Bit scores - scaled version of an alignment’s rawscore that accounts for the statistical properties ofthe scoring system used.
• E-value - expected number of distinct alignmentsthat would achieve a given score by chance.Lower E-value => more significant.
![Page 18: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/18.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 11
Some formulas
![Page 19: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/19.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 11
Some formulas
E = Kmn e-lS
![Page 20: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/20.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 11
Some formulas
E = Kmn e-lS
This is the Expected number of high-scoringsegment pairs (HSPs) with score at least S
for sequences of length m and n.
![Page 21: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/21.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 11
Some formulas
E = Kmn e-lS
This is the Expected number of high-scoringsegment pairs (HSPs) with score at least S
for sequences of length m and n.
This is the E value for the score S.
![Page 22: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/22.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 12
Topics to Cover• Introduction• Scoring alignments• Alignment methods• Significance of alignments• Database searching methods
– BLAST - ungapped and gapped– BLAST vs. FASTA– PSI-BLAST– PHI-BLAST– Pattern searching
• Demo
![Page 23: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/23.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 13
Low Complexity Regions• Local regions of biased composition• Common in real sequences• Generate false positives on BLAST search
• DUST for BLASTN (n’s in sequence)• SEG for other programs (x’s in sequence)
Filtering is only applied to the query sequence(or its translation products), not to databasesequences.
![Page 24: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/24.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 14
Filtered Sequence>HUMAN MSH2
MAVQPKETLQLESAAEVGFVRFFQGMPEKPTTTVRLFDRGDFYTAHGEDALLAAREVFKTQGVIKYMGPAGAKNLQSVVLSKMNFESFVKDLLLVRQYRVEVYKNRAGNKASKENDWYLAYKASPGNLSQFEDILFGNNDMSASIGVVGVKMSAVDGQRQVGVGYVDSIQRKLGLCEFPDNDQFSNLEALLIQIGPKECVLPGGETAGDMGKLRQIIQRGGILITERKKADFSTKDIYQDLNRLLKGKKGEQMNSAVLPEMENQVAVSSLSAVIKFLELLSDDSNFGQFELTTFDFSQYMKLDIAAVRALNLFQGSVEDTTGSQSLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEAFVEDAELRQTLQEDLLRRFPDLNRLAKKFQRQAANLQDCYRLYQGINQLPNVIQALEKHEGKHQKLLLAVFVTPLTDLRSDFSKFQEMIETTLDMDQVENHEFLVKPSFDPNLSELREIMNDLEKKMQSTLISAARDLGLDPGKQIKLDSSAQFGYYFRVTCKEEKVLRNNKNFSTVDIQKNGVKFTNSKLTSLNEEYTKNKTEYEEAQDAIVKEIVNISSGYVEPMQTLNDVLAQLDAVVSFAHVSNGAPVPYVRPAILEKGQGRIILKASRHACVEVQDEIAFIPNDVYFEKDKQMFHIITGPNMGGKSTYIRQTGVIVLMAQIGCFVPCESAEVSIVDCILARVGAGDSQLKGVSTFMAEMLETASILRSATKDSLIIIDELGRGTSTYDGFGLAWAISEYIATKIGAFCMFATHFHELTALANQIPTVNNLHVTALTTEETLTMLYQVKKGVCDQSFGIHVAELANFPKHVIECAKQKALELEEFQYIGESQGYDIMEPAAKKCYLEREQGEKIIQEFLSKVKQMPFTEMSEENITIKLKQLKAEVIAKNNSFVNEIISRIKVTT
![Page 25: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/25.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 14
Filtered Sequence>HUMAN MSH2
MAVQPKETLQLESAAEVGFVRFFQGMPEKPTTTVRLFDRGDFYTAHGEDALLAAREVFKTQGVIKYMGPAGAKNLQSVVLSKMNFESFVKDLLLVRQYRVEVYKNRAGNKASKENDWYLAYKASPGNLSQFEDILFGNNDMSASIGVVGVKMSAVDGQRQVGVGYVDSIQRKLGLCEFPDNDQFSNLEALLIQIGPKECVLPGGETAGDMGKLRQIIQRGGILITERKKADFSTKDIYQDLNRLLKGKKGEQMNSAVLPEMENQVAVSSLSAVIKFLELLSDDSNFGQFELTTFDFSQYMKLDIAAVRALNLFQGSVEDTTGSQSLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEAFVEDAELRQTLQEDLLRRFPDLNRLAKKFQRQAANLQDCYRLYQGINQLPNVIQALEKHEGKHQKLLLAVFVTPLTDLRSDFSKFQEMIETTLDMDQVENHEFLVKPSFDPNLSELREIMNDLEKKMQSTLISAARDLGLDPGKQIKLDSSAQFGYYFRVTCKEEKVLRNNKNFSTVDIQKNGVKFTNSKLTSLNEEYTKNKTEYEEAQDAIVKEIVNISSGYVEPMQTLNDVLAQLDAVVSFAHVSNGAPVPYVRPAILEKGQGRIILKASRHACVEVQDEIAFIPNDVYFEKDKQMFHIITGPNMGGKSTYIRQTGVIVLMAQIGCFVPCESAEVSIVDCILARVGAGDSQLKGVSTFMAEMLETASILRSATKDSLIIIDELGRGTSTYDGFGLAWAISEYIATKIGAFCMFATHFHELTALANQIPTVNNLHVTALTTEETLTMLYQVKKGVCDQSFGIHVAELANFPKHVIECAKQKALELEEFQYIGESQGYDIMEPAAKKCYLEREQGEKIIQEFLSKVKQMPFTEMSEENITIKLKQLKAEVIAKNNSFVNEIISRIKVTT
NEEYTKNKTEYEE
![Page 26: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/26.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 14
Filtered Sequence>HUMAN MSH2
MAVQPKETLQLESAAEVGFVRFFQGMPEKPTTTVRLFDRGDFYTAHGEDALLAAREVFKTQGVIKYMGPAGAKNLQSVVLSKMNFESFVKDLLLVRQYRVEVYKNRAGNKASKENDWYLAYKASPGNLSQFEDILFGNNDMSASIGVVGVKMSAVDGQRQVGVGYVDSIQRKLGLCEFPDNDQFSNLEALLIQIGPKECVLPGGETAGDMGKLRQIIQRGGILITERKKADFSTKDIYQDLNRLLKGKKGEQMNSAVLPEMENQVAVSSLSAVIKFLELLSDDSNFGQFELTTFDFSQYMKLDIAAVRALNLFQGSVEDTTGSQSLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEAFVEDAELRQTLQEDLLRRFPDLNRLAKKFQRQAANLQDCYRLYQGINQLPNVIQALEKHEGKHQKLLLAVFVTPLTDLRSDFSKFQEMIETTLDMDQVENHEFLVKPSFDPNLSELREIMNDLEKKMQSTLISAARDLGLDPGKQIKLDSSAQFGYYFRVTCKEEKVLRNNKNFSTVDIQKNGVKFTNSKLTSLNEEYTKNKTEYEEAQDAIVKEIVNISSGYVEPMQTLNDVLAQLDAVVSFAHVSNGAPVPYVRPAILEKGQGRIILKASRHACVEVQDEIAFIPNDVYFEKDKQMFHIITGPNMGGKSTYIRQTGVIVLMAQIGCFVPCESAEVSIVDCILARVGAGDSQLKGVSTFMAEMLETASILRSATKDSLIIIDELGRGTSTYDGFGLAWAISEYIATKIGAFCMFATHFHELTALANQIPTVNNLHVTALTTEETLTMLYQVKKGVCDQSFGIHVAELANFPKHVIECAKQKALELEEFQYIGESQGYDIMEPAAKKCYLEREQGEKIIQEFLSKVKQMPFTEMSEENITIKLKQLKAEVIAKNNSFVNEIISRIKVTT
NEEYTKNKTEYEE
TALTTEETLT
![Page 27: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/27.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 15
Example Alignment w/o filteringScore = 29.6 bits (65), Expect = 1.8Identities = 22/70 (31%), Positives = 32/70 (45%), Gaps = 12/70 (17%)
Query: 31 PPPTTQGAPRTSSFTPTTLT------------NGTSHSPTALNGAPSPPNGFS 71 PPP+ Q R S + T T NG+S S ++ + + S + SSbjct: 1221 PPPSVQNQQRWGSSSVITTTCQQRQQSVSPHSNGSSSSSSSSSSSSSSSSSTS 1273
Query: 72 NGPSSSSSSSLANQQLP 88 + SSSS+SS Q PSbjct: 1274 SNCSSSSASSCQYFQSP 1290
![Page 28: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/28.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 16
Example BLAST w/ filtering Score = 36.6 bits (83), Expect = 0.67 Identities = 21/58 (36%), Positives = 25/58 (42%), Gaps = 1/58 (1%)
Query: 471 AEDALAVINQQEDSSESCWNCGRKASETCSGCNTARYCGSFCQHKDWE-KHHHICGQT 527 A D V Q + + C CG A TCS C A YC Q DW+ H C Q+Sbjct: 61 ASDTECVCLQLKSGAHLCRVCGCLAPMTCSRCKQAHYCSKEHQTLDWQLGHKQACTQS 118
Score = 37.0 bits (84), Expect = 0.55 Identities = 18/55 (32%), Positives = 22/55 (39%)
Query: 483 DSSESCWNCGRKASETCSGCNTARYCGSFCQHKDWEKHHHICGQTLQAQQQGDTP 537 D C CG A++ C+ C ARYC Q DW H C + D PSbjct: 75 DGPGLCRICGCSAAKKCAKCQVARYCSQAHQVIDWPAHKLECAKAATDGSITDEP 129
![Page 29: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/29.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 17
WU-BLAST vs NCBI BLAST
![Page 30: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/30.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 17
WU-BLAST vs NCBI BLAST
• WU-BLAST first for gapped alignments
![Page 31: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/31.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 17
WU-BLAST vs NCBI BLAST
• WU-BLAST first for gapped alignments• Use different scoring system for gaps
![Page 32: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/32.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 17
WU-BLAST vs NCBI BLAST
• WU-BLAST first for gapped alignments• Use different scoring system for gaps• Report different statistics
![Page 33: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/33.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 17
WU-BLAST vs NCBI BLAST
• WU-BLAST first for gapped alignments• Use different scoring system for gaps• Report different statistics• WU-BLAST does not filter low-complexity by
default
![Page 34: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/34.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 17
WU-BLAST vs NCBI BLAST
• WU-BLAST first for gapped alignments• Use different scoring system for gaps• Report different statistics• WU-BLAST does not filter low-complexity by
default• WU-BLAST looks for and reports multiple
regions of similarity
![Page 35: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/35.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 17
WU-BLAST vs NCBI BLAST
• WU-BLAST first for gapped alignments• Use different scoring system for gaps• Report different statistics• WU-BLAST does not filter low-complexity by
default• WU-BLAST looks for and reports multiple
regions of similarity• Results will be different
![Page 36: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/36.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 18
BLAT
![Page 37: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/37.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 18
BLAT• Developed by Jim Kent at UCSC
![Page 38: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/38.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 18
BLAT• Developed by Jim Kent at UCSC• BLAT is not BLAST
![Page 39: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/39.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 18
BLAT• Developed by Jim Kent at UCSC• BLAT is not BLAST• For DNA it is designed to quickly find sequences of >=
95% similarity of length 40 bases or more.
![Page 40: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/40.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 18
BLAT• Developed by Jim Kent at UCSC• BLAT is not BLAST• For DNA it is designed to quickly find sequences of >=
95% similarity of length 40 bases or more.• For proteins it finds sequences of >= 80% similarity of
length 20 amino acids or more.
![Page 41: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/41.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 18
BLAT• Developed by Jim Kent at UCSC• BLAT is not BLAST• For DNA it is designed to quickly find sequences of >=
95% similarity of length 40 bases or more.• For proteins it finds sequences of >= 80% similarity of
length 20 amino acids or more.• DNA BLAT works by keeping an index of the entire
genome in memory - non-overlapping 11-mers (< 1 GB ofRAM)
![Page 42: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/42.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 18
BLAT• Developed by Jim Kent at UCSC• BLAT is not BLAST• For DNA it is designed to quickly find sequences of >=
95% similarity of length 40 bases or more.• For proteins it finds sequences of >= 80% similarity of
length 20 amino acids or more.• DNA BLAT works by keeping an index of the entire
genome in memory - non-overlapping 11-mers (< 1 GB ofRAM)
• Protein BLAT uses 4-mers (~ 2 GB)
![Page 43: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/43.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 19
FASTA
![Page 44: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/44.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 19
FASTA
• Index "words" and locate identities
![Page 45: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/45.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 19
FASTA
• Index "words" and locate identities• Rescore best 10 regions
![Page 46: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/46.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 19
FASTA
• Index "words" and locate identities• Rescore best 10 regions• Find optimal subset of initial regions
that can be joined to form single alignment
![Page 47: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/47.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 19
FASTA
• Index "words" and locate identities• Rescore best 10 regions• Find optimal subset of initial regions
that can be joined to form single alignment• Align highest scoring sequences using
Smith-Waterman
![Page 48: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/48.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 20
PSI-BLAST
![Page 49: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/49.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 20
PSI-BLAST• Position specific iterative BLAST uses a profile (or
position specific scoring matrix, PSSM) that is constructed(automatically) from a multiple alignment of the highestscoring hits in an initial BLAST search.
![Page 50: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/50.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 20
PSI-BLAST• Position specific iterative BLAST uses a profile (or
position specific scoring matrix, PSSM) that is constructed(automatically) from a multiple alignment of the highestscoring hits in an initial BLAST search.
• The PSSM is generated by calculating position-specificscores for each position in the alignment. Highlyconserved positions receive high scores and weaklyconserved positions receive scores near zero.
![Page 51: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/51.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 20
PSI-BLAST• Position specific iterative BLAST uses a profile (or
position specific scoring matrix, PSSM) that is constructed(automatically) from a multiple alignment of the highestscoring hits in an initial BLAST search.
• The PSSM is generated by calculating position-specificscores for each position in the alignment. Highlyconserved positions receive high scores and weaklyconserved positions receive scores near zero.
• The profile is used to perform a second (etc.) BLASTsearch and the results of each "iteration" is used to refinethe profile. This iterative searching strategy results inincreased sensitivity.
![Page 52: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/52.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 21
Start with a BLASTP search
![Page 53: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/53.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 21
Start with a BLASTP search
![Page 54: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/54.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 21
Start with a BLASTP search
![Page 55: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/55.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 21
Start with a BLASTP search
![Page 56: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/56.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 22
PSI-BLAST - Iteration 1
![Page 57: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/57.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 23
PSSM from PSI-BLAST
POSITIONS
A R N D C Q E G H I L K M F P S T W Y V
1 0 2 3 2 4 1 1 4 3 0 3 3 7 3 3 2 1 0 1 2
2 6 0 3 3 5 4 0 3 2 5 0 1 2 2 4 1 3 2 4 2
3 4 3 0 3 3 1 3 2 4 2 3 2 5 0 1 2 1 0 5 7
4 3 2 3 2 4 9 3 3 5 4 0 3 2 5 1 2 2 4 1 2
5 0 1 2 2 4 1 6 3 3 1 3 2 0 4 8 3 1 0 3 0
6 4 3 2 …
• …
• …
N
Aminoacids
![Page 58: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/58.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 23
PSSM from PSI-BLAST
POSITIONS
A R N D C Q E G H I L K M F P S T W Y V
1 0 2 3 2 4 1 1 4 3 0 3 3 7 3 3 2 1 0 1 2
2 6 0 3 3 5 4 0 3 2 5 0 1 2 2 4 1 3 2 4 2
3 4 3 0 3 3 1 3 2 4 2 3 2 5 0 1 2 1 0 5 7
4 3 2 3 2 4 9 3 3 5 4 0 3 2 5 1 2 2 4 1 2
5 0 1 2 2 4 1 6 3 3 1 3 2 0 4 8 3 1 0 3 0
6 4 3 2 …
• …
• …
N
Aminoacids
![Page 59: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/59.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 23
PSSM from PSI-BLAST
POSITIONS
A R N D C Q E G H I L K M F P S T W Y V
1 0 2 3 2 4 1 1 4 3 0 3 3 7 3 3 2 1 0 1 2
2 6 0 3 3 5 4 0 3 2 5 0 1 2 2 4 1 3 2 4 2
3 4 3 0 3 3 1 3 2 4 2 3 2 5 0 1 2 1 0 5 7
4 3 2 3 2 4 9 3 3 5 4 0 3 2 5 1 2 2 4 1 2
5 0 1 2 2 4 1 6 3 3 1 3 2 0 4 8 3 1 0 3 0
6 4 3 2 …
• …
• …
N
Aminoacids
![Page 60: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/60.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 23
PSSM from PSI-BLAST
POSITIONS
A R N D C Q E G H I L K M F P S T W Y V
1 0 2 3 2 4 1 1 4 3 0 3 3 7 3 3 2 1 0 1 2
2 6 0 3 3 5 4 0 3 2 5 0 1 2 2 4 1 3 2 4 2
3 4 3 0 3 3 1 3 2 4 2 3 2 5 0 1 2 1 0 5 7
4 3 2 3 2 4 9 3 3 5 4 0 3 2 5 1 2 2 4 1 2
5 0 1 2 2 4 1 6 3 3 1 3 2 0 4 8 3 1 0 3 0
6 4 3 2 …
• …
• …
N
Aminoacids
![Page 61: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/61.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 23
PSSM from PSI-BLAST
POSITIONS
A R N D C Q E G H I L K M F P S T W Y V
1 0 2 3 2 4 1 1 4 3 0 3 3 7 3 3 2 1 0 1 2
2 6 0 3 3 5 4 0 3 2 5 0 1 2 2 4 1 3 2 4 2
3 4 3 0 3 3 1 3 2 4 2 3 2 5 0 1 2 1 0 5 7
4 3 2 3 2 4 9 3 3 5 4 0 3 2 5 1 2 2 4 1 2
5 0 1 2 2 4 1 6 3 3 1 3 2 0 4 8 3 1 0 3 0
6 4 3 2 …
• …
• …
N
Aminoacids
![Page 62: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/62.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 23
PSSM from PSI-BLAST
POSITIONS
A R N D C Q E G H I L K M F P S T W Y V
1 0 2 3 2 4 1 1 4 3 0 3 3 7 3 3 2 1 0 1 2
2 6 0 3 3 5 4 0 3 2 5 0 1 2 2 4 1 3 2 4 2
3 4 3 0 3 3 1 3 2 4 2 3 2 5 0 1 2 1 0 5 7
4 3 2 3 2 4 9 3 3 5 4 0 3 2 5 1 2 2 4 1 2
5 0 1 2 2 4 1 6 3 3 1 3 2 0 4 8 3 1 0 3 0
6 4 3 2 …
• …
• …
N
Aminoacids
![Page 63: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/63.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 23
PSSM from PSI-BLAST
POSITIONS
A R N D C Q E G H I L K M F P S T W Y V
1 0 2 3 2 4 1 1 4 3 0 3 3 7 3 3 2 1 0 1 2
2 6 0 3 3 5 4 0 3 2 5 0 1 2 2 4 1 3 2 4 2
3 4 3 0 3 3 1 3 2 4 2 3 2 5 0 1 2 1 0 5 7
4 3 2 3 2 4 9 3 3 5 4 0 3 2 5 1 2 2 4 1 2
5 0 1 2 2 4 1 6 3 3 1 3 2 0 4 8 3 1 0 3 0
6 4 3 2 …
• …
• …
N
Aminoacids
![Page 64: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/64.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 24
Pattern Hit Initiated (PHI)-BLAST>HUMAN MSH2
MAVQPKETLQLESAAEVGFVRFFQGMPEKPTTTVRLFDRGDFYTAHGEDALLAAREVFKTQGVIKYMGPAGAKNLQSVVLSKMNFESFVKDLLLVRQYRVEVYKNRAGNKASKENDWYLAYKASPGNLSQFEDILFGNNDMSASIGVVGVKMSAVDGQRQVGVGYVDSIQRKLGLCEFPDNDQFSNLEALLIQIGPKECVLPGGETAGDMGKLRQIIQRGGILITERKKADFSTKDIYQDLNRLLKGKKGEQMNSAVLPEMENQVAVSSLSAVIKFLELLSDDSNFGQFELTTFDFSQYMKLDIAAVRALNLFQGSVEDTTGSQSLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEAFVEDAELRQTLQEDLLRRFPDLNRLAKKFQRQAANLQDCYRLYQGINQLPNVIQALEKHEGKHQKLLLAVFVTPLTDLRSDFSKFQEMIETTLDMDQVENHEFLVKPSFDPNLSELREIMNDLEKKMQSTLISAARDLGLDPGKQIKLDSSAQFGYYFRVTCKEEKVLRNNKNFSTVDIQKNGVKFTNSKLTSLNEEYTKNKTEYEEAQDAIVKEIVNISSGYVEPMQTLNDVLAQLDAVVSFAHVSNGAPVPYVRPAILEKGQGRIILKASRHACVEVQDEIAFIPNDVYFEKDKQMFHIITGPNMGGKSTYIRQTGVIVLMAQIGCFVPCESAEVSIVDCILARVGAGDSQLKGVSTFMAEMLETASILRSATKDSLIIIDELGRGTSTYDGFGLAWAISEYIATKIGAFCMFATHFHELTALANQIPTVNNLHVTALTTEETLTMLYQVKKGVCDQSFGIHVAELANFPKHVIECAKQKALELEEFQYIGESQGYDIMEPAAKKCYLEREQGEKIIQEFLSKVKQMPFTEMSEENITIKLKQLKAEVIAKNNSFVNEIISRIKVTT
![Page 65: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/65.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 24
Pattern Hit Initiated (PHI)-BLAST>HUMAN MSH2
MAVQPKETLQLESAAEVGFVRFFQGMPEKPTTTVRLFDRGDFYTAHGEDALLAAREVFKTQGVIKYMGPAGAKNLQSVVLSKMNFESFVKDLLLVRQYRVEVYKNRAGNKASKENDWYLAYKASPGNLSQFEDILFGNNDMSASIGVVGVKMSAVDGQRQVGVGYVDSIQRKLGLCEFPDNDQFSNLEALLIQIGPKECVLPGGETAGDMGKLRQIIQRGGILITERKKADFSTKDIYQDLNRLLKGKKGEQMNSAVLPEMENQVAVSSLSAVIKFLELLSDDSNFGQFELTTFDFSQYMKLDIAAVRALNLFQGSVEDTTGSQSLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEAFVEDAELRQTLQEDLLRRFPDLNRLAKKFQRQAANLQDCYRLYQGINQLPNVIQALEKHEGKHQKLLLAVFVTPLTDLRSDFSKFQEMIETTLDMDQVENHEFLVKPSFDPNLSELREIMNDLEKKMQSTLISAARDLGLDPGKQIKLDSSAQFGYYFRVTCKEEKVLRNNKNFSTVDIQKNGVKFTNSKLTSLNEEYTKNKTEYEEAQDAIVKEIVNISSGYVEPMQTLNDVLAQLDAVVSFAHVSNGAPVPYVRPAILEKGQGRIILKASRHACVEVQDEIAFIPNDVYFEKDKQMFHIITGPNMGGKSTYIRQTGVIVLMAQIGCFVPCESAEVSIVDCILARVGAGDSQLKGVSTFMAEMLETASILRSATKDSLIIIDELGRGTSTYDGFGLAWAISEYIATKIGAFCMFATHFHELTALANQIPTVNNLHVTALTTEETLTMLYQVKKGVCDQSFGIHVAELANFPKHVIECAKQKALELEEFQYIGESQGYDIMEPAAKKCYLEREQGEKIIQEFLSKVKQMPFTEMSEENITIKLKQLKAEVIAKNNSFVNEIISRIKVTT
DNA mismatchrepair proteins mutS
family signature
![Page 66: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/66.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 25
PHI-BLAST
![Page 67: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/67.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 25
PHI-BLAST
![Page 68: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/68.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 26
Basic Searching Strategies
• Search early and often• Use specialized databases• Use multiple matrices• Use filters• Consider Biology
![Page 69: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/69.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 27
Demo• Readseq• Entrez• NCBI
– BLAST2.0– PSI-BLAST– PHI-BLAST
• WU-BLAST2• FASTA• Smith-Waterman
![Page 70: Sequence Analysis: Part II. Database Searching](https://reader031.fdocuments.net/reader031/viewer/2022020706/61fcb66fc9d2d703a06db0d2/html5/thumbnails/70.jpg)
WIBR Bioinformatics Course, © Whitehead Institute, 2002 28
BLAST and FASTA Citations
http://www.ncbi.nlm.nih.gov/blasthttp://www.ebi.ac.uk/blast2/
http://www2.ebi.ac.uk/fasta33/http://www2.ebi.ac.uk/bic_sw/
• PNAS, 1988, 85: 2444-2448.• Journal of Molecular Biology, 1990, 215: 403-410.• Nature Genetics, 1994, 6: 119-129.• Nucleic Acids Research, 1997, 25(17):3389-3402.• Nucleic Acids Research, 1998, 26(17):3986-3990.• TIBS, 1998, 23:444-447.• Nucleic Acids Research, 2001, 29(2):351-361.• Nucleic Acids Research, 2001, 29(14):2994-3005.