Computational Biology, Part 9 Efficient database searching methods Robert F. Murphy Copyright 1996,...

29
Computational Biology, Part 9 Efficient database searching methods Robert F. Murphy Robert F. Murphy Copyright Copyright 1996, 1999, 1996, 1999, 2001. 2001. All rights reserved. All rights reserved.

Transcript of Computational Biology, Part 9 Efficient database searching methods Robert F. Murphy Copyright 1996,...

Computational Biology, Part 9Efficient database searching

methods

Computational Biology, Part 9Efficient database searching

methods

Robert F. MurphyRobert F. Murphy

Copyright Copyright 1996, 1999, 2001. 1996, 1999, 2001.

All rights reserved.All rights reserved.

Efficient database searching methodsEfficient database searching methods Dynamic programming requires order NDynamic programming requires order N22L L

computations (where N is size of the query computations (where N is size of the query sequence and L is the size of the database)sequence and L is the size of the database)

Given size of databases, more efficient Given size of databases, more efficient methods neededmethods needed

“Hit and extend” sequence searching“Hit and extend” sequence searching Problem: Too many calculations “wasted” Problem: Too many calculations “wasted”

by comparing regions that have nothing in by comparing regions that have nothing in commoncommon

Initial insight: Regions that are Initial insight: Regions that are similarsimilar between two sequences are likely to share between two sequences are likely to share short stretches that are short stretches that are identicalidentical

Basic method: Look for similar regions only Basic method: Look for similar regions only near short stretches that match near short stretches that match exactlyexactly

“Hit and extend” sequence searching“Hit and extend” sequence searching We define a We define a wordword size that is the minimum size that is the minimum

number of exact “letter” matches that must number of exact “letter” matches that must occur before we do any further comparison or occur before we do any further comparison or alignmentalignment

How do we find all of the occurences of How do we find all of the occurences of matching words between a sequence and a matching words between a sequence and a database?database? Could scan sequence a word at a time, but this is Could scan sequence a word at a time, but this is

order L (size of database)order L (size of database)

Word searching - hashingWord searching - hashing

Solution: Use a precomputed table that lists Solution: Use a precomputed table that lists where in the database each possible word where in the database each possible word occursoccurs GenerationGeneration of the table is of order L (size of of the table is of order L (size of

database) but database) but useuse of the table is of order N (size of the table is of order N (size of query sequence)of query sequence)

The computer science term for this The computer science term for this approach is approach is hashinghashing

HashingHashing

(Demonstration A9)(Demonstration A9)

Database searching using wordsReferences

Database searching using wordsReferences

W. J. Wilbur and D. J. Lipman. W. J. Wilbur and D. J. Lipman. Rapid similarity Rapid similarity searches of nucleic acid and protein data banks. searches of nucleic acid and protein data banks. Proc. Natl. Acad. Sci. U.S.A. 80:Proc. Natl. Acad. Sci. U.S.A. 80:726-730 (1983)726-730 (1983)

D. J. Lipman and W. R. Pearson. D. J. Lipman and W. R. Pearson. Rapid and Rapid and sensitive protein similarity searches. sensitive protein similarity searches. Science Science 227:227:1435-1441 (1985) 1435-1441 (1985) [FASTP][FASTP]

W. R. Pearson and D. J. Lipman. W. R. Pearson and D. J. Lipman. Improved tools Improved tools for biological sequence comparison. for biological sequence comparison. Proc. Natl. Proc. Natl. Acad. Sci. U.S.A. 85:Acad. Sci. U.S.A. 85:2444-2448 (1988) 2444-2448 (1988) [FASTA][FASTA]

FASTAFASTA

Heavily used for searching databases until Heavily used for searching databases until advent of BLAST (see below)advent of BLAST (see below)

InputsInputs k (word or ktuple) sizek (word or ktuple) size similarity matrixsimilarity matrix

Compares query sequence pairwise with Compares query sequence pairwise with each sequence in the databaseeach sequence in the database

FASTA methodFASTA method

1. Find 1. Find diagonals diagonals (paired pieces from each (paired pieces from each sequence without gaps) that have the sequence without gaps) that have the highest density of common wordshighest density of common words

2. Rescore these using a scoring (similarity) 2. Rescore these using a scoring (similarity) matrix and trim ends that do not contribute matrix and trim ends that do not contribute to the highest scoreto the highest score Result: partial alignments Result: partial alignments withoutwithout gaps gaps Reported as the “init1” scoreReported as the “init1” score

FASTA methodFASTA method

3. Join regions together, including penalties 3. Join regions together, including penalties for gapsfor gaps Result: Result: unoptimizedunoptimized alignment with gaps alignment with gaps Reported as the “initn” scoreReported as the “initn” score

4. Use dynamic programming in a band 32 4. Use dynamic programming in a band 32 residues wide around the best “initn” scoreresidues wide around the best “initn” score Result: optimized alignment with gapsResult: optimized alignment with gaps Reported as the “opt” scoreReported as the “opt” score

Comments on FASTAComments on FASTA

Larger ktuple increases speed since fewer Larger ktuple increases speed since fewer “hits” are found but it also decreases “hits” are found but it also decreases sensitivity for finding similar but not sensitivity for finding similar but not identical sequences since exact matches of identical sequences since exact matches of this length are requiredthis length are required

Limitations of FASTALimitations of FASTA

FASTA can miss significant similarity sinceFASTA can miss significant similarity since For proteins, similar sequences do not have to For proteins, similar sequences do not have to

share identical residuesshare identical residuesAsp-Lys-ValAsp-Lys-Val is quite similar to is quite similar to Glu-Arg-IleGlu-Arg-Ile yet yet

it is missed even with ktuple size of 1 since no it is missed even with ktuple size of 1 since no amino acid matchesamino acid matches

Gly-Asp-Gly-Lys-GlyGly-Asp-Gly-Lys-Gly is quite similar to is quite similar to

Gly-Glu-Gly-Arg-GlyGly-Glu-Gly-Arg-Gly but there is match with but there is match with ktuple size of 2ktuple size of 2

Limitations of FASTALimitations of FASTA

FASTA can miss significant similarity sinceFASTA can miss significant similarity since For nucleic acids, due to codon “wobble”, DNA For nucleic acids, due to codon “wobble”, DNA

sequences may look like XXyXXyXXy where sequences may look like XXyXXyXXy where X’s are conserved and y’s are notX’s are conserved and y’s are notGGuUCuACgAAgGGuUCuACgAAg and and GGcUCcACaAAAGGcUCcACaAAA

both code for the same peptide sequence (Gly-Ser-both code for the same peptide sequence (Gly-Ser-Thr-Lys) but they don’t match with ktuple size of 3 Thr-Lys) but they don’t match with ktuple size of 3 or higheror higher

BLAST (Basic Local Alignment Search Tool)BLAST (Basic Local Alignment Search Tool) Goal: find sequences from database similar Goal: find sequences from database similar

to query sequenceto query sequence Previous tools use either Previous tools use either

direct, theoretically sound but computationally direct, theoretically sound but computationally slow approach to examine all possible slow approach to examine all possible alignments of query with database (dynamic alignments of query with database (dynamic programming)programming)

indirect, heuristic but computationally fast indirect, heuristic but computationally fast approach to find similar sequences by first approach to find similar sequences by first finding identical stretches (FASTP, FASTA)finding identical stretches (FASTP, FASTA)

BLAST (Basic Local Alignment Search Tool)BLAST (Basic Local Alignment Search Tool) BLAST combines best of both by using BLAST combines best of both by using

theoretically sound method which searches theoretically sound method which searches for similar sequences directly but for similar sequences directly but computationally fastcomputationally fast

ReferenceReference S. F. Altschul, W. Gish, W. Miller, E. W. S. F. Altschul, W. Gish, W. Miller, E. W.

Myers and Myers and D. J. LipmanD. J. Lipman. . Basic Local Basic Local Alignment Search Tool. Alignment Search Tool. J. Mol. Biol. 215:J. Mol. Biol. 215:403-403-410 (1990)410 (1990)

Global vs. Local AlgorithmsGlobal vs. Local Algorithms

We distinguishWe distinguish GlobalGlobal similar algorithms which optimize similar algorithms which optimize

overall alignment between two sequences overall alignment between two sequences (dynamic programming)(dynamic programming)

LocalLocal similar algorithms which see only similar algorithms which see only relatively conserved pieces of sequence relatively conserved pieces of sequence (FASTA, BLAST)(FASTA, BLAST)

BLAST basicsBLAST basics

Need similarity measure, as in dynamic Need similarity measure, as in dynamic programming - use PAM-120 for proteinsprogramming - use PAM-120 for proteins

Define Define maximal segment pair (MSP) maximal segment pair (MSP) to be to be the highest scoring pair of identical length the highest scoring pair of identical length segments chosen from 2 sequences (in segments chosen from 2 sequences (in FASTA terms, highest init1 diagonal)FASTA terms, highest init1 diagonal)

BLAST basicsBLAST basics

Define a segment pair to be locally maximal Define a segment pair to be locally maximal if its score cannot be improved either by if its score cannot be improved either by extending or by shortening both segmentsextending or by shortening both segments

BLAST basicsBLAST basics

Approach: find segment pairs by first Approach: find segment pairs by first finding word pairs that score above a finding word pairs that score above a threshold, i.e., find word pairs of fixed threshold, i.e., find word pairs of fixed length length ww with a score of at least with a score of at least TT

Key concept: Seems similar to FASTA, but Key concept: Seems similar to FASTA, but we are searching for words which we are searching for words which score score above T above T rather than that rather than that match exactlymatch exactly

BLAST method for proteinsBLAST method for proteins

1. Compile a list of words which give a score 1. Compile a list of words which give a score above above TT when paired with the query when paired with the query sequence.sequence. Example using PAM-120 for query sequence Example using PAM-120 for query sequence

ACDE (ACDE (ww=4, =4, TT=17):=17): A C D EA C D EACDE = +3 +9 +5 +5 = 22ACDE = +3 +9 +5 +5 = 22

try all possibilities:try all possibilities:AAAA = +3 -3 0 0 = 0 AAAA = +3 -3 0 0 = 0 no goodno goodAAAC = +3 -3 0 -7 = -7 AAAC = +3 -3 0 -7 = -7 no goodno good

...too slow, try directed change...too slow, try directed change

Generating word listGenerating word list

A C D EA C D EACDE = +3 +9 +5 +5 = 22ACDE = +3 +9 +5 +5 = 22

change 1st pos. to all acceptable substitutionschange 1st pos. to all acceptable substitutionsgCDEgCDE = 1 9 5 5 = 20 = 1 9 5 5 = 20 ok ok (=pCDE,sCDE,(=pCDE,sCDE, tCDE)tCDE)nCDEnCDE = 0 9 5 5 = 19 = 0 9 5 5 = 19 ok ok (=dCDE,eCDE,(=dCDE,eCDE, nCDE,vCDE)nCDE,vCDE)iCDEiCDE = -1 9 5 5 = 18 = -1 9 5 5 = 18 ok ok (=qCDE)(=qCDE)kCDEkCDE = -2 9 5 5 = 17 = -2 9 5 5 = 17 ok ok (=mCDE)(=mCDE)

change 2nd pos.: can't - all alternatives negative and change 2nd pos.: can't - all alternatives negative and the other three positions only add up to 13 the other three positions only add up to 13

change 3rd pos. in combination with first positionchange 3rd pos. in combination with first positiongCnEgCnE = 1 9 2 5 = 17 = 1 9 2 5 = 17 okok

continue - use recursioncontinue - use recursion

Generating word listGenerating word list

For "best" values of For "best" values of ww and and TT there are there are typically about 50 words in the list for every typically about 50 words in the list for every residue in the query sequenceresidue in the query sequence

BLAST method for proteinsBLAST method for proteins

2. Scan the database for hits with the compiled 2. Scan the database for hits with the compiled list of words. Two approaches:list of words. Two approaches: Use index of all possible words (for Use index of all possible words (for ww=4, need =4, need

array of size 20array of size 2044=160,000. Can compress this =160,000. Can compress this index using pointers to save space.index using pointers to save space.

Use finite state machine (actually used)Use finite state machine (actually used) Calculate a state transition table that tells what state to Calculate a state transition table that tells what state to

go to based on the next character in the sequencego to based on the next character in the sequence

3. Extend hits to form segment pairs3. Extend hits to form segment pairs

BLAST Method for DNABLAST Method for DNA

1. Make list of all contiguous 1. Make list of all contiguous ww-mers in the -mers in the query sequence (often query sequence (often ww=12)=12)

2. Compress database by packing 4 2. Compress database by packing 4 nucleotides into a single byte (use auxiliary nucleotides into a single byte (use auxiliary table to tell you where sequences start and table to tell you where sequences start and stop within the compressed database) -- stop within the compressed database) -- doesn't allow for unspecified bases doesn't allow for unspecified bases (wildcards)(wildcards)

BLAST Method for DNABLAST Method for DNA

3. Compress the 3. Compress the ww-mers from the query -mers from the query sequence the same way.sequence the same way.

4. Search the compressed database for matches 4. Search the compressed database for matches with the compressed with the compressed ww-mers-mers Since all frames of the query sequence are Since all frames of the query sequence are

considered separately, any match of length considered separately, any match of length ww>=11 >=11 must contain a match of length 8 that lies on a byte must contain a match of length 8 that lies on a byte boundary of one of the boundary of one of the ww-mers from the query -mers from the query sequence. Thus can scan a (packed) byte at a time, sequence. Thus can scan a (packed) byte at a time, improving speed 4-fold over comparing one improving speed 4-fold over comparing one nucleotide at a time.nucleotide at a time.

BLAST Method for DNABLAST Method for DNA

Problem: if query sequence has a stretch of Problem: if query sequence has a stretch of unusual base composition (e.g., A-T rich) unusual base composition (e.g., A-T rich) or a repeated sequence element (e.g., or a repeated sequence element (e.g., AluAlu sequence) there will be many hits with sequence) there will be many hits with "uninteresting" regions."uninteresting" regions.

BLAST Method for DNABLAST Method for DNA

Solution:Solution: During compression of the database, tabulate During compression of the database, tabulate

frequencies of all 8-tuples.frequencies of all 8-tuples. Make a list of those occurring very frequently (more Make a list of those occurring very frequently (more

frequently than expected by chance).frequently than expected by chance). Remove these words from the query list of Remove these words from the query list of ww-mers -mers

before searching database.before searching database. Remove words matching a sublibrary of repeated Remove words matching a sublibrary of repeated

sequences (but report the matches to that sublibrary sequences (but report the matches to that sublibrary when done).when done).

BLAST Statistical significanceBLAST Statistical significance

A key to the utility of BLAST is the ability A key to the utility of BLAST is the ability to calculate expected probabilities of to calculate expected probabilities of occurrence of Maximum Segment Pairs occurrence of Maximum Segment Pairs (MSPs) given w and T(MSPs) given w and T

This allows BLAST to rank matching This allows BLAST to rank matching sequences in order of “significance” and to sequences in order of “significance” and to cut off listings at a user-specified cut off listings at a user-specified probabilityprobability

Summary of Database Search MethodsSummary of Database Search Methods

Authors (Program) Description

Needleman & Wunsch full alignment

Wilbur & Lipman match ktuple - formdiag - NW

Lipman & Pearson(FASTP)

ktuple - diag - rescore

Pearson & Lipman(FASTA)

FASTP - join diags-NW

Altschul et al (BLAST) word match list -statistics