Blast

BLAST- Basic Local Alignment Search Tool

- Similarity search program developed at NCBI- Available as free service over the Internet- Provides very fast, accurate, and sensitive database searching- A heuristic algorithm which seeks local alignment to detect

relationships among sequences that share only isolated regions of similarity

- Like FASTA, BLAST is a ‘word-based’ method

BLAST works through the following 3 steps.

1. Finds the list of high scoring words (w) and takes each word from the query sequence (typically 3 for amino acids and 11 for nucleotides), and locates all similar words in the current test sequence.

2. Compares the words list to the database and identifies the exact matches. If similar words are found, BLAST tries to expand the alignment to the adjacent words, without allowing for gaps.

3. After all words are tested, a set of Maximal Segment Pairs (MSPs) is chosen for that database sequence. Several short,

non-overlapping MSPs may be combined in a statistical test too create a larger, more significant match.

Purpose of BLAST

As number of genomes is being sequenced, a researcher often comes across a novel DNA or protein sequence for which no functional information is available

Some basic information on the sequence is necessary before a molecular biologist can even take the new sequence into the lab and perform meaningful experiments with it.

Database searches reveal sequences that have some degree of similarity to the query sequence and these sequences from the databases are commonly referred as ‘hits’ (- to infer homology and molecular function)

Identity- when two sequences are compared to each other, identity indicates the extent to which the two sequences have the exact same composition (i.e., nucleotide base or amino acid residue) at equivalent positions, usually expressed as a percentage,

Similarity- when two genes or proteins are compared with each other, similarity indicates the level of relatedness between the two on the basis of their primary sequences. For DNA sequences, this is the number of identical bases at equivalent positions, usually expressed as a percentage.

Simple Classification of amino acids

Based on the nature of side-chains:

Aliphatic amino acids G,A,V,L,I,P Aromatic amino acids F,Y,W Polar amino acids S,T,N,Q Sulfur containing amino acids C,M Charged amino acids D,E,H,K,R

Based on Hydrophobicity:

Amino acids with hydrophilic side-chains N,G,Q,R,H,K Amino acids with hydrophobic side-chains V,I,L,M,P

Based on charge:

Positively charged K,R Negatively charged D,E

BLAST Services from NCBI

1. Nucleotide BLAST- allows one to input nucleotide sequences and compare these against other nucleotides.

2. Standard nucleotide-nucleotide BLAST- takes nucleotide sequences in FASTA format, GenBank accession numbers or GI numbers and compares them against the NCBI nucleotide databases.

3. MEGA BLAST- This program uses a ‘greedy algorithm’ for nucleotide sequence alignment searches and concatenates many queries to save time spent scanning the database. It is optimized for aligning sequences that differ slightly and is upto 10 times faster than more common sequence similarity programs. It can be used to compare two large sets of sequences against each other and gives the results very quickly.

4. Protein BLAST- allows one input protein sequences and compares these against other protein sequences.

5. Standard protein-protein BLAST- takes protein sequences in FASTA format, GenBank accession numbers or GI numbers and compares them against the NCBI protein database.

6. Pattern Hit Initiated BLAST (PHI-BLAST)- combines matching of regular expression pattern with a Position Specific Iterative protein search. PHI-BLAST can locate other protein sequences that both contain the regular expression pattern and are homologous to a query protein sequence.

7. Translating BLAST- translates query sequences or databases from nucleotides to proteins so that protein-nucleotide sequences can be performed.

8. Translated query- Protein database (BLASTX)- converts a nucleotide query sequence into protein sequences in all 6 reading frames. The translated protein products are then compared against the NCBI protein databases.

9. Protein query- Translated database (TBLASTN)- takes a protein query sequence and compares it against an NCBI nucleotide database that has been translated in all six reading frames.

10. Translated query- Translated database (TBLASTX)- converts a nucleotide query sequence into protein sequences in all 6 reading frames and then compares this to an NCBI nucleotide database which has been translated in all 6 reading frames.

11. Position Specific Iterated BLAST (PSI-BLAST)- an implementation of BLAST for finding protein families. Instead of using a single amino acid at a given position in the query sequence, it is better to use a combination of amino acids known to be present at the same position in that protein and related ones. The search of sequence databases will thereby be expanded to include additional related sequences that might otherwise be missed. The major difficulty with such an expanded search is that an alignment of related sequences must already be available in order to know the variations at each position in the query sequence. PSI-BLAST has been designed to provide information on this variation starting with a BLAST search by a single query sequence.

PSI-BLAST involves a series of repeated steps or iterations:(i) A database search of a protein sequence database is performed

using a query sequence.(ii) The results of the search are presented and can be assessed

visually to see if any database sequences that are significantly related to the query sequence are present.

(iii) If such is the case, user decides to go through another iteration of the search.

(iv) The high scoring sequence matches found in the first step are aligned and from the alignment a sequence motif that indicates the variations at each aligned position is produced. The database is then searched with this motif. The search has thus been expanded to include sequences that match the variations found in the motif at each sequence position.

(v) The results are again displayed, indicating any newly discovered sequences that are significantly ralted to the motif sequences in additin to those found in the previous iteration.

(vi) Again, an opportunity is given to go through another iteration of the program, but this time including any newly recruited sequences to refine the motif. In this fashion, a new family of sequences that are significantly similar to the original query sequence can be found.

PSI-BLAST applications:

Distant homology detection Fold assignment Domain identification Evolutionary analysis (i.e, tree building) Sequence annotation/ function assignment Profile export to other programs Sequence clustering Structural genomics target selection

Blast

Documents

Transcript of Blast