Blast bioinformatics

BLAST 06/01/2012

Introduction:

Acronym for Basic Local Alignment Search Tool

The BLAST program was developed by Stephen

Altschul et al of NCBI in 1990

Also a heuristic method like FASTA

It is one of the most popular programs for sequence

analysis

enables a researcher to compare a query

sequence with a library or database of

sequences and

identify library sequences that resemble the

query sequence above a certain threshold

The objective is to find high-scoring ungapped

segments among related sequences

Using BLAST

http://www.ncbi.nlm.nih.gov/BLAST

1. Select BLAST program to use (blastn, blastp,

blastx, tblastn, tblastx)

2. Select database to search

3. different BLAST programs have different

databases

4. Enter Query Sequence

5. Submit Search

Steps in BLAST

The seq is optionally filtered to remove low-

complexity regions (AGAGAG…)

The next step is to create a list of words from the

query sequence.

Each word is typically 3 residues for protein

sequences and 11 residues for DNA sequences.

The list includes every possible word extracted from

the query sequence.

This step is also called seeding.

PROTEIN WORDS GTQITVEDLFYNIATRRKALKN Query:

Neighborhood Words

LTV, MTV, ISV, LSV, etc.

GTQ

TQI

QIT

ITV

TVE

VED

EDL

DLF

...

Make a lookup

table of words

Word Size = 3 Word Size can be 2 or 3 (default = 3)

NUCLEOTIDE WORDS GTACTGGACATGGACCCTACAGGAA Query:

GTACTGGACAT

TACTGGACATG

ACTGGACATGG

CTGGACATGGA

TGGACATGGAC

GGACATGGACC

GACATGGACCC

ACATGGACCCT

...........

Make a lookup

table of words

Word Size = 11 minimum word size = 7

blastn default = 11

megablast default = 28

The third step is to search a sequence

database for the occurrence of these words.

This step is to identify database sequences

containing the matching words

Using substitution scores matrixes the query

seq. words are evaluated for matches with

any DB seq. and these scores (log) are added

A cut-off score (T) is selected to reduce

number of matches to the most significant

ones

The above procedure is repeated for each

word in the query seq.

The remaining high-scoring words are

organised into efficient search tree and rapidly

compared to the DB seq.

If a good match is found then an alignment is

extended from the match area in both

directions as far as the score continue to grow.

The extension continues until the score of the

alignment drops below a threshold due to

mismatches

(the drop threshold is twenty-two for proteins

and twenty for DNA).

The resulting contiguous aligned segment pair

without gaps is called high-scoring segment pair

(HSP )

In the original version of BLAST, the highest

scored HSPs are presented as the final report

A recent improvement in the implementation

of BLAST is the ability to provide gapped

alignment.

In gapped BLAST, the highest scored segment

is chosen to be extended in both directions

using dynamic programming where gaps may

be introduced.

The extension continues if the alignment

score is above a certain threshold otherwise it

is terminated

BLAST Output

1. an introduction that tells where the search occurred and what database and query were compared

2. a list of the sequences in the database containing segment pairs whose scores were least likely to occur by chance

3. alignments of the high-scoring segment pairs showing identical and similar residues

4. a complete list of the parameter settings used for the search.

BLAST Variants

Program Query sequence Database sequence

BLASTP protein protein

BLASTN nucleic acid nucleic acid

BLASTX translated nucleic acid protein

TBLASTN protein translated nucleic acid

TBLASTX translated nucleic acid translated nucleic acid

Databases available on BLAST Web server

Database - Description

A. Peptide sequence databases

1. nr-translations of GenBank DNA sequences with redundancies removed, PDB, SwissProt, PIR, and PRF

2. month -new or revised entries or updates to nr in the previous 30 days

3. Swissprot- latest release of the SwissProt protein sequence databasea

4. Drosophila genome -provided by Celera and Berkeley Drosophila genome project

5. yeast -yeast (Saccharomyces cerevisiae) genomic sequences

6. E. Coli- E. coli genomic sequences

7. pdb -sequences of proteins of known three-dimensional structure from the Brookhaven Protein Data Bank

8. yeast -yeast (S. cerevisiae) protein sequences

9. E. coli- E. coli genomic coding sequence translations

10. kabat [kabatpro] -Kabat’s database of sequences of immunological interest

11. Alu- translations of select Alu repeats from REPBASE, a database of sequence repeats

B. Nucleotide sequence databases

1. nr- GenBank, EMBL, DDBJ, and PDB sequences with redundancies removed (EST, STS, GSS, and HTGS sequences excluded)

2. month -new or revised entries or updates to nr in the previous 30 days

3. dbestb- EST sequences from GenBank, EMBL, and DDBJ with redundancies removed

4. dbstsb- STS sequences from GenBank, EMBL, and DDBJ with redundancies removed

5. htgsb- high-throughput genomic sequences

6. kabat [kabatnuc] -Kabat’s database of sequences of immunological interest

7. vector- vector subset of GenBank

8. mito -database of mitochondrial sequences

9. alu -select Alu repeats from REPBASE, a database of sequence repeats; suitable for masking Alu repeats from query sequences

10. epd- eukaryotic promoter database

11. gssb -genome survey sequences, includes single-pass genomic data,exon-trapped sequences, and Alu PCR sequences

Difference between BLAST and FASTA BLAST FASTA

uses a substitution matrix to find matching

words

Uses the hashing procedure

Word size:

Protein=3 ;DNA=11

K-tuple:

Protein=2;DNA=4-6

Faster than FASTA Slower than BLAST

have higher specificity than FASTA due to

Low complexity masking

Lower specificity

E-value (expectation value)

Important statistical indicator in Sequence alignment

it indicates the probability that the resulting

alignments from a database search are caused by

random chance

The E-value provides information about the

likelihood that a given sequence match is purely by

chance.

The lower the E-value, the less likely the database

match is a result of random chance and therefore

the more significant the match is

Formula

E-value is determined by the equation

E = m × n × P

Where

m is the total number of residues in a database

n is the number of residues in the query sequence

and

P is the probability that an HSP alignment is a result

of random chance.

Bit Score

A bit score is another prominent statistical indicator

used in addition to the E value in a BLAST output.

The bit score measures sequence similarity

independent of query sequence length and

database size and is normalized based on the raw

pairwise alignment score.

Formula

The bit score (S) is determined by the following formula:

S = (λ × s − lnK)/ ln2

Where

λ is the Gumble distribution constant,

s is the raw alignment score, and

K is a constant associated with the scoring matrix used.

Thus, the bit score (S) is linearly related to the raw

alignment score (s).

Hence, the higher the bit score, the more highly

significant the match is.

Blast bioinformatics

Technology

Transcript of Blast bioinformatics