Blast bioinformatics

25
BLAST 06/01/2012

Transcript of Blast bioinformatics

Page 1: Blast bioinformatics

BLAST 06/01/2012

Page 2: Blast bioinformatics

Introduction:

Acronym for Basic Local Alignment Search Tool

The BLAST program was developed by Stephen

Altschul et al of NCBI in 1990

Also a heuristic method like FASTA

It is one of the most popular programs for sequence

analysis

Page 3: Blast bioinformatics

enables a researcher to compare a query

sequence with a library or database of

sequences and

identify library sequences that resemble the

query sequence above a certain threshold

The objective is to find high-scoring ungapped

segments among related sequences

Page 4: Blast bioinformatics

Using BLAST

http://www.ncbi.nlm.nih.gov/BLAST

1. Select BLAST program to use (blastn, blastp,

blastx, tblastn, tblastx)

2. Select database to search

3. different BLAST programs have different

databases

4. Enter Query Sequence

5. Submit Search

Page 5: Blast bioinformatics

Steps in BLAST

The seq is optionally filtered to remove low-

complexity regions (AGAGAG…)

The next step is to create a list of words from the

query sequence.

Each word is typically 3 residues for protein

sequences and 11 residues for DNA sequences.

The list includes every possible word extracted from

the query sequence.

This step is also called seeding.

Page 6: Blast bioinformatics

PROTEIN WORDS GTQITVEDLFYNIATRRKALKN Query:

Neighborhood Words

LTV, MTV, ISV, LSV, etc.

GTQ

TQI

QIT

ITV

TVE

VED

EDL

DLF

...

Make a lookup

table of words

Word Size = 3 Word Size can be 2 or 3 (default = 3)

Page 7: Blast bioinformatics

NUCLEOTIDE WORDS GTACTGGACATGGACCCTACAGGAA Query:

GTACTGGACAT

TACTGGACATG

ACTGGACATGG

CTGGACATGGA

TGGACATGGAC

GGACATGGACC

GACATGGACCC

ACATGGACCCT

...........

Make a lookup

table of words

Word Size = 11 minimum word size = 7

blastn default = 11

megablast default = 28

Page 8: Blast bioinformatics

The third step is to search a sequence

database for the occurrence of these words.

This step is to identify database sequences

containing the matching words

Page 9: Blast bioinformatics

Using substitution scores matrixes the query

seq. words are evaluated for matches with

any DB seq. and these scores (log) are added

A cut-off score (T) is selected to reduce

number of matches to the most significant

ones

The above procedure is repeated for each

word in the query seq.

The remaining high-scoring words are

organised into efficient search tree and rapidly

compared to the DB seq.

Page 10: Blast bioinformatics

If a good match is found then an alignment is

extended from the match area in both

directions as far as the score continue to grow.

The extension continues until the score of the

alignment drops below a threshold due to

mismatches

(the drop threshold is twenty-two for proteins

and twenty for DNA).

Page 11: Blast bioinformatics

The resulting contiguous aligned segment pair

without gaps is called high-scoring segment pair

(HSP )

In the original version of BLAST, the highest

scored HSPs are presented as the final report

Page 12: Blast bioinformatics
Page 13: Blast bioinformatics

A recent improvement in the implementation

of BLAST is the ability to provide gapped

alignment.

In gapped BLAST, the highest scored segment

is chosen to be extended in both directions

using dynamic programming where gaps may

be introduced.

The extension continues if the alignment

score is above a certain threshold otherwise it

is terminated

Page 14: Blast bioinformatics

BLAST Output

1. an introduction that tells where the search occurred and what database and query were compared

2. a list of the sequences in the database containing segment pairs whose scores were least likely to occur by chance

3. alignments of the high-scoring segment pairs showing identical and similar residues

4. a complete list of the parameter settings used for the search.

Page 15: Blast bioinformatics
Page 16: Blast bioinformatics
Page 17: Blast bioinformatics

BLAST Variants

Program Query sequence Database sequence

BLASTP protein protein

BLASTN nucleic acid nucleic acid

BLASTX translated nucleic acid protein

TBLASTN protein translated nucleic acid

TBLASTX translated nucleic acid translated nucleic acid

Page 18: Blast bioinformatics

Databases available on BLAST Web server

Database - Description

A. Peptide sequence databases

1. nr-translations of GenBank DNA sequences with redundancies removed, PDB, SwissProt, PIR, and PRF

2. month -new or revised entries or updates to nr in the previous 30 days

3. Swissprot- latest release of the SwissProt protein sequence databasea

4. Drosophila genome -provided by Celera and Berkeley Drosophila genome project

5. yeast -yeast (Saccharomyces cerevisiae) genomic sequences

6. E. Coli- E. coli genomic sequences

7. pdb -sequences of proteins of known three-dimensional structure from the Brookhaven Protein Data Bank

8. yeast -yeast (S. cerevisiae) protein sequences

9. E. coli- E. coli genomic coding sequence translations

10. kabat [kabatpro] -Kabat’s database of sequences of immunological interest

11. Alu- translations of select Alu repeats from REPBASE, a database of sequence repeats

Page 19: Blast bioinformatics

B. Nucleotide sequence databases

1. nr- GenBank, EMBL, DDBJ, and PDB sequences with redundancies removed (EST, STS, GSS, and HTGS sequences excluded)

2. month -new or revised entries or updates to nr in the previous 30 days

3. dbestb- EST sequences from GenBank, EMBL, and DDBJ with redundancies removed

4. dbstsb- STS sequences from GenBank, EMBL, and DDBJ with redundancies removed

5. htgsb- high-throughput genomic sequences

6. kabat [kabatnuc] -Kabat’s database of sequences of immunological interest

7. vector- vector subset of GenBank

8. mito -database of mitochondrial sequences

9. alu -select Alu repeats from REPBASE, a database of sequence repeats; suitable for masking Alu repeats from query sequences

10. epd- eukaryotic promoter database

11. gssb -genome survey sequences, includes single-pass genomic data,exon-trapped sequences, and Alu PCR sequences

Page 20: Blast bioinformatics

Difference between BLAST and FASTA BLAST FASTA

uses a substitution matrix to find matching

words

Uses the hashing procedure

Word size:

Protein=3 ;DNA=11

K-tuple:

Protein=2;DNA=4-6

Faster than FASTA Slower than BLAST

have higher specificity than FASTA due to

Low complexity masking

Lower specificity

Page 21: Blast bioinformatics

E-value (expectation value)

Important statistical indicator in Sequence alignment

it indicates the probability that the resulting

alignments from a database search are caused by

random chance

The E-value provides information about the

likelihood that a given sequence match is purely by

chance.

The lower the E-value, the less likely the database

match is a result of random chance and therefore

the more significant the match is

Page 22: Blast bioinformatics

Formula

E-value is determined by the equation

E = m × n × P

Where

m is the total number of residues in a database

n is the number of residues in the query sequence

and

P is the probability that an HSP alignment is a result

of random chance.

Page 23: Blast bioinformatics

Bit Score

A bit score is another prominent statistical indicator

used in addition to the E value in a BLAST output.

The bit score measures sequence similarity

independent of query sequence length and

database size and is normalized based on the raw

pairwise alignment score.

Page 24: Blast bioinformatics

Formula

The bit score (S) is determined by the following formula:

S = (λ × s − lnK)/ ln2

Where

λ is the Gumble distribution constant,

s is the raw alignment score, and

K is a constant associated with the scoring matrix used.

Thus, the bit score (S) is linearly related to the raw

alignment score (s).

Hence, the higher the bit score, the more highly

significant the match is.

Page 25: Blast bioinformatics