Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

46
Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    242
  • download

    2

Transcript of Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

Page 1: Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

Introduction to Bioinformatics - Tutorial no. 2

Global AlignmentLocal Alignment

FASTABLAST

Page 2: Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

Sequence Alignment

Page 3: Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

DP – what does it mean? Principle of reduction of

number of paths that need to be examined:

If a path from X→Z passes through Y, the best path from X→Y is independent of the best path from Y→Z

Page 4: Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

Global vs. Local alignment Dotplot showing

identities between short name (DOROTHYHODGKIN) and full name (DOROTHYCROWFOOTHODGKIN) of a famous protein crystallographer.

S1 = DOROTHYHODGKIN

S2 = DOROTHYCROWFOOTHODGKIN

Page 5: Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

Global vs. Local alignment Dotplot showing

identities between short name (DOROTHYHODGKIN) and full name (DOROTHYCROWFOOTHODGKIN) of a famous protein crystallographer.

Global alignment:

DOROTHY--------HODGKIN

DOROTHYCROWFOOTHODGKIN

Page 6: Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

Local Alignment

The problem: we want to find the substrings of s andt with highest similarity.

Scoring System: just as in global alignment: Match: +1 Mismatch: -1 Indel: -2

Page 7: Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

Local Alignment – cont’dThe differences: 1. We can start a new match instead of

extending a previous alignment. This means- at each cell, we can start to calculate

the score from 0 (even if this means ignoring the prefix).

We do this only if it’s better than the alternative (which means- only if the alternative is negative).

2. Instead of looking only at the far corner, we look anywhere in the table for the best score (even if this means ignoring the suffix)

Page 8: Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

0

T

1

A

2

C

3

T

4

A

5

A

6

0  0

T 1 

A 2 

A 3 

T 4 

A 5 

Page 9: Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

0

T

1

A

2

C

3

T

4

A

5

A

6

0  0 0

T 1 

A 2 

A 3 

T 4 

A 5 

T-

Page 10: Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

0

T

1

A

2

C

3

T

4

A

5

A

6

0  0 0 0 0 0 0 0

T 1 

A 2 

A 3 

T 4 

A 5 

TACTAA------

Page 11: Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

0

T

1

A

2

C

3

T

4

A

5

A

6

0  0 0 0 0 0 0 0

T 1  0

A 2  0

A 3  0

T 4  0

A 5  0

-----TAATA

Page 12: Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

0

T

1

A

2

C

3

T

4

A

5

A

6

0  0 0 0 0 0 0 0

T 1  0 1

A 2  0

A 3  0

T 4  0

A 5  0

TT

Page 13: Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

0

T

1

A

2

C

3

T

4

A

5

A

6

0  0 0 0 0 0 0 0

T 1  0 1 ?

A 2  0

A 3  0

T 4  0

A 5  0

TAT-

-1

TA---T

-2

TA-T

-1

0

Page 14: Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

0

T

1

A

2

C

3

T

4

A

5

A

6

0  0 0 0 0 0 0 0

T 1  0 1 0 0 1

A 2  0

A 3  0

T 4  0

A 5  0

TACT---T

Page 15: Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

0

T

1

A

2

C

3

T

4

A

5

A

6

0  0 0 0 0 0 0 0

T 1  0 1 0 0 1 0 0

A 2  0 0 2 0 0 2 1

A 3  0

T 4  0

A 5  0

Page 16: Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

0

T

1

A

2

C

3

T

4

A

5

A

6

0  0 0 0 0 0 0 0

T 1  0 1 0 0 1 0 0

A 2  0 0 2 0 0 2 1

A 3  0 0 1 1 0 1 3

T 4  0

A 5  0

Page 17: Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

0

T

1

A

2

C

3

T

4

A

5

A

6

0  0 0 0 0 0 0 0

T 1  0 1 0 0 1 0 0

A 2  0 0 2 0 0 2 1

A 3  0 0 1 1 0 1 3

T 4  0 0 0 0 2 0 1

A 5  0

Page 18: Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

0

T

1

A

2

C

3

T

4

A

5

A

6

0  0 0 0 0 0 0 0

T 1  0 1 0 0 1 0 0

A 2  0 0 2 0 0 2 1

A 3  0 0 1 1 0 1 3

T 4  0 0 0 0 2 0 1

A 5  0 0 1 0 0 3 1

Page 19: Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

0

T

1

A

2

C

3

T

4

A

5

A

6

0  0 0 0 0 0 0 0

T 1  0 1 0 0 1 0 0

A 2  0 0 2 0 0 2 1

A 3  0 0 1 1 0 1 3

T 4  0 0 0 0 2 0 1

A 5  0 0 1 0 0 3 1

TACTAATAATA

Page 20: Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

0

T

1

A

2

C

3

T

4

A

5

A

6

0  0 0 0 0 0 0 0

T 1  0 1 0 0 1 0 0

A 2  0 0 2 0 0 2 1

A 3  0 0 1 1 0 1 3

T 4  0 0 0 0 2 0 1

A 5  0 0 1 0 0 3 1

TACTAA TAATA

Page 21: Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

How do your prefer it – right or fast? Exact methods - the result is guaranteed to be

(mathematically) optimal Needleman-Wunsch (global) Smith-Waterman (local)

Heuristic methods: make some assumptions that hold most, but not all of the time FASTA BLAST

Still, a typical run takes minutes to complete.

Page 22: Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

FASTA

Page 23: Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

FASTA http://www.ebi.ac.uk/fasta33/ Performs a local alignment of the input

sequence against a complete database. Finds n subsequences with best alignments. Speed-up: doesn’t really look at all the

sequences- just those that ‘look similar’ (details- in the course Algorithms in Computational Biology)

Still, a typical run takes minutes to complete.

Page 24: Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

FASTA Variations (programs)

fasta3 – DNA sequence – DNA database, protein sequence – protein database

fastx/y3 – DNA sequence - protein database. DNA is translated in forward and reverse frames.

tfastx/y3 - protein sequence - translated DNA DB

… and more

Page 25: Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

Databases Depend on the type chosen (Nucleic acid /

protein) EMBL- all the nucleotide databases of the

European Molecular Biology Laboratory Some organism-type specific:

FUNGI INVERTEBRATES PLANTS

Some content –specific: ESTs STSs

MAMALS MOUSE HUMAN

Page 26: Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

More FASTA options

Gap penalties – different for opening gaps and for continuing them (residue = indel)

Scores and Alignments – how many (max) to retrieve?

KTUP – see the algorithm description in the lecture

DNA Strand Matrix – for searches that involve proteins

(next week)

Page 27: Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

E-values The number of hits (with the same similarity score) one can

"expect" to see just by chance when searching the given string in a database of a particular size.

higher e-value lower similarity From FASTA documentation:

“sequences with E-value of less than 0.01 are almost always found to be homologous”

“sequences with E-value between 1 and 10 frequently turn out to be related as well”

FASTA defaults for upper limit: 10 for FASTA with protein searches 5 for translated DNA/protein comparisons 2 for DNA/DNA searches.

The lower bound is normally 0 (we want to find the best)

Page 28: Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

BLAST

Page 29: Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

BLAST – Outline

Sequence Alignment Complexity and indexing BLASTN and BLASTP

Basic parameters PAM and BLOSUM matrices Affine gap model E Values (once again)

Page 30: Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

Advanced BLAST

Databases BLAST options BLAST output Taxonomic BLAST Pairwise BLAST

Page 31: Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

Name Query type Database

blastn Genomic Genomic

blastp Protein Protein

blastx Translated genomic Protein

tblastn Protein Translated genomic

tblastx Translated genomic Translated genomic

Genomic translations test all 6 possibilities:

3x for codon frames, 2x for reverse complement

BLAST Variations

Page 32: Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

BLASTN Databases

nrGenBank, EMBL, DDBJ, PDB and NCBI

reference sequences (RefSeq)

htgs High-throughput genomic sequences (draft)

pat Patented nucleotide sequences

mito Mitochondrial sequences

vector Vector subset of GenBank

month GenBank, EMBL, DDBJ, PDB from 30 days

chrom Contigs and chromosomes from RefSeq

Page 33: Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

BLASTP Databases

nrGenBank CDS translations, RefSeq,

PDB, SWISS-PROT, PIR, PRF

swissprot

SWISS-PROT

pat Patented protein sequences

pdb Protein Data Bank

monthGenBank CDS translations, PDB,

SWISS-PROT, PIR, PRF from 30 days

Page 34: Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

BLASTN/P Options (1)

Only search part of database using NCBI Entrez query format

Search specific

organism

Remove low information content, e.g. short repeats or

rich in only 2 nucleotides

Remove known human repeats

(LINEs, SINEs)

Page 35: Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

BLASTN/P Options (2)Threshold for results

significance

Use index based on words of 7, 11 or 15

nucleotides Costs to open and extend gap, score for nucleotide

match or mismatch. Allowed gap scores: 10/1, 10/2, 11/1, 8/2, 9/2

Page 36: Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

BLASTP Options

Scoring matrix: PAM, etc…

Search for a motif (PSI-BLAST)

Costs to open and extend gap

Page 37: Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

BLASTN/P Formatting (1)

Show colored bar chart

Number of sequences listed

Number of alignments shown

Other (less important) options on

what to show

Page 38: Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

BLASTN/P Formatting (2)

How to display alignments

Only show results which match Entrez search or are from specific organism

Only show results with E values in this range

Page 39: Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

BLASTN Results

Query sequence representation

Matched areas of database sequences

Multiple matches on sequence

Page 40: Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

BLAST Output Header

Request ID for later retrieval

Query sequence details

Database details

Tax BLAST

Page 41: Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

BLAST Alignments (1)

Sequence Identifier

Sequence description

Score andE value

Page 42: Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

BLAST Alignments (2)

Several alignments possible for one sequence match

Normalized score of alignment

Expected number of such hits (2e-11 = 2 10-11)

Number of exact matches

Number of matches with positive score

Number of insertion / deletions

Page 43: Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

BLAST Alignments (3)

Query sequenceExact matchInsertion / deletion

Matched sequence

Mismatch with positive

score

Position within sequence Masked low complexity region

Page 44: Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

Expectation Values

Increases linearly with

length of query sequence

Increases linearly with

length of database

Decreases exponentially with score of

alignment

Page 45: Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

Tax BLAST

Lineage of organism with strongest hit

Score of organism’s strongest hit

Number of organism hits

Shared ancestry in taxonomic tree

Page 46: Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.

BLAST2SEQ

Scoring scheme

Type of program

Gap model,Expect Value,

Advanced options

Sequences

Scoring matrix

SequencesGO !