Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.
-
date post
21-Dec-2015 -
Category
Documents
-
view
242 -
download
2
Transcript of Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.
Introduction to Bioinformatics - Tutorial no. 2
Global AlignmentLocal Alignment
FASTABLAST
Sequence Alignment
DP – what does it mean? Principle of reduction of
number of paths that need to be examined:
If a path from X→Z passes through Y, the best path from X→Y is independent of the best path from Y→Z
Global vs. Local alignment Dotplot showing
identities between short name (DOROTHYHODGKIN) and full name (DOROTHYCROWFOOTHODGKIN) of a famous protein crystallographer.
S1 = DOROTHYHODGKIN
S2 = DOROTHYCROWFOOTHODGKIN
Global vs. Local alignment Dotplot showing
identities between short name (DOROTHYHODGKIN) and full name (DOROTHYCROWFOOTHODGKIN) of a famous protein crystallographer.
Global alignment:
DOROTHY--------HODGKIN
DOROTHYCROWFOOTHODGKIN
Local Alignment
The problem: we want to find the substrings of s andt with highest similarity.
Scoring System: just as in global alignment: Match: +1 Mismatch: -1 Indel: -2
Local Alignment – cont’dThe differences: 1. We can start a new match instead of
extending a previous alignment. This means- at each cell, we can start to calculate
the score from 0 (even if this means ignoring the prefix).
We do this only if it’s better than the alternative (which means- only if the alternative is negative).
2. Instead of looking only at the far corner, we look anywhere in the table for the best score (even if this means ignoring the suffix)
0
T
1
A
2
C
3
T
4
A
5
A
6
0 0
T 1
A 2
A 3
T 4
A 5
0
T
1
A
2
C
3
T
4
A
5
A
6
0 0 0
T 1
A 2
A 3
T 4
A 5
T-
0
T
1
A
2
C
3
T
4
A
5
A
6
0 0 0 0 0 0 0 0
T 1
A 2
A 3
T 4
A 5
TACTAA------
0
T
1
A
2
C
3
T
4
A
5
A
6
0 0 0 0 0 0 0 0
T 1 0
A 2 0
A 3 0
T 4 0
A 5 0
-----TAATA
0
T
1
A
2
C
3
T
4
A
5
A
6
0 0 0 0 0 0 0 0
T 1 0 1
A 2 0
A 3 0
T 4 0
A 5 0
TT
0
T
1
A
2
C
3
T
4
A
5
A
6
0 0 0 0 0 0 0 0
T 1 0 1 ?
A 2 0
A 3 0
T 4 0
A 5 0
TAT-
-1
TA---T
-2
TA-T
-1
0
0
T
1
A
2
C
3
T
4
A
5
A
6
0 0 0 0 0 0 0 0
T 1 0 1 0 0 1
A 2 0
A 3 0
T 4 0
A 5 0
TACT---T
0
T
1
A
2
C
3
T
4
A
5
A
6
0 0 0 0 0 0 0 0
T 1 0 1 0 0 1 0 0
A 2 0 0 2 0 0 2 1
A 3 0
T 4 0
A 5 0
0
T
1
A
2
C
3
T
4
A
5
A
6
0 0 0 0 0 0 0 0
T 1 0 1 0 0 1 0 0
A 2 0 0 2 0 0 2 1
A 3 0 0 1 1 0 1 3
T 4 0
A 5 0
0
T
1
A
2
C
3
T
4
A
5
A
6
0 0 0 0 0 0 0 0
T 1 0 1 0 0 1 0 0
A 2 0 0 2 0 0 2 1
A 3 0 0 1 1 0 1 3
T 4 0 0 0 0 2 0 1
A 5 0
0
T
1
A
2
C
3
T
4
A
5
A
6
0 0 0 0 0 0 0 0
T 1 0 1 0 0 1 0 0
A 2 0 0 2 0 0 2 1
A 3 0 0 1 1 0 1 3
T 4 0 0 0 0 2 0 1
A 5 0 0 1 0 0 3 1
0
T
1
A
2
C
3
T
4
A
5
A
6
0 0 0 0 0 0 0 0
T 1 0 1 0 0 1 0 0
A 2 0 0 2 0 0 2 1
A 3 0 0 1 1 0 1 3
T 4 0 0 0 0 2 0 1
A 5 0 0 1 0 0 3 1
TACTAATAATA
0
T
1
A
2
C
3
T
4
A
5
A
6
0 0 0 0 0 0 0 0
T 1 0 1 0 0 1 0 0
A 2 0 0 2 0 0 2 1
A 3 0 0 1 1 0 1 3
T 4 0 0 0 0 2 0 1
A 5 0 0 1 0 0 3 1
TACTAA TAATA
How do your prefer it – right or fast? Exact methods - the result is guaranteed to be
(mathematically) optimal Needleman-Wunsch (global) Smith-Waterman (local)
Heuristic methods: make some assumptions that hold most, but not all of the time FASTA BLAST
Still, a typical run takes minutes to complete.
FASTA
FASTA http://www.ebi.ac.uk/fasta33/ Performs a local alignment of the input
sequence against a complete database. Finds n subsequences with best alignments. Speed-up: doesn’t really look at all the
sequences- just those that ‘look similar’ (details- in the course Algorithms in Computational Biology)
Still, a typical run takes minutes to complete.
FASTA Variations (programs)
fasta3 – DNA sequence – DNA database, protein sequence – protein database
fastx/y3 – DNA sequence - protein database. DNA is translated in forward and reverse frames.
tfastx/y3 - protein sequence - translated DNA DB
… and more
Databases Depend on the type chosen (Nucleic acid /
protein) EMBL- all the nucleotide databases of the
European Molecular Biology Laboratory Some organism-type specific:
FUNGI INVERTEBRATES PLANTS
Some content –specific: ESTs STSs
MAMALS MOUSE HUMAN
More FASTA options
Gap penalties – different for opening gaps and for continuing them (residue = indel)
Scores and Alignments – how many (max) to retrieve?
KTUP – see the algorithm description in the lecture
DNA Strand Matrix – for searches that involve proteins
(next week)
E-values The number of hits (with the same similarity score) one can
"expect" to see just by chance when searching the given string in a database of a particular size.
higher e-value lower similarity From FASTA documentation:
“sequences with E-value of less than 0.01 are almost always found to be homologous”
“sequences with E-value between 1 and 10 frequently turn out to be related as well”
FASTA defaults for upper limit: 10 for FASTA with protein searches 5 for translated DNA/protein comparisons 2 for DNA/DNA searches.
The lower bound is normally 0 (we want to find the best)
BLAST
BLAST – Outline
Sequence Alignment Complexity and indexing BLASTN and BLASTP
Basic parameters PAM and BLOSUM matrices Affine gap model E Values (once again)
Advanced BLAST
Databases BLAST options BLAST output Taxonomic BLAST Pairwise BLAST
Name Query type Database
blastn Genomic Genomic
blastp Protein Protein
blastx Translated genomic Protein
tblastn Protein Translated genomic
tblastx Translated genomic Translated genomic
Genomic translations test all 6 possibilities:
3x for codon frames, 2x for reverse complement
BLAST Variations
BLASTN Databases
nrGenBank, EMBL, DDBJ, PDB and NCBI
reference sequences (RefSeq)
htgs High-throughput genomic sequences (draft)
pat Patented nucleotide sequences
mito Mitochondrial sequences
vector Vector subset of GenBank
month GenBank, EMBL, DDBJ, PDB from 30 days
chrom Contigs and chromosomes from RefSeq
BLASTP Databases
nrGenBank CDS translations, RefSeq,
PDB, SWISS-PROT, PIR, PRF
swissprot
SWISS-PROT
pat Patented protein sequences
pdb Protein Data Bank
monthGenBank CDS translations, PDB,
SWISS-PROT, PIR, PRF from 30 days
BLASTN/P Options (1)
Only search part of database using NCBI Entrez query format
Search specific
organism
Remove low information content, e.g. short repeats or
rich in only 2 nucleotides
Remove known human repeats
(LINEs, SINEs)
BLASTN/P Options (2)Threshold for results
significance
Use index based on words of 7, 11 or 15
nucleotides Costs to open and extend gap, score for nucleotide
match or mismatch. Allowed gap scores: 10/1, 10/2, 11/1, 8/2, 9/2
BLASTP Options
Scoring matrix: PAM, etc…
Search for a motif (PSI-BLAST)
Costs to open and extend gap
BLASTN/P Formatting (1)
Show colored bar chart
Number of sequences listed
Number of alignments shown
Other (less important) options on
what to show
BLASTN/P Formatting (2)
How to display alignments
Only show results which match Entrez search or are from specific organism
Only show results with E values in this range
BLASTN Results
Query sequence representation
Matched areas of database sequences
Multiple matches on sequence
BLAST Output Header
Request ID for later retrieval
Query sequence details
Database details
Tax BLAST
BLAST Alignments (1)
Sequence Identifier
Sequence description
Score andE value
BLAST Alignments (2)
Several alignments possible for one sequence match
Normalized score of alignment
Expected number of such hits (2e-11 = 2 10-11)
Number of exact matches
Number of matches with positive score
Number of insertion / deletions
BLAST Alignments (3)
Query sequenceExact matchInsertion / deletion
Matched sequence
Mismatch with positive
score
Position within sequence Masked low complexity region
Expectation Values
Increases linearly with
length of query sequence
Increases linearly with
length of database
Decreases exponentially with score of
alignment
Tax BLAST
Lineage of organism with strongest hit
Score of organism’s strongest hit
Number of organism hits
Shared ancestry in taxonomic tree
BLAST2SEQ
Scoring scheme
Type of program
Gap model,Expect Value,
Advanced options
Sequences
Scoring matrix
SequencesGO !