BITS: Basics of Sequence similarity
description
Transcript of BITS: Basics of Sequence similarity
Basic bioinformatics concepts, databases and tools
Module 2
Searching for similar sequences
Joachim Jacobhttp://www.bits.vib.be
Updated February 2012 http://dl.dropbox.com/u/18352887/BITS_training_material/Link%20to%20mod2-intro_H1_2012_SimSearch.pdf
Based on annotations, we can use text searching to get sequences of interest (module 1)
WHERE
– Primary dbs
– Derived dbs
HOW to find sequences
by keywords
by literature
by annotation
See BITS website - module 1
In this module, we will look into sequence similarity to get and analyze sequences
Detecting sequence similarity is a cornerstone-type of analysis in bioinformatics
Why would we like to detect similar sequences? 1. Searching in sequence databases for similar sequences
2. From a high-throughput experiment, every read needs to be 'aligned' to a genomic reference sequence or to each other (assembly)
3. Elucidation of functionality by detecting sites of conservation (sequences parts that resemble each other more than would be expected)
4. Phylogeny is build upon comparison of multiple sequences
How to search for sequence similarity
- in sequence databases, via BLAST or FASTA
- to compare the sequence of two sequences in detail: pairwise sequence comparison
- to compare multiple sequences at once: multiple sequence alignment, de novo assembly
Methods can be categorized into:
Optimal/exhaustive – heuristic – Graphical
Comparing sequences can be classified into 'one to one', 'one to many', or 'many to many'
One to many
One to one
many to many
Conceptualizing the source of sequence similarity
Sequences can be similar because ...
they are derived from evolutionary related organisms
they evolve in similar conditions: convergent evolution
So the first question we have to solve, what is similar? How do we measure similarity?
Similar?
The source of sequence similarity
Similar?
Summary of the really occurred changes
Let's assume this toy example, a short sequence, mutating over time (without insertions of deletions) occurring.
Similar?
So taking the most divergent sequences (the first and the last), the only correct alignment for those two, regarding their history, is:
Similar?
KLRMWILVATAEIDD
KPRMCILVAIADIRD
But we usually don't have all intermediate sequences: only the first and the last. How to determine what is the correct alignment?
In addition, multiple changes can have happened at one location over time
Similar? KLRMWILVATAEIDDKPRMCILVAIADIRD
KLRMWILVATAEIDD
KLRMWILVATAEIDD
KPRMCILVAIADIRD
KPRMCILVAIADIRDKLRMWILVATAEIDD
Many possibilities exist to align them: drag the sequences over each other. One of those positions, will have highest number of identical residues, called matches (green)
KPRMCILVAIADIRD
Similar? KLRMWILVATAEIDDKPRMCILVAIADIRD
KLRMWILVATAEIDD
KLRMWILVATAEIDD
KPRMCILVAIADIRD
KPRMCILVAIADIRDKLRMWILVATAEIDD
In this example, we base our claim 'we have a match' if we see an identical residue on that position in both sequences.
KPRMCILVAIADIRD
The identity matrix summarizes this scoring system, listing all residue combinations in a table
A C Y W
A C
Y W
Residue
match
mismatch
Substitutions or score matrices provide a means to determine similarity in an objective way
Such matrices are called substitution or scoring matrices. They are used to calculate a score for every possible AA alignment in aligned sequences, in order have a measure for sequence similarity.
KLRMWILVATAEIDDKPRMCILVAIADIRD
KLRMWILVATAEIDD
KPRMCILVAIADIRD KPRMCILVAIADIRDKLRMWILVATAEIDD
Score: 0 1 0 0 0 0 0 0 0 0
Sum of the scores: 1
Score: 0 1 0 0 0 0 0 0 0 1 0 1
Sum of the scores: 2
Score: 1 0 1 1 0 1 1 1 1 0 1 1 1 0 1Sum of the scores: 11
Complex substitutions matrices are more meaningful and sensitive to detect similarity
The two most popular are PAM and BLOSUM. Every pair of aligned residues get a score, based on the matrix. E.g. an A-A alignment gets score 2 (PAM120) or 4 (BLOSUM62). An F-G gets -5 or -3.
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Scoring2.htmlhttp://biology.unm.edu/biology/maggieww/Public_Html/444544seqsim.html
Likely changes: positive score - unlikely changes: negative score
Substitution matrices are derived from analysis of multiple alignments of related sequences
PAM (Point Accepted Mutations) by Margaret Dayhoff :global alignments of proteins with >85% identity --> phylogenetic
trees --> count substitutions --> estimate prob. conservation/substitution at distance of 1 mutation per 100 aa ==> PAM1 table
PAMn tables by matrix multiplication
BLOSUM (BLOCKS Substitution Matrices) by Henikoff and Henikoff : BLOCKS (= local multiple sequence
alignment without gaps) databank made from protein families from PROSITE databank -->
BLOSUMn table derived from BLOCK with >n% conserved aa
http://en.wikipedia.org/wiki/Substitution_matrix
A B C D E F G H I K L M N P Q R S T V W X Y Z A 4 -2 0 -2 -1 -2 0 -2 -1 -1 -1 -1 -2 -1 -1 -1 1 0 0 -3 -1 -2 -1 B -2 6 -3 6 2 -3 -1 -1 -3 -1 -4 -3 1 -1 0 -2 0 -1 -3 -4 -1 -3 2 C 0 -3 9 -3 -4 -2 -3 -3 -1 -3 -1 -1 -3 -3 -3 -3 -1 -1 -1 -2 -1 -2 -4 D -2 6 -3 6 2 -3 -1 -1 -3 -1 -4 -3 1 -1 0 -2 0 -1 -3 -4 -1 -3 2 E -1 2 -4 2 5 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -1 -2 5 F -2 -3 -2 -3 -3 6 -3 -1 0 -3 0 0 -3 -4 -3 -3 -2 -2 -1 1 -1 3 -3 G 0 -1 -3 -1 -2 -3 6 -2 -4 -2 -4 -3 0 -2 -2 -2 0 -2 -3 -2 -1 -3 -2 H -2 -1 -3 -1 0 -1 -2 8 -3 -1 -3 -2 1 -2 0 0 -1 -2 -3 -2 -1 2 0 I -1 -3 -1 -3 -3 0 -4 -3 4 -3 2 1 -3 -3 -3 -3 -2 -1 3 -3 -1 -1 -3 K -1 -1 -3 -1 1 -3 -2 -1 -3 5 -2 -1 0 -1 1 2 0 -1 -2 -3 -1 -2 1 L -1 -4 -1 -4 -3 0 -4 -3 2 -2 4 2 -3 -3 -2 -2 -2 -1 1 -2 -1 -1 -3 M -1 -3 -1 -3 -2 0 -3 -2 1 -1 2 5 -2 -2 0 -1 -1 -1 1 -1 -1 -1 -2 N -2 1 -3 1 0 -3 0 1 -3 0 -3 -2 6 -2 0 0 1 0 -3 -4 -1 -2 0 P -1 -1 -3 -1 -1 -4 -2 -2 -3 -1 -3 -2 -2 7 -1 -2 -1 -1 -2 -4 -1 -3 -1 Q -1 0 -3 0 2 -3 -2 0 -3 1 -2 0 0 -1 5 1 0 -1 -2 -2 -1 -1 2 R -1 -2 -3 -2 0 -3 -2 0 -3 2 -2 -1 0 -2 1 5 -1 -1 -3 -3 -1 -2 0 S 1 0 -1 0 0 -2 0 -1 -2 0 -2 -1 1 -1 0 -1 4 1 -2 -3 -1 -2 0 T 0 -1 -1 -1 -1 -2 -2 -2 -1 -1 -1 -1 0 -1 -1 -1 1 5 0 -2 -1 -2 -1 V 0 -3 -1 -3 -2 -1 -3 -3 3 -2 1 1 -3 -2 -2 -3 -2 0 4 -3 -1 -1 -2 W -3 -4 -2 -4 -3 1 -2 -2 -3 -3 -2 -1 -4 -4 -2 -3 -3 -2 -3 11 -1 2 -3 X -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 Y -2 -3 -2 -3 -2 3 -3 2 -1 -2 -1 -1 -2 -3 -1 -2 -2 -2 -1 2 -1 7 -2 Z -1 2 -4 2 5 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -1 -2 5
The BLOSUM62 similarity matrix
ftp://ftp.ncbi.nih.gov/blast/matrices/
The substitution matrices capture the similarity in properties between residues
From Livingstone, C. D. and Barton, G. J. (1993),"Protein Sequence Alignments: A Strategy for the Hierarchical Analysis of Residue Conservation", Comp. Appl. Bio. Sci., 9, 745-756.
A matrix does not capture insertions and deletions: penalties are given to deal with them
When two sequences align, the relation between aligned residues can only been seen as one of the following:
identity Mismatch (substitution (DNA) or similarity level (protein))
gap (insertion/deletion)
The two parts of the gap penalty: a higher penalty for creation, one lower for its extension
Score from substitution matrix
Gap penalty
How to search for sequence similarity
- In sequence databases: BLAST or FASTA
- to compare the sequence of two sequences in detail: pairwise sequence comparison
- to compare sequences of multiple sequences: multiple sequence alignment
3 methods exist:
Graphical – optimal/exhaustive – heuristic
Substitutions matrices are used in many algorithms to detect sequence similarity
One to many
One to one
many to many
Pairwise sequence comparison – one to one
To create an alignment between two sequences
- Manually (?)
- Two sequences (= pairwise alignment): optimal alignment through 'dynamic programming'
– Needleman-Wunsch (global alignment)– Smith-Waterman (local alignment)
Dynamic programming uses a gap penalty and a scoring scheme to align two sequences
Dynamic programming: two things needed
Scoring scheme to measure identity and similarity
• choose a scoring matrix for similarity and identity (e.g. PAM250)
Gap penalty
• For each gap, a penalty in the ultimate score is given, also called weight, or cost
most used : a + b * (n-1) for gap of n positionsa : gap opening penalty, higher penalty (negative
score)
b : gap extension penalty, smaller penalty to widen a gap
http://biology.unm.edu/biology/maggieww/Public_Html/444544seqsim.html
Dynamic programming and backtracking
A T
T
T
T
-
-
C
0 -1 -2 -3
-1 0 -1
-3 -1 0
-1
-3
-2 -2 0 +1
A T
T
T
T
-
-
C
0 -1 -2 -3
-1 0 -1
-3 -1 0
-1
-3
-2 -2 0 +1
-1
-1
-1
-3
-1
-1
-1
-1
-2
-2
-2
-2
-2
-2
-3
-3
-3
-4
-3
-3
-2
-2
-3
-3
-4
-4
0
0
0
+1
-1
-1
-2
A T T -- T T C
Si-1,j-1
+s(ai,bj)Si,j-1
+s(-,bj)
Si,j
Si-1,j
+s(ai,-)
source
target
scoring scheme :• s(ai,bi) = +1 if ai = bi
• s(ai,bi) = -1 if ai = bi
• s(ai,-) = -1• s(-,bi) = -1
/
A
Bglobal Alignment
local Alignment
A
B
Needleman - Wunsch algorithmconsiders similarity across the full extent of the sequences
Smith - Waterman algorithmfocuses on regions of similarity in parts of the sequences
Two approaches to align pairwise : align globally versus locally
Software for creating an optimal pairwise alignment
Best global alignment (Needleman – Wunsch)EMBOSS needle (webinterface here, here on Mobyle)
EMBOSS stretcher (with Myers-Miller optimization, for very long sequences – webinterface on Mobyle)
Best local alignment (Smith-Waterman)EMBOSS water
SIM (Huang and Miller, with optimization for very long sequences, can also find non-overlapping suboptimal alignments) (link)
EMBOSS matcher (idem as SIM)modified version of SIM (by Laurent Duret) with output for graphical viewer
http://mobyle.pasteur.fr/cgi-bin/portal.py#welcome
Parameters that are set
A graphical method: dot plots can be made to rapidly identify regions with similar sequence
The parameters of a dotplot (which uses the identity matrix), are the word size (e.g. per 3 residues) and the threshold (% of a word that are identities). This is very convenient for large molecules, e.g. chromosomes
Software for making dotplots EMBOSS contains following programs
– dottup : word comparison– dotmatcher : window/threshold comparison– dottup : word comparison, makes n*n dotplots in one
graph Dotter
developed by Erik Sonnhammer and Richard Durbin (U. Stockholm, Sweden)
Dotlet – (Java applet) at the Swiss Institute of Bioinformatics
Gepard – (Munich Information center for Protein Sequences,
Germany) : with heuristic for speeding up computation, for comparing very long sequences
...
http://www.bits.vib.be/wiki/index.php/Dotplot
Dot plots generate typical patterns which can be interpreted
http://www.code10.info/index.php?option=com_content&view=article&id=64:inroduction-to-dot-plots&catid=52:cat_coding_algorithms_dot-plots&Itemid=76
Sequence A
Sequence B
Simple repeat
Insertion in sequence B
Insertion in sequence A
Complex repeat
Palindrome
How to search for sequence similarity
- In sequence databases: BLAST or FASTA
- to compare the sequence of two sequences in detail: pairwise sequence comparison
- to compare sequences of multiple sequences: multiple sequence alignment
3 methods exist:
Graphical – optimal/exhaustive – heuristic
Multiple sequence alignment
One to many
One to one
many to many
Multiple sequence alignment is not simply expanding pairwise sequence alignmentsMany to many
One could try to dynamically program to time consuming: 20 seqs need already more time than the universe has existed...
Heuristic methods lead the way: "progressive alignment" most used
1. Use pairwise dynamic programming for all sequences
2. Guide tree is constructed based on scores
3. Two sequences are aligned, and sequentially every sequence is added following the guide tree (progressive clustering)
multiple sequencealignment
progressive clustering
progressive alignment"once a gap, always a gap"
ABCDguide tree
A B C
B 142
C 95 101
D 60 62 55
similarity matrix
N (N-1)2
pairwise sequencealignments
Progressive clustering is a two-step process: measuring distance and constructing alignment
Take meanOf AB
Take mean of ABC
STEP 1:Measure similarity
STEP 2: construct MSA
Progressive clustering: once a gap, always a gap
The guide tree is NOT a phylogenetic tree !
The progressive alignment framework can be extended to make it faster and more sensitive
More sensitive:
- consistency: per position scoring scheme (T-Coffee)
- structural guidance: based on structural info alignment is guided (Expresso)
Faster:
- distance measured by analysing k-tuples (see later) instead of pairwise aligning (Clustal Omega)
http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.0030123
Different formats of aligned sequences exists
http://www.bits.vib.be/wiki/index.php/Multiple_sequence_alignment#Formats
h1_pea -MATEEPIVAVETVPEPIVTEPTTITEPEVPEKEEPKAEVEKTKKAKGSKPKKASKPRNPh1_sollc -MATEEPVIVNEVVEEQAA--PETVKDEANPPAKSGKAKKETKAKKPAAPRKRSATP---h11_volca MSETEAAPVVAPAAEAAPAAEAPKAKAPKAKAPKQPKAPKAPKEPKAPKEKKPKAAP---
h1_pea ASHPTYEEMIKDAIVSLKEKNGSSQYAIAKFIEEKQ-KQLP-ANFKKLLLQNLKKNVASGh1_sollc -THPPYFEMIKDAIVTLKERTGSSQHAITKFIEEKQ-KSLP-SNFKKLLLTQLKKFVASEh11_volca -THPPYIEMVKDAITTLKERNGSSLPALKKFIENKYGKDIHDKNFAKTLSQVVKTFVKGG
3 298h1_pea -MATEEPIVA VETVPEPIVT EPTTITEPEV PEKEEPKAEV EKTKKAKGSK h1_sollc -MATEEPVIV NEVVEEQAA- -PETVKDEAN PPAKSGKAKK ETKAKKPAAP h11_volca MSETEAAPVV APAAEAAPAA EAPKAKAPKA KAPKQPKAPK APKEPKAPKE
PKKASKPRNP ASHPTYEEMI KDAIVSLKEK NGSSQYAIAK FIEEKQ-KQL RKRSATP--- -THPPYFEMI KDAIVTLKER TGSSQHAITK FIEEKQ-KSL KKPKAAP--- -THPPYIEMV KDAITTLKER NGSSLPALKK FIENKYGKDI
Clustal format
Phylip format
Software that implement these algorithms and manually adjust the alignments
Alignment editors
- SeaView
- SeqPup
- GeneDoc
- Jalview
- BioEdit
- CLC Sequence Viewer
- UGene
http://www.bits.vib.be/wiki/index.php/Multiple_sequence_alignment
Additional references
Notredame et al. (2002), “T-Coffee: A novel method for fast and accurate multiple sequence alignment,” J Mol Biol 302:205. [Introduced notion of consistency]
Blackshields et al. (2010), “Sequence embedding for fast construction of guide trees for multiple sequence alignment,” Algorithms for Mol Biol 5:21. [mBed algorithm]
Söding (2005), “Protein homology detection by HMM-HMM comparison,” Bioinformatics 21:951. [HHalign algorithm]
Thompson et al. (2005), “BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark,” Proteins 61:127.
How to search for sequence similarity
- In sequence databases: BLAST or FASTA
- to compare the sequence of two sequences in detail: pairwise sequence comparison
- to compare sequences of multiple sequences: multiple sequence alignment
3 methods exist:
Graphical – optimal/exhaustive – heuristic
Divided in 'one to one', 'one to many', or 'many to many' sequence comparisons
One to many
One to one
many to many
Searching sequence databases is done through a little trick
Problem
find me all similar sequences to a query sequence in a database.
(Find me the position of many short reads in a genome)
Bottleneck:
we cannot compute an optimal alignment for every sequence and determine which is best (~MSA). This is time-consuming, only practicable on special computer (parallel computer or computer cluster)
"Heuristic" algorithm : gain of speed at the expense of some loss in sensitivity
• BLAST (developed by S. Altschul et al. at NCBI)
• fastA (developed by R. Pearson at U. of Virginia)
BLAST finds quickly similar sequences by giving up some sensitivity
Algorithm (= steps to follow to reach your goal)
http://www.ncbi.nlm.nih.gov/books/NBK21097/
BLAST step 1 : neighbouring
BLAST step 2 : searching the little words in the db
BLAST step 3 : extend where the words match
Only if words match >Sg score
Proteins: only extension if another hit <40
Proteins: optimal composition adapted
Each BLAST search hit has an E-value, which is how many hits we expect by chance
E() = m n K e - λ ∗ S
query sequence length m total databank length n
K and λ parameters obtained by simulation(search random sequence against random databank)
Expect value : number of unrelated databank sequences expected to yield same or higher score S by pure chance (extreme value distribution)
***
http://bioinfo.uncc.edu/zhx/binf8312/lecture-7-SequenceAnalyses.pdf
BLAST statistics
bit score : score corrected for scale of scoring scheme
P = 1 - e - E
S’λ S - ln K
ln 2=
*
Probability P that databank yields by pure chance at least one alignment with same or higer score
Interpreting the BLAST results by E-value and bit score
E-value: the lower the better (= chance to obtain such a similarity by chance with a random sequence and database of the same size) (e.g. 0.1 means 1 in 10 searches, this similarity could have arosen by chance alone)
Max/Total score: bit score – the higher the better (= score constructed from length of total alignment of the high scoring pair)
Depending on DNA and/or protein sequences as query or in the db, you choose a BLAST version
Different flavours of BLAST
Depending on query sequence: DNA or protein
and database: DNA or protein
Flavour: query - databaseblastn: DNA - DNAblastp: protein - proteinblastx: translated DNA - proteintblastn: protein - tr DNA tblastx: tr DNA - tr DNA
You can adjust few parameters to the BLAST algorithm
SEG filter for proteins
DUST filter for nucleic acids
E-value threshold for searching: rule of thumb: Good >1e-05 > weak similarity >1e-01> take a good look > 10
Higher word size = sensitivity up
A lot of power lies within choosing the right database for the BLAST search.
The choice of the database
The "nr/nt" database is the largest nucleotide database available through NCBI BLAST; select the "nr/nt" database for this exercise. It includes all GenBank, RefSeq Nucleotides, EMBL (European nucleotide database), DDBJ (Japanese nucleotide database) and PDB (Protein Data Bank) sequences, but no EST, STS, GSS, or phase 0, 1 or 2 htgs (unfinished high throughput genomic) sequences. The NCBI nr database originally got its name from the phrase "nonredundant" nucleotide database, but there is no longer any claim to nonredundancy in the sequence set.
Nearly every sequence database comes with BLAST services nowadays
Numerous online websites, mostly WU-BLAST (NCBI)
http://blast.ncbi.nlm.nih.gov/Blast.cgi
http://www.ebi.ac.uk/Tools/sss/
But very easy to install on own computer ('run locally')
1. Download blast programs ( here )
2. Format your 'database' (multifasta file)
3. Run BLAST
You can also choose to use NCBI Blast online outside of the browser by using netblast (instructions here)
http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/
Some adjustments to the BLAST protocol exist for particular purposes
Identifying very distantly related proteins
PSI-BLAST (position specific iterated) (see module 3)
BLAST protein with matching of a pattern
PHI-BLAST (pattern hit initiated) (see module 3)
BLAST highly similar nucleotide sequences
Mega-BLAST
LastZ explanation – have a look at the dotplots here
BLAST2SEQ aligns 2 sequences and visualises the output in a dotplot-like graph
The tool to do this is called BLAST2SEQ: e.g. comparing chrI with ChrVIII of S. cerevisiae
insertions!
http://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch&PROG_DEF=blastn&BLAST_PROG_DEF=megaBlast&SHOW_DEFAULTS=on&BLAST_SPEC=blast2seq&LINK_LOC=align2seq
BLAT is derived from BLAST, and is used for searching very similar sequences in a genome
BLAT = BLAST like alignment tool
- database is a genome sequence
- the database are not files, but is kept into memory as words of size 11
- it can only be used for very similar sequences: if you have a fragment which you want to know the position in the genome.
The technique of indexing 'words' is also used in some short read aligners
In BLAST for nucleotides: k = 11 (11-mers)
11111111111 → 11 consecutive matches
However, non-consecutive matches improve sensitivity: a spaced seed.
111010010100110111 → 55% more sensitive• 1 means a match, a 0 means a don't care
position– Key size: number of 1's
– Key width: total number of 0's and 1's
• The 'keys' are used to index the genome or the reads, depending on the aligner
doi: 10.1093/bib/bbq015 on http://dx.doi.org
FastA is another popular sequence database search algorithm
Find runs of identities Rescore using PAM matrix andKeep top scoring segments
FastA is another popular sequence database search algorithm
Apply 'joining threshold' toeliminate segments that are
unlikely to be part of the alignmentthat includes the highest
scoring alignment
Use dynamic programming tooptimize the alignment in a
narrow band that encompasses the topscoring segments
FastA is accessible on the website of EBI
Further explanation of algorithm: here
Accessibility• EBI (help link)
FastA developers: link
The interpretation of FastA output is similar as for BLAST.
http://www.ebi.ac.uk/Tools/sss/fasta/
Similarity you observe, homology you infer
Interpreting results
Sequences are similar if their similarity score is significantly higher than that of random sequences of same length and composition.
Sequences are homologous if they are similar because they diverged from a common ancestor.
Sequences are analogous if they are similar because of convergent evolution (e.g. binding sites for same ligand)
Similarity you observe, homology you infer !
You can speak of %similarity or %identity, not of %homology !
Homology: orthologous and paralogous (in- and out-)
(out)
(in)
Summary sequence similarity
Pairwise (one to one)
– Dotplot (graphical)
– Smith-waterman / needleman-wunsch (optimal)
Multiple sequence alignment (many to many) (heuristic)
– ClustalW
– Muscle, ...
Database search (one to many) (heuristic)
– BLAST
– FastA
– BLAT
What you can check to stay updated?
Biocatalogue http://www.biocatalogue.org/
EMBRACE http://www.embraceregistry.net/
Bioinformatics Links Directory http://www.bioinformatics.ca/links_directory/
Summary Detecting sequence similarity is a cornerstone-type of analysis in bioinformatics
The identity matrix summarizes this scoring system, listing all residue combinations in a table
Substitutions or score matrices provide a means to determine similarity in an objective way
Complex substitutions matrices are more meaningful and sensitive to detect similarity
Substitution matrices are derived from analysis of multiple alignments of related sequences
The substitution matrices capture the similarity in properties between residues
A matrix does not capture insertions and deletions: penalties are given to deal with them
The two parts of the gap penalty: a higher penalty for creation, one lower for its extension
Dynamic programming uses a gap penalty and a scoring scheme to align two sequences
Needleman-Wunsch to align two sequences over the whole length (global alignment)
Smith-Waterman to align the most similar parts of two sequences (local alignment)
A graphical method: dot plots can be made to rapidly identify regions with similar sequence
Dot plots generate typical patterns which can be interpreted
Multiple sequence alignment is not simply expanding pairwise sequence alignments
Searching sequence databases is done through a little trick
BLAST finds quickly similar sequences by giving up some sensitivity
Depending on DNA and/or protein sequences as query or in the db, you choose a BLAST version
You can adjust few parameters to the BLAST algorithm
A lot of power lies within choosing the right database for the BLAST search.
Some adjustments to the BLAST protocol exist for particular purposes
BLAST2SEQ aligns 2 sequences and visualises the output in a dotplot-like graphh
BLAT is derived from BLAST, and is used for searching very similar sequences in a genome
The technique of indexing 'words' is also used in some short read aligners
FastA is another popular sequence database search algorithm
FastA is accessible on the website of EBI
The interpretation of FastA output is similar as for BLAST.
Similarity you observe, homology you infer
Homology: orthologous and paralogous (in- and out-)