BITS: Basics of Sequence similarity

Basic bioinformatics concepts, databases and tools

Module 2

Searching for similar sequences

Joachim Jacobhttp://www.bits.vib.be

Updated February 2012 http://dl.dropbox.com/u/18352887/BITS_training_material/Link%20to%20mod2-intro_H1_2012_SimSearch.pdf

http://www.bits.vib.be/

http://dl.dropbox.com/u/18352887/BITS_training_material/Link%20to%20mod2-intro_H1_2012_SimSearch.pdf

Based on annotations, we can use text searching to get sequences of interest (module 1)

WHERE

– Primary dbs

– Derived dbs

HOW to find sequences

by keywords

by literature

by annotation

See BITS website - module 1

http://www.bits.vib.be/index.php?option=com_content&view=article&id=17203651:basic-bioinformatics-concepts-databases-and-tools&catid=81:training-pages&Itemid=190

In this module, we will look into sequence similarity to get and analyze sequences

Detecting sequence similarity is a cornerstone-type of analysis in bioinformatics

Why would we like to detect similar sequences? 1. Searching in sequence databases for similar sequences

2. From a high-throughput experiment, every read needs to be 'aligned' to a genomic reference sequence or to each other (assembly)

3. Elucidation of functionality by detecting sites of conservation (sequences parts that resemble each other more than would be expected)

4. Phylogeny is build upon comparison of multiple sequences

How to search for sequence similarity

- in sequence databases, via BLAST or FASTA

- to compare the sequence of two sequences in detail: pairwise sequence comparison

- to compare multiple sequences at once: multiple sequence alignment, de novo assembly

Methods can be categorized into:

Optimal/exhaustive – heuristic – Graphical

Comparing sequences can be classified into 'one to one', 'one to many', or 'many to many'

One to many

One to one

many to many

Conceptualizing the source of sequence similarity

Sequences can be similar because ...

they are derived from evolutionary related organisms

they evolve in similar conditions: convergent evolution

So the first question we have to solve, what is similar? How do we measure similarity?

Similar?

The source of sequence similarity

Similar?

Summary of the really occurred changes

Let's assume this toy example, a short sequence, mutating over time (without insertions of deletions) occurring.

Similar?

So taking the most divergent sequences (the first and the last), the only correct alignment for those two, regarding their history, is:

Similar?

KLRMWILVATAEIDD

KPRMCILVAIADIRD

But we usually don't have all intermediate sequences: only the first and the last. How to determine what is the correct alignment?

In addition, multiple changes can have happened at one location over time

Similar? KLRMWILVATAEIDDKPRMCILVAIADIRD

KLRMWILVATAEIDD

KLRMWILVATAEIDD

KPRMCILVAIADIRD

KPRMCILVAIADIRDKLRMWILVATAEIDD

Many possibilities exist to align them: drag the sequences over each other. One of those positions, will have highest number of identical residues, called matches (green)

KPRMCILVAIADIRD

Similar? KLRMWILVATAEIDDKPRMCILVAIADIRD

KLRMWILVATAEIDD

KLRMWILVATAEIDD

KPRMCILVAIADIRD

KPRMCILVAIADIRDKLRMWILVATAEIDD

In this example, we base our claim 'we have a match' if we see an identical residue on that position in both sequences.

KPRMCILVAIADIRD

The identity matrix summarizes this scoring system, listing all residue combinations in a table

A C Y W

A C

Y W

Residue

match

mismatch

Substitutions or score matrices provide a means to determine similarity in an objective way

Such matrices are called substitution or scoring matrices. They are used to calculate a score for every possible AA alignment in aligned sequences, in order have a measure for sequence similarity.

KLRMWILVATAEIDDKPRMCILVAIADIRD

KLRMWILVATAEIDD

KPRMCILVAIADIRD KPRMCILVAIADIRDKLRMWILVATAEIDD

Score: 0 1 0 0 0 0 0 0 0 0

Sum of the scores: 1

Score: 0 1 0 0 0 0 0 0 0 1 0 1

Sum of the scores: 2

Score: 1 0 1 1 0 1 1 1 1 0 1 1 1 0 1Sum of the scores: 11

Complex substitutions matrices are more meaningful and sensitive to detect similarity

The two most popular are PAM and BLOSUM. Every pair of aligned residues get a score, based on the matrix. E.g. an A-A alignment gets score 2 (PAM120) or 4 (BLOSUM62). An F-G gets -5 or -3.

http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Scoring2.htmlhttp://biology.unm.edu/biology/maggieww/Public_Html/444544seqsim.html

Likely changes: positive score - unlikely changes: negative score

http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Scoring2.html

http://biology.unm.edu/biology/maggieww/Public_Html/444544seqsim.html

Substitution matrices are derived from analysis of multiple alignments of related sequences

PAM (Point Accepted Mutations) by Margaret Dayhoff :global alignments of proteins with >85% identity --> phylogenetic

trees --> count substitutions --> estimate prob. conservation/substitution at distance of 1 mutation per 100 aa ==> PAM1 table

PAMn tables by matrix multiplication

BLOSUM (BLOCKS Substitution Matrices) by Henikoff and Henikoff : BLOCKS (= local multiple sequence

alignment without gaps) databank made from protein families from PROSITE databank -->

BLOSUMn table derived from BLOCK with >n% conserved aa

http://en.wikipedia.org/wiki/Substitution_matrix

http://en.wikipedia.org/wiki/Substitution_matrix

A B C D E F G H I K L M N P Q R S T V W X Y Z A 4 -2 0 -2 -1 -2 0 -2 -1 -1 -1 -1 -2 -1 -1 -1 1 0 0 -3 -1 -2 -1 B -2 6 -3 6 2 -3 -1 -1 -3 -1 -4 -3 1 -1 0 -2 0 -1 -3 -4 -1 -3 2 C 0 -3 9 -3 -4 -2 -3 -3 -1 -3 -1 -1 -3 -3 -3 -3 -1 -1 -1 -2 -1 -2 -4 D -2 6 -3 6 2 -3 -1 -1 -3 -1 -4 -3 1 -1 0 -2 0 -1 -3 -4 -1 -3 2 E -1 2 -4 2 5 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -1 -2 5 F -2 -3 -2 -3 -3 6 -3 -1 0 -3 0 0 -3 -4 -3 -3 -2 -2 -1 1 -1 3 -3 G 0 -1 -3 -1 -2 -3 6 -2 -4 -2 -4 -3 0 -2 -2 -2 0 -2 -3 -2 -1 -3 -2 H -2 -1 -3 -1 0 -1 -2 8 -3 -1 -3 -2 1 -2 0 0 -1 -2 -3 -2 -1 2 0 I -1 -3 -1 -3 -3 0 -4 -3 4 -3 2 1 -3 -3 -3 -3 -2 -1 3 -3 -1 -1 -3 K -1 -1 -3 -1 1 -3 -2 -1 -3 5 -2 -1 0 -1 1 2 0 -1 -2 -3 -1 -2 1 L -1 -4 -1 -4 -3 0 -4 -3 2 -2 4 2 -3 -3 -2 -2 -2 -1 1 -2 -1 -1 -3 M -1 -3 -1 -3 -2 0 -3 -2 1 -1 2 5 -2 -2 0 -1 -1 -1 1 -1 -1 -1 -2 N -2 1 -3 1 0 -3 0 1 -3 0 -3 -2 6 -2 0 0 1 0 -3 -4 -1 -2 0 P -1 -1 -3 -1 -1 -4 -2 -2 -3 -1 -3 -2 -2 7 -1 -2 -1 -1 -2 -4 -1 -3 -1 Q -1 0 -3 0 2 -3 -2 0 -3 1 -2 0 0 -1 5 1 0 -1 -2 -2 -1 -1 2 R -1 -2 -3 -2 0 -3 -2 0 -3 2 -2 -1 0 -2 1 5 -1 -1 -3 -3 -1 -2 0 S 1 0 -1 0 0 -2 0 -1 -2 0 -2 -1 1 -1 0 -1 4 1 -2 -3 -1 -2 0 T 0 -1 -1 -1 -1 -2 -2 -2 -1 -1 -1 -1 0 -1 -1 -1 1 5 0 -2 -1 -2 -1 V 0 -3 -1 -3 -2 -1 -3 -3 3 -2 1 1 -3 -2 -2 -3 -2 0 4 -3 -1 -1 -2 W -3 -4 -2 -4 -3 1 -2 -2 -3 -3 -2 -1 -4 -4 -2 -3 -3 -2 -3 11 -1 2 -3 X -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 Y -2 -3 -2 -3 -2 3 -3 2 -1 -2 -1 -1 -2 -3 -1 -2 -2 -2 -1 2 -1 7 -2 Z -1 2 -4 2 5 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -1 -2 5

The BLOSUM62 similarity matrix

ftp://ftp.ncbi.nih.gov/blast/matrices/

The substitution matrices capture the similarity in properties between residues

From Livingstone, C. D. and Barton, G. J. (1993),"Protein Sequence Alignments: A Strategy for the Hierarchical Analysis of Residue Conservation", Comp. Appl. Bio. Sci., 9, 745-756.

A matrix does not capture insertions and deletions: penalties are given to deal with them

When two sequences align, the relation between aligned residues can only been seen as one of the following:

identity Mismatch (substitution (DNA) or similarity level (protein))

gap (insertion/deletion)

The two parts of the gap penalty: a higher penalty for creation, one lower for its extension

Score from substitution matrix

Gap penalty


- In sequence databases: BLAST or FASTA


- to compare sequences of multiple sequences: multiple sequence alignment

3 methods exist:

Graphical – optimal/exhaustive – heuristic

Substitutions matrices are used in many algorithms to detect sequence similarity

One to many

One to one

many to many

Pairwise sequence comparison – one to one

To create an alignment between two sequences

- Manually (?)

- Two sequences (= pairwise alignment): optimal alignment through 'dynamic programming'

– Needleman-Wunsch (global alignment)– Smith-Waterman (local alignment)

Dynamic programming uses a gap penalty and a scoring scheme to align two sequences

Dynamic programming: two things needed

Scoring scheme to measure identity and similarity

• choose a scoring matrix for similarity and identity (e.g. PAM250)

Gap penalty

• For each gap, a penalty in the ultimate score is given, also called weight, or cost

most used : a + b * (n-1) for gap of n positionsa : gap opening penalty, higher penalty (negative

score)

b : gap extension penalty, smaller penalty to widen a gap



Dynamic programming and backtracking

A T

T

T

T

-

-

C

0 -1 -2 -3

-1 0 -1

-3 -1 0

-1

-3

-2 -2 0 +1

A T

T

T

T

-

-

C

0 -1 -2 -3

-1 0 -1

-3 -1 0

-1

-3

-2 -2 0 +1

-1

-1

-1

-3

-1

-1

-1

-1

-2

-2

-2

-2

-2

-2

-3

-3

-3

-4

-3

-3

-2

-2

-3

-3

-4

-4

0

0

0

+1

-1

-1

-2

A T T -- T T C

Si-1,j-1

+s(ai,bj)Si,j-1

+s(-,bj)

Si,j

Si-1,j

+s(ai,-)

source

target

scoring scheme :• s(ai,bi) = +1 if ai = bi

• s(ai,bi) = -1 if ai = bi

• s(ai,-) = -1• s(-,bi) = -1

/

A

Bglobal Alignment

local Alignment

A

B

Needleman - Wunsch algorithmconsiders similarity across the full extent of the sequences

Smith - Waterman algorithmfocuses on regions of similarity in parts of the sequences

Two approaches to align pairwise : align globally versus locally

Software for creating an optimal pairwise alignment

Best global alignment (Needleman – Wunsch)EMBOSS needle (webinterface here, here on Mobyle)

EMBOSS stretcher (with Myers-Miller optimization, for very long sequences – webinterface on Mobyle)

Best local alignment (Smith-Waterman)EMBOSS water

SIM (Huang and Miller, with optimization for very long sequences, can also find non-overlapping suboptimal alignments) (link)

EMBOSS matcher (idem as SIM)modified version of SIM (by Laurent Duret) with output for graphical viewer

http://mobyle.pasteur.fr/cgi-bin/portal.py#welcome

http://emboss.bioinformatics.nl/cgi-bin/emboss/needle

http://mobyle.pasteur.fr/cgi-bin/portal.py#forms::needle

http://mobyle.pasteur.fr/cgi-bin/portal.py#forms::stretcher

http://expasy.org/tools/sim-prot.html

http://mobyle.pasteur.fr/cgi-bin/portal.py#welcome

Parameters that are set

A graphical method: dot plots can be made to rapidly identify regions with similar sequence

The parameters of a dotplot (which uses the identity matrix), are the word size (e.g. per 3 residues) and the threshold (% of a word that are identities). This is very convenient for large molecules, e.g. chromosomes

Software for making dotplots EMBOSS contains following programs

– dottup : word comparison– dotmatcher : window/threshold comparison– dottup : word comparison, makes n*n dotplots in one

graph Dotter

developed by Erik Sonnhammer and Richard Durbin (U. Stockholm, Sweden)

Dotlet – (Java applet) at the Swiss Institute of Bioinformatics

Gepard – (Munich Information center for Protein Sequences,

Germany) : with heuristic for speeding up computation, for comparing very long sequences

...

http://www.bits.vib.be/wiki/index.php/Dotplot

http://www.bits.vib.be/wiki/index.php/Dotplot

Dot plots generate typical patterns which can be interpreted

http://www.code10.info/index.php?option=com_content&view=article&id=64:inroduction-to-dot-plots&catid=52:cat_coding_algorithms_dot-plots&Itemid=76

Sequence A

Sequence B

Simple repeat

Insertion in sequence B

Insertion in sequence A

Complex repeat

Palindrome

http://www.code10.info/index.php?option=com_content&view=article&id=64:inroduction-to-dot-plots&catid=52:cat_coding_algorithms_dot-plots&Itemid=76





3 methods exist:


Multiple sequence alignment

One to many

One to one

many to many

Multiple sequence alignment is not simply expanding pairwise sequence alignmentsMany to many

One could try to dynamically program to time consuming: 20 seqs need already more time than the universe has existed...

Heuristic methods lead the way: "progressive alignment" most used

1. Use pairwise dynamic programming for all sequences

2. Guide tree is constructed based on scores

3. Two sequences are aligned, and sequentially every sequence is added following the guide tree (progressive clustering)

multiple sequencealignment

progressive clustering

progressive alignment"once a gap, always a gap"

ABCDguide tree

A B C

B 142

C 95 101

D 60 62 55

similarity matrix

N (N-1)2

pairwise sequencealignments

Progressive clustering is a two-step process: measuring distance and constructing alignment

Take meanOf AB

Take mean of ABC

STEP 1:Measure similarity

STEP 2: construct MSA

Progressive clustering: once a gap, always a gap

The guide tree is NOT a phylogenetic tree !

The progressive alignment framework can be extended to make it faster and more sensitive

More sensitive:

- consistency: per position scoring scheme (T-Coffee)

- structural guidance: based on structural info alignment is guided (Expresso)

Faster:

- distance measured by analysing k-tuples (see later) instead of pairwise aligning (Clustal Omega)

http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.0030123

http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.0030123

Different formats of aligned sequences exists

http://www.bits.vib.be/wiki/index.php/Multiple_sequence_alignment#Formats

h1_pea -MATEEPIVAVETVPEPIVTEPTTITEPEVPEKEEPKAEVEKTKKAKGSKPKKASKPRNPh1_sollc -MATEEPVIVNEVVEEQAA--PETVKDEANPPAKSGKAKKETKAKKPAAPRKRSATP---h11_volca MSETEAAPVVAPAAEAAPAAEAPKAKAPKAKAPKQPKAPKAPKEPKAPKEKKPKAAP---

h1_pea ASHPTYEEMIKDAIVSLKEKNGSSQYAIAKFIEEKQ-KQLP-ANFKKLLLQNLKKNVASGh1_sollc -THPPYFEMIKDAIVTLKERTGSSQHAITKFIEEKQ-KSLP-SNFKKLLLTQLKKFVASEh11_volca -THPPYIEMVKDAITTLKERNGSSLPALKKFIENKYGKDIHDKNFAKTLSQVVKTFVKGG

3 298h1_pea -MATEEPIVA VETVPEPIVT EPTTITEPEV PEKEEPKAEV EKTKKAKGSK h1_sollc -MATEEPVIV NEVVEEQAA- -PETVKDEAN PPAKSGKAKK ETKAKKPAAP h11_volca MSETEAAPVV APAAEAAPAA EAPKAKAPKA KAPKQPKAPK APKEPKAPKE

PKKASKPRNP ASHPTYEEMI KDAIVSLKEK NGSSQYAIAK FIEEKQ-KQL RKRSATP--- -THPPYFEMI KDAIVTLKER TGSSQHAITK FIEEKQ-KSL KKPKAAP--- -THPPYIEMV KDAITTLKER NGSSLPALKK FIENKYGKDI

Clustal format

Phylip format

http://www.bits.vib.be/wiki/index.php/Multiple_sequence_alignment#Formats

Software that implement these algorithms and manually adjust the alignments

Alignment editors

- SeaView

- SeqPup

- GeneDoc

- Jalview

- BioEdit

- CLC Sequence Viewer

- UGene

http://www.bits.vib.be/wiki/index.php/Multiple_sequence_alignment

http://www.bits.vib.be/wiki/index.php/Multiple_sequence_alignment

Additional references

Notredame et al. (2002), “T-Coffee: A novel method for fast and accurate multiple sequence alignment,” J Mol Biol 302:205. [Introduced notion of consistency]

Blackshields et al. (2010), “Sequence embedding for fast construction of guide trees for multiple sequence alignment,” Algorithms for Mol Biol 5:21. [mBed algorithm]

Söding (2005), “Protein homology detection by HMM-HMM comparison,” Bioinformatics 21:951. [HHalign algorithm]

Thompson et al. (2005), “BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark,” Proteins 61:127.





3 methods exist:


Divided in 'one to one', 'one to many', or 'many to many' sequence comparisons

One to many

One to one

many to many

Searching sequence databases is done through a little trick

Problem

find me all similar sequences to a query sequence in a database.

(Find me the position of many short reads in a genome)

Bottleneck:

we cannot compute an optimal alignment for every sequence and determine which is best (~MSA). This is time-consuming, only practicable on special computer (parallel computer or computer cluster)

"Heuristic" algorithm : gain of speed at the expense of some loss in sensitivity

• BLAST (developed by S. Altschul et al. at NCBI)

• fastA (developed by R. Pearson at U. of Virginia)

BLAST finds quickly similar sequences by giving up some sensitivity

Algorithm (= steps to follow to reach your goal)

http://www.ncbi.nlm.nih.gov/books/NBK21097/

http://www.ncbi.nlm.nih.gov/books/NBK21097/

BLAST step 1 : neighbouring

BLAST step 2 : searching the little words in the db

BLAST step 3 : extend where the words match

Only if words match >Sg score

Proteins: only extension if another hit <40

Proteins: optimal composition adapted

Each BLAST search hit has an E-value, which is how many hits we expect by chance

E() = m n K e - λ ∗ S

query sequence length m total databank length n

K and λ parameters obtained by simulation(search random sequence against random databank)

Expect value : number of unrelated databank sequences expected to yield same or higher score S by pure chance (extreme value distribution)

***

http://bioinfo.uncc.edu/zhx/binf8312/lecture-7-SequenceAnalyses.pdf

http://bioinfo.uncc.edu/zhx/binf8312/lecture-7-SequenceAnalyses.pdf

BLAST statistics

bit score : score corrected for scale of scoring scheme

P = 1 - e - E

S’λ S - ln K

ln 2=

*

Probability P that databank yields by pure chance at least one alignment with same or higer score

Interpreting the BLAST results by E-value and bit score

E-value: the lower the better (= chance to obtain such a similarity by chance with a random sequence and database of the same size) (e.g. 0.1 means 1 in 10 searches, this similarity could have arosen by chance alone)

Max/Total score: bit score – the higher the better (= score constructed from length of total alignment of the high scoring pair)

Depending on DNA and/or protein sequences as query or in the db, you choose a BLAST version

Different flavours of BLAST

Depending on query sequence: DNA or protein

and database: DNA or protein

Flavour: query - databaseblastn: DNA - DNAblastp: protein - proteinblastx: translated DNA - proteintblastn: protein - tr DNA tblastx: tr DNA - tr DNA

You can adjust few parameters to the BLAST algorithm

SEG filter for proteins

DUST filter for nucleic acids

E-value threshold for searching: rule of thumb: Good >1e-05 > weak similarity >1e-01> take a good look > 10

Higher word size = sensitivity up

A lot of power lies within choosing the right database for the BLAST search.

The choice of the database

The "nr/nt" database is the largest nucleotide database available through NCBI BLAST; select the "nr/nt" database for this exercise. It includes all GenBank, RefSeq Nucleotides, EMBL (European nucleotide database), DDBJ (Japanese nucleotide database) and PDB (Protein Data Bank) sequences, but no EST, STS, GSS, or phase 0, 1 or 2 htgs (unfinished high throughput genomic) sequences. The NCBI nr database originally got its name from the phrase "nonredundant" nucleotide database, but there is no longer any claim to nonredundancy in the sequence set.

Nearly every sequence database comes with BLAST services nowadays

Numerous online websites, mostly WU-BLAST (NCBI)

http://blast.ncbi.nlm.nih.gov/Blast.cgi

http://www.ebi.ac.uk/Tools/sss/

But very easy to install on own computer ('run locally')

1. Download blast programs ( here )

2. Format your 'database' (multifasta file)

3. Run BLAST

You can also choose to use NCBI Blast online outside of the browser by using netblast (instructions here)

http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/

http://blast.ncbi.nlm.nih.gov/Blast.cgi

http://www.ebi.ac.uk/Tools/sss/

http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/netblast.html

http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/

Some adjustments to the BLAST protocol exist for particular purposes

Identifying very distantly related proteins

PSI-BLAST (position specific iterated) (see module 3)

BLAST protein with matching of a pattern

PHI-BLAST (pattern hit initiated) (see module 3)

BLAST highly similar nucleotide sequences

Mega-BLAST

LastZ explanation – have a look at the dotplots here

http://www.bx.psu.edu/miller_lab/dist/README.lastz-1.02.00/README.lastz-1.02.00a.html

BLAST2SEQ aligns 2 sequences and visualises the output in a dotplot-like graph

The tool to do this is called BLAST2SEQ: e.g. comparing chrI with ChrVIII of S. cerevisiae

insertions!

http://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch&PROG_DEF=blastn&BLAST_PROG_DEF=megaBlast&SHOW_DEFAULTS=on&BLAST_SPEC=blast2seq&LINK_LOC=align2seq



BLAT is derived from BLAST, and is used for searching very similar sequences in a genome

BLAT = BLAST like alignment tool

- database is a genome sequence

- the database are not files, but is kept into memory as words of size 11

- it can only be used for very similar sequences: if you have a fragment which you want to know the position in the genome.

The technique of indexing 'words' is also used in some short read aligners

In BLAST for nucleotides: k = 11 (11-mers)

11111111111 → 11 consecutive matches

However, non-consecutive matches improve sensitivity: a spaced seed.

111010010100110111 → 55% more sensitive• 1 means a match, a 0 means a don't care

position– Key size: number of 1's

– Key width: total number of 0's and 1's

• The 'keys' are used to index the genome or the reads, depending on the aligner

doi: 10.1093/bib/bbq015 on http://dx.doi.org

http://dx.doi.org/

FastA is another popular sequence database search algorithm

Find runs of identities Rescore using PAM matrix andKeep top scoring segments


Apply 'joining threshold' toeliminate segments that are

unlikely to be part of the alignmentthat includes the highest

scoring alignment

Use dynamic programming tooptimize the alignment in a

narrow band that encompasses the topscoring segments

FastA is accessible on the website of EBI

Further explanation of algorithm: here

Accessibility• EBI (help link)

FastA developers: link

http://en.wikipedia.org/wiki/FASTA

http://www.ebi.ac.uk/Tools/blast2/help.html

http://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.shtml

The interpretation of FastA output is similar as for BLAST.

http://www.ebi.ac.uk/Tools/sss/fasta/

http://www.ebi.ac.uk/Tools/sss/fasta/

Similarity you observe, homology you infer

Interpreting results

Sequences are similar if their similarity score is significantly higher than that of random sequences of same length and composition.

Sequences are homologous if they are similar because they diverged from a common ancestor.

Sequences are analogous if they are similar because of convergent evolution (e.g. binding sites for same ligand)

Similarity you observe, homology you infer !

You can speak of %similarity or %identity, not of %homology !

Homology: orthologous and paralogous (in- and out-)

(out)

(in)

Summary sequence similarity

Pairwise (one to one)

– Dotplot (graphical)

– Smith-waterman / needleman-wunsch (optimal)

Multiple sequence alignment (many to many) (heuristic)

– ClustalW

– Muscle, ...

Database search (one to many) (heuristic)

– BLAST

– FastA

– BLAT

What you can check to stay updated?

Biocatalogue http://www.biocatalogue.org/

EMBRACE http://www.embraceregistry.net/

Bioinformatics Links Directory http://www.bioinformatics.ca/links_directory/

http://www.biocatalogue.org/

http://www.embraceregistry.net/

http://www.bioinformatics.ca/links_directory/

Summary Detecting sequence similarity is a cornerstone-type of analysis in bioinformatics

The identity matrix summarizes this scoring system, listing all residue combinations in a table

Substitutions or score matrices provide a means to determine similarity in an objective way

Complex substitutions matrices are more meaningful and sensitive to detect similarity

Substitution matrices are derived from analysis of multiple alignments of related sequences

The substitution matrices capture the similarity in properties between residues

A matrix does not capture insertions and deletions: penalties are given to deal with them

The two parts of the gap penalty: a higher penalty for creation, one lower for its extension

Dynamic programming uses a gap penalty and a scoring scheme to align two sequences

Needleman-Wunsch to align two sequences over the whole length (global alignment)

Smith-Waterman to align the most similar parts of two sequences (local alignment)

A graphical method: dot plots can be made to rapidly identify regions with similar sequence

Dot plots generate typical patterns which can be interpreted

Multiple sequence alignment is not simply expanding pairwise sequence alignments

Searching sequence databases is done through a little trick

BLAST finds quickly similar sequences by giving up some sensitivity

Depending on DNA and/or protein sequences as query or in the db, you choose a BLAST version

You can adjust few parameters to the BLAST algorithm

A lot of power lies within choosing the right database for the BLAST search.

Some adjustments to the BLAST protocol exist for particular purposes

BLAST2SEQ aligns 2 sequences and visualises the output in a dotplot-like graphh

BLAT is derived from BLAST, and is used for searching very similar sequences in a genome

The technique of indexing 'words' is also used in some short read aligners


FastA is accessible on the website of EBI

The interpretation of FastA output is similar as for BLAST.

Similarity you observe, homology you infer

Homology: orthologous and paralogous (in- and out-)

BITS: Basics of Sequence similarity

Education

Transcript of BITS: Basics of Sequence similarity