BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5...

72
Data bases (Biosequences, Structures, Genomes, DNA Chips, Computational Biology Tools for: BIOINFORMATICS Genomes, DNA Chips, Proteomics, Interactomics) •Design •Curation •Data Mining •Sequence analysis •Structure prediction •Docking •Structural genomics •Functional genomics •Proteomics •Interactomics

Transcript of BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5...

Page 1: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

Data bases

(Biosequences, Structures, Genomes, DNA Chips,

Computational Biology

Tools for:

BIOINFORMATICS

Genomes, DNA Chips,

Proteomics, Interactomics)

•Design

•Curation

•Data Mining

•Sequence analysis•Structure prediction•Docking•Structural genomics•Functional genomics•Proteomics •Interactomics

Page 2: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

The problem: Sequence comparison

•How to compare two sequences

•How to compare one sequence (target) to many sequences (database search)sequences (database search)

The solution: Sequence alignment

Page 3: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

Why to compare

Similarity search is necessary for:

•Family assignment•Sequence annotation•Sequence annotation•Construction of phylogenetic trees•Protein structure prediction

Page 4: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

Sequence Alignment

Rita Casadio

Page 5: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

Types of alignments:

•Aligment of Pairs of Sequences

•Multiple Sequence Alignment of three or more protein sequences

Page 6: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

Pairwise and multiple sequence alignments can be global or local

•Globlal: the whole sequence is aligned

•Local : only fragments of the sequence are aligned

Page 7: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

Pairwise sequence alignments can be global or local

Page 8: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

A basic concept:

Measures of sequence similarity

Page 9: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

Given two character strings, two measures of the distance between them are:

1) The Hamming distancedefined between two strings of equal length is the number of positions with mismatching charactersagtccgta Hd=2

2) The Levenshtein or edit distance 2) The Levenshtein or edit distance between two strings of not necessarily equal length is the minimal number of edit operations required to change one string into the other, where an edit operation is a deletion, insertionor alteration of a single characterin either sequence. A given sequence of edit operations induces an unique alignment, but not viceversaag-tcccgctca Ld=3

Page 10: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

Scoring schemes

A scoring scheme must account for residue substitutions, insertions or deletions (gaps)

Scores are measures of sequence similarity (similar Scores are measures of sequence similarity (similar sequences have small distances (high scores), dissimilar sequences give large distances (low scores))

Page 11: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

Algorithms for optimal aligment can seek either to minimize a dissimilarity measureor maximize a scoring functionmaximize a scoring function

Page 12: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

For nucleic acid sequences (Genetic Code Scoring)

A simple scheme for substitution for nucleic acid sequences:match +1match +1mismatch -1

More complicated scheme are based on the higher frequency of transition mutations than transverse mutations

Page 13: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

Example:

a t g ca 20 10 5 5t 10 20 5 5g 5 5 20 10c 5 5 10 20c 5 5 10 20

identity=20high frequency=5low frequency=10

purine=purine, and pyrimidine=pyrimidine are more common than purine=pyrimidine, or pyrimidine=purine.

Page 14: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

For proteins a variety of scoring schemes have been proposed

Similarity of physicochemical type (Chemical similarity scoring. Eg. McLachlan similarity matrix:polar, non polar;size, shape, charge, rare (F))polar;size, shape, charge, rare (F))

Substitution matrices

Page 15: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

Substitution matrices for proteins

•PAM

•BLOSUM

•Matrices derived from tertiary structure •Matrices derived from tertiary structure aligment

Page 16: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

Derivation of substitution matrices

The Dayhoff mutation matrix:

As sequence diverge, mutations accumulate. To measure the relativeprobability of any particular substitution we can count the number of changes in pairs of aligned similar sequences (relative frequency of such changes to form a scoring matrix for substitution)

1PAM: 1 Percent Accept Mutationtwo sequences 1PAM apart have 99% identical residues. One change in any position

Collecting statistics from pairs of sequences closely related (1PAM) and correcting for different aminoacid abundances produces the 1PAM substitution matrix

Page 17: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

For more widely diverged sequences powers of 1PAM are used

PAM250 (20% overall sequence identity)

Different PAMS for different level of sequence identity

PAM 0 30 80 110 200 250% Identity 100 75 50 60 25 20

M.Dayhoff (1978)

PET91 (Jones et al.,1992) is based on 2621 families

Page 18: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

Score of mutation i,j

log observed i,j mutation rate/mutation rate expected from aminoacidfrequencies

log-odds values (x10)

2 =0.2 (scaling) log 10= 10^0.2=1.6˜ 2 The value is the expectation value of the mutationThe probability of two independent mutational events is the product of their probabilities (addition when we consider logs)

Page 19: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

PAM250

Page 20: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

The BLOSUM Matrices

Henikoff and Henikoff (1991)

Best performing in identifying distant relationships, making use of the much larger amount of data that had become available since M. Dayhoff’s work

Based on a data base of multiple alignments without gaps for short Based on a data base of multiple alignments without gaps for short regions of related sequences. Within each alignment in the data base, the sequences were clustered into groups where the sequences are similar at some threshold value of percentage identitylog odds BLOSUM(blocks substitution matrix)

BLOSUM40, 62, 80..

Page 21: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

BLOSUM62

Page 22: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

DOTPLOT ANALYSIS

a simple picture that gives an overwiev of the similarites between two sequences

The doplot is a table or matrix based on scoring schemesThe doplot is a table or matrix based on scoring schemes

Page 23: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

Dotlet - A Java applet for sequence comparisons using the dot matrix method

http://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html

Page 24: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,
Page 25: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

How to search for internal repeats in the sequenceHow to search for internal repeats in the sequence

Page 26: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,
Page 27: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

How to search for conserved domains

Page 28: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,
Page 29: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

ANCALM (J05545):TGAATCCCAGTTCAGCTCTTCAGCCTTTCGTGGATAAGAGAAGGCTGAAAGCGGGTCACGTTTTGGACTAAGCGACGCCCTTGCCAGGCATCCAGCTTAGTGGCTGTTGGTTTATTTGTAGAGTCCCCTTAACTCTCTCTCCCCCACATCGCCCATCTCCACCGACGCCTCTCTCTCTCGTGTTATTTCTCCCCATTCTCGCTTCATTTCCCATCCATTTTCGAGTTCTGCAATATCCTCACTAACTAGTATAGCCATGGTACGCCTCACTCGATCATCATCGTTGTTCGTGCGCTCAAACGCATCCGCTGTGCGGGGCAGATCTACTGGTGTCCTCCTGCGTAGATGAGCTGACGACTTCACTTCCAGGCCGACTCTCTGACCGAAGAGCAAGTTTCCGAGTACAAGGAGGCCTTCTCCCTATTTGTAAGTGCCATTGGTTACTGTTATATCAAAATCGAATTTGTATTGAGAGTATACTAATACATTCCGCACTAAACAGGACAAGGATGGCGATGGTTAGTGCATCTGTCCCCCCAGGCTTGATCGCATTCGCCCAGCATGTCTGCTGTAGCTCTATATAACCGTTTCTGACAAACGGCGACAGGCCAGATTACCACTAAGGAGCTTGGCACTGTCATGCGCTCGCTCGGTCAGAATCCTTCAGAGTCTGAGCTTCAGGACATGATCAACGAAGTTGACGCCGACAACAATGGCACC

Exons and Introns: How to find them

TGCGCTCGCTCGGTCAGAATCCTTCAGAGTCTGAGCTTCAGGACATGATCAACGAAGTTGACGCCGACAACAATGGCACCATTGACTTTCCAGGTACGCGAACTCCCCAATCTACTTCGCACCAGCCTAGAAATGTACTAATGCTAAACAGAGTTCCTTACCATGATGGCCAGAAAGATGAAGGACACCGATTCCGAGGAGGAAATTCGGGAGGCGTTCAAGGTCTTCGACCGTGACAACAATGGTTTCATCTCCGCTGCTGAGCTGCGTCACGTCATGACCTCGATCGGTGAGAAGCTCACCGATGACGAAGTCGACGAGATGATCCGCGAGGCGGACCAGGATGGCGACGGCCGAATTGACTGTACGTTGGCTCCCCGCTTATCCTTGACCGTAGAAGAGGTATGATACTGATCGGCTGCAGACAACGAATTCGTCCAACTTATGATGCAAAAATAAACGCTCTTACCTTTGATGTTTATCGTTAGCGAAGAAGGTGTGGACACTTTCCAGCTGTCTCATCTTAGTTGTCATATCATTGAATGTAGCCTATCTGATTGCGGATAAGCAACTGATGGTTGTAACGGCTTCCATTTTGCTCTGACTTCTGAGTACCCTTTTCCTTCATGTTTGTTCGTCGACCATTCTGCTAGTGAGATATGCGTAGAGTTGGGTAGGCTGAATTTACGAGTCTCTGTTGGGGGATATCACATGCTTCAC

TACAATCTTTCTCTAC

CALM_EMENI (P19533):ADSLTEEQVSEYKEAFSLFDKDGDGQITTKELGTVMRSLGQNPSESELQDMINEVDADNNGTIDFPEFLTMMARKMKDTD

SEEEIREAFKVFDRDNNGFISAAELRHVMTSIGEKLTDDEVDEMIREADQDGDGRIDYNEFVQLMMQK

Page 30: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,
Page 31: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

Sequence Alignment:Methods

Rita Casadio

Page 32: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

Alignment of pairs of sequences

•Dot matrix analysis (dotplot)

•The dynamic programming algorithm

•Word or K-tuple methods (FASTA, BLAST)•Word or K-tuple methods (FASTA, BLAST)

Page 33: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

Sequence comparison with gaps

Deletions are referred to as ``gaps'', while insertions and deletions are collectively referred to as ``indels''. Insertions and deletions are needed to align accurately even quite closely related sequences such as the αααα and ββββ globins

The naive approach to finding the best alignment of two sequences including gaps is to generate all possible alignments, add up the scores for equivalencing each amino acid pair in each alignment then select the highest scoring alignment. However, for two sequences of 100 residues there are alternative alignments so such an approach would be time consuming and infeasible for longer sequences.

Page 34: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

Finding the best alignment with dynamic programming

A group of algorithms calculate the best score and alignment in the order of steps. These dynamic programming algorithms were first developed for protein sequence comparison by Needleman and Wunsch(J Mol Biol 48, 443, 1970). protein sequence comparison by Needleman and Wunsch(J Mol Biol 48, 443, 1970).

Page 35: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

Pairwise and multiple sequence alignments can be global or local

•Globlal: the whole sequence is aligned

•Local : only fragments of the sequence are aligned

Page 36: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

Pairwise sequence alignments can be global or global or local

Page 37: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

DATABASE SCANNING

•Word or K-tuple methods (FAST, BLAST)

Page 38: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

Heuristic strategies for sequence sequence comparison

Page 39: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

Sequence similarity with FASTA

Page 40: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

FASTA

Page 41: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

Sequence similarity with BLAST (Basic Local Alignment Search Tool)

Page 42: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

BLAST

Page 43: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

PAM250

Page 44: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

A R N D C Q E G H I L K M F P S T W Y VA 5 -2 -1 -2 -1 -1 -1 0 -2 -1 -2 -1 -1 -3 -1 1 0 -3 -2 0R -2 7 -1 -2 -4 1 0 -3 0 -4 -3 3 -2 -3 -3 -1 -1 -3 -1 -3N -1 -1 7 2 -2 0 0 0 1 -3 -4 0 -2 -4 -2 1 0 -4 -2 -3D -2 -2 2 8 -4 0 2 -1 -1 -4 -4 -1 -4 -5 -1 0 -1 -5 -3 -4C -1 -4 -2 -4 13 -3 -3 -3 -3 -2 -2 -3 -2 -2 -4 -1 -1 -5 -3 -1Q -1 1 0 0 -3 7 2 -2 1 -3 -2 2 0 -4 -1 0 -1 -1 -1 -3E -1 0 0 2 -3 2 6 -3 0 -4 -3 1 -2 -3 -1 -1 -1 -3 -2 -3G 0 -3 0 -1 -3 -2 -3 8 -2 -4 -4 -2 -3 -4 -2 0 -2 -3 -3 -4

Blosum50

H -2 0 1 -1 -3 1 0 -2 10 -4 -3 0 -1 -1 -2 -1 -2 -3 2 -4I -1 -4 -3 -4 -2 -3 -4 -4 -4 5 2 -3 2 0 -3 -3 -1 -3 -1 4L -2 -3 -4 -4 -2 -2 -3 -4 -3 2 5 -3 3 1 -4 -3 -1 -2 -1 1K -1 3 0 -1 -3 2 1 -2 0 -3 -3 6 -2 -4 -1 0 -1 -3 -2 -3M -1 -2 -2 -4 -2 0 -2 -3 -1 2 3 -2 7 0 -3 -2 -1 -1 0 1F -3 -3 -4 -5 -2 -4 -3 -4 -1 0 1 -4 0 8 -4 -3 -2 1 4 -1P -1 -3 -2 -1 -4 -1 -1 -2 -2 -3 -4 -1 -3 -4 10 -1 -1 -4 -3 -3S 1 -1 1 0 -1 0 -1 0 -1 -3 -3 0 -2 -3 -1 5 2 -4 -2 -2T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 2 5 -3 -2 0W -3 -3 -4 -5 -5 -1 -3 -3 -3 -3 -2 -3 -1 1 -4 -4 -3 15 2 -3Y -2 -1 -2 -3 -3 -1 -2 -3 2 -1 -1 -2 0 4 -3 -2 -2 2 8 -1V 0 -3 -3 -4 -1 -3 -3 -4 -4 4 1 -3 1 -1 -3 -2 0 -3 -1 5

Page 45: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

BLOSUM62

Page 46: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,
Page 47: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,
Page 48: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

MEGABLAST SearchMega BLAST uses the greedy algorithm for nucleotide sequence alignment search. This program is optimized for aligning sequences that differ slightly as a result of sequencing or other similar "errors". When larger word size is used (see slightly as a result of sequencing or other similar "errors". When larger word size is used (see explanation below), it is up to 10 times faster than more common sequence similarity programs. Mega BLAST is also able to efficiently handle much longer DNA sequences than the blastn program of traditional BLAST algorithm.

Page 49: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,
Page 50: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,
Page 51: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,
Page 52: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

SH2 domain analysis with the program AMAS

Livingstone and Burton, Cabios 9,745, 1993

Page 53: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

1 Y K D Y H S - D K K K G E L - -2 Y R D Y Q T - D Q K K G D L - -3 Y R D Y Q S - D H K K G E L - -4 Y R D Y V S - D H K K G E L - -5 Y R D Y Q F - D Q K K G S L - -6 Y K D Y N T - H Q K K N E S - -7 Y R D Y Q T - D H K K A D L - -8 G Y G F G - - L I K N T E T T K 9 T K G Y G F G L I K N T E T T K10 T K G Y G F G L I K N T E T T K

A 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0D 0 0 70 0 0 0 0 60 0 0 0 0 20 0 0 0E 0 0 0 0 0 0 0 0 0 0 0 0 70 0 0 0F 0 0 0 10 0 33 0 0 0 0 0 0 0 0 0 0

Position

Multiple Alignment

F 0 0 0 10 0 33 0 0 0 0 0 0 0 0 0 0G 10 0 30 0 30 0 100 0 0 0 0 50 0 0 0 0H 0 0 0 0 10 0 0 10 30 0 0 0 0 0 0 0K 0 40 0 0 0 0 0 0 10 100 70 0 0 0 0 100I 0 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0L 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 0M 0 0 0 0 0 0 0 0 0 0 0 0 0 60 0 0N 0 0 0 0 10 0 0 0 0 0 30 10 0 0 0 0P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0Q 0 0 0 0 40 0 0 0 30 0 0 0 0 0 0 0R 0 50 0 0 0 0 0 0 0 0 0 0 0 0 0 0S 0 0 0 0 0 33 0 0 0 0 0 0 10 10 0 0T 20 0 0 0 0 33 0 0 0 0 0 30 0 30 100 0V 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0W 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0Y 70 0 0 90 0 0 0 0 0 0 0 0 0 0 0 0

How to build a sequence profile

Page 54: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

(1) PSI-BLAST takes as an input a single protein sequence and compares it to a protein database, using the gapped BLAST program

(2) The program constructs a multiple alignment, and then a profile, from any significant local alignments found. The original query sequence serves as a template for the multiple alignment and profile, whose lengths are identical to that of the query. Different numbers of sequences can be aligned in different template positions

(3) The profile is compared to the protein database, again seeking local

The design of PSI-BLAST

(3) The profile is compared to the protein database, again seeking local alignments. After a few minor modifications, the BLAST algorithm can be used for this directly.

(4) PSI-BLAST estimates the statistical significance of the local alignments found. Because profile substitution scores are constructed to a fixed scale, and gap scores remain independent of position, the statistical theory and parameters for gapped BLAST alignments remain applicable to profile alignments.

(5) Finally, PSI-BLAST iterates, by returning to step (2), an arbitrary number of times or until convergence.

Page 55: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

http://helix.biology.mcmaster.ca/721/distance/node9.html

Page 56: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

http://barton.ebi.ac.uk/papers/rev93_/http://barton.ebi.ac.uk/papers/rev93_/tableofcontents3_1.html

Page 57: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

Sequence Alignment: Statistics

Rita Casadio

Page 58: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

The statistics of global sequence comparison

Page 59: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

Distribution of SD scores obtained with 100000 alignments of length>20 between unrelated proteins.

100 randomizations, a global alignment method, Pam250

The tail of high SD scores

Page 60: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

The Z score:

Z=(Xs-Xt)/σσσσsXs=average of distribution scores with random sequencesXt=average of distribution score with real sequencesσσσσs=SD of distribution scores with random sequences

Accuracy of the alignment:

Z<3 not significant3<Z<6 putatively significant6<Z<10 possibly significantZ>10 significant

Page 61: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

The Z score of this local alignment is 7.5 over 54 residues, identity is25.9%. The sequences are of completely different secondary structure

How reliable is the Z score?

Citrate synthase (2cts) vs transthyritin (2paba)

Page 62: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

Predicting quality using percentage identity

% I= No of identities over the length of the alignment

Page 63: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

Typical alignment score distributions resulting from data base scan

a) Not discriminating

b) Intermediary result

c) Perfect discrimination

Page 64: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

The statistics of global sequence comparison

To assess whether a given alignment constitutes evidence for homology, it helps to know how strong an alignment can be expected from chance alone. In this context, "chance" can mean the comparison of (i) real but non-homologous sequences; (ii) real sequences that are shuffled to preserve compositional properties or (iii) sequences that are generated randomly based upon a sequences that are generated randomly based upon a DNA or protein sequence model.

Analytic statistical results invariably use the last of these definitions of chance, while empirical results based on simulation and curve-fitting may use any of the definitions.

Page 65: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

The P-value

The probability that a variate would assume a value greater than or equal to the observed value strictly by chance P(z>zo)

It is possible to express the score of interest in terms of standard deviations from the mean; it is a mistake to assume that the relevant distribution is normal and convert the Z-value into a P-value; the tail deviations from the mean; it is a mistake to assume that the relevant distribution is normal and convert the Z-value into a P-value; the tail behavior of global alignment scores is unknown. The most one can say reliably is that if 100 random alignments have score inferior to the alignment of interest, the P-value in question is likely less than 0.01.

Page 66: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

The statistics of local sequence comparison

Page 67: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

A local alignment without gaps consists simply of a pair of equal length segments, one from each of the two sequences being compared.

The scores of segment pairs can not be improved by extension or trimming. These are called high-scoring segment pairs or HSPs.

The maximum of a large number of independent idendically The maximum of a large number of independent idendically distributed random variables tends to an extreme value distribution.

Page 68: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

The maximum of a large number of independent idendically distributed random variables tends to an extreme value distribution

P(S<x)=exp[-exp(-x)]

P(S>=x)= 1- exp[-exp(-x)]

Page 69: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

E value

In the limit of sufficiently large sequence lengths m and n, the statistics of HSP scores are characterized by two parameters, K and lambda.

Most simply, the expected number of HSPs with score at least S is given by the E-value for the score S:

E=Kmn exp(-λλλλS)

The parameters K and lambda can be thought of simply as natural scales for the search space size and the scoring system respectively.scales for the search space size and the scoring system respectively.

Bit scores:

S’= (λλλλS-lnK)/ln2

The E-value corresponding to a given bit score is:

E=mn 2–S’

Page 70: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

P-value:

The chance of finding zero HSPs with score >=S is exp(-E), so the probability of finding at least one such HSP is

P=1-exp(-E)P=1-exp(-E)

This is the P-value associated with the score S.

Page 71: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,

Sequence divergence through evolution

Page 72: BIOINFORMATICS - biocomp.unibo.itsavojard/slides/lbc10/lezione2.pdf · a 20 10 5 5 t 10 20 5 5 g 5 5 20 10 c 5 5 10 20 identity=20 high frequency=5 low frequency=10 purine=purine,