Post on 02-Oct-2020
Introduction to Bioinformaticsfor Medical Research
Gideon Greenspangdg@cs.technion.ac.il
Lecture 4Protein Sequence Alignment
2
Protein Sequence Alignment
• Alignment Recap• Needleman-Wunsch Algorithm• Genomes to Proteins• Scoring Matrices• PAM• BLOSUM• Genomic vs Protein
3
Alignment Recap (1)
• Two input sequences– Add dashes to make same length
• Score for each position in alignment– Positive for match, negative for mismatch– Final score is sum of position scores
• Dashes placed so as to maximize score– Needleman-Wunsch algorithm
4
Alignment Recap (2)
• Global vs Local Alignment– Smith-Waterman algorithm
• Gap scores– Affine gap model
• Complexity– Indexing to reduce alignments
• FASTA and BLAST algorithms
5
Needleman-Wunsch Alignment
• Global alignment between sequences– Compare entire sequence against another
• Create scoring table– Sequence A across top, B down left
• Cell at row i and column j contains thescore of best alignment between the first ielements of A and the first j elements of B– Global alignment score is bottom right cell
6
Needleman-Wunsch Example
5
4
32
-1
Score of bestalignment betweenAC and CATG
Sequences: A = ACGCTG, B = CATGT
-2…betweenAC andCATGT
2
…between ACGand CATG
Calculate scorebetween ACGand CATGT
?
7Sequences: A = ACGCTG, B = CATGT
Needleman-Wunsch Example
5
4
32
-1-2
2
-1 from beforeplus -1 for
mismatch of Gagainst T fi -2
2 from beforeplus -1 for
mismatch of –against T fi 1
-2 from beforeplus -1 for
mismatch of Gagainst – fi -3
1 Cell getshighest score
of -2,1,-3 fi 1
8T 5!G 4!T 3!A 2!C 1!
00!
G6
T5
C4
G3
C2
A10
9T 5!G 4!T 3!A 2!C 1!
-100!
G6
T5
C4
G3
C2
A10
A-
10T 5!G 4!T 3!A 2!C 1!
-6-5-4-3-2-100!
G6
T5
C4
G3
C2
A10
ACGCTG------
11-5T 5!-4G 4!-3T 3!-2A 2!-1C 1!
-6-5-4-3-2-100!
G6
T5
C4
G3
C2
A10
-----CATGT
12-5T 5!-4G 4!-3T 3!-2A 2!
-1-1C 1!-6-5-4-3-2-100!
G6
T5
C4
G3
C2
A10
AC
13-5T 5!-4G 4!-3T 3!-2A 2!
1-1-1C 1!-6-5-4-3-2-100!
G6
T5
C4
G3
C2
A10
AC-C
14-5T 5!-4G 4!-3T 3!-2A 2!
01-1-1C 1!-6-5-4-3-2-100!
G6
T5
C4
G3
C2
A10
ACG-C-
15-5T 5!-4G 4!-3T 3!-2A 2!
-101-1-1C 1!-6-5-4-3-2-100!
G6
T5
C4
G3
C2
A10
ACGC-C--
ACGC---C
16-5T 5!-4G 4!-3T 3!
001-2A 2!-3-2-101-1-1C 1!-6-5-4-3-2-100!
G6
T5
C4
G3
C2
A10
ACG-CA
172311-2-2-5T 5!3012-1-1-4G 4!01-1-100-3T 3!-3-2-1001-2A 2!-3-2-101-1-1C 1!-6-5-4-3-2-100!
G6
T5
C4
G3
C2
A10
182311-2-2-5T 5!3012-1-1-4G 4!01-1-100-3T 3!-3-2-1001-2A 2!-3-2-101-1-1C 1!-6-5-4-3-2-100!
G6
T5
C4
G3
C2
A10
1923T 5!312G 4!
10T 3!-101A 2!
01-1C 1!-100!
G6
T5
C4
G3
C2
A10
2023T 5!312G 4!
10T 3!-101A 2!
01-1C 1!-100!
G6
T5
C4
G3
C2
A10
ACGCTG--C-ATGT
2123T 5!312G 4!
10T 3!-101A 2!
01-1C 1!-100!
G6
T5
C4
G3
C2
A10
ACGCTG--CA-TGT
2223T 5!312G 4!
10T 3!-101A 2!
01-1C 1!-100!
G6
T5
C4
G3
C2
A10
-ACGCTGCATG-T-
23
From Genomes to Proteins
• 3-nucleotide ‘codons’ for each amino acid– 43 = 64 possible codon values– Only 20 amino acids fi degeneracy
• Start and stop codons– Start codon determines reading frame
• Silent, missense and nonsense mutations• Different codes, e.g. for mitochondria
24
The Standard Genetic Code
25
Scoring Matrices (1)• The standard scoring scheme for aligning
nucleotides could be expressed as a matrix:
+2-1-1-1T-1+2-1-1G-1-1+2-1C-1-1-1+2ATGCA
26
Scoring Matrices (2)• But we could take account of relative
likelihood of transitions and transversions:
+1-5-1-5T-5+1-5-1G-1-5+1-5C-5-1-5+1ATGCA
axis ofsymmetry
27
Amino Acid Matrices
• For aligning amino acids, we need a scoringmatrix of 20 rows ¥ 20 columns
• Matrices represent biological processes– Mutation causes changes in sequence– Evolution tends to conserve protein function– Similar function requires similar amino acids
• Could base matrix on amino acid properties– In practice: based on empirical data
28
29
PAM
• Percent Accepted Mutations– Margaret Dayhoff (1978)
• Based on very similar protein sequences– 34 known protein superfamilies
• Phylogenetic trees from global alignment– Count number of observed changes– Changes are obviously ‘accepted’
30
PAM Matrices
• PAM1 defines unit of evolutionary change– One percent accepted mutation, i.e. one amino
acid difference per 100 residues on average• Any PAMn calculated from PAM1
– PAMn = PAM1 multiplied by itself n times• PAMn is not n differences per 100 residues
– One amino acid can change several times– PAM250 is in common use
31
Selecting a PAM Matrix
• For PAMn, higher n suitable for sequenceswhich are longer or less similar– PAM120 recommended for general use– PAM60 for close relations– PAM250 for distant relations
• If uncertain, try several different matrices– PAM40, PAM120, PAM250 recommended
32
BLOSUM
• Blocks Substitution Matrix– Steven and Jorga G. Henikoff (1992)
• Based on BLOCKS database– Families of proteins with identical function– Highly conserved protein domains
• Ungapped local alignment to identify motifs– Counts amino acids observed in same column– Symmetrical model of substitution
33
BLOSUM Matrices
• Different BLOSUMn matrices arecalculated independently from BLOCKS
• Value n defines how sequences are groupedwithin family before counting amino acids– For BLOSUMn, sequences which are more
than n percent identical are considered as one• Purpose: to prevent bias in favor of closely
related protein sequences
34
Selecting a BLOSUM Matrix
• For BLOSUMn, higher n suitable forsequences which are more similar– BLOSUM62 recommended for general use– BLOSUM80 for close relations– BLOSUM45 for distant relations
• BLOSUM unsuitable for short sequences– Use PAM30 instead
35
PAM vs BLOSUM
IndependentDependentCalculation
DomainsAncestryEffective forUnderlying model
Source sequencesSource alignmentHigher n fiMatrix
DomainsEvolution
SimilarVery similarLocalGlobalCloseDistant
BLOSUMPAM
36
Provisional Guidelines
11, 1BLOSUM62> 85
10, 1BLOSUM8050 … 85
10, 1PAM7035 … 50
9, 1PAM30< 35
Gap CostsMatrixQuery Length
37
Other Scoring Matrices
• Genetic code changes– Distance between amino acid codons
• Amino acid properties– Particular properties for application
• Matrices from specific families– Example: transmembrane proteins
• Multi-site substitution model
38
Amino Acid Properties
39
Genomic vs Protein Alignment
SelectionMutationMechanismFurtherCloserInterspecies range
Scoring matrixDifferent genesDifferent proteins
RelationshipSequence
complexsimpleexponential1
16
FunctionPhylogenyProteinGenomic
40
BLAST Variations
Translated genomicTranslated genomictblastx
Translated genomicProteintblastn
ProteinTranslated genomicblastx
ProteinProteinblastp
GenomicGenomicblastn
DatabaseQuery typeName