Lecture 4 Protein Sequence Alignment -...

Post on 02-Oct-2020

11 views 0 download

Transcript of Lecture 4 Protein Sequence Alignment -...

Introduction to Bioinformaticsfor Medical Research

Gideon Greenspangdg@cs.technion.ac.il

Lecture 4Protein Sequence Alignment

2

Protein Sequence Alignment

• Alignment Recap• Needleman-Wunsch Algorithm• Genomes to Proteins• Scoring Matrices• PAM• BLOSUM• Genomic vs Protein

3

Alignment Recap (1)

• Two input sequences– Add dashes to make same length

• Score for each position in alignment– Positive for match, negative for mismatch– Final score is sum of position scores

• Dashes placed so as to maximize score– Needleman-Wunsch algorithm

4

Alignment Recap (2)

• Global vs Local Alignment– Smith-Waterman algorithm

• Gap scores– Affine gap model

• Complexity– Indexing to reduce alignments

• FASTA and BLAST algorithms

5

Needleman-Wunsch Alignment

• Global alignment between sequences– Compare entire sequence against another

• Create scoring table– Sequence A across top, B down left

• Cell at row i and column j contains thescore of best alignment between the first ielements of A and the first j elements of B– Global alignment score is bottom right cell

6

Needleman-Wunsch Example

5

4

32

-1

Score of bestalignment betweenAC and CATG

Sequences: A = ACGCTG, B = CATGT

-2…betweenAC andCATGT

2

…between ACGand CATG

Calculate scorebetween ACGand CATGT

?

7Sequences: A = ACGCTG, B = CATGT

Needleman-Wunsch Example

5

4

32

-1-2

2

-1 from beforeplus -1 for

mismatch of Gagainst T fi -2

2 from beforeplus -1 for

mismatch of –against T fi 1

-2 from beforeplus -1 for

mismatch of Gagainst – fi -3

1 Cell getshighest score

of -2,1,-3 fi 1

8T 5!G 4!T 3!A 2!C 1!

00!

G6

T5

C4

G3

C2

A10

9T 5!G 4!T 3!A 2!C 1!

-100!

G6

T5

C4

G3

C2

A10

A-

10T 5!G 4!T 3!A 2!C 1!

-6-5-4-3-2-100!

G6

T5

C4

G3

C2

A10

ACGCTG------

11-5T 5!-4G 4!-3T 3!-2A 2!-1C 1!

-6-5-4-3-2-100!

G6

T5

C4

G3

C2

A10

-----CATGT

12-5T 5!-4G 4!-3T 3!-2A 2!

-1-1C 1!-6-5-4-3-2-100!

G6

T5

C4

G3

C2

A10

AC

13-5T 5!-4G 4!-3T 3!-2A 2!

1-1-1C 1!-6-5-4-3-2-100!

G6

T5

C4

G3

C2

A10

AC-C

14-5T 5!-4G 4!-3T 3!-2A 2!

01-1-1C 1!-6-5-4-3-2-100!

G6

T5

C4

G3

C2

A10

ACG-C-

15-5T 5!-4G 4!-3T 3!-2A 2!

-101-1-1C 1!-6-5-4-3-2-100!

G6

T5

C4

G3

C2

A10

ACGC-C--

ACGC---C

16-5T 5!-4G 4!-3T 3!

001-2A 2!-3-2-101-1-1C 1!-6-5-4-3-2-100!

G6

T5

C4

G3

C2

A10

ACG-CA

172311-2-2-5T 5!3012-1-1-4G 4!01-1-100-3T 3!-3-2-1001-2A 2!-3-2-101-1-1C 1!-6-5-4-3-2-100!

G6

T5

C4

G3

C2

A10

182311-2-2-5T 5!3012-1-1-4G 4!01-1-100-3T 3!-3-2-1001-2A 2!-3-2-101-1-1C 1!-6-5-4-3-2-100!

G6

T5

C4

G3

C2

A10

1923T 5!312G 4!

10T 3!-101A 2!

01-1C 1!-100!

G6

T5

C4

G3

C2

A10

2023T 5!312G 4!

10T 3!-101A 2!

01-1C 1!-100!

G6

T5

C4

G3

C2

A10

ACGCTG--C-ATGT

2123T 5!312G 4!

10T 3!-101A 2!

01-1C 1!-100!

G6

T5

C4

G3

C2

A10

ACGCTG--CA-TGT

2223T 5!312G 4!

10T 3!-101A 2!

01-1C 1!-100!

G6

T5

C4

G3

C2

A10

-ACGCTGCATG-T-

23

From Genomes to Proteins

• 3-nucleotide ‘codons’ for each amino acid– 43 = 64 possible codon values– Only 20 amino acids fi degeneracy

• Start and stop codons– Start codon determines reading frame

• Silent, missense and nonsense mutations• Different codes, e.g. for mitochondria

24

The Standard Genetic Code

25

Scoring Matrices (1)• The standard scoring scheme for aligning

nucleotides could be expressed as a matrix:

+2-1-1-1T-1+2-1-1G-1-1+2-1C-1-1-1+2ATGCA

26

Scoring Matrices (2)• But we could take account of relative

likelihood of transitions and transversions:

+1-5-1-5T-5+1-5-1G-1-5+1-5C-5-1-5+1ATGCA

axis ofsymmetry

27

Amino Acid Matrices

• For aligning amino acids, we need a scoringmatrix of 20 rows ¥ 20 columns

• Matrices represent biological processes– Mutation causes changes in sequence– Evolution tends to conserve protein function– Similar function requires similar amino acids

• Could base matrix on amino acid properties– In practice: based on empirical data

28

29

PAM

• Percent Accepted Mutations– Margaret Dayhoff (1978)

• Based on very similar protein sequences– 34 known protein superfamilies

• Phylogenetic trees from global alignment– Count number of observed changes– Changes are obviously ‘accepted’

30

PAM Matrices

• PAM1 defines unit of evolutionary change– One percent accepted mutation, i.e. one amino

acid difference per 100 residues on average• Any PAMn calculated from PAM1

– PAMn = PAM1 multiplied by itself n times• PAMn is not n differences per 100 residues

– One amino acid can change several times– PAM250 is in common use

31

Selecting a PAM Matrix

• For PAMn, higher n suitable for sequenceswhich are longer or less similar– PAM120 recommended for general use– PAM60 for close relations– PAM250 for distant relations

• If uncertain, try several different matrices– PAM40, PAM120, PAM250 recommended

32

BLOSUM

• Blocks Substitution Matrix– Steven and Jorga G. Henikoff (1992)

• Based on BLOCKS database– Families of proteins with identical function– Highly conserved protein domains

• Ungapped local alignment to identify motifs– Counts amino acids observed in same column– Symmetrical model of substitution

33

BLOSUM Matrices

• Different BLOSUMn matrices arecalculated independently from BLOCKS

• Value n defines how sequences are groupedwithin family before counting amino acids– For BLOSUMn, sequences which are more

than n percent identical are considered as one• Purpose: to prevent bias in favor of closely

related protein sequences

34

Selecting a BLOSUM Matrix

• For BLOSUMn, higher n suitable forsequences which are more similar– BLOSUM62 recommended for general use– BLOSUM80 for close relations– BLOSUM45 for distant relations

• BLOSUM unsuitable for short sequences– Use PAM30 instead

35

PAM vs BLOSUM

IndependentDependentCalculation

DomainsAncestryEffective forUnderlying model

Source sequencesSource alignmentHigher n fiMatrix

DomainsEvolution

SimilarVery similarLocalGlobalCloseDistant

BLOSUMPAM

36

Provisional Guidelines

11, 1BLOSUM62> 85

10, 1BLOSUM8050 … 85

10, 1PAM7035 … 50

9, 1PAM30< 35

Gap CostsMatrixQuery Length

37

Other Scoring Matrices

• Genetic code changes– Distance between amino acid codons

• Amino acid properties– Particular properties for application

• Matrices from specific families– Example: transmembrane proteins

• Multi-site substitution model

38

Amino Acid Properties

39

Genomic vs Protein Alignment

SelectionMutationMechanismFurtherCloserInterspecies range

Scoring matrixDifferent genesDifferent proteins

RelationshipSequence

complexsimpleexponential1

16

FunctionPhylogenyProteinGenomic

40

BLAST Variations

Translated genomicTranslated genomictblastx

Translated genomicProteintblastn

ProteinTranslated genomicblastx

ProteinProteinblastp

GenomicGenomicblastn

DatabaseQuery typeName