Introduction to Comparative Protein Modelinghomes.nano.aau.dk/fp/md/Introduction to Comparative...

Chapter 4

Part I

Introduction to Comparative Protein

Modeling

1

Information on Proteins

Each modeling study depends on the quality of the known

experimental data. Basis of the model

Search in the literature

May find 3D structure on the protein of interest

When you have a sequence Compare it to other to find

similarities or differences Several algorithms have been developed

Comparison within a few minutes

Differences between a healthy and a diseased individual.

2

Databases

Nucleotide and Protein Sequences: EMBL (European Molecular Biology Laboratory) Nucleotide Sequence

Database.

Universal Protein Resource (UniPort) Database.

3D structural Information: Protein Data Bank, PDB. Search on author's name, journal name or a part

of a sequence.

3

PDB file

The format of the data files are similar. PDB files are widely used

I will describe the standard format of a protein file

Header General information about the protein

Includes official name, references, resolution of the crystal structure and

other useful remarks

Atomic coordinates Atoms belonging to standard amino acids are labeled ATOM

To distinguish between individual peptide chains ATOMS are separated by

TER

Non standard amino acids are labeled HETATMS

4

PDB file

When the file is read in a modeling program bonds are built between

ATOMS but not between HETATMS

An additional connectivity table is at the end of the data file

Atom type of HETATMS is often incorrect when reading into a modeling

program necessary to check all atom types

PDB do not include hydrogen atoms.

Keep in mind: The resolution of the crystal structure should be least between 2.5 and 1.5

Å

As NMR measurements are performed in solution the results are highly

dependent on the solvent

5

Protein Structure

The 3D structure of proteins is characterized by 4 levels of

structural organization: The primary structure represents the linear arrangement of the amino

acids in the protein sequence

The secondary structure describes the local architecture of the linear

fragments of the chain (α-helix, β-sheet). The supersecondary structure is a

new level, which describes the association of the secondary elements

through the side chain interactions. Also called a motif (hairpins, Greek

key..)

The tertiary structure shows the overall topology of the folded peptide

chain

The quaternary structure describes the arrangement of separate subunits

in the functional protein

6

Conformational Properties of Proteins

20 different amino acids found in nature Physicochemical properties of their side chains (size, shape,

hydrophobicity, charge and hydrogen bonding) span a considerable range

They have strongly restricted degrees of freedom

The dominant influences on protein conformation is: Hydrogen bonding capabilities

Chirality of the amino acids. All 19 chiral amino acids (glycine are not

chiral) possess a L-configuration

Linear connectivity

Steric volume

7


Central carbon, Cα

Backbone: amino N, Cα

and the carbonyl C. i is the number of residue,

starting from the amino end of

the chain

Main chain torsion angles:

Φ: N-Cα

Ψ: Cα-C’

ω: C’-N (peptide bond)

χ: side chain

8


The peptide bon is planar. Nearly always trans

configuration (ω = 180̊) which

is more energetically favorable

than cis (ω = 0̊)

The imino Proline is sometimes

found in cis

9


Rotation about ϕ and ψ: Makes the peptide chain flexible

Constrained geometrically due

to steric hindrance between

neighboring atoms

Conformations of ϕ and ψ

Ramachandran plot

White: Disallowed region

Red: Favored region

Yellow: Allowed region

Sub-regions: α and β

10

Types of Secondary Structural

Elements

α-helix: Best known and most easily recognized

structure

Repetitive structure: Cα-atoms in identical relative positions

Thus the ϕ and ψ angles are the same for each residue in the helix

Repeats itself every 5.4 Å

3.6 amino acids per turn

Hydrogen-bonds between carbonyl of residue n and NH of residue n+4 regular and favored state

Right-handed helix due to L-amino acids

Side chains point outwards

Length: 10-15 residues

11


Elements

β-sheet: Second most regular and

recognizable structure

Periodic element formed from β-strands

Hydrogen-bonds are intermolecular therefore are β-sheets less favorable

Parallel: All strands run in the same direction

Anti-parallel: Run in the opposite direction. Most common

Side chains are perpendicular to the plane of hydrogen-bonds

Length of stand: 3-10 residues

12


Elements

Turns: 1/3 of all residues of glubular peoteins are involved in turns

General function: Reverse the direction of the peptide chain

Located on the protein surface charged and polar amino acids

Turns often connect anti-parallel β-strands named β-turns or hairpin bends.

They often only contain 2 residues

Schematic form: Very helpful tool to se and understand the

overall structure of a protein

Side chains are often omitted to give a more

clear picture

13

Homologous Proteins

Proteins that have evolved from a common ancestor are said

to be homologous. The 3D structure for homologous proteins are more preserved than the identity

The structure is crucial for the function

Trypsin and α-Chymotrypsin belongs to the serine protease family. Has only 44 %

identity but they are very homolog

Dissimilarities: Loop regions. The core is more preserved

Homologous proteins appear to be highly conserved during evolution basis for

comparative protein modeling

14

Comparative Protein Modeling

Known sequences vs. known 3D structures: Protein sequence determination is much faster than determining the 3D structure

from X-ray or NMR theoretical procedures for predicting the 3D structure on

the basis of the sequence is needed

No general rule for folding of a protein base structural predictions on the

conformation of available homologous reference proteins

Use Comparative Protein Modeling approach when: A sequence is found homologous to another with a known 3D structure, then this

method is used to predict the structure for the unknown protein

Also called Homology Modeling Approach

15

Comparative Protein Modeling

Process:1. Determination of proteins which are related to the protein being studied

Sequence alignment

2. Identification of structurally conserved regions (SCRs) and structurally variable

regions (SVRs)

3. Alignment of the sequence of the unknown protein with those of the reference

protein(s) within the SCRs

4. Construction of SCRs of the target protein using coordinates from the template

structure(s)

5. Construction of the SVRs

6. Side chain modeling

7. Structural refinement using energy minimization methods and molecular

dynamics

16

Sequence Alignment

Sequence alignment by data base search: Major methods: FASTA and BLAST

Used in many available software: HOMOLOGY, MODELLER…

Sequence alignment important because: Find related sequences

Identify conserved regions

Find amino acids of the known reference protein that correspond with those of the

protein to be modeled basis for transferring the coordinates of the reference

protein to the new protein. Need more sensitive and selective alignment

procedures

Needleman and Wunsch algorithm (align two sequences)

17

Sequence Alignment

Optimal local alignment: Best local identity between two sequences

Only consider relatively conserved subsequences

Important tool for comparing sequences

18

Sequence Alignment

Scoring Scheme: Indicates the weight for substituting one amino acid with another matrices

High no: Substitution is likely

Low no: Substitution is unlikely

Different kind of Matrices: Identity matix: Most simple, gives 1 to identical pairs and 0 to all others

Codon substitution matix: Scoring values are derives from codons. Identical

amino acids get 9, íf one mutation is required the score is 3 and if two is required the

score is 1

Mutation or Dayhoff matrix: Obtained by counting the number of substitutions

from one amino acid by others observed in related proteins, across species. Larges

scores are given to substitutions that are frequently, and low scores to substitutions

which are not observed.

19

Sequence Alignment

Dayhoff matrix: Larger scores for some

non-identical mutations

than for some identical

one

Statistic method

20

Sequence Alignment

21

Gabs: If there is differences in the sequence length or variations in the location of conserved

regions it complicate the alignment gabs are introduced

An additional factor is introduced in the alignment algorithms gap penalty function

The balance between the number of aligned amino acids and the smallest number of

required gaps leads to an optimal alignment

Combination of an alignment algorithm, a scoring matrix and

a gab function: Optimal alignment of two or more sequences

The quality of the alignment is described by an alignment score

The derived alignment can only be used as a basis for a protein model if it agrees with

all known structural data

Sequence Alignment

22

Problems: Sequence similarity is lost more quickly during evolution than the structural

similarities

This makes it difficult to makes some simple rules

Investigations to solve problems: Doolittle et al.: Rules of thumb. Sequences are longer than 100 residues and are found

to be more than 25% identical very likely to be related. If the identity is 15-25%

the sequences may still be related and if the identity is less than 15% they are probably

not related

Chothis and Lesk: To have success in modeling the structure of a protein from its

sequence, using the 3D structure of a homologous protein as template depends very

much on the sequence identity above 50%

Determination and Generation of

Structurally Conserved Regions (SCRs)

23

When building a model protein using the homology

approach, it is based on the fact that there are regions in all

proteins that belong to the same family, that are nearly

identical in there 3D structures These regions tend to be located at the inner core of the protein

SCRs: These regions in strongly related proteins have the same relative orientations of their

secondary structural units in space (α-helices and β-sheets) throughout the whole

family

Used as a natural framework for the atomic coordinates for another protein in the

family



24

Find SCRs within a family: Depends on the number of available crystal structures of related proteins

If more than one crystal structure is available superimpose them relative to each other. Done by a least-square fitting method

Problem to selection the fitting atoms Fit by the Cα-atoms. This method can then be optimized by using only matching points located in the secondary structure

The resulting superimposed 3D structures tend to show that large parts of the two proteins are very similar and they appear to be the SCRs, while other regions differ vary much

Keep in mind: The definition of SCRs is that a SCR must be terminated at the end of a secondary structural unit, so therefore the secondary structural elements must be assigned for the protein

Crystal files

Programs like DSSP or STRIDE (based on the H-bonding pattern or the main chain dihedral angle)



25

Find SCRs within a family: If only one crystal structure is available detect SCRs manually using both sequence

and structural information of the proteins

Residues in the core are more conserved than residues on the surface

Amino acids involved in hydrogen bonds and disulfide bridges are most likely to be conserved within the protein family. Also the residues in the active site tend to be conserved

If the SCRs of the reference protein are known: Find the regions on the model protein that corresponds to the SCRs

Done by an alignment of the target sequence with the sequences of the SCRs

No gabs are allowed in the SCRs, so different programs are needed

After the alignment the coordinates for the SCRs can be assigned, by use of the coordinates of the reference protein as a basis

Segments with identical side chains: All coordinates are used

Segments with non identical side chains: Only backbone coordinates

Construction of Structurally Variable

Regions (SVRs)

26

SVRs: Occur normally in loop regions

Construction of these are more difficult

Insertions and deletions make the modeling procedure complicated due to variations

in the number of amino acids

A good guide for modeling a missing region: Use a segment of similar length in a

homologous protein

Studies have showed that when loops has the same length and amino acid character,

their conformation will be the same

The coordinates can then be transferred to the model protein

Construction of Structurally Variable

Regions (SVRs)

27

If no comparable loops exits in the protein family The coordinates for the SVRs can be retrieved from:

Loop search method: A peptide segment which are found in other proteins and fit

into the model’s spatial environment are used

De novo generation: Generation a loop segment de novo. A peptide chain is built

between two conserved segments using randomly generated values for all backbone

dihedral angles. Rather complex so can only be used when the loop is smaller than

seven residues

All loops should be refines by an energy minimization in order to remove steric

hindrance and relax the loop conformations

Side-Chain Modeling

28

Backbone is constructed, the next step is to add the side

chains: The predictions of the side chain conformations is a much more complex process

Many of the side chains have one or more degrees of freedom can adopt many

energetically allowed conformations

It has been generally assumed that identical residues in homologous proteins adopt

similar conformations

Side chain with amino acids that shows high similarity (Isoleucine and Leucine or

Valine) are also assumed to adopt the same orientation in the protein

Difficult when the substituted amino acids are not related

Showed that side chains usually adopt only a small number of the many possible

conformations Statistical rotamer libraries

Still difficult due to the conformations depends on the local environment

Final Model

29

A refinement of the final model is often desirable due to: Regions where SCRs and SVRs are connected often have a lot of steric strain and need

to be minimized

Several side chains also adopt positions which has a bad van derWaals contact

A stepwise refinement is needed, because an approach on all the residues at once

will destroy important internal hydrogen bonds

Secondary Structure Prediction

30

In the case where a homologous protein does not exist,

methods for predictions the secondary structure have been

developed 90% of the residues are either in α-helices, β-sheets or reverse turns

If these are predicted it seems possible to combine the segments complete protein

structure

Not as reliable as homology modeling

Three different methods: Statistical

Stereochemical

Neutral network-based


31

Statistical method: First to be developed

Idea: Many of the 20 amino acids have preferred secondary structures

Ala, Arg, Gln, Glu, Met, Leu and Lys: α-helix

Cys, Ile, Phe, Thr, Tyr and Val: β-sheets

The most simple method is proposed by Chou and Fasman

Calculating the probability of which secondary structure an amino acids is in by its

frequency in the different structures found in the PDB

Limitations: Below 56% accurate in predicting helix, sheets and loops


32

Stereochemical: Based on the hydrophobic, hydrophilic and electrostatic properties of the side chains

The method of Lim: Takes into account the interactions between side chains separated

with up to 3 residues in the sequence, in view of their packing behavior

If a sequence have alternating hydrophobic and hydrophilic side chains likely to be

found in a β-sheet hydrophilic residues exposed to the solvent and the hydrophobic

residues buried in the interior of the protein

Neutral-based: Uses neutral networks, which can be trained rules are not need in advance but they

are formed by the network itself, based on known facts

More than 70% accuracy in the prediction of three classes of secondary structures on

the basis of just one known homologous sequence

Fold recognition/threading Methods

34

Use when: The structural similarity is limited to only the part of the structure having a common

structural motif, and the rest is completely different

First methods: Recognize folds in the absence of sequence similarity.

Now: Comparative modeling and threading approaches are done simultaneously

Close related to ab initio methods, but are limited to search for conformations of

known structures

Thus, threading methods fail for any protein that adopts a new fold

Introduction to Comparative Protein Modelinghomes.nano.aau.dk/fp/md/Introduction to Comparative...

Documents

Transcript of Introduction to Comparative Protein Modelinghomes.nano.aau.dk/fp/md/Introduction to Comparative...