PROTEIN SYNTHESIS PROTEIN SYNTHESIS Introduction and Overview.
Introduction to Comparative Protein Modelinghomes.nano.aau.dk/fp/md/Introduction to Comparative...
Transcript of Introduction to Comparative Protein Modelinghomes.nano.aau.dk/fp/md/Introduction to Comparative...
Chapter 4
Part I
Introduction to Comparative Protein
Modeling
1
Information on Proteins
Each modeling study depends on the quality of the known
experimental data. Basis of the model
Search in the literature
May find 3D structure on the protein of interest
When you have a sequence Compare it to other to find
similarities or differences Several algorithms have been developed
Comparison within a few minutes
Differences between a healthy and a diseased individual.
2
Databases
Nucleotide and Protein Sequences: EMBL (European Molecular Biology Laboratory) Nucleotide Sequence
Database.
Universal Protein Resource (UniPort) Database.
3D structural Information: Protein Data Bank, PDB. Search on author's name, journal name or a part
of a sequence.
3
PDB file
The format of the data files are similar. PDB files are widely used
I will describe the standard format of a protein file
Header General information about the protein
Includes official name, references, resolution of the crystal structure and
other useful remarks
Atomic coordinates Atoms belonging to standard amino acids are labeled ATOM
To distinguish between individual peptide chains ATOMS are separated by
TER
Non standard amino acids are labeled HETATMS
4
PDB file
When the file is read in a modeling program bonds are built between
ATOMS but not between HETATMS
An additional connectivity table is at the end of the data file
Atom type of HETATMS is often incorrect when reading into a modeling
program necessary to check all atom types
PDB do not include hydrogen atoms.
Keep in mind: The resolution of the crystal structure should be least between 2.5 and 1.5
Å
As NMR measurements are performed in solution the results are highly
dependent on the solvent
5
Protein Structure
The 3D structure of proteins is characterized by 4 levels of
structural organization: The primary structure represents the linear arrangement of the amino
acids in the protein sequence
The secondary structure describes the local architecture of the linear
fragments of the chain (α-helix, β-sheet). The supersecondary structure is a
new level, which describes the association of the secondary elements
through the side chain interactions. Also called a motif (hairpins, Greek
key..)
The tertiary structure shows the overall topology of the folded peptide
chain
The quaternary structure describes the arrangement of separate subunits
in the functional protein
6
Conformational Properties of Proteins
20 different amino acids found in nature Physicochemical properties of their side chains (size, shape,
hydrophobicity, charge and hydrogen bonding) span a considerable range
They have strongly restricted degrees of freedom
The dominant influences on protein conformation is: Hydrogen bonding capabilities
Chirality of the amino acids. All 19 chiral amino acids (glycine are not
chiral) possess a L-configuration
Linear connectivity
Steric volume
7
Conformational Properties of Proteins
Central carbon, Cα
Backbone: amino N, Cα
and the carbonyl C. i is the number of residue,
starting from the amino end of
the chain
Main chain torsion angles:
Φ: N-Cα
Ψ: Cα-C’
ω: C’-N (peptide bond)
χ: side chain
8
Conformational Properties of Proteins
The peptide bon is planar. Nearly always trans
configuration (ω = 180̊) which
is more energetically favorable
than cis (ω = 0̊)
The imino Proline is sometimes
found in cis
9
Conformational Properties of Proteins
Rotation about ϕ and ψ: Makes the peptide chain flexible
Constrained geometrically due
to steric hindrance between
neighboring atoms
Conformations of ϕ and ψ
Ramachandran plot
White: Disallowed region
Red: Favored region
Yellow: Allowed region
Sub-regions: α and β
10
Types of Secondary Structural
Elements
α-helix: Best known and most easily recognized
structure
Repetitive structure: Cα-atoms in identical relative positions
Thus the ϕ and ψ angles are the same for each residue in the helix
Repeats itself every 5.4 Å
3.6 amino acids per turn
Hydrogen-bonds between carbonyl of residue n and NH of residue n+4 regular and favored state
Right-handed helix due to L-amino acids
Side chains point outwards
Length: 10-15 residues
11
Types of Secondary Structural
Elements
β-sheet: Second most regular and
recognizable structure
Periodic element formed from β-strands
Hydrogen-bonds are intermolecular therefore are β-sheets less favorable
Parallel: All strands run in the same direction
Anti-parallel: Run in the opposite direction. Most common
Side chains are perpendicular to the plane of hydrogen-bonds
Length of stand: 3-10 residues
12
Types of Secondary Structural
Elements
Turns: 1/3 of all residues of glubular peoteins are involved in turns
General function: Reverse the direction of the peptide chain
Located on the protein surface charged and polar amino acids
Turns often connect anti-parallel β-strands named β-turns or hairpin bends.
They often only contain 2 residues
Schematic form: Very helpful tool to se and understand the
overall structure of a protein
Side chains are often omitted to give a more
clear picture
13
Homologous Proteins
Proteins that have evolved from a common ancestor are said
to be homologous. The 3D structure for homologous proteins are more preserved than the identity
The structure is crucial for the function
Trypsin and α-Chymotrypsin belongs to the serine protease family. Has only 44 %
identity but they are very homolog
Dissimilarities: Loop regions. The core is more preserved
Homologous proteins appear to be highly conserved during evolution basis for
comparative protein modeling
14
Comparative Protein Modeling
Known sequences vs. known 3D structures: Protein sequence determination is much faster than determining the 3D structure
from X-ray or NMR theoretical procedures for predicting the 3D structure on
the basis of the sequence is needed
No general rule for folding of a protein base structural predictions on the
conformation of available homologous reference proteins
Use Comparative Protein Modeling approach when: A sequence is found homologous to another with a known 3D structure, then this
method is used to predict the structure for the unknown protein
Also called Homology Modeling Approach
15
Comparative Protein Modeling
Process:1. Determination of proteins which are related to the protein being studied
Sequence alignment
2. Identification of structurally conserved regions (SCRs) and structurally variable
regions (SVRs)
3. Alignment of the sequence of the unknown protein with those of the reference
protein(s) within the SCRs
4. Construction of SCRs of the target protein using coordinates from the template
structure(s)
5. Construction of the SVRs
6. Side chain modeling
7. Structural refinement using energy minimization methods and molecular
dynamics
16
Sequence Alignment
Sequence alignment by data base search: Major methods: FASTA and BLAST
Used in many available software: HOMOLOGY, MODELLER…
Sequence alignment important because: Find related sequences
Identify conserved regions
Find amino acids of the known reference protein that correspond with those of the
protein to be modeled basis for transferring the coordinates of the reference
protein to the new protein. Need more sensitive and selective alignment
procedures
Needleman and Wunsch algorithm (align two sequences)
17
Sequence Alignment
Optimal local alignment: Best local identity between two sequences
Only consider relatively conserved subsequences
Important tool for comparing sequences
18
Sequence Alignment
Scoring Scheme: Indicates the weight for substituting one amino acid with another matrices
High no: Substitution is likely
Low no: Substitution is unlikely
Different kind of Matrices: Identity matix: Most simple, gives 1 to identical pairs and 0 to all others
Codon substitution matix: Scoring values are derives from codons. Identical
amino acids get 9, íf one mutation is required the score is 3 and if two is required the
score is 1
Mutation or Dayhoff matrix: Obtained by counting the number of substitutions
from one amino acid by others observed in related proteins, across species. Larges
scores are given to substitutions that are frequently, and low scores to substitutions
which are not observed.
19
Sequence Alignment
Dayhoff matrix: Larger scores for some
non-identical mutations
than for some identical
one
Statistic method
20
Sequence Alignment
21
Gabs: If there is differences in the sequence length or variations in the location of conserved
regions it complicate the alignment gabs are introduced
An additional factor is introduced in the alignment algorithms gap penalty function
The balance between the number of aligned amino acids and the smallest number of
required gaps leads to an optimal alignment
Combination of an alignment algorithm, a scoring matrix and
a gab function: Optimal alignment of two or more sequences
The quality of the alignment is described by an alignment score
The derived alignment can only be used as a basis for a protein model if it agrees with
all known structural data
Sequence Alignment
22
Problems: Sequence similarity is lost more quickly during evolution than the structural
similarities
This makes it difficult to makes some simple rules
Investigations to solve problems: Doolittle et al.: Rules of thumb. Sequences are longer than 100 residues and are found
to be more than 25% identical very likely to be related. If the identity is 15-25%
the sequences may still be related and if the identity is less than 15% they are probably
not related
Chothis and Lesk: To have success in modeling the structure of a protein from its
sequence, using the 3D structure of a homologous protein as template depends very
much on the sequence identity above 50%
Determination and Generation of
Structurally Conserved Regions (SCRs)
23
When building a model protein using the homology
approach, it is based on the fact that there are regions in all
proteins that belong to the same family, that are nearly
identical in there 3D structures These regions tend to be located at the inner core of the protein
SCRs: These regions in strongly related proteins have the same relative orientations of their
secondary structural units in space (α-helices and β-sheets) throughout the whole
family
Used as a natural framework for the atomic coordinates for another protein in the
family
Determination and Generation of
Structurally Conserved Regions (SCRs)
24
Find SCRs within a family: Depends on the number of available crystal structures of related proteins
If more than one crystal structure is available superimpose them relative to each other. Done by a least-square fitting method
Problem to selection the fitting atoms Fit by the Cα-atoms. This method can then be optimized by using only matching points located in the secondary structure
The resulting superimposed 3D structures tend to show that large parts of the two proteins are very similar and they appear to be the SCRs, while other regions differ vary much
Keep in mind: The definition of SCRs is that a SCR must be terminated at the end of a secondary structural unit, so therefore the secondary structural elements must be assigned for the protein
Crystal files
Programs like DSSP or STRIDE (based on the H-bonding pattern or the main chain dihedral angle)
Determination and Generation of
Structurally Conserved Regions (SCRs)
25
Find SCRs within a family: If only one crystal structure is available detect SCRs manually using both sequence
and structural information of the proteins
Residues in the core are more conserved than residues on the surface
Amino acids involved in hydrogen bonds and disulfide bridges are most likely to be conserved within the protein family. Also the residues in the active site tend to be conserved
If the SCRs of the reference protein are known: Find the regions on the model protein that corresponds to the SCRs
Done by an alignment of the target sequence with the sequences of the SCRs
No gabs are allowed in the SCRs, so different programs are needed
After the alignment the coordinates for the SCRs can be assigned, by use of the coordinates of the reference protein as a basis
Segments with identical side chains: All coordinates are used
Segments with non identical side chains: Only backbone coordinates
Construction of Structurally Variable
Regions (SVRs)
26
SVRs: Occur normally in loop regions
Construction of these are more difficult
Insertions and deletions make the modeling procedure complicated due to variations
in the number of amino acids
A good guide for modeling a missing region: Use a segment of similar length in a
homologous protein
Studies have showed that when loops has the same length and amino acid character,
their conformation will be the same
The coordinates can then be transferred to the model protein
Construction of Structurally Variable
Regions (SVRs)
27
If no comparable loops exits in the protein family The coordinates for the SVRs can be retrieved from:
Loop search method: A peptide segment which are found in other proteins and fit
into the model’s spatial environment are used
De novo generation: Generation a loop segment de novo. A peptide chain is built
between two conserved segments using randomly generated values for all backbone
dihedral angles. Rather complex so can only be used when the loop is smaller than
seven residues
All loops should be refines by an energy minimization in order to remove steric
hindrance and relax the loop conformations
Side-Chain Modeling
28
Backbone is constructed, the next step is to add the side
chains: The predictions of the side chain conformations is a much more complex process
Many of the side chains have one or more degrees of freedom can adopt many
energetically allowed conformations
It has been generally assumed that identical residues in homologous proteins adopt
similar conformations
Side chain with amino acids that shows high similarity (Isoleucine and Leucine or
Valine) are also assumed to adopt the same orientation in the protein
Difficult when the substituted amino acids are not related
Showed that side chains usually adopt only a small number of the many possible
conformations Statistical rotamer libraries
Still difficult due to the conformations depends on the local environment
Final Model
29
A refinement of the final model is often desirable due to: Regions where SCRs and SVRs are connected often have a lot of steric strain and need
to be minimized
Several side chains also adopt positions which has a bad van derWaals contact
A stepwise refinement is needed, because an approach on all the residues at once
will destroy important internal hydrogen bonds
Secondary Structure Prediction
30
In the case where a homologous protein does not exist,
methods for predictions the secondary structure have been
developed 90% of the residues are either in α-helices, β-sheets or reverse turns
If these are predicted it seems possible to combine the segments complete protein
structure
Not as reliable as homology modeling
Three different methods: Statistical
Stereochemical
Neutral network-based
Secondary Structure Prediction
31
Statistical method: First to be developed
Idea: Many of the 20 amino acids have preferred secondary structures
Ala, Arg, Gln, Glu, Met, Leu and Lys: α-helix
Cys, Ile, Phe, Thr, Tyr and Val: β-sheets
The most simple method is proposed by Chou and Fasman
Calculating the probability of which secondary structure an amino acids is in by its
frequency in the different structures found in the PDB
Limitations: Below 56% accurate in predicting helix, sheets and loops
Secondary Structure Prediction
32
Stereochemical: Based on the hydrophobic, hydrophilic and electrostatic properties of the side chains
The method of Lim: Takes into account the interactions between side chains separated
with up to 3 residues in the sequence, in view of their packing behavior
If a sequence have alternating hydrophobic and hydrophilic side chains likely to be
found in a β-sheet hydrophilic residues exposed to the solvent and the hydrophobic
residues buried in the interior of the protein
Neutral-based: Uses neutral networks, which can be trained rules are not need in advance but they
are formed by the network itself, based on known facts
More than 70% accuracy in the prediction of three classes of secondary structures on
the basis of just one known homologous sequence
33
H
Fold recognition/threading Methods
34
Use when: The structural similarity is limited to only the part of the structure having a common
structural motif, and the rest is completely different
First methods: Recognize folds in the absence of sequence similarity.
Now: Comparative modeling and threading approaches are done simultaneously
Close related to ab initio methods, but are limited to search for conformations of
known structures
Thus, threading methods fail for any protein that adopts a new fold