Nitrogen NMR shieldings of 2-amino-5-nitro-6-methylpyridines
Chap. 11 Protein Structures. Amino Acid R: large white and gray C: black Nitrogen: blue Oxygen: red...
-
Upload
victoria-neal -
Category
Documents
-
view
215 -
download
0
Transcript of Chap. 11 Protein Structures. Amino Acid R: large white and gray C: black Nitrogen: blue Oxygen: red...
Chap. 11 Protein Structures
Amino Acid
• R: large white and gray
• C: black• Nitrogen: blue• Oxygen: red• Hydrogen: white
General structure of amino acids an amino group a carboxyl group α-carbon bonded to a
hydrogen and a side-chain group, R
Side chain R determines the identity of particular amino acid
Protein Protein: polymer consisting of AA’s linked by peptide
bonds AA in a polymer is called a residue
Folded into 3D structures Structure of protein determines its function
Primary structure: linear arrangement of AA’s AA sequence (primary structure) determines 3D
structure of a protein, which in turn determines its properties
N- and C-terminal Secondary structure: short stretches of AAs Tertiary structure: overall 3D structure
Protein Structures
Secondary structure Secondary structures have repetitive interactions
resulting from hydrogen bonding between N-H and carboxyl groups of peptide backbone
Conformations of side chains of AA are not part of the secondary structure
α-helix
Secondary structure β-pleated sheet
Parallel/antiparallel
3D form of antiparallel
Secondary structure: domain
(a) α unit(b) α α unit (helix-turn-
helix)(c) meander(d) Greek key
Part of chain folds independently of foldings of other parts
• Such independent folded portion of protein is called domain (super-secondary structure)
Domain Larger proteins are modular
Their structural units, domains or folds, can be covalently linked to generate multi-domain proteins
Domains are not only structurally, but also functionally, discrete units – domain family members are structurally and functionally conserved and recombined in complex ways during evolution
Domains can be seen as the units of evolution Novelty in protein function often arises as a result of
gain or loss of domains, or by re-shuffling existing domains along sequence
Pairs of protein domains with the same 3D fold, precise function is conserved to ~40% sequence identity (broad functional class is conserved ~20%)
DNA binding domains http://en.wikipedia.org/wiki/DNA-binding_domain
Motif A short, conserved regions (frequently the most
conserved regions of a domain) Critical for the domain to function Domain vs. Motif
Motif are structural characteristics Domains are functional regions, usually consisting
of a few motifs
Motif Representation
Motif In multiple alignments
of distinctly related sequences, highly conserved regions are called motifs, features, signatures or blocks
Tends to correspond to core structural and functional elements of the proteins
Motif
(a) complement control protein module
(b) Immunoglobulin module
(c) Fibronectin type I module
(d) Growth factor module
(e) Kringle module
Greek key motif is often found in –barrel tertiary structure
(a) Linked series of -meanders
(b) Greek key pattern(c) Alternative α untis(d) Top and side views (α-
helical section is outside)
Secondary structure: conformation
(a) Schematic diagrams of fibrous and globular proteins
(b) Computer-generated model of globular protein
Two types of Protein Conformations Fibrous Globular –folds back onto itself to create a
spherical shape
Secondary Structure Prediction Ab initio prediction (from AA sequence)
Still an open problem 1974 Peter Chou and Gerald Fasman
Use known structures to determine which AA contributes to each secondary structure
Propensity values : likelihood that an AA appears in a particular structure P(a), P(b) and P(turn) >1 indicates a greater than average chance (log-
odd ratios) Frequency values: frequency of an AA being found in
a hairpin Four positions in a hairpin beta-turn
Accuracy is around 50-60%, but popular due to its foundation for later prediction programs
AA P(a) P(b) P(turn) f(i) f(i+1) f(i+2) f(i+3)Alanine 142 83 66 0.060 0.076 0.035 0.058Arginine 98 93 95 0.070 0.106 0.099 0.085Asparagine 67 89 95 0.161 0.083 0.191 0.091Aspartic acid 101 54 146 0.147 0.110 0.179 0.081Cysteine 70 119 119 0.149 0.050 0.117 0.128Glutamic acid 151 37 74 0.056 0.060 0.077 0.064Glutamine 111 110 98 0.074 0.098 0.037 0.098Glycine 57 75 156 0.102 0.085 0.190 0.152Histidine 100 87 95 0.140 0.047 0.093 0.054Isoleucine 108 160 47 0.043 0.034 0.013 0.054Leucine 121 130 59 0.061 0.025 0.036 0.070Lysine 114 74 101 0.055 0.115 0.072 0.095Methionine 145 105 60 0.068 0.082 0.014 0.055Pheylalanine 113 138 60 0.059 0.041 0.065 0.065Proline 57 55 152 0.102 0.301 0.034 0.068Serine 77 75 143 0.120 0.139 0.125 0.106Threonine 83 119 96 0.086 0.108 0.065 0.079Tryptophan 108 137 96 0.077 0.013 0.064 0.167Tyrosine 69 147 114 0.082 0.065 0.114 0.125Valine 104 170 50 0.062 0.048 0.028 0.053
Chou-Fasman Algorithm
Step 1: identify alpha-helices Find a region of six contiguous residues where at
least four have P(a)>103 Extend the region until a set of four contiguous
residues with P(a)<100 is found If region’s average P(a)>103, length is >5, and
∑P(a)> ∑P(b), alpha Step 2: beta strands
Find a region of five contiguous residues with at least three with P(b)>105
Extend the region until a set of four contiguous residues with P(b)<100 is found
If region’s average P(b)>105, and ∑P(b)> ∑P(a), beta
Chou-Fasman Algorithm Step 3: beta turns
For each residue f, determine the turn propensity (P(t)) for j, asP(t) j = f(i) j *f(i+1) j+1 *f(i+2) j+2 *f(i+3) j+3
A turn at postion if P(t) >0.000075, average P(turn) from j to j+3 > 100, and ∑P(a)< ∑P(turn) > ∑P(b)
Step 4: overlaps If alpha region overlaps with beta, the region’s ∑P(a)
and ∑P(b) determine the most likely structure in the overlapped region
If ∑P(a) > ∑P(b) for the overlapping region, alpha If ∑P(a) < ∑P(b) for the overlapping region, beta If ∑P(a) = ∑P(b), no valid call
Secondary structure prediction
Page 427
Chou and Fasman (1974) based on the frequencies of amino acids found in a helices, b-sheets, and turns.
Proline: occurs at turns, but not in a helices. GOR (Garnier, Osguthorpe, Robson): related algorithm Modern algorithms: use multiple sequence alignments
and achieve higher success rate (about 70-75%)
Secondary structure prediction
Web servers:
GOR4JpredNNPREDICTPHDPredatorPredictProteinPSIPREDSAM-T99sec
Table 11-3Page 429
Secondary Structure Prediction by PSIRED
Prediction of regions of the protein that form alpha-helix, beta-sheet, or random coil
http://bioinf.cs.ucl.ac.uk/psipred/ Based on neural networks Uses Chou-Fasman-like algorithm but first does
PSI-BLAST search to get a collection of sequences related to the input (searching for orthologous sequences)
Univ. College London, 1999
PSI-BLAST is performed in five steps
1. Select a query and search it against a protein database
2. PSI-BLAST constructs a multiple sequence alignment then creates a “profile” or specialized position-specificscoring matrix (PSSM)
Page 146
R,I,K C D,E,T K,R,T N,L,Y,G
Inspect the blastp output to identify empirical “rules” regarding amino acids tolerated at each position
A R N D C Q E G H I L K M F P S T W Y V 1 M -1 -2 -2 -3 -2 -1 -2 -3 -2 1 2 -2 6 0 -3 -2 -1 -2 -1 1 2 K -1 1 0 1 -4 2 4 -2 0 -3 -3 3 -2 -4 -1 0 -1 -3 -2 -3 3 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 4 V 0 -3 -3 -4 -1 -3 -3 -4 -4 3 1 -3 1 -1 -3 -2 0 -3 -1 4 5 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 6 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 7 L -2 -2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 1 8 L -1 -3 -3 -4 -1 -3 -3 -4 -3 2 2 -3 1 3 -3 -2 -1 -2 0 3 9 L -1 -3 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 2 10 L -2 -2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 1 11 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 12 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 13 W -2 -3 -4 -4 -2 -2 -3 -4 -3 1 4 -3 2 1 -3 -3 -2 7 0 0 14 A 3 -2 -1 -2 -1 -1 -2 4 -2 -2 -2 -1 -2 -3 -1 1 -1 -3 -3 -1 15 A 2 -1 0 -1 -2 2 0 2 -1 -3 -3 0 -2 -3 -1 3 0 -3 -2 -2 16 A 4 -2 -1 -2 -1 -1 -1 3 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 -1 ... 37 S 2 -1 0 -1 -1 0 0 0 -1 -2 -3 0 -2 -3 -1 4 1 -3 -2 -2 38 G 0 -3 -1 -2 -3 -2 -2 6 -2 -4 -4 -2 -3 -4 -2 0 -2 -3 -3 -4 39 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -3 -2 0 40 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 41 Y -2 -2 -2 -3 -3 -2 -2 -3 2 -2 -1 -2 -1 3 -3 -2 -2 2 7 -1 42 A 4 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0
20 amino acids
all the amino acids from position 1 to the end of your PSI-BLAST query protein
A R N D C Q E G H I L K M F P S T W Y V 1 M -1 -2 -2 -3 -2 -1 -2 -3 -2 1 2 -2 6 0 -3 -2 -1 -2 -1 1 2 K -1 1 0 1 -4 2 4 -2 0 -3 -3 3 -2 -4 -1 0 -1 -3 -2 -3 3 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 4 V 0 -3 -3 -4 -1 -3 -3 -4 -4 3 1 -3 1 -1 -3 -2 0 -3 -1 4 5 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 6 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 7 L -2 -2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 1 8 L -1 -3 -3 -4 -1 -3 -3 -4 -3 2 2 -3 1 3 -3 -2 -1 -2 0 3 9 L -1 -3 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 2 10 L -2 -2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 1 11 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 12 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 13 W -2 -3 -4 -4 -2 -2 -3 -4 -3 1 4 -3 2 1 -3 -3 -2 7 0 0 14 A 3 -2 -1 -2 -1 -1 -2 4 -2 -2 -2 -1 -2 -3 -1 1 -1 -3 -3 -1 15 A 2 -1 0 -1 -2 2 0 2 -1 -3 -3 0 -2 -3 -1 3 0 -3 -2 -2 16 A 4 -2 -1 -2 -1 -1 -1 3 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 -1 ... 37 S 2 -1 0 -1 -1 0 0 0 -1 -2 -3 0 -2 -3 -1 4 1 -3 -2 -2 38 G 0 -3 -1 -2 -3 -2 -2 6 -2 -4 -4 -2 -3 -4 -2 0 -2 -3 -3 -4 39 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -3 -2 0 40 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 41 Y -2 -2 -2 -3 -3 -2 -2 -3 2 -2 -1 -2 -1 3 -3 -2 -2 2 7 -1 42 A 4 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0
note that a given amino acid (such as alanine) in your query protein can receive different scores for matching alanine—depending on the position in the protein
A R N D C Q E G H I L K M F P S T W Y V 1 M -1 -2 -2 -3 -2 -1 -2 -3 -2 1 2 -2 6 0 -3 -2 -1 -2 -1 1 2 K -1 1 0 1 -4 2 4 -2 0 -3 -3 3 -2 -4 -1 0 -1 -3 -2 -3 3 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 4 V 0 -3 -3 -4 -1 -3 -3 -4 -4 3 1 -3 1 -1 -3 -2 0 -3 -1 4 5 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 6 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 7 L -2 -2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 1 8 L -1 -3 -3 -4 -1 -3 -3 -4 -3 2 2 -3 1 3 -3 -2 -1 -2 0 3 9 L -1 -3 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 2 10 L -2 -2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 1 11 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 12 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 13 W -2 -3 -4 -4 -2 -2 -3 -4 -3 1 4 -3 2 1 -3 -3 -2 7 0 0 14 A 3 -2 -1 -2 -1 -1 -2 4 -2 -2 -2 -1 -2 -3 -1 1 -1 -3 -3 -1 15 A 2 -1 0 -1 -2 2 0 2 -1 -3 -3 0 -2 -3 -1 3 0 -3 -2 -2 16 A 4 -2 -1 -2 -1 -1 -1 3 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 -1 ... 37 S 2 -1 0 -1 -1 0 0 0 -1 -2 -3 0 -2 -3 -1 4 1 -3 -2 -2 38 G 0 -3 -1 -2 -3 -2 -2 6 -2 -4 -4 -2 -3 -4 -2 0 -2 -3 -3 -4 39 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -3 -2 0 40 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 41 Y -2 -2 -2 -3 -3 -2 -2 -3 2 -2 -1 -2 -1 3 -3 -2 -2 2 7 -1 42 A 4 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0
note that a given amino acid (such as tryptophan) in your query protein can receive different scores for matching tryptophan—depending on the position in the protein
PSI-BLAST is performed in five steps
1. Select a query and search it against a protein database
2. PSI-BLAST constructs a multiple sequence alignment then creates a “profile” or specialized position-specific scoring matrix (PSSM)
3. The PSSM is used as a query against the database
4. PSI-BLAST estimates statistical significance (E values)
1. Repeat steps [3] and [4] iteratively, typically 5 times.At each new search, a new profile is used as the query
Page 146
SRC protein
Tyrosine kinase Enzyme putting a phophate group on tyrosine AA
(phosphorylation) Activates an inactive protein, eventually
activates cell-division proteins NP_005408
>gi|4885609|ref|NP_005408.1| proto-oncogene tyrosine-protein kinase Src [Homo sapiens]
MGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQTPSKPASADGHRGPSAAFAPAAAEPKLFGGFNSS
DTVTSPQRAGPLAGGVTTFVALYDYESRTETDLSFKKGERLQIVNNTEGDWWLAHSLSTGQTGYIPSNYV
APSDSIQAEEWYFGKITRRESERLLLNAENPRGTFLVRESETTKGAYCLSVSDFDNAKGLNVKHYKIRKL
DSGGFYITSRTQFNSLQQLVAYYSKHADGLCHRLTTVCPTSKPQTQGLAKDAWEIPRESLRLEVKLGQGC
FGEVWMGTWNGTTRVAIKTLKPGTMSPEAFLQEAQVMKKLRHEKLVQLYAVVSEEPIYIVTEYMSKGSLL
DFLKGETGKYLRLPQLVDMAAQIASGMAYVERMNYVHRDLRAANILVGENLVCKVADFGLARLIEDNEYT
ARQGAKFPIKWTAPEAALYGRFTIKSDVWSFGILLTELTTKGRVPYPGMVNREVLDQVERGYRMPCPPEC
PESLHDLMCQCWRKEPEERPTFEYLQAFLEDYFTSTEPQYQPGENL
Examining Crystal Structure Cn3D: NCBI structure viewer and modeling tool DeppView: SWISSPROT JMOL
NCBI Structure database Links to NCBI MMDB (Molecular Modeling
Database) MMDB contains experimentally verified protein
structures
SRC – MMDB ID 56157, PDB ID 1FMK
View Structure from NCBI Structure database Opens up Cn3D window Click to rotate; Ctrl_click to zoom; Shift_clcik to
move Rendering and coloring menus
Tertiary structure 3D arrangment of all atoms in the module Considers arrangement of helical and sheet sections,
conformations of side chains, arrangement of atoms of side chains, etc.
Experimentally determined by X-ray crystallography –
measure diffraction patterns of atoms
NMR (Nuclear Magnetic Resonance) spectroscopy – use protein samples in aqueous solution
• Tertiary structure of α-lactalbumin myoglobin
Protein families Groups of genes of identical or similar sequence are
common Sometimes, repetition of identical sequences is
correlated with the synthesis of increased quantities of a gene product e.g., a genome contains multiple copies of ribosomal
RNAs Human chromosome 1 has 2000 genes for 5S rRNA
(sedimentation coefficient), and chr 13, 14, 15, 21 and 22 have 280 copies of a repeat unit made up of 28S, 5.8S and 18S
Amplication of rRNA genes evolved because of heavy demand for rRNA synthesis during cell division
These rRNA genes are examples of protein families having identical or near identical sequences
Sequence similarities indicate a common evolutionary origin
α- and β-globin families have distinct sequence similarities evolved from a single ancestral globin gene
Protein families and superfamilies Dayhoff classification, 1978
Protein families – at least 50 % AA sequence similar (based on physico-chemical AA features)
Related proteins with less similarity (35%) belong to a superfamily, may have quite diverse functions
α- and β-globins are classified as two separate families, and together with myoglobins form the globin superfamily
families have distinct sequence similarities evolved from a single ancestral globin gene
Protein family database Pattern or secondary database derived from sequences
a pattern may be the most conserved aspects of sequence families
The most conserved part may vary between species Use scoring system to account for some variability Position-specific scoring matrix (PSSM) or Profile
Contrast to a pairwise alignment, having the same weight regardless of positions
Protein family databases are derived by different analytical techniques But, trying to find motifs, conserved regions, considered to
reflect shared structural or functional characteristics Three groups: single motifs, multiple motifs, or full domain
alignments
Protein family databases
Data source Stored info
PROSITE Swiss-Prot Regular expressions (patterns) of single most conserved motif
Profiles Swiss-Prot Weighted matrices (profiles) of position-sensitive weights
PRINTS Swiss-Prot and TrEMBL
Aligned motifs (fingerprints)
Pfam Swiss-Prot and TrEMBl
multiple sequence alignment of a protein domain or conserved region
Blocks interPro/PRINTS Aligned motifs (blocks)
eMOTIF Blocks/PRINTS Permissive regular expressions
Pattern or secondary database derived from sequences
Single Motif Method Regular expression
PROSITE PDB 1ivy
Carboxypet_Ser_His (PS00560) [LIVF]-x2-[[LIVSTA]-x[IVPST]-[GSDNQL]-[SAGV]-
[SG]-H-x-[IVAQ]-P-x(3)-[PSA] [] – any of the enclosed symbols X- any residue (3) – number of repeats
Fuzzy regular expression Build regular expressions with info on shared
biochemical properties of AA Provide flexibility according to AA group
clustering
Multiple motif methods PRINTS
Encode multiple motifs (called fingerprints) in ungapped, unweighted local alignments
BLOCKS Derived from PROSITE and PRINTS Use the most highly conserved regions in protein
families in PROSITE Use motif-finding algorithm to generate a large number
of candidate blocks Initially, three conserved AA positions anywhere in the
alignment are identified and used as anchors Blocks are iteratively extended and ultimately encoded
as ungapped local alignments Graph theory is used to assemble a best set of blocks
for a given family Use position specific scoring matrix (PSSM), similar to a
profile
Full domain alignment Profiles
Use family-based scoring matrix via dynamic programming
Has position-specific info on insertions and deletions in the sequence family
Hidden Markov Model (HMM) PFAM, SMART, TIGRFAM represent full domain
alignments as HMMs PFAM
Represents each family as seed alignment, full alignment, and an HMM
Seed contains representative members of the family
Full alignment contains all members of the family as detected with HMM constructed from seen alignment
Structure-based Sequence Alignment Well-known that sequence alignment is not correct by
sequence similarity alone and that similar structure but no sequence similarity
Sequence alignment is augmented by structural alignments COMPASS< HOMSTRAD< PALI, ..
Protein Structure Comparison/Classification
Protein structures Domain
Polypeptide chain in a protein folds into a ‘tertiary ’ structure
One or more compact globular regions called domains
The tertiary structure associated with a domain region is also described as a protein fold
Multi-domain Proteins with polypeptide chains fold into several
domains Nearly half the known globular structures are
multidomain, more than half in two domains Automatic structure comparison methods are
introduced in 1970s shortly after the first crystal structures are stored in PDB
Structure comparison algorithms Two main components in structure comparison
algorithms Scoring similarities in structural features Optimization strategy maximizing similarities
measured Most are based on geometric properties from 3D
coordinates Intermolecular method
Superpose structures by minimizing distance between superposed position
Intra Compare sets of internal distances between
positions to identify an alignment maximizing the number of equivalent positions
Distance is described by RMSD (Root Mean Square Deviation), squared root of the average squared distance between equivalent atoms
Inter vs. Intra
RMSD
Distant homolog Structure is more
conserved than sequences during evolution
Structural similarity between distant homologs can be found Pairwise
sequence similarity
SSAP structural similarity score in parenthesis (0 – 100)
Distant homolog
Structural variations in protein families
Structure comparison algorithms SSAP, 1989
Residue level, Intra, Dynamic programming DALI, 1993
Residue fragment level, intra, Monte Carlo optimization
COMPARER, 1990 Multiple element level, both, Dynamic
programming
Structure classification hierarchy Class level -- proteins are grouped according to
their structural class (composition of residues in a α -helical and β-strand conformations) Mainly- α, mainly- β, alternating α- β, α plus
β (mainly- α and – β are segregated) Architecture
the manner by which secondary structure elements are packed together (arrangement of sec. structures in 3D space)
Fold group (topology) Orientation of sec. structures and the
connectivity between them Superfamily Family
Hierarchy example
Protein Structure databases PDB
Over 20,000 entries deduced from X-ray diffraction, NMR or modeling
Massively redundant 1FMK, 1BK5, 2F9C, ..
Protein Structure databases SCOP (Structural Classification of Proteins)
Multi-domain protein is split into its constituent domains Known structures are classified according to evolutionary
and structural relationship Domains in SCOP are grouped by species and
hierarchically classified into families, superfamilies, folds and classes Family level – group together domains with celar
sequence similarities Superfamily – group of domains with structural and
functional evidence for their descent from a common evolutionary ancestor
Gold – group of domains with the same major secondary structure with the same chain topology
Domains identified manually by visually inspecting structures
Proteins in the same superfamily often have the same function
Protein Structure databases CATH (Class, Architecture, Topology, Homology)
Homology – clustered domains with 35% sequence identity and shared common ancestry
800 fold families, 10 of which are super-folds 2009 www.cs.uml.edu/~kim/580/08_cath.pdf
Structure classification Most structure classifications are established at
the domain level Thought to be an important evolutionary unit and
easier to determine domain boundaries from structural data than from sequence data
Criteria for assessing domain regions within a structure The domain possesses a compact globular
structure Residues within a domain make more internal
contacts than to residues in the rest of polypeptide
Secondary structure elements are usually not shared with other regions of the polypeptide
There is evidence for existence of this region as an evolutionary unit
CATH classifications
Multi-domain structures
Protein Function/Structure Prediction
Protein Function Prediction In the absense of experimental data, function of a
protein is usually inferred from its sequence similarity to a protein of known function The more similar the sequence, the more similar the
function is likely to be Not always true
Can clues to function be derived directly from 3D structure
Definition of function Function can be described at many levels:
biochemical, biological processes, pathways, organ level
Proteins are annotated at different degrees of functional specificity: ubiquitin-like dome, signaling protein, ..
GO (Gene Ontology) scheme
Protein Function Prediction Sequence-based – largely unreliable
Profile-based Profiles are constructed from sequences of whole protein
families with families are grouped by 3D structure or function (as in Pfam)
Start with sequences matched by an initial search, iteratively pull in more remote homologues
More sensitivity than simple sequence comparison because profiles implicitly contain information on which residues within the family are well conserved and which sites are more variable
Structure-based Fold-based
Proteins sharing simlar functions often shave similar folds, resulting from descent from a common ancestral protein
Sometimes, function of proteins alter during evolution with the folds unchanged
Thus, fold match is not always reliable Surface clefts and binding pockets
Chap. 12 RNA Structures
Stem-loop structureRNA structure
A loop structure A loop between i and j when base at i pairs with base
at j Base at i+1 pairs with at base j Or base at i pairs with base at j-1 Or a multiple loop
RNA structure
Search for minimum free energy Gibbs free energy at 37
degrees (C) Free energy increments of
base pairs are counted as stacks of adjacent pairs Successive CGs: -3.3
kcal/mol Unfavorable loop initiation
energy to constrain bases in a loop
RNA secondary structure
Ad-hoc approach Simply look at a strand and find areas where base
pairing can occur Possible to find many locations where folds can
occur Prediction should be able to determine the most
likely one What should be the criteria ?
1980, Nussinov-Jacobson Algorithm More stable one is the most likely structure Find the fold that forms the greatest number of
base pairs (base-pairing lowers the overall energy of the strand, more stable)
Checking for all possible folds is impossible -> dynamic programming
RNA structure prediction
Create an nxn matrix for a sequence with n bases
Initialize the diagonal to 0 Fill the matrix with the largest number of base
pairs (S)
w(I,j) = 1 if base I can be paired with base j
Nussinov-Jacobson Algorithm
S(i+1, j-1) + w(i,j)S(i,j) = max [ S(i+1, j) ] S(i, j-1)
max[S(I,k) + S(k+1,j)}