Sequence Similarity Analysis Often Misses Evolutionary Relationships Which Can Be Detected by...
-
Upload
laura-goodman -
Category
Documents
-
view
225 -
download
8
Transcript of Sequence Similarity Analysis Often Misses Evolutionary Relationships Which Can Be Detected by...
Sequence Similarity Analysis Often Misses Evolutionary Relationships Which Can Be Detected by Combined Analysis of 3D Structural and
Sequence
Residues Aligned
% S
equ
ence
Id
enti
ty
Homologous relationshipsestablished by both 3Dstructure and sequence:
Homologous
Non-homologous
Adapted from work by Sanders and co-workers
Structure can often provide valuable clues to biochemical and biophysical
aspects of protein function
Structure-based Functional Genomics
Biological Functionsof Genes and Proteins
• Genetic Function / Phenotype
• Cellular Function
• Biochemical Function
• Detailed Atomic Mechanism
•Biochemical Function
•Detailed Atomic Mechanism
An Important Approach to the Protein Folding Problem is to
Characterize the “Natural Language of Proteins”
Representative 3D Structure from Each of Several Thousand Sequence Families of Domains
National Institutes of HealthProtein Structure Initiative (PSI)
Long-Range Goal
To make the three-dimensional atomic level structures of most proteins easily available from knowledge of their corresponding DNA sequences
http://www.nigms.nih.gov/psi.html/
J. Norvell
• Structure provides information on function and will aid in the design of experiments
• Development of better therapeutic targets from comparisons of protein structures from:– Pathogens vs. hosts– Diseased vs. normal tissues
Expected PSI Expected PSI BenefitsBenefits
J. Norvell
• Collection of structures will address key biochemical and biophysical problems– Protein folding, prediction, folds, evolution, etc.
• Benefits to biologists– Technology developments – Structural biology facilities– Availability of reagents and materials– Experimental outcome data on protein production
and crystallization
PSI Benefits (con’t)PSI Benefits (con’t)
J. Norvell
PSI Pilot Phase
• 5-year pilot phase, September, 2000• Pilot phase Goals
– Development of high throughput structure genomics pipeline to produce unique, non-redundant protein structures
– Pilots for testing all facets and strategies of structural genomics
• PSI target selection policy – Representatives of protein sequence families – Public release of all targets, progress, results, and
structures
J. Norvell
PSI Pilot Research Centers
•Seven research centers funded in FY2000
•Two additional research centers funded in FY2001
•Co-funding by NIAID for two of the nine research centers
•Many subprojects
J. Norvell
PSI Pilot Phase -- Lessons Learned
• Structural genomics pipelines can be constructed and scaled-up
• High throughput operation works for many proteins
• Genomic approach works for structures• Bottlenecks remain for some proteins• A coordinated, 5-year target selection policy
must be developed• Homology modeling methods need
improvement
J. Norvell
Bioinformatics
Barry Honig, Columbia University Mark Gerstein, Yale UniversitySharon Goldsmith, Columbia UniversityChern Goh, Yale UniversityIgor Jurisica, Ontario Cancer Inst.Andrew Laine, Columbia UniversityJessica Lau, Rutgers UniversityJinfeng Liu, Columbia UniversityDiana Murray, Cornell Medical SchoolBurkhard Rost, Columbia UniversityMike Wilson, Yale University
X-ray Crystallography
Wayne Hendrickson, Columbia UniversityPeter Allen, Columbia UniversityGeorge DeTitta, Hauptman-WoodwardJohn Hunt, Columbia University Rich Karlin, Columbia University Joe Luft, Hauptman-WoodwardAlex Kuzin, Columbia University Phil Manor, Columbia UniversityLiang Tong, Columbia UniversityKalyan Das, Rutgers University
Protein Production / Biophysics
Gaetano Montelione, Rutgers University Thomas Acton, Rutgers UniversityStephen Anderson, Rutgers UniversityCheryl Arrowsmith, Ontario Cancer Inst.YiWen Chiang, Rutgers UniversityNatasha Dennisova, Rutgers UnivedrsityMasayori Inouye, RWJMS - UMDNJLichung Ma, Rutgers UniversityRong Xiao, Rutgers UniversityAdlinda Yee, Ontario Cancer Instit
Protein NMR
Thomas Szyperski, SUNY BuffaloJames Aramani, Rutgers University Cheryl Arrowsmith, Ontario Cancer Inst.John Cort, Pacific Northwest Natl LabsMichael Kennedy, Pacific Northwest Natl Labs Gaouhua Liu , SUNY Buffalo Theresa Ramelot, Pacific Northwest Natl LabsJanet Huang, Rutgers UniversityGaetano Montelione, Rutgers UniversityGVT Swapna, Rutgers UniversityBin Wu, Ontario Cancer Inst.
Northeast Structural Genomics Consortium:A SG Research Network
Goals of the NESG Consortium
Short TermDevelop a Scalable Platform for Structural and Functional Proteomics of Prokaryotic and Eukaryotic Proteins
Long TermCharacterize the repertoire of eukaryotic protein structural domain families
The NESG Publication Network
PubNetDouglas, Montelione, GersteinBioinformatics, 2005 in press
Target Selection Strategy
Target Selection for Structural ProteomicsC. Orengo, Snowbird, UT 4.17.04
How many protein families can we identify in the How many protein families can we identify in the genomes with/without structural genomes with/without structural
representatives?representatives?
Which families should we target to maximise Which families should we target to maximise the structural coverage of the genomes?the structural coverage of the genomes?
Can we select families to optimise function Can we select families to optimise function coverage?coverage?
Rost Clusters: Structural Genomics Targets
• Protein domain families / clusters
• Full length proteins < 340 amino acids
• No member > 30% identity to PDB structures
• No regions of low complexity
• Not predicted to be membrane associated
Target genomes Reagent genomes (prokaryotes):Eukaryotes Eubacteria Archea
Arabidopsis thaliana (A) Aquifex aeolicus (Q) Aeropyrum pernix (X)Caenorhabditis elegans (W) Bacillus subtilis (S) Archaeoglobus fulgidus (G)Drosophila melanogaster (F) Escherichia coli (E) Methanobacterium thermoautotrophicum (T)
Homo sapiens (H) Haemophilus influenzae (I) Pyrococcus horikoshii (J)Saccharomyces cerevisiae (Y) Helicobacter pylori (P)
(Mus musculus) Staphylococcus aureus (Z)Thermotoga maritima (V)Campylobacter jejuni (B)
Neisseria meningitides (M)Thermus thermophilus (U)
~ 20,000 “NESG Clusters”
NESG Domain Clusters• Protein domain families / clusters
• Full length proteins < 340 amino acids
• No member > 30% identity to PDB structures
• No regions of low complexity
• Not predicted to be membrane associated
Aeropyrum pernixAquifex aeolicusArabidopsis thalianaArchaeglobus fulgidisBacillus subtilisBrucella melitensisCaenorhabditis elegansCampylobacter jejuniCaulobacter crescentus
Deinococcus radioduransDrosophila melanogaster
Escherichia coliFusobacterium nucleatumHaemophilus influenzaeHelicobacter pyloriHomo sapiens
Human cytomegalovirus Lactococcus lactisM. thermoautotrophicumNeisseria meningitidisOtherPyrococcus furiosusPyrococcus horikoshi
Saccharomyces cerevisiaeStaphylococcus aureusStreptococcus pyogenesStreptomyces coelicolor
Thermoplasma acidophilumThermotoga maritimaThermus thermophilusVibrio cholerae
Liu, Hegi, Acton, Montelione, & Rost PROTEINS 2004. 56: 188-200 Wunderlich et al. PROTEINS 2004 56: 181-187 Acton et al. Meths Enzymol. 2005 in press
1 Euka: 2 Proka
Cloned / Expressed> 1000 Human Proteins
WR41ET8
Protein StructureProduction
Primer Prímer Program
http://www-nmr.cabm.rutgers.edu/bioinformatics/index.html
DNA Mini-preps PCR ReactionSet up-96 well
PCR Purification
Res
tric
tio
n D
iges
t
Qiaquick Purify
Lig
atio
nT
ran
sfo
rm C
olo
ny
PC
R
Cycle Sequencing
Big Dye removal
Auto-Steps with the Biorobot 8000
96- Well Expression
Overnight culture
24 Well Blocks
2 ml of MJ9
Transfer ~200 ul of overnight culture to appropriate well
HR969
HSQC and HetNOE Screening
Amenability to Structural Determination by NMR
Is Determined on NiNTA-Purified Samples
# Targets Good Excellent314 60 25
20% 8%
Some 30% of full-length, expressed, soluble eukaryotic proteinsfrom the Rost Clusters produced in E. coli by NESG are DISORDERED based on Heteronuclear 1H-15N NOE Data
Critical NMR Observation From SPiNE
It may not be possible to determine 3D structures of a large portion of the Rost domain families in isolation!
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Sample Optimization - Buffer Screening
Microdialysis Buttons- Optimization for NMR
Buffer NaCl DTT Arginine
50 mM Ammonium Acetate pH 5.0 0 0 0
50 mM Ammonium Acetate pH 5.0 0 10 mM 0
50 mM Ammonium Acetate pH 5.0 0.1 M 10 mM 0
50 mM Ammonium Acetate pH 5.0 0 10 mM 0.1 M
50mM MES pH 6.0 0 0 0
50mM MES pH 6.0 0 10 mM 0
50mM MES pH 6.0 0.1 M 10 mM 0
50mM MES pH 6.0 0 10 mM 0.1 M
50mM Bis.Tris pH 6.5 0 0 0
50mM Bis.Tris pH 6.5 0 10 mM 0
50mM Bis.Tris pH 6.5 0.1 M 10 mM 0
50mM Bis.Tris pH 6.5 0 10 mM 0.1 M
Vary Buffer Conditions - Stability
Screen for ppt.
100 mM Arginine Small sample mass
(50 ug/button)
Bagby S, Tong KI, Liu D, Alattia JR, Ikura M. 1997. J Biomol NMR.
Monodisperse Conditions
Aggregation Screening - Crystallization
Analytical Gel Filtration with Light Scattering
Proterion - 96 Well
Less Sample
More Conditions
Philip Manor, Roland Satterwhite and John Hunt
LS
RI
5 hours
12 hours
ÄKTAxpress™
4 modules in parallel 16 samples AC-GF
AC
AC/GF
Affinity Chromatography (AC)HiTrap™ Chelating HP, 1 and 5 ml
Gel Filtration (GF)HiLoad 16/60 Superdex 200 pg
Solubility / 2004 Stats
Organism Cloned % Sol* PDBA. aeolicus (Q) 85 46 3A. thaliana (A) 35 29 1A. fulgidis (G) 23 74 2B. subtilis (S) 158 49 4B. melitensis (L) 15 67 0C. elegans (W) 90 50 6C. jejuni (B) 20 55 0D. melanogaster (F) 113 15 1E. faecalis (Ef) 23 100 0E. coli ( E) 118 50 12H. influenzae (I) 101 57 4H. pylori (P) 75 21 1H. sapiens (H) 548 43 4N. meningitidis (M) 22 54 1P. furiosus (Pf) 48 46 2P. horikosh i (J) 19 63 1S. pyogenes (D) 12 50 1
*defined as greater than 60% soluble by SDS-PAGE analysis
Many HR (Human) proteins in advanced stages of NMR
3 HR Crystal structures
Total Week GoalCloned 511 51 50Fermented 183 20 ~20-24Purified 180 20 ~20-24
2004 ProductionSolubility vs Organism
2004 HR Success
T. Acton et al
Internet-based Data Management
Cloned Targets 4,220
Purified Targets 1,458
Crystal Structures in PDB 84
NMR Structures in PDB 72
Structures In PDB 147
Total Structures 160
In Refinement (NMR + Xray) 13
Intrinsicly Unfolded Proteins > 70
New Folds 12
Publications 209
NESG PROGRESS SUMMARY Jan 1, 2005
Intrinsically Disordered ProteinsFull-length Proteins Produced in E. coli
Organism % UnfoldedE. coli 8%yeast 18%fly / worm 25%human 35%
Phylogenetic Distribution of 160 NESG Structures
Most (>95%) completed NESG structures are
members of eukaryotic protein domain families
Eukar
yotic
Eubacteria
Archea
Some 35 (~20%) NESG structures submitted to the PDB are eukaryotic
proteins
Uniqueness of NESG Structures
Leverage of NESG Structures
lower panel: number of proteins for which the sequence-unique structures experimentally determined (red) by each consortium could be used to buildhomology models (light green).
upper panel shows the number of new models that could be built for ten entirely sequenced eukaryotes (tan) and for the human genome (green)
Total Leverage ~20,000 Structures Novel Leverage ~ 4,000 Structures
Liu and Rost