Hidden in plain sight
-
Upload
valerie-wood -
Category
Science
-
view
273 -
download
0
Transcript of Hidden in plain sight
Hidden in plain sight
Classifying the known and the unknown proteome
Valerie Wood, PomBase
We tend to study what we know5054 protein coding2154 published, small scale (blue), 2050 inferred from orthologs (red)
Steady progress in characterizing proteins already studied in other organisms
‘Unknowns’ decreasing only gradually 509 conserved (green)321 Schizosaccharomyces specific (purple)
Similar situation in other organisms (other organisms checked for annotation)
Progress in characterising proteins since 2006
Classifying as unknown
• Concept of “known function” is vague/arbitrary- we may know the molecular function (oxidase/protease) but nothing about the broader cellular role
• For fission yeast we use ‘unknown’ if there is no information about the broad cellular role in which they participate, and thus cannot be assigned to a ‘biological process’ in the ‘GO slim’ (i.e. transcription, translation, replication, amino acid metabolism)
• People tend to work on processes, this makes them accessible as candidates for follow up
vertebrate, eukaryote only
vertebrate, bacteria } 179 (have clearhuman ortholog)
Taxonomic conservationUnknown 830
Schizosaccharomyces only 178
321 fission yeast specific
Schizosaccharomyces pombe only 143
other non-fungal eukaryote, no vertebrate
fungi only 186
horizontal transfer 19509 conserved in other organismsfungi and bacteria
• To make useful inferences for ‘unknowns’ we need a clear and accurate picture of what we know
• A “GO slim” is a way of summarizing the biological roles of an organisms gene products
• This “matrix” shows genes co-annotated to pairs of GO slim terms
• Many GO slim terms do not share annotations with other slim terms
• Used for QC, identify ontology and annotation errors (especially electronic annotation)
What is known ?
DNA replication, recombination, repair frequently intersect with each other and withChromosome organization, mitotic cell cycle regulation
Rarely (currently never) intersect with: carbohydrate metabolism amino acid metabolismCytokinesisLipid metabolismNucleocytoplasmic transportProtein glycosylation
Unknown 830
TOTAL 5054
cytoskeleton org 206
nuclear DNA replication, recombination,repair305
mitotic chromosome segregation184 regulation of mitotic
cell cycle 232
10
CELL DIVISION 751
27
cytokinesis110
0
39 1
46
3
4. MITOCHONDRIAL ORG/EXP 280
4
cell wall org 1303
4
1
MEMBRANES, TRAFFICKING, CELL SURFACE 787
14
lipid met222 vesicle
Mediatedtransport324
6
glycosylationpolysacc met 140membrane
org 199
75
0
674
10
33
0
detox
SMALL MOLECULE TM TRANSPORT 288
13
9
0
AA &sulfurmet220
vitamincofactormet
9
5 nucleo-base/side/tide met219
small sugar met 77
CENTRAL MET, ENERGY AND BUILDING BLOCKS 549
Nitrogen15
25174
54
3430
other energygeneration 25
23
signalling404
sexual reproductive process 262(Many intersections)
Other 290No intersections.Includes adhesion,many proteases,peroxions
EXPRESSION 1294
````
EXPRESSION submod 863
4 13
ribosome biogenesis317
RNA metabolism772cytoplasmic
translation249
189
c
nucleocytotransport 110
5
34
26
2
Transcription479
32
18
PROTEIN ASSEMBLY/STABILITY 765
protein catabolism & autophagy 251
ubiquitination 192
63
folding102
complex Assembly325
13
4
1
All cardiolipin synthesis
MITOCHONDRIAL ORG/EXP 280
MEMBRANES, TRAFFICKING, CELL SURFACE 787
signalling404
sexual reproductive process 262
Other 290
TOTAL 5054
PROTEIN ASSEMBLY/STABILITY 765
CELL DIVISION 751
SMALL MOLECULE TM TRANSPORT 288
CENTRAL MET, ENERGY AND BUILDING BLOCKS 549
EXPRESSION 1294
````
EXPRESSION submod 863
c
Transcription479
Unknown 830
This covers the known “process options” for a single-celled eukaryote- First step for the unknowns , assign to broad process (bring to researchers of interest
notice)- If we can predict strong association with some module or submodule is unlikely to be associated with others (caveat)- Provides a ‘framework’ to begin to partition “unknowns” based on general or specific
non-process characteristics (constrain predictions, and evaluate them, based on existing knowledge)
New biology?
Function prediction1. Find features informative for known processes• Phenotypes• Taxonomic distribution• Location• Catalytic grouping
2. Identify these informative features in unknown protein:
3. Cluster similar features
4. Ask “which known genes best match these profiles?”
5. Look for matching processes
Classification/clustering of unknowns(conserved to human subset)
1. Identify informative featuresfor each unknown protein:
PhenotypesLocationTaxonomic distributionCatalytic grouping
2. Group by similar features
For 100/179 of the conserved to human subset see poster 144
1. ER localization 2. >1 <4 TM domain3. Absent from S. cerevisiae4. Conserved in vertebrates
Ask “which known genes best match these profiles?”
Query using PomBase advanced search tool:
Look for matching processes“What are these genes enriched for?”
1. Present in nucleus 2. Methyltransferase domain3. Conserved in bacteria4. Conserved in vertebrates
11/15 are tRNA metabolism, … there are orphan tRNA enzymes11/15 tRNA met, 3/15 rRNA met
Adding another feature “HU sensitivity”, increases specificity for tRNA metabolism
Real example
SPCC1840.09 was recently characterised as coq11 in S. cerevisiae
4/9 with these phenotypes are ubiquinone biosynthesis1 transcription1 (SPAC823.10c) indirect (reannotated to heme transport)
Guilt by association
All unknowns in STRING (http://string-db.org/)
Human AMMECR1Human MEMO1
Using AnGELI
AMMECR subnetwork has connections to meiotic cell cycle (7/8), and are upregulated in response to caffeine and rapamycin and stress
What do we need to make good predictions?
CURATION• Accurate predictions requires high quality curation. Continual removal of known false positive annotations (via
ontology errors, incorrect experiments, manual curation errors and incorrect automated mapping) from the ‘true positives training set’
FUNCTION PREDICTION METHODS• Many pipelines for function prediction produce lots of false positives because not fully constrained by existing
knowledge• Integrations of all methods, integrated approach to prediction, different methods will suit different processes
(no one size fits all)• To identify more informative features (e.g. phenotypes) which correlate strongly and specifically with known
processes (i.e. some phenotypes ‘abnormal shape may be enriched for some processes but are non-specific)
EXPERIMENTAL DATA• Require more datasets which provide more strong positive and negative discriminators• More high quality physical interactions
ACCESS• Make predictions prominently accessible to validate in small scale follow up
The future
Acknowledgements
• Midori Harris• Antonia Lock• Jurg Bahler• Danny Bitton (AnGeli)• Steve Oliver
Spare slides
167
No biological role:
3436
831 Biological Processe.g.Cell cycleTranscriptionDNA replicationTransportRegulation of process
Molecular Functione.gtransporterenzyme-protein kinase-ubiquitin ligase-oxidoreductase-proteasebinding functionsenzyme regulator (direct)
Cellular Component(location or complex)
455
15 1842
Total 5054All 3 aspects unknown
717, unknown rolePlus 113 where the processIs not very informative total 830
90
Find “non process” features that correlate with processes
e.g. “Mitochondrial organization” (A GO slim term)
More likely to be:
Location Phenotype
Phenotype
Expression
Less likely to be:
Proteinfeature
Using AnGeLi http://bahlerweb.cs.ucl.ac.uk/cgi-bin/GLA/GLA_input
275/280 genes involved in mitochondrionorganization are mitochondrial, BUT not all mitochondrial genes (732) are involved in mitochondrial organization
10/10 genes which show decreased population growth on galactose are mitochondrial organization
Less likely to be periodic 8/497 or abnormal cell cycle 4/695
`
Between module intersectionsNumbers of between module intersections are low
Between module intersectionsExcluding signalling co-annotation the intersections between modules are minimal (mainly Ub mediated protein degradation)