Hidden in plain sight

Hidden in plain sight

Classifying the known and the unknown proteome

Valerie Wood, PomBase

We tend to study what we know5054 protein coding2154 published, small scale (blue), 2050 inferred from orthologs (red)

Steady progress in characterizing proteins already studied in other organisms

‘Unknowns’ decreasing only gradually 509 conserved (green)321 Schizosaccharomyces specific (purple)

Similar situation in other organisms (other organisms checked for annotation)

Progress in characterising proteins since 2006

Classifying as unknown

• Concept of “known function” is vague/arbitrary- we may know the molecular function (oxidase/protease) but nothing about the broader cellular role

• For fission yeast we use ‘unknown’ if there is no information about the broad cellular role in which they participate, and thus cannot be assigned to a ‘biological process’ in the ‘GO slim’ (i.e. transcription, translation, replication, amino acid metabolism)

• People tend to work on processes, this makes them accessible as candidates for follow up

vertebrate, eukaryote only

vertebrate, bacteria } 179 (have clearhuman ortholog)

Taxonomic conservationUnknown 830

Schizosaccharomyces only 178

321 fission yeast specific

Schizosaccharomyces pombe only 143

other non-fungal eukaryote, no vertebrate

fungi only 186

horizontal transfer 19509 conserved in other organismsfungi and bacteria

• To make useful inferences for ‘unknowns’ we need a clear and accurate picture of what we know

• A “GO slim” is a way of summarizing the biological roles of an organisms gene products

• This “matrix” shows genes co-annotated to pairs of GO slim terms

• Many GO slim terms do not share annotations with other slim terms

• Used for QC, identify ontology and annotation errors (especially electronic annotation)

What is known ?

DNA replication, recombination, repair frequently intersect with each other and withChromosome organization, mitotic cell cycle regulation

Rarely (currently never) intersect with: carbohydrate metabolism amino acid metabolismCytokinesisLipid metabolismNucleocytoplasmic transportProtein glycosylation

Unknown 830

TOTAL 5054

cytoskeleton org 206

nuclear DNA replication, recombination,repair305

mitotic chromosome segregation184 regulation of mitotic

cell cycle 232

10

CELL DIVISION 751

27

cytokinesis110

0

39 1

46

3

4. MITOCHONDRIAL ORG/EXP 280

4

cell wall org 1303

4

1

MEMBRANES, TRAFFICKING, CELL SURFACE 787

14

lipid met222 vesicle

Mediatedtransport324

6

glycosylationpolysacc met 140membrane

org 199

75

0

674

10

33

0

detox

SMALL MOLECULE TM TRANSPORT 288

13

9

0

AA &sulfurmet220

vitamincofactormet

9

5 nucleo-base/side/tide met219

small sugar met 77

CENTRAL MET, ENERGY AND BUILDING BLOCKS 549

Nitrogen15

25174

54

3430

other energygeneration 25

23

signalling404

sexual reproductive process 262(Many intersections)

Other 290No intersections.Includes adhesion,many proteases,peroxions

EXPRESSION 1294

````

EXPRESSION submod 863

4 13

ribosome biogenesis317

RNA metabolism772cytoplasmic

translation249

189

c

nucleocytotransport 110

5

34

26

2

Transcription479

32

18

PROTEIN ASSEMBLY/STABILITY 765

protein catabolism & autophagy 251

ubiquitination 192

63

folding102

complex Assembly325

13

4

1

All cardiolipin synthesis

Val Wood

MITOCHONDRIAL ORG/EXP 280

MEMBRANES, TRAFFICKING, CELL SURFACE 787

signalling404

sexual reproductive process 262

Other 290

TOTAL 5054

PROTEIN ASSEMBLY/STABILITY 765

CELL DIVISION 751

SMALL MOLECULE TM TRANSPORT 288

CENTRAL MET, ENERGY AND BUILDING BLOCKS 549

EXPRESSION 1294

````

EXPRESSION submod 863

c

Transcription479

Unknown 830

This covers the known “process options” for a single-celled eukaryote- First step for the unknowns , assign to broad process (bring to researchers of interest

notice)- If we can predict strong association with some module or submodule is unlikely to be associated with others (caveat)- Provides a ‘framework’ to begin to partition “unknowns” based on general or specific

non-process characteristics (constrain predictions, and evaluate them, based on existing knowledge)

New biology?

Val Wood

Function prediction1. Find features informative for known processes• Phenotypes• Taxonomic distribution• Location• Catalytic grouping

2. Identify these informative features in unknown protein:

3. Cluster similar features

4. Ask “which known genes best match these profiles?”

5. Look for matching processes

Classification/clustering of unknowns(conserved to human subset)

1. Identify informative featuresfor each unknown protein:

PhenotypesLocationTaxonomic distributionCatalytic grouping

2. Group by similar features

For 100/179 of the conserved to human subset see poster 144

1. ER localization 2. >1 <4 TM domain3. Absent from S. cerevisiae4. Conserved in vertebrates

Ask “which known genes best match these profiles?”

Query using PomBase advanced search tool:

Look for matching processes“What are these genes enriched for?”

1. Present in nucleus 2. Methyltransferase domain3. Conserved in bacteria4. Conserved in vertebrates

11/15 are tRNA metabolism, … there are orphan tRNA enzymes11/15 tRNA met, 3/15 rRNA met

Adding another feature “HU sensitivity”, increases specificity for tRNA metabolism

Real example

SPCC1840.09 was recently characterised as coq11 in S. cerevisiae

4/9 with these phenotypes are ubiquinone biosynthesis1 transcription1 (SPAC823.10c) indirect (reannotated to heme transport)

Guilt by association

All unknowns in STRING (http://string-db.org/)

Human AMMECR1Human MEMO1

Using AnGELI

AMMECR subnetwork has connections to meiotic cell cycle (7/8), and are upregulated in response to caffeine and rapamycin and stress

What do we need to make good predictions?

CURATION• Accurate predictions requires high quality curation. Continual removal of known false positive annotations (via

ontology errors, incorrect experiments, manual curation errors and incorrect automated mapping) from the ‘true positives training set’

FUNCTION PREDICTION METHODS• Many pipelines for function prediction produce lots of false positives because not fully constrained by existing

knowledge• Integrations of all methods, integrated approach to prediction, different methods will suit different processes

(no one size fits all)• To identify more informative features (e.g. phenotypes) which correlate strongly and specifically with known

processes (i.e. some phenotypes ‘abnormal shape may be enriched for some processes but are non-specific)

EXPERIMENTAL DATA• Require more datasets which provide more strong positive and negative discriminators• More high quality physical interactions

ACCESS• Make predictions prominently accessible to validate in small scale follow up

The future

Acknowledgements

• Midori Harris• Antonia Lock• Jurg Bahler• Danny Bitton (AnGeli)• Steve Oliver

Spare slides

167

No biological role:

3436

831 Biological Processe.g.Cell cycleTranscriptionDNA replicationTransportRegulation of process

Molecular Functione.gtransporterenzyme-protein kinase-ubiquitin ligase-oxidoreductase-proteasebinding functionsenzyme regulator (direct)

Cellular Component(location or complex)

455

15 1842

Total 5054All 3 aspects unknown

717, unknown rolePlus 113 where the processIs not very informative total 830

90

Find “non process” features that correlate with processes

e.g. “Mitochondrial organization” (A GO slim term)

More likely to be:

Location Phenotype

Phenotype

Expression

Less likely to be:

Proteinfeature

Using AnGeLi http://bahlerweb.cs.ucl.ac.uk/cgi-bin/GLA/GLA_input

275/280 genes involved in mitochondrionorganization are mitochondrial, BUT not all mitochondrial genes (732) are involved in mitochondrial organization

10/10 genes which show decreased population growth on galactose are mitochondrial organization

Less likely to be periodic 8/497 or abnormal cell cycle 4/695

`

Between module intersectionsNumbers of between module intersections are low

Between module intersectionsExcluding signalling co-annotation the intersections between modules are minimal (mainly Ub mediated protein degradation)

Hidden in plain sight

Science

Transcript of Hidden in plain sight