2. Annotation 1. Enzyme Seq-Fun talk IV - huji.ac.il · PDF file1. Enzyme Seq-Fun 2....
Transcript of 2. Annotation 1. Enzyme Seq-Fun talk IV - huji.ac.il · PDF file1. Enzyme Seq-Fun 2....
ISSCB 09/03 M. Linial
Discussion of Functiontalk IV
1. Enzyme Seq-Fun2. Annotation3. Integration
ISSCB 09/03 M. Linial
Sequence and Functionrelationship
taking one example: Enzymeswell known
functionality is definesconservedessential
tree like classification
ISSCB 09/03 M. Linial
Relatively easy function
ENZYMES
Enzymes, WIT, KEGG etc
ISSCB 09/03 M. Linial
Functionally characterized EnzymesBy Cofactors
6-hydroxyDOPAAmmoniaAscorbateATP BicarbonateBile saltsBiotinCadmiumCalciumCobalaminCobaltCoenzyme F430Coenzyme-ACopper DipyrromethaneDithiothreitolDivalent cation F420 FAD Fe(II)
FlavinFlavoproteinFMNGlutathioneHemeHeme-thiolateIron Iron(II)Iron-molybdenum Iron-sulfur Lipoyl groupMagnesium Manganese MolybdenumMolybdopterinMonovalent cationNAD NAD(P)HNickelPotassiumPQQ
Proto heme IXPterinPyridoxal phosphatePyridoxal-phosphatePyruvate Reduced flavinSeleniumSirohemeSodiumTetrahydropteridineThiamine pyrophosphateThiol-dependentTryptophan…………..
ISSCB 09/03 M. Linial
ISSCB 09/03 M. Linial
Functionally characterized Enzymes
1. -. -.- Oxidoreductases.1. 1. -.- Acting on the CH-OH group of donors.1. 2. -.- Acting on the aldehyde or oxo group of donors.1. 3. -.- Acting on the CH-CH group of donors.1. 4. -.- Acting on the CH-NH(2) group of donors.1. 5. -.- Acting on the CH-NH group of donors.1. 6. -.- Acting on NADH or NADPH.
5. -. -.- Isomerases.5. 1. -.- Racemases and epimerases.5. 2. -.- Cis-trans-isomerases.5. 3. -.- Intramolecular oxidoreductases..5. 4. -.- Intramolecular transferases (mutases).5. 5. -.- Intramolecular lyases.
Catalysis
ISSCB 09/03 M. Linial
Structurally based alignmentsof structurally and functionally characterized sequences
(Human)
90%
(Chick)
45%(E coli)
(E coli)
(B ster.)
20%
(E coli)
(Yeast)
Sequence5.3.1.1 (TP Isomerase)
SameExact
5.3.1.1 (TP Isomerase)
BothClass 5 (isom.)
5.3.1.1 (TP Isomerase)
5.3.1.24 (PRA Isomerase)
5.3.1.15 (Xylose Isom.)
DifferentClasses
4.1.3.3 (Aldolase)
4.2.1.11 (Enolase)
Function
ISSCB 09/03 M. Linial
0102030405060708090100
010203040506070
Relationship of Similarity inSequence to that in Function
%ID
Sequence similarity of pairs of proteins
% S
ame
Fu
nct
ion
Percentage of pairs that havesame precise function asdefined by Enzyme & FlyBasefunctional classifications
ISSCB 09/03 M. Linial
0102030405060708090100
010203040506070
Relationship of Similarity inSequence to that in Function
%ID
% S
ame
Fu
nct
ion
M. G
erstein
ISSCB 09/03 M. Linial
Can transfer both
Fold & Functional Annotation
0102030405060708090100
010203040506070
Relationship of Similarity in Sequence tothat in Function
%ID
% S
ame
Fu
nct
ion
M. G
erstein
ISSCB 09/03 M. Linial
Can not transfer Fold or Functional
Annotation("Twilight Zone")
Can transfer Annotation related
Fold but not Function
Can transfer both
Fold & Functional Annotation
0102030405060708090100
010203040506070
Relationship of Similarity inSequence to that in Function
%ID
% S
ame
Fu
nct
ion
M. G
erstein
ISSCB 09/03 M. Linial
Can not transfer Fold or Functional
Annotation("Twilight Zone")
Can transfer Annotation related
Fold but not Function
Can transfer both
Fold & Functional Annotation
0
1020
3040
50
6070
8090
100
010203040506070
Relationship of Similarity inSequence to that in Function
%ID
% S
ame
Fu
nct
ion
Broadvs
NarrowSimilarity
M. G
erstein
ISSCB 09/03 M. Linial
Annotation-Based Analysis ofProtein Sets
ISSCB 09/03 M. Linial
Few words on
Protein Function
Protein Annotation
ISSCB 09/03 M. Linial
Prediction of Function
What is function? This is not a simple term
Function may be:
• a biological process (e.g. serine protease activity)
• a molecular event (e.g. proteolysis of a specific substrate)
• a cellular structure (e.g. membrane; chromatin, etc.)
• relevance to a whole process (e.g. cell cycle)
• relevance to the whole organism (e.g. ovulation)
ISSCB 09/03 M. Linial
“omics”: genomics and proteomics
• Main idea: use high throughput as a mean oftackling biological complexity.
DIGE 2d gel DNA microarray SELDI-TOF spectrum
ISSCB 09/03 M. Linial
“omic” research
• Experimental Stage: data collection
• Computational Stage: statistical analysis
• Result: “graveyards” of genes/proteins
CD44 HSP CAT ERP2 RPL1 ENO
SODa TRD PMS DUF ACT GLU
ISSCB 09/03 M. Linial
A protein graveyardCRZ 1 HMO 1 POL 1 SNU 13 RPC 10 SFL 1 SNU 13 RPC 10
BEM 2 NHP 6A DPB 3 PRP 19 RPB 8 GAL 4 PRP 19 RPB 8SYF 1 EPL 1 POL 2 KEM 1 RPB 10 MIG1 KEM 1 RPB 10
CDC 13 SCC 4 RFA 2 SEH 1 RPO 26 HSF 1 SEH 1 RPO 26SHE 3 RSC 9 RFA 3 NPL 6 STB 4 MOT3 NPL 6 STB 4
NCE 4 ISW 1 RFA 1 HOT1 TOA2 STE 12 HOT1 TOA2- ISW 2 RFC 3 DAL 82 TOA1 NUT 1 DAL 82 TOA1
ECM 5 TRA 1 RFC 2 ACE 2 SUA 7 BDF 1 ACE 2 SUA 7EAF 3 - RFC 4 BUR 6/NCB 1TAF1 UME 6 BUR 6/NCB 1TAF1
HFI1 IOC3 TOP1 NCB 2 TAF9 MMS 4 NCB 2 TAF9MSI 1 - TOF2 SSU 72 TAF10 ABF 2 SSU 72 TAF10
CAC 2 CST 6 RNH 35 KIN 28 TAF11 GAT1 KIN 28 TAF11RSC 1 or 2 HOF 1 FOB 1 MOT1 TAF3 RTG1 MOT1 TAF3
RSC 6 ACT 1 SSA 3 RPB 9 TAF6 SKN 7 RPB 9 TAF6RSC 8 ARP 4 SSA 2 RPB 2 TAF12 TAF4/MPT1RPB 2 TAF12
STH 1 ARP 9 GLC 7 RPB 7 TAF7 SIR 2 RPB 7 TAF7SFH 1 ARP 8 TDH 1, 2, 3RPO 21 TAF5 MSN 2 RPO 21 TAF5
RSC 2 ARP 7 PDC 2 RPB 4 SPT 15 MET 31 RPB 4 SPT 15CHD 1 APN 1 HPR 5 RPB 3 TAF2 HAC 1 RPB 3 TAF2
SMC 3 PHR 1 - MED 8 TAF8 SSL 2 MED 8 TAF8IRR 1 NTG2 RIM 1 SRB 8 TFA2 RAD 3 SRB 8 TFA2
SWI 3 MSH 6 MGM 101 MED 2 TFA1 UBP 8 MED 2 TFA1SNF 12 MSH 2 CBF 5 SRB 7 TFG1 CCT 4 SRB 7 TFG1
SNF 2 RAD 26 CBF 2 SSN 2 TAF14 /TFG3RPL 10 SSN 2 TAF14 /TFGSWI 1 RPH 1 CHL 1 SRB 4 TFG2 RPP 0 SRB 4 TFG2
GCN 5 MUS 81 CDC 14 FHL 1 TFB4 RPL 11 A or BFHL 1 TFB4SPT 7 MEC 1 SMC 1 SRB 5 TFB3 RPL 12 A or BSRB 5 TFB3
NGG 1 RAD 52 SMC 2 SRB 2 TFB2 RPL 15 B SRB 2 TFB2SPT 3 RAD 59 SGS 1 MED 6 TFB1 RPL 19 A or BMED 6 TFB1
ADA 2 MSH 3 YCS 4 RGR 1 SSL 1 RPL 1A or BRGR 1 SSL 1YNG 2 RAD 7 MCD 1 MED 11 CCL 1 RPL 25 MED 11 CCL 1
SPT 8 RAD 4 SCC 2 SIN 4 REB 1 RPL 3 SIN 4 REB 1SPT 20 RAD 14 CFT1 CSE 2/MED 9FHL 1 RPL 30 CSE 2/MED 9FHL 1
ESA 1 RAD 23 YSH 1 GAL 11 SUB 1 RPL 34 A or BGAL 11 SUB 1RPD 3 DPB 4 REF 2 MED 7 GIS1 RPL 35 A or BMED 7 GIS1
HTB 2 or 1 RFC 5 NAM 7 MED 4 ARO 80 RPL 4A or BMED 4 ARO 80RSC 58 RFC 1 PAP 1 MED 1 FKH 2 RPL 8A or BMED 1 FKH 2
IOC4 TOP2 PRP 43 RPB 11 - RPS 0A or BRPB 11 -ITC1 TOP3 PRP 9 ROX3 IXR1 RPS 11 A or BROX3 IXR1
NHP 10 MIP 1 PRP 46 SRB 6 SGV 1 RPS 1A SRB 6 SGV 1
ISSCB 09/03 M. Linial
Biological analysis of protein sets
• Biological interpretation requires intimateknowledge of the proteins and is time-consuming.
• Usually only a few proteins are examined.
• How can we interpret the results efficiently?
• How can we understand the results at aproteomic level?
• Solution: analysis of protein annotations.
ISSCB 09/03 M. Linial
Protein annotations
• Annotation (keyword): a binary property of aprotein, from a “library” of properties.
• Cover various biological aspects: function,structure, taxonomy, localization, biologicalpathway…
• Annotations come from different sources.• Growth in annotation amount and variety.• Libraries of annotations allow computational
analysis.
ISSCB 09/03 M. Linial
Major source for incorrectannotations
In the protein world:
1. Wrong gene finding (exon- intron)
2. Premature cleavage -wrong tails (nt sequencingmistakes)
3. ESTs may be misleading
4. Automatic assignment of features
5. Rush due to publication/public/government pressure(human genome, is the worse)
6. No replacement for manual curators
ISSCB 09/03 M. Linial
Biological analysis of protein sets
• Biological interpretation requires intimateknowledge of the proteins and is time-consuming.
• Usually only a few proteins are examined.
• How can we interpret the results efficiently?
• How can we understand the results at aproteomic level?
• Solution: analysis of protein annotations.
ISSCB 09/03 M. Linial
Protein annotations
• Annotation (keyword): a binary property of aprotein, from a “library” of properties.
• Cover various biological aspects: function,structure, taxonomy, localization, biologicalpathway…
• Annotations come from different sources.• Growth in annotation amount and variety.• Libraries of annotations allow computational
analysis.
ISSCB 09/03 M. Linial
Annotation types
ISSCB 09/03 M. Linial
Gene Ontology (GO)
GO provides controlled
annotations of :
1. Molecular function
2. Biological process
3. Cellular component
The annotations arepart of a hierarchicalgraph, in which eachGO term has a parentor parents, and mighthave child terms.
ISSCB 09/03 M. Linial
Assigning GO termsto proteins
Nuclear
Protein Hcc-1SWISS-PROT: P82979
Cellular Component
GO:0005634 - nucleus
Molecular FunctionGO:0003676 – nucleic acid bindingGO:0003677 – DNA binding
Biological ProcessGO:0006350 – transcriptionGO:0006355 – transcription regulationGO:0006417 – general regulation ofprotein biosynthesisGO:0006355 – translational regulation
ISSCB 09/03 M. Linial
Advantages of using GO terms
• Assigned to most of the proteins in SWISS-PROT (~100%).
• Checked one-by-one by experts (EBI).
• Comprehensive : based on SWISS-PROT keywords, EC number,InterPro keywords and PubMed abstracts.
• Tree-like structure (DAG) : using the hierarchy to find the “best” GO
term describing a protein or a set of proteins.
ISSCB 09/03 M. Linial
Semantic similarity measures
“Information content” : “chaperone” is more informative term than
“signal transducer” because the former is used several hundredtimes, while the latter is used several thousand times.
GO:0003674 : molecular function
GO:0004871 : signal transducer
GO:0004872 : receptor
GO:0009881 : photoreceptor
GO:0004888 : transmembrane receptor
ISSCB 09/03 M. Linial
Protein annotations
9% have more than 20 annotation per protein (not including Taxonomy)
ISSCB 09/03 M. Linial
Annotation types• GO annotation has a broad distribution, the accuracylevel is very different• Some overlap in keywords but different definition
ISSCB 09/03 M. Linial
Computational analysis –naïve
Something is missing…
60membrane
40enzyme
amountannotation
100 proteins:
Summation: a naïve method forprotein set analysis.
ISSCB 09/03 M. Linial
Intersection and inclusion
60 membrane40 enzyme
enzymemembrane membraneenzyme
enzyme
membrane
ISSCB 09/03 M. Linial
ISSCB 09/03 M. Linial
ISSCB 09/03 M. Linial
• A web-base tool aimed at biologicalanalysis of protein sets.
• Biological information is shown throughintersection and inclusion.
• Goal: provide a “biological roadmap” ofthe protein set.
ISSCB 09/03 M. Linial
enzyme
cytoplasm
hydrolase
transcription
nucleus
kinase
Method
enzyme
cytoplasm
hydrolase
transcription
nucleus
kinase
P1
100110
P2
110110
P3
111000
P4
111001
P5
111000
P6
111001
enzyme
hydrolase
cytoplasmnucleus
transcription
kinase
cytoplasmtranscription
nucleus
P1 P2 P3 P4 P5 P6
P2 P3 P4 P5 P6
P3 P4 P5 P6
P4 P6
P1 P2
P2
= 6
= 5
= 4
= 2
= 2
= 2
ISSCB 09/03 M. Linial
10101000100000100100101010100010000010010010101010001000001001001010101000100000100100101010100010000010010010101010001000001001001010101000100000101111000000010101011111110100101010100010000010010010101010001000001001001010101000100000100100101010100010101011111111111101000000101010111111111110000000101010111111111110000000101010111111111111101001010100000001010101111111111100000001010101111111111100000001010101111111111100000001010101111000000000011111110000000101010111111111110000000101010111111111110000000101010111111111110000000101010111111111110000000010101000100000100100101010100010000010010010101010001000001001001010101000100000100100101010100010000010010010101010001000001001001010101000100000101111000000010101011111110100101010100010000010010010101010001000001001001010101000100000100100101010100010101011111111111101000000101010111111111110000000101010111111111110000000101010111111111111101001010100000001010101111111111100000001010101111111111100000001010101111111111100000001010101111000000000011111110000000101010111111111110000000101010111111111110000000101010111111111110000000101010111111111110000000101010111111110101010010111100000001010101111111111111100000001010101111111111101111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111100100000100100101010100010000010010010101010001000001001001010101000100000100100101010100010000010010010101010001000001001001010101000100000100100101010100010000011111000000010101011111110010010101010001000001001001010101000100000100100101010100010000010010010101011111000000010101011111110001000001001001010101000100000100100101010100010000010010010000000011111111111111111111111000000001010100001111111111111111111000000010101011111111111000000010101011111111111000000010101011111111111000000010101011111111111000000010101011000000000000000000001111000000010101011111111111000000101010101001100101010111111100101010101001011111000000010101011111111111111010101000101010010111111111110000000101010111111111111010000001010101111111111100000001010101111111111100000001010101111111111111010010101000000010101011111111111000000010101011111111111000000010101011111111111000000010101011110000000000111111100000001010101111111111100000001010101111111111100000001010101111111111100000001010101111111111100000001010101111111101010100101111000000010101011111111111111000000010101011111111111011111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111101110000000101010111111111110000000101010111111100001010001111000000010101011111101010100111111000000010101111111101010100101111000000010101011111111111111000000010101011111111111011111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111101110000000101010111111111110000000101010111111100001010001111000000010101011111101010100111111000000010101
ISSCB 09/03 M. Linial
Graph complexity
• 20 keywords: >1,000,000 nodes
• This worst-case doesn’t occur for largeK values in the protein-keyword world.
• Still, highly complex graphs do occur.
KK
n n
K2
1
=
∑=
Theoretical complexity:
ISSCB 09/03 M. Linial
Re• A user-controlledthreshold trading graphaccuracy for simplicity.
• Represents the maximallevel of error allowed, inproteins.
40
1022
8
40
22
10
35
15 15
14
35
16
Resolution = 2 proteins solution
ISSCB 09/03 M. Linial
Annotation types
Some non informativewords -complete genomeDisease…
Some are partiallyAnnotated EC x.y.
Growing very fastStill many terms are inconsistent
ISSCB 09/03 M. Linial
The power of integration
• Integration of various biological aspects:structure, function, taxonomy, localization,pathways…
• Integration of various characteristics(e.g. scop).
• Methods: full integration, “zooming”.
ISSCB 09/03 M. Linial
Biological examples
• Take set of all 576 proteins annotated by ‘GOmolecular function’ as ‘anion channel’.
• View through InterPro keywords (sequentialsignatures).
ISSCB 09/03 M. Linial
zoom
BASICSET
GABA Areceptor
Neurotransmitter-gated ion channel
Nicotinicacetylcholine
receptorVoltage-gated
chloride channel
Intracellularchloride channel
H+ transportingATPase
Eukaryotic porin
InterProNumber of
proteins
Sensitivity: TP/(TP+FN)
red = FN white = TP
ISSCB 09/03 M. Linial
InterPro
alphasubunit
betasubunit
gammasubunit
GABA Areceptor
ISSCB 09/03 M. Linial
TaxonomyEukaryota
chordata
drosophilla
C. elegans
human
chickenmammalia
rodentia
ISSCB 09/03 M. Linial
Resolution:
0 proteins
ISSCB 09/03 M. Linial
Resolution: 1
ISSCB 09/03 M. Linial
Resolution: 2
ISSCB 09/03 M. Linial
Resolution: 5
ISSCB 09/03 M. Linial
Resolution: 8
ISSCB 09/03 M. Linial
Resolution: 15
ISSCB 09/03 M. Linial
Resolution: 30
ISSCB 09/03 M. Linial
Biological examples - ProtoNet
• Very large cluster of GTP binding proteins (A244299)
ISSCB 09/03 M. Linial
Biological examples - ProtoNet
• Very large cluster of UREASE SF (as in talk III)
17 different enzymatic groups
Highly pure (while)
ISSCB 09/03 M. Linial
Biological examples - experimental
• Comparative proteomic experiment: E. coli responseto benzoic acid (Yan et al, 2002).
• A set of 51 proteins are down-regulated by a factor of1.3 or more.
• Benzoic acid is known to inhibit E. coli growth (Lambert etal, 1997).
• Could we guess this without examining individualproteins?
ISSCB 09/03 M. Linial
TransportMetabolism
Biosynthesis
Cell growth and/ormaintenance
Amino acidbiosynthesis
Vitaminbiosynthesis
Proteinbiosynthesis
Nitrogenmetabolism
Coenzymebiosynthesis
Lipidmetabolism
Phosphatemetabolism
Carbohydratemetabolism
GO biological process
ISSCB 09/03 M. Linial
False Annotations
AutomatedConsistency of KWConnectivity
Tested against ProSite(manually)
78% correct +9% fullSeparation (TP/FP)
ISSCB 09/03 M. Linial
Detecting false annotations• Automatic statistical annotation methods are susceptible to
both errors of type I and II.
• False positives are especially problematic because ofincorrect annotation transfer.
• InterPro ‘Glutamine synthetase’ – 131 proteins
• Glutamine synthetase (Glutamate ammonia ligase)reaction:
ATP + L-glutamate + NH3 ∆ ADP + phosphate + L-glutamine
ISSCB 09/03 M. Linial
False-positives
Glutaminesynthetase
Ligase
GlutamineSynthetase class Iadenylation site
Outer-membrane,Virulence,
Bacterial Ig-like,Bacterial adhesion mediator,
Peptidoglycan-binding Actin-binding,WD repeat,Coiled coil
InterPro
SwissProt
ENZYME
ISSCB 09/03 M. Linial
Conclusion
• PANDORA offers:– Interactive comprehensible graph display.
– Full protein-keyword intersection andinclusion relations.
– User-controlled data simplification.
– Integration of 6 annotation sources.
– Detection of false annotations.
ISSCB 09/03 M. Linial
Future plans
• Enlarge and enrich sources.
• An automatic detection method of falseannotations.
• Deal with quantitative properties.
ISSCB 09/03 M. Linial
Quantitative properties
• Not all biological properties are naturallybinary.
• Some interesting quantitative properties areuser-specific (e.g. change in expressionlevel).
ISSCB 09/03 M. Linial
ISSCB 09/03 M. Linial
ISSCB 09/03 M. Linial
www.pandora.cs.huji.ac.il
We come in peace…