Post on 03-Jan-2016
Semantic empowermentof Life Science Applications
October 2006
Amit Sheth LSDIS Lab, Department of Computer Science,
University of Georgia
Acknowledgement: NCRR funded Bioinformatics of Glycan Expression, collaborators, partners at CCRC (Dr. William S. York)
and Satya S. Sahoo, Cartic Ramakrishnan, Christopher Thomas, Cory Henson.
Computation, data and semantics In life sciences
• “The development of a predictive biology will likely be one of the major creative enterprises of the 21st century.” Roger Brent, 1999
• “The future will be the study of the genes and proteins of organisms in the context of their informational pathways or networks.” L. Hood, 2000
• "Biological research is going to move from being hypothesis-driven to being data-driven." Robert Robbins
• “We’ll see over the next decade complete transformation (of life science industry) to very database-intensive as opposed to wet-lab intensive.” Debra Goldfarb
We will show how semantics is a key enabler for achieving the above predictions and visions in which information and process play critical role.
Semantic Web and Life Science
• Data captured per year = 1 exabyte (1018)(Eric Neumann, Science, 2005)
• How much is that?– Compare it to the estimate of the total words
ever spoken by humans = 12 exabyte • Death by data• The need for
– Search– Integration – Analysis,
decision support
– Discovery
Not data, but analysis and
insight, leading to decisions
and discovery
Semantic empowermentof Life Science Applications
Life Science research today deals with highly heterogeneous as well as massive amounts of data distributed across the world.
We need more automated ways for integration and analysis leading to insight and discovery
- to understand cellular components, molecular functions and biological processes, and more importantly complex interactions and interdependencies between them.
Benefits of Semantics
• Development of large domain-specific knowledge – for reference, common nomenclature, tagging
• Integration of heterogeneous multi-source data: biomedical documents (text), scientific/experimental data and structured databases
• Semantic search, browsing, integration analysis, and discovery
Faster and more reliable discovery leading to quality of life improvements
What is semantics & Semantic Web
• Meaning and use of data• From syntax and structure to semantics (beyond
formatting, organization, query interfaces,….)• XML -> RDF -> OWL -> Rules -> Trust• Ontologies at the heart of Semantic Web,
capturing agreement and domain knowledge• (Automatic) Semantic annotation, reasoning,…• Also, increasing use of Services oriented
Architecture -> semantic Web services• W3C SW for Health Care and Life Sciences
Semantic empowermentof Life Science Applications
This talk will demonstrate some of the efforts in:
• Building large (populated) life science ontologies (GlycO, ProPreO)
• Gathering/extracting knowledge and metadata: entity and relationship extraction from unstructured data, automatic semantic annotation of scientific/experimental data (e.g., mass spectrometry)
• Semantic web services and registries, leading to better discovery/reuse of scientific tools and their composition
• Ontology-driven applications developed
Semantic Applications
• Active Semantic Medical Records Demo : an operational health care application using multiple ontologies, semantic annotations and rule based decsion support
• Semantic Browser Demo: contextual browsing of PubMed aided by ontology and schema (in future instance) level relationships
• N-glycosylation process: an example of scientific workflow
• Integrated Semantic Information & Knowledge System (ISIS): integrated access and analysis of structured databases, sc. literature and experimental data
Others we will not discuss: SemBowser, SemDrug, ….
Let us start with a couple of simple applications
Life Science Ontologies
• ProPreO• An ontology for capturing process and lifecycle information
related to proteomic experiments• 398 classes, 32 relationships• 3.1 million instances• Published through the National Center for Biomedical
Ontology (NCBO) and Open Biomedical Ontologies (OBO)
• Glyco• An ontology for structure and function of Glycopeptides• 573 classes, 113 relationships• Published through the National Center for Biomedical
Ontology (NCBO)
N-Glycosylation metabolic pathway
GNT-Iattaches GlcNAc at position 2
UDP-N-acetyl-D-glucosamine + alpha-D-Mannosyl-1,3-(R1)-beta-D-mannosyl-R2 <=>
UDP + N-Acetyl-$beta-D-glucosaminyl-1,2-alpha-D-mannosyl-1,3-(R1)-beta-D-mannosyl-$R2
GNT-Vattaches GlcNAc at position 6
UDP-N-acetyl-D-glucosamine + G00020 <=> UDP + G00021
N-acetyl-glucosaminyl_transferase_VN-glycan_beta_GlcNAc_9N-glycan_alpha_man_4
• Challenge – model hundreds of thousands of complex carbohydrate entities
• But, the differences between the entities are small (E.g. just one component)
• How to model all the concepts but preclude redundancy → ensure maintainability, scalability
GlycO ontology
N. Takahashi and K. Kato, Trends in Glycosciences and Glycotechnology, 15: 235-251
-D-GlcpNAc-D-GlcpNAc-D-Manp-(1-4)- -(1-4)-
-D-Manp -(1-6)+-D-GlcpNAc-(1-2)-
-D-Manp -(1-3)+-D-GlcpNAc-(1-4)-
-D-GlcpNAc-(1-2)+
GlycoTree
EnzyO• The enzyme ontology EnzyO is highly
intertwined with GlycO. While it’s structure is mostly that of a taxonomy, it is highly restricted at the class level and hence allows for comfortable classification of enzyme instances from multiple organisms
• GlycO together with EnzyO contain all the information that is needed for the description of Metabolic pathways – e.g. N-Glycan Biosynthesis
Pathway representation in GlycO
Pathways do not need to be explicitly defined in GlycO. The residue-, glycan-, enzyme- and reaction descriptions contain all the knowledge necessary to infer pathways.
Zooming in a little …
The N-Glycan with KEGG ID 00015 is the substrate to the reaction R05987, which is catalyzed by an enzyme of the class EC
2.4.1.145.
The product of this reaction is the
Glycan with KEGG ID 00020.
Reaction R05987catalyzed by enzyme 2.4.1.145
adds_glycosyl_residueN-glycan_b-D-GlcpNAc_13
• Multiple data sources used in populating the ontologyo KEGG - Kyoto Encyclopedia of Genes and
Genomeso SWEETDBo CARBANK Database
• Each data source has different schema for storing data
• There is significant overlap of instances in the data sources
• Hence, entity disambiguation and a common representational format are needed
GlycO population
Has CarbBank
ID?
IUPAC to LINUCS
LINUCS to GLYDE
Compare to Knowledge
Base
Already in KB?
YES
NO
Semagix Freedom knowledge extractor
Instance Data
YES: next Instance
Insert into KB
NO
Ontology population workflow
Has CarbBank
ID?
IUPAC to LINUCS
LINUCS to GLYDE
Compare to Knowledge
Base
Already in KB?
YES
NO
Semagix Freedom knowledge extractor
Instance Data
YES: next Instance
Insert into KB
NO
[][Asn]{[(4+1)][b-D-GlcpNAc]{[(4+1)][b-D-GlcpNAc]
{[(4+1)][b-D-Manp]{[(3+1)][a-D-Manp]
{[(2+1)][b-D-GlcpNAc]{}[(4+1)][b-D-GlcpNAc]
{}}[(6+1)][a-D-Manp]{[(2+1)][b-D-GlcpNAc]{}}}}}}
Ontology population workflow
Has CarbBank
ID?
IUPAC to LINUCS
LINUCS to GLYDE
Compare to Knowledge
Base
Already in KB?
YES
NO
Semagix Freedom knowledge extractor
Instance Data
YES: next Instance
Insert into KB
NO
<Glycan> <aglycon name="Asn"/> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc"> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc"> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="Man" > <residue link="3" anomeric_carbon="1" anomer="a" chirality="D" monosaccharide="Man" > <residue link="2" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc" > </residue> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc" > </residue> </residue> <residue link="6" anomeric_carbon="1" anomer="a" chirality="D" monosaccharide="Man" > <residue link="2" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc"> </residue> </residue> </residue> </residue> </residue></Glycan>
Ontology population workflow
Has CarbBank
ID?
IUPAC to LINUCS
LINUCS to GLYDE
Compare to Knowledge
Base
Already in KB?
YES
NO
Semagix Freedom knowledge extractor
Instance Data
YES: next Instance
Insert into KB
NO
Ontology population workflow
• Two aspects of glycoproteomics:o What is it? → identificationo How much of it is there? → quantification
• Heterogeneity in data generation process, instrumental parameters, formats
• Need data and process provenance → ontology-mediated provenance
• Hence, ProPreO models both the glycoproteomics experimental process and attendant data
ProPreO ontology
ProPreO population: transformation to rdf
Scientific Data
Computational Methods
Ontology instances
“Protein RDF”
chemicalmass
monoisotopicmass
amino-acidsequence
n-glycosylationconcensus
Protein Dataamino-acidsequence
ChemicalMass RDF
MonoisotopicMass RDF
Amino-acidSequence
RDF
“Peptide RDF”
chemicalmass
monoisotopicmass
amino-acidsequence
n-glycosylationconcensus
parentprotein
CalculateChemical
Mass
CalculateMonoisotopic
Mass
DetermineN-glycosylation
Concensus
Key
Protein Path
Peptide Path
amino-acidsequence
Extract Peptide Amino-acid Sequence from Protein Amino-acid Sequence
ProPreO population: transformation to rdf
Scientific DataComputational Methods
RDF
Semantic empowermentof Life Science Applications
This talk will demonstrate some of the efforts in:
• building large life science ontologies (GlycO -an ontology for structure and function for Glycopeptides and ProPreO - an ontology for capturing process and lifecycle information related to proteomic experiments) and their application in advanced ontology-driven semantic applications
• entity and relationship extraction from unstructured data, automatic semantic annotation of scientific/experimental data (e.g., mass spectrometry), and resulting capability in integrated access and analysis of structured databases, scientific literature and experimental data
• semantic web services and registries, leading to better discovery/reuse of scientific tools and composition of scientific workflows that process high-throughput data and can be adaptive
• semantic applications developed
Relationship extraction from unstructured data
(other related research: biological entity extraction)
Overview
9284 documents 4733
documents
Biologically active substance
LipidDisease or Syndrome
affects
causes
affects
causes
complicates
Fish Oils Raynaud’s Disease???????
instance_of instance_of
5 documents
UMLS
MeSH
PubMed
About the data used
• UMLS – A high level schema of the biomedical domain– 136 classes and 49 relationships– Synonyms of all relationship – using variant
lookup (tools from NLM)
• MeSH – Terms already asserted as instance of one or
more classes in UMLS• PubMed
– Abstracts annotated with one or more MeSH terms
T147—effect T147—induce T147—etiology T147—cause T147—effecting T147—induced
Example PubMed abstract (for the domain expert)
Abstract
Classification/Annotation
Method – Parse Sentences in PubMed
SS-Tagger (University of Tokyo)
SS-Parser (University of Tokyo)
(TOP (S (NP (NP (DT An) (JJ excessive) (ADJP (JJ endogenous) (CC or) (JJ exogenous) ) (NN stimulation) ) (PP (IN by) (NP (NN estrogen) ) ) ) (VP (VBZ induces) (NP (NP (JJ adenomatous) (NN hyperplasia) ) (PP (IN of) (NP (DT the) (NN endometrium) ) ) ) ) ) )
Method – Identify entities and Relationships in Parse Tree
ModifiersModified entitiesComposite Entities
Method – Identify entities and Relationships in Parse Tree
Method – Fact Extraction from Parse Tree
Semantic annotation of scientific/experimental data
830.9570 194.9604 2
580.2985 0.3592
688.3214 0.2526
779.4759 38.4939
784.3607 21.7736
1543.7476 1.3822
1544.7595 2.9977
1562.8113 37.4790
1660.7776 476.5043
parent ion m/z
fragment ion m/z
ms/ms peaklist data
fragment ionabundance
parent ionabundance
parent ion charge
ProPreO: Ontology-mediated provenance
Mass Spectrometry (MS) Data
<ms-ms_peak_list>
<parameter instrument=“micromass_QTOF_2_quadropole_time_of_flight_mass_spectrometer”
mode=“ms-ms”/>
<parent_ion m-z=“830.9570” abundance=“194.9604” z=“2”/>
<fragment_ion m-z=“580.2985” abundance=“0.3592”/>
<fragment_ion m-z=“688.3214” abundance=“0.2526”/>
<fragment_ion m-z=“779.4759” abundance=“38.4939”/>
<fragment_ion m-z=“784.3607” abundance=“21.7736”/>
<fragment_ion m-z=“1543.7476” abundance=“1.3822”/>
<fragment_ion m-z=“1544.7595” abundance=“2.9977”/>
<fragment_ion m-z=“1562.8113” abundance=“37.4790”/>
<fragment_ion m-z=“1660.7776” abundance=“476.5043”/>
</ms-ms_peak_list>
OntologicalConcepts
ProPreO: Ontology-mediated provenance
Semantically Annotated MS Data
Semantic empowermentof Life Science Applications
This talk will demonstrate some of the efforts in:
• building large life science ontologies (GlycO -an ontology for structure and function for Glycopeptides and ProPreO - an ontology for capturing process and lifecycle information related to proteomic experiments) and their application in advanced ontology-driven semantic applications
• entity and relationship extraction from unstructured data, automatic semantic annotation of scientific/experimental data (e.g., mass spectrometry), and resulting capability in integrated access and analysis of structured databases, scientific literature and experimental data
• semantic web services and registries, leading to better discovery/reuse of scientific tools and composition of scientific workflows that process high-throughput data and can be adaptive
• semantic applications developed
N-GlycosylationN-Glycosylation ProcessProcess (NGPNGP)Cell Culture
Glycoprotein Fraction
Glycopeptides Fraction
extract
Separation technique I
Glycopeptides Fraction
n*m
n
Signal integrationData correlation
Peptide Fraction
Peptide Fraction
ms data ms/ms data
ms peaklist ms/ms peaklist
Peptide listN-dimensional arrayGlycopeptide identificationand quantification
proteolysis
Separation technique II
PNGase
Mass spectrometry
Data reductionData reduction
Peptide identificationbinning
n
1
Storage
Standard FormatData
Raw Data
Filtered Data
Search Results
Final Output
Agent Agent Agent Agent Biological Sample Analysis
by MS/MS
Raw Data to
Standard Format
DataPre-
process
DB Search
(Mascot/Sequest)
Results Post-
process
(ProValt)
O I O I O I O I O
Biological Information
SemanticAnnotationApplications
Semantic Web Process to incorporate provenance
Converting biological information to the W3C Resource Description
Framework (RDF): Experience with Entrez Gene
Collaboration with Dr. Olivier Bodenreider (US National Library of Medicine, NIH, Bethesda, MD)
Biomedical Knowledge Repository
Entrez
BiomedicalKnowledgeRepository
….
Implementation
XSLT
Entrez Gene Entrez Gene XML
Entrez Gene RDF graph Entrez Gene RDF
Web interface
XSLT
ENTREZ GENE ENTREZ GENE XML
ENTREZ GENE RDF GRAPH ENTREZ GENE RDF….
Implementation
XSLT
Entrez Gene Entrez Gene XML
Entrez Gene RDF graph Entrez Gene RDF
Connecting different genes
APP gene [Homo sapiens]
APP gene [Gallus gallus]
APP gene [Canis familiaris ]
protease nexin-II
amyloid beta A4 protein
amyloid-beta protein
A4 amyloid protein
beta-amyloid peptide
amyloid beta (A4) precursor protein (protease nexin-II, Alzheimer disease)
cerebral vascular amyloid peptide
amyloid protein
eg:has_protein_reference_name_E
amyloid beta A4 proteinamyloid beta A4 proteinHuman APP gene is implicated in Alzheimer's disease. Which genes are functionally homologous to this gene?
Raw2mzXML mzXML2Pkl Pkl2pSplit MASCOT Search ProVault
Raw mzXML Pkl pSplit MACOTresult
ProVaultresult
ExperimentalData Semantic
Annotation MetadataFile
SPARQL query-based User Interface
SemanticMetadataRegistry
PROTEOMECOMMONS
PROTEOMICS WORKFLOW
Integrated Semantic Information and knowledge System (Isis)
ProPreO ontology
EXPERIMENTAL DATA
Have I performed an error? Give me all result files from a similar
organism, cell, preparation, mass spectrometric conditions
and compare results.
Is the result erroneous? Give me all result files from a similar
organism, cell, preparation, mass spectrometric conditions
and compare results.
Summary, Observations, Conclusions
• We now have semantics and services enabled approaches that support semantic search, semantic integration, semantic analytics, decision support and validation (e.g., error prevention in healthcare), knowledge discovery, process/pathway discovery, …
• http://lsdis.cs.uga.edu• http://knoesis.org
http://lsdis.cs.uga.edu/projects/asdoc/http://lsdis.cs.uga.edu/projects/glycomics/