Discovering Disease Associations using a Biomedical Semantic Web: Integration and Ranking

1
Discovering Disease Associations using a Biomedical Semantic Web: Integration and Ranking One of the principal goals of biomedical research is to elucidate the complex network of gene interactions underlying common human diseases. Although integrative genomics based approaches have been shown to be successful in understanding the underlying pathways and biological processes in normal and disease states, most of the current biomedical knowledge is spread across different databases in different formats. Semantic Web principals, standards and technologies provide an ideal platform to integrate such heterogeneous information and bring forth implicit relations hitherto embedded in these large integrated biomedical and genomic datasets. Semantic Web query languages such as SPARQL can be effectively used to mine the biological entities underlying complex diseases through richer and complex queries on this integrated data. However, the end results are frequently large and unmanageable. Thus, there is a great need to develop techniques to rank resources on the Semantic Web which can later be used to retrieve and rank the results and prevent the information overload. Such ranking can be used to prioritize the discovered disease– gene, disease–pathway or disease–processes novel relationships. We implemented an existing semantic web based knowledge mining technique which not only discovers underlying genes, processes and pathways of diseases but also determines the importance of the resources to rank the results of a search while determining the semantic associations. Data Integration- RDF MODEL Ranga Chandra Gudivada 1,2 , Xiaoyan A. Qu 1,2, Anil G Jegga 2,3,4 , Eric K. Neumann 5 , Bruce J Aronow 1,2,3,4 Departments of Biomedical Engineering 1 and Pediatrics 2 , University of Cincinnati, Center for Computational Medicine 3 and Division of Biomedical Informatics 4 , Cincinnati Children’s Hospital Medical Center, Cincinnati OH-45229, USA and Teranode Corporation 5 , Seattle, WA 98104 Case Study-Prioritizing Modifier Genes, Pathways and Biological Processes for Case Study-Prioritizing Modifier Genes, Pathways and Biological Processes for CARDIOMYOPATHY, DILATED Abstract Abstract Computational Problem Computational Problem Data integration: biological feature complexity is deep, heterogeneous, and extensive. Data complexity poses a formidable challenge to efforts to integrate, formally model, and simulate biological systems behaviors Likelihood Ranking requires mining and prioritization of entities and events that function in the context of biological networks Biological Problem Biological Problem Disease genes discovered to date likely represent the easy ones. Discovering the genetic basis of remaining Mendelian and complex gene-X-gene-X-environment disorders will be challenging and require consideration of many more features and causal relationships No gene operates in vacuum, all gene, protein, pathway interactions can lead to Modifier Gene effects Identifying modifier genes, i.e. gene networks underlying diseases is challenging (pathways, biological processes and functions) Benefits of Semantic Web Benefits of Semantic Web Semantic Web standards such as Resource Description Framework (RDF) & Ontology Web Language (OWL) facilitate semantic integration of heterogeneous multi-source data SPARQL, a semantic web query language , capable of making queries of higher order relationships in multi dimensional data can be used to mine Bio-RDF graphs Prioritization of biological entities on semantic web can be accomplished by extending[2] and applying existing graph algorithms, such as Kleinberg Aglorithm[1] Cell.Component GO ID Disease CUI Gene Symbol Mol.Function GO ID Pathway Id Biol.Process GO ID Biol.Process Description Anatomy CUI Disease Name Anatomy Name Mol.Function Description Pathway Description Cell.Component Description r d f s : l a b e l rdfs:label r d f s : l a b e l r dfs :la b e l rdfs:lab el r dfs:label inBiologic al Process inMolecul a rFunction occursIn Pathway hasAssociated Gene h a s A s s o c i a t e d A n a t o m y hasAssociate d Disease Mouse Phenotype ID Mouse Phenotype Description hasMou se PhenoT ype r df s: l ab e l Ranking on Semantic Web BIND REACTOME Nature Pathway Interaction database KleinBerg Algorithm (1) H i g h A u t h o r i t a t i v e s c o r e A u t h o r i t a t i v e n o d e Pointed by good hubs its authoritati ve score increases High Hub score Hub Nodes Points to many authoritative sites, increases the hub scores Extending ‘KleinBerg Algorithm’(2) for Semantic Web gene Pathway associatedPathw ay Objectivity weight Subjectivity Weight Subjectivity weight > objectivity weight A single gene participating in multiple biological pathways is considered more sensitive to perturbation than a single pathway having a large number of nodes (Different weights for non - symmetric properties); corollary : geneA geneB interacts Objectivity weight Subjectivity Weight Subjectivity weight = objectivity weight GeneA interacting with various genes has equal significance as GeneB interacting with various genes (Equal weights for symmetric properties) CARDIOMYOPATHY, DILATED, X-LINKED Primary Genes (1) DMD Pathways (1) Interacting Partners (16) Biological Processes (4) Primary genes + Interacting Partners (1+16) Pathways (28) Biological Processes (27) Biological Process GO_0006936 muscle contraction GO_0007016 cytoskeletal anchor GO_0043043 peptide biosynthesi GO_0007517 muscle development h_agrPathway Agrin in Postsynaptic Differentiation Pathways QUERY RESULT WITH PRIORITIZATION Step1 Step2 Modifier Genes (16) 1 h_agrPathway Agrin in Postsynaptic Differentiation 1.13498424 2 h_hsp27Pathway Stress Induction of HSP Regulation 0.13988791 3 h_actinYPathwayY branching of actin filaments 0.09390897 3 h_no1Pathway Actions of Nitric Oxide in the Heart 0.09390897 3 h_nfatPathway NFAT and Hypertrophy of the he 0.09390897 3 h_metPathway Signaling of Hepatocyte Growth 0.09390897 3 h_salmonellaPat How does salmonella hijack a cell 0.09390897 3 h_mCalpainPathway mCalpain and friends in Cell motility 0.09390897 3 h_PDZsPathway Synaptic Proteins at the Synaptic Junctio 0.09390897 3 h_rabPathway Rab GTPases Mark Targets In Th 0.09390897 Pathways (28) 1 GO_0006936 muscle contraction 1.538585 2 GO_0007517 muscle development 0.356276 3 GO_0007165 signal transduction 0.113940 4 GO_0048741 skeletal muscle fiber development 0.110290 4 GO_0030240 muscle thin filament assembly 0.110290 4 GO_0043043 peptide biosynthesis 0.102790 4 GO_0007016 cytoskeletal anchoring 0.102790 Biological Processes (27) OMIM Mammalian Phenotype Others Disease Entrez Gene SwissProt Gene Ontology others Gene / Protein Annotations BIOCARTA KEGG BIOCYC Pathways Molecula r Interact ions Rank GeneSymbol Score Pubmed Evidence 1 UTRN 21.8934495212868498 10423348 11186 2 FASLG 17.4202899416168288 16080838 3 ACTA1 12.3602553916945537 10508519 4 DTNA 8.88847565816644324 16427346 5 DAG1 5.89311275815117830 14564412 6 KCNJ12 4.838225059Novel Gene 7 SNTA1 4.62322831216427346 PREFIX CCHMC:<http://www.cchmc.com/test.owl#> PREFIX rdf:<http://www.w3.org/1999/02/22- rdf-syntax-ns#> SELECT DISTINCT ?pathway where { ?pathway rdf:type CCHMC:Pathway . ?resource ?PROPERTY ?pathway . } SPARQL QUERY 1.Kleinberg, J. M. 1999. Authoritative sources in a hyperlinked environment. J. ACM 46, 5 (Sep. 1999) 2 Bhuvan Bamba, Sougata Mukherjea: Utilizing Resource Importance for Ranking Semantic Web Query Results. SWDB 2004: 185-198 Conclusion Conclusion We have shown that related yet heterogeneous information can be integrated using RDF-OWL and that this approach can support mechanistic analyses of diseases. Specifically, we have uncovered additional genes and pathways that could play a role in the onset and treatment of Cardiomyopathy. We intend to expand our analyses into additional modalities such as anatomy, cellular type, and symptoms/ phenotypes.

description

BIOCARTA KEGG BIOCYC. OMIM Mammalian Phenotype Others. Pathways. Disease. Discovering Disease Associations using a Biomedical Semantic Web: Integration and Ranking. Ranga Chandra Gudivada 1,2 , Xiaoyan A. Qu 1,2, Anil G Jegga 2,3,4 , Eric K. Neumann 5 , Bruce J Aronow 1,2,3,4 - PowerPoint PPT Presentation

Transcript of Discovering Disease Associations using a Biomedical Semantic Web: Integration and Ranking

Page 1: Discovering Disease Associations using a Biomedical Semantic Web: Integration and Ranking

Discovering Disease Associations using a Biomedical Semantic Web: Integration and Ranking

One of the principal goals of biomedical research is to elucidate the complex network of gene interactions underlying common human diseases. Although integrative genomics based approaches have been shown to be successful in understanding the underlying pathways and biological processes in normal and disease states, most of the current biomedical knowledge is spread across different databases in different formats. Semantic Web principals, standards and technologies provide an ideal platform to integrate such heterogeneous information and bring forth implicit relations hitherto embedded in these large integrated biomedical and genomic datasets. Semantic Web query languages such as SPARQL can be effectively used to mine the biological entities underlying complex diseases through richer and complex queries on this integrated data. However, the end results are frequently large and unmanageable. Thus, there is a great need to develop techniques to rank resources on the Semantic Web which can later be used to retrieve and rank the results and prevent the information overload. Such ranking can be used to prioritize the discovered disease–gene, disease–pathway or disease–processes novel relationships. We implemented an existing semantic web based knowledge mining technique which not only discovers underlying genes, processes and pathways of diseases but also determines the importance of the resources to rank the results of a search while determining the semantic associations.

Data Integration- RDF MODELData Integration- RDF MODEL

Ranga Chandra Gudivada1,2, Xiaoyan A. Qu 1,2, Anil G Jegga2,3,4, Eric K. Neumann5 , Bruce J Aronow1,2,3,4

Departments of Biomedical Engineering1 and Pediatrics2, University of Cincinnati, Center for Computational Medicine3 and Division of Biomedical Informatics4,Cincinnati Children’s Hospital Medical Center, Cincinnati OH-45229, USA and Teranode Corporation5, Seattle, WA 98104

Case Study-Prioritizing Modifier Genes, Pathways and Biological Processes for Case Study-Prioritizing Modifier Genes, Pathways and Biological Processes for CARDIOMYOPATHY, DILATEDCase Study-Prioritizing Modifier Genes, Pathways and Biological Processes for Case Study-Prioritizing Modifier Genes, Pathways and Biological Processes for CARDIOMYOPATHY, DILATEDAbstractAbstractAbstractAbstract

Computational ProblemComputational ProblemComputational ProblemComputational Problem

Data integration: biological feature complexity is deep, heterogeneous, and extensive.

Data complexity poses a formidable challenge to efforts to integrate, formally model, and simulate biological systems behaviors

Likelihood Ranking requires mining and prioritization of entities and events that function in the context of biological networks

Biological ProblemBiological ProblemBiological ProblemBiological Problem

Disease genes discovered to date likely represent the easy ones. Discovering the genetic basis of remaining Mendelian and complex gene-X-gene-X-environment disorders will be challenging and require consideration of many more features and causal relationships

No gene operates in vacuum, all gene, protein, pathway interactions can lead to Modifier Gene effects

Identifying modifier genes, i.e. gene networks underlying diseases is challenging (pathways, biological processes and functions)

Benefits of Semantic WebBenefits of Semantic WebBenefits of Semantic WebBenefits of Semantic Web

Semantic Web standards such as Resource Description Framework (RDF) & Ontology Web Language (OWL) facilitate semantic integration of heterogeneous multi-source data

SPARQL, a semantic web query language , capable of making queries of higher order relationships in multi dimensional data can be used to mine Bio-RDF graphs

Prioritization of biological entities on semantic web can be accomplished by extending[2] and applying existing graph algorithms, such as Kleinberg Aglorithm[1]

Cell.ComponentGO ID

DiseaseCUI

GeneSymbol

Mol.FunctionGO ID

PathwayId

Biol.ProcessGO ID

Biol.ProcessDescription

Anatomy CUIDisease

Name

Anatomy Name

Mol.FunctionDescription

PathwayDescription

Cell.ComponentDescription

rdfs

:lab

el

rdfs:label

rdfs

:lab

el

rdfs:la

bel

rdfs:label

rdfs:label

inBiological

Process

inMolecula

rFunction

occursIn

Pathway

hasAssociatedGene

ha

sA

ss

oc

iate

dA

na

tom

y

hasAssociated

Disease

Mouse PhenotypeID

Mouse PhenotypeDescription

hasMouse

PhenoType

rdfs

:labe

l

Ranking on Semantic WebRanking on Semantic Web

BIND

REACTOME

Nature Pathway Interaction database

KleinBerg Algorithm (1)

Hig

h A

uth

oritative sco

re

Au

tho

ritative no

de

Pointed by good hubs its authoritative score increasesH

igh

Hu

b s

core

Hu

b N

od

es

Points to many authoritative sites, increases the hub scores

Extending ‘KleinBerg Algorithm’(2) for Semantic Web

gene Pathway

associatedPathway

Objectivity weight

SubjectivityWeight

Subjectivity weight > objectivity weight

A single gene participating in multiple biological pathways is considered more sensitive to perturbation than a single pathway having a large number of nodes (Different weights for non - symmetric properties); corollary :

geneA geneB

interacts

Objectivity weight

SubjectivityWeight

Subjectivity weight = objectivity weight

GeneA interacting with various genes has

equal significance as GeneB interacting with

various genes (Equal weights for symmetric

properties)

CARDIOMYOPATHY,

DILATED,

X-LINKED

Primary Genes

(1)

DMD

Pathways

(1)

Interacting

Partners

(16)

Biological Processes

(4)

Primary genes

+

Interacting Partners

(1+16)

Pathways

(28)

Biological Processes

(27)

Biological Process

GO_0006936 muscle contraction

GO_0007016 cytoskeletal anchoring

GO_0043043 peptide biosynthesis

GO_0007517 muscle development

h_agrPathwayAgrin in Postsynaptic Differentiation

Pathways

QUERY RESULTWITH

PRIORITIZATION

Step1

Step2

Modifier Genes (16)

1 h_agrPathway Agrin in Postsynaptic Differentiation 1.1349842422 h_hsp27Pathway Stress Induction of HSP Regulation 0.1398879183 h_actinYPathway Y branching of actin filaments 0.0939089763 h_no1Pathway Actions of Nitric Oxide in the Heart 0.0939089763 h_nfatPathway NFAT and Hypertrophy of the heart (Transcription in the broken heart)0.0939089763 h_metPathway Signaling of Hepatocyte Growth Factor Receptor0.0939089763 h_salmonellaPathwayHow does salmonella hijack a cell 0.0939089763 h_mCalpainPathway mCalpain and friends in Cell motility 0.0939089763 h_PDZsPathway Synaptic Proteins at the Synaptic Junction 0.0939089763 h_rabPathway Rab GTPases Mark Targets In The Endocytotic Machinery0.093908976

Pathways (28)

1 GO_0006936 muscle contraction 1.53858592 GO_0007517 muscle development 0.35627623 GO_0007165 signal transduction 0.11394034 GO_0048741 skeletal muscle fiber development 0.11029094 GO_0030240 muscle thin filament assembly 0.11029094 GO_0043043 peptide biosynthesis 0.10279024 GO_0007016 cytoskeletal anchoring 0.1027902

Biological Processes (27)

OMIM

Mammalian Phenotype

Others

Disease

Entrez Gene

SwissProt

Gene Ontology

others

Gene / Protein

Annotations

BIOCARTA

KEGG

BIOCYC

Pathways

Molecular

Interactions

PREFIX CCHMC:<http://www.cchmc.com/test.owl#>

PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>

SELECT DISTINCT ?pathway

where {

?pathway rdf:type CCHMC:Pathway .

?resource ?PROPERTY ?pathway .

}

SPARQL QUERY

1.Kleinberg, J. M. 1999. Authoritative sources in a hyperlinked environment. J. ACM 46, 5 (Sep. 1999)

2 Bhuvan Bamba, Sougata Mukherjea: Utilizing Resource Importance for Ranking Semantic

Web Query Results. SWDB 2004: 185-198

ConclusionConclusionConclusionConclusion

We have shown that related yet heterogeneous information can be integrated using RDF-OWL and that this approach can support mechanistic analyses of diseases. Specifically, we have uncovered additional genes and pathways that could play a role in the onset and treatment of Cardiomyopathy. We intend to expand our analyses into additional modalities such as anatomy, cellular type, and symptoms/ phenotypes.