MySQL to Neo4j: A DBA Perspective - David Stern @ GraphConnect NY 2013
GraphConnect Europe 2016 - Building a Repository of Biomedical Ontologies with Neo4j - Simon Jupp
-
Upload
neo4j-the-fastest-and-most-scalable-native-graph-database -
Category
Technology
-
view
374 -
download
2
Transcript of GraphConnect Europe 2016 - Building a Repository of Biomedical Ontologies with Neo4j - Simon Jupp
Building a repository of biomedical ontologies with Neo4j
Simon Jupp [email protected], @simonjuppSamples, Phenotypes and Ontologies TeamEuropean Bioinformatics InstituteCambridge, UK.
Biological data heavily interlinked
Proteome
Metabolome
Genome
tissue
CE-MS
antibody array LC-MS/MSm/z
600 800 1000 1200 1400 1600
10
20
30
40
50
60
70
80
90
100
Inte
nsity
609.256b6
755.422y8
882.357b9
852.476y9
995.435b10
1092.506b11
1181.252y12
1318.578b13
1587.759b16
1715.817b18
858.408b18 ++
794.380b16 ++
0
miRNAarray
mRNA array
PathwaysProtein Interaction
Drug targets
We need terminology standards
Dyschromatopsia
Search PubMed for “color blindness”
Search PubMed for “Dyschromatopsia”
Search PubMed for "abnormality of the eye"
The ontology of color blindness
HP:0011518 (Dichromacy )HP:0011518 (Eye)
HP:0000551 (Abnormality of color vision )
HP:0007641 (Dyschromatopsia)
Is-a
Is-aDisease-location
The ontology of color blindness
HP:0011518 (Dichromacy )HP:0011518 (Eye)
HP:0000551 (Abnormality of color vision )
HP:0007641 (Dyschromatopsia)
Is-a
Is-aDisease-location
“Colorblindness”
“A form of colorblindness in which only two of the three fundamental colors can be distinguished due to a lack of one of the retinal cone pigments.”
synonym
definition
9
Genotype Phenotype
Sequence
Proteins
Gene products Transcript
Pathways
Cell type
BRENDA tissue / enzyme source
Development
Anatomy
Phenotype
Plasmodium life cycle
- Sequence types and features
- Genetic Context
- Molecule role - Molecular Function- Biological process - Cellular component
- Protein covalent bond - Protein domain - UniProt taxonomy
-Pathway ontology -Event (INOH pathway ontology) -Systems Biology -Protein-protein interaction
-Arabidopsis development -Cereal plant development -Plant growth and developmental stage -C. elegans development -Drosophila development FBdv fly development.obo OBO yes yes -Human developmental anatomy, abstract version -Human developmental anatomy, timed version
-Mosquito gross anatomy-Mouse adult gross anatomy -Mouse gross anatomy and development -C. elegans gross anatomy-Arabidopsis gross anatomy -Cereal plant gross anatomy -Drosophila gross anatomy -Dictyostelium discoideum anatomy -Fungal gross anatomy FAO -Plant structure -Maize gross anatomy -Medaka fish anatomy and development -Zebrafish anatomy and development
-NCI Thesaurus -Mouse pathology -Human disease -Cereal plant trait -PATO PATO attribute and value.obo -Mammalian phenotype - Human phenotype-Habronattus courtship -Loggerhead nesting -Animal natural history and life history
eVOC (Expressed Sequence Annotation for Humans)
Ontologies for life sciences
Ontology Lookup Service
• Ontology search engine (Solr)• Graph database of terms (Neo4j)• Powerful RESTful API (Built with Spring data neo4j / rest)• Open source project
• Generic infrastructure (can load any ontology represented in OWL)https://github.com/EBISPOT/OLS
Repository of over 140 biomedical ontologies (4.5 million terms, 11 million relations)
http://www.ebi.ac.uk/ols/beta
Web Ontology Language – (OWL)
• W3C standard vocabulary for describing ontologies• Powerful knowledge representation
However• OWL ontologies aren’t graphs, but…
… can be represented as an RDF graph… people want to use them as graphs
• Plenty of RDF databases around • But incomplete w.r.t. OWL semantics• SPARQL is an acquired taste
OWL to Neo4j schema• Each node label one of {Class, Property, Individuals} AND {Ontology name}• All OWL annotations become properties (labels, id, descriptions etc)• Superclass of (named and simple existentials) become edges in Neo4j
• E.g. In OWL “heart” subclassOf (part-of some “cardiovascular system”) In Neo4j “heart” part-of “cardiovascular system”
What are the sub types of “colorblindess”?MATCH (n:Class {obo_id: 'HP:0007641'})<-[r*]-(types:Class) RETURN n, r, types
What parts of the eye are related to diseases?MATCH
(eye:Class {obo_id: 'UBERON:0000970'})<-[r:Related {label : "part_of"}]-(eye_part:Class)<-
[r1:Related {label : "has_disease_location"}]-(disease:Class) RETURN eye, r,r1, eye_part, disease
Finding common ancestors via shortest pathMatch p=shortestPath( (a:Class)-[r:SUBCLASSOF*]-(b:Class) )Return nodes(p)
What is the common taxonomic superfamily of Gibbons and Chimpanzees?(or Hylobatidae and Pan troglodytes!)
https://commons.wikimedia.org/wiki/File:Hylobates_lar_pair_of_white_and_black_01.jpg
OLS visualisations• Partonomy for heart from the UBERON anatomy
ontology MATCH path = (n:Class)-[r:SUBCLASSOF|PartOf*]->(ancestor)
REST API (Spring Data REST + Neo4j)
• Crawlable API - Hypermedia drivel (HAL)
• Get ontology and term meta data • /ontologies• /ontologies/{name}• /ontologies/{name}/terms• /ontologies/{name}/terms/{termid}
• Get related terms and navigate ontology structure• /ontologies/{name}/terms/{termid}/parent• /ontologies/{name}/terms/{termid}/children• /ontologies/{name}/terms/{termid}/descendants• /ontologies/{name}/terms/{termid}/ancestors• /ontologies/{name}/terms/{termid}/{relation} e.g. part_of
http://www.ebi.ac.uk/ols/beta/api
Building the index• We check all 140 external ontology files nightly for
changes• We have a master build index
• When ontology updates we remove the old version and reload using the Neo4j BatchInserter (Potentially fragile)
• We push master index to various production data centers• Provides load balancing
Nightly crawl of all >140 registered ontologies
Conclusion• We’ve built a scalable repository of biomedical
ontologies with Neo4j• Generic OWL indexer (simplified OWL)• Powerful REST API built with Spring
• Acts as standalone OWL ontology server• Now being deployed externally
• Beta ~2000 users / 10 Million requests per month• Would like to discuss
• Batch Inserter• Migrating to Spring Data Neo4j 4
Acknowledgements• Sample Phenotypes and Ontologies Team - Tony
Burdett, James Malone, Dani Welter, Catherine Leroy, Sira Sarntivijai, Ilinca Tudose, Helen Parkinson
• Matt Pearce – Flax (BioSOLR project)• Michal Bachman and GraphAware team (Neo4j
training)
• Funding • European Molecular Biology Laboratory (EMBL)• European Union projects: DIACHRON, BioMedBridges
and CORBEL