Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web services and HYDRA
-
Upload
alexandre-riazanov -
Category
Science
-
view
300 -
download
0
Transcript of Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web services and HYDRA
![Page 1: Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web services and HYDRA](https://reader031.fdocuments.net/reader031/viewer/2022030312/58ede65b1a28ab0b408b46a1/html5/thumbnails/1.jpg)
COMPREHENSIVE SELF-SERVICE
LIFE SCIENCE DATA FEDERATION
WITH SADI SEMANTIC WEB SERVICES
AND HYDRA
Alexandre Riazanov, CTOIPSNP Computing Inc
Oslo University, Sep 23, 2015
![Page 2: Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web services and HYDRA](https://reader031.fdocuments.net/reader031/viewer/2022030312/58ede65b1a28ab0b408b46a1/html5/thumbnails/2.jpg)
WHO WE ARE
• IPSNP Computing Inc -- a Canadian startup, building on and commercializing prior academic research on SADI.
• Founded to develop an industrial strength query tool for SADI, to supercede a research proof-of-concept prototype.
• Looking for customers/partners and investors.
![Page 3: Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web services and HYDRA](https://reader031.fdocuments.net/reader031/viewer/2022030312/58ede65b1a28ab0b408b46a1/html5/thumbnails/3.jpg)
BIOMEDICAL RESEARCHERS AND CLINICIANS USE DATA FROM MULTIPLE SOURCES
• Online and in-house databases, spreadsheets.
• Web services, e.g., literature search, etc.
• Nomenclatures, ontologies, controlled vocabularies.
• Web sites, scientific publications, patents, etc.
• Algorithms, e.g., BLAST, molecular structure prediction, various text mining programs, etc.
![Page 4: Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web services and HYDRA](https://reader031.fdocuments.net/reader031/viewer/2022030312/58ede65b1a28ab0b408b46a1/html5/thumbnails/4.jpg)
BIG VISION: FEDERATED QUERYING OF HETEROGENEOUS AND DISTRIBUTED DATA SOURCES
• We want to query 1000s of data sources as a single database.
• We want more agility than datawarehousing can provide: e.g., just-in-time algorithm execution, plug-and-play data source addition, live data querying.
• We want to use simple and declarative queries, not to program workflow scripts.
![Page 5: Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web services and HYDRA](https://reader031.fdocuments.net/reader031/viewer/2022030312/58ede65b1a28ab0b408b46a1/html5/thumbnails/5.jpg)
IS THIS SCI-FI?
![Page 6: Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web services and HYDRA](https://reader031.fdocuments.net/reader031/viewer/2022030312/58ede65b1a28ab0b408b46a1/html5/thumbnails/6.jpg)
WE CAN ACTUALLY DO THIS WITH SEMANTIC WEB SERVICES
Here is how our data federation engine HYDRA works:
![Page 7: Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web services and HYDRA](https://reader031.fdocuments.net/reader031/viewer/2022030312/58ede65b1a28ab0b408b46a1/html5/thumbnails/7.jpg)
HOW IS THIS ALL POSSIBLE?
• Key ingredient: the SADI framework for Semantic Web services (Semantic Automated Discovery and Integration).
• SADI services are: • RESTful services• consuming and producing one format -- RDF,• with semantic descriptions (in OWL) fully defining
their functionality.
![Page 8: Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web services and HYDRA](https://reader031.fdocuments.net/reader031/viewer/2022030312/58ede65b1a28ab0b408b46a1/html5/thumbnails/8.jpg)
PLAN OF THE TALK
• What are SADI services?
• Automatic service discovery and invocation in query engines (HYDRA).
• Self-service querying vision.
• Query composition with HYDRA GUI.
• An overview of Bioinformatics and Clinical Intelligence case studies.
Tons of screenshots!
![Page 9: Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web services and HYDRA](https://reader031.fdocuments.net/reader031/viewer/2022030312/58ede65b1a28ab0b408b46a1/html5/thumbnails/9.jpg)
SADI SERVICE I/O
• Input: RDF description of an input object.
• Output: another RDF graph providing more (computed or retrieved) info about the input object or linking it to other objects.
• Since all SADI services “talk the same language” (RDF), they are 100% syntactically interoperable:– output of one SADI service can be directly
consumed by any other SADI services.
Describe your input, and I will tell you something else about it”
![Page 10: Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web services and HYDRA](https://reader031.fdocuments.net/reader031/viewer/2022030312/58ede65b1a28ab0b408b46a1/html5/thumbnails/10.jpg)
COMPLETE SEMANTIC DESCRIPTIONSOF SERVICE FUNCTIONALITY
• SADI services carry semantic descriptions of their I/O that completely define what the service expects and can accept as input, and what RDF assertions the service can output.
• Unique and extremely powerful property: it facilitatescompletely automatic discovery
and orchestration of services.
![Page 11: Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web services and HYDRA](https://reader031.fdocuments.net/reader031/viewer/2022030312/58ede65b1a28ab0b408b46a1/html5/thumbnails/11.jpg)
HYDRA QUERY ENGINE
● Given a SPARQL query, HYDRA analyses it by using an intelligent logic-based algorithm (proprietary, unlike SADI itself).
● HYDRA requests descriptions of potentially useful services from available SADI service registries.
● HYDRA processes the descriptions and figures out which services have to be invoked, on what data and in what order.
SPARQL is a W3C standard semantic query language -- much more intuitive than SQL.
![Page 12: Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web services and HYDRA](https://reader031.fdocuments.net/reader031/viewer/2022030312/58ede65b1a28ab0b408b46a1/html5/thumbnails/12.jpg)
QUERY EXAMPLE
• Find documents mentioning "haloalkane dehalogenase activity", extract information about mutations and visualise the mutations on 3D protein structure images.
• HYDRA automatically finds and orchestrates 5 services from our registry:– PubMed search: keyword query ⟶ document PubMed IDs– PDF retrieval: PubMed ID ⟶ PDF file URL– ASCII extraction: PDF file ⟶ ASCII text– Text mining: ASCII text ⟶ mutation info– Visualisation: mutation & protein ⟶ 3D image (Jmol)
![Page 13: Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web services and HYDRA](https://reader031.fdocuments.net/reader031/viewer/2022030312/58ede65b1a28ab0b408b46a1/html5/thumbnails/13.jpg)
RESULTS
Deploying mutation impact text-mining software with the SADI Semantic Web Services frameworkhttp://www.biomedcentral.com/qc/1471-2105/12/S4/S6
![Page 14: Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web services and HYDRA](https://reader031.fdocuments.net/reader031/viewer/2022030312/58ede65b1a28ab0b408b46a1/html5/thumbnails/14.jpg)
WHAT IS SO COOL ABOUT IT?
• Data federation at its best:
– independent, heterogeneous data sources (PubMed doc search, PubMed Central for PDFs);
– not only data is integrated: ASCII extraction, text mining and 3D visualisation are algorithms!
• Execution is completely automatic: HYDRA finds and invokes the services without any help from the user.
![Page 15: Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web services and HYDRA](https://reader031.fdocuments.net/reader031/viewer/2022030312/58ede65b1a28ab0b408b46a1/html5/thumbnails/15.jpg)
MORE QUERY EXAMPLES
• Find drug products that contain active ingredient X.• Find drugs that have been studied in clinical trials targeting
infections caused by bacteria X.• Annotate a DNA sequence X with molecular functions of
proteins produced by the corresponding gene.
• Find patients with precondition X diagnosed with infections Y resulting from procedure Z.
• Many many other questions that Life Scientists and Clinicians ask on a daily basis.
![Page 16: Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web services and HYDRA](https://reader031.fdocuments.net/reader031/viewer/2022030312/58ede65b1a28ab0b408b46a1/html5/thumbnails/16.jpg)
IT’S ONLY ½ OF THE STORY
![Page 17: Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web services and HYDRA](https://reader031.fdocuments.net/reader031/viewer/2022030312/58ede65b1a28ab0b408b46a1/html5/thumbnails/17.jpg)
REMEMBER THE BIG VISION?
![Page 18: Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web services and HYDRA](https://reader031.fdocuments.net/reader031/viewer/2022030312/58ede65b1a28ab0b408b46a1/html5/thumbnails/18.jpg)
HERE IS AN EVEN BIGGER VISION:Self-service ad hoc querying of federated data.
![Page 19: Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web services and HYDRA](https://reader031.fdocuments.net/reader031/viewer/2022030312/58ede65b1a28ab0b408b46a1/html5/thumbnails/19.jpg)
HYDRA IMPLEMENTS SEMANTIC QUERYING
• Users need not know how the source data is organised or accessed.
• They just need to know the terminology of their subject domain.
• Queries are completely declarative: specify what you want to find, not how.
![Page 20: Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web services and HYDRA](https://reader031.fdocuments.net/reader031/viewer/2022030312/58ede65b1a28ab0b408b46a1/html5/thumbnails/20.jpg)
HYDRA ALSO SUPPORTS CONCEPT HIERARCHIES AND RULES
● Some queries would be too complex if we could not exploit generality:o a query concerning all antibiotics requires
generalisation, otherwise all types of antibiotics would have to be enumerated in the query.
● Much better way to do this is to import a classification of drugs and use it in query execution.
● HYDRA facilitates such reasoning and even more complex reasoning with rules.
![Page 21: Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web services and HYDRA](https://reader031.fdocuments.net/reader031/viewer/2022030312/58ede65b1a28ab0b408b46a1/html5/thumbnails/21.jpg)
THERE ARE NO PRINCIPLE OBSTACLES TO SELF-SERVICE QUERYING
We just need an adequate user interface for building queries.
![Page 22: Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web services and HYDRA](https://reader031.fdocuments.net/reader031/viewer/2022030312/58ede65b1a28ab0b408b46a1/html5/thumbnails/22.jpg)
HYDRA QUERY TOOL = ENGINE + GUI
![Page 23: Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web services and HYDRA](https://reader031.fdocuments.net/reader031/viewer/2022030312/58ede65b1a28ab0b408b46a1/html5/thumbnails/23.jpg)
QUERY COMPOSITION
Queries built based on entry of “Google-like” keyphrases:
Keyphrase: “document mentions protein “P22607”
![Page 24: Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web services and HYDRA](https://reader031.fdocuments.net/reader031/viewer/2022030312/58ede65b1a28ab0b408b46a1/html5/thumbnails/24.jpg)
A QUERY GRAPH IS GENERATED FOR THE KEYPHRASE
“document mentions protein “P22607””
![Page 25: Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web services and HYDRA](https://reader031.fdocuments.net/reader031/viewer/2022030312/58ede65b1a28ab0b408b46a1/html5/thumbnails/25.jpg)
Keyphrase: “has pubmed id”:
ADDING ANOTHER KEYPHRASE
![Page 26: Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web services and HYDRA](https://reader031.fdocuments.net/reader031/viewer/2022030312/58ede65b1a28ab0b408b46a1/html5/thumbnails/26.jpg)
QUERY GRAPH IS EXTENDED WITH NODES CORRESPONDING TO THE SECOND KEYPHRASE
Keyphrase: “has pubmed id”Keyphrase: “document mentions protein “P22607”
![Page 27: Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web services and HYDRA](https://reader031.fdocuments.net/reader031/viewer/2022030312/58ede65b1a28ab0b408b46a1/html5/thumbnails/27.jpg)
OPTION 2: MANUALLY ADD/DELETE CLASSES, INCOMING AND OUTGOING PROPERTIES
![Page 28: Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web services and HYDRA](https://reader031.fdocuments.net/reader031/viewer/2022030312/58ede65b1a28ab0b408b46a1/html5/thumbnails/28.jpg)
MANUALLY ADDED PROPERTY
![Page 29: Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web services and HYDRA](https://reader031.fdocuments.net/reader031/viewer/2022030312/58ede65b1a28ab0b408b46a1/html5/thumbnails/29.jpg)
FINISHED QUERY: FIND PUBMED IDS OF DOCUMENTS MENTIONING PROTEIN P22607 AND CO-MENTIONED PROTEINS
![Page 30: Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web services and HYDRA](https://reader031.fdocuments.net/reader031/viewer/2022030312/58ede65b1a28ab0b408b46a1/html5/thumbnails/30.jpg)
SERVICES IN THE REGISTRY
![Page 31: Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web services and HYDRA](https://reader031.fdocuments.net/reader031/viewer/2022030312/58ede65b1a28ab0b408b46a1/html5/thumbnails/31.jpg)
SPARQL GENERATION
![Page 32: Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web services and HYDRA](https://reader031.fdocuments.net/reader031/viewer/2022030312/58ede65b1a28ab0b408b46a1/html5/thumbnails/32.jpg)
QUERY EXECUTION WITH THE HYDRA ENGINE
![Page 33: Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web services and HYDRA](https://reader031.fdocuments.net/reader031/viewer/2022030312/58ede65b1a28ab0b408b46a1/html5/thumbnails/33.jpg)
EXPORTED RESULTS IN AN EXCEL SPREADSHEET
![Page 34: Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web services and HYDRA](https://reader031.fdocuments.net/reader031/viewer/2022030312/58ede65b1a28ab0b408b46a1/html5/thumbnails/34.jpg)
SADI AND HYDRA QUERY TOOL
AT WORK
![Page 35: Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web services and HYDRA](https://reader031.fdocuments.net/reader031/viewer/2022030312/58ede65b1a28ab0b408b46a1/html5/thumbnails/35.jpg)
BIOINFORMATICS AND CHEMINFORMATICS CASE STUDIES AND PILOTS WITH SADI AND HYDRA
• Integrating genomics text mining results with online biomedical data and visualisation algorithms.
• Integrating programs for lipid molecule structural analysis and classification.
• Interpreting toxicity experiment data by discovering relevant info in online databases.
• Large-scale retrieval of toxicity information from publications.
![Page 36: Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web services and HYDRA](https://reader031.fdocuments.net/reader031/viewer/2022030312/58ede65b1a28ab0b408b46a1/html5/thumbnails/36.jpg)
INTERPRETING TOXICITY EXPERIMENT DATA
• Partner: university lab studying effects of environmental pollutants.
• Querying needs: finding relevant prior experiments, gene annotation, protein domain annotation, etc.
• Data sources: ArrayExpress, BLAST, HMMER3, RefSeq, Pfam, ORFPredictor, GO, UniProt, NCBI Taxonomy -- all queried as a single DB!
![Page 37: Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web services and HYDRA](https://reader031.fdocuments.net/reader031/viewer/2022030312/58ede65b1a28ab0b408b46a1/html5/thumbnails/37.jpg)
SUBTASK: DNA MICROARRAY ANNOTATION
• Toxicity experiments with microarrays: which DNA sequences are under/overexpressed after organism’s exposure to toxin X?
• Interpretation requires knowing affected protein functions and domains.
• HYDRA virtually implements this workflow:
![Page 38: Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web services and HYDRA](https://reader031.fdocuments.net/reader031/viewer/2022030312/58ede65b1a28ab0b408b46a1/html5/thumbnails/38.jpg)
RETRIEVAL OF TOXICITY DATA FROM PUBLICATIONS
• Customer: government agency (Canada).
• Querying needs: online publication search by organism and chemical types, text-mining for toxicity data.
• Data sources: NCBI Taxonomy and ChEBI with free-text search, PubMed search, electronic libraries, journal Web sites, Google Scholar, specialised text-mining algorithm, text utilities.
Apparent value: some queries save many man-weeks of work of a postdoc.
![Page 39: Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web services and HYDRA](https://reader031.fdocuments.net/reader031/viewer/2022030312/58ede65b1a28ab0b408b46a1/html5/thumbnails/39.jpg)
CLASSIFYING NEW LIPID MOLECULES
• One of the early experiments with SADI.• A group in Carleton U. had a program for
identifying functional groups in a molecule structure.
• A group in U. of New Brunswick had a classifier estimating lipid classes based on presence/absence of functional groups.
• Publishing the prototypes as SADI services allowed us to integrate them with each other and relevant external resources.
![Page 40: Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web services and HYDRA](https://reader031.fdocuments.net/reader031/viewer/2022030312/58ede65b1a28ab0b408b46a1/html5/thumbnails/40.jpg)
CLINICAL IT CASE STUDIES AND PILOTS WITH SADI AND HYDRA
• Ad hoc querying of clinical data for Hospital Acquired Infections surveillance and research (with UNB, McGill SoM and Ottawa H.)
• On-going pilot with a US hospital.
• Looking for pilot opportunities for Clinical Trial Cohort selection:• trial eligibility criteria can be implemented as queries
over heterogeneous and distributed clinical data;• benefits: cost reduction and timely alerts.
![Page 41: Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web services and HYDRA](https://reader031.fdocuments.net/reader031/viewer/2022030312/58ede65b1a28ab0b408b46a1/html5/thumbnails/41.jpg)
THANK YOU!
Further materials/services are available on request:• Live and recorded demos.
• Publications on previous (academic) case studies.
• Training/consulting.
• http://ipsnp.com/ (Canada) and http://ipsnp.co/ (UK)