SC1 - Hangout 2: The Open PHACTS pilot
-
Upload
bigdataeurope -
Category
Science
-
view
387 -
download
2
Transcript of SC1 - Hangout 2: The Open PHACTS pilot
BIG DATA EUROPE H2020 CSA (2015-17) SOCIETALCHALLENGE“HEALTH”
Integrating Big Data, Software & Communities for Addressing Europe’s Societal Challenges
06.07.2016
BigDataEurope
6-Jul-16
Today: • Short overview of Big Data Europe Ronald Siebes • What is Open PHACTS Stian Soiland-Reyes, Bryn Williams-Jones • The Big Data Europe infrastructure Erika Pauwels, Aad Versteden • Pilot 1: The Open PHACTS docker Stian Soiland-Reyes • Q&A
Stian Soiland-Reyes BioExcel and
University of Manchester
Ronald Siebes VU Amsterdam
Erika Pauwels Tenforce
Aad Versteden Tenforce
Bryn Williams-Jones Open PHACTS Foundation
Big Data Europe
6-Jul-16
6-Jul-16 www.big-data-europe.eu
Partners :
6-Jul-16
Q&A
6-Jul-16 www.big-data-europe.eu
Open PHACTSArchitecture and
Docker install
Stian Soiland-Reyes, University of Manchesterhttp://orcid.org/0000-0001-9842-9718
@soilandreyes
This work is licensed under a .Creative Commons Attribution 4.0 International License
Big Data Europe Webinar, 2016-07-06
This work has been done as part of the BioExcel CoE ( ),a project funded by the EC H2020 program, contract number
www.bioexcel.euEINFRA-5-2015 675728
https://slides.com/soilandreyes/2016-07-06-openphacts
1
http://www.openphacts.org/
Bringing together pharmacological data resources
in an integrated, interoperable infrastructure
Data sources integrated and linked togetherso that you can easily see the relationships
between compounds, targets, pathways,diseases and tissues.
, , , ,, , , ,
, ,
ChEBI ChEMBL ChemSpider ConceptWikiDisGeNET DrugBank FAERS Gene Ontology
neXtProt SureChEMBL, UniProt WikiPathways
2 . 1
Data integration
https://www.openphacts.org/2/sci/data.html2 . 2
https://dev.openphacts.org/docs/2.1
Re-exposed aspublic API
2 . 3
{ "format": "linked-data-api", "version": "1.5", "result": { "_about": "https://beta.openphacts.org/1.5/compound?uri=http%3A%2F%2Fwww.conceptwiki.org%2Fconcept%2F38932552-111f-4a4e-a46a-4ed1d7bdf9d5&app_id=161aeb7d&app_key=bbcba81896020f0b95e3dd35b55e3345&_format=json" "definition": "https://beta.openphacts.org/api-config", "extendedMetadataVersion": "https://beta.openphacts.org/1.5/compound?uri=http%3A%2F%2Fwww.conceptwiki.org%2Fconcept%2F38932552-111f-4a4e-a46a-4ed1d7bdf9d5&app_id=161aeb7d&app_key=bbcba81896020f0b95e3dd35b55e3345&_format=json&_metadata=all%2Cviews%2Cformats%2Cexecution%2Cbindings%2Csite" "linkPredicate": "http://www.w3.org/2004/02/skos/core#exactMatch", "activeLens": "Default", "primaryTopic": { "_about": "http://www.conceptwiki.org/concept/38932552-111f-4a4e-a46a-4ed1d7bdf9d5", "inDataset": "http://www.conceptwiki.org", "exactMatch": [ { "_about": "http://bio2rdf.org/drugbank:DB00398", "description_en": "Sorafenib (rINN), marketed as Nexavar by Bayer, is a drug approved for the treatment of advanced renal cell carcinoma (primary kidney cancer). It has also received \"Fast Track\" designation by the FDA for the treatment of advanced hepatocellular carcinoma (primary liver cancer), and has since performed well in Phase III trials.\nSorafenib is a small molecular inhibitor of Raf kinase, PDGF (platelet-derived growth factor), VEGF receptor 2 & 3 kinases and c Kit the receptor for Stem cell factor. A growing number of drugs target most of these pathways. The originality of Sorafenib lays in its simultaneous targeting of the Raf/Mek/Erk pathway." "description": "Sorafenib (rINN), marketed as Nexavar by Bayer, is a drug approved for the treatment of advanced renal cell carcinoma (primary kidney cancer). It has also received \"Fast Track\" designation by the FDA for the treatment of advanced hepatocellular carcinoma (primary liver cancer), and has since performed well in Phase III trials.\nSorafenib is a small molecular inhibitor of Raf kinase, PDGF (platelet-derived growth factor), VEGF receptor 2 & 3 kinases and c Kit the receptor for Stem cell factor. A growing number of drugs target most of these pathways. The originality of Sorafenib lays in its simultaneous targeting of the Raf/Mek/Erk pathway." "drugType_en": [ "investigational", "approved" ], "drugType": [ "investigational", "approved" ], "genericName_en": "Sorafenib", "genericName": "Sorafenib", "metabolism_en": "Sorafenib is metabolized primarily in the liver, undergoing oxidative metabolism, mediated by CYP3A4, as well as glucuronidation mediated by UGT1A9. Sorafenib accounts for approximately 70-85% of the circulating analytes in plasma at steady- state. Eight metabolites of sorafenib have been identified, of which five have been detected in plasma. The main circulating metabolite of sorafenib in plasma, the pyridine N-oxide, shows in vitro potency similar to that of sorafenib. This metabolite comprises approximately 9-16% of circulating analytes at steady-state." "metabolism": "Sorafenib is metabolized primarily in the liver, undergoing oxidative metabolism, mediated by CYP3A4, as well as glucuronidation mediated by UGT1A9. Sorafenib accounts for approximately 70-85% of the circulating analytes in plasma at steady- state. Eight metabolites of sorafenib have been identified, of which five have been detected in plasma. The main circulating metabolite of sorafenib in plasma, the pyridine N-oxide, shows in vitro potency similar to that of sorafenib. This metabolite comprises approximately 9-16% of circulating analytes at steady-state." "proteinBinding_en": "99.5% bound to plasma proteins.", "proteinBinding": "99.5% bound to plasma proteins.", "toxicity_en": "The highest dose of sorafenib studied clinically is 800 mg twice daily. The adverse reactions observed at this dose were primarily diarrhea and dermatologic events. No information is available on symptoms of acute overdose in animals because of the saturation of absorption in oral acute toxicity studies conducted in animals." "toxicity": "The highest dose of sorafenib studied clinically is 800 mg twice daily. The adverse reactions observed at this dose were primarily diarrhea and dermatologic events. No information is available on symptoms of acute overdose in animals because of the saturation of absorption in oral acute toxicity studies conducted in animals." "inDataset": "http://www.openphacts.org/bio2rdf/drugbank", 2 . 4
<?xml version="1.0" encoding="utf-8"?><result format="linked-data-api" version="1.5" href="https://beta.openphacts.org/1.5/compound?uri=http%3A%2F%2Fwww.conceptwiki.org%2Fconcept%2F38932552-111f-4a4e-a46a-4ed1d7bdf9d5&app_id=161aeb7d&app_key=bbcba81896020f0b95e3dd35b55e3345&_format=xml" <primaryTopic href="http://www.conceptwiki.org/concept/38932552-111f-4a4e-a46a-4ed1d7bdf9d5"> <prefLabel xml:lang="en">Sorafenib</prefLabel> <exactMatch> <item href="http://rdf.ebi.ac.uk/resource/chembl/molecule/CHEMBL1336"> <type href="http://rdf.ebi.ac.uk/terms/chembl#SmallMolecule"/> <inDataset href="http://www.ebi.ac.uk/chembl"/> <mw_freebase datatype="double">464.82</mw_freebase> </item> <item href="http://ops.rsc.org/OPS379634"> <smiles>CNC(=O)C1=NC=CC(=C1)OC2=CC=C(C=C2)NC(=O)NC3=CC(=C(C=C3)Cl)C(F)(F)F</smiles> <rtb datatype="double">5.0</rtb> <ro5_violations datatype="double">1.0</ro5_violations> <psa datatype="double">92.35</psa> <molweight datatype="double">464.825</molweight> <molformula>C21H16ClF3N4O3</molformula> <logp datatype="double">5.158</logp> <inchikey>MLDQJTXFUGDVEO-UHFFFAOYSA-N</inchikey> <inchi>InChI=1S/C21H16ClF3N4O3/c1-26-19(30)18-11-15(8-9-27-18)32-14-5-2-12(3-6-14)28-20(31)29-13-4-7-17(22)16(10-13)21(23,24)25/h2-11H,1H3,(H,26,30)(H2,28,29,31) <hbd datatype="double">3.0</hbd> <hba datatype="double">7.0</hba> <inDataset href="http://ops.rsc.org"/> </item> <item href="http://aers.data2semantics.org/resource/drug/NEXAVAR"> <prefLabel>NEXAVAR</prefLabel> <reportedAdverseEvent> <item href="http://aers.data2semantics.org/resource/diagnosis/HEAD_INJURY"> <prefLabel>HEAD INJURY</prefLabel> <inDataset href="http://aers.data2semantics.org/"/> </item> <item href="http://aers.data2semantics.org/resource/diagnosis/SUPRAVENTRICULAR_TACHYCARDIA" <prefLabel>SUPRAVENTRICULAR TACHYCARDIA</prefLabel> <inDataset href="http://aers.data2semantics.org/"/> 2 . 5
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .@prefix skos: <http://www.w3.org/2004/02/skos/core#> .@prefix void: <http://rdfs.org/ns/void#> .@prefix foaf: <http://xmlns.com/foaf/0.1/> .@prefix ns0: <http://www.openphacts.org/api#> .@prefix ns1: <http://bio2rdf.org/> .@prefix ns2: <http://rdf.ebi.ac.uk/terms/chembl#> .@prefix chembl1336: <http://rdf.ebi.ac.uk/resource/chembl/molecule/CHEMBL1336#> .@prefix linked-data: <http://purl.org/linked-data/api/vocab#> .@prefix msg0: <http://www.openphacts.org/api/> .
<http://www.conceptwiki.org/concept/38932552-111f-4a4e-a46a-4ed1d7bdf9d5> skos:exactMatch <http://aers.data2semantics.org/resource/drug/NEXAVAR> ; skos:exactMatch <http://aers.data2semantics.org/resource/drug/SORAFENIB> ; skos:exactMatch <http://www.conceptwiki.org/concept/38932552-111f-4a4e-a46a-4ed1d7bdf9d5> ; skos:exactMatch <http://bio2rdf.org/drugbank:DB00398> ; skos:exactMatch <http://rdf.ebi.ac.uk/resource/chembl/molecule/CHEMBL1336> ; skos:exactMatch <http://ops.rsc.org/OPS379634> ; skos:prefLabel "Sorafenib"@en ; void:inDataset <http://www.conceptwiki.org> ; foaf:isPrimaryTopicOf <https://beta.openphacts.org/1.5/compound?uri=http%3A%2F%2Fwww.conceptwiki.org%2Fconcept%2F38932552-111f-4a4e-a46a-4ed1d7bdf9d5&app_id=161aeb7d&app_key=bbcba81896020f0b95e3dd35b55e3345&_format=ttl> .
<https://beta.openphacts.org/1.5/compound?uri=http%3A%2F%2Fwww.conceptwiki.org%2Fconcept%2F38932552-111f-4a4e-a46a-4ed1d7bdf9d5&app_id=161aeb7d&app_key=bbcba81896020f0b95e3dd35b55e3345&_format=ttl> foaf:primaryTopic <http://www.conceptwiki.org/concept/38932552-111f-4a4e-a46a-4ed1d7bdf9d5> ; linked-data:definition <https://beta.openphacts.org/api-config> ; msg0:activeLens "Default" ; void:linkPredicate skos:exactMatch ; linked-data:extendedMetadataVersion <https://beta.openphacts.org/1.5/compound?uri=http%3A%2F%2Fwww.conceptwiki.org%2Fconcept%2F38932552-111f-4a4e-a46a-4ed1d7bdf9d5&app_id=161aeb7d&app_key=bbcba81896020f0b95e3dd35b55e3345&_format=ttl&_metadata=all%2Cviews%2Cformats%2Cexecution%2Cbindings%2Csite> .
<http://ops.rsc.org/OPS379634> void:inDataset <http://ops.rsc.org> ; ns0:smiles "CNC(=O)C1=NC=CC(=C1)OC2=CC=C(C=C2)NC(=O)NC3=CC(=C(C=C3)Cl)C(F)(F)F" ; ns0:inchi "InChI=1S/C21H16ClF3N4O3/c1-26-19(30)18-11-15(8-9-27-18)32-14-5-2-12(3-6-14)28-20(31)29-13-4-7-17(22)16(10-13)21(23,24)25/h2-11H,1H3,(H,26,30)(H2,28,29,31)" ns0:inchikey "MLDQJTXFUGDVEO-UHFFFAOYSA-N" ; 2 . 6
Architecture
4 . 1
API architecture
4 . 2
Chemical Structure Search
RDF/SPARQL(Virtuoso)
Identity Mapping Service
Identity Resolution Service(ConceptWiki)
Chembl, Uniprot, ...
Data loading
4 . 3
SC1 Health WebinarTechnical overview6 July 2016
Platform goals
◎ Low total cost of ownership
◎ Simple to get started with Big Data
◎ Cater for widely varying use cases
◎ Embrace emerging Big Data technologies
◎ Simple integration with custom components
Key actors
Big Data is
◎ Volumeo Quantity of data
◎ Velocityo Speed at which data is provided
◎ Varietyo Different formats/models in which data is provided
◎ Veracityo Accuracy/truthfulness of the data
Why did we need all this?
Platform architecture
Platform architecture
Platform architecture
Semantic Big Data
ongoing research!
◎ Semantic Data Lake
o from data swamp to data lake
o query contents in the data lake
◎ SANSA stack
o Big Data analytics on semantic graph
Support layer
◎ Swarm UIo Launch, install and manage pipelines
◎ Pipeline daemon & monitoro Determine order in which steps are executedo eg: Upload files before running computations
◎ Integrator UIo Present dashboards in a unified interface
Platform architecture
Key actors
Platform installation
Platform installation
◎ Manual installation guide
◎ Using Docker Machine
o On local machine (VirtualBox)
o In the cloud (AWS, DigitalOcean, Azure)
o Bare metal
Platform development
◎ High level pictureo docker-compose.yml describes pipeline topology
◎ Common componentso extend template image with your code
◎ New componentso build a Docker image for your componento this is your own little Virtual Machine for your component
◎ Sharingo publish topology as git repositoryo publish new components on docker hub
Platform development
Platform development
Deployment
Swarm UI
Swarm UI
Deployment
Swarm UI
Swarm UI
Integrator UI
Workflow UI
More monitoring
This topic is ongoing, many interesting options
◎ Visualise logs with Kibana?
◎ Combine logs for large overview?
◎ Monitor node load?
◎ Provide autoscheduling?
Concluding remarks
◎ Used in practice
◎ Easy to get started
◎ Improving as we speak
Linux Container technology..light-weight "virtual" virtual machine A container is started from a image Images downloaded from Docker Hub Dockerfile: Layer-based recipe Philosophy: One service, one image → microservices Cloud's best friend: scalable, reproducible, customizable
https://www.docker.com/5 . 1
https://hub.docker.com/r/openphacts/5 . 2
ops-ims
ops-mysql
ops-virtuoso
ops-apiops-memcached
ops-virtuosodata
ops-mysqldata
ops-virtuosostaging
ops-mysqlstaging
https://data.openphacts.org/
ops-explorer:3001
:3002
:3004:3003
https://hub.docker.com/
ops-docker
https://github.com/openphacts/ops-docker/5 . 3
Docker Compose
https://www.docker.com/products/docker-compose
Which images to download Which data volumes to use Which network ports are exposed How are containers linked How to start/stop the containers
$ docker-compose up -d
5 . 4
docker-compose.yml
# Open PHACTS platform# Docker Compose configuration
explorer: image: openphacts/explorer2 ports: - "3001:3000" links: - api environment: - API_URL=http://localhost:3002 #restart: always
api: image: openphacts/ops-linkeddataapi ports: - "3002:80" links: - ims - memcached - virtuoso:sparql
# SPARQL servervirtuoso: build: virtuoso-ops ports: - "3003:8890" volumes_from: - virtuosodata
virtuosodata: image: busybox volumes: - /virtuoso 5 . 5
Data staging
6 . 1
Docker and data?Docker Hub maximum image size: 10 GBOpen PHACTS data (compressed): ~30 GB
Open PHACTS data (installed): ~200 GB
Solution: Added staging Docker containersDownload from
Verify consistencyImport into Virtuso and mySQL
https://data.openphacts.org/
6 . 2
https://data.openphacts.org/6 . 3
https://data.openphacts.org/
data.openphacts.orgRDF datasetsRDF linksetsVoID metadata/provenance mySQL-imported linksetsVirtuoso-imported datasets → Maven repositoryrelease data as software→Research Objectspropagate metadata
6 . 4
Try it!
7 . 1
https://github.com/openphacts/ops-docker
Hardware requirements:
150 GB of disk space (ideal: 250 GB)16 GB of RAM (ideal: 128 GB)4 CPU core (ideal: 8 cores)
Prerequisites:
Recent x64 Linux (Ubuntu 14.04 LTS, Centos 7)
Fast Internet connection
DockerDocker Compose
What do I need?
7 . 2
https://github.com/openphacts/ops-docker
Follow the GitHub tutorial exactly, customize later Install latest Docker and Docker Compose Just testing on Windows or OS X?.. modify Docker's Linux VM to have enough disk and memory Firewall? Different settings depend on your firewall details. Don't worry - Docker is containerized!..you won't break your machine
Don't jump ahead..
7 . 3
https://github.com/openphacts/ops-docker
Get the softwarecurl -L https://github.com/openphacts/ops-docker/archive/master.tar.gz | tar xzvcd ops-docker-mastersudo docker-compose pull
7 . 4
https://github.com/openphacts/ops-docker
Get the data$ sudo docker-compose up --no-recreate -d mysqlstaging virtuosostaging
$ sudo docker-compose logs mysqlstaging virtuosostaging
ops-mysqlstaging | mySQL staging finishedops-mysqlstaging exited with code 0
ops-virtuosostaging | 09:13:35 --> Backup file # 675 [0x3F02-0x74-0x8A]ops-virtuosostaging | 09:13:36 --> Backup file # 676 [0x3F02-0x74-0x8A]ops-virtuosostaging | 09:13:37 End of restoring from backup, 6751701 pagesops-virtuosostaging | 09:13:37 Server exitingops-virtuosostaging | Loading completedops-virtuosostaging exited with code 0
7 . 5
https://github.com/openphacts/ops-docker
Start the services$ sudo docker-compose up --no-recreate -d$ sudo docker-compose logs --tail=5
7 . 6
Using the services
8 . 1
http://localhost:3001/ Explorer
8 . 2
http://localhost:3002/ API
8 . 3
http://localhost:3003/ SPARQL
8 . 4
http://localhost:3004/QueryExpander Identity Mapping
8 . 5
What's next?
9 . 1
Custom data stagingDifferent Open PHACTS 2.1 licensing options:
Non-Commercial users: Everything
Commercial users: No DrugBank, partial SureChemblOpen PHACTS members: Full SureChembl
9 . 2
Microservices pr datasetMost queries have separate fragments per dataset
..which could be executed on separate microservicesBetter cloud scalability
Easier to test upgrades of individual datasets
But still need "API" layer to do Identity Mappingand selecting datasets to query
9 . 3
BioExcel Workflow blocksBioExcel approach: Spin up virtual machine when an
Open PHACTS workflow is started
Workflow bound dynamically to VM instance(s)Scalability (exclusive access)
Reproducibility (independent/fixed OPS install)Tool descriptions - exposed in bio.tools
9 . 4
CustomizationMake it easier to add third-party data:
datasets, linksets, queries, API calls
..so pharma industry can mix in their in-house data.. so academics can upgrade and expand datasets
More tooling,
more documentation,or more training?
9 . 5
Feedback
https://github.com/openphacts/ops-docker/issues
http://support.openphacts.org/
http://ask.bioexcel.eu/
https://data.openphacts.org/10