Dataset Descriptions in Open PHACTS and HCLS

16
Dataset Descriptions in Open PHACTS and W3C HCLS IG Alasdair J G Gray Heriot-Watt University www.alasdairjggray.co.uk [email protected] NDEx Call, April 2014

description

This presentation gives an overview of the dataset description specification developed in the Open PHACTS project (http://www.openphacts.org/). The creation of the specification was driven by a real need within the project to track the datasets used. Details of the dataset metadata captured and the vocabularies used to model this metadata are given together with the tools developed to enable the specification's uptake. Over the course of the last 12 months, the W3C Healthcare and Life Science Interest Group have been developing a community profile for dataset descriptions. This has drawn on the ideas developed in the Open PHACTS specification. A brief overview of the forthcoming community profile is given in the presentation. This presentation was given to the Network Data Exchange project http://www.ndexbio.org/ on 2 April 2014.

Transcript of Dataset Descriptions in Open PHACTS and HCLS

Page 1: Dataset Descriptions in Open PHACTS and HCLS

Dataset Descriptions in Open PHACTS and

W3C HCLS IG

Alasdair J G GrayHeriot-Watt University

www.alasdairjggray.co.uk [email protected]

NDEx Call, April 2014

Page 2: Dataset Descriptions in Open PHACTS and HCLS

RDFNanopub

Db

VoID

Data Cache (Virtuoso Triple Store)

Semantic Workflow Engine

Linked Data API (RDF/XML, TTL, JSON)DomainSpecificServices

Identity Resolution

Service

Chemistry RegistrationNormalisation & Q/C

IdentifierManagement

Service

Indexing

Cor

e Pl

atfo

rm

P12374EC2.43.4

CS4532

“Adenosine receptor 2a”

RDF

VoID

Db

RDFNanopub

Db

VoID

RDF

Db

VoID

RDFNanopub

VoID

Public Content Commercial

Public Ontologies

User Annotations

Apps

Page 3: Dataset Descriptions in Open PHACTS and HCLS

Data Cache (Triple Store)

Semantic Workflow Engine

Linked Data API (RDF/XML, TTL, JSON) DomainSpecificServices

Identity Resolution

Service

IdentifierManagement

Service

Cor

e Pl

atfo

rm

P12374EC2.43.4

CS4532

“Adenosine receptor 2a”

ChEMBL-RDF

ChEMBL

Apps

Chem2Bio2RDF

SD

v13v12v2 or v8

ChEMBL

January 2012

Page 4: Dataset Descriptions in Open PHACTS and HCLS
Page 5: Dataset Descriptions in Open PHACTS and HCLS

ChemSpider

• Data aggregator: over 400 sources– What data does it contain?– What version of ?? did they load?– When are new versions loaded?

• OPS data covers– ChEBI– ChEMBL– DrugBank

2 April 2014 OPS Dataset Descriptions – A. J. G. Gray 5

Page 6: Dataset Descriptions in Open PHACTS and HCLS

Metadata Challenges

• Datasets available– In many versions over time– In different formats– From many mirrors/registries

• Datasets build on each other• Files do not carry metadata• Registries

– Can be out-of-date– Can contain conflicting information

2 April 2014 OPS Dataset Descriptions – A. J. G. Gray 6

Users require data

provenance!

Page 7: Dataset Descriptions in Open PHACTS and HCLS

2 April 2014 OPS Dataset Descriptions – A. J. G. Gray 7

Page 8: Dataset Descriptions in Open PHACTS and HCLS

2 April 2014 OPS Dataset Descriptions – A. J. G. Gray 8

Page 9: Dataset Descriptions in Open PHACTS and HCLS

Description Model

2 April 2014 OPS Dataset Descriptions – A. J. G. Gray 9

Page 10: Dataset Descriptions in Open PHACTS and HCLS

Realisation of Dataset Descriptions

• Needs to be incorporated into data publishing pipeline

• Hard for publishers to provide conformant descriptions– Datasets are complex– Evolve over time– Seen as yet another burden

2 April 2014 OPS Dataset Descriptions – A. J. G. Gray 15

Page 11: Dataset Descriptions in Open PHACTS and HCLS

VoID Editor

2 April 2014 OPS Dataset Descriptions – A. J. G. Gray 16

Page 12: Dataset Descriptions in Open PHACTS and HCLS

Validator

2 April 2014 OPS Dataset Descriptions – A. J. G. Gray 17

Page 13: Dataset Descriptions in Open PHACTS and HCLS

W3C HCLS Group

Page 14: Dataset Descriptions in Open PHACTS and HCLS

HCLS Community Profile Model

2 April 2014 OPS Dataset Descriptions – A. J. G. Gray 19

Page 15: Dataset Descriptions in Open PHACTS and HCLS

Future Vision

Metadata: Write once, use many times• Provide rich and accurate provenance trail of

data– Automatic pipeline from VoID file to registries

• Align Open PHACTS with W3C HCLS– Update tools for HCLS profile

2 April 2014 OPS Dataset Descriptions – A. J. G. Gray 20