The HCLS Community Profile: Describing Datasets, Versions, and Distributions

17
The HCLS Community Profile: Describing Datasets, Versions, and Distributions Alasdair J G Gray Heriot-Watt University www.macs.hw.ac.uk/~ajg33 [email protected] @gray_alasdair Michel Dumontier Stanford University M. Scott Marshall MAASTRO Clinic 30/11/2016 @gray_alasdair www.macs.hw.ac.uk/~ajg33 1

Transcript of The HCLS Community Profile: Describing Datasets, Versions, and Distributions

Page 1: The HCLS Community Profile: Describing Datasets, Versions, and Distributions

@gray_alasdair www.macs.hw.ac.uk/~ajg33

1

The HCLS Community Profile: Describing Datasets, Versions, and Distributions

Alasdair J G GrayHeriot-Watt University www.macs.hw.ac.uk/[email protected]@gray_alasdair

Michel DumontierStanford University

M. Scott MarshallMAASTRO Clinic

30/11/2016

Page 2: The HCLS Community Profile: Describing Datasets, Versions, and Distributions

Open PHACTS Example

Page 3: The HCLS Community Profile: Describing Datasets, Versions, and Distributions

Data Cache (Triple Store)

Semantic Workflow Engine

Linked Data API (RDF/XML, TTL, JSON) DomainSpecificServices

Identity Resolution

Service

IdentifierManagement

Service

“Adenosine receptor 2a”

EC2.43.4CS4532

P12374

Cor

e Pl

atfo

rm

ChEMBL-RDF

ChEMBLv13

Chem2Bio2RDF

SD

v13v12

v2 or v8

Which ChEMBL version?

@gray_alasdair www.macs.hw.ac.uk/~ajg33 3

Page 4: The HCLS Community Profile: Describing Datasets, Versions, and Distributions

Open PHACTS Example

Page 5: The HCLS Community Profile: Describing Datasets, Versions, and Distributions

OPS Example

Page 6: The HCLS Community Profile: Describing Datasets, Versions, and Distributions

Data Cache (Triple Store)

Semantic Workflow Engine

Linked Data API (RDF/XML, TTL, JSON) DomainSpecificServices

Identity Resolution

Service

IdentifierManagement

Service

“Adenosine receptor 2a”

EC2.43.4CS4532

P12374

Cor

e Pl

atfo

rm

ChEMBL-RDF

ChEMBLv13

Chem2Bio2RDF

SD

v13v12

v2 or v8

Open PHACTSDiscovery PlatformHistoric Use Case

~January 2012

Open PHACTS v2.1ChEMBL 20

http://tiny.cc/ops-datasets

Which ChEMBL version?

@gray_alasdair www.macs.hw.ac.uk/~ajg33 6

Page 7: The HCLS Community Profile: Describing Datasets, Versions, and Distributions

Challenges• Datasets available

– In many versions over time– In different formats– From many mirrors/registries

• Datasets build on each other• Files do not carry metadata• Registries

– Can be out-of-date– Can contain conflicting information

30/11/2016 @gray_alasdair www.macs.hw.ac.uk/~ajg33 7

Scientists require data provenance!

Page 8: The HCLS Community Profile: Describing Datasets, Versions, and Distributions

Dublin Core Metadata Initiative

Widely usedBroadly applicable

– Documents– Datasets

✗Generic terms✗Not comprehensive✗No required properties

30/11/2016 @gray_alasdair www.macs.hw.ac.uk/~ajg33 8

“Date: A point or period of time associated with an event in the lifecycle of the resource.”

Page 9: The HCLS Community Profile: Describing Datasets, Versions, and Distributions

9@gray_alasdair www.macs.hw.ac.uk/~ajg33

Metadata carried with data– Directly embedded: void:inDataset

✗No versioning✗No checklist of requisite fields✗Only for RDF data

VoID: Vocabulary of Interlinked Datasets

30/11/2016

Page 10: The HCLS Community Profile: Describing Datasets, Versions, and Distributions

DCAT: Data CatalogSeparates Dataset and Distribution✗No versioning✗No prescribed properties

30/11/2016 @gray_alasdair www.macs.hw.ac.uk/~ajg33 10

Page 11: The HCLS Community Profile: Describing Datasets, Versions, and Distributions

W3C HCLS Group

Page 12: The HCLS Community Profile: Describing Datasets, Versions, and Distributions

HCLS Dataset Descriptions

61 Metadata properties from 18 vocabularies5 Modules: Core, Identifiers, Provenance, Distributions, Stats

Page 13: The HCLS Community Profile: Describing Datasets, Versions, and Distributions

Prescribed Usage

Element Property Value Summary Level

Version Level

Distribution Level

Core MetadataType declaration rdf:type dctypes:Dataset MUST MUST SHOULD

Type declaration rdf:type void:Dataset or

dcat:DistributionMUST NOT

MUST NOT MUST

Title dct:title rdf:langString MUST MUST MUSTAlternative titles dct:alternative rdf:langString MAY MAY MAY

Description dct:description rdf:langString MUST MUST MUST

… … … … … …

Page 14: The HCLS Community Profile: Describing Datasets, Versions, and Distributions

30/11/2016 @gray_alasdair www.macs.hw.ac.uk/~ajg33 15

ChEMBL: Summary Level

Page 15: The HCLS Community Profile: Describing Datasets, Versions, and Distributions

Requires Tooling

Creation Validation

30/11/2016 @gray_alasdair www.macs.hw.ac.uk/~ajg33 17

Page 16: The HCLS Community Profile: Describing Datasets, Versions, and Distributions

Implementations

RDF Platform

More coming…

Page 17: The HCLS Community Profile: Describing Datasets, Versions, and Distributions

HCLS Dataset Descriptions

https://www.w3.org/TR/hcls-dataset/Dumontier M, Gray AJG, Marshall MS, et al. (2016) The health care and life sciences community profile for dataset descriptions. PeerJ 4:e2331 https://doi.org/10.7717/peerj.2331

[email protected] @gray_alasdair