Supporting Dataset Descriptions in the Life Sciences

50
Supporting Dataset Descriptions in the Life Sciences Alasdair J G Gray Heriot-Watt University www.macs.hw.ac.uk/~ajg33 [email protected] @gray_alasdair

Transcript of Supporting Dataset Descriptions in the Life Sciences

Page 1: Supporting Dataset Descriptions in the Life Sciences

Supporting Dataset Descriptions in the Life Sciences

Alasdair J G GrayHeriot-Watt University www.macs.hw.ac.uk/[email protected]@gray_alasdair

Page 2: Supporting Dataset Descriptions in the Life Sciences

FAIR Data Principles

5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 2

Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship Authors. Nature Scientific Data 3, 1–15 (2016). DOI: 10.1038/sdata.2016.18

Page 3: Supporting Dataset Descriptions in the Life Sciences

Degrees of FAIRness

5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 3

Page 4: Supporting Dataset Descriptions in the Life Sciences

@gray_alasdairwww.macs.hw.ac.uk/~ajg33 4

Open PHACTS Explorer

5 April 2017

Page 5: Supporting Dataset Descriptions in the Life Sciences

5

Data Cache (Triple Store)

Semantic Workflow Engine

Linked Data API (RDF/XML, TTL, JSON) DomainSpecificServices

Identity Resolution

Service

IdentifierManagement

Service

“Adenosine receptor 2a”

EC2.43.4CS4532

P12374

Cor

e Pl

atfo

rm

ChEMBL-RDF

ChEMBLv13

Chem2Bio2RDF

SD

v13v12

v2 or v8

Which ChEMBL version?

5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33

Historic Use Case~January 2012

Open PHACTS v2.1ChEMBL 20

http://tiny.cc/ops-datasets

Page 6: Supporting Dataset Descriptions in the Life Sciences

6

Open PHACTS Provenance

5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33

Page 7: Supporting Dataset Descriptions in the Life Sciences

@gray_alasdairwww.macs.hw.ac.uk/~ajg33 7

Open PHACTS FAIR Data

5 April 2017

Page 8: Supporting Dataset Descriptions in the Life Sciences

@gray_alasdairwww.macs.hw.ac.uk/~ajg33 8

Data Reuse Challenges• Datasets available

– In many versions over time– In different formats– From many mirrors/registries

• Datasets build on each other• Files do not carry metadata• Registries

– Can be out-of-date– Can contain conflicting information

5 April 2017

Scientists require data provenance!

Page 9: Supporting Dataset Descriptions in the Life Sciences

Goal: To be FAIR

5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 9

Page 10: Supporting Dataset Descriptions in the Life Sciences

@gray_alasdairwww.macs.hw.ac.uk/~ajg33 10

Open PHACTS Dataset Description Guidelines

5 April 2017

Challenging for Publishers:• Datasets are complex• Evolve over time• Another publishing burden• Requires RDF knowledge• Descriptions are complex• Metadata precision

Tooling support required!

Page 11: Supporting Dataset Descriptions in the Life Sciences

@gray_alasdairwww.macs.hw.ac.uk/~ajg33 11

Open PHACTS Dataset Description Model

5 April 2017

Page 12: Supporting Dataset Descriptions in the Life Sciences

@gray_alasdairwww.macs.hw.ac.uk/~ajg33 12

Open PHACTS Dataset Description Guidelines

5 April 2017

Page 13: Supporting Dataset Descriptions in the Life Sciences

Help me describe my data!

No! Use the Open PHACTS VoID Editor

Thanks for converting my data to RDF, can you help me make it findable by creating a VoID dataset description?

Dataset description Metadata Boring

Here are the guidelines, just write the terms in a text document.

Characters reproduced from Piled Higher and Deeper by Jorge Cham, http://phdcomics.com

Page 14: Supporting Dataset Descriptions in the Life Sciences

@gray_alasdairwww.macs.hw.ac.uk/~ajg33 14

Open PHACTS VoID Editor

5 April 2017

Page 15: Supporting Dataset Descriptions in the Life Sciences

@gray_alasdairwww.macs.hw.ac.uk/~ajg33 15

Open PHACTS VoID Editor

5 April 2017

Page 16: Supporting Dataset Descriptions in the Life Sciences

@gray_alasdairwww.macs.hw.ac.uk/~ajg33 16

Open PHACTS VoID Editor

5 April 2017

Page 17: Supporting Dataset Descriptions in the Life Sciences

@gray_alasdairwww.macs.hw.ac.uk/~ajg33 17

Open PHACTS Validator

5 April 2017

Page 18: Supporting Dataset Descriptions in the Life Sciences

@gray_alasdairwww.macs.hw.ac.uk/~ajg33 18

(Some) Life Sciences Metadata Specifications

5 April 2017

Depth

Reach

HCLS DataDesc

Page 19: Supporting Dataset Descriptions in the Life Sciences

Bioschemas

Schema.org for biologyMinimum properties for • Finding data• Presenting search results

5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 19

Page 20: Supporting Dataset Descriptions in the Life Sciences

<div> <h1>Classic potato salad</h1> <div> Nutrition facts: <span>144 kcal</span>, </div>

Ingredients: - <span>800g small new potato</span> - <span>3 shallot</span> . . .

Structured data markup for web pages

Without markup

Page 21: Supporting Dataset Descriptions in the Life Sciences

<div> <h1>Classic potato salad</h1> <div> Nutrition facts: <span>144 kcal</span>, </div>

Ingredients: - <span>800g small new potato</span> - <span>3 shallot</span> . . .

Structured data markup for web pages

Recipe

Nutrition

Calories

Ingridients

Title

Without markup

Page 22: Supporting Dataset Descriptions in the Life Sciences

<div itemscope itemtype="http://schema.org/Recipe"> <h1 itemprop="name">Classic potato salad</h1> <div itemprop="nutrition” itemscope

itemtype="http://schema.org/NutritionInformation"> Nutrition facts: <span itemprop="calories">144 kcal</span>, </div>

Ingredients: - <span itemprop="recipeIngredient">800g small new potato</span> - <span itemprop="recipeIngredient">3 shallot</span> . . .

Structured data markup for web pages

RDFaJSON-LD

Microdata With markup

Page 23: Supporting Dataset Descriptions in the Life Sciences
Page 24: Supporting Dataset Descriptions in the Life Sciences

Minimum informationControlled vocabularies

Cardinality

Data model

New properties24

Page 25: Supporting Dataset Descriptions in the Life Sciences

The ELIXIR Implementation Study

1. Data Repositories

2. Datasets

3. Beacons4. Samples

5. P

lant

P

heno

type

s

6. Protein

Annotations

7. Bioschemas registry

8. Validation

Henning Hermjakob

Susanna A Sansone

Serena ScollenHelen Parkinson

Rafa Jimenez

???

Maria Martin

Audald Lloret

Alasdair Gray

Planning

Agreement

Adoption

Application

1

2

3

4

March-April 2017

May-June 2017

July-Oct 2017

Nov-Feb 2018

5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 25

Page 26: Supporting Dataset Descriptions in the Life Sciences

@gray_alasdairwww.macs.hw.ac.uk/~ajg33 26

(Some) Life Sciences Metadata Specifications

5 April 2017

Depth

Reach

HCLS DataDesc

Page 27: Supporting Dataset Descriptions in the Life Sciences

27

W3C HCLS GroupDumontier, M. et al. The health care and life sciences community profile for dataset descriptions. PeerJ 4, e2331 (2016). DOI:10.7717/peerj.2331

Page 28: Supporting Dataset Descriptions in the Life Sciences

Use Case Requirements

Standard metadata requirements plus:

1. Resolvable identifiers for metadata

2. Descriptions of data identifiers

3. Data provenance

4. Data statistics

5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 28

Page 29: Supporting Dataset Descriptions in the Life Sciences

@gray_alasdairwww.macs.hw.ac.uk/~ajg33 29

HCLS Dataset Descriptions

61 Metadata properties from 18 vocabularies5 Modules: Core, Identifiers, Provenance, Distributions, Stats

5 April 2017

Page 30: Supporting Dataset Descriptions in the Life Sciences

Prescribed UsageElement Property Value Summary

LevelVersion Level

Distribution Level

Core MetadataType declaration rdf:type dctypes:Dataset MUST MUST SHOULD

Type declaration rdf:type void:Dataset or

dcat:DistributionMUST NOT

MUST NOT MUST

Title dct:title rdf:langString MUST MUST MUSTAlternative titles dct:alternative rdf:langString MAY MAY MAY

Description dct:description rdf:langString MUST MUST MUST

… … … … … …

5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 30

Page 31: Supporting Dataset Descriptions in the Life Sciences

ChEMBL: Summary Level

5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 31

Page 32: Supporting Dataset Descriptions in the Life Sciences

@gray_alasdairwww.macs.hw.ac.uk/~ajg33 33

Implementations

RDF Platform

More coming…5 April 2017

Page 33: Supporting Dataset Descriptions in the Life Sciences

@gray_alasdairwww.macs.hw.ac.uk/~ajg33 34

(Some) Life Sciences Metadata Specifications

5 April 2017

Depth

Reach

HCLS DataDesc

Page 34: Supporting Dataset Descriptions in the Life Sciences

@gray_alasdairwww.macs.hw.ac.uk/~ajg33 35

Layered Descriptions

Minimal dataset description More detailed

description

Dataset

Sketch of content

5 April 2017

HCLS DataDesc

Page 35: Supporting Dataset Descriptions in the Life Sciences

Configurable Tooling

Creation Validation

5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 36

✗ ✓

Page 36: Supporting Dataset Descriptions in the Life Sciences

Configurable Tooling

Creation Validation

5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 37

✓ ✓

Page 37: Supporting Dataset Descriptions in the Life Sciences

Constraint LanguagesShEx SHACL JSON Schema

Status W3C Draft CG Report

W3C Working Draft IETF Internet-Draft v5

Notation Concise notation Extended SPARQL JSONData model RDF RDF JSON (JSON-LD?)Open/closed Supported Supported ClosedResult format Defined DefinedConstraint types supported• Domain ✓ ✓ ✓• Values ✓ ✓ ✓• Cardinality ✓ ✓ ✓• Vocabulary ✓ ✓ ✗• Recursion ✓ ✗ ✗• Conformance

levels Extension Fixed ✗

Page 38: Supporting Dataset Descriptions in the Life Sciences

Example Constraint

• Shape

• A Dataset– MUST be declared to be of type dctype:Dataset– MUST have a dcterms:title as a language typed string– MUST NOT have dcterms:created date

<Dataset> rdf:langString

.✗

Dates are associated with versions in HCLS

5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 39

Page 39: Supporting Dataset Descriptions in the Life Sciences

Example Validation

<Dataset> rdf:langString

.✗

• Shape

• Data

Valid

5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 40

Page 40: Supporting Dataset Descriptions in the Life Sciences

Example Validation

• Shape

• Data

<Dataset> rdf:langString

.✗

Not Valid

5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 41

Page 41: Supporting Dataset Descriptions in the Life Sciences

Example Validation

<Dataset> rdf:langString

.✗

• Shape

• Data

Valid

5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 42

Page 42: Supporting Dataset Descriptions in the Life Sciences

<Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created .}

Shape

<Dataset> rdf:langString

.✗

Shape Expressions (ShEx)

5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 43

Page 43: Supporting Dataset Descriptions in the Life Sciences

ShEx: Validation<Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created .}

<Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created .}

<Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created .}

<Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created .}

<Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created .}

<Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created .}

Validator can’t warn of missing property

Example data

5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 44

Page 44: Supporting Dataset Descriptions in the Life Sciences

<Dataset> { `MUST` rdf:type (dctypes:Dataset), `MUST` dct:title rdf:langString, `MAY` dct:alternative rdf:langString+, `MUST` !dct:created .}

Shape

<Dataset> rdf:langString

.✗

Requirement Levels

Validator can warn of missing property

5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 45

Page 45: Supporting Dataset Descriptions in the Life Sciences

Implementation

Validata• Web app front end• Javascript + HTML• Relies on ShEx-validator

– Validates documents– Returns report

https://github.com/HW-SWeL/Validata

ShEx-validator• Validation system• Validation API• Javascript

– nodejs engine• Reuses

– n3: RDF Library– ShExParser

https://github.com/HW-SWeL/ShEx-validator

5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 46

Page 46: Supporting Dataset Descriptions in the Life Sciences

http://hw-swel.github.io/Validata/ VALIDATA DEMO

5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 47

Page 47: Supporting Dataset Descriptions in the Life Sciences

@gray_alasdairwww.macs.hw.ac.uk/~ajg33 48

(Some) Life Sciences Metadata Specifications

5 April 2017

Depth

Reach

HCLS DataDesc

Page 48: Supporting Dataset Descriptions in the Life Sciences

Configurable Tooling

Creation Validation

5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 49

✓ ✓

Page 49: Supporting Dataset Descriptions in the Life Sciences

@gray_alasdairwww.macs.hw.ac.uk/~ajg33 50

AcknowledgementsBioSchemas• Carole Gobel• Rafael JimenezFAIR Data• FAIRdom project• Jun ZhaoOpen PHACTS• Christian Brenninkmeijer• Lefteris Tatakis• Andra Waagmeester

Validata (MEng 2015)• Andrew Beveridge• Jacob Baungard Hansen• Johnny Val• Leif Gehrmann• Roisin Farmer• Sunil Khutan• Tomas Robertson

• Eric Prud’hommeaux

5 April 2017

Page 50: Supporting Dataset Descriptions in the Life Sciences

@gray_alasdairwww.macs.hw.ac.uk/~ajg33 51

QuestionsValidata https://github.com/HW-SWeL/Validata• RDF constraint validation tool

– Configurable to any profile• Shape Expression (ShEx) constraints

Dumontier, M. et al. The health care and life sciences community profile for dataset descriptions. PeerJ 4, e2331 (2016). DOI:10.7717/peerj.2331

Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Nature Scientific Data 3, 1–15 (2016). DOI: 10.1038/sdata.2016.18

www.macs.hw.ac.uk/~ajg33/[email protected]@gray_alasdair

5 April 2017