Supporting Dataset Descriptions in the Life Sciences

Post on 11-Apr-2017

44 views 1 download

Transcript of Supporting Dataset Descriptions in the Life Sciences

Supporting Dataset Descriptions in the Life Sciences

Alasdair J G GrayHeriot-Watt University www.macs.hw.ac.uk/~ajg33A.J.G.Gray@hw.ac.uk@gray_alasdair

FAIR Data Principles

5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 2

Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship Authors. Nature Scientific Data 3, 1–15 (2016). DOI: 10.1038/sdata.2016.18

Degrees of FAIRness

5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 3

@gray_alasdairwww.macs.hw.ac.uk/~ajg33 4

Open PHACTS Explorer

5 April 2017

5

Data Cache (Triple Store)

Semantic Workflow Engine

Linked Data API (RDF/XML, TTL, JSON) DomainSpecificServices

Identity Resolution

Service

IdentifierManagement

Service

“Adenosine receptor 2a”

EC2.43.4CS4532

P12374

Cor

e Pl

atfo

rm

ChEMBL-RDF

ChEMBLv13

Chem2Bio2RDF

SD

v13v12

v2 or v8

Which ChEMBL version?

5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33

Historic Use Case~January 2012

Open PHACTS v2.1ChEMBL 20

http://tiny.cc/ops-datasets

6

Open PHACTS Provenance

5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33

@gray_alasdairwww.macs.hw.ac.uk/~ajg33 7

Open PHACTS FAIR Data

5 April 2017

@gray_alasdairwww.macs.hw.ac.uk/~ajg33 8

Data Reuse Challenges• Datasets available

– In many versions over time– In different formats– From many mirrors/registries

• Datasets build on each other• Files do not carry metadata• Registries

– Can be out-of-date– Can contain conflicting information

5 April 2017

Scientists require data provenance!

Goal: To be FAIR

5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 9

@gray_alasdairwww.macs.hw.ac.uk/~ajg33 10

Open PHACTS Dataset Description Guidelines

5 April 2017

Challenging for Publishers:• Datasets are complex• Evolve over time• Another publishing burden• Requires RDF knowledge• Descriptions are complex• Metadata precision

Tooling support required!

@gray_alasdairwww.macs.hw.ac.uk/~ajg33 11

Open PHACTS Dataset Description Model

5 April 2017

@gray_alasdairwww.macs.hw.ac.uk/~ajg33 12

Open PHACTS Dataset Description Guidelines

5 April 2017

Help me describe my data!

No! Use the Open PHACTS VoID Editor

Thanks for converting my data to RDF, can you help me make it findable by creating a VoID dataset description?

Dataset description Metadata Boring

Here are the guidelines, just write the terms in a text document.

Characters reproduced from Piled Higher and Deeper by Jorge Cham, http://phdcomics.com

@gray_alasdairwww.macs.hw.ac.uk/~ajg33 14

Open PHACTS VoID Editor

5 April 2017

@gray_alasdairwww.macs.hw.ac.uk/~ajg33 15

Open PHACTS VoID Editor

5 April 2017

@gray_alasdairwww.macs.hw.ac.uk/~ajg33 16

Open PHACTS VoID Editor

5 April 2017

@gray_alasdairwww.macs.hw.ac.uk/~ajg33 17

Open PHACTS Validator

5 April 2017

@gray_alasdairwww.macs.hw.ac.uk/~ajg33 18

(Some) Life Sciences Metadata Specifications

5 April 2017

Depth

Reach

HCLS DataDesc

Bioschemas

Schema.org for biologyMinimum properties for • Finding data• Presenting search results

5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 19

<div> <h1>Classic potato salad</h1> <div> Nutrition facts: <span>144 kcal</span>, </div>

Ingredients: - <span>800g small new potato</span> - <span>3 shallot</span> . . .

Structured data markup for web pages

Without markup

<div> <h1>Classic potato salad</h1> <div> Nutrition facts: <span>144 kcal</span>, </div>

Ingredients: - <span>800g small new potato</span> - <span>3 shallot</span> . . .

Structured data markup for web pages

Recipe

Nutrition

Calories

Ingridients

Title

Without markup

<div itemscope itemtype="http://schema.org/Recipe"> <h1 itemprop="name">Classic potato salad</h1> <div itemprop="nutrition” itemscope

itemtype="http://schema.org/NutritionInformation"> Nutrition facts: <span itemprop="calories">144 kcal</span>, </div>

Ingredients: - <span itemprop="recipeIngredient">800g small new potato</span> - <span itemprop="recipeIngredient">3 shallot</span> . . .

Structured data markup for web pages

RDFaJSON-LD

Microdata With markup

Minimum informationControlled vocabularies

Cardinality

Data model

New properties24

The ELIXIR Implementation Study

1. Data Repositories

2. Datasets

3. Beacons4. Samples

5. P

lant

P

heno

type

s

6. Protein

Annotations

7. Bioschemas registry

8. Validation

Henning Hermjakob

Susanna A Sansone

Serena ScollenHelen Parkinson

Rafa Jimenez

???

Maria Martin

Audald Lloret

Alasdair Gray

Planning

Agreement

Adoption

Application

1

2

3

4

March-April 2017

May-June 2017

July-Oct 2017

Nov-Feb 2018

5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 25

@gray_alasdairwww.macs.hw.ac.uk/~ajg33 26

(Some) Life Sciences Metadata Specifications

5 April 2017

Depth

Reach

HCLS DataDesc

27

W3C HCLS GroupDumontier, M. et al. The health care and life sciences community profile for dataset descriptions. PeerJ 4, e2331 (2016). DOI:10.7717/peerj.2331

Use Case Requirements

Standard metadata requirements plus:

1. Resolvable identifiers for metadata

2. Descriptions of data identifiers

3. Data provenance

4. Data statistics

5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 28

@gray_alasdairwww.macs.hw.ac.uk/~ajg33 29

HCLS Dataset Descriptions

61 Metadata properties from 18 vocabularies5 Modules: Core, Identifiers, Provenance, Distributions, Stats

5 April 2017

Prescribed UsageElement Property Value Summary

LevelVersion Level

Distribution Level

Core MetadataType declaration rdf:type dctypes:Dataset MUST MUST SHOULD

Type declaration rdf:type void:Dataset or

dcat:DistributionMUST NOT

MUST NOT MUST

Title dct:title rdf:langString MUST MUST MUSTAlternative titles dct:alternative rdf:langString MAY MAY MAY

Description dct:description rdf:langString MUST MUST MUST

… … … … … …

5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 30

ChEMBL: Summary Level

5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 31

@gray_alasdairwww.macs.hw.ac.uk/~ajg33 33

Implementations

RDF Platform

More coming…5 April 2017

@gray_alasdairwww.macs.hw.ac.uk/~ajg33 34

(Some) Life Sciences Metadata Specifications

5 April 2017

Depth

Reach

HCLS DataDesc

@gray_alasdairwww.macs.hw.ac.uk/~ajg33 35

Layered Descriptions

Minimal dataset description More detailed

description

Dataset

Sketch of content

5 April 2017

HCLS DataDesc

Configurable Tooling

Creation Validation

5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 36

✗ ✓

Configurable Tooling

Creation Validation

5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 37

✓ ✓

Constraint LanguagesShEx SHACL JSON Schema

Status W3C Draft CG Report

W3C Working Draft IETF Internet-Draft v5

Notation Concise notation Extended SPARQL JSONData model RDF RDF JSON (JSON-LD?)Open/closed Supported Supported ClosedResult format Defined DefinedConstraint types supported• Domain ✓ ✓ ✓• Values ✓ ✓ ✓• Cardinality ✓ ✓ ✓• Vocabulary ✓ ✓ ✗• Recursion ✓ ✗ ✗• Conformance

levels Extension Fixed ✗

Example Constraint

• Shape

• A Dataset– MUST be declared to be of type dctype:Dataset– MUST have a dcterms:title as a language typed string– MUST NOT have dcterms:created date

<Dataset> rdf:langString

.✗

Dates are associated with versions in HCLS

5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 39

Example Validation

<Dataset> rdf:langString

.✗

• Shape

• Data

Valid

5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 40

Example Validation

• Shape

• Data

<Dataset> rdf:langString

.✗

Not Valid

5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 41

Example Validation

<Dataset> rdf:langString

.✗

• Shape

• Data

Valid

5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 42

<Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created .}

Shape

<Dataset> rdf:langString

.✗

Shape Expressions (ShEx)

5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 43

ShEx: Validation<Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created .}

<Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created .}

<Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created .}

<Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created .}

<Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created .}

<Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created .}

Validator can’t warn of missing property

Example data

5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 44

<Dataset> { `MUST` rdf:type (dctypes:Dataset), `MUST` dct:title rdf:langString, `MAY` dct:alternative rdf:langString+, `MUST` !dct:created .}

Shape

<Dataset> rdf:langString

.✗

Requirement Levels

Validator can warn of missing property

5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 45

Implementation

Validata• Web app front end• Javascript + HTML• Relies on ShEx-validator

– Validates documents– Returns report

https://github.com/HW-SWeL/Validata

ShEx-validator• Validation system• Validation API• Javascript

– nodejs engine• Reuses

– n3: RDF Library– ShExParser

https://github.com/HW-SWeL/ShEx-validator

5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 46

http://hw-swel.github.io/Validata/ VALIDATA DEMO

5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 47

@gray_alasdairwww.macs.hw.ac.uk/~ajg33 48

(Some) Life Sciences Metadata Specifications

5 April 2017

Depth

Reach

HCLS DataDesc

Configurable Tooling

Creation Validation

5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 49

✓ ✓

@gray_alasdairwww.macs.hw.ac.uk/~ajg33 50

AcknowledgementsBioSchemas• Carole Gobel• Rafael JimenezFAIR Data• FAIRdom project• Jun ZhaoOpen PHACTS• Christian Brenninkmeijer• Lefteris Tatakis• Andra Waagmeester

Validata (MEng 2015)• Andrew Beveridge• Jacob Baungard Hansen• Johnny Val• Leif Gehrmann• Roisin Farmer• Sunil Khutan• Tomas Robertson

• Eric Prud’hommeaux

5 April 2017

@gray_alasdairwww.macs.hw.ac.uk/~ajg33 51

QuestionsValidata https://github.com/HW-SWeL/Validata• RDF constraint validation tool

– Configurable to any profile• Shape Expression (ShEx) constraints

Dumontier, M. et al. The health care and life sciences community profile for dataset descriptions. PeerJ 4, e2331 (2016). DOI:10.7717/peerj.2331

Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Nature Scientific Data 3, 1–15 (2016). DOI: 10.1038/sdata.2016.18

www.macs.hw.ac.uk/~ajg33/A.J.G.Gray@hw.ac.uk@gray_alasdair

5 April 2017