Tdwg14 fp-kurator-ludaescher

23
1 Workflow Support for Continuous Data Quality Control in a FilteredPush Network J. Hanken, D. Lowery, B. Ludäscher, J. Macklin, P. Morris, B. Morris, T. Song

description

Workflow Support for Continuous Data Quality Control in a FilteredPush Network J. Hanken, D. Lowery, B. Ludäscher, J. Macklin, T. McPhillips, P. Morris, B. Morris, T. Song Presentation given at TDWG 2014 Jönköping, Sweden

Transcript of Tdwg14 fp-kurator-ludaescher

Page 1: Tdwg14 fp-kurator-ludaescher

1

Workflow Support for Continuous Data Quality Control in a FilteredPush Network

J. Hanken, D. Lowery, B. Ludäscher, J. Macklin, T. McPhillipsP. Morris, B. Morris, T. Song

Page 2: Tdwg14 fp-kurator-ludaescher

2

Problem: Data & Metadata Quality• Collections & occurrence data

… is all over the map

… literally (off the map!)• DQ Issues, e.g., …

– Lat/Long transposition, coordinate & projection issues

– Scientific Names (spelling errors, other)

– Data entry/creation, “fuzzy” data, naming issues, bit rot, data conversions and transformations, schema mappings, … (you name it)

• Related Projects:– Filtered-Push– Kurator

Page 3: Tdwg14 fp-kurator-ludaescher

3

What problems are we trying to solve?• Detect and flag data quality issues• Repair if possible

– … ask human curators as needed

• Keep track of provenance– automatic repairs– human curators’ edits

• Employ workflow (semi-)automation – Scientific workflow systems:

• Kepler/COMAD, Restflow, Galaxy, Biovel/Taverna, Argo, VisTrails, …

– Related technologies• Akka parallel execution platform• Script-based automation (e.g. Python) and digital notebooks (iPython)

Page 4: Tdwg14 fp-kurator-ludaescher

4

Data Curation Workflow

Dou, Lei., G. Cao, P.J. Morris, R.A. Morris, B. Ludäscher, J.A. Macklin, J. Hanken. 2012. Kurator: A Kepler Package for Data Curation Workflows, Procedia Computer Science, 9:1614-1619, doi:10.1016/j.procs.2012.04.177

Page 5: Tdwg14 fp-kurator-ludaescher

5

Customers of Curation Workflows

• Collection Managers – … who are managing the collections databases– Can run curation workflows periodically

• … in the presence of new data and/or new curation services

• (Biodiversity) Researchers– To perform an analysis in the presence of (partially)

dirty data, researchers need to• Clean or fix dirty data• Throw out unfixable data

– Reporting back to the collection managers (cf. FPush)

Page 6: Tdwg14 fp-kurator-ludaescher

6

Filtered Push

http://xkcd.com/386/

(1) Kvetch about data

(2) Push to interested parties

(3) Human Filter

(4) Change data in databases

(5) Store all assertions

Source: Paul J. Morris

Page 7: Tdwg14 fp-kurator-ludaescher

7

Akka curation workflowon FP2, working on DWspreadsheet reports

Symbiota Instance & DB

Symbiota Instance

Source: Paul J. Morris

Page 8: Tdwg14 fp-kurator-ludaescher

8

AccessPoint

SymbiotaPortal Access

Point

AkkaKurator

Workflows

FilteredPushNode

OccurrenceRecords

Quality ControlAnnotations

Quality ControlWorkflowQuality Controlled

Data Set

Overall Dataflow

Source: Paul J. Morris

Page 9: Tdwg14 fp-kurator-ludaescher

9

Example Curation Workflow …

• Load Dataset• Scientific Name Validation • Georeference Validation • Collection Date Validation• [Create Annotations into FPush Network]• Output results

– translate to spreadsheet – with provenance!

some steps of a larger workflow

Page 10: Tdwg14 fp-kurator-ludaescher

10

… Curation Workflow Output …

Page 11: Tdwg14 fp-kurator-ludaescher

11

… close up …

• CORRECT– Checked and OK

• CURATED:– Checked and fixed

• UNABLE_CURATE– Internally inconsistent– cannot fix

• UNABLED_DET_VALIDITY– Not enough data:

• No external reference found

Page 12: Tdwg14 fp-kurator-ludaescher

12

… even more close: Spreadsheet Provenance

• Assertions made– sign changed coordinates are on the Earth's surface – Coordinates not inside country– transposed/sign changed coordinates to place inside country– Transposed/sign changed coordinates are near georeference

of locality from Geolocate

• Sources used– Land data from Natural Earth– Country boundary data from GeoCommunity– GeoLocate

Page 13: Tdwg14 fp-kurator-ludaescher

13

Date Validation

• Check: – Collector’s life span – .. vs. Date-Collected

• Possible outcomes:– Valid– Corrected– Unable to validate

• Internal inconsistency– Contradicting dates

• External inconsistency– Lack of date data

Page 14: Tdwg14 fp-kurator-ludaescher

14

The Logic Behind Each Step …

• Date Collected– … collectors life-time vs date collected

• Georeference Validation– Lat/long valid (on Earth)– … within a country (shape file), point in polygon– If georef is “bad” then try

• … transpositions, sign-swapping etc of lat/long• If they match fix it!• Make sure to record in provenance • Using the transposed (or sign-fixed) original date

(not the Geolocate)

Page 15: Tdwg14 fp-kurator-ludaescher

15

… Logic Behind Each Step (cont’d)

• Scientific Name Validation– Customer-dependent:

• Collection Managers:– Nomenclature

• Researchers:– Taxonomy (current names)

– Several Remote services• IPNI, GNI, …

• …. <your logic here> …

Page 16: Tdwg14 fp-kurator-ludaescher

16

Curation Workflow Challenges: Machine Cycles

• Scalability & Technology Issues:– Clean aggregated data at a FP Node

• Headless• Use of Kepler/COMAD, pros & cons:

– OK on human cycles, but NOT OK on machine cycles

• Akka – Parallelize remote service invocation: helps – Non-trivial programming

• => add another layer on top of Akka• .. or … ?? <tell us about your technology!>

Page 17: Tdwg14 fp-kurator-ludaescher

17

Challenges: Human Cycles

• New Kurator project:– Enable tool makers– Make it easy to build

• components (software “actors”, services)• workflows (gluing services together)

• Data Curation Workflows Interest Group !?– Service builders– Service & Workflow Registries

• cf. myExperiment

– Service aggregators • cf. BioVel, DwC validator, …

Page 18: Tdwg14 fp-kurator-ludaescher

18

What is Kurator?

• NSF-DBI #1356751 – Collaborative Research: ABI Development:

Kurator: A Provenance-enabled Workflow Platform and Toolkit to Curate Biodiversity Data

– Sept. 2014 – 2017– @Illinois:

• B. Ludäscher, James Macklin, Tim McPhillips, …

– @Harvard: • James Hanken, Paul Morris, Bob Morris, …

– @TDWG community• <your name here>

Page 19: Tdwg14 fp-kurator-ludaescher

19

Kurator Tenets• Technology Agnostic

– … to the extent we can … – … avoid reinventing the wheel– … one size probably doesn’t fit all=> Deploy curation steps on different wf systems, platforms

• For Tool Makers• Agile, Community-Driven Development• Kurator just started, evolving

– Get involved now!– Kick-off meeting November 17 & 18

• @ NCSA (University of Illinois, Urbana-Champaign)

Page 20: Tdwg14 fp-kurator-ludaescher

20

How we do it

• Build a library of curation services such that curation workflows can be run from various platforms– Scientific workflow systems

• e.g. Restflow, Kepler, Taverna, Galaxy

– Other platforms• e.g. Akka, Python-based, …

• … leveraging existing technologies

Page 21: Tdwg14 fp-kurator-ludaescher

21

How we do it

• Open source, community-friendly approach– git repository (NCSA open source projects)

• Agile software development– NCSA support tools, e.g. JIRA, Bamboo

• Inspired by – Small bioinformatics tools manifesto (post-facto)

• cf. Unix tenets (small tools, use filters, pipes, … KISS!)

– Experience with other (sometimes not so agile) development projects

Page 22: Tdwg14 fp-kurator-ludaescher

22

Agile Kurator Development

Interested in looking under the hood?Kurator/Akka curation wf demo:

Wed PM

Initial URL: opensource.ncsa.illinois.edu/projects/KURATOR

Page 23: Tdwg14 fp-kurator-ludaescher

23

Related Research (Tianhong Song, UC Davis)

• Analyze linear workflow “story”

• Use patterns to discover wf design issues (e.g. use before update); then fix them

• Parallelize when possible