Data Management Plans: A good idea, but not sufficient

29
Data Management Plans: A good idea, but not sufficient Andreas Rauber Department of Software Technology and Interactive Systems Vienna University of Technology & Secure Business Austria [email protected] http://www.ifs.tuwien.ac.at/~andi

description

Data Management Plans: A good idea, but not sufficient. Outline. Why are Data Management Plans good but insufficient? From Data to Process Management Plans How to capture process & context? Summary. Sustainable (e-)Science. Data is key enabler in science - PowerPoint PPT Presentation

Transcript of Data Management Plans: A good idea, but not sufficient

Page 1: Data Management Plans: A good idea, but not sufficient

Data Management Plans:A good idea, but not sufficient

Andreas Rauber

Department of Software Technology and Interactive Systems

Vienna University of Technology&

Secure Business [email protected]

http://www.ifs.tuwien.ac.at/~andi

Page 2: Data Management Plans: A good idea, but not sufficient

Outline

Why are Data Management Plans good but insufficient?

From Data to Process Management Plans

How to capture process & context?

Summary

Page 3: Data Management Plans: A good idea, but not sufficient

Sustainable (e-)Science

Data is key enabler in science

- Basis for evaluation and verification

- Basis for re-use

- Basis for meta-studies

Safeguarding investment made in data

Need to preserve and curate the data

Preservation: keeping useable over time fighting mostly technical & semantic obsolescence

How to avoid data being lost after projects end?

Page 4: Data Management Plans: A good idea, but not sufficient

Sustainable (e-)Science

Data Management Plans as integral part of research proposals

Need recognized by researchers, funding bodies,…

Focus on- Data- Descriptions- Declarations of activities to ensure long-term availability of data

Data Management Plans are good, but not sufficient!

https://data.uni-bielefeld.de/de/data-management-plan

https://dmp.cdlib.org/

https://dmponline.dcc.ac.uk/

Page 5: Data Management Plans: A good idea, but not sufficient

Data Management Plans

Short, free-form text, requiring human interpretation Declarations of intent Not enforceable, hardly verifiable (Burden remains with researchers / institutions,

who need to become data management experts) Focuses solely on data, ignoring the process:

pre-processing, processing, analysis Limits

- availability of data & results

- verification of results,

- re-use and re-purposing http://deepblue.lib.umich.edu/bitstream/handle/2027.42/86586/CoE_DMP_template_v1.pdf?sequence=1

http://rci.ucsd.edu/_files/DMP%20Example%20Cosman.pdf

Page 6: Data Management Plans: A good idea, but not sufficient

From Data to Processes

Excursion: Scientific Processes

Page 7: Data Management Plans: A good idea, but not sufficient

From Data to Processes

Rhythm Pattern Feature Set- extracts numeric descriptors from audio- basically 2 Fourier Transforms- some psycho-acoustic modelling- some filters (gaussian, gradient) to make features more robust

Used for- music genre classification- clustering of music by similarity- retrieval

Implemented first in Matlab, then in Java- both publicly available on website- same same but different...

Page 8: Data Management Plans: A good idea, but not sufficient

From Data to Processes

Excursion: scientific processes

set1_freq440Hz_Am12.0Hz

set1_freq440Hz_Am05.5Hz

set1_freq440Hz_Am11.0Hz

Java Matlab

Page 9: Data Management Plans: A good idea, but not sufficient

From Data to Processes

Excursion: Scientific Processes

Bug? Psychoacoustic transformation tables? Forgetting a transformation? Diferent implementation of filters? Limited accuracy of calculation? Difference in FFT implementation? ...?

Page 10: Data Management Plans: A good idea, but not sufficient

From Data to Processes

http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0038234

Page 11: Data Management Plans: A good idea, but not sufficient

From Data to Processes

To sum up:

Data

- is the fuel for scientific processes

- is the result of scientific processes

Curation of data thus needs to consider these processes

Data Management Plans

- are data centric

- put too little focus on the processes associated with data

- are written by humans for humans

Page 12: Data Management Plans: A good idea, but not sufficient

Outline

Why are Data Management Plans insufficient?

From Data to Process Management Plans

How to capture process & context?

Summary

Page 13: Data Management Plans: A good idea, but not sufficient

Process Management Plans

Process Management Plans (PMPs)

Go beyond data to cover research process:

- ideas, steps, tools, documentation, results, …

- data is only one (important) element, commonly actually a result of a research (pre-)process

Ensure re-executability, re-usability

Must be machine-actionable & verifiable

Basis for preservation and re-use of research

Similar to “research objects”, “executable papers”, …

Page 14: Data Management Plans: A good idea, but not sufficient

Process Management Plans

Need to establish

Models for representing such process management plans (PMPs)

Must be machine-readable and machine-actionable

Identify “minimum set” of information

Devise means to automate (most of) the activity in creating and maintaining those PMPs

Establish them to replace (enhance / subsume / …) Data Management Plans

Page 15: Data Management Plans: A good idea, but not sufficient

Process Management Plans

Structure of PMPs (following concept of DMPs):

1.Overview and context

2.Description of processes and their implementation Process description | Process implementation | Data used and

produced by process

3.Preservation1. Preservation history | Long term storage and funding

4.Sharing and reuse Sharing | Reuse | Verification | Legal aspects

§Monitoring and external dependencies§Adherence and Review

Page 16: Data Management Plans: A good idea, but not sufficient

Outline

Why are Data Management Plans insufficient?

From Data to Process Management Plans

How to capture process & context?

Summary

Page 17: Data Management Plans: A good idea, but not sufficient

Process Capture

Need to establish what forms part of a process:- analyzing process documentation- establishing context of process, relationships between elements- monitoring of process activities

Capture and describe this in a context model

Page 18: Data Management Plans: A good idea, but not sufficient

Architectural Concepts

Based on Enterprise Architecture Framework(Zachmann), taxonomies (e.g. PREMIS), …

DIO: Domain-Independent Ontology DSO: Domain-Specific Ontologies

(legal, sensor, multimedia codecs, …)

19

DIO (ArchiMate) DSO-1DIO-DSO1

Transformation Map

DIO-DSO2Transformation Map DSO-2

Page 19: Data Management Plans: A good idea, but not sufficient

Process Capture

Input: music (e.g. MP3 format) Input: training data, i.e. music with genre labels Output: classification of music, e.g. into genres Intermediate steps

extract numeric description (features) from music combine features with ground truth into specific file format, …

Example: Music Classification Process

Page 20: Data Management Plans: A good idea, but not sufficient

Process Capture

Taverna

…………….

Page 21: Data Management Plans: A good idea, but not sufficient

Process Capture

Software setup can be automatically detected in OS with software packages (e.g. Linux);

allows detection of licenses, dependencies

Page 22: Data Management Plans: A good idea, but not sufficient

Process Capture

Page 23: Data Management Plans: A good idea, but not sufficient

Process Capture

24

Example:

Music Classification Workflow

Page 24: Data Management Plans: A good idea, but not sufficient
Page 25: Data Management Plans: A good idea, but not sufficient

Process Re-deployment

Preservation and Re-deployment

„Encapsulate“ as complex „research objects“ (RO)

Re-Deployment beyond original environment Format migration of elements of ROs

Cross-compilation of code

Emulation-as-a-Service, virtual machines, …

Page 26: Data Management Plans: A good idea, but not sufficient

Process Re-deployment

Verification, Validation & Data

Verify correctness of re-execution validation and verification framework

process instance data

points of capture

Metrics

Data and data citation Identifying subsets of data in large and dynamic databases

Timestamping and versioning of data

Assigning PID (DOI, …) to time-stamped query

Data

Table A

Table B

Query

Query Store

Subsets

PID Provider

PID Store

Page 27: Data Management Plans: A good idea, but not sufficient

Sustainable (e-)Science

How to get there?

Research infrastructure support

- Versioning systems

- Logging (“virtual lab-book”)

- Virtual machines / pre-configured virtual labs for research

- Data citation support for large, dynamic databases

R&D in process preservation, re-deployment & verification

- Evolving research environments, code migration, …

- Verification of process re-execution

- Financial impact, business models

Page 28: Data Management Plans: A good idea, but not sufficient

Summary

Need to move beyond concept of data

Need to move beyond the focus on description

Process Management Plans (PMPs) extending DMPs

Process capture, preservation & verification

Capture “all” elements of a research process

Machine-readable and -actionable

Data and process re-use as basis for data driven science

Page 29: Data Management Plans: A good idea, but not sufficient

Thank you!

http://www.ifs.tuwien.ac.at/imp

Data

Table A

Table B

Query

Query Store

Subsets

PID Provider

PID Store

DIO (ArchiMate) DSO-1DIO-DSO1

Transformation Map

DIO-DSO2Transformation Map DSO-2