Claudia Bauzer Medeiros Digital preservation – caring for our data to foster knowledge discovery...

Post on 11-May-2015

565 views 0 download

Tags:

Transcript of Claudia Bauzer Medeiros Digital preservation – caring for our data to foster knowledge discovery...

Digital preservation caring for our data to foster

knowledge discovery and

dissemination

Claudia Bauzer Medeiros

Institute of Computing

UNICAMP

Pre-Saervare

(Before) – (Save)

= save before disappears

Maintain

Manu-tenere

= being able to get/find it

Dec 2008

Feb 2010

Data deluge

• At end of 2011 – info created and replicated > 1.8 zettabytes

• 90% data created in the last 2 years

• 5 hour flight – 240 Tbytes

• Facebook – 200 million users, >70 languages

• Each person in England is filmed 300 times/day

• Teenagers in the US send average 110 phone text messages a day

=> We need to build arks during the deluge - PRESERVATION

Outline

• Why preserve?

• What to preserve?

• How to preserve?

• Where to preserve?

And a few associated challenges

Outline

• Why preserve?

• What to preserve?

• How to preserve?

• Where to preserve?

And a few associated challenges

WHY PRESERVE

• Costly to produce

• Contribute to progress of science

• Intrinsic value

culture/science/sustainability

WHY PRESERVE• Costly to produce

– Infrastructure, power, software, models, visualization, people

– Hardware, Software, Peopleware

• Contribute to progress of science– Reproducibility and reusability

– Publication and sharing

– Quality

• Intrinsic value culture/science/sustainability– Digital humanities

– Domesday project

– Fonoteca Neotropical Jacques Vieillard

WHY PRESERVE• Costly to produce

– Infrastructure, power, software, models, visualization, people

– Hardware, Software, Peopleware

• Contribute to progress of science– Reproducibility and reusability

– Publication and sharing

– Quality

• Intrinsic value culture/science/sustainability– Digital humanities

– Domesday project

– Fonoteca Neotropical Jacques Vieillard

WHY PRESERVE• Costly to produce

– Infrastructure, power, software, models, visualization, people

– Hardware, Software, Peopleware

• Contribute to progress of science– Reproducibility and reusability

– Publication and sharing

– Quality

• Intrinsic value culture/science/sustainability– Digital humanities

– Domesday project

– Fonoteca Neotropical Jacques Vieillard

The Domesday Project 1086-1986

• Digital decay

• Equipment obsolescence

• Software obsolescence

Domesday reloaded

Fonoteca

Neotropical

Jacques

Vieillard

Outline

• Why preserve?

• What to preserve? • How to preserve?

And associated challenges

What to preserve?

• Data

• BUT what is “data”?

• Only data?

What to preserve?

• Data

• BUT what is “data”?

– Files and records

– Models, documentation, annotations, sketches,

experiments, recordings

• Only data?

What to preserve?

• Data

• BUT what is “data”?

– Files and records

– Models, documentation, annotations, sketches,

experiments, recordings

• Only data?

– How produced it – workflows, devices,

methodologies, materials and methods,

reasonings, logs --- provenance

What to preserve?

• Data

• Environment in which was produced

• Data needed to preserve occupies more space

than the data itself

• Preservation means storing more than object

itself

23/10000

What about our research data?(slide adapted from Jim Gray)

Answers

Questions

“Collaboratory”Data-driven science

Models

Simulations

Papers

Files

Experiments

Instruments

DATA

24/10000

Data sources?Table of Product Characteristics

id Property name Value

MilkProd productsrep MilkA

MilkProd quantity 10000

MilkProd validity date 10/06/2006

CheeseProd productsr

ep

Minas

CheeseProd quantity 2000

CheeseProd validity date 12/02/2006

CheeseProd shape Circular

25/10000

eEnvironmental Science

• Direct and indirect observations

26/10000

Data sources

27/10000

We are

DATASCOPE

engineers

Software is the

device/tool

Outline

• Why preserve?

• What to preserve?

• How to preserve?

And associated challenges

How to preserve?

How to construct the ark during the

deluge?

Presaervare, Manutenere and Share

How to preserve?

• To ensure retrievability and sharing– Index structures

– Ontologies, metadata, keywords, standards

– Workflows

• To ensure longevity – Media decay, software decay, hardware decay

• To ensure quality– Curation procedures

• To afford maintenance costs– Cloud? CAP theorem?

How to preserve?

• To ensure retrievability and sharing– Index structures

– Ontologies, metadata, keywords, standards

– Workflows

• To ensure longevity – Media decay, software decay, hardware decay

• To ensure quality– Curation procedures

• To afford maintenance costs– Cloud? CAP theorem?

How to preserve?

• To ensure retrievability and sharing– Index structures

– Ontologies, metadata, keywords, standards

– Workflows

• To ensure longevity – Media decay, software decay, hardware decay

• To ensure quality– Curation procedures

• To afford maintenance costs– Cloud? CAP theorem?

How to preserve?

• To ensure retrievability and sharing– Index structures

– Ontologies, metadata, keywords, standards

– Workflows

• To ensure longevity – Media decay, software decay, hardware decay

• To ensure quality– Curation procedures, metadata,standards

• To afford maintenance costs– Cloud? CAP theorem?

How to preserve?

• To ensure retrievability and sharing– Index structures

– Ontologies, metadata, keywords, standards

– Workflows

• To ensure longevity – Media decay, software decay, hardware decay

• To ensure quality– Curation procedures,metadata, standards

• To afford maintenance costs– Cloud? CAP theorem? =======� WHERE

How to preserve?

• To ensure retrievability and sharing– Index structures

– Ontologies, metadata, keywords, standards

– Workflows

• To ensure longevity – Media decay, software decay, hardware decay

– PEOPLE DECAY

• To ensure quality– Curation procedures,metadata, standards

• To afford maintenance costs– Cloud? CAP theorem? =======� WHERE

Sharing and open access

NSF Data Management Policy

Paper and data publication

Sharing of Data Leads to Progress on Alzheimer’s

By GINA KOLATA

Published: August 12, 2010

= NEW YORK TIMES

In 2003, a group of scientists and executives from the National Institutes of Health, the Food and

Drug Administration, the drug and medical-imaging industries, universities and nonprofit groups

joined in a project that experts say had no precedent: a collaborative effort to find the biological

markers that show the progression of Alzheimer’s disease in the human brain.

share all the data, making every single

finding public immediately, available to

anyone with a computer anywhere in the

world

=> AVAILABILITY and REUSE

40/10000

• Data must be properly curated throughout its

life-cycle and released with the appropriate

high-quality metadata.

• Medical Research Council UK

41/10000

• Research data should be made available for

use by other researchers. Researchers must

retain research data, including electronic data,

in a durable, indexed and retrievable form.

• Australian Govnmt National Health and

Medical Research Council

42/10000

Microsoft Academic Search

40M publications

19M authors

75 publishers (Wiley, Springer, ACM, IEEE …)

43/10000

Google Scholar Citations

44/10000

• Citing data is as important as citing papers

• For researchers, publishers, data centers

• Over 1M DOI, several major national research

libraries

– Germany, France, Korea, Netherlands, Australia,

USA...

• Present manager – German National Library of

Science and Technology

45/10000

Publish on the Cloud

Add metadata

Pre-print sharing

46/10000

FNJV

proj.lis.ic.unicamp.br/fnjv

• Sharing by publishing on the Web

• Retrievability by extending metadata

CURATION AND USE OF STANDARDS

Workflows and model preservation

52/10000

Workflows and model preservation

Comb-e-Chem

X-Ray

e-Lab

Analysis

Properties

Properties

e-Lab

SimulationVideo

Dif

fra

cto

me

ter

Grid Middleware

Structures

Database

The cloud and CAP

Outline

• Why preserve?

• What to preserve?

• How to preserve?

• Where to preserve?

And a few associated challenges

PRE-SAVE and MANU-TENERE

Outline• Why preserve?

– Costly to produce (hardware, software, peopleware)

– Contribute to progress of science

– Value – culture, science, sustainability

• What to preserve? – Data [WHAT IS DATA?]

– Context of production and use

• How to preserve?– Accessibility and sharing – standards, metadata,

ontologies

– Integrity and quality – context to use (hw, sw), standards

56/10000

References

References

NSF – CISE Data management policy

The Domesday Project

http://www.atsf.co.uk/dottext/domesday.html

The CLARIN Project (languages)

Eigenfactor.org

Altmetrics movement

Thank you!!!!