Die ZBW ist Mitglied der Leibniz-Gemeinschaft Statistical Research Data on the Semantic Web SWIB...

Die ZBW ist Mitglied der Leibniz-Gemeinschaft

Statistical Research Data on the Semantic Web

SWIB 2012Cologne, Germany

Daniel BahlsLeibniz Information Centre for Economics (ZBW)

Outline

1. Introduction

2. Research data in economics and scientific practices

3. Thoughts on data representation

4. Repeatability of research results

5. Outlook

6. Data access and retrieval

7. Proxies and empirical models

Seite 2

MaWiFo Project

Management of Economic Research Data

Seite 3

Seite 4

„What researchers want“

Source: Feijen (2011)

• Tools and services must be in tune with researchers’ workflows, which are often discipline-specific

• They must be easy to use

• “Cafeteria model”: researchers can pick and choosefrom a set of tools and services

• Benefits must be clearly visible – not in three years’time, but now

Research Dataas Bibliographic Artefacts

• Re-use

Data Sharing gives more opportunities for research

• Citation

Data acquisition and assignement of Persistent Identifiers

• Transparency

Reproducibility:

Fundamental criteria for good scientific practice

Seite 5

Research data in economics and scientific practices

Target Group: Researchers in Economics

Community Building for Knowledge Exchange:

Economists – Data Librarians – Computer Scientists

Interviews on

Data Management Sharing

Sources Publishing

Processing

Seite 6

How does Research Data look like in Economics?

Seite 7

Interviews with Researchers in Economics

Seite 8

Sources

Data Agencies

Statistical Offices

Trusted Institutesand Researchers

Data Management

Own Surveys & Studies

Local File System

Backup Server

DVD, External HD, ...

Processing

Sharing

PublishingSPSS

Stata Matlab

...

ProgrammingLanguages

High PerformanceComputing

Execution Times:seconds, minutes, hours

Within Teams

Trusted Colleagues

On Request (?)

practiced sometimesZip Files

not includedin review process

8

Particular Findings

Research is driven by the availability of data

(to some extent)

Some research is based on external data,

Some research is based on self-conducted studies

Combining and Merging of data sets

Seite 9

in average, 66% ofthe data comes from

external sources(estimated)

Particular Findings

Data Usage Rights – e.g. Thomson-Reuters Datastream

Data Protection

on-site access, virtual access

sample data to understand structure

analysis scripts

aggregation

protection maintained?

Seite 10

Copy to third party?

?

Thoughts on Data Representation

data review curationtransparency re-userepeatability

Seite 11

Often, the legal situation does not allow for publishing the entire data set as was used

Interim Conclusion

A model based on copying is insufficient

We suggest fine-grained referencing

single data items must be referenceable (merging, curation)

highly distributable (distributed data sources)

extensible (heterogeneous long tail data, curation)

LOD-based approach

Seite 12

DataSet

type

UserDataSet

Data Items

type

Data Itemsfrom own survey

includesData

external dataset

13

SourceData Cube vocabularyStatsWales: Life Expectancy, Dataset 003311

used for our example

RDF-Representation for Statistical Data

14

DataSet Dimension

label

dataPropertyItem DimValue

example:

time

X

2005-7

83.7rdf:

value

A

labelregion CardiffB

labelgenderFemaleC

15

Using the semantic model, referencing of data at a very detailed level is possible - without need for the data itself to be public

labeltime

X

2005-7

83.7rdf:value

A

labelregionCardiffB

labelgenderFemaleC

you can omit single information itemssuch as the value itself,

yet the data is still referenceable

protected

RDF-Representation for Statistical Data

Challenge:Stable URIs required

for every single data item

16

SCOVO

17

RDF Data Cube Vocabulary (QB)

18source:http://publishing-statistical-data.googlecode.com/svn/trunk/specs/src/main/html/cube.html

Repeatability of research results

Seite 19

aggregationand data cleaning

?

missing values

seasonal adjustment

purchasing power adjustment

plausibility tests

basket analyses

...

McCullough, B. D. Got Replicability? The _Journal of Money, Credit and Banking_ Archive Econ Journal Watch, 2007, 4, 326-337

Interesting read

Repeatability of research results

Seite 20

scripts (“do-files”)

working copies of data

change parameters, so that

effect can be shown clearly

no overall build process

A build script for empirical analyses

Maven-like, ANT-like

Seite 21

DataSet

type

UserDataSet

Data Items

type

Data Itemsfrom own survey

includesData

external dataset

buildScript

No gaps

Trust

Incentive

22

Communication & Architecture

Seite 23

Client

Digital Library

Archive DArchive CArchive B

Archive A

DOI

Reference Model

Authenticate & Request Data

Open Challenges (practical)

Researchers in economics would love to re-use data from others.

Researchers in economics hesitate to share their data.

Competitive advantage:

“We put too much effort into data production,

so we want to be the ones to publish on it.”

“The code discloses too much of our know-how.”

Incentives needed:

Data citation

Trust in research results (no gaps from data sources to results)

Open Challenges (technical)

Precise referencing:

A unique URI for every data item / table cell ?

How about curation and data versioning ?

Maven-like build scripts:

How to specify entire system environments and software modules?

Vocabulary extensions:

Specific data needs specific description,

where do the necessary rdf:Properties come from?

Summing up

• Reference model for exact reconstruction of research data sets

• Build scripts and dependency management for repeatability

• Transparency of data sources and processes

• “executable paper”, learning from others, data reviews,....

• rerun analysis – with curated values – with latest data

Seite 26

Thank you

Die ZBW ist Mitglied der Leibniz-Gemeinschaft Statistical Research Data on the Semantic Web SWIB...

Documents

Transcript of Die ZBW ist Mitglied der Leibniz-Gemeinschaft Statistical Research Data on the Semantic Web SWIB...