Die ZBW ist Mitglied der Leibniz-Gemeinschaft Statistical Research Data on the Semantic Web SWIB...
-
Upload
cathleen-brown -
Category
Documents
-
view
218 -
download
2
Transcript of Die ZBW ist Mitglied der Leibniz-Gemeinschaft Statistical Research Data on the Semantic Web SWIB...
Die ZBW ist Mitglied der Leibniz-Gemeinschaft
Statistical Research Data on the Semantic Web
SWIB 2012Cologne, Germany
Daniel BahlsLeibniz Information Centre for Economics (ZBW)
Outline
1. Introduction
2. Research data in economics and scientific practices
3. Thoughts on data representation
4. Repeatability of research results
5. Outlook
6. Data access and retrieval
7. Proxies and empirical models
Seite 2
MaWiFo Project
Management of Economic Research Data
Seite 3
Seite 4
„What researchers want“
Source: Feijen (2011)
• Tools and services must be in tune with researchers’ workflows, which are often discipline-specific
• They must be easy to use
• “Cafeteria model”: researchers can pick and choosefrom a set of tools and services
• Benefits must be clearly visible – not in three years’time, but now
Research Dataas Bibliographic Artefacts
• Re-use
Data Sharing gives more opportunities for research
• Citation
Data acquisition and assignement of Persistent Identifiers
• Transparency
Reproducibility:
Fundamental criteria for good scientific practice
Seite 5
Research data in economics and scientific practices
Target Group: Researchers in Economics
Community Building for Knowledge Exchange:
Economists – Data Librarians – Computer Scientists
Interviews on
Data Management Sharing
Sources Publishing
Processing
Seite 6
How does Research Data look like in Economics?
Seite 7
Interviews with Researchers in Economics
Seite 8
Sources
Data Agencies
Statistical Offices
Trusted Institutesand Researchers
Data Management
Own Surveys & Studies
Local File System
Backup Server
DVD, External HD, ...
Processing
Sharing
PublishingSPSS
Stata Matlab
...
ProgrammingLanguages
High PerformanceComputing
Execution Times:seconds, minutes, hours
Within Teams
Trusted Colleagues
On Request (?)
practiced sometimesZip Files
not includedin review process
8
Particular Findings
Research is driven by the availability of data
(to some extent)
Some research is based on external data,
Some research is based on self-conducted studies
Combining and Merging of data sets
Seite 9
in average, 66% ofthe data comes from
external sources(estimated)
Particular Findings
Data Usage Rights – e.g. Thomson-Reuters Datastream
Data Protection
on-site access, virtual access
sample data to understand structure
analysis scripts
aggregation
protection maintained?
Seite 10
Copy to third party?
?
Thoughts on Data Representation
data review curationtransparency re-userepeatability
Seite 11
Often, the legal situation does not allow for publishing the entire data set as was used
Interim Conclusion
A model based on copying is insufficient
We suggest fine-grained referencing
single data items must be referenceable (merging, curation)
highly distributable (distributed data sources)
extensible (heterogeneous long tail data, curation)
LOD-based approach
Seite 12
DataSet
type
UserDataSet
Data Items
type
Data Itemsfrom own survey
includesData
external dataset
13
SourceData Cube vocabularyStatsWales: Life Expectancy, Dataset 003311
used for our example
RDF-Representation for Statistical Data
14
DataSet Dimension
label
dataPropertyItem DimValue
example:
time
X
2005-7
83.7rdf:
value
A
labelregion CardiffB
labelgenderFemaleC
15
Using the semantic model, referencing of data at a very detailed level is possible - without need for the data itself to be public
labeltime
X
2005-7
83.7rdf:value
A
labelregionCardiffB
labelgenderFemaleC
you can omit single information itemssuch as the value itself,
yet the data is still referenceable
protected
RDF-Representation for Statistical Data
Challenge:Stable URIs required
for every single data item
16
SCOVO
17
RDF Data Cube Vocabulary (QB)
18source:http://publishing-statistical-data.googlecode.com/svn/trunk/specs/src/main/html/cube.html
Repeatability of research results
Seite 19
aggregationand data cleaning
?
missing values
seasonal adjustment
purchasing power adjustment
plausibility tests
basket analyses
...
McCullough, B. D. Got Replicability? The _Journal of Money, Credit and Banking_ Archive Econ Journal Watch, 2007, 4, 326-337
Interesting read
Repeatability of research results
Seite 20
scripts (“do-files”)
working copies of data
change parameters, so that
effect can be shown clearly
no overall build process
A build script for empirical analyses
Maven-like, ANT-like
Seite 21
DataSet
type
UserDataSet
Data Items
type
Data Itemsfrom own survey
includesData
external dataset
buildScript
No gaps
Trust
Incentive
22
Communication & Architecture
Seite 23
Client
Digital Library
Archive DArchive CArchive B
Archive A
DOI
Reference Model
Authenticate & Request Data
Open Challenges (practical)
Researchers in economics would love to re-use data from others.
Researchers in economics hesitate to share their data.
Competitive advantage:
“We put too much effort into data production,
so we want to be the ones to publish on it.”
“The code discloses too much of our know-how.”
Incentives needed:
Data citation
Trust in research results (no gaps from data sources to results)
Open Challenges (technical)
Precise referencing:
A unique URI for every data item / table cell ?
How about curation and data versioning ?
Maven-like build scripts:
How to specify entire system environments and software modules?
Vocabulary extensions:
Specific data needs specific description,
where do the necessary rdf:Properties come from?
Summing up
• Reference model for exact reconstruction of research data sets
• Build scripts and dependency management for repeatability
• Transparency of data sources and processes
• “executable paper”, learning from others, data reviews,....
• rerun analysis – with curated values – with latest data
Seite 26
Thank you