The Structured Data Hub in 2019
-
Upload
richard-zijdeman -
Category
Data & Analytics
-
view
345 -
download
1
Transcript of The Structured Data Hub in 2019
The Structured Data Hub
Today’s fiction, 2019’s reality
Status quo
Many datasets currently live in isolation. They are stored on people’s computers and are not findable. Moreover little effort is given to link such datasets. When data is being linked, it requires cleaning and harmonising the datasets, which is very time intensive. More importantly, such linkage efforts are seldom shared, literally providing ‘disposable research’
What we envisage
Is to select core micro, meso and macro datasets from the field of economic and social history and create a structured data hub from those.
What we envisage
Structured Data Hub
Your data
Tooling
WWW
Next to allow you to connect your data and allow you to build such connections yourself, while we will ensure your data is findable and linkable to other datasets on the (semantic) world wide web.
The Structured Data HubA place to
store data
augment data
link data
find data
ask questions! (for data analysis and visualization)
So, the structure data hub is a place to …. Now let’s go into more detail for some of these aspects.
Data augmentation
A first feature of the Structured Data Hub, is augmentation. With augmentation we refer to the process of enhancing your data with core variables from social, demographic and economic sciences.
For example, think of this datasets containing individual characteristics, including occupation and HISCO code. If we wanted to know whether these person were incumbents of high or low occupations we would needed to add a stratification measure.
Here, we add the universal HISCAM scale, but any other HISCO based stratification scale or class measure can be added.
We might also be interested in the area where people are working, here indicated by the place variable. If we wanted to map such values, or calculate distances between these places, we would need information on the latitude and longitude.
Another type of data augmentation concerns the application of basic calculus to derive new variables. Income for example, is seldom analysed in its raw form, and is often rescaled using a log transformation.
The Structured Data Hub facilitates in the creation and documentation of such newly derived variables.
Provenance tracking
A second feature of the Data Hub is traceable provenance. Currently bigger datasets such as Clio-Infra consists of a core part derived from a bigger statistical agency, combined with many smaller datasets as well as ‘corrections’ of the data by the researcher. After an iteration it is hard to track who contributed what, or which number was changed by whom for what reason. We therefore present provenance tracking.
version 2version 1 activity =+
The basic formula for provenance we use is that one version leads to the next as the result of an activity.
activity
who
when
what
how
For proper provenance it is crucial to describe this activity, at least in the terms of what the activity entailed, how the activity was performed, by whom and in which time period.
surname occupa+on
Fumes cigarmaker
Bridges civilengineer
Moves dancer
Bones undertaker
New PID!PID: ab.123 PID: bc.789
- added occupation Bones - from Gravediggers Vol II - 2015-12-09A09:30:17 - dai:richard.zijdeman
surname occupa+on
Fumes cigarmaker
Bridges civilengineer
Moves dancer
Bones
In this example, the occupation for ‘Bones’ is added, which leads to a new version of the data, hence a new PID. Moreover, the action of adding the value for occupation, is provided with as provenance.
Quality flags
An important aspect to consider when combining data is that datasets will come in various forms of quality.
Quality flags
Allow for quality flags of content
e.g. created by scientists
e.g. peer reviewed (by scientist)
created by public and peer reviewed
We will design a system in which datasets will be accompanied by a ‘quality flag’, an indicator of the trustworthiness of the dataset. This might involve simple reputation effects, but could also provide more enhanced features, like whether other data confirms the values in this datasets. Work together with sestet on this
Basic visualisation
Focus on visual exploration of data and results
‘Ask’ question and get visual output:
e.g. bar, line graph etc.
get output on map or even as ‘movie’
A final feature that we want to highlight here is to ask questions and receive a ‘visual’ answer. Data visualisations are increasingly present in all sorts of media and our hub will allow for such visualisations to answer basic questions on historical patterns.
To society and back
From Science to Society and back
Provide data to public: ‘enthusiasts’, journalists
Have enthusiasts add data to the hub (creating linked data): e.g. stucadoors dataset, harbour datasets, railway datasets, etc.
And back: link scientific data to crowd-projects like dpbedia: enhance occupations with descriptions
The last point we want to make about the structured data hub, is that it is not just for academics, but we provide our tools for a broader audience too. This means that we assume a lowish level of knowledge of history and technical skills. However, we also believe, that ‘the public’ is making quite interesting datasets from which we may borrow, as well as may give back to, by enriching those with scientific knowledge.