The Structured Data Hub in 2019

The Structured Data Hub

Today’s fiction, 2019’s reality

Status quo

Many datasets currently live in isolation. They are stored on people’s computers and are not findable. Moreover little effort is given to link such datasets. When data is being linked, it requires cleaning and harmonising the datasets, which is very time intensive. More importantly, such linkage efforts are seldom shared, literally providing ‘disposable research’

What we envisage

Is to select core micro, meso and macro datasets from the field of economic and social history and create a structured data hub from those.

What we envisage

Structured Data Hub

Your data

Tooling

WWW

Next to allow you to connect your data and allow you to build such connections yourself, while we will ensure your data is findable and linkable to other datasets on the (semantic) world wide web.

The Structured Data HubA place to

store data

augment data

link data

find data

ask questions! (for data analysis and visualization)

So, the structure data hub is a place to …. Now let’s go into more detail for some of these aspects.

Data augmentation

A first feature of the Structured Data Hub, is augmentation. With augmentation we refer to the process of enhancing your data with core variables from social, demographic and economic sciences.

For example, think of this datasets containing individual characteristics, including occupation and HISCO code. If we wanted to know whether these person were incumbents of high or low occupations we would needed to add a stratification measure.

Here, we add the universal HISCAM scale, but any other HISCO based stratification scale or class measure can be added.

We might also be interested in the area where people are working, here indicated by the place variable. If we wanted to map such values, or calculate distances between these places, we would need information on the latitude and longitude.

Another type of data augmentation concerns the application of basic calculus to derive new variables. Income for example, is seldom analysed in its raw form, and is often rescaled using a log transformation.

The Structured Data Hub facilitates in the creation and documentation of such newly derived variables.

Provenance tracking

A second feature of the Data Hub is traceable provenance. Currently bigger datasets such as Clio-Infra consists of a core part derived from a bigger statistical agency, combined with many smaller datasets as well as ‘corrections’ of the data by the researcher. After an iteration it is hard to track who contributed what, or which number was changed by whom for what reason. We therefore present provenance tracking.

version 2version 1 activity =+

The basic formula for provenance we use is that one version leads to the next as the result of an activity.

activity

who

when

what

how

For proper provenance it is crucial to describe this activity, at least in the terms of what the activity entailed, how the activity was performed, by whom and in which time period.

surname occupa+on

Fumes cigarmaker

Bridges civilengineer

Moves dancer

Bones undertaker

New PID!PID: ab.123 PID: bc.789

- added occupation Bones - from Gravediggers Vol II - 2015-12-09A09:30:17 - dai:richard.zijdeman

surname occupa+on

Fumes cigarmaker

Bridges civilengineer

Moves dancer

Bones

In this example, the occupation for ‘Bones’ is added, which leads to a new version of the data, hence a new PID. Moreover, the action of adding the value for occupation, is provided with as provenance.

Quality flags

An important aspect to consider when combining data is that datasets will come in various forms of quality.

Quality flags

Allow for quality flags of content

e.g. created by scientists

e.g. peer reviewed (by scientist)

created by public and peer reviewed

We will design a system in which datasets will be accompanied by a ‘quality flag’, an indicator of the trustworthiness of the dataset. This might involve simple reputation effects, but could also provide more enhanced features, like whether other data confirms the values in this datasets. Work together with sestet on this

Basic visualisation

Focus on visual exploration of data and results

‘Ask’ question and get visual output:

e.g. bar, line graph etc.

get output on map or even as ‘movie’

A final feature that we want to highlight here is to ask questions and receive a ‘visual’ answer. Data visualisations are increasingly present in all sorts of media and our hub will allow for such visualisations to answer basic questions on historical patterns.

To society and back

From Science to Society and back

Provide data to public: ‘enthusiasts’, journalists

Have enthusiasts add data to the hub (creating linked data): e.g. stucadoors dataset, harbour datasets, railway datasets, etc.

And back: link scientific data to crowd-projects like dpbedia: enhance occupations with descriptions

The last point we want to make about the structured data hub, is that it is not just for academics, but we provide our tools for a broader audience too. This means that we assume a lowish level of knowledge of history and technical skills. However, we also believe, that ‘the public’ is making quite interesting datasets from which we may borrow, as well as may give back to, by enriching those with scientific knowledge.

The Structured Data Hub in 2019

Data & Analytics

Transcript of The Structured Data Hub in 2019