The Structured Data Hub in 2019

20
The Structured Data Hub Today’s fiction, 2019’s reality

Transcript of The Structured Data Hub in 2019

Page 1: The Structured Data Hub in 2019

The Structured Data Hub

Today’s fiction, 2019’s reality

Page 2: The Structured Data Hub in 2019

Status quo

Many datasets currently live in isolation. They are stored on people’s computers and are not findable. Moreover little effort is given to link such datasets. When data is being linked, it requires cleaning and harmonising the datasets, which is very time intensive. More importantly, such linkage efforts are seldom shared, literally providing ‘disposable research’

Page 3: The Structured Data Hub in 2019

What we envisage

Is to select core micro, meso and macro datasets from the field of economic and social history and create a structured data hub from those.

Page 4: The Structured Data Hub in 2019

What we envisage

Structured Data Hub

Your data

Tooling

WWW

Next to allow you to connect your data and allow you to build such connections yourself, while we will ensure your data is findable and linkable to other datasets on the (semantic) world wide web.

Page 5: The Structured Data Hub in 2019

The Structured Data HubA place to

store data

augment data

link data

find data

ask questions! (for data analysis and visualization)

So, the structure data hub is a place to …. Now let’s go into more detail for some of these aspects.

Page 6: The Structured Data Hub in 2019

Data augmentation

A first feature of the Structured Data Hub, is augmentation. With augmentation we refer to the process of enhancing your data with core variables from social, demographic and economic sciences.

Page 7: The Structured Data Hub in 2019

For example, think of this datasets containing individual characteristics, including occupation and HISCO code. If we wanted to know whether these person were incumbents of high or low occupations we would needed to add a stratification measure.

Page 8: The Structured Data Hub in 2019

Here, we add the universal HISCAM scale, but any other HISCO based stratification scale or class measure can be added.

Page 9: The Structured Data Hub in 2019

We might also be interested in the area where people are working, here indicated by the place variable. If we wanted to map such values, or calculate distances between these places, we would need information on the latitude and longitude.

Page 10: The Structured Data Hub in 2019

Another type of data augmentation concerns the application of basic calculus to derive new variables. Income for example, is seldom analysed in its raw form, and is often rescaled using a log transformation.

Page 11: The Structured Data Hub in 2019

The Structured Data Hub facilitates in the creation and documentation of such newly derived variables.

Page 12: The Structured Data Hub in 2019

Provenance tracking

A second feature of the Data Hub is traceable provenance. Currently bigger datasets such as Clio-Infra consists of a core part derived from a bigger statistical agency, combined with many smaller datasets as well as ‘corrections’ of the data by the researcher. After an iteration it is hard to track who contributed what, or which number was changed by whom for what reason. We therefore present provenance tracking.

Page 13: The Structured Data Hub in 2019

version 2version 1 activity =+

The basic formula for provenance we use is that one version leads to the next as the result of an activity.

Page 14: The Structured Data Hub in 2019

activity

who

when

what

how

For proper provenance it is crucial to describe this activity, at least in the terms of what the activity entailed, how the activity was performed, by whom and in which time period.

Page 15: The Structured Data Hub in 2019

surname occupa+on

Fumes cigarmaker

Bridges civilengineer

Moves dancer

Bones undertaker

New PID!PID: ab.123 PID: bc.789

- added occupation Bones - from Gravediggers Vol II - 2015-12-09A09:30:17 - dai:richard.zijdeman

surname occupa+on

Fumes cigarmaker

Bridges civilengineer

Moves dancer

Bones

In this example, the occupation for ‘Bones’ is added, which leads to a new version of the data, hence a new PID. Moreover, the action of adding the value for occupation, is provided with as provenance.

Page 16: The Structured Data Hub in 2019

Quality flags

An important aspect to consider when combining data is that datasets will come in various forms of quality.

Page 17: The Structured Data Hub in 2019

Quality flags

Allow for quality flags of content

e.g. created by scientists

e.g. peer reviewed (by scientist)

created by public and peer reviewed

We will design a system in which datasets will be accompanied by a ‘quality flag’, an indicator of the trustworthiness of the dataset. This might involve simple reputation effects, but could also provide more enhanced features, like whether other data confirms the values in this datasets. Work together with sestet on this

Page 18: The Structured Data Hub in 2019

Basic visualisation

Focus on visual exploration of data and results

‘Ask’ question and get visual output:

e.g. bar, line graph etc.

get output on map or even as ‘movie’

A final feature that we want to highlight here is to ask questions and receive a ‘visual’ answer. Data visualisations are increasingly present in all sorts of media and our hub will allow for such visualisations to answer basic questions on historical patterns.

Page 19: The Structured Data Hub in 2019

To society and back

Page 20: The Structured Data Hub in 2019

From Science to Society and back

Provide data to public: ‘enthusiasts’, journalists

Have enthusiasts add data to the hub (creating linked data): e.g. stucadoors dataset, harbour datasets, railway datasets, etc.

And back: link scientific data to crowd-projects like dpbedia: enhance occupations with descriptions

The last point we want to make about the structured data hub, is that it is not just for academics, but we provide our tools for a broader audience too. This means that we assume a lowish level of knowledge of history and technical skills. However, we also believe, that ‘the public’ is making quite interesting datasets from which we may borrow, as well as may give back to, by enriching those with scientific knowledge.