Linked Census DataRinke Hoekstra
CEDAR Kickoff, 26 January 2012
donderdag 26 januari 12
Overview
Problem
Procedure (as I understand it)
Step-by-step
Vocabularies, tools
Conclusion
“Can Linked Data make a difference for historical analysis?”
donderdag 26 januari 12
Problem~519 Excel spreadsheets (more?... I heard 1200)
Want to do analysis over time and space, but...
Structure
Excel sheets cannot be readily imported in a database
Contents
Excel sheets are not normalised (age) nor harmonised (occupations/places)
Excel sheets contain errors (both original and data-entry)
Want to preserve all stages of data cleansing/harmonisation
donderdag 26 januari 12
Procedure
Archiving
Correcting/Interpreting
Normalising
Harmonising
Visualising
Verbatim import of sheets to database/triple store
Add missing information (headers)Add corrected information (data)
Interpret and correct objective information
Link information across sheetsLink information to other datasets (e.g. locations)
Build (generic) visualisations of results
Docum
enting
donderdag 26 januari 12
... a bit about Linked Data
“Just another Data Model”RDF ≠ Ontology (OWL)RDF ≠ Taxonomy (RDFS/SKOS)
Globally Unique Identifiers (URI) for all entities
Dereferencable on the Web (URI = URL)
HTTP-accessible databases (triple stores, SPARQL)
Triples all the way <subject, predicate, object>
donderdag 26 januari 12
Spreadsheet ≠ Database
Primary Keys are entities
Column names are attributes
Cell values are attribute values
Secondary keys are relations to other entities
donderdag 26 januari 12
Spreadsheet ≠ Database
Primary Keys are entities
Column names are attributes
Cell values are attribute values
Secondary keys are relations to other entities
donderdag 26 januari 12
Spreadsheet ≠ Database
Primary Keys are entities
Column names are attributes
Cell values are attribute values
Secondary keys are relations to other entities
donderdag 26 januari 12
Spreadsheet ≠ Database
No Primary Keys!
Anything can be an entity
Column headers are “types”
Row headers are “types”
Hierarchies!
Cell values are entity “values”
No relations to other entities
donderdag 26 januari 12
Anatomy of a Spreadsheet
Workbook
Cell
Sheet
CellCell
CellCellCell
CellCellCell
Cell
Sheet
CellCell
CellCellCell
CellCellCell
donderdag 26 januari 12
Anatomy of a Spreadsheet
Workbook1.xls
Sheet1:C1
Sheet1
Sheet1:B1Sheet1:A1
Sheet1:C2Sheet1:B2Sheet1:A2
.........
Sheet2
Sheet2:C1Sheet2:B1Sheet2:A1
Sheet2:C2Sheet2:B2Sheet2:A2
.........
donderdag 26 januari 12
Anatomy of a Spreadsheet
Workbook1.xls
12
Sheet1
agricultureworkers
6industry
......
Sheet2
34Adiamond cutters
67B
.........
donderdag 26 januari 12
Anatomy of a Spreadsheet
Workbook1.xls
12
Sheet1
agricultureworkers
6industry
......
Sheet2
34Adiamond cutters
67B
.........
NB: all URIs scoped to sheet!
donderdag 26 januari 12
Data Cube
How to best represent numeric data, in a flexible way?
SDMX (Eurostat, World Bank, CBS, etc.)
Every data item is an observation
Every observation has a value
Every observation has one or more dimensions
donderdag 26 januari 12
Data Cube
How to best represent numeric data, in a flexible way?
SDMX (Eurostat, World Bank, CBS, etc.)
Every data item is an observation
Every observation has a value
Every observation has one or more dimensions
donderdag 26 januari 12
Data Cube
How to best represent numeric data, in a flexible way?
SDMX (Eurostat, World Bank, CBS, etc.)
Every data item is an observation
Every observation has a value
Every observation has one or more dimensions1D
pannenbakkersE
I
positie
beroep
letter der beroepsklasse
nummer der beroepsklasse
geslacht
O
huwelijkse staat
M
geboortejaar
12
leeftijd
1878
donderdag 26 januari 12
Data Cube
How to best represent numeric data, in a flexible way?
SDMX (Eurostat, World Bank, CBS, etc.)
Every data item is an observation
Every observation has a value
Every observation has one or more dimensions1D
pannenbakkersE
I
positie
beroep
letter der beroepsklasse
nummer der beroepsklasse
geslacht
O
huwelijkse staat
M
geboortejaar
12
leeftijd
1878
1D
pannenbakkersE
I
positie
beroep
letter der beroepsklasse
nummer der beroepsklasse
geslacht
O
huwelijkse staat
M
geboortejaar
12
leeftijd
1878
?
?
??
donderdag 26 januari 12
Anatomy of a Spreadsheet
HeadersProperties
DataRowHeaders
donderdag 26 januari 12
Anatomy of a Spreadsheet
HeadersProperties
DataRowHeaders
donderdag 26 januari 12
Anatomy of a Spreadsheet
HeadersProperties
DataRowHeaders
http://github.com/Data2Semantics/TabLinkerdonderdag 26 januari 12
_:x
Sheet1:I/E/Fabricage_van_dakpannen__pannenbakkers
:I/E
:I
skos:broader
skos:broader
:O
:M
:14--15_1875--1874
:Nummer_der_beroepsklasse
:Letter__Onderdeel_beroepsklasse_
d2s:dimension
d2s:dimension
d2s:dimension
:D
:Positie_in_het_beroep__aangeduid_met_A__B__C_of_D
"1"^^xsd:int
d2s:populationSize
:BENAMING_van_de_onderdeelen_der_onderscheidene_beroepsklassen__met_de_daartoe_behoorende_beroepen
Sheet1:D15
donderdag 26 januari 12
Sheet1:L15
d2s:DataCell
rdf:type
_:x
d2s:isObservation
Sheet1:I/E/Fabricage_van_dakpannen__pannenbakkers
:I/E
:I
skos:broader
skos:broader
:10
:O
:M
:14--15_1875--1874
:5
:Nummer_der_beroepsklasse
:Letter__Onderdeel_beroepsklasse_
d2s:dimension
:Regelnummerd2s:dimension
d2s:dimension
d2s:dimension
:D
:Positie_in_het_beroep__aangeduid_met_A__B__C_of_D
"1"^^xsd:int
d2s:populationSize
:BENAMING_van_de_onderdeelen_der_onderscheidene_beroepsklassen__met_de_daartoe_behoorende_beroepen
Sheet1:L3
d2s:Header
rdf:type
d2s:isDimension
Sheet1:L4
d2s:isDimension
Sheet1:L5
d2s:isDimension
rdf:type rdf:type
Sheet1:B8
d2s:HierarchicalRowHeader
rdf:type
Sheet1:C14Sheet1:E15
rdf:typerdf:type
d2s:isDimension
d2s:isDimension
d2s:isDimension
Sheet1:F15
d2s:RowHeader
rdf:type
Sheet1:D15
rdf:type
d2s:isDimension d2s:isDimension
d2s:Metadata
Sheet1:L6
d2s:isDimension
rdf:type
donderdag 26 januari 12
What TabLinker can’t doAnnotations“footnote”-style on separate sheet
Interpret functions e.g. automatic sums
Integrate/harmonise across sheets/files
Additional useful functionality:
“checksum” functionality
Export to database tables
donderdag 26 januari 12
Normalising & Correcting
_:x
:14--15_1875--1874
d2s:dimension
"1"^^xsd:int
d2s:populationSize
donderdag 26 januari 12
Normalising & Correcting
_:x
:14--15_1875--1874
d2s:dimension
"1"^^xsd:int
d2s:populationSize
_:x
:14--15_1875--1874
d2s:dimension
"11"^^xsd:int
d2s:populationSize
"1"^^xsd:int
d2s:populationSize
:14-15
d2s:ageGroup
:1875--1874d2s:birthYears
"1889"^^xsd:intd2s:censusYear
:Assendelft
d2s:gemeente
donderdag 26 januari 12
Documenting
http://www.w3.org/TR/prov-o/
<http://example.com/workbook1/sheet1/corrected><http://example.com/workbook1/sheet1>
:curation20120126
provo:wasGeneratedBy
provo:Activity
:RinkeHoekstra
_:a_:b
rdf:type
provo:hadAgent
provo:endedAtprovo:startedAt
"20120126T09:00:00" "20120126T08:30:00"
time:inXSDDateTime time:inXSDDateTime
_:x
:14--15_1875--1874
d2s:dimension
"11"^^xsd:int
d2s:populationSize
"1"^^xsd:int
d2s:populationSize
:14-15
d2s:ageGroup
:1875--1874d2s:birthYears
"1889"^^xsd:intd2s:censusYear
:Assendelft
d2s:gemeente
donderdag 26 januari 12
Fabricage van dakpannen (pannenbakkers)
E
I
skos:broader
skos:broader
Fabricage van steen (molensteen, steenbakkers,
tegelbakkers)
D
Fabricage van kalk
skos:broaderskos:broader
A
Fabricage van aardewerk (incl.
porcelein, terracotta, kachelbakkers,
pottenbakkers, enz.)
skos:broader
skos:broader
skos:broader
Harmonising
donderdag 26 januari 12
Fabricage van dakpannen (pannenbakkers)
E
I
skos:broader
skos:broader
Fabricage van steen (molensteen, steenbakkers,
tegelbakkers)
D
Fabricage van kalk
skos:broaderskos:broader
A
Fabricage van aardewerk (incl.
porcelein, terracotta, kachelbakkers,
pottenbakkers, enz.)
skos:broader
skos:broader
skos:broader
HISCO:23811 HISCO:25281
HISCO:25281
HISCO:25281 HISCO:26345
HISCO:23810 HISCO:26340
skos:exactMatch
skos:exactMatchskos:closeMatch
skos:exactMatchskos:exactMatch
skos:broadMatchskos:broadMatch
Harmonising
donderdag 26 januari 12
Harmonising
Sheet1:Fabricage van dakpannen
(pannenbakkers)
Sheet1:E
Sheet1:I
skos:broader
skos:broader
Sheet1:Fabricage van steen (molensteen, steenbakkers,
tegelbakkers)
Sheet1:D
Sheet1:Fabricage van kalk
skos:broaderskos:broader
Sheet1:A
Sheet1:Fabricage van aardewerk (incl.
porcelein, terracotta, kachelbakkers,
pottenbakkers, enz.)
skos:broader
skos:broaderskos:broader
Fabricage van dakpannen (pannenbakkers)
E
I
skos:broader
skos:broader
Fabricage van steen (molensteen, steenbakkers,
tegelbakkers)
D
Fabricage van kalk
skos:broaderskos:broader
A
Fabricage van aardewerk (incl.
porcelein, terracotta, kachelbakkers,
pottenbakkers, enz.)
skos:broader
skos:broader
skos:broader
donderdag 26 januari 12
Fabricage van dakpannen (pannenbakkers)
E
I
skos:broader
skos:broader
Fabricage van steen (molensteen, steenbakkers,
tegelbakkers)
D
Fabricage van kalk
skos:broaderskos:broader
A
Fabricage van aardewerk (incl.
porcelein, terracotta, kachelbakkers,
pottenbakkers, enz.)
skos:broader
skos:broader
skos:broader
Fabricage van dakpannen (pannenbakkers)
E
I
skos:broader
skos:broader
Fabricage van steen (steenbakkers, tegelbakkers)
D
Fabricage van kalk
skos:broaderskos:broader
A
Fabricage van aardewerk (incl.
porcelein, kachelbakkers,
pottenbakkers, enz.)
skos:broader
skos:broader
skos:broader
1889
1899
skos:exactMatch
skos:narrowMatch skos:closeMatch
skos:narrowMatch
donderdag 26 januari 12
Fabricage van dakpannen (pannenbakkers)
E
I
skos:broader
skos:broader
Fabricage van steen (molensteen, steenbakkers,
tegelbakkers)
D
Fabricage van kalk
skos:broaderskos:broader
A
Fabricage van aardewerk (incl.
porcelein, terracotta, kachelbakkers,
pottenbakkers, enz.)
skos:broader
skos:broader
skos:broader
Fabricage van dakpannen (pannenbakkers)
E
I
skos:broader
skos:broader
Fabricage van steen (steenbakkers, tegelbakkers)
D
Fabricage van kalk
skos:broaderskos:broader
A
Fabricage van aardewerk (incl.
porcelein, kachelbakkers,
pottenbakkers, enz.)
skos:broader
skos:broader
skos:broader
1889
1899
skos:exactMatch
skos:narrowMatch skos:closeMatch
skos:narrowMatch
Is SKOS sufficient?
NB: These are not strings, but globally unique URIs, scoped within their spreadsheet (graph!) of origin.
donderdag 26 januari 12
Fabricage van dakpannen (pannenbakkers)
E
I
skos:broader
skos:broader
Fabricage van steen (molensteen, steenbakkers,
tegelbakkers)
D
Fabricage van kalk
skos:broaderskos:broader
A
Fabricage van aardewerk (incl.
porcelein, terracotta, kachelbakkers,
pottenbakkers, enz.)
skos:broader
skos:broader
skos:broader
Fabricage van dakpannen (pannenbakkers)
E
I
skos:broader
skos:broader
Fabricage van steen (steenbakkers, tegelbakkers)
D
Fabricage van kalk
skos:broaderskos:broader
A
Fabricage van aardewerk (incl.
porcelein, kachelbakkers,
pottenbakkers, enz.)
skos:broader
skos:broader
skos:broader
1889
1899
skos:exactMatch
skos:narrowMatch skos:closeMatch
skos:narrowMatch
Is SKOS sufficient?
NB: These are not strings, but globally unique URIs, scoped within their spreadsheet (graph!) of origin.
donderdag 26 januari 12
Fabricage van dakpannen (pannenbakkers)
E
I
skos:broader
skos:broader
Fabricage van steen (molensteen, steenbakkers,
tegelbakkers)
D
Fabricage van kalk
skos:broaderskos:broader
A
Fabricage van aardewerk (incl.
porcelein, terracotta, kachelbakkers,
pottenbakkers, enz.)
skos:broader
skos:broader
skos:broader
Fabricage van dakpannen (pannenbakkers)
E
I
skos:broader
skos:broader
Fabricage van steen (steenbakkers, tegelbakkers)
D
Fabricage van kalk
skos:broaderskos:broader
A
Fabricage van aardewerk (incl.
porcelein, kachelbakkers,
pottenbakkers, enz.)
skos:broader
skos:broader
skos:broader
1889
1899
skos:exactMatch
skos:narrowMatch skos:closeMatch
skos:narrowMatch
Is SKOS sufficient?
NB: These are not strings, but globally unique URIs, scoped within their spreadsheet (graph!) of origin.
donderdag 26 januari 12
Vocabularies, Tools
VocabulariesData Cube, SKOS, W3C Time, PROV-O
Excel + TabLinkerSemi-automatic conversion of Excel sheets to RDF
ProvTracerCreate PROV-O provenance trail for shell/python scripts
Visualization PrototypeSGVizler (SPARQL + Google Graph API)
donderdag 26 januari 12
Discussion
Advantages of Linked Data approach
Straightforward transformation from spreadsheets
Seamless integration of original, corrected and harmonised data
Ingestion of external (linked) data
Powerful documentation (provenance)
Everything is transparently query-able (SPARQL)
.... on the Web
donderdag 26 januari 12
Discussion
Disadvantages of Linked Data approach (subject to research)
Size? (300k * 519 sheets = 156M triples)
Only rudimentary support for arithmetical operations in queries
No dynamic/conditional ‘view’-like graphs
donderdag 26 januari 12
SPARQL vs. SQL?
Middle ground?
Expose database through D2RQ
donderdag 26 januari 12
Fin
donderdag 26 januari 12
Top Related