Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format...

63
Semantic Enrichment of Open Data Or: How to build an Open Data Knowledge Graph Sebastian Neumaier [email protected] https://sebneumaier.wordpress.com/ Advisor: Dr. Axel Polleres Reviewers: Dr. Christian Bizer, Dr. Elena Simperl Rigorosum, TU Wien, November 20, 2019 Slides: tiny.cc/sebneu 1

Transcript of Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format...

Page 1: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

Semantic Enrichment of Open DataOr: How to build an Open Data Knowledge Graph

Sebastian [email protected] https://sebneumaier.wordpress.com/

Advisor: Dr. Axel PolleresReviewers: Dr. Christian Bizer, Dr. Elena Simperl

Rigorosum, TU Wien, November 20, 2019 Slides: tiny.cc/sebneu 1

Page 2: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

Open Data comes in various ways

2

Page 3: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

CSV (3-star)

Excel (2-star)

PDF (1-start)

82 data portals 160K datasets

Unknown format (1-star)

RDF? Not significant

Tim Berners-Lee’s 5-star deployment scheme for Open Data

Available data is only partially structured and not linked

Umbrich, J., Neumaier, S., Polleres, A.: Quality assessment & evolution of open data portals. International Conference on Open and Big Data (2015) 3

Page 4: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

Metadata vs. Data Two dimensions of Open Data

E.g. data.gv.at:

attribute-value pairs CSVs, spreadsheets, PDFs, etc.

4

Page 5: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

EU data portal

data.europa.eu

2012

10 years of open-data government initiatives

2009

First governmental

data portals

data.gov.uk & data.gov

Austrian portal

data.gv.at

2011

Google dataset

search

toolbox.google.com

/datasetsearch

2018

EU harvesting portal

europeandataportal.eu

2015

Some milestones:

5

Page 6: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

Open Data as a Global Trend

Country URL Datasets

United States data.gov 304k

Canada open.canada.ca 81k

UK data.gov.uk 46.5k

France www.data.gouv.fr 36.4k

Japan data.go.jp 22.4k

Russia data.gov.ru 21.5k

Germany govdata.de 21.5k

Italy dati.gov.it 20.5k

Data portals of the G8 countries:

6

Page 7: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

What do we find on Open Data Portals?

7

Page 8: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

What do we find on Open Data Portals?

8

Page 9: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

What do we find on Open Data Portals?

9

Page 10: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

Given a corpus of tabular Open Data resources,

the metadata descriptions can be enriched,

and therefore the data quality can be increased,

by semantically analyzing Open Data CSVs,

assigning semantic labels,

and integrating the extracted knowledge into a graph.

Hypothesis

10

Page 11: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

identify quality issues and build a corpus

find improvement strategies and(re-)publish as homogenized/standardized Linked Data

develop a method to find and rank candidates of semantic context descriptions for numerical values

resolve entities in metadata descriptions and resources, and add links to the respective locations and time periods

Approach

Monitoring and analysis of Open Data portals

Evaluate applicability of existing techniques

Labeling and classification of numerical values

Extraction of spatial and temporal information

Metadata

Data

11

Page 12: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

Monitoring and analysis of Open Data portals

Evaluate applicability of existing techniques

Labeling and classification of numerical values

Extraction of spatial, and temporal information

identify metadata quality issues and build a corpus

12

Page 13: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

● [Reiche et al., 2014]: Small sample set of 10 portals, automated assessment completeness, accuracy, readability, misspelling, …

● [Zaveri et al., 2015]: Survey on Linked Open Data quality assessment transparency and openness aspects not covered

● Global Open Data Index: Manual expert evaluation of defined data categories/key datasets

● [Veljković et al., 2014]: A theoretical model for openness in e-government primary, authenticity, understandability, ...

Openness & Transparency Evaluations in the Literature

13

Page 14: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

Quality Metrics

Sebastian Neumaier, Jürgen Umbrich, and Axel Polleres. Automated quality assessment of metadata across open data portals. ACM Journal of Data and Information Quality (JDIQ), 2016.

● 16 specific metrics along 4 dimensions● automated and scalable assessment

14

Page 15: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

Open Data Portal Watch

Evolution of Portals and Metrics

https://data.wu.ac.at/portalwatch/ 15

Page 16: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

Identified Challenges

● Metadata is heterogeneous and (partially) messy➢ Software-specific metadata (CKAN vs Socrata vs …)➢ Portal-specific metadata➢ Missing metadata (file formats, API descriptions, …)

● Metadata not available as Linked Data➢ Only partially in common vocabulary

● Poor discoverability of datasets➢ No content information in metadata (e.g., CSV headers)➢ Datasets’ metadata not optimized for search engines

16

Page 17: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

Monitoring and analysis of Open Data portals

Use of existing techniques

Labeling and classification of numerical values

Extraction of spatial, and temporal information

to improve and (re-)publish the metadata as homogenized/standardized Linked Data

17

Page 18: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

Approach

1 Mapping to standard vocabularies

2 Enrich the datasets

3 Enable access

18

Page 19: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

● Mappings of metadata from CKAN, Socrata and OpenDataSoft portals➢ Including mappings of most frequent (portal/domain specific) metadata fields

● Mapping and publishing of all descriptions as Schema.org:➢ Enables the integration into knowledge graphs of major search engines

1 Mapping to standard vocabularies

19

Page 20: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

2 Enrich the datasets

Open Data Portal Watch

CSV file

CSV metadata

Mapping to standard

vocabularies

20

Page 21: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

● Portal Watch quality dimensions:➢ Measurements per dataset

2 Enrich the datasets

Open Data Portal Watch

CSV file

CSV metadata

Quality

Measures

21

Page 22: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

● Portal Watch quality dimensions:➢ Measurements per dataset

● Metadata for tables:➢ CSV dialect, headers, column types, ...

2 Enrich the datasets

Open Data Portal Watch

CSV file

CSV metadataCSV

metadata

22

Page 23: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

● Portal Watch quality dimensions:➢ Measurements per dataset

● Metadata for tables:➢ CSV dialect, headers, column types, ...

● Record provenance:➢ Versioning, track modifications

2 Enrich the datasets

Open Data Portal Watch

CSV file

CSV metadata

23

Page 24: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

● SPARQL endpoint:➢ Versions (snapshots) stored as named graphs➢ ~180 million triples each

● Access & query historical data:➢ Timestamp-based interfaces

● Schema.org via sitemap.xml:➢ Publishing of all datasets as HTML-embedded Schema.org

3 Enable access

Sebastian Neumaier, Jürgen Umbrich, and Axel Polleres. Lifting data portals to the Web of Data. In WWW ’17 Workshop on Linked Data on the Web (LDOW2017), Perth, Australia, April 2017. 24

Page 25: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

Example Table

federal state district year sex population

Upper Austria Linz 2013 male 98157

Upper Austria Steyr 2013 male 18763

Upper Austria Wels 2013 male 29730

… … … … …

25

Page 26: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

Open Data CSVs look more like this

Source: https://www.data.gv.at/katalog/dataset/e108dcc3-1304-4076-8619-f2185c37ef81

NUTS2 LAU2_NAME YEAR SEX AGE_TOTAL

AT31 Linz 2013 1 98157

AT31 Steyr 2013 1 18763

AT31 Wels 2013 1 29730

… … … …

26

Page 27: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

Monitoring and analysis of Open Data portals

Evaluate applicability of existing techniques

Labeling and classification of numerical values

Extraction of spatial, and temporal information

develop a method to find and rank candidates of semantic context descriptions for numerical values

27

Page 28: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

Use the numeric values in the tables

▪ Identifying the most likely semantic label for a bag of numerical values

▪ Deliberately ignore surroundings

NUTS2 LAU2_NAME YEAR SEX AGE_TOTAL

AT31 Linz 2013 1 98157

AT31 Steyr 2013 1 18763

AT31 Wels 2013 1 29730

… … … …

28

Page 29: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

Use the numeric values in the tables

▪ Identifying the most likely semantic label for a bag of numerical values

▪ Deliberately ignore surroundings

98157

18763

29730

29

Page 30: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

Use the numeric values in the tables

population (a district) (country Austria)

98157

18763

29730

▪ Identifying the most likely semantic label for a bag of numerical values

▪ Deliberately ignore surroundings

30

Page 31: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

Background Knowledge Graph

▪ Find properties with numerical range

▪ Hierarchical clustering approach

▪ Two hierarchical layers:

▪ Type hierarchy (using OWL classes)

▪ Property-object hierarchy (shared property-object pairs)

31

Page 32: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

Label based on Nearest Neighbors

1

234

5

6

32

Page 33: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

Labelling Results

33

populationTotal (a Settlement) populationDensity (a City)

33

Page 34: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

Lessons Learned

● We can assign fine-grained semantic labels➢ If there is enough evidence in BK

● However: Missing domain knowledge for labelling OD

Conclusions:

● Complementary to existing approaches (column header labeling, entity linking and relation extraction)➢ Combined approaches may improve results

● Focusing on core dimensions of specific domains e.g. city data, maybe more promising than “general” value labelling

Sebastian Neumaier, Jürgen Umbrich, Josiane Xavier Parreira, and Axel Polleres. Multi-level semantic labelling of numerical values. In Proceedings of the 15th International Semantic Web Conference (ISWC 2016), Kobe, Japan, October 2016. Nominated for best student paper award. 34

Page 35: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

NUTS2 LAU2_NAME YEAR SEX AGE_TOTAL

AT31 Linz 2013 1 98157

AT31 Steyr 2013 1 18763

AT31 Wels 2013 1 29730

… … … …

Focus on specific dimensions:

▪ Particularly temporal and geospatial queries require better support [2]

What else can we do/use?

[2] Emilia Kacprzak, et al.: Characterising dataset search — An analysis of search logs and data requests. Journal of Web Semantics (2019) 35

Page 36: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

Monitoring and analysis of Open Data portals

Evaluate applicability of existing techniques

Labeling and classification of numerical values

Extraction of spatial, and temporal information

resolve entities in metadata descriptions and resources, and add links to the respective locations and time periods

36

Page 37: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

Available Geospatial Knowledge Bases

37

Page 38: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

Wikidata links

Wikidata links

European Classification of Territorial Units

Wikidata, GeoNames

Mapping OSM entities to GeoNames regions

Extracting OSM streets and places

Geo-Knowledge Graph Construction

38

Page 39: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

Available Temporal Knowledge

39

Page 40: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

}}

Temporal Knowledge Graph Construction

● Named events and their labels

● Links to parent periods

● Links to the spatial coverage

● Temporal extent:

a single start and end date

40

Page 41: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

Spatio-temporal labelling

Table cell value disambiguation

▪ Row context:

▪ Filter candidates by potential parents (if available)

▪ Column context:

▪ Least common ancestor of the spatial entities

Metadata descriptions

▪ Restrict annotation to origin country

▪ Temporal tagging using the Heideltime framework [3]

[3] Strötgen, Gertz: Multilingual and Cross-domain Temporal Tagging. Language Resources and Evaluation, 2013. 41

Page 42: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

Evaluations

Sample evaluation on record level:

● 11 portals● 10 random CSVs● 10 random rows per dataset● i.e. 1100 inspected values

Discussion:

● Partially incomplete knowledge● Incomplete mapping of OSM● Heuristics for portal-specifics

would be required

42Sebastian Neumaier and Axel Polleres. Enabling Spatio-Temporal Search in Open Data. Journal of Web Semantics (JWS), 2018.

Page 43: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

Demo: Geo-entity Search “Leopoldstadt”

http://data.wu.ac.at/odgraphsearch/

43

Page 44: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

The Portal Watch focuses on continuous metadata quality and archiving. A scalable, continuous profiling and archiving of the actual data is still missing.

Monitoring and analysis of Open Data portals

Labeling and classification of numerical values

(Re-)publish the improved dataset descriptions as Linked Data

Extraction of spatial, and temporal information

Conclusions & Critical Discussions

44

Page 45: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

The Portal Watch focuses on continuous metadata quality and archiving. A scalable, continuous profiling and archiving of the actual data is still missing.

Monitoring and analysis of Open Data portals

Labeling and classification of numerical values

(Re-)publish the improved dataset descriptions as Linked Data

Extraction of spatial, and temporal information

Conclusions & Critical Discussions

We developed the methods, and publish mapped and enriched metadata. Ideally, however, the users find the LD endpoints and rich descriptions already at the data portals.

45

Page 46: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

The Portal Watch focuses on continuous metadata quality and archiving. A scalable, continuous profiling and archiving of the actual data is still missing.

Monitoring and analysis of Open Data portals

Labeling and classification of numerical values

(Re-)publish the improved dataset descriptions as Linked Data

Extraction of spatial, and temporal information

Conclusions & Critical Discussions

We developed the methods, and publish mapped and enriched metadata. Ideally, however, the users find the LD endpoints and rich descriptions already at the data portals.

We can assign fine-grained semantic labels if there is enough evidence in BK. However, there is missing domain knowledge for labelling OD.

46

Page 47: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

The Portal Watch focuses on continuous metadata quality and archiving. A scalable, continuous profiling and archiving of the actual data is still missing.

Monitoring and analysis of Open Data portals

Labeling and classification of numerical values

(Re-)publish the improved dataset descriptions as Linked Data

Extraction of spatial, and temporal information

Conclusions & Critical Discussions

We developed the methods, and publish mapped and enriched metadata. Ideally, however, the users find the LD endpoints and rich descriptions already at the data portals.

We annotate CSV tables and metadata at scale, with links to spatial and temporal entities, and a search and query interface.

It is still open if our approach is generalisable for other entities such as categories, (governmental) organisations, etc.

We can assign fine-grained semantic labels if there is enough evidence in BK. However, there is missing domain knowledge for labelling OD.

47

Page 48: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

Impact

● PhD builds on top of my master thesis (winner of the OCG-Förderpreis by the Austrian Computer Society)

● Publications:

Monitoring and analysis of Open Data portals

Evaluate applicability of existing techniques

Labeling and classification of numerical values

Extraction of spatial and temporal information

● Project & Community Work➢ FFG Projects on Open Data: ADEQUATE, Communidata➢ W3C working groups: CSV on the Web, Dataset Exchange (DXWG)➢ Open Source projects: github.com/sebneu

● Integration of dataset assessments & improvements in data.gv.at ● Re-published corpus gets harvested by Google Dataset Search

Automated quality assessment of metadata [OBD 2015] (best paper award), [JDIQ 2016]Measures for assessing the data freshness/up-to-dateness [OBD 2016]Comparison of metadata quality [GIQ 2018]

Lifting data portals to the Web of Data [LDOW 2017]

Labelling of numerical values [ISWC 2016] (nominated for best student paper)

Geo-semantic labelling of open data [SEMANTiCS 2018]Enabling Spatio-Temporal Search in Open Data [JWS 2019]

48

Page 49: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

Backup Slides

49

Page 50: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

Given a corpus of tabular Open Data resources, the metadata descriptions can be enriched, and therefore the data quality can be increased, by semantically analyzing Open Data CSVs,

assigning semantic labels, and integrating the extracted knowledge into a graph.

How can we use existing Semantic Web technologies?

➢ Report and analysis of current OD in order to select/filter methods

How to best describe and publish datasets?

➢ Standardized W3C vocabularies and interfaces to publish Linked Data and enable integration

How to best find and assign semantic labels to datasets?

➢ Labeling of numeric data, extracting spatial & temporal information

Research Question & Hypothesis

50

Page 51: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

Open Data Portal Watch

Datasets and resources of the monitored portals51

Page 52: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

52

Page 53: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

Numerical Labelling

53

Page 54: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

Evaluation Setup

• Data• DBPedia 3.9

• 50 most frequent numerical properties

• Distance functions• euclidean distance (min, max, mean, stddev)

• distribution similarity (Kolmogorov-Smirnov (KS) distance)

54

54

Page 55: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

• AGGREGATION FUNCTION• majority vote and average distance

• AGGREGATION LEVELS• property• exact type

• 30 GB RAM• 3 different knowledge bases

Evaluation Setup

55

Page 56: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

• train/test split : 80/20• 20% of the subjects for each property as test data• test context graph: similar as background construction,

however, without constraints• randomly select leaf nodes

Test/Training Data

56

Page 57: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

• Best:Kolmogorov-Smirnov (KS) distance exact = correct property, type and p-o

prop = correct property

type = correct type

stype = correct super type

Evaluation: Distance Measure

57

Page 58: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

• 9% of test nodes are contained 1-1 in knowledge graph !!

• aggregation

• majority and average vote

• different neighbours

• majority vote slightly better

• more neighbours also better

Evaluation: Large-Scale (33657 Test Nodes)

58

Page 59: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

● labelling numerical columns● manual inspection of top 100 tables ( based on distance)

Findings• Dealing with timeline data:

values for different time points -> not in DBPedia

• missing domain knowledgereports about spendings, election results, tourism

• Aggregation of column scores: especially for type detection ( majority vote over column types)

• Combine with complementary approaches

Evaluation: Open Data Tables

59

Page 60: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

● [Nguyen et al., 2019]: EmbNum+: Effective, Efficient, and Robust Semantic Labeling for Numerical Valuesneural embedding for learning representations and similarity metric from numerical columns

● [Kacprzak et al., 2018]: Making Sense of Numerical Data - Semantic Labelling of Web Tables

● [Alobaid et al., 2018]: Fuzzy Semantic Labeling of Semi-structured Numerical Datasets

● [Alobaid et al., 2019]: Typology-based Semantic Labeling of Numeric Tabular Datataking into account different kinds of numeric values

60

Related follow-up work on labeling numerical values in OD

Page 61: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

Spatio-temporal Labelling

61

Page 62: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

Open Data is about Locations

62

Page 63: Or: How to build an Open Data Knowledge Graph · 82 data portals 160K datasets Unknown format (1-star) RDF? Not significant Tim Berners-Lee’s 5-star deployment scheme for Open Data

Faceted query interface:

▪ Full-text queries

▪ Geo-entity queries

▪ Timespan & Time pattern

▪ SPARQL endpoint

Back end:

▪ MongoDB for efficient key look-ups

▪ ElasticSearch for indexing and full-text queries

▪ Virtuoso as a triple store

Interface

63