Visual Querying LOD sources with LODeX

33
DB Group @ UNIMO Visual Querying LOD sources with LODeX Fabio Benedetti, Sonia Bergamaschi, Laura Po Department of Engineering “Enzo Ferrari” University of Modena & Reggio Emilia K-Cap 2015 - The 8th International Conference on Knowledge Capture October 7-10, 2015, Palisades, NY, USA

Transcript of Visual Querying LOD sources with LODeX

DB

Gro

up @

U

NIM

O

Visual Querying LOD sources with LODeX

Fabio Benedetti, Sonia Bergamaschi, Laura PoDepartment of Engineering “Enzo Ferrari”

University of Modena & Reggio Emilia

K-Cap 2015 - The 8th International Conference on Knowledge Capture October 7-10, 2015, Palisades, NY, USA

DB

Gro

up @

U

NIM

O

3Visual Querying LOD sources with LODeXK-Cap 2015 - The 8th International Conference on Knowledge Capture October 7-10, 2015, Palisades, NY, USA

Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 3

Linked Open Data: The story so far

[Schmachtenberg, Max, Christian Bizer, and Heiko Paulheim. "Adoption of the Linked Data Best Practices in Different Topical Domains." The Semantic Web–ISWC 2014. Springer International Publishing, 2014. 245-260]

DB

Gro

up @

U

NIM

O

4Visual Querying LOD sources with LODeXK-Cap 2015 - The 8th International Conference on Knowledge Capture October 7-10, 2015, Palisades, NY, USA

Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 4

Linked Open Data: The story so far

*Only 570 datasets belong to the LOD cloud,the remaining datasets do not contain ingoing/outgoing links to the LOD Cloud.

2009 2014*Domain Number % Number %

Cross-domain 41 13.95% 41 4.04%

Geographic 31 10.54% 21 2.07%

Government 49 16.67% 183 18.05%

Life sciences 41 13.95% 83 8.19%

Media 25 8.50% 22 2.17%

Publications 87 29.59% 96 9.47%

Social web 0 0.00% 520 51.28%

User-generated content

20 6.80% 48 4.73%

Total 294 1014

2009Domain

Cross-domain

Geographic

Government

Life sciences

Media

Publications

Social web

User-gener-ated content

2014

DB

Gro

up @

U

NIM

O

5Visual Querying LOD sources with LODeXK-Cap 2015 - The 8th International Conference on Knowledge Capture October 7-10, 2015, Palisades, NY, USA

Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 5

Linked Open Data drawbacks

The Open Access trends encourage the publication of

Open Data in form of Linked Data

But

Discovering and consuming LOD sources is a complex task for both

skilled and unskilled user

DB

Gro

up @

U

NIM

O

6Visual Querying LOD sources with LODeXK-Cap 2015 - The 8th International Conference on Knowledge Capture October 7-10, 2015, Palisades, NY, USA

Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 6

Why is a complex task?• There does not exist any standard for documenting a

dataset• A great number of datasets is published without a real

documentation that could help on revealing their structure.

To understand if a dataset really contains interesting information a user have to manually explore it using SPARQL queries.

Unskilled user

A user with no SPARQL knowledge cannot become a consumer of Linked Data

Skilled user

The task of exploring a dataset can be time consuming without having any knowledge of its structure

DB

Gro

up @

U

NIM

O

7Visual Querying LOD sources with LODeXK-Cap 2015 - The 8th International Conference on Knowledge Capture October 7-10, 2015, Palisades, NY, USA

Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 7

Our solution - LODeX

A tool for that promotes the understanding, navigation and querying of LOD sources

Requirements

• portable to the LOD Cloud• provide a synthetic representation of the

structure of the dataset• provide visual query building functionalities

hiding the complexity of Semantic Web technologies

DB

Gro

up @

U

NIM

O

8Visual Querying LOD sources with LODeXK-Cap 2015 - The 8th International Conference on Knowledge Capture October 7-10, 2015, Palisades, NY, USA

Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 8

LODeX ArchitectureTwo main modules• Extraction & Summarization

– Index Extraction (IE)– Post Processing (PP)

LOD Cloud

SPARQL Queries

LODeX Post-

processing

Statistical Indexes

LODeX Indexes

Extraction

EndpointURLs

Schema Summary

NoSQL

SPARQL Queries

SchemaSummary

Query Orchestrator

Schema Summary

Visualizzation

Basic QueryResults

• Visualization & Querying– Schema Summary Visualization– Query Orchestrator

DB

Gro

up @

U

NIM

O

9Visual Querying LOD sources with LODeXK-Cap 2015 - The 8th International Conference on Knowledge Capture October 7-10, 2015, Palisades, NY, USA

Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 9

Extraction & Summarization

Index Extraction [1]The IE process is able to generate the SPARQL queries used to extract the different indexes.• Pattern Strategy technique

– It is a technique able to produce an higher number of less complex SPARQL query

Post Processing

An algorithm combines the information contained in the Statistical Indexes to produce and store the Schema Summary

DB

Gro

up @

U

NIM

O

10Visual Querying LOD sources with LODeXK-Cap 2015 - The 8th International Conference on Knowledge Capture October 7-10, 2015, Palisades, NY, USA

Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 10

Schema SummaryThe Schema Summary is a pseudograph composed

by:• C - Classes (nodes)• P - Properties (edges)

And additional elements and function:• A - Attributes associated to each class

– Each attribute represent the existence of a Datatype property from the instances of the class

• - labels• l – labeling function • count - count function

The Schema Summary is inferred by the distribution of the instances of a dataset

DB

Gro

up @

U

NIM

O

11Visual Querying LOD sources with LODeXK-Cap 2015 - The 8th International Conference on Knowledge Capture October 7-10, 2015, Palisades, NY, USA

Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 11

Running example

ex:Sector foaf:Organization

owl:Class

ex:sector

“sector”

rdf:type rdf:type

rdf:Propertyrdf:type

owl:ObjectProperty

rdf:type

sector1 organization1ex:sector

dc:title

“Energy” organization2

Extensional Classes

ExtensionalKnowledge

IntensionalKnowledge

ex:activity

“Village electrification in the Pacific”

“+41331231”

rdfs:label

rdfs:label

rdfs:domain

rdf:type

ex:sector

rdf:type rdf:type

dbpedia:fax

person1

foaf:Person

ex:activity

“Paolo”

“Rossi”

rdf:type

ex:ceo

rdf:type foaf:firstName

foaf:lastName

The information contained in the Intensional knowledge can be incomplete or absent

DB

Gro

up @

U

NIM

O

12Visual Querying LOD sources with LODeXK-Cap 2015 - The 8th International Conference on Knowledge Capture October 7-10, 2015, Palisades, NY, USA

Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 12

Indexes needed to generate a Schema Summary

These indexes belong to extensional group of the Statistical Indexes [2]:• SC (Subject Class) contains the pairs (p,c) where p is an object

property and c is its domain class.• SCl (Subject Class to literal) contains the pairs (p,c) where p is a

datatype property and c is its domain class.• OC (Object Class) contains the pairs (p,c) where p is an object

property and c is its range class.

DB

Gro

up @

U

NIM

O

13Visual Querying LOD sources with LODeXK-Cap 2015 - The 8th International Conference on Knowledge Capture October 7-10, 2015, Palisades, NY, USA

Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 13

Indexes needed to generate a Schema Summary

These indexes belong to extensional group of the Statistical Indexes [2]:• SC (Subject Class) contains the pairs (p,c) where p is an object

property and c is its domain class.• SCl (Subject Class to literal) contains the pairs (p,c) where p is a

datatype property and c is its domain class.• OC (Object Class) contains the pairs (p,c) where p is an object

property and c is its range class.

ex:Sector foaf:Organization

sector1 organization1ex:sector

dc:title

“Energy” organization2

Extensional Classes

ExtensionalKnowledge

“Village electrification in the Pacific”

“+41331231”ex:sector

rdf:type rdf:type

dbpedia:fax

person1

foaf:Person

ex:activity

“Paolo”

“Rossi”

rdf:type

ex:ceo

rdf:type foaf:firstName

foaf:lastName

DB

Gro

up @

U

NIM

O

14Visual Querying LOD sources with LODeXK-Cap 2015 - The 8th International Conference on Knowledge Capture October 7-10, 2015, Palisades, NY, USA

Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 14

Indexes needed to generate a Schema Summary

These indexes belong to extensional group of the Statistical Indexes [2]:• SC (Subject Class) contains the pairs (p,c) where p is an object

property and c is its domain class.• SCl (Subject Class to literal) contains the pairs (p,c) where p is a

datatype property and c is its domain class.• OC (Object Class) contains the pairs (p,c) where p is an object

property and c is its range class.

ex:Sector foaf:Organization

sector1 organization1ex:sector

dc:title

“Energy” organization2

Extensional Classes

ExtensionalKnowledge

“Village electrification in the Pacific”

“+41331231”ex:sector

rdf:type rdf:type

dbpedia:fax

person1

foaf:Person

ex:activity

“Paolo”

“Rossi”

rdf:type

ex:ceo

rdf:type foaf:firstName

foaf:lastName

DB

Gro

up @

U

NIM

O

15Visual Querying LOD sources with LODeXK-Cap 2015 - The 8th International Conference on Knowledge Capture October 7-10, 2015, Palisades, NY, USA

Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 15

Indexes needed to generate a Schema Summary

These indexes belong to extensional group of the Statistical Indexes [2]:• SC (Subject Class) contains the pairs (p,c) where p is an object

property and c is its domain class.• SCl (Subject Class to literal) contains the pairs (p,c) where p is a

datatype property and c is its domain class.• OC (Object Class) contains the pairs (p,c) where p is an object

property and c is its range class.

ex:Sector foaf:Organization

sector1 organization1ex:sector

dc:title

“Energy” organization2

Extensional Classes

ExtensionalKnowledge

“Village electrification in the Pacific”

“+41331231”ex:sector

rdf:type rdf:type

dbpedia:fax

person1

foaf:Person

ex:activity

“Paolo”

“Rossi”

rdf:type

ex:ceo

rdf:type foaf:firstName

foaf:lastName

DB

Gro

up @

U

NIM

O

16Visual Querying LOD sources with LODeXK-Cap 2015 - The 8th International Conference on Knowledge Capture October 7-10, 2015, Palisades, NY, USA

Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 16

Schema Summary generationWe use an algorithm for combining these indexes and produce a Schema Summary

Name Values

SC(foaf:Organization,ex:ceo,1),

(foaf:Organization,ex:sector,2)

SCl

(foaf:Person,foaf:firstName,1), (foaf:Person,foaf:lastName,1),

(foaf:Organization,ex:dbpedia:fax,1), (ex:Sector,dc:title,1),

(foaf:Organization,ex:activity,1), (foaf:Organization,dbpedia:fax,1)

OC(ex:Sector,ex:sector,1)(ex:Person,ex:ceo,1)

DB

Gro

up @

U

NIM

O

17Visual Querying LOD sources with LODeXK-Cap 2015 - The 8th International Conference on Knowledge Capture October 7-10, 2015, Palisades, NY, USA

Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 17

Schema Summary generation

foaf:Organizzation2

ex:Sector1

ex:sector 2foaf:Person1

ex:ceo 1

dc:title 1foaf:firstName 1foaf:lastName 1

ex:activity 1dbpedia:fax 1

We use an algorithm for combining these indexes and produce a Schema Summary

Name Values

SC(foaf:Organization,ex:ceo,1),

(foaf:Organization,ex:sector,2)

SCl

(foaf:Person,foaf:firstName,1), (foaf:Person,foaf:lastName,1),

(foaf:Organization,ex:dbpedia:fax,1), (ex:Sector,dc:title,1),

(foaf:Organization,ex:activity,1), (foaf:Organization,dbpedia:fax,1)

OC(ex:Sector,ex:sector,1)(ex:Person,ex:ceo,1)

DB

Gro

up @

U

NIM

O

18Visual Querying LOD sources with LODeXK-Cap 2015 - The 8th International Conference on Knowledge Capture October 7-10, 2015, Palisades, NY, USA

Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 18

Visualization & QueryingSchema Summary VisualizationFront end of the Web Application composed by three panel:• List of datasets indexed in LODeX• Schema Summary and query building panel• Refinement panel

Query Orchestrator• It manages the interaction between the User and the

GUI• It contains a SPARQL compiler able to compile the

visual query in a SPARQL one

DB

Gro

up @

U

NIM

O

19Visual Querying LOD sources with LODeXK-Cap 2015 - The 8th International Conference on Knowledge Capture October 7-10, 2015, Palisades, NY, USA

Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 19

Schema Summary – Building a Visual Query

DB

Gro

up @

U

NIM

O

20Visual Querying LOD sources with LODeXK-Cap 2015 - The 8th International Conference on Knowledge Capture October 7-10, 2015, Palisades, NY, USA

Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 20

Refinement Panel

DB

Gro

up @

U

NIM

O

21Visual Querying LOD sources with LODeXK-Cap 2015 - The 8th International Conference on Knowledge Capture October 7-10, 2015, Palisades, NY, USA

Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 21

Visual Query & SPARQL compiler

Schema Summary

SPARQL compiler

SPARQL query

Basic Query

The Visual Query has a tree structure

A SPARQL compiler exploits a recursive algorithm to generate the corresponding SPARQL queryOperators supported by the compiler:• AND• Optional• Filter

The query is sent to the SPARQL endpoint and the results can be visualized in a tabular view

• ORDER BY• LIMIT• OFFSET

DB

Gro

up @

U

NIM

O

22Visual Querying LOD sources with LODeXK-Cap 2015 - The 8th International Conference on Knowledge Capture October 7-10, 2015, Palisades, NY, USA

Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 22

Evaluation of LODeX

We performed 3 different kinds of evaluation to inspect:• Portability of LODeX to SPARQL endpoints

• SPARQL expressiveness

• Usability of LODeX– to verify if the graph visualization of the SS is clear in representing

the structure of a dataset– to prove if the visual query panel is a powerful and adequate way

for generating SPARQL queries

DB

Gro

up @

U

NIM

O

23Visual Querying LOD sources with LODeXK-Cap 2015 - The 8th International Conference on Knowledge Capture October 7-10, 2015, Palisades, NY, USA

Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 23

We evaluate the complexity of the graph visualization with a group of group of 5 students.• Task: find a node in graphs of increasing size

The test set is composed by 185 datasets taken from Datahub

Portability evaluation

Result portability test Number of datasets

%

Huge Schema Summary(more than 80 nodes)

40 21%

Offline endpoints 7 4%

Not standard response 28 15%

Pass the test 110 60%

DB

Gro

up @

U

NIM

O

24Visual Querying LOD sources with LODeXK-Cap 2015 - The 8th International Conference on Knowledge Capture October 7-10, 2015, Palisades, NY, USA

Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 24

We analyzed what kind of SPARQL query LODeX is able to generate

We used as reference the queries contained in the Berlin SPARQL Benchmark [3]• LODeX is able to generate 6 of 10 queries contained in BSBM

SPARQL expressiveness

DB

Gro

up @

U

NIM

O

25Visual Querying LOD sources with LODeXK-Cap 2015 - The 8th International Conference on Knowledge Capture October 7-10, 2015, Palisades, NY, USA

Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 25

We analyzed what kind of SPARQL query LODeX is able to generate

We used as reference the queries contained in the Berlin SPARQL Benchmark [3]• LODeX is able to generate 6 of 10 queries contained in BSBM

SPARQL expressiveness

• UNION queries• CONSTRUCT queries• ASK queries

DB

Gro

up @

U

NIM

O

26Visual Querying LOD sources with LODeXK-Cap 2015 - The 8th International Conference on Knowledge Capture October 7-10, 2015, Palisades, NY, USA

Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 26

We analyzed what kind of SPARQL query LODeX is able to generate

We used as reference the queries contained in the Berlin SPARQL Benchmark [3]• LODeX is able to generate 6 of 10 queries contained in BSBM

SPARQL expressiveness

• UNION queries• CONSTRUCT queries• ASK queries

• All JOIN acyclic queries• All FILTER queries• All ORDER queries

DB

Gro

up @

U

NIM

O

27Visual Querying LOD sources with LODeXK-Cap 2015 - The 8th International Conference on Knowledge Capture October 7-10, 2015, Palisades, NY, USA

Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 27

We performed an online survey where has been enrolled 27 users

The survey is divided in two parts having different goals:• Evaluate the clarity of Schema Summary• Evaluate the functionality of visual query building

For each part has been designed some tasks and a SUS [4] questionnaires

User Evaluation

DB

Gro

up @

U

NIM

O

28Visual Querying LOD sources with LODeXK-Cap 2015 - The 8th International Conference on Knowledge Capture October 7-10, 2015, Palisades, NY, USA

Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 28

Clarity of Schema Summary

Tasks:

Datasets:

• (T1)Indicate the topic of the dataset• (T2)Find out the class with the largest number of instances• (T3)Find out the classes connected to a given class chosen by us• (T4)Find out the most used attribute of a class chosen by us

• Bio2RDF - INOH - pathway database of model organisms• Linked Open Aalto Data Service - Open data published by

Aalto University

Task Number n Correct %

T1 54 48 89%

T2 54 48 89%

T3 27 23 89%

T4 27 27 100%

Total

162 148 91%

DB

Gro

up @

U

NIM

O

30Visual Querying LOD sources with LODeXK-Cap 2015 - The 8th International Conference on Knowledge Capture October 7-10, 2015, Palisades, NY, USA

Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 30

Query building functionalities

Tasks:

Dataset:

• (Q1)Return all the different category of Nobel prizes• (Q2) Return a table containing the list of winners of a Nobel

prizes ordered by the name of the winner; the table has to contain the date of birth of the winner.

• (Q3) Find the award files related to the award of Peter W. Higgs

• (Q4) Find the organizations that won a Nobel prize after the 1999Nobel Prizes - Linked Open Data about every Nobel Prize

Task Number n Correct %

Q1 27 27 100%

Q2 27 26 96%

Q3 27 22 81%

Q4 27 23 85%

Total

108 98 90%

DB

Gro

up @

U

NIM

O

31Visual Querying LOD sources with LODeXK-Cap 2015 - The 8th International Conference on Knowledge Capture October 7-10, 2015, Palisades, NY, USA

Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 31

Query building functionalities(2)

We obtained a median SUS score of 85.5• No remarkable differences between skilled and unskilled

user• This score classifies the usability of LODeX as “Excellent”

[5] FeedbackUnskilled users write their SPARQL query for the first time“LODeX is cognitively less demanding that write SPARQL query”

Browser rendering differenceStarting a query can be difference and keyword search techniques could be helpful

DB

Gro

up @

U

NIM

O

32Visual Querying LOD sources with LODeXK-Cap 2015 - The 8th International Conference on Knowledge Capture October 7-10, 2015, Palisades, NY, USA

Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 32

Conclusion and Future WorksConclusion• LODeX is portable with the 60% of the datasets tested

– 19% a failure induced by endpoint issues• Both skilled and unskilled users appreciated LODeX

Future works• Modify the interface of LODeX according to the results

of the online survey• Define clustering and new techniques of browsing to

reduce the complexity of the Summary for huge dataset

• Extend the group of operators supported by the SPARQL compiler

DB

Gro

up @

U

NIM

O

33Visual Querying LOD sources with LODeXK-Cap 2015 - The 8th International Conference on Knowledge Capture October 7-10, 2015, Palisades, NY, USA

Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 33

Referencies

• [1] F. Benedetti, S. Bergamaschi, and L. Po, A visual summary for linked open data sources. 2014, International Semantic Web Conference (Posters & Demos).

• [2] F. Benedetti, S. Bergamaschi, and L. Po. Online index extraction from linked open data sources. Linked Data for Information Extraction (LD4IE) Workshop held at International Semantic Web Conference, 2014.

• [3] C. Bizer and A. Schultz. Benchmarking the performance of storage systems that expose sparql endpoints.

• [4] J. Brooke. Sus-a quick and dirty usability scale. Usability evaluation in industry, 189(194):4–7, 1996.

• [5] A. Bangor, P. Kortum, and J. Miller. Determining what individual sus scores mean: Adding an adjective rating scale.

DB

Gro

up @

U

NIM

O

34Visual Querying LOD sources with LODeXK-Cap 2015 - The 8th International Conference on Knowledge Capture October 7-10, 2015, Palisades, NY, USA

Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 34

Acknowledgment

DB

Gro

up @

U

NIM

O

35Visual Querying LOD sources with LODeXK-Cap 2015 - The 8th International Conference on Knowledge Capture October 7-10, 2015, Palisades, NY, USA

Fabio BenedettiDip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia

Thanks for your attention!

Try LODeX at: http://dbgroup.unimo.it/lodex2