Focused Exploration of Geospatial Context on Linked Open Data

Thomas Gottron Focused Exploration of LOD 1 Institute for Web Science and Technologies · University of Koblenz-Landau, Germany

Focused Exploration of Geospatial Context on Linked Open Data

Thomas Gottron, Johannes Schmitz, Stuart E. Middleton

20 October 2014 IESD workshop, Riva del Garda

Thomas Gottron Focused Exploration of LOD 2

Challenge: Focused Exploration of LOD

•  Linked Data entities



•  Linked Data entities •  (Semantic) link

structure




structure •  „Relevant“ entities




structure •  „Relevant“ entities •  Seed entity




structure •  „Relevant“ entities •  Seed entity

? ?

? ?

? ?

Classification:

Which links lead to

relevant entities?

Ranking: How probable is a link leading to a relevant entity?

Use Cases: Guided exploration

Focused LOD crawler


Focused exploration of Geospatial Context

Relevant entities: Locations semantically related to seed entities

Bensheim (Germany)

Rovereto


•  E: set of entities (URIs) •  R: set of RDF triples (s,p,o)

– Restricted to s,o ∈ E •  L⊆E: relevant entities

– For us: Locations with coordinates •  Task: for given s‘ and all (s‘,p,o) ∈ R

– Classification: Predict which o are in L – Ranking: Sort object entities o starting from the

one presumed most probable to be relevant

Focused Exploration: Formalisation

s∈ L

-1.404

50.897

wgs84:long

wgs84:lat


•  Based on 3 paradigms: – Schema semantics (1 approach) – Supervised machine learning (2 approaches) –  Information Retrieval inspired (2 approaches)

5 Approaches


Exploration based on Schema Semantics •  Exploit rdfs:range definitions of link predicates

•  Follow links which lead to locations

dbponto:twinCity dbpedia:City rdfs:range

dbpedia:Place

rdfs:subClassOf


Exploration based on Schema Semantics

Classification •  Range of any pi is a

location? àLabel = relevant

Ranking

•  Re-use classification: –  Relevant before

irrelevant

s

o

pm

p1

p2

...

Location?


Supervised Machine Learning •  Use incoming link predicates as features

–  Learn predicates which typically leading to locations

•  Train a classifier (e.g. Naive Bayes)

o

xxx

yyy

wgs84:long

wgs84:lat

p2

p3 o‘

p4

p6

2 Variations:

Use all or only

observed predicates


Supervised Machine Learning

Classification • 

àLabel = relevant

Ranking

•  Rank by odds:

s

o

pm

p1

p2

...

O o∈ L( ) =P o∈ L( )P o∉ L( )

P o∈ L( ) > P o∉ L( )?

Location?


IR Inspired Approaches •  Discriminativeness of predicates (inspired by tf-idf)

•  Property relevance frequency:

•  Inverse property frequency

•  Combine into prf-ipf and prr-ipf •  Total score ρ: aggregate over all predicates

prf = c(p,L)

ipf = log c(∗,∗)c(p,∗)"

#$

%

&'

o p3

2nd Variation:

prr: normalised prf


IR Inspired Approaches

Classification •  Determine threshold

–  Nearest centroid

Ranking

•  Rank by score

s

o

pm

p1

p2

...

ρprr-ipf o( )

Location?


Evaluation

•  Metrics: –  Ranking:

•  ROC curves •  AUC

–  Classification: •  Precision •  Recall •  F1 •  Accuracy

•  Cross validation: –  10-times / 10-fold –  Averages

425,338 entities 128,171 relevant

Exp

lora

tion

99,951 entities

owl:sameAs See

d

1,728,633 links


Performance (Ranking)

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

ROC

randomSchema SemanticsNB (all predicates)

NB (present predicates)prf-ipfprr-ipf

0.95

0.975

1

0 0.025 0.05


Performance (Classification & Ranking) 10

Table 2. Average performance of approaches († indicates significant improvements atconfidence level ⇢ = 0.01)

Method Recall Precision F1 Accuracy AUC

Schema Scemantics 0.1188 0.8119 0.2073 0.7262 0.5552NB (all predicates) 0.9906 0.9491 † 0.9694 † 0.9812 0.9970NB (observed predicates) 0.9943 0.9436 0.9683 0.9804 0.9968prf-ipf 0.8512 † 0.9754 0.9091 0.9487 0.9958prr-ipf † 0.9973 0.9240 0.9592 0.9745 0.9769

performance in bold. Furthermore, we marked the results where we had a significant im-provement over the second best method at confidence level of ⇢ = 0.01. The aggregatedvalues basically confirm the observations made above. In general, when considering themeasures F1, Accuracy and AUC, the Naive Bayes classifier making use of all predi-cates performs best. However, the advantage in comparison to the Naive Bayes classifierusing only observed terms is negligible. In application scenarios, where a high Recallis of importance, instead, the prr-ipf approach achieves the best results with more than99.7%. When focusing on Precision, prf-ipf performs best and demonstrated the high-est values. More than 97% of the objects predicted to have geo-coordinates actually didprovide such information. In a setting where we want to focus on promising items thismight be the kind of performance the end user is looking for.

One explanation for the very high accuracy in general might also be the dataset.Given that we started the exploration from location entities on DBPedia and Linked-GeoData, the overall dataset was biased towards entities from DBPedia. Hence, we in-tend to extend the evaluation to see if the quality of the supervised approaches remainsat a comparable level, when using larger and even more diverse datasets.

6 Related Work

Previous work related to this paper can be found in three areas, each of which will bedescribed below: (a) Extraction of geographic entities provides a starting point for ourapproach. The fields of (b) focused crawling on the WWW and (c) machine learningapplied to Linked Data in general each share some similarities with our classificationand ranking task, although differences do exist.

6.1 Extraction of Geographic Entities

Work done in the TRIDEC project [7] examined how geographic databases such asGeonames, OpenStreetMap and GooglePlaces could be used to avoid the need for errorprone named entity recognition and thus increase the overall precision when geoparsinglarge volumes of Twitter reports for crisis mapping. This work directly compared crisismaps from Twitter with official post-disaster environment agency impact assessments,highlighting just how accurate maps based on large-scale geospatial report crowd sourc-ing can be. We are building on this approach within the REVEAL project and extending


•  Focused exploration feasible •  ML approach performing best

•  Future work: – Other data sets – Generalise scenario (more than locations) – Better approaches using more features

Summary

Thomas Gottron Focused Exploration of LOD 20 Institute for Web Science and Technologies · University of Koblenz-Landau, Germany

Questions?

Thomas Gottron Institute for Web Science and Technologies Universität Koblenz-Landau [email protected]

Focused Exploration of Geospatial Context on Linked Open Data

Science

Transcript of Focused Exploration of Geospatial Context on Linked Open Data