Focused Exploration of Geospatial Context on Linked Open Data
-
Upload
thomas-gottron -
Category
Science
-
view
94 -
download
2
description
Transcript of Focused Exploration of Geospatial Context on Linked Open Data
Thomas Gottron Focused Exploration of LOD 1 Institute for Web Science and Technologies · University of Koblenz-Landau, Germany
Focused Exploration of Geospatial Context on Linked Open Data
Thomas Gottron, Johannes Schmitz, Stuart E. Middleton
20 October 2014 IESD workshop, Riva del Garda
Thomas Gottron Focused Exploration of LOD 2
Challenge: Focused Exploration of LOD
• Linked Data entities
Thomas Gottron Focused Exploration of LOD 3
Challenge: Focused Exploration of LOD
• Linked Data entities • (Semantic) link
structure
Thomas Gottron Focused Exploration of LOD 4
Challenge: Focused Exploration of LOD
• Linked Data entities • (Semantic) link
structure • „Relevant“ entities
Thomas Gottron Focused Exploration of LOD 5
Challenge: Focused Exploration of LOD
• Linked Data entities • (Semantic) link
structure • „Relevant“ entities • Seed entity
Thomas Gottron Focused Exploration of LOD 6
Challenge: Focused Exploration of LOD
• Linked Data entities • (Semantic) link
structure • „Relevant“ entities • Seed entity
? ?
? ?
? ?
Classification:
Which links lead to
relevant entities?
Ranking: How probable is a link leading to a relevant entity?
Use Cases: Guided exploration
Focused LOD crawler
Thomas Gottron Focused Exploration of LOD 7
Focused exploration of Geospatial Context
Relevant entities: Locations semantically related to seed entities
Bensheim (Germany)
Rovereto
Thomas Gottron Focused Exploration of LOD 8
• E: set of entities (URIs) • R: set of RDF triples (s,p,o)
– Restricted to s,o ∈ E • L⊆E: relevant entities
– For us: Locations with coordinates • Task: for given s‘ and all (s‘,p,o) ∈ R
– Classification: Predict which o are in L – Ranking: Sort object entities o starting from the
one presumed most probable to be relevant
Focused Exploration: Formalisation
s∈ L
-1.404
50.897
wgs84:long
wgs84:lat
Thomas Gottron Focused Exploration of LOD 9
• Based on 3 paradigms: – Schema semantics (1 approach) – Supervised machine learning (2 approaches) – Information Retrieval inspired (2 approaches)
5 Approaches
Thomas Gottron Focused Exploration of LOD 10
Exploration based on Schema Semantics • Exploit rdfs:range definitions of link predicates
• Follow links which lead to locations
dbponto:twinCity dbpedia:City rdfs:range
dbpedia:Place
rdfs:subClassOf
Thomas Gottron Focused Exploration of LOD 11
Exploration based on Schema Semantics
Classification • Range of any pi is a
location? àLabel = relevant
Ranking
• Re-use classification: – Relevant before
irrelevant
s
o
pm
p1
p2
...
Location?
Thomas Gottron Focused Exploration of LOD 12
Supervised Machine Learning • Use incoming link predicates as features
– Learn predicates which typically leading to locations
• Train a classifier (e.g. Naive Bayes)
o
xxx
yyy
wgs84:long
wgs84:lat
p2
p3 o‘
p4
p6
2 Variations:
Use all or only
observed predicates
Thomas Gottron Focused Exploration of LOD 13
Supervised Machine Learning
Classification •
àLabel = relevant
Ranking
• Rank by odds:
s
o
pm
p1
p2
...
O o∈ L( ) =P o∈ L( )P o∉ L( )
P o∈ L( ) > P o∉ L( )?
Location?
Thomas Gottron Focused Exploration of LOD 14
IR Inspired Approaches • Discriminativeness of predicates (inspired by tf-idf)
• Property relevance frequency:
• Inverse property frequency
• Combine into prf-ipf and prr-ipf • Total score ρ: aggregate over all predicates
prf = c(p,L)
ipf = log c(∗,∗)c(p,∗)"
#$
%
&'
o p3
2nd Variation:
prr: normalised prf
Thomas Gottron Focused Exploration of LOD 15
IR Inspired Approaches
Classification • Determine threshold
– Nearest centroid
Ranking
• Rank by score
s
o
pm
p1
p2
...
ρprr-ipf o( )
Location?
Thomas Gottron Focused Exploration of LOD 16
Evaluation
• Metrics: – Ranking:
• ROC curves • AUC
– Classification: • Precision • Recall • F1 • Accuracy
• Cross validation: – 10-times / 10-fold – Averages
425,338 entities 128,171 relevant
Exp
lora
tion
99,951 entities
owl:sameAs See
d
1,728,633 links
Thomas Gottron Focused Exploration of LOD 17
Performance (Ranking)
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
ROC
randomSchema SemanticsNB (all predicates)
NB (present predicates)prf-ipfprr-ipf
0.95
0.975
1
0 0.025 0.05
Thomas Gottron Focused Exploration of LOD 18
Performance (Classification & Ranking) 10
Table 2. Average performance of approaches († indicates significant improvements atconfidence level ⇢ = 0.01)
Method Recall Precision F1 Accuracy AUC
Schema Scemantics 0.1188 0.8119 0.2073 0.7262 0.5552NB (all predicates) 0.9906 0.9491 † 0.9694 † 0.9812 0.9970NB (observed predicates) 0.9943 0.9436 0.9683 0.9804 0.9968prf-ipf 0.8512 † 0.9754 0.9091 0.9487 0.9958prr-ipf † 0.9973 0.9240 0.9592 0.9745 0.9769
performance in bold. Furthermore, we marked the results where we had a significant im-provement over the second best method at confidence level of ⇢ = 0.01. The aggregatedvalues basically confirm the observations made above. In general, when considering themeasures F1, Accuracy and AUC, the Naive Bayes classifier making use of all predi-cates performs best. However, the advantage in comparison to the Naive Bayes classifierusing only observed terms is negligible. In application scenarios, where a high Recallis of importance, instead, the prr-ipf approach achieves the best results with more than99.7%. When focusing on Precision, prf-ipf performs best and demonstrated the high-est values. More than 97% of the objects predicted to have geo-coordinates actually didprovide such information. In a setting where we want to focus on promising items thismight be the kind of performance the end user is looking for.
One explanation for the very high accuracy in general might also be the dataset.Given that we started the exploration from location entities on DBPedia and Linked-GeoData, the overall dataset was biased towards entities from DBPedia. Hence, we in-tend to extend the evaluation to see if the quality of the supervised approaches remainsat a comparable level, when using larger and even more diverse datasets.
6 Related Work
Previous work related to this paper can be found in three areas, each of which will bedescribed below: (a) Extraction of geographic entities provides a starting point for ourapproach. The fields of (b) focused crawling on the WWW and (c) machine learningapplied to Linked Data in general each share some similarities with our classificationand ranking task, although differences do exist.
6.1 Extraction of Geographic Entities
Work done in the TRIDEC project [7] examined how geographic databases such asGeonames, OpenStreetMap and GooglePlaces could be used to avoid the need for errorprone named entity recognition and thus increase the overall precision when geoparsinglarge volumes of Twitter reports for crisis mapping. This work directly compared crisismaps from Twitter with official post-disaster environment agency impact assessments,highlighting just how accurate maps based on large-scale geospatial report crowd sourc-ing can be. We are building on this approach within the REVEAL project and extending
Thomas Gottron Focused Exploration of LOD 19
• Focused exploration feasible • ML approach performing best
• Future work: – Other data sets – Generalise scenario (more than locations) – Better approaches using more features
Summary
Thomas Gottron Focused Exploration of LOD 20 Institute for Web Science and Technologies · University of Koblenz-Landau, Germany
Questions?
Thomas Gottron Institute for Web Science and Technologies Universität Koblenz-Landau [email protected]