An analysis of the quality issues of the properties available in the Spanish DBpedia
-
Upload
nandana-mihindukulasooriya -
Category
Internet
-
view
316 -
download
1
Transcript of An analysis of the quality issues of the properties available in the Spanish DBpedia
An Analysis of the Quality Issues of the Properties Available
in the Spanish DBpedia
Nandana Mihindukulasooriya, Mariano Rico, Raúl García-Castro, and Asunción Gómez-Pérez
Ontology Engineering Group (OEG)Departamento de Inteligencia Artificial
Escuela Técnica Superior de Ingenieros InformáticosUniversidad Politécnica de Madrid
Acknowledgments: 4V (TIN2013-46238-C4-2-R) and LIDER (EU FP7 610782) projects
http://loupe.linkeddata.es
2
Collaborative editing in Wikipedia
Ontology Engineering Group, Universidad Politécnica de Madrid
3
Spontaneous data model creation
Ontology Engineering Group, Universidad Politécnica de Madrid
4
Can spontaneous data models support us in data quality
assessment?
But, first, what is the quality of such spontaneous data models?
Ontology Engineering Group, Universidad Politécnica de Madrid
5
DBpedia – Exposing Wikipedia’s content as Linked Data
Ontology Engineering Group, Universidad Politécnica de Madrid
RDF Triplestore
Ren
derin
g
mappings
infobox
6
esDBpedia – the Spanish DBpedia chapter
Ontology Engineering Group, Universidad Politécnica de Madrid
http://es.dbpedia.org/
7
Can esDBpedia’s spontaneous data model support us in data quality
assessment?
But, first, what is the quality of the properties of such spontaneous
data model?Ontology Engineering Group, Universidad Politécnica de Madrid
8
Quality Dimensions for Datasets
A. Conciseness. A dataset does not contain redundant concepts with different identifiers
B. Consistency. A dataset does not contain conflicting or contradictory data
C. Syntactic Validity. Values belong to the legal value range for the represented domain and do not violate the syntactic rules
D. Semantic Accuracy. Values correctly represent real world facts
Ontology Engineering Group, Universidad Politécnica de Madrid
9
Extraction and inspection of property statistics
Ontology Engineering Group, Universidad Politécnica de Madrid
http://loupe.linkeddata.es/
10
Information extracted about properties
Ontology Engineering Group, Universidad Politécnica de Madrid
Property statistics template Example DataGeneral information URI http://es.dbpedia.org/property/edad
Local name edadNamespace http://es.dbpedia.org/property/Number of triples 4623
Subject Analysis IRI subject count 4623 (100 %)Extracted domain classes (i.e., ?subject a ?class)
dbpedia-owl:Agent 2611 (56,48 %)schema:Person 1515 (32.77 %)…
Object analysis URI object count 186 (4.02 %)Extracted range classes(i.e., ?object a ?class)
skos:Concept 17 (9.14 %)schema:Place 2 (1.08 %)…
Literal object count 4437 (95.98 %)Numerical object count 2491 (53.88 %)Integer object count 2382 (51.52 %)Average of numerics 3.53Max numeric sample 8.79E11, 1.5E8, 1.5E7, 8.2E6, 8121540Min numeric sample -5, 0, 1, 1.08, 1.2
11
Properties in esDBpedia
Ontology Engineering Group, Universidad Politécnica de Madrid
Property prefix Properties Property values# % # %
http://es.dbpedia.org/property/ 19,885 52.53 18,021,389 10.66
http://dbpedia.org/property/ 17,188 45.40 9,742,710 5.76
http://dbpedia.org/ontology/ 576 1.52 86,602,281 51.21
http://xmlns.com/foaf/0.1/ 12 0.03 8,132,328 4.81
http://www.w3.org/1999/02/22-rdf-syntax-ns#
8 0.02 12,298,451 7.27
http://www.w3.org/2000/01/rdf-schema# 7 0.02 5,366,982 3.17
http://www.w3.org/2002/07/owl# 6 0.02 16,523,751 9.77
http://purl.org/dc/terms/# 4 0.01 4,148,399 2.45
http://www.w3.org/2004/02/skos/core# 4 0.01 1,153,685 0.68
http://purl.org/dc/elements/1.1/ 3 0.01 3,346,874 1.98
http://www.w3.org/ns/prov# 1 0.00 2,853,681 1.69
Other prefixes 163 0.43 911,131 0.54
Total 37,857 100 169,101,662 100
12Ontology Engineering Group, Universidad Politécnica de Madrid
A. Conciseness
• Many redundant properties in esDBpedia• 97.93% are auto-generated
• Causes• Capitalization (857): partidosEnPrimera,partidosenprimera• Synonyms: causaDeMuerte, causaDeFallecimiento• Prepositions: causaDeFallecimiento, causaFallecimiento• Spelling (7,495): apeliido, apelldio, apellid• Singular/plural: apellido, apellidos• Gender: administrador, administradora• Accent usage (1,252): administracion, administración• Parsing (107): altitudMin/máx, residencia/trabajo, idioma/s
13
B. Consistency
• OWL properties with IRI and literal values• 3,380 properties• Use of strings and URL interchangeably
• esdbpedia:lugarDeEntierro• "Madrid"@es• http://es.dbpedia.org/resource/Madrid
• Diverse and incorrect domain and range types • esdbpedia:edad has range of type dbo:Place • esdbpedia:lugarmuerte has range of type dbo:Person• esdbpedia:pais has range of type dbo:Actor
Ontology Engineering Group, Universidad Politécnica de Madrid
14
C. Syntactic Validity
• IRIs represented as strings• Many properties with IRI values and very few string values• Common cause:
• IRIs encoded as strings -> “http://...”@es • Numerical values represented as strings
• 3,675 properties with more than 99% integer objects and a very few string literals
• Common cause: • Numerics encoded as strings -> “2^^xsd:integer”
Ontology Engineering Group, Universidad Politécnica de Madrid
15
D. Semantic Accuracy
• Outliers • Numerical values allow an automatic analysis • Properties such as diameter or edad with negative values
• Harder to detect automatically • Our plan is to use data fusion approaches to compare
values from multiple sources
Ontology Engineering Group, Universidad Politécnica de Madrid
16
Conclusions and future work
• DBpedia’s spontaneous data model can support quality assessment
• Errors in DBpedia are introduced in many stages• Crowd-sourced data• Mappings• Extraction framework
• Some errors can be eliminated with pre-processing and cleaning
• Quality assessment currently semi-automatic• Currently working towards its automation
• We plan to investigate if the quality issues are the same in other DBpedia instances
Ontology Engineering Group, Universidad Politécnica de Madrid
Questions?
http://loupe.linkeddata.es/