An analysis of the quality issues of the properties available in the Spanish DBpedia

17
An Analysis of the Quality Issues of the Properties Available in the Spanish DBpedia Nandana Mihindukulasooriya, Mariano Rico, Raúl García-Castro, and Asunción Gómez- Pérez Ontology Engineering Group (OEG) Departamento de Inteligencia Artificial Escuela Técnica Superior de Ingenieros Informáticos Universidad Politécnica de Madrid Acknowledgments: 4V (TIN2013-46238-C4-2-R) and LIDER (EU FP7 610782) projects http://loupe.linkeddata .es

Transcript of An analysis of the quality issues of the properties available in the Spanish DBpedia

Page 1: An analysis of the quality issues of the properties available in the Spanish DBpedia

An Analysis of the Quality Issues of the Properties Available

in the Spanish DBpedia

Nandana Mihindukulasooriya, Mariano Rico, Raúl García-Castro, and Asunción Gómez-Pérez

Ontology Engineering Group (OEG)Departamento de Inteligencia Artificial

Escuela Técnica Superior de Ingenieros InformáticosUniversidad Politécnica de Madrid

Acknowledgments: 4V (TIN2013-46238-C4-2-R) and LIDER (EU FP7 610782) projects

http://loupe.linkeddata.es

Page 2: An analysis of the quality issues of the properties available in the Spanish DBpedia

2

Collaborative editing in Wikipedia

Ontology Engineering Group, Universidad Politécnica de Madrid

Page 3: An analysis of the quality issues of the properties available in the Spanish DBpedia

3

Spontaneous data model creation

Ontology Engineering Group, Universidad Politécnica de Madrid

Page 4: An analysis of the quality issues of the properties available in the Spanish DBpedia

4

Can spontaneous data models support us in data quality

assessment?

But, first, what is the quality of such spontaneous data models?

Ontology Engineering Group, Universidad Politécnica de Madrid

Page 5: An analysis of the quality issues of the properties available in the Spanish DBpedia

5

DBpedia – Exposing Wikipedia’s content as Linked Data

Ontology Engineering Group, Universidad Politécnica de Madrid

RDF Triplestore

Ren

derin

g

mappings

infobox

Page 6: An analysis of the quality issues of the properties available in the Spanish DBpedia

6

esDBpedia – the Spanish DBpedia chapter

Ontology Engineering Group, Universidad Politécnica de Madrid

http://es.dbpedia.org/

Page 7: An analysis of the quality issues of the properties available in the Spanish DBpedia

7

Can esDBpedia’s spontaneous data model support us in data quality

assessment?

But, first, what is the quality of the properties of such spontaneous

data model?Ontology Engineering Group, Universidad Politécnica de Madrid

Page 8: An analysis of the quality issues of the properties available in the Spanish DBpedia

8

Quality Dimensions for Datasets

A. Conciseness. A dataset does not contain redundant concepts with different identifiers

B. Consistency. A dataset does not contain conflicting or contradictory data

C. Syntactic Validity. Values belong to the legal value range for the represented domain and do not violate the syntactic rules

D. Semantic Accuracy. Values correctly represent real world facts

Ontology Engineering Group, Universidad Politécnica de Madrid

Page 9: An analysis of the quality issues of the properties available in the Spanish DBpedia

9

Extraction and inspection of property statistics

Ontology Engineering Group, Universidad Politécnica de Madrid

http://loupe.linkeddata.es/

Page 10: An analysis of the quality issues of the properties available in the Spanish DBpedia

10

Information extracted about properties

Ontology Engineering Group, Universidad Politécnica de Madrid

Property statistics template Example DataGeneral information URI http://es.dbpedia.org/property/edad

Local name edadNamespace http://es.dbpedia.org/property/Number of triples 4623

Subject Analysis IRI subject count 4623 (100 %)Extracted domain classes (i.e., ?subject a ?class)

dbpedia-owl:Agent 2611 (56,48 %)schema:Person 1515 (32.77 %)…

Object analysis URI object count 186 (4.02 %)Extracted range classes(i.e., ?object a ?class)

skos:Concept 17 (9.14 %)schema:Place 2 (1.08 %)…

Literal object count 4437 (95.98 %)Numerical object count 2491 (53.88 %)Integer object count 2382 (51.52 %)Average of numerics 3.53Max numeric sample 8.79E11, 1.5E8, 1.5E7, 8.2E6, 8121540Min numeric sample -5, 0, 1, 1.08, 1.2

Page 11: An analysis of the quality issues of the properties available in the Spanish DBpedia

11

Properties in esDBpedia

Ontology Engineering Group, Universidad Politécnica de Madrid

Property prefix Properties Property values# % # %

http://es.dbpedia.org/property/ 19,885 52.53 18,021,389 10.66

http://dbpedia.org/property/ 17,188 45.40 9,742,710 5.76

http://dbpedia.org/ontology/ 576 1.52 86,602,281 51.21

http://xmlns.com/foaf/0.1/ 12 0.03 8,132,328 4.81

http://www.w3.org/1999/02/22-rdf-syntax-ns#

8 0.02 12,298,451 7.27

http://www.w3.org/2000/01/rdf-schema# 7 0.02 5,366,982 3.17

http://www.w3.org/2002/07/owl# 6 0.02 16,523,751 9.77

http://purl.org/dc/terms/# 4 0.01 4,148,399 2.45

http://www.w3.org/2004/02/skos/core# 4 0.01 1,153,685 0.68

http://purl.org/dc/elements/1.1/ 3 0.01 3,346,874 1.98

http://www.w3.org/ns/prov# 1 0.00 2,853,681 1.69

Other prefixes 163 0.43 911,131 0.54

Total 37,857 100 169,101,662 100

Page 12: An analysis of the quality issues of the properties available in the Spanish DBpedia

12Ontology Engineering Group, Universidad Politécnica de Madrid

A. Conciseness

• Many redundant properties in esDBpedia• 97.93% are auto-generated

• Causes• Capitalization (857): partidosEnPrimera,partidosenprimera• Synonyms: causaDeMuerte, causaDeFallecimiento• Prepositions: causaDeFallecimiento, causaFallecimiento• Spelling (7,495): apeliido, apelldio, apellid• Singular/plural: apellido, apellidos• Gender: administrador, administradora• Accent usage (1,252): administracion, administración• Parsing (107): altitudMin/máx, residencia/trabajo, idioma/s

Page 13: An analysis of the quality issues of the properties available in the Spanish DBpedia

13

B. Consistency

• OWL properties with IRI and literal values• 3,380 properties• Use of strings and URL interchangeably

• esdbpedia:lugarDeEntierro• "Madrid"@es• http://es.dbpedia.org/resource/Madrid

• Diverse and incorrect domain and range types • esdbpedia:edad has range of type dbo:Place • esdbpedia:lugarmuerte has range of type dbo:Person• esdbpedia:pais has range of type dbo:Actor

Ontology Engineering Group, Universidad Politécnica de Madrid

Page 14: An analysis of the quality issues of the properties available in the Spanish DBpedia

14

C. Syntactic Validity

• IRIs represented as strings• Many properties with IRI values and very few string values• Common cause:

• IRIs encoded as strings -> “http://...”@es • Numerical values represented as strings

• 3,675 properties with more than 99% integer objects and a very few string literals

• Common cause: • Numerics encoded as strings -> “2^^xsd:integer”

Ontology Engineering Group, Universidad Politécnica de Madrid

Page 15: An analysis of the quality issues of the properties available in the Spanish DBpedia

15

D. Semantic Accuracy

• Outliers • Numerical values allow an automatic analysis • Properties such as diameter or edad with negative values

• Harder to detect automatically • Our plan is to use data fusion approaches to compare

values from multiple sources

Ontology Engineering Group, Universidad Politécnica de Madrid

Page 16: An analysis of the quality issues of the properties available in the Spanish DBpedia

16

Conclusions and future work

• DBpedia’s spontaneous data model can support quality assessment

• Errors in DBpedia are introduced in many stages• Crowd-sourced data• Mappings• Extraction framework

• Some errors can be eliminated with pre-processing and cleaning

• Quality assessment currently semi-automatic• Currently working towards its automation

• We plan to investigate if the quality issues are the same in other DBpedia instances

Ontology Engineering Group, Universidad Politécnica de Madrid

Page 17: An analysis of the quality issues of the properties available in the Spanish DBpedia

Questions?

http://loupe.linkeddata.es/