Linked Data Quality Assessment: A Survey

Post on 14-Jan-2017

351 views 3 download

Transcript of Linked Data Quality Assessment: A Survey

Data Quality Assessment for Linked Data: A Survey

Amrapali Zaveri, Anisa Rula, Andrea Maurino, Ricardo Pietrobon, Jens Lehmann, Sören Auer

1Data Quality Tutorial, September 12, 2016

OutlineSurvey Methodology

LDQ Dimensions and Metrics

LDQ Assessment Tools

LDQ In Practice

2

OutlineSurvey Methodology

LDQ Dimensions and Metrics

LDQ Assessment Tools

LDQ In Practice

3

Survey Methodology — Steps IRelated Surveys

Research Questions

Eligibility Criteria

Search Strategy

Title & Abstract Reviewing

4

Survey Methodology — Research Questions• How can one assess the quality of Linked Data employing a

conceptual framework integrating prior approaches?

• What are the data quality problems that each approach assesses?

• Which are the data quality dimensions and metrics supported by the proposed approaches?

• What kinds of tools are available for data quality assessment?

5

Survey Methodology — Eligibility CriteriaInclusion criteria:

Must satisfy:

• published between 2002 and 2014.

Should satisfy:

• data quality assessment

• trust assessment

• proposed and/or implemented an approach

• assessed the quality of LD or information systems based on LD

Exclusion criteria:

• not peer-reviewed

• published as a poster abstract

• data quality management

• other forms of structured data

• did not propose any methodology or framework

6

Survey Methodology — StepsRemove duplicates

Further potential articles

Compare short- listed articles

Quantitative analysis

Qualitative analysis

7

Survey Methodology — Results

8

30 core articles

Conference - 21

Journal - 8

Masters Thesis - 1

18 Dimensions

69 Metrics

OutlineSurvey Methodology

LDQ Dimensions and Metrics

LDQ Assessment Tools

LDQ In Practice

9

LDQ Dimensions & Metrics• Data Quality: commonly conceived as a multi-dimensional

construct with a popular definition ‘fitness for use’*.

• Dimension: characteristics of a dataset.

• Metric: or indicator is a procedure for measuring an information quality dimension.

10

*Juran et al., The Quality Control Handbook, 1974

18 LDQ Dimensions

11

LDQ Dimensions - Accessibility dimensions & metrics• Availability - extent to which data (or some portion of it) is present, obtainable and

ready for use

• accessibility of the SPARQL endpoint and the server

• dereferenceability of the URI

• Interlinking - degree to which entities that represent the same concept are linked to each other, be it within or between two or more data sources

• detection of the existence and usage of external URIs

• detection of all local in-links or back-links: all triples from a dataset that have the resource’s URI as the object

12

LDQ Dimensions - Representational dimensions & metrics• Interoperability - degree to which the format and structure of the information conforms to

previously returned information as well as data from other sources

• detection of whether existing terms from all relevant vocabularies for that particular domain have been reused

• usage of existing vocabularies for a particular domain

• Interpretability - refers to technical aspects of the data, that is, whether information is represented using an appropriate notation and whether the machine is able to process the data

• detection of invalid usage of undefined classes and properties

• detecting the use of appropriate language, symbols, units, datatypes and clear definitions

13

LDQ Dimensions - Intrinsic dimensions & metrics• Syntactic Validity - degree to which an RDF document conforms to

the specification of the serialization format

• detecting syntax errors using (i) validators, (ii) via crowdsourcing

• by (i) use of explicit definition of the allowed values for a datatype, (ii) syntactic rules (type of characters allowed and/or the pattern of literal values)

14

LDQ Dimensions - Intrinsic dimensions & metrics• Completeness

• Schema - ontology completeness

• no. of classes and properties represented / total no. of classes and properties

• Property - missing values for a specific property

• no. of values represented for a specific property / total no. of values for a specific property

• Population - % of all real-world objects of a particular type

• Interlinking - degree to which instances in the dataset are interlinked

15

LDQ Dimensions - Contextual dimensions & metrics• Understandability - refers to the ease with which data can be comprehended

without ambiguity and be used by a human information consumer

• human-readable labelling of classes, properties and entities as well as presence of metadata

• indication of the vocabularies used in the dataset

• Timeliness - measures how up-to-date data is relative to a specific task

• freshness of datasets based on currency and volatility

• freshness of datasets based on their data source

16

OutlineSurvey Methodology

LDQ Dimensions and Metrics

LDQ Assessment Tools

LDQ In Practice

17

LDQ Assessment Tools

18

LDQ Assessment Tools - RDFUnit

http://aksw.org/Projects/RDFUnit.html 19

Syntactic Validity

Semantic Accuracy

Consistency

LDQ Assessment Tools - Dacura

http://dacura.cs.tcd.ie/about-dacura/ 20

Interpretability

Semantic Accuracy

Consistency

OutlineSurvey Methodology

LDQ Dimensions and Metrics

LDQ Assessment Tools

LDQ In Practice

21

Linked Data Quality — In Practice

22

Linked Data Quality

Methodologies

Tools

Use Cases

Beyond Data

Vocabulary

23

Crowdsourcing Linked Data Quality Assessment

LDQ Assessment Tools — Luzzu

http://eis-bonn.github.io/Luzzu/index.html 24

2 Assess

3 Clean

4 Store5 Rank

1 Metric

LDQ Assessment Tools — LODLaundromat

http://lodlaundromat.org/25

LDQ Use Cases — Open Data Portals

26

Automated Quality Assessment of Metadata across Open Data Portals. Neumaier et. al., JDIQ 2016.

Completeness Interoperability

Relevancy Accuracy

Openness

LDQ Beyond Data — Mapping Quality

27

Dimou et al. Assessing and Refining Mappings to RDF to Improve Dataset Quality. ISWC 2015.

https://github.com/RMLio/RML-Validator

28

W3C Data Quality Vocabularyhttps://www.w3.org/ TR/vocab-dqv/

W3C Data Quality Vocabulary

29https://www.w3.org/TR/vocab-dqv/

dqv:Category

dqv:Dimension

dqv:Metric

dqv:QualityMeasurementqb:Observation

dqv:QualityMeasurementDatasetqb:DataSet dqv:inDimension

dqv:inCategory

dqv:isMeasurementOfdqv:hasQuality Measurement

Challenges• Propagation of errors

• Management/Improvement

• Usage of the standard vocabulary

• Quality-based search engines

30

Thank you!Questions?

amrapali@stanford.edu @AmrapaliZ

Quality assessment for linked data: A survey A Zaveri, A Rula, A Maurino, R Pietrobon, J Lehmann, S Auer Semantic Web 7 (1), 63-93