Behshid Behkamal 1 Date: 1388/11/14. Behshid Behkamal 2 Data Quality Definition Quality of Linked...

50
Behshid Behkamal 1 Date: 1388/11/14

Transcript of Behshid Behkamal 1 Date: 1388/11/14. Behshid Behkamal 2 Data Quality Definition Quality of Linked...

Behshid Behkamal1Date: 1388/11/14

Behshid Behkamal2

Data Quality Definition

Quality of Linked Data

Data Quality Dimensions

Data Quality Model

Behshid Behkamal4

Some definition

data qualitydegree to which the characteristics of data satisfy stated and implied needs when used under specified conditions

data quality characteristic category of data quality attributes that bears on data quality

data quality measure variable to which a value is assigned as the result of measurement of a data

quality characteristic

Behshid Behkamal5

Data Quality ProblemData Quality Problem

Single Source ProblemMulti Source Problem

Schema RelatedInstant SpecificSchema Related Instant Specific

•Attribute•Record•Record Type •Source

•Attribute•Record•Record Type •Source

•Attribute•Record•Record Type •Source

•Attribute•Record•Record Type •Source

Multi Source ProblemSingle Source Problem

Instant SpecificSchema Related Schema RelatedInstant SpecificSchema Related Instant SpecificSchema RelatedInstant SpecificSchema Related

Classification of Data Quality problemsClassification of Data Quality problems

Behshid Behkamal6

Single Source Problem - Schema RelatedSingle Source Problem - Schema Related

Behshid Behkamal7

Multi Source Problem - Schema RelatedMulti Source Problem - Schema Related

Behshid Behkamal9

Measuring Data Quality in Data Warehousing – 2001[1]

Behshid Behkamal10

Data quality Dimensions – 2003 [2]

Task independentReflect states of the data without the contextual knowledge of the application, and can be applied to any data set, regardless of the tasks at hand.

Task dependent Which include the organization’s business rules, company and government regulations, and constraints provided by the database administrator, are developed in specific application contexts.

Behshid Behkamal11

Task Independent

Task Dependent

Behshid Behkamal12

Dimension of Data Quality- 2005 [3]

Process: Dimensions of DQ related to the generation, assembly, description and maintenance of data

- Reliability (with several sub dimensions), Metadata, Security and Confidentiality.

Data: Dimensions of DQ specifically associated with the data themselves.

- Record/table level: Accuracy, Completeness, Consistency and Validity

- Database level dimensions: Identifiably and Join ability.

User: Dimensions of DQ related to use and users

- Accessibility, Interpretability,, Relevance and Timeliness.

Behshid Behkamal13

Dimension of Data quality – 2006 [4]

Depth of Data Quality •Accuracy•Completeness•Validity•Currentness

Width of Data Quality•Consistency•Integration

Behshid Behkamal14

Dimension of Data Quality – 2008 [5]

User BaseConsistent representation, Interpretability, Case of understanding, Concise representation, Timeliness, Completeness, Value-added, relevance, appropriate, Meaningfulness, Lack of confusion, Arrangement, Readable, Reasonable

SystemData Deficiency, Design Deficiencies, Operation Deficiencies

Inherent IQ AccuracyCost, Objectivity, Believability, Reputation, Accessibility, Correctness, Unambiguous, Consistency

IntuitivePrecision, Reliability, freedom from bias

Behshid Behkamal16

ISO/IEC 25012 Data Quality Model – 2008 [6]

The ISO/IEC-25012 data quality model defined quality attributes into fifteen characteristics considered by two points of view:

– Inherent data quality refers to data itself, in particular to:

- data domain values and possible restrictions - relationships of data values - Metadata

– system dependentdata quality depends on the technological domain in which data are used:

- computer systems' components such as: hardware devices (precision)

- computer system software (recoverability)

- other software (portability)

Behshid Behkamal17

Inherent data quality

From the inherent point of view, data quality refers to data itself, in particular to:

data domain values and possible restrictions (e.g. business rules governing the quality required for the characteristic in a given application);

relationships of data values (e.g. consistency);

metadata.

Behshid Behkamal18

System dependent data quality

System dependent data quality refers to the degree to which data quality is reached and preserved within a computer system when data is used under specified conditions.

From this point of view data quality depends on the technological domain in which data are used; it is achieved by the capabilities of

computer systems' components such as: hardware devices (e.g. to make data available or to obtain the required precision),

Computer system software (e.g. backup software to achieve recoverability),

Other software (e.g. migration tools to achieve portability).

Behshid Behkamal19

Behshid Behkamal20

1. Accuracy

The degree to which data has attributes that correctly represent the true value of the intended attribute of a concept or event in a specific context of use.

– Syntactic accuracy– Semantic accuracy

Measurement Function A/B A: records in which all attributes are accurateB: Total records in a dataset A=number of records with the specified field syntactically accurate B=number of records

A: attribute values that are accurate B: records × attributes

Behshid Behkamal21

2. Completeness

The degree to which subject data associated with an entity has values for all expected attributes and related entity instances in a specific context of use.

Measurement Function A/B – A: records with no missing attribute– B: Total records in a dataset

– A: number of data required for the particular context in the data file– B: number of data in the specified particular context of intended use

– A: attribute fields containing values – B: records × attributes

Behshid Behkamal22

3. Consistency

Free from contradiction and are coherent with other data in a specific context of use.

A particular case of inconsistency is represented by synonyms: a dictionary of terms used to define data could be useful to avoid it.

EXAMPLE An employee's birth date cannot be later than his “recruitment date”.

Behshid Behkamal23

4. Creditability (validity)

Validity is a weakened but more readily measured form of accuracy.

Attribute values may be valid without being correct, but not vice versa.

An attribute value is valid if it falls in some external sources defined and domain-knowledge dependent set of values.

Validity can range from – mechanical (Example:18/19/2002 is not a well-formed and not a valid date)– Logical (Example: -5 is not a valid age)– Domain-derived (Example: 1234 pounds is not a valid weight for a person)– Task dependent: 16:12 may be a valid time in one database but not in

another

Behshid Behkamal24

5. Currentness

The degree to which data has attributes that are of the right age in a specific context of use.

EXAMPLE The timetable of a railway station must be updated with the frequency required to allow passengers to take a train even if the scheduled time or platform change.

Behshid Behkamal25

6. Accessibility

The degree to which data can be accessed in a specific context of use, particularly by people who need supporting technology or special configuration because of some disability.

EXAMPLE Data that should be managed by a screen reader cannot be stored as an image.

Inherent Data Quality Measure for Sound data accessibility• Measurement Function A/BA= number of data stored only as “sound” (e.g. without a textual representation of sound)B= number of data values representing a sound

System Dependent Data Quality Measure for Multi channel data accessibility• Measurement Function A/BA=Number of data that the differently able user successfully accessesB=Number of data available

Behshid Behkamal26

7. Compliance

The degree to which data has attributes that adhere to standards, conventions or regulations in force and similar rules relating to data quality in a specific context of use.

EXAMPLE: Credit risk data of a bank must comply with specific laws and standards.

Inherent Data Quality Measure for Privacy law non-conformity: values• Measurement Function AA=number of items that do not conform to privacy law statements due to data content

System Dependent Data Quality Measure for Privacy law non-conformity: architecture• Measurement Function AA=number of items that do not conform to privacy law statements due to technical architecture failures

Behshid Behkamal27

8. Confidentiality

Ensure that it is only accessible and interpretable by authorized users in a specific context of use.

Confidentiality is an aspect of information security (together with availability, integrity) as defined in ISO/IEC 13335-1:2004.

EXAMPLE: Data that refers to personal or confidential information like health or profit must be accessed only by authorized users or should be written in secret code.

Inherent Data Quality Measure for Encryption usage• Measurement Function A/BA= Number of database fields encryptedB=Number of fields with an encryption requisite

System Dependent Data Quality Measure for Non vulnerability• Measurement Function 1- A/BA=number of successful penetrations during formal penetration testsB=number of penetration attempted

Behshid Behkamal28

9. Efficiency

The degree to which data has attributes that can be processed and provide the expected levels of performance by using the appropriate amounts and types of resources in a specific context of use.

EXAMPLE: Using more space than necessary to store data can cause waste of storage, memory and time.

Inherent Data Quality Measure for Numbers stored as strings• Measurement Function AA=number of data stored as strings

System Dependent Data Quality Measure for Wasted space• Measurement Function Σ(B - A)A=benchmarked average space for efficient data storage of a databaseB=used space for data in any physical files of the database

Behshid Behkamal29

10. Precision

The degree to which data has attributes that are exact or that provide discrimination in a specific context of use.

Look for rounding errors. Exp. precision of 5 decimal places allows different functionalities rather than a precision of 2 decimal places

Precision in location latitude and longitude declarations: must contain seconds in the Degree/Minute/Second system.

Inherent Data Quality Measure Name Precision of data values• Measurement Function A/B

A=number of data values with the requested precisionB=total number of data values

System Dependent Data Quality Measure for Precision of fields of a database• Measurement Function A/B

A=Number of data fields of the database defined with the requested precisionB=total number of data fields of the database

Behshid Behkamal30

11. Traceability

Provide an audit trail of access to the data and of any changes made to the data in a specific context of use.

EXAMPLE: Public administrations must keep information about the access executed by users for investigating who read/wrote confidential data.

Inherent Data Quality Measure for Traceability of values• Measurement Function A/B

A=Number of data for which required traceability of values is availableB=number of data items for which traceability is tested

System Dependent Data Quality Measure for Automatic traceability• Measurement Function AA=number of data items traced automatically (using system capabilities)

Behshid Behkamal31

12. Understand ability

Enable data it to be read and interpreted by users, and are expressed in appropriate languages, symbols and units in a specific context of use.

Some information about data understandability are provided by metadata.

EXAMPLE: To represent a State (within a country), the standard acronym is more understandable than a numeric code.

Inherent Data Quality Measure for Master data understandability due to existing metadata• Measurement Function A/B

A=Number of data of master data files with existing metadataB=Number of data of master data files

System Dependent Data Quality Measure for Master data understandability due to linked metadata• Measurement Function A/B

A=Number of fields having metadata automatically linked to related dataB=Total number of fields

Behshid Behkamal32

13. Availability

Enable data to be retrieved by authorized users and/or applications in a specific context of use.

A particular case of availability is concurrent access (both to read or to update data) by more than one user and/or application.

Another case of availability is the capability of data to be available for a specific period of time.

SYSTEM DEPENDENT Data Quality Measure for Data items availability

• Measurement Function A/BA=Number of data items available during backup/restore activitiesB=Number of data items of backup/restore procedures

Behshid Behkamal33

14. Portability

Enable data to be installed, replaced or moved from one system to another preserving the existing quality in a specific context of use.

SYSTEM DEPENDENT Data Quality Measure for Data portability

• Measurement Function A/BA=number of data that preserved the existing quality attribute after the migration to a different computer systemB=number of data migrated

Behshid Behkamal34

15. Recoverability

Enable data to maintain and preserve a specified level of operations and quality, even in the event of failure, in a specific context of use.

Recoverability can be provided by features like commit/synch point, rollback (fault-tolerance capability) or by backup-recovery mechanisms.

EXAMPLE: When a media device has a failure, data stored in that device should be recoverable.

SYSTEM DEPENDENT Data Quality Measure for Recoverability• Measurement Function A/B

A= number of data items successfully backed up/restored during backup /restore operationB= number of data items of backup/restore procedures

Behshid Behkamal35

Creditability (or validity)

[3] Measurement Function A/B A: records for which all entries are validB: Total records in a dataset

[5] Measurement Function A/BA= Number of data certified by internal audit after obtaining credit risk information dataB=Number of data used to obtain credit risk information

[6] Measurement Function A/BA: attribute values that are valid B: records × attributes

[7] Look for artificial keys, identity values, system generated keys and apply at least one business key to a data grouping say in a data mart or row occurrence for a registry type data group (an inventory list like list of persons, list of vehicles etc)

Behshid Behkamal36

Understand ability

[5] Measurement Function A/BA=Number of data of master data files with existing

metadata

B=Number of data of master data files

[7] Look for lack of referential integrity on the use of same attributes being used in various tables

Look for loss of history data with no record of previous values

Behshid Behkamal37

Understand ability according to Ref#2

Look for consistency of business types that an organization is licensed for and related types of returns or transactional consistencies

Look for lack of referential integrity on the use of same attributes being used in various tables

Applicable to uniquely traceable items like serial numbers or particular licensed item identifiers, look for can the same item be involved with another item at the same time.

Applies to ownership, involvement, and lineage.

Look for loss of history data with no record of previous values

Behshid Behkamal39

Linked Data

39

The goal of Semantic Web or Web of Data:processing data directly or indirectly by machines

Linked Data provides the means to reach the goal

Refers to data published on the Web in such a way – It is machine-readable– Its meaning is explicitly defined– It is linked to other datasets– It can be linked to/from external datasets

Behshid Behkamal40

Quality Characteristics of Linked Data

According to Definition of Linked Data:

– Compliance HTTP URIs to identify resources HTTP Protocol to retrieve resources

– Understand ability It is machine-readable Its meaning is explicitly defined

– Portability RDF data model to represent resources (Any application that

understands the model, can consume any data source published based on the model)

It can be linked to/from other datasets

Behshid Behkamal41

Classification of Quality characteristics in Linked Data

Inherent data quality– Accuracy– Validity– Precision

Context Related– Completeness– Currentness

System Dependent – Accessibility– Traceability– Recoverability– Availability– Efficiency– Confidentiality (Privacy Protection and Licensing in Linked Data)

Consistency– one of the most challenge in Linked Data is Data fusion

Behshid Behkamal42

Data Fusion

42

Process of integrating multiple data items representing the same real-world object into a single, consistent, and clean representation.

Behshid Behkamal43

Co-reference

A single URI identifies more than one resource – Exp. A number of people in DBLP with the same name who are

being incorrectly identified as being the same person.

Multiple URIs identify the same resource– Different datasets use their own URIs to identify the same

resource. People and places are entities which suffer from URI multiplicity.

– Exp. Spain has at least four URIs:1. http://dbpedia.org/resource/Spain2. http://www4.wiwiss.fu-berlin.de/factbook/resource/Spain3. http://sws.geonames.org/25107694. http://www4.wiwiss.fuberlin.de/eurostat/resource/countries/Espa

%C3%B1a

Behshid Behkamal44

Author Disambiguation [7]

1. Single author having multiple identities (variation in the spelling)

– ‘Hugh Glaser’– ‘H. Glaser’– ‘Glaser, H.’

2. Many authors who share the same name

Behshid Behkamal45

Author Disambiguation …

– Solutions: citation matching, name matching, Name equivalence identification

– All of them involve some form of string matching and word sense disambiguation.

– Help in identifying names with different spellings or written in different formats

– Disambiguating authors with exactly the same name remains a challenge.

Behshid Behkamal46

Consistent Reference Services [8]

The CRS introduces the concept of a bundle to group together resources that have been deemed to refer to the same concept within a given context.

Different bundles may be used to group together URIs of the same resource in different contexts.

For example, there may be a bundle containing all of the URIs about a person in the context of institution 1; and another bundle containing all of the URIs about the same person in the context of institution 2.

Each CRS can use different algorithms to identify equivalent resources.

Behshid Behkamal47

An Entity Name System for Linking Semantic Web Data [9]

Entity Name System (ENS), might play for the Semantic Web the role that the DNS played for interlinking hypertexts on the Web.

Behshid Behkamal48

Interlinking Distributed Social Graphs [10]

1. Export social data contained within data silos into the

same semantic form. (FaceBook, Twitter, MySpace )

2. Link person instances from separate social networks

referring to the same real world person.

3. Publish a decentralized linked social graph.

Behshid Behkamal49

1. Markus Helfert, Institute of Information Management, University of St. Gallen, Managing and Measuring Data Quality in Data Warehousing, 2001

2.Leo L. Pipino, Yang W. Lee, and Richard Y. Wang, Data quality Assessment, 2003

3. Alan F. Karr and Ashish P. Sanil , Data Quality: A Statistical Perspective, 2005

4. Kyung-Seok Ryu, Joo-Seok Park, and Jae-Hong Park, A Data Quality Management Maturity Model, ETRI Journal, (2006) Vol. 28, No. 2, 191- 204

5. Ying Su, Zhanming Jin, A Methodology for Information Quality Assessment in Data Warehousing, reviewed at the direction of IEEE Communications Society, Publication in the ICC 2008 proceedings.

6. ISO/IEC 25012 - Data Quality Model, Final Draft: 2008-11-04

7. Afraz Jaffri, Hugh Glaser, Ian C. Millard, URI Disambiguation in the Context of Linked Data, LDOW2008, China.

8. Hugh Glaser, Afraz Jaffri, Ian C. Millard, Managing Co-reference on the Semantic Web, LDOW2009, Spain.

9. Paolo Bouquet, Heiko Stoermer, Daniele Cordioli, An Entity Name System for Linking Semantic Web Data, LDOW2008, China.

10. Matthew Rowe, Interlinking Distributed Social Graphs, LDOW2009, Spain.

Behshid Behkamal50