CLEANING-Error-Flagging-Javier

24
Data Cleaning and Data Publishing Workshop 2013 18-22 February, Nairobi, Kenya Javier Otegui @jotegui ERROR FLAGGING

Transcript of CLEANING-Error-Flagging-Javier

Page 1: CLEANING-Error-Flagging-Javier

Data Cleaning and Data Publishing Workshop 2013 18-22 February, Nairobi, Kenya Javier Otegui @jotegui

ERROR FLAGGING

Page 2: CLEANING-Error-Flagging-Javier

¡ What is flagging? § Adding a piece of information to a record or PBD § Give extra information on something § Especially used to highlight records to inform collector or user

¡ Aims of error flagging: § Provide a simple way of filtering records that might be

problematic § Very useful for automated error processing § Reporting issues to the owner

¡ Difference between flagging and resolving: § Ownership

INTRODUCTION – ERROR FLAGGING

Page 3: CLEANING-Error-Flagging-Javier

DATA IS OURS

¡ We are directly responsible for the quality

¡ We may share the master copy of the data

¡ We can directly improve the quality of the data and serve it

DATA IS NOT OURS

¡ We are not directly responsible for the quality

¡ We point to the original source

¡ We cannot directly improve the quality of the data and serve it

INTRODUCTION – OWNERSHIP

Page 4: CLEANING-Error-Flagging-Javier

¡ What is flagging? § Adding a piece of information to a record or PBD § Give extra information on something § Especially used to highlight records to inform collector or user

¡ Aims of error flagging: § Provide a simple way of filtering records that might be

problematic § Very useful for automated error processing § Reporting issues to the owner

¡ Difference between flagging and resolving: § Ownership

¡ Why flag and not resolve? Attribution and persistence

INTRODUCTION – ERROR FLAGGING

Page 5: CLEANING-Error-Flagging-Javier

¡  Data from an aggregator – certain restrictions or conditions ¡  Acknowledge the original source of the data ¡  Each collection might have additional rules

INTRODUCTION - ATTRIBUTION

Page 6: CLEANING-Error-Flagging-Javier

¡  Data from an aggregator – certain restrictions or conditions ¡  Acknowledge the original source of the data ¡  Each collection might have additional rules

INTRODUCTION - ATTRIBUTION

Page 7: CLEANING-Error-Flagging-Javier

¡  Data from an aggregator – certain restrictions or conditions ¡  Acknowledge the original source of the data ¡  Each collection might have additional rules

INTRODUCTION - ATTRIBUTION

Page 8: CLEANING-Error-Flagging-Javier

¡ Persistence of the correction

¡ Local work = no permanence of corrections

¡ Next researcher must repeat the cleaning process

¡ Error flagging as an excellent tool for reporting

issues

¡ Once reported, owners can clean the data

¡ Example or flagging: annotations

INTRODUCTION - PERSISTENCE

Page 9: CLEANING-Error-Flagging-Javier

¡  Data manipulation – add a piece of information to the original record

¡  New fields, populated if an issue is detected ¡  Recommendation: use (and document) a codification

INTRODUCTION - MECHANISMS

Coordinates swapped Swapped coordinates

Coordinates transposed Coordnates transposed

1 1 1 1 1

Page 10: CLEANING-Error-Flagging-Javier

¡  Data Usage Terms §  Accepted when using the portal §  Among others, the need to cite the data

¡  Data Sharing Agreement §  “GBIF Secretariat may cache a copy and serve full or partial data

further to other users together with the terms and conditions for use set by the Data Publisher”

§  Partial based on detected issues in the quality

¡  How do they detect issues? §  Processing routines search for most common issues §  Errors are flagged – They cannot alter the data §  Flags used to alert users and reported back to owners

INTRODUCTION – EXAMPLE: GBIF

Page 11: CLEANING-Error-Flagging-Javier

INTRODUCTION – EXAMPLE: GBIF

Coordinates fall outside specified country, territory or island

Page 12: CLEANING-Error-Flagging-Javier

INTRODUCTION – EXAMPLE: GBIF

138,458 records with coordinates 138,312 records in map

146 records with wrong coordinates

Page 13: CLEANING-Error-Flagging-Javier

¡ What happens when errors are flagged?

¡ Flags or annotations should reach the owner

¡ Owner is the only one who can solve issues at the

source

¡ Corrected data is then deployed and re-indexed

¡ This has happened often…

INTRODUCTION – RESOLUTION PATH

Page 14: CLEANING-Error-Flagging-Javier

INTRODUCTION – RESOLUTION PATH

Before

After

Page 15: CLEANING-Error-Flagging-Javier

¡ Key factor: awareness and implication of data owners

§ Some owners correct their data

§ Some owners don’t

¡ Without this step, the process of error flagging loses

part of its sense

INTRODUCTION – RESOLUTION PATH

Page 16: CLEANING-Error-Flagging-Javier

¡ Error flagging can be applied to several data storage

formats

¡ Each format has its own requirements

¡ Formats:

§ Text files: tab-delimited, CSV files…

§ Spreadsheets: LibreOffice Calc, Google Spreadsheets, Microsoft

Office…

§ Database tables

ERROR FLAGGING

Page 17: CLEANING-Error-Flagging-Javier

¡ On some aspects, the most comfortable way of managing data

¡ Semi-structured, visual management of information § Rows, columns and cells § Not determined to hold any specific type of data § Plotting records in several ways

¡ Calculations with cells ¡ Some of the most common operations:

ERROR FLAGGING – SPREADSHEETS

Page 18: CLEANING-Error-Flagging-Javier

¡  Sorting

ERROR FLAGGING – SPREADSHEETS

Page 19: CLEANING-Error-Flagging-Javier

¡  Filtering

ERROR FLAGGING – SPREADSHEETS

Page 20: CLEANING-Error-Flagging-Javier

¡  Conditional formatting

ERROR FLAGGING – SPREADSHEETS

Page 21: CLEANING-Error-Flagging-Javier

¡  Controlled vocabulary

ERROR FLAGGING – SPREADSHEETS

Page 22: CLEANING-Error-Flagging-Javier

¡  Visualizations

ERROR FLAGGING – SPREADSHEETS

Page 23: CLEANING-Error-Flagging-Javier

¡  Formulae & Advanced scripting

ERROR FLAGGING – SPREADSHEETS

Page 24: CLEANING-Error-Flagging-Javier

¡ Error flagging – the process of reporting

issues without modifying the original data

¡ Useful when working with shared data

¡ In Spreadsheets

§ Simple, yet powerful

§ Adaptable levels of difficulty

§ Several possibilities to filter and flag records

CONCLUSION