CLEANING-Error-Flagging-Javier
-
Upload
javier-otegui -
Category
Documents
-
view
19 -
download
0
Transcript of CLEANING-Error-Flagging-Javier
Data Cleaning and Data Publishing Workshop 2013 18-22 February, Nairobi, Kenya Javier Otegui @jotegui
ERROR FLAGGING
¡ What is flagging? § Adding a piece of information to a record or PBD § Give extra information on something § Especially used to highlight records to inform collector or user
¡ Aims of error flagging: § Provide a simple way of filtering records that might be
problematic § Very useful for automated error processing § Reporting issues to the owner
¡ Difference between flagging and resolving: § Ownership
INTRODUCTION – ERROR FLAGGING
DATA IS OURS
¡ We are directly responsible for the quality
¡ We may share the master copy of the data
¡ We can directly improve the quality of the data and serve it
DATA IS NOT OURS
¡ We are not directly responsible for the quality
¡ We point to the original source
¡ We cannot directly improve the quality of the data and serve it
INTRODUCTION – OWNERSHIP
¡ What is flagging? § Adding a piece of information to a record or PBD § Give extra information on something § Especially used to highlight records to inform collector or user
¡ Aims of error flagging: § Provide a simple way of filtering records that might be
problematic § Very useful for automated error processing § Reporting issues to the owner
¡ Difference between flagging and resolving: § Ownership
¡ Why flag and not resolve? Attribution and persistence
INTRODUCTION – ERROR FLAGGING
¡ Data from an aggregator – certain restrictions or conditions ¡ Acknowledge the original source of the data ¡ Each collection might have additional rules
INTRODUCTION - ATTRIBUTION
¡ Data from an aggregator – certain restrictions or conditions ¡ Acknowledge the original source of the data ¡ Each collection might have additional rules
INTRODUCTION - ATTRIBUTION
¡ Data from an aggregator – certain restrictions or conditions ¡ Acknowledge the original source of the data ¡ Each collection might have additional rules
INTRODUCTION - ATTRIBUTION
¡ Persistence of the correction
¡ Local work = no permanence of corrections
¡ Next researcher must repeat the cleaning process
¡ Error flagging as an excellent tool for reporting
issues
¡ Once reported, owners can clean the data
¡ Example or flagging: annotations
INTRODUCTION - PERSISTENCE
¡ Data manipulation – add a piece of information to the original record
¡ New fields, populated if an issue is detected ¡ Recommendation: use (and document) a codification
INTRODUCTION - MECHANISMS
Coordinates swapped Swapped coordinates
Coordinates transposed Coordnates transposed
…
1 1 1 1 1
¡ Data Usage Terms § Accepted when using the portal § Among others, the need to cite the data
¡ Data Sharing Agreement § “GBIF Secretariat may cache a copy and serve full or partial data
further to other users together with the terms and conditions for use set by the Data Publisher”
§ Partial based on detected issues in the quality
¡ How do they detect issues? § Processing routines search for most common issues § Errors are flagged – They cannot alter the data § Flags used to alert users and reported back to owners
INTRODUCTION – EXAMPLE: GBIF
INTRODUCTION – EXAMPLE: GBIF
Coordinates fall outside specified country, territory or island
INTRODUCTION – EXAMPLE: GBIF
138,458 records with coordinates 138,312 records in map
146 records with wrong coordinates
¡ What happens when errors are flagged?
¡ Flags or annotations should reach the owner
¡ Owner is the only one who can solve issues at the
source
¡ Corrected data is then deployed and re-indexed
¡ This has happened often…
INTRODUCTION – RESOLUTION PATH
INTRODUCTION – RESOLUTION PATH
Before
After
¡ Key factor: awareness and implication of data owners
§ Some owners correct their data
§ Some owners don’t
¡ Without this step, the process of error flagging loses
part of its sense
INTRODUCTION – RESOLUTION PATH
¡ Error flagging can be applied to several data storage
formats
¡ Each format has its own requirements
¡ Formats:
§ Text files: tab-delimited, CSV files…
§ Spreadsheets: LibreOffice Calc, Google Spreadsheets, Microsoft
Office…
§ Database tables
ERROR FLAGGING
¡ On some aspects, the most comfortable way of managing data
¡ Semi-structured, visual management of information § Rows, columns and cells § Not determined to hold any specific type of data § Plotting records in several ways
¡ Calculations with cells ¡ Some of the most common operations:
ERROR FLAGGING – SPREADSHEETS
¡ Sorting
ERROR FLAGGING – SPREADSHEETS
¡ Filtering
ERROR FLAGGING – SPREADSHEETS
¡ Conditional formatting
ERROR FLAGGING – SPREADSHEETS
¡ Controlled vocabulary
ERROR FLAGGING – SPREADSHEETS
¡ Visualizations
ERROR FLAGGING – SPREADSHEETS
¡ Formulae & Advanced scripting
ERROR FLAGGING – SPREADSHEETS
¡ Error flagging – the process of reporting
issues without modifying the original data
¡ Useful when working with shared data
¡ In Spreadsheets
§ Simple, yet powerful
§ Adaptable levels of difficulty
§ Several possibilities to filter and flag records
CONCLUSION