Data quality challenges in the Canadensys network of occurrence records: examples, tools, and...
-
Upload
kristgen -
Category
Technology
-
view
424 -
download
2
description
Transcript of Data quality challenges in the Canadensys network of occurrence records: examples, tools, and...
![Page 1: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions](https://reader034.fdocuments.net/reader034/viewer/2022051515/554fa1b4b4c90586258b49f9/html5/thumbnails/1.jpg)
Data quality challenges in the Canadensys network of
occurrence records: examples, tools, and solutions
Chris&an Gendreau, David Shorthouse & Peter Desmet
![Page 2: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions](https://reader034.fdocuments.net/reader034/viewer/2022051515/554fa1b4b4c90586258b49f9/html5/thumbnails/2.jpg)
Game plan • Introduc&on to Canadensys • Data quality @ Canadensys • Canadensys processing solu&ons • Numbers from Canadensys • Hopes and expecta&ons
![Page 3: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions](https://reader034.fdocuments.net/reader034/viewer/2022051515/554fa1b4b4c90586258b49f9/html5/thumbnails/3.jpg)
A Network Of people and collections
![Page 4: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions](https://reader034.fdocuments.net/reader034/viewer/2022051515/554fa1b4b4c90586258b49f9/html5/thumbnails/4.jpg)
Canadensys Headquarters Université de Montréal Biodiversity Centre
![Page 5: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions](https://reader034.fdocuments.net/reader034/viewer/2022051515/554fa1b4b4c90586258b49f9/html5/thumbnails/5.jpg)
data.canadensys.net/vascan
![Page 6: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions](https://reader034.fdocuments.net/reader034/viewer/2022051515/554fa1b4b4c90586258b49f9/html5/thumbnails/6.jpg)
data.canadensys.net/ipt
![Page 7: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions](https://reader034.fdocuments.net/reader034/viewer/2022051515/554fa1b4b4c90586258b49f9/html5/thumbnails/7.jpg)
data.canadensys.net/explorer
![Page 8: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions](https://reader034.fdocuments.net/reader034/viewer/2022051515/554fa1b4b4c90586258b49f9/html5/thumbnails/8.jpg)
Data quality related activities From an aggregator perspective
![Page 9: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions](https://reader034.fdocuments.net/reader034/viewer/2022051515/554fa1b4b4c90586258b49f9/html5/thumbnails/9.jpg)
During data entry • Help to avoid typographical errors • Help to convert verba&m data
Actor : data entry person
![Page 10: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions](https://reader034.fdocuments.net/reader034/viewer/2022051515/554fa1b4b4c90586258b49f9/html5/thumbnails/10.jpg)
Before publica&on
Actor : data publisher
• Detect file character encoding issue • Detect duplicate or missing IDs
Previous Activity: Data entry
![Page 11: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions](https://reader034.fdocuments.net/reader034/viewer/2022051515/554fa1b4b4c90586258b49f9/html5/thumbnails/11.jpg)
During aggrega&on • Process data: valida&on, cleaning • Produce structured reports : quality control
Actor : data aggregator
Previous Activity: Before publication
![Page 12: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions](https://reader034.fdocuments.net/reader034/viewer/2022051515/554fa1b4b4c90586258b49f9/html5/thumbnails/12.jpg)
AKer aggrega&on • Allow and facilitate community feedback • Help data publisher to integrate correc&ons
Actor : users and community
Previous Activity: Aggregation
![Page 13: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions](https://reader034.fdocuments.net/reader034/viewer/2022051515/554fa1b4b4c90586258b49f9/html5/thumbnails/13.jpg)
Canadensys tools during data entry
data.canadensys.net/tools
![Page 14: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions](https://reader034.fdocuments.net/reader034/viewer/2022051515/554fa1b4b4c90586258b49f9/html5/thumbnails/14.jpg)
Why do we process data? • Enrich our Explorer, h"p://data.canadensys.net • Provide structured reports to data providers
• Help iden&fy records that need re-‐examina&on • Help to improve data entry procedure
![Page 15: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions](https://reader034.fdocuments.net/reader034/viewer/2022051515/554fa1b4b4c90586258b49f9/html5/thumbnails/15.jpg)
Data processing
![Page 16: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions](https://reader034.fdocuments.net/reader034/viewer/2022051515/554fa1b4b4c90586258b49f9/html5/thumbnails/16.jpg)
Processing solu&ons Narwhals to the rescue
Narwhal image Public Domain
![Page 17: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions](https://reader034.fdocuments.net/reader034/viewer/2022051515/554fa1b4b4c90586258b49f9/html5/thumbnails/17.jpg)
The narwhal-‐processor approach ● Single field processing to allow complex
processing (combined fields) ● Processors with common interface ease
integra&on and usage ● Collabora&on
https://github.com/Canadensys/narwhal-processor
![Page 18: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions](https://reader034.fdocuments.net/reader034/viewer/2022051515/554fa1b4b4c90586258b49f9/html5/thumbnails/18.jpg)
Data usability before processing
92%
60%
96%
44%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
country text state/province text coordinates dates
% of n
on-‐null clean
verba
>m data
![Page 19: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions](https://reader034.fdocuments.net/reader034/viewer/2022051515/554fa1b4b4c90586258b49f9/html5/thumbnails/19.jpg)
Data usability aKer processing
• 7% of provided country text
USA ISO 3166-‐2:US, United States
![Page 20: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions](https://reader034.fdocuments.net/reader034/viewer/2022051515/554fa1b4b4c90586258b49f9/html5/thumbnails/20.jpg)
Data usability aKer processing
• 7% of provided country text • 16% of provided state/province text
Qué ISO 3166-‐2 CA-‐QC, Quebec
![Page 21: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions](https://reader034.fdocuments.net/reader034/viewer/2022051515/554fa1b4b4c90586258b49f9/html5/thumbnails/21.jpg)
Data usability aKer processing
• 7% of provided country text • 16% of provided state/province text • 4% of provided coordinates
45° 32' 25" N, 129° 40' 31" W
45.5402778, -‐129.6752778
![Page 22: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions](https://reader034.fdocuments.net/reader034/viewer/2022051515/554fa1b4b4c90586258b49f9/html5/thumbnails/22.jpg)
Data usability aKer processing
• 7% of provided country text • 16% of provided state/province text • 4% of provided coordinates • 42% of provided dates
2008 VI 13 2008-‐06-‐13
![Page 23: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions](https://reader034.fdocuments.net/reader034/viewer/2022051515/554fa1b4b4c90586258b49f9/html5/thumbnails/23.jpg)
Data usability including processed data
92%
60%
96%
44%
7%
16%
4%
42%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
country text state/province text coordinates dates
% of n
on-‐null provide
d
![Page 24: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions](https://reader034.fdocuments.net/reader034/viewer/2022051515/554fa1b4b4c90586258b49f9/html5/thumbnails/24.jpg)
Projects With Data Quality Tools • Atlas of living Australia • GBIF Norway, GBIF Spain, Na&onal Biodiversity Network, BioVeL …
• GBIF libraries • Most nodes have their own data quality rou&ne
![Page 25: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions](https://reader034.fdocuments.net/reader034/viewer/2022051515/554fa1b4b4c90586258b49f9/html5/thumbnails/25.jpg)
Hopes and expecta&ons
![Page 26: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions](https://reader034.fdocuments.net/reader034/viewer/2022051515/554fa1b4b4c90586258b49f9/html5/thumbnails/26.jpg)
• Maintain taxonomic authority files • Maintain country, province and city lists
We do not want to
![Page 27: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions](https://reader034.fdocuments.net/reader034/viewer/2022051515/554fa1b4b4c90586258b49f9/html5/thumbnails/27.jpg)
• Efficiently use specialized resources/services • Provide report, quality indices
We prefer to
![Page 28: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions](https://reader034.fdocuments.net/reader034/viewer/2022051515/554fa1b4b4c90586258b49f9/html5/thumbnails/28.jpg)
Help from Seman&c Web • Data in other languages (French, Spanish, …)
should not be flagged as error • Misspellings should be shared as a common
resource (e.g. SKOS) • Understand historical data (e.g. collected in
USSR in 1980)
![Page 29: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions](https://reader034.fdocuments.net/reader034/viewer/2022051515/554fa1b4b4c90586258b49f9/html5/thumbnails/29.jpg)
Repor&ng and log • DarwinCore annota&ons for processed data • Shared vocabulary for structured reports and
quality indices
![Page 30: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions](https://reader034.fdocuments.net/reader034/viewer/2022051515/554fa1b4b4c90586258b49f9/html5/thumbnails/30.jpg)
Summary • Tools available for sharing • Use, review, contribute • Opportunity for broad coordina&on and increased efficiencies
![Page 31: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions](https://reader034.fdocuments.net/reader034/viewer/2022051515/554fa1b4b4c90586258b49f9/html5/thumbnails/31.jpg)
Thanks
Anne Bruneau, Institut de recherche en biologie végétale and Département de Sciences Biologiques, Université de Montréal
![Page 32: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions](https://reader034.fdocuments.net/reader034/viewer/2022051515/554fa1b4b4c90586258b49f9/html5/thumbnails/32.jpg)
Contact hrp://www.canadensys.net hrp://github.com/Canadensys @Canadensys
Gulo gulo, Larry Master (www.masterimages.org)
![Page 33: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions](https://reader034.fdocuments.net/reader034/viewer/2022051515/554fa1b4b4c90586258b49f9/html5/thumbnails/33.jpg)
Mul&-‐field processing DwC Field Raw data Processed data
verba&mLa&tude 45°30ʹ′N 45.5
verba&mLongitude 73°34ʹ′W -‐73.5666667
country Canada Canada
stateProvince QC Quebec
municipality Montreal City Montreal
![Page 34: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions](https://reader034.fdocuments.net/reader034/viewer/2022051515/554fa1b4b4c90586258b49f9/html5/thumbnails/34.jpg)
Mul&-‐field processing 1. Get informa&on on coordinates
45.5,-‐73.5666667 2. Compare with processed data 3. Assert that these coordinates are in Montréal