Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data)...
Transcript of Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data)...
![Page 1: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape](https://reader033.fdocuments.net/reader033/viewer/2022042205/5ea713582b203d4c4907ab29/html5/thumbnails/1.jpg)
Data Quality: The elephant in the (big data) room
Chris Park Data Scientist UK Data Service
DataFirst Data Quality Workshop Cape Town, South Africa 6-7 July 2017
![Page 2: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape](https://reader033.fdocuments.net/reader033/viewer/2022042205/5ea713582b203d4c4907ab29/html5/thumbnails/2.jpg)
Janitors?
“Data scientists, according to interviews and expert
estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labour of collecting and
preparing unruly digital data, before it can be explored for useful nuggets.”
New York Times
![Page 3: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape](https://reader033.fdocuments.net/reader033/viewer/2022042205/5ea713582b203d4c4907ab29/html5/thumbnails/3.jpg)
“Data Science”
![Page 4: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape](https://reader033.fdocuments.net/reader033/viewer/2022042205/5ea713582b203d4c4907ab29/html5/thumbnails/4.jpg)
2016 CrowdFlower Survey
![Page 5: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape](https://reader033.fdocuments.net/reader033/viewer/2022042205/5ea713582b203d4c4907ab29/html5/thumbnails/5.jpg)
2016 CrowdFlower Survey
![Page 6: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape](https://reader033.fdocuments.net/reader033/viewer/2022042205/5ea713582b203d4c4907ab29/html5/thumbnails/6.jpg)
Key Messages
• Data cleaning and data quality are important in the present era of ‘Big Data’ et al. It might be cheaper to store data now, but it is harder to keep track, standardize, and curate data for secondary research.
• The way forward is to work across disciplines and sectors, e.g. academia, government, and industry, to provide standardized access to and use of data that has potential to provide public value, e.g. energy data.
![Page 7: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape](https://reader033.fdocuments.net/reader033/viewer/2022042205/5ea713582b203d4c4907ab29/html5/thumbnails/7.jpg)
UK Data Service
• Curator of the UK’s largest collection of digital social and economic research data.
• Serving the data needs of social and economics researchers since 1967.
• Promotes data sharing and reproducibility, a topic of increasing importance, e.g. data as academic output.
• Undergone a number of key transformations in response to changing user needs.
![Page 8: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape](https://reader033.fdocuments.net/reader033/viewer/2022042205/5ea713582b203d4c4907ab29/html5/thumbnails/8.jpg)
Decline of Survey Data, 1980 - 2010
Chetty, R. (2012). The Transformative Potential of Administrative Data for Microeconometric Research. Retrieved from http://conference.nber.org/confer/2012/SI2012/LS/ChettySlides.pdf
AER: American Economic Review JPE: Journal of Political Economy QJE: Quarterly Journal of Economics ECMA: Econometrica
![Page 9: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape](https://reader033.fdocuments.net/reader033/viewer/2022042205/5ea713582b203d4c4907ab29/html5/thumbnails/9.jpg)
Rise of Administrative Data, 1980 - 2010
Chetty, R. (2012). The Transformative Potential of Administrative Data for Microeconometric Research. Retrieved from http://conference.nber.org/confer/2012/SI2012/LS/ChettySlides.pdf
AER: American Economic Review JPE: Journal of Political Economy QJE: Quarterly Journal of Economics ECMA: Econometrica
![Page 10: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape](https://reader033.fdocuments.net/reader033/viewer/2022042205/5ea713582b203d4c4907ab29/html5/thumbnails/10.jpg)
Human Activity
![Page 11: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape](https://reader033.fdocuments.net/reader033/viewer/2022042205/5ea713582b203d4c4907ab29/html5/thumbnails/11.jpg)
Human Activity
![Page 12: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape](https://reader033.fdocuments.net/reader033/viewer/2022042205/5ea713582b203d4c4907ab29/html5/thumbnails/12.jpg)
Same architecture, different infrastructure
And also: in response to changing user needs, diversifying into new and emerging forms of data with public impact, e.g. energy data.
![Page 13: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape](https://reader033.fdocuments.net/reader033/viewer/2022042205/5ea713582b203d4c4907ab29/html5/thumbnails/13.jpg)
Smarter Household Energy Data
• Partnership between UK Data Service, UCL Centre for Energy Epidemiology, and DataFirst.
• Explore ways to scale up research using household energy data, e.g. benefits and barriers.
• Energy research is important: • Energy is the linchpin of modern economic activity, • Efficient use can help reduce negative impact on the
environment and help consumers save money on their bills, • Linking with sociodemographic data can help Identify
and support fuel poor households, etc.
![Page 14: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape](https://reader033.fdocuments.net/reader033/viewer/2022042205/5ea713582b203d4c4907ab29/html5/thumbnails/14.jpg)
Energy Research
• Key lies in linking energy data with administrative data such as building and sociodemographic data.
• Topics studied include: • Forecasting based on machine learning. Helps with
estimating supply. • Help consumers save money on their bills by shifting energy
consumption to lower-tariff times of the week. • Disaggregating energy use to break down consumption to
the appliance level.
![Page 15: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape](https://reader033.fdocuments.net/reader033/viewer/2022042205/5ea713582b203d4c4907ab29/html5/thumbnails/15.jpg)
Barriers to Energy Research
• Heavily anonymized e.g. limited ability to link with other datasets.
• Limited and biased sample e.g. recruitment-based studies
• One-time dataset e.g. sprawl, limited reproducibility
• Data governance and provenance issues e.g. no standard documentation
![Page 16: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape](https://reader033.fdocuments.net/reader033/viewer/2022042205/5ea713582b203d4c4907ab29/html5/thumbnails/16.jpg)
Barriers to Energy Research
• Missing and duplicate observations and lack of standardized markers. e.g. “”, NA, NULL, 99, ‘99’, etc.
• Timestamp formats: different combinations of date, time, and date + time columns, and handling of time zones.
e.g. Daylight saving: false features - Duplicates when clock turns back 1 hour, - Missing when clock shifts forward 1 hour. • 80-90% of time spent in janitorial work.
![Page 17: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape](https://reader033.fdocuments.net/reader033/viewer/2022042205/5ea713582b203d4c4907ab29/html5/thumbnails/17.jpg)
Key Messages
• Data cleaning and data quality is important in the era of ‘Big Data’ et al. It might be cheaper to store data now, but it is harder to keep track, standardize, and curate data.
• Way forward is through collaborative projects between academia, government, and industry that facilitate access to and use of data with policy implications, e.g. energy data.
![Page 18: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape](https://reader033.fdocuments.net/reader033/viewer/2022042205/5ea713582b203d4c4907ab29/html5/thumbnails/18.jpg)
From ‘Dumb’ to ‘Smart’: Meters
![Page 19: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape](https://reader033.fdocuments.net/reader033/viewer/2022042205/5ea713582b203d4c4907ab29/html5/thumbnails/19.jpg)
Why Smart Meters?
• Better control and oversight over own energy use.
• No more ‘estimated’ bills, and no more meter readers visiting your home.
• Researchers can have access to raw, unadjusted data.
• Opportunity to standardize how energy data stored and shared to encourage reproducibility.
![Page 20: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape](https://reader033.fdocuments.net/reader033/viewer/2022042205/5ea713582b203d4c4907ab29/html5/thumbnails/20.jpg)
Smart Meter Roll-out Plans in Europe
![Page 21: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape](https://reader033.fdocuments.net/reader033/viewer/2022042205/5ea713582b203d4c4907ab29/html5/thumbnails/21.jpg)
Data Quality Challenges
Retrieved from https://www.intechopen.com/source/html/50727/media/fig2.png
![Page 22: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape](https://reader033.fdocuments.net/reader033/viewer/2022042205/5ea713582b203d4c4907ab29/html5/thumbnails/22.jpg)
Lessons learned and way forward
• Academia, industry, and government all have something to offer.
• Smart meters provide a unique opportunity to demonstrate how data-driven innovation across industries and sectors can create public value.
• Want: a unified, standardized, and secure interface to
smart meter data that can help researchers and policymakers.
![Page 23: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape](https://reader033.fdocuments.net/reader033/viewer/2022042205/5ea713582b203d4c4907ab29/html5/thumbnails/23.jpg)
Smart Meter Research Portal
• Serve as a knowledge base for intervention and longitudinal studies using energy data across the socio-technical spectrum.
• Provide seamless access to standardized smart meter data at half-hourly, daily, or monthly resolutions.
• Facilitate secure data linkage service within an ISO-certified, trusted digital repository.
• Use cutting-edge technology based on the big data platform at the UK Data Service.
![Page 24: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape](https://reader033.fdocuments.net/reader033/viewer/2022042205/5ea713582b203d4c4907ab29/html5/thumbnails/24.jpg)
Data Service as a Platform
![Page 25: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape](https://reader033.fdocuments.net/reader033/viewer/2022042205/5ea713582b203d4c4907ab29/html5/thumbnails/25.jpg)
Key Messages
• Data cleaning and data quality is important in the era of ‘Big Data’ et al. It might be cheaper to store data now, but it is harder to keep track, standardize, and curate data.
• The way forward is to work across disciplines and sectors, e.g. academia, government, and industry, to provide standardized access to and use of data that has potential to provide public value, e.g. energy data.
![Page 26: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape](https://reader033.fdocuments.net/reader033/viewer/2022042205/5ea713582b203d4c4907ab29/html5/thumbnails/26.jpg)
![Page 27: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape](https://reader033.fdocuments.net/reader033/viewer/2022042205/5ea713582b203d4c4907ab29/html5/thumbnails/27.jpg)
Chris Park Big Data Network Support UK Data Service [email protected]