Data Service Centre and Apache Spark file• Covers all domains: social statistics, business...
Transcript of Data Service Centre and Apache Spark file• Covers all domains: social statistics, business...
![Page 1: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics](https://reader035.fdocuments.net/reader035/viewer/2022070616/5d16409c88c993d4608b5afd/html5/thumbnails/1.jpg)
Data Service Centre and Apache Spark
at Statistics Netherlands
![Page 2: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics](https://reader035.fdocuments.net/reader035/viewer/2022070616/5d16409c88c993d4608b5afd/html5/thumbnails/2.jpg)
2
Statistical process and DSC
WebsiteStatlineOpen dataArticlesBooks
DSCMicrodata
services
RIN
RIN
RIN
RIN
![Page 3: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics](https://reader035.fdocuments.net/reader035/viewer/2022070616/5d16409c88c993d4608b5afd/html5/thumbnails/3.jpg)
3
• Technical backend: Document management system Documentum (Open Text)
• Only statistical data that you can store in rows and columns (no documents, images etc.)
• Data stored as text files (csv, fixed-width): future proof• Primary focus was archiving, but now more and more on data
exchange• Retrieve data and process data in SPSS, R, Python, custom built
systems• Almost 14.000 datasets, mostly microdata• Covers all domains: social statistics, business statistics, national
accounts, health statistics, energy statistics, agriculturalstatistics etc. etc.
DSC not a traditional datawarehouse
![Page 4: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics](https://reader035.fdocuments.net/reader035/viewer/2022070616/5d16409c88c993d4608b5afd/html5/thumbnails/4.jpg)
4
DSC Catalogue
![Page 5: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics](https://reader035.fdocuments.net/reader035/viewer/2022070616/5d16409c88c993d4608b5afd/html5/thumbnails/5.jpg)
5
DSC Catalogue
![Page 6: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics](https://reader035.fdocuments.net/reader035/viewer/2022070616/5d16409c88c993d4608b5afd/html5/thumbnails/6.jpg)
10
• Subset of data in DSC• Highly coordinated• Mostly based on administrative sources, some surveys• ‘Backbones’ (persons, buildings, households, companies)• Linkable datasets• Widely used for statistical production and research:
longitudinal, small groups, intergenerational, networks• SSD tool set on top of DSC• https://www.cbs.nl/NR/rdonlyres/98BFF618-D7A7-4897-
85D6-6293CFB8EA75/0/systemofsocialstatisticaldatasets.pdf
System of Social statistical Datasets (SSD)
![Page 7: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics](https://reader035.fdocuments.net/reader035/viewer/2022070616/5d16409c88c993d4608b5afd/html5/thumbnails/7.jpg)
11
Proof of concept ‘Data lake’
DSCRaw data Big dataOther SN data Other data
Data virtualisation (Denodo)
User User User User
Statistics Netherlands The ‘outside’
Metadata
Governance
Organisation
Governance+ Governance+
Organisation
![Page 8: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics](https://reader035.fdocuments.net/reader035/viewer/2022070616/5d16409c88c993d4608b5afd/html5/thumbnails/8.jpg)
14
BIG DATAis of all times
![Page 9: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics](https://reader035.fdocuments.net/reader035/viewer/2022070616/5d16409c88c993d4608b5afd/html5/thumbnails/9.jpg)
15ca. 1981–1975 B.C.
![Page 10: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics](https://reader035.fdocuments.net/reader035/viewer/2022070616/5d16409c88c993d4608b5afd/html5/thumbnails/10.jpg)
16
![Page 11: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics](https://reader035.fdocuments.net/reader035/viewer/2022070616/5d16409c88c993d4608b5afd/html5/thumbnails/11.jpg)
17
![Page 12: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics](https://reader035.fdocuments.net/reader035/viewer/2022070616/5d16409c88c993d4608b5afd/html5/thumbnails/12.jpg)
18
Contest: person who could process and tabulate the data fastest would earn a contract for Census 1890
Process:
Participant A: 144 hrs
Participant B: 100 hrs
Participant C: 72 hrs
1888 Hackathon US Census Bureau
Tabulate:
Participant A: 44 hrs
Participant B: 55 hrs
Participant C: 5 hrs
![Page 13: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics](https://reader035.fdocuments.net/reader035/viewer/2022070616/5d16409c88c993d4608b5afd/html5/thumbnails/13.jpg)
19
Herman Hollerith
1896 Tabulating Machine Company
1911 Computing-Tabulating-Recording Company
1924 International Business Machines Corporation
1908
![Page 14: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics](https://reader035.fdocuments.net/reader035/viewer/2022070616/5d16409c88c993d4608b5afd/html5/thumbnails/14.jpg)
20
2018: DSC contains about 14 thousand datasets (≈5 TB). Retrieving and processing data should go faster.
Can we build a tabulating machine based on contemporary technology?
![Page 15: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics](https://reader035.fdocuments.net/reader035/viewer/2022070616/5d16409c88c993d4608b5afd/html5/thumbnails/15.jpg)
21
Apache SPARK
![Page 16: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics](https://reader035.fdocuments.net/reader035/viewer/2022070616/5d16409c88c993d4608b5afd/html5/thumbnails/16.jpg)
22
![Page 17: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics](https://reader035.fdocuments.net/reader035/viewer/2022070616/5d16409c88c993d4608b5afd/html5/thumbnails/17.jpg)
23
![Page 18: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics](https://reader035.fdocuments.net/reader035/viewer/2022070616/5d16409c88c993d4608b5afd/html5/thumbnails/18.jpg)
24
![Page 19: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics](https://reader035.fdocuments.net/reader035/viewer/2022070616/5d16409c88c993d4608b5afd/html5/thumbnails/19.jpg)
25
Test case
![Page 20: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics](https://reader035.fdocuments.net/reader035/viewer/2022070616/5d16409c88c993d4608b5afd/html5/thumbnails/20.jpg)
26
DSC
Authentication
SPARK
Spark programming (PySpark)
Data control
Authorisation control
meta
data
![Page 21: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics](https://reader035.fdocuments.net/reader035/viewer/2022070616/5d16409c88c993d4608b5afd/html5/thumbnails/21.jpg)
27
After a CBS press release about average capital per municipality* a journalist asks whether the top 10 would be the same when one looks at average wage per municipality.
Top 10 average capital per municipality, 2016
Laren (NH.)
Blaricum
Bloemendaal
Wassenaar
Rozendaal
Heemstede
Bergen (NH.)
Alphen-Chaam
De Bilt
Westvoorne
*https://www.cbs.nl/nl-nl/nieuws/2018/06/vermogen-huishoudens-bijna-10-procent-hoger-in-2016
User story
![Page 22: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics](https://reader035.fdocuments.net/reader035/viewer/2022070616/5d16409c88c993d4608b5afd/html5/thumbnails/22.jpg)
28
![Page 23: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics](https://reader035.fdocuments.net/reader035/viewer/2022070616/5d16409c88c993d4608b5afd/html5/thumbnails/23.jpg)
29
![Page 24: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics](https://reader035.fdocuments.net/reader035/viewer/2022070616/5d16409c88c993d4608b5afd/html5/thumbnails/24.jpg)
30
DSC Datasets
SPOLIS2015all jobs in NL in 2015
GBAADRESOBJECT2015all addresses 2015
VSLGWB2015municipality-district-
neighbourhood code of alladdresses
SBASISLOON (wage), SREGULIEREUREN(hours)
Filter:SDATUMAANVANGIKO >= 20150101SDATUMAANVANGIKO <= 20150131
-
Filter:GBADATUMAANVANGADRESHUISHOUDING
<= 20150101GBADATUMEINDEADRESHUISHOUDING
>= 20150101
GEM, derived from GWBCODE2016 [1-4]
Link by:RINPERSOONSRINPERSOON
Link by:RINPERSOONSRINPERSOON
SOORTOBJECTNUMMERRINOBJECTNUMMER
Link by:
SOORTOBJECTNUMMERRINOBJECTNUMMER
10 mln records, 1.74 Gb61 mln records, 3.45 Gb110 mln records, 68.76 Gb
Aggregate on GEM (MUN)
UURLOON (HOURLYWAGE) = Sum(SBASISLOON) / Sum(SREGULIEREUREN)
![Page 25: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics](https://reader035.fdocuments.net/reader035/viewer/2022070616/5d16409c88c993d4608b5afd/html5/thumbnails/25.jpg)
31
User interface
![Page 26: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics](https://reader035.fdocuments.net/reader035/viewer/2022070616/5d16409c88c993d4608b5afd/html5/thumbnails/26.jpg)
32
![Page 27: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics](https://reader035.fdocuments.net/reader035/viewer/2022070616/5d16409c88c993d4608b5afd/html5/thumbnails/27.jpg)
33
![Page 28: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics](https://reader035.fdocuments.net/reader035/viewer/2022070616/5d16409c88c993d4608b5afd/html5/thumbnails/28.jpg)
34
![Page 29: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics](https://reader035.fdocuments.net/reader035/viewer/2022070616/5d16409c88c993d4608b5afd/html5/thumbnails/29.jpg)
35
Processing time syntax on Spark cluster: Approx. 1 minute
Other advantages:- Open source- Modern tool set- Syntax based- Sharing code- Visualisations- Commonly used, documentation
Disclaimer: data shown are for demo purposes only, they are not official outcomes