Barteld Braaksma and Kees Zeelenberg “Re-make / Re-model”: Should big data change the modelling...

21
Barteld Braaksma and Kees Zeelenberg “Re-make / Re-model”: Should big data change the modelling paradigm in official statistics?

Transcript of Barteld Braaksma and Kees Zeelenberg “Re-make / Re-model”: Should big data change the modelling...

Barteld Braaksma and Kees Zeelenberg

“Re-make / Re-model”:

Should big data change the modelling paradigm in official statistics?

Lay-out of presentation

– Sources and modes of inference– Big data examples at Statistics Netherlands– How to use big data?

‐ ‘as is’‐ models

– But how about quality?– More examples– Conclusions

2

Sources for official statistics

Always start from observations– Traditional surveys• Statistical populations• Owned by statistical offices (full control)• Costly and burdensome

– Administrative sources• Administrative populations• Owned by government bodies (limited control)• Cheaper to obtain

– Big (‘organic’) data• Unclear populations• Owned by private companies (no control)

• Cost unclear3

Modes of inference in official statistics

Main approaches for collecting and processing data– Design-based

‐ Stratified sample survey of sales

– Model-assisted‐ Combine tax data with sales survey (regression)

– Model-based‐ Add up all sales from tax declarations‐ (small-area estimates)‐ (seasonal adjustment)‐ (…)

– Sometimes ‘implicit models’‐ Imputation of missing values‐ Preliminary estimates of GDP

4

Big data at Statistics Netherlands

Experiments discussed today– Traffic detection loops– Social media messages– Mobile phone data

Other examples, not discussed here– Scanner data (in production)– Satellite images– Financial transactions– Internet robots (close to production)– Google Trends

– PM: Administrative data (in production)5

22

Traffic detection loops: daily pattern

Daytime population based on mobile phone data

Big data ‘as is’

– Imperfect, yet timely, indicator of trends– “These data exist and that’s why they are interesting”

– Example: social media messages‐ Signals of human activity and feelings

8

Dutch social media activity, 2010-2012

What are people talking about on Twitter?

9

Sentiment indicator using social media

10

Big data and statistics

Important issues:– Undercoverage– Selectivity– Volatility– Interpretation– Continuity

Traditionalists’ view:– These sources are useless for producing quality statistics

Modernists’ view:– We should stop doing surveys, everything is already out there

Déjà-vu:– Similar discussions when introducing administrative data…

11

How to use big data?

– Many methodological issues– No linking variables (often)– Additional information may be available

– Possible approach: combine available information‐ By old or new mathematical methods (often Bayesian)‐ By integration techniques (“National accounts”-style)

– But how about models?

12

Examples of models in official statistics

– Correction by weighing for non-response– Imputation for item non-response– Seasonal adjustment– Estimates for small areas– Capture-recapture models for hard to observe

populations– Preliminary (flash) estimates of GDP

– So we are already using models in official statistics!– But we should look carefully at principles and conditions

13

Guiding principles of official statistics

European Statistical System, mission statement– “We provide the European Union, the world and the public with independent high quality

information on the economy and society on European, national and regional levels and make the information available to everyone for decision-making purposes, research and debate.”

ESS Code of Practice, principle 6:‐ “Statistical authorities develop, produce and disseminate European Statistics respecting

scientific independence and in an objective, professional and transparent manner in which all users are treated equitably.”

ESS Code of Practice, principle 7:– “Sound methodology underpins quality statistics. This requires adequate tools, procedures

and expertise.”

ESS Code of Practice, principle 12:– “European Statistics accurately and reliably portray reality.”

14

So how about quality?

For use of models this implies:– Objectivity:

‐ Do not move too far from observed data‐ Objects and populations for the model correspond to the

statistical phenomenon ‐ No forecasting

– Reliability:‐ Extensive specification to guarantee robustness against model

failure‐ No behavioural models

15

Some model-based examples

– Relation assumed between observations and phenomena– Sophisticated modelling– Trial and error– Signal and noise

16

Bayesian recursive filter (single traffic loop)

17

EMD-filtered monthly rush hour indicator and expected manufacturing development

18

Google Trends for nowcasting(Choi & Varian using a Bayesian regression method)

19

Mobile phone data vs. traffic loops: opportunities for integration?

20

Conclusions

– Big data leads to new opportunities‐ Better accuracy and more details‐ More frequent and more timely estimates‐ Statistics in new areas

– Big data based statistics are useful in their own right

– Don’t be afraid to use models‐ Documented and transparent‐ Well tested‐ Describe, do not judge

21