1 Statistics Canada’s Small Area Estimation Product: BUPF 1.0 (Best Unbiased Prediction via...

41
1 Statistics Canada’s Small Area Estimation Product: BUPF 1.0 (Best Unbiased Prediction via Filtering) SAE-SPORD Project Team Statistics Research and Innovation Division Statistics Canada, Ottawa (for presentation to FLMM_LMIWG Workshop on Oct 17, 2007, Vancouver, BC)

Transcript of 1 Statistics Canada’s Small Area Estimation Product: BUPF 1.0 (Best Unbiased Prediction via...

Page 1: 1 Statistics Canada’s Small Area Estimation Product: BUPF 1.0 (Best Unbiased Prediction via Filtering) SAE-SPORD Project Team Statistics Research and Innovation.

1

Statistics Canada’s Small Area Estimation

Product: BUPF 1.0 (Best Unbiased Prediction via Filtering)

SAE-SPORD Project TeamStatistics Research and Innovation

DivisionStatistics Canada, Ottawa

(for presentation to FLMM_LMIWG Workshop on Oct 17, 2007, Vancouver, BC)

Page 2: 1 Statistics Canada’s Small Area Estimation Product: BUPF 1.0 (Best Unbiased Prediction via Filtering) SAE-SPORD Project Team Statistics Research and Innovation.

2

Project: SAE-SPORD (Small Area Estimation for Statistical Product Oriented

R&D)

Team: Avi Singh (Project Leader)

François Verret

Claude Nadeau

Pin Yuan

Acknowledgments:

Meth Res Block Fund, Labour Stat Div, FLMM-LMIWG

Page 3: 1 Statistics Canada’s Small Area Estimation Product: BUPF 1.0 (Best Unbiased Prediction via Filtering) SAE-SPORD Project Team Statistics Research and Innovation.

3

Outline

1. SAE: Introduction

2. SAE: Visual Depiction

3. Product BUPF: Description

4. BUPF Application to Labour Force Survey

5. BUPF Demonstration (GUI Sample Screen-shots)

6. Concluding Remarks and Future Work

Page 4: 1 Statistics Canada’s Small Area Estimation Product: BUPF 1.0 (Best Unbiased Prediction via Filtering) SAE-SPORD Project Team Statistics Research and Innovation.

4

1. SAE: Introduction

Direct estimates for small areas (or domains) not reliable; e.g., for provinces, annual LFS estimates of Managers in Manufacturing and Utilities (a three-digit occupation code A39) are not reliable. Here provinces could be deemed as small areas.

Data Requirements: Provincial estimates of employment by 3-digit occupation codes

Page 5: 1 Statistics Canada’s Small Area Estimation Product: BUPF 1.0 (Best Unbiased Prediction via Filtering) SAE-SPORD Project Team Statistics Research and Innovation.

5

Table 1 Monthly Total Employed (A39)(Annual Average for 2003 LFS )

Prov.Populatio

nSize

Sample Size

Direct Estimate

SE CV in %

NL 429,298 3,978 670 177 26.4

PEI 109,886 2,769 233 55 23.5

NS 758,549 5,858 1,532 292 19.0

NB 607,565 5,624 1,275 218 17.1

PQ 6,059,655 18,234 25,273 2,204 8.7

ON 9,766,566 30,373 42,447 3,178 7.5

MA 876,396 7,117 3,023 432 14.3

SK 744,431 7,295 1,963 339 17.3

AB 2,467,412 10,317 7,643 1,098 14.4

BC 3,346,181 9,636 8,676 1,228 14.2

Canada 25,165,939 101,201 92,734 4,260 4.6

Page 6: 1 Statistics Canada’s Small Area Estimation Product: BUPF 1.0 (Best Unbiased Prediction via Filtering) SAE-SPORD Project Team Statistics Research and Innovation.

6

1. SAE: Introduction …cont.

Need more sample to get more reliable estimates

A cost effective alternative-- use a model such as the common mean model; e.g., the proportion employed in A39 is common across provinces

Quality of estimates depends on the validity of the model.

Page 7: 1 Statistics Canada’s Small Area Estimation Product: BUPF 1.0 (Best Unbiased Prediction via Filtering) SAE-SPORD Project Team Statistics Research and Innovation.

7

1. SAE: Introduction …cont.

Model provides an indirect (or synthetic) estimate at the area level.

For the common mean model, multiply the national total by the provincial population proportion to get indirect the estimate, e.g., for NL

1.7% times 92,734 = 1582

Page 8: 1 Statistics Canada’s Small Area Estimation Product: BUPF 1.0 (Best Unbiased Prediction via Filtering) SAE-SPORD Project Team Statistics Research and Innovation.

8

Table 2 Direct and Indirect (under an oversimplified model) Estimates for A39 (Annual Average for 2003

LFS )

Prov.Populatio

nPortion

Sample Size

Direct Estimat

eSAE 

Indirect

Estimate

Sample Size

NL 1.7% 3,978 670 1,582 101,201

PEI 0.4% 2,769 233   405 101,201

NS 3.0% 5,858 1,532 2,795 101,201

NB 2.4% 5,624 1,275   2,239 101,201

PQ 24.1% 18,234 25,273 22,329 101,201

ON 38.8% 30,373 42,447   35,989 101,201

MA 3.5% 7,117 3,023 3,229 101,201

SK 3.0% 7,295 1,963   2,743 101,201

AB 9.8% 10,317 7,643 9,092 101,201

BC 13.3% 9,636 8,676   12,330 101,201

Canada 100.0% 101,201 92,734   92,734 101,201

Page 9: 1 Statistics Canada’s Small Area Estimation Product: BUPF 1.0 (Best Unbiased Prediction via Filtering) SAE-SPORD Project Team Statistics Research and Innovation.

9

1. SAE: Introduction …cont.

A combination of the two estimates ( direct and indirect) may provide a reasonable estimate with adequate precision depending on the level of small area.

The direct estimate is not precise but unbiased, while the indirect estimate is generally precise but not unbiased.

Page 10: 1 Statistics Canada’s Small Area Estimation Product: BUPF 1.0 (Best Unbiased Prediction via Filtering) SAE-SPORD Project Team Statistics Research and Innovation.

10

1. SAE: Introduction …cont.

SAE combines the direct and the indirect in an optimal way:

SAE for Area d = (shrinkage factor for d) x (direct Estimate for d) +

(1- shrinkage factor for d) x (indirect estimate for d)

If the shrinkage factor is 10%, then only 10% of direct and 90% of indirect are used for SAE. If it is 50%, then both direct and indirect have equal say in compositing the two for SAE.

Page 11: 1 Statistics Canada’s Small Area Estimation Product: BUPF 1.0 (Best Unbiased Prediction via Filtering) SAE-SPORD Project Team Statistics Research and Innovation.

11

1. SAE: Introduction …cont.

The relative size of the shrinkage factor depends on variability in modeling error (in the indirect estimate) and sampling error (in the direct estimate).

Effective sample size for SAE is more than that for the direct estimate.

Page 12: 1 Statistics Canada’s Small Area Estimation Product: BUPF 1.0 (Best Unbiased Prediction via Filtering) SAE-SPORD Project Team Statistics Research and Innovation.

12

1: SAE: Introduction (Modeling Requirements)

Direct estimates from other small areas (termed indirect data) needed for modeling purposes; i.e., for predicting estimate for the area of interest.

Need enough small areas for adequate modeling. Subdivide provinces into subprovincial areas:• ER or ER by age by gender instead of province although it is the

province level that is of interest.

Page 13: 1 Statistics Canada’s Small Area Estimation Product: BUPF 1.0 (Best Unbiased Prediction via Filtering) SAE-SPORD Project Team Statistics Research and Innovation.

13

1: SAE: Introduction (Modeling Requirements)

Beneficial to have an Auxiliary Information Source (Administrative/ Census): need true population totals at the area level for all areas.

Using auxiliary source can improve modeling with the indirect data.

Page 14: 1 Statistics Canada’s Small Area Estimation Product: BUPF 1.0 (Best Unbiased Prediction via Filtering) SAE-SPORD Project Team Statistics Research and Innovation.

14

1. SAE: Introduction (Modeling Requirements…cont.)

Examples of Auxiliary Information for LFS Application

Administrative Source• Number of employment beneficiary claims at the area level

• Number with employment income

Population Census based demographic projections• Subpopulation counts

Page 15: 1 Statistics Canada’s Small Area Estimation Product: BUPF 1.0 (Best Unbiased Prediction via Filtering) SAE-SPORD Project Team Statistics Research and Innovation.

15

1: SAE: Introduction (Modeling Requirements)

The model predictor based on indirect data and auxiliary data provides an indirect estimate for the area of interest.

The model can be simple such as the common mean model which doesn’t use any auxiliary data or can be advanced.

Page 16: 1 Statistics Canada’s Small Area Estimation Product: BUPF 1.0 (Best Unbiased Prediction via Filtering) SAE-SPORD Project Team Statistics Research and Innovation.

16

1: SAE: Introduction (Modeling Requirements)

All indirect estimates are biased but bias can be low if model is good.

Combining direct and indirect estimates gives rise to estimates more precise than either one.

Benchmarking (Sum of small area total estimates within a subgroup of areas equals the direct estimate of the subgroup) helps in reducing model bias.

Page 17: 1 Statistics Canada’s Small Area Estimation Product: BUPF 1.0 (Best Unbiased Prediction via Filtering) SAE-SPORD Project Team Statistics Research and Innovation.

17

1. SAE: Introduction (User Concerns)

Detailed area-level requirements may vary from user to user.

However, cannot go to a very low level for two reasons: precision of SAEs may not be adequate, and auxiliary data may not be available.

Bias concerns due to use of indirect estimates for borrowing information; models may not be perfect but one chosen with care may be useful.

SAE methodology involves a trade-off between bias and precision

Page 18: 1 Statistics Canada’s Small Area Estimation Product: BUPF 1.0 (Best Unbiased Prediction via Filtering) SAE-SPORD Project Team Statistics Research and Innovation.

18

1. SAE: Introduction (User Concerns…cont.)

External validation of SAE; can be done periodically using census.

Also, validation by ‘local area’ knowledge

Confidentiality concerns ( this may or may not be a problem because smaller the area, more the error in SAE; built-in protection)

Page 19: 1 Statistics Canada’s Small Area Estimation Product: BUPF 1.0 (Best Unbiased Prediction via Filtering) SAE-SPORD Project Team Statistics Research and Innovation.

19

2. SAE: A Visual Depiction

• However, with the usual SAE model the overall total is not preserved!

Province ER by Age by Gender

BeforeSAE

(rag level)

AfterSAE (rag level)

AfterSAE (prov.level)

NL

PEI

AB

BC

Canada Good! Good? Good?

For Employment in A39

Page 20: 1 Statistics Canada’s Small Area Estimation Product: BUPF 1.0 (Best Unbiased Prediction via Filtering) SAE-SPORD Project Team Statistics Research and Innovation.

20

2. SAE: A Visual Depiction...cont.

• Benchmarking ensures that the total stays the same after modeling

Province ER by Age by Gender

BeforeSAE (rag level)

AfterSAE (rag level)

AfterSAE (prov.level)

NL

PEI

AB

BC

Canada Good! Good! Good!

For Employment in A39

Page 21: 1 Statistics Canada’s Small Area Estimation Product: BUPF 1.0 (Best Unbiased Prediction via Filtering) SAE-SPORD Project Team Statistics Research and Innovation.

21

3. Product BUPF: Description

STC’s SAE product based on the client need identification (re: SAE Workshop in Feb ’05,see www.flmm-lmi.org for proceedings)

Main Features

• Menu-driven software system

• Sampling design is fully taken into account

• Self-benchmarking for protection against model breakdowns

• Area collapsing to include areas with no or few observations in the modeling process

• Extensive model diagnostics and evaluation of estimates Existing software (such as SAS PROC MIXED, MLwiN, WinBUGS)

are not satisfactory

Page 22: 1 Statistics Canada’s Small Area Estimation Product: BUPF 1.0 (Best Unbiased Prediction via Filtering) SAE-SPORD Project Team Statistics Research and Innovation.

22

3. Product BUPF 1.0: Description

Part I : Data Preparations

Part II: Modeling Preparations

Part III: Model Selection and Diagnostics

Part IV: Small Area Estimation and Evaluation

Part V: Summary Report

Page 23: 1 Statistics Canada’s Small Area Estimation Product: BUPF 1.0 (Best Unbiased Prediction via Filtering) SAE-SPORD Project Team Statistics Research and Innovation.

23

4. BUPF Application to LFS

Empirical results presented here are still not final.

Two Main components of the product

• Modeling component (for increasing effective sample size)

• Estimation Component ( combining direct and indirect)

Page 24: 1 Statistics Canada’s Small Area Estimation Product: BUPF 1.0 (Best Unbiased Prediction via Filtering) SAE-SPORD Project Team Statistics Research and Innovation.

24

4. BUPF Application to LFS…cont

Model: Direct Estimate for Area d = True value + sampling error

True Value= Predictor + Model error

Predictor = x1β1+ x2β2+…; it gives rise to indirect or synthetic estimates.

X-variables considered: # reported income, # employment beneficiary, age-sex counts, etc. all at the small area level

Page 25: 1 Statistics Canada’s Small Area Estimation Product: BUPF 1.0 (Best Unbiased Prediction via Filtering) SAE-SPORD Project Team Statistics Research and Innovation.

25

Table 3 Direct, Indirect and SAE of Monthly Total Employed (A39)(Annual Average For 2003 LFS )

Prov.

Direct SAE IndirectSAE - Dir

Dir.Estimate

CVEstimat

e

Mod. CVMod

RRMSE

Estimate

Mod. CVMod

RRMSE

NF 670 0.264 579 14.4 603 0.229 -0.136

PEI 233 0.235 207 16.8 187 0.179 -0.111

NS 1,532 0.19 1,417 10.5 1,450 0.177 -0.075

NB 1,275 0.171 1,112 10.0 1,083 0.168 -0.128

PQ 25,273 0.087 24,962 5.6 25,381 0.081 -0.012

ON 42,447 0.075 44,355 6.3 46,255 0.081 0.045

MA 3,023 0.143 2,348 8.2 2,251 0.129 -0.223

SK 1,963 0.173 1,766 9.1 1,753 0.164 -0.100

AB 7,643 0.144 7,276 7.8 7,292 0.134 -0.048

BC 8,676 0.142 8,712 9.4 8,792 0.129 0.004

Canada 92,734 0.046 92,734 4.6 95,047 0.073 0.000

Page 26: 1 Statistics Canada’s Small Area Estimation Product: BUPF 1.0 (Best Unbiased Prediction via Filtering) SAE-SPORD Project Team Statistics Research and Innovation.

26

5. STC’s SAE Product Demonstration

BUPF 1.0 Demo BUPF 1.0 Demo

Page 27: 1 Statistics Canada’s Small Area Estimation Product: BUPF 1.0 (Best Unbiased Prediction via Filtering) SAE-SPORD Project Team Statistics Research and Innovation.

28

Part I: Data Preparations

Page 28: 1 Statistics Canada’s Small Area Estimation Product: BUPF 1.0 (Best Unbiased Prediction via Filtering) SAE-SPORD Project Team Statistics Research and Innovation.

29

Part II: Modeling Preparations

Page 29: 1 Statistics Canada’s Small Area Estimation Product: BUPF 1.0 (Best Unbiased Prediction via Filtering) SAE-SPORD Project Team Statistics Research and Innovation.

30

Part II: Modeling Preparations

Page 30: 1 Statistics Canada’s Small Area Estimation Product: BUPF 1.0 (Best Unbiased Prediction via Filtering) SAE-SPORD Project Team Statistics Research and Innovation.

31

Part III: Model Selection and Diagnostics

Page 31: 1 Statistics Canada’s Small Area Estimation Product: BUPF 1.0 (Best Unbiased Prediction via Filtering) SAE-SPORD Project Team Statistics Research and Innovation.

32

Part III: Model Selection and Diagnostics

Page 32: 1 Statistics Canada’s Small Area Estimation Product: BUPF 1.0 (Best Unbiased Prediction via Filtering) SAE-SPORD Project Team Statistics Research and Innovation.

34

Part IV: Small Area Estimation and Evaluation

Page 33: 1 Statistics Canada’s Small Area Estimation Product: BUPF 1.0 (Best Unbiased Prediction via Filtering) SAE-SPORD Project Team Statistics Research and Innovation.

35

6. Concluding Remarks and Future Work

Several unique features in the BUPF product for SAE such as self-benchmarking, domain collapsing for nonsampled domains, and extensive diagnostics.

The Graphical User Interface (GUI) for the product is useful as a systematic checklist or as a virtual analyst for efficient production; also useful for training and product demonstration.

Page 34: 1 Statistics Canada’s Small Area Estimation Product: BUPF 1.0 (Best Unbiased Prediction via Filtering) SAE-SPORD Project Team Statistics Research and Innovation.

36

6. Concluding Remarks and Future Work

Complete beta-version of BUPF 1.0; current version is only alpha or a prototype and is not suitable for production.

Plan for validation study with Census 2006.

Page 35: 1 Statistics Canada’s Small Area Estimation Product: BUPF 1.0 (Best Unbiased Prediction via Filtering) SAE-SPORD Project Team Statistics Research and Innovation.

37

For more information, please contact [email protected]

Thank you…Merci

Page 36: 1 Statistics Canada’s Small Area Estimation Product: BUPF 1.0 (Best Unbiased Prediction via Filtering) SAE-SPORD Project Team Statistics Research and Innovation.

38

Appendix

Product BUPF 1.0: Detailed Description

Page 37: 1 Statistics Canada’s Small Area Estimation Product: BUPF 1.0 (Best Unbiased Prediction via Filtering) SAE-SPORD Project Team Statistics Research and Innovation.

39

A1. Product BUPF 1.0: Description

Part I : Data Preparations• M1 : Data Specification

• M2 : Task Specification• The definition of Small Area Modeling domains (SAM

domains) is very important

• Direct estimates, population counts and auxiliary data must be available at this level

• # of SAM domains should be high enough for proper modeling

• Here, SAM domain = ER(73) by Age(4) by Gender(2)

Page 38: 1 Statistics Canada’s Small Area Estimation Product: BUPF 1.0 (Best Unbiased Prediction via Filtering) SAE-SPORD Project Team Statistics Research and Innovation.

40

A2. Product BUPF 1.0: Description

Part II : Modeling Preparations• M3 : Benchmark Constraints & Baseline Model

• Self-benchmarking is important to protect against model breakdowns as no model is perfect

• Option: No BC, Global BC, Regional BC

• M4 : Domain Collapsing

• Improved alternative to leaving small sample size SAM domains outside of the model

• M5 : Variance Smoothing

Page 39: 1 Statistics Canada’s Small Area Estimation Product: BUPF 1.0 (Best Unbiased Prediction via Filtering) SAE-SPORD Project Team Statistics Research and Innovation.

41

A3. Product BUPF 1.0: Description

Part III : Model Selection and Diagnostics• M6 : Model Selection

• Standard Forward and Backward procedures implemented

• M7 : Variance Component• Needed to find the proper shrinkage to move indirect to

direct

• M8 : Innovation Sequence• Makes it possible to diagnose the model with standard “iid

N(0,1)” error tests

• M9 : Model Diagnostics• Residual Plots, QQ-plots, R-square, Chi-square test for

overdispersion and for model adequacy…

Page 40: 1 Statistics Canada’s Small Area Estimation Product: BUPF 1.0 (Best Unbiased Prediction via Filtering) SAE-SPORD Project Team Statistics Research and Innovation.

42

A4. Product BUPF 1.0: Description

Part IV : Small Area Estimation and Evaluation• M10 : Small Area Estimation

• M11 : Evaluation of Estimates

• Check for relative difference between direct and SAE

• Other measures

SAE (direct) (1 )(indirect)

(indirect) [(direct) (indirect) ]

: shrinkage factord

d d d d d

d d d d

Page 41: 1 Statistics Canada’s Small Area Estimation Product: BUPF 1.0 (Best Unbiased Prediction via Filtering) SAE-SPORD Project Team Statistics Research and Innovation.

43

A5. Product BUPF 1.0: Description

Part V : Summary Report• M12 : Overall Summary

• Sampling Design and Data Sources (Part I)

• Input Diagnostics (Part II)

• Modeling Diagnostics (Part III)

• Ouput Diagnostics (Part IV)