Demography and Local Government David Keyser · State Demography Office.
1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography...
-
Upload
jeffrey-flynn -
Category
Documents
-
view
219 -
download
1
Transcript of 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography...
![Page 1: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/1.jpg)
1
STK 4600: Statistical methods for social sciences.
Survey sampling and statistical demography
Surveys for households and individuals
![Page 2: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/2.jpg)
2
Survey sampling: 4 major topics
1. Traditional design-based statistical inference • 7 weeks
2. Likelihood considerations• 1 week
3. Model-based statistical inference• 3 weeks
4. Missing data - nonresponse• 2 weeks
![Page 3: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/3.jpg)
3
Statistical demography
• Mortality
• Life expectancy
• Population projections• 2 weeks
![Page 4: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/4.jpg)
4
Course goals
• Give students knowledge about:– planning surveys in social sciences
– major sampling designs
– basic concepts and the most important estimation methods in traditional applied survey sampling
– Likelihood principle and its consequences for survey sampling
– Use of modeling in sampling
– Treatment of nonresponse
– A basic knowledge of demography
![Page 5: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/5.jpg)
5
But first: Basic concepts in sampling
Population (Target population): The universe of all units of interest for a certain study
• Denoted, with N being the size of the population: U = {1, 2, ...., N}
All units can be identified and labeled
• Ex: Political poll – All adults eligible to vote
• Ex: Employment/Unemployment in Norway– All persons in Norway, age 15 or more
• Ex: Consumer expenditure : Unit = household
Sample: A subset of the population, to be observed. The sample should be ”representative” of the population
![Page 6: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/6.jpg)
6
Sampling design:
• The sample is a probability sample if all units in the sample have been chosen with certain probabilities, and such that each unit in the population has a positive probability of being chosen to the sample
• We shall only be concerned with probability sampling
• Example: simple random sample (SRS). Let n denote the sample size. Every possible subset of n units has the same chance of being the sample. Then all units in the population have the same probability n/N of being chosen to the sample.
• The probability distribution for SRS on all subsets of U is an example of a sampling design: The probability plan for selecting a sample s from the population:
nssp
nsn
Nsp
|| if 0)(
|| if /1)(
![Page 7: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/7.jpg)
7
Basic statistical problem: Estimation
• A typical survey has many variables of interest
• Aim of a sample is to obtain information regarding totals or averages of these variables for the whole population
• Examples : Unemployment in Norway– Want to estimate the total number t of individuals unemployed.
For each person i (at least 15 years old) in Norway:
otherwise 0 ,unemployed is person if 1 iyi
Ni iyt 1
:Then
![Page 8: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/8.jpg)
8
• In general, variable of interest: y with yi equal to the value of y for unit i in the population, and the total is denoted
Ni iyt 1
• The typical problem is to estimate t or t/N
•Sometimes, of interest also to estimate ratios of totals:
Example- estimating the rate of unemployment:
otherwise 0 force,labor in the is person if 1
otherwise 0 ,unemployed is person if 1
ix
iy
i
i
xy tt , swith total
Unemployment rate: xy tt /
![Page 9: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/9.jpg)
9
Sources of error in sample surveys
1. Target population U vs Frame population UF
Access to the population is thru a list of units – a register UF . U and UF may not be the same: Three possible errors in UF:
– Undercoverage: Some units in U are not in UF
– Overcoverage: Some units in UF are not in U
– Duplicate listings: A unit in U is listed more than once in UF
• UF is sometimes called the sampling frame
![Page 10: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/10.jpg)
10
2. Nonresponse - missing data• Some persons cannot be contacted• Some refuse to participate in the survey• Some may be ill and incapable of responding• In postal surveys: Can be as much as 70%
nonresponse• In telephone surveys: 50% nonresponse is not
uncommon
• Possible consequences:– Bias in the sample, not representative of the
population– Estimation becomes more inaccurate
• Remedies: – imputation, weighting
![Page 11: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/11.jpg)
11
3. Measurement error – the correct value of yi is not measured
– In interviewer surveys:• Incorrect marking
• interviewer effect: people may say what they think the interviewer wants to hear – underreporting of alcohol ute, tobacco use
• misunderstanding of the question, do not remember correctly.
![Page 12: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/12.jpg)
12
4. Sampling «error»– The error (uncertainty, tolerance) caused by
observing a sample instead of the whole population
– To assess this error- margin of error:measure sample to sample variation
– Design approach deals with calculating sampling errors for different sampling designs
– One such measure: 95% confidence interval:
If we draw repeated samples, then 95% of the calculated confidence intervals for a total t will actually include t
![Page 13: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/13.jpg)
13
• The first 3 errors: nonsampling errors– Can be much larger than the sampling error
• In this course:– Sampling error– nonresponse bias– Shall assume that the frame population is
identical to the target population– No measurement error
![Page 14: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/14.jpg)
14
Summary of basic concepts
• Population, target population• unit• sample• sampling design• estimation
– estimator– measure of bias – measure of variance– confidence interval
![Page 15: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/15.jpg)
15
• survey errors:– register /frame population– mesurement error– nonresponse– sampling error
![Page 16: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/16.jpg)
16
Example – Psychiatric Morbidity Survey 1993 from Great Britain
• Aim: Provide information about prevalence of psychiatric problems among adults in GB as well as their associated social disabilities and use of services
• Target population: Adults aged 16-64 living in private households
• Sample: Thru several stages: 18,000 adresses were chosen and 1 adult in each household was chosen
• 200 interviewers, each visiting 90 households
![Page 17: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/17.jpg)
17
Result of the sampling process• Sample of addresses 18,000
Vacant premises 927Institutions/business premises 573Demolished 499Second home/holiday flat 236
• Private household addresses 15,765Extra households found 669
• Total private households 16,434Households with no one 16-64 3,704
• Eligible households 12,730• Nonresponse 2,622• Sample 10,108
households with responding adults aged 16-64
![Page 18: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/18.jpg)
18
Why sampling ?• reduces costs for acceptable level of accuracy
(money, manpower, processing time...)• may free up resources to reduce nonsampling error
and collect more information from each person in the sample– ex:
400 interviewers at $5 per interview: lower sampling error
200 interviewers at 10$ per interview: lower nonsampling error
• much quicker results
![Page 19: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/19.jpg)
19
When is sample representative ?• Balance on gender and age:
– proportion of women in sample proportion in population
– proportions of age groups in sample proportions in population
• An ideal representative sample: – A miniature version of the population: – implying that every unit in the sample represents the
characteristics of a known number of units in the population
• Appropriate probability sampling ensures a representative sample ”on the average”
![Page 20: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/20.jpg)
20
Alternative approaches for statistical inference based on survey sampling
• Design-based: – No modeling, only stochastic element is the
sample s with known distribution• Model-based: The values yi are assumed to be
values of random variables Yi: – Two stochastic elements: Y = (Y1, …,YN) and s– Assumes a parametric distribution for Y– Example : suppose we have an auxiliary
variable x. Could be: age, gender, education. A typical model is a regression of Yi on xi.
![Page 21: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/21.jpg)
21
• Statistical principles of inference imply that the model-based approach is the most sound and valid approach
• Start with learning the design-based approach since it is the most applied approach to survey sampling used by national statistical institutes and most research institutes for social sciences. – Is the easy way out: Do not need to model. All
statisticians working with survey sampling in practice need to know this approach
![Page 22: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/22.jpg)
22
Design-based statistical inference• Can also be viewed as a distribution-free
nonparametric approach• The only stochastic element: Sample s, distribution
p(s) for all subsets s of the population U={1, ..., N}• No explicit statistical modeling is done for the
variable y. All yi’s are considered fixed but unknown • Focus on sampling error• Sets the sample survey theory apart from usual
statistical analysis• The traditional approach, started by Neyman in 1934
![Page 23: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/23.jpg)
23
Estimation theory-simple random sample
Estimation of the population mean of a variable y: NyN
i i /1
A natural estimator - the sample mean: nyy si is / Desirable properties:
)ˆ(
ifunbiased is ˆestimator An :ess UnbiasednI)(
Edesign SRSfor unbiased is sy
SRS of size n: Each sample s of size n has
n
Nsp /1)(
Can be performed in principle by drawing one unit at time at random without replacement
![Page 24: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/24.jpg)
24
The uncertainty of an unbiased estimator is measured by its estimated sampling variance or standard error (SE):
)ˆ(ˆ)ˆ(
)ˆ( of estimate (unbiased)an is )ˆ(ˆ
)ˆ( if ,)ˆ()ˆ( 2
VSE
VarV
EEVar
Some results for SRS:
)( )2(
fraction sampling the, /Then
sample, in the is unit y that probabilit thebe Let )1(
s
i
i
yE
fNn
i
![Page 25: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/25.jpg)
25
correction population finite called the is )-(1factor theHere,
)1()(
)(1
1 :variance population thebe Let )3(
2
1222
f
fn
yVar
yN
s
Ni i
• usually unimportant in social surveys:
n =10,000 and N = 5,000,000: 1- f = 0.998
n =1000 and N = 400,000: 1- f = 0.9975
n =1000 and N = 5,000,000: 1-f = 0.9998
• effect of changing n much more important than effect of changing n/N
![Page 26: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/26.jpg)
26
si si yyn
s 22
2
)(1
1
variance sample
by thegiven is ofestimator unbiased An
The estimated variance )1()(ˆ2
fn
syV s
Usually we report the standard error of the estimate:
)(ˆ)( ss yVySE
Confidence intervals for is based on the Central Limit Theorem:
)1,0(~/)1(/)(:, large For NnfyZnNn s
)(96.1)(96.1 ),(96.1
:for CI95% eApproximat
ssssss ySEyySEyySEy
![Page 27: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/27.jpg)
Example – Student performance in California schools
• Academic Performance Index (API) for all California schools
• Based on standardized testing of students
• Data from all schools with at least 100 students
• Unit in population = school (Elementary/Middle/High)
• Full population consists of N = 6194 observations
• Concentrate on the variable: y = api00 = API in 2000
• Mean(y) = 664.7 with min(y) =346 and max(y) =969
• Data set in R: apipop and y= apipop$api00
27
![Page 28: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/28.jpg)
Histogram of y population with fitted normal density
28
Histogram of y
y
Density
400 500 600 700 800 900 1000
0.0000
0.0010
0.0020
0.0030
![Page 29: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/29.jpg)
Histogram for sample mean and fitted normal densityy = api scores from 2000. Sample size n =10, based on
10000simulations
29
R-code:>b =10000>N=6194>n=10>ybar=numeric(b)>for (k in 1:b){+s=sample(1:N,n)+ybar[k]=mean(y[s])+}>hist(ybar,seq(min(ybar)-5,max(ybar)+5,5),prob=TRUE)>x=seq(mean(ybar)-4*sqrt(var(ybar)),mean(ybar)+4*sqrt(var(ybar)),0.05)>z=dnorm(x,mean(ybar),sqrt(var(ybar)))>lines(x,z)
![Page 30: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/30.jpg)
Histogram and fitted normal densityapi scores. Sample size n =10, based on 10000 simulations
30
Histogram of ybar
ybar
Density
500 550 600 650 700 750 800
0.000
0.002
0.004
0.006
0.008
0.010
![Page 31: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/31.jpg)
31
y = api00 for 6194 California schools
n Conf. level
10 0.915
30 0.940
50 0.943
100 0.947
1000 0.949
2000 0.951
10000 simulations of SRS. Confidence level of the approximate 95% CI
![Page 32: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/32.jpg)
32
For one sample of size n = 100:
679.2) ,8.629(7.24654.5 12.61.96654.5
:mean populationfor CI 95% eApproximat
6.12)(
5.654
s
s
ySE
y
For one sample of size n = 100
R-code: >s=sample(1:6194,100)> ybar=mean(y[s])> se=sqrt(var(y[s])*(6194-100)/(6194*100))> ybar[1] 654.47> var(y[s])[1] 16179.28> se[1] 12.61668
![Page 33: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/33.jpg)
33
The coefficient of variation for the estimate:
sss yySEyCV /)()(
•A measure of the relative variability of an estimate.
•It does not depend on the unit of measurement.
• More stable over repeated surveys, can be used for planning, for example determining sample size
• More meaningful when estimating proportions
Absolute value of sampling error is not informative when not related to value of the estimate
For example, SE =2 is small if estimate is 1000, but very large if estimate is 3
%9.1019.05.654/6.12)( :exampleIn syCV
![Page 34: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/34.jpg)
34
Estimation of a population proportion pwith a certain characteristic A
p = (number of units in the population with A)/N
Let yi = 1 if unit i has characteristic A, 0 otherwise
Then p is the population mean of the yi’s.
Let X be the number of units in the sample with characteristic A. Then the sample mean can be expressed as
nXyp s /ˆ
![Page 35: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/35.jpg)
35
1
)1( equals variance population thesince
)1
11(
)1()ˆ(
and
)ˆ(
:SRSunder Then
2
N
pNpN
n
n
pppVar
ppE
)ˆ1(ˆ1
2 ppn
ns
So the unbiased estimate of the variance of the estimator:
)1(1
)ˆ1(ˆ)ˆ(ˆ
N
n
n
pppV
![Page 36: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/36.jpg)
36
Examples
A political poll: Suppose we have a random sample of 1000 eligible voters in Norway with 280 saying they will vote for the Labor party. Then the estimated proportion of Labor votes in Norway is given by:
2801000280 ./p̂
01440999
7202801
1
1.
..)
N
n(
n
)p̂(p̂)p̂(SE
Confidence interval requires normal approximation. Can use the guideline from binomial distribution, when N-n is large: 5)1(and 5 pnnp
![Page 37: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/37.jpg)
37
In this example : n = 1000 and N = 4,000,000
0.308) (0.252, 0.028 0.280
961 :CI 95% eApproximat
)p̂(SE.p̂
Ex: Psychiatric Morbidity Survey 1993 from Great Britain
p = proportion with psychiatric problems
n = 9792 (partial nonresponse on this question: 316)
N
47)(0.133,0.1 0.0070.14 0.00351.96 0.14 :CI %95
0035.09791/86.014.0)00024.01()ˆ(
14.0ˆ
pSE
p
![Page 38: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/38.jpg)
38
General probability sampling• Sampling design: p(s) - known probability of selection for each subset s of the population U
• Actually: The sampling design is the probability distribution p(.) over all subsets of U
• Typically, for most s: p(s) = 0 . In SRS of size n, all s with size different from n has p(s) = 0.
• The inclusion probability:
}:{)()(
sample) in the is unit (
sis
i
spsiP
iP
![Page 39: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/39.jpg)
39
Illustration
U = {1,2,3,4}Sample of size 2; 6 possible samplesSampling design: p({1,2}) = ½, p({2,3}) = 1/4, p({3,4}) = 1/8, p({1,4}) = 1/8
The inclusion probabilities:
}4:{4
}3:{3
}2:{2
}1:{1
8/2})4,1({})4,3({)(
8/3})4,3({})3,2({)(
8/64/3})3,2({})2,1({)(
8/5})4,1({})2,1({)(
ss
ss
ss
ss
ppsp
ppsp
ppsp
ppsp
![Page 40: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/40.jpg)
40
Some results
n
nII
nnEI
N
N
...
:advancein be todetermined is size sample If )(
size sample theis ; )(... )(
21
21
N
i i
N
i i
N
i i
iii
i
ZEnEZn
ZEZP
iZLet
111)()(
)()1(
otherwise 0 sample, in the included is unit if 1
:Proof
![Page 41: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/41.jpg)
41
Estimation theory probability sampling in general
Problem: Estimate a population quantity for the variable y
For the sake of illustration: The population total
N
iiyt
1
tt ˆ :sample on thebased ofestimator An
)ˆ( ifunbiased is ˆ
)ˆ( :Bias
)(]ˆ)(ˆ[]ˆˆ[)ˆ( :Variance
)()(ˆ)ˆ( :valueExpected 22
ttEt
ttE
sptEsttEtEtVar
spsttE
s
s
![Page 42: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/42.jpg)
42
ttSEtCVt
tVtSEt
tVartV
ˆ/)ˆ()ˆ( :ˆ ofvariation oft Coefficien
)ˆ(ˆ)ˆ( :ˆ oferror standard The
)ˆ( of estimate possible) if(unbiased an be )ˆ(ˆLet
CV is a useful measure of uncertainty, especially when standard error increases as the estimate increases
Because, typically we have that
nNntSEtttSEtP , largefor 95.0))ˆ(2ˆ)ˆ(2ˆ(
)ˆ(2 :error ofMargin tSE
nNnt , largefor d distributenormally ely approximat is ˆ Since
CI 95% aely approximat is 2 )t̂(SEt̂
![Page 43: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/43.jpg)
43
Some peculiarities in the estimation theoryExample: N=3, n=2, simple random sample
3313232
3213122
112112
2
1
321
2
1)(ˆ)
3
1
2
1(3)(ˆ
2
1)(ˆ)
3
2
2
1(3)(ˆ
)(ˆ)(2
13)(ˆ
:bygiven be ˆLet
unbiased ,3ˆLet
1,2,3 for 3/1)(
}3,2{},3,1{},2,1{
ystyyst
ystyyst
styyst
t
yt
ksp
sss
s
k
![Page 44: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/44.jpg)
44
ttstspsttE
t
k ks 33
1)(ˆ
3
1)()(ˆ)ˆ(
:unbiased is ˆ Also
31 222
2
)33(6
1)ˆ()ˆ( 312321 yyyytVartVar
1,0when happens thisvariables,1/0 If
33and 0 if )ˆ()ˆ(
321
312321
yyyy
yyyytVartVar
i
For this set of values of the yi’s:
5.2)(ˆ ,2)(ˆ ,5.1)(ˆ
correctnever : 3)(ˆ ,5.1)(ˆ ,5.1)(ˆ
322212
312111
ststst
ststst
values- for these ˆy than variabilit lessclearly has ˆ12 ytt
![Page 45: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/45.jpg)
45
Let y be the population vector of the y-values.
This example shows that
syNis not uniformly best ( minimum variance for all y) among linear design-unbiased estimators
Example shows that the ”usual” basic estimators do not have the same properties in design-based survey sampling as they do in ordinary statistical models
In fact, we have the following much stronger result:
Theorem: Let p(.) be any sampling design. Assume each yi can take at least two values. Then there exists no uniformly best design-unbiased estimator of the total t
![Page 46: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/46.jpg)
46
Proof:
0
0
yy
yy
when 0)ˆ(with ˆunbiased exists Then there
. of value possible one be let and unbiased, be ˆLet
00 tVart
t
00 yyyy for total theis ,),(ˆ),(ˆ),(ˆ000 ttststst
0)ˆ( samples allfor ˆ:When )2
)(),(ˆ)ˆ( :unbiased is ˆ )1
000
000
tVarstt
ttspstttEt s
0
0
yy
y
This implies that a uniformly best unbiased estimator must have variance equal to 0 for all values of y, which is impossible
![Page 47: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/47.jpg)
47
Determining sample size
• The sample size has a decisive effect on the cost of the survey
• How large n should be depends on the purpose for doing the survey
• In a poll for detemining voting preference, n = 1000 is typically enough
• In the quarterly labor force survey in Norway, n = 24000
Mainly three factors to consider:
1. Desired accuracy of the estimates for many variables. Focus on one or two variables of primary interest
2. Homogeneity of the population. Needs smaller samples if little variation in the population
3. Estimation for subgroups, domains, of the population
![Page 48: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/48.jpg)
48
It is often factor 3 that puts the highest demand on the survey
• If we want to estimate totals for domains of the population we should take a stratified sample
• A sample from each domain
• A stratified random sample: From each domain a simple random sample
H
H
n...nnn
n,...,n,n
H
21
21
: size sample Total
:sizes Sample
population whole theconstitute that strata
hneach determineMust
![Page 49: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/49.jpg)
49
Assume the problem is to estimate a population proportion p for a certain stratum, and we use the sample proportion from the stratum to estimate p
Let n be the sample size of this stratum, and assume that n/N is negligible
Desired accuracy for this stratum: 95% CI for p should be %5
n
pppp
)ˆ1(ˆ96.1ˆ:for CI95%
The accuracy requirement:
384)ˆ1(ˆ2096.1
20
105.0
)ˆ1(ˆ96.1
22
ppn
n
pp
![Page 50: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/50.jpg)
50
The estimate is unkown in the planning fase
Use the conservative size 384 or a planning value p0 with n = 1536 p0(1- p0 )
F.ex.: With p0 = 0.2: n = 246
In general with accuracy requirement d, 95% CI dp ˆ
200 /)1(84.3 dppn
edpCVpdn
pp
pp
p
96.1/)ˆ(ˆ )ˆ1(ˆ
96.1
) -1 estimate otherwise ,5.0ˆ(when
ˆ toalproportion is CI95% ofLength
:trequiremenaccuracy eAlternativ
![Page 51: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/51.jpg)
51
With e = 0.1, then we require approximately that
900and 02.0ˆ CI %95:1.0when
100and 10.0ˆ CI %95:5.0when
0
0
npp
npp
0
020
2
11: value Planning
ˆ
ˆ11ˆ/)ˆ(
p
p
enp
p
p
eneppSE
![Page 52: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/52.jpg)
52
Example: Monthly unemployment rate
Important to detect changes in unemployment rates from month to month
planning value p0 = 0.05
7300005.0
600,45002.0
182,400 0.1%) error of(margin 001.0
/1824.0/)1(84.3)ˆ(1.96
:accuracyDesired 22
00
nd
nd
nd
ddppndpSE
%5051.05.0/00255.0)ˆ(005.0 :Note pCVd
![Page 53: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/53.jpg)
53
Two basic estimators:Ratio estimator
Horvitz-Thompson estimator
• Ratio estimator in simple random samples
• H-T estimator for unequal probability sampling: The inclusion probabilities are unequal
• The goal is to estimate a population total t for a variable y
![Page 54: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/54.jpg)
54
Ratio estimator
),...,( 21 Nxxxx
N
i ixX1
Let
Suppose we have known auxiliary information for the whole population:
Ex: age, gender, education, employment status
The ratio estimator for the y-total t:
s
s
si i
si iR x
yX
x
yXt
ˆ
![Page 55: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/55.jpg)
55
We can express the ratio estimator on the following form:
)(ˆs
sR yN
xN
Xt
It adjusts the usual “sample mean estimator” in the cases where the x-values in the sample are too small or too large.
Reasonable if there is a positive correlation between x and y
Example: University of 4000 students, SRS of 400
Estimate the total number t of women that is planning a career in teaching, t=Np, p is the proportion
yi = 1 if student i is a woman planning to be a teacher, t is the y-total
![Page 56: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/56.jpg)
56
Results : 84 out of 240 women in the sample plans to be a teacher
840ˆˆ
21.0400/84ˆ
pNt
p
HOWEVER: It was noticed that the university has 2700 women (67,5%) while in the sample we had 60% women. A better estimate that corrects for the underrepresentation of women is obtained by the ratio estimate using the auxiliary
x = 1 if student is a woman
945)840(6.04000
2700ˆ
Rt
![Page 57: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/57.jpg)
57
In business surveys it is very common to use a ratio estimator.
Ex: yi = amount spent on health insurance by business i
xi = number of employees in business i
We shall now do a comparison between the ratio estimator and the sample mean based estimator. We need to derive expectation and variance for the ratio estimator
![Page 58: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/58.jpg)
58
First: Must define the population covariance
variables and theof means population are ,
))((1
11
xy
yxN
yx
N
i yixixy
N
i xix
N
i yiy
xN
yN
1
22
1
22
)(1
1
)(1
1
The population correlation coefficient: yx
xyxy
![Page 59: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/59.jpg)
59
),ˆ()ˆ( :Bias )(
//ˆ and
//Let 11
sR
ssss
N
i i
N
i i
xNRCovttEI
xyxNyNR
XtxyR
),ˆ()()ˆ(
)1(ˆ
Proof
sss
sR
s
ss
s
sR
xNRCovXxNxN
yNEttE
txN
XxNyNtX
xN
yNtt
![Page 60: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/60.jpg)
60
It follows that
)(|),ˆ(|)(
)()()ˆ(
|),ˆ(|
)ˆ(
|)ˆBias(|
sss
s
s
s
R
R
xCVxNRCorrxNCV
xNVarxNVarRVarX
xNRCov
tVar
t
Hence, in SRS, the absolute bias of the ratio estimator is small relative to the true SE of the estimator if the coefficient of variation of the x-sample mean is small
Certainly true for large n
![Page 61: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/61.jpg)
61
nt)t̂(E)II( R largefor ,
N
i ii
xxyyR
RxyNn
fN
RRn
fNtVarIII
1
22
2222
)(1
11
)2(1
)ˆ( )(
![Page 62: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/62.jpg)
62
Note: The ratio estimator is very precise when the population points (yi , xi) lie close around a straight line thru the origin with slope R.
The regression model generates the ratio estimator
![Page 63: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/63.jpg)
63
N
i iiR RxyNn
fNtVar
1
22 )(1
11 )ˆ(
N
i yi
N
i iisR yRxyyNVartVar1
2
1
2 )()()()ˆ(
The ratio estimator is more accurate if Rxi predicts yi better than y does
N
i yis yNn
fNyNVar
1
22 )(1
11)(
that recalling and
![Page 64: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/64.jpg)
64
Estimated variance for the ratio estimator
)1/()ˆ(by
)1/()( Estimate
2
1
2
nxRy
NRxy
si ii
N
i ii
atreflect th larger to becomes estimate variancethe
anduncertain more is ˆ then small, very is If :
)ˆ(1
11)ˆ(ˆ 22
2
RxNote
xRynn
fN
xtV
s
si iis
xR
![Page 65: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/65.jpg)
65
For large n, N-n: Approximate normality holds and an approximate 95% confidence interval is given by
si iis
R xRynn
f
x
Xt 2)ˆ(
1
1196.1ˆ
![Page 66: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/66.jpg)
R-computing of estimate, variance estimate and confidence interval
>y=apipop$api00>x=apipop$col.grad #col.grad = percent of parents with college degree#calculating the ratio estimator:>s=c(20,2000,3900,5000)>N=6194>n=4
>r=mean(y[s])/mean(x[s])>#ratio estimate of t/N: >muhatr=r*mean(x) >muhatr[1] 542.3055
#variance estimatessqr=(1/(n-1))*sum((y[s]-r*x[s])^2)varestr=(mean(x)/mean(x[s]))^2*(1-n/N)*ssqr/nser=sqrt(varestr)ser[1] 63.85705
#confidence interval:>CI=muhatr+qnorm(c(0.025,0.975))*se >CI[1] 417.1479 667.4630
![Page 67: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/67.jpg)
67
n Conf.level
10 0.927
30 0.946
50 0.946
100 0.947
1000 0.947
2000 0.948
y = api00 for 6194 California schools
10000 simulations of SRS. Confidence level of approximate 95% CI
![Page 68: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/68.jpg)
68
R-code for simulations to estimate true confidence level of 95% CI, based on the ratio estimator
>simtratio=function(b,n,N)+{ +muhatr=numeric(b) +se=numeric(b) +for (k in 1:b){ +s=sample(1:N,n) +r[k]=mean(y[s])/mean(x[s]) +muhatr[k]=r[k]*mean(x) +ssqr[k]=(1/(n-1))*sum((y[s]-r[k]*x[s])^2) +se[k]=sqrt((mean(x)/mean(x[s]))^2*(1-n/N)*ssqr[k]/n) } +sum(mean(y)<muhatr+1.96*se)-sum(mean(y)<muhatr-1.96*se)}
![Page 69: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/69.jpg)
69
Unequal probability sampling
tobelongs individual
that household in the 64-16 adults ofnumber
/1
i
M
M
i
ii
Example:
Psychiatric Morbidity Survey: Selected individuals from households
Inclusion probabilities:NisiPi ,...,1 allfor 0)(
![Page 70: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/70.jpg)
70
Horvitz-Thompson estimator- unequal probability sampling
NisiPi ,...,1 allfor 0)(
syNLet’s try and use
unbiasednot
)/()(1
)(
)( otherwise. 0 , if 1Let
11
N
i ii
N
i iis
iii
tynNZyEn
NyNE
ZEsiZ
Bias is large if inclusion probabilities tend to increase or decrease systematically with yi
![Page 71: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/71.jpg)
71
Use weighting to correct for bias:
ii
i
N
i iii
N
i iii
isi ii
w
yt
ywZywEtE
swywt
/1 ifonly and if
valuespossible allfor unbiased is ˆ and
)ˆ(
on dependnot does ; ˆ
11
sii
iHT
yt
ˆ
sHTi yNtNn ˆ and / SRS,In
![Page 72: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/72.jpg)
72
2
1
1 1
1
1 1
2
1
)()ˆ( )
then,|| If
21
)ˆ( )
N
i
N
ijj
j
i
iijjiHT
ji
N
i
N
ijji
jiiji
N
ii
iHT
yytVarb
ns
yyytVara
)1(),( jiij ZZPsjiP
Horvitz-Thompson estimator is widely used f.ex., in official statistics
![Page 73: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/73.jpg)
73
Note that the variance is small if we determine the inclusion probabilities such that
ii
ii
y
y
increasing with increases i.e.
equal,ely approximat are /
Of course, we do not know the value of yi when planning the survey, use known auxiliary xi and choose
Xnxx iiii /
nN
i i 1 since
![Page 74: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/74.jpg)
74
unequal are ' h theeven thoug
estimator,-HT usenot should one and enormous becan )ˆ(
"correlated" negativelyor relatednot are and If
s
tVar
y
i
HT
ii
Example: Population of 3 elephants, to be shipped. Needs an estimate for the total weight
•Weighing an elephant is no simple matter. Owner wants to estimate the total weight by weighing just one elephant.
• Knows from earlier: Elephant 2 has a weight y2 close to the average weight. Wants to use this elephant and use 3y2 as an estimate
• However: To get an unbiased estimator, all inclusion probabilities must be positive.
![Page 75: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/75.jpg)
75
• Sampling design:
05.0 ,90.0 and 1|| 312 s
• The weights: 1,2, 4 tons, total = 7 tons
}3{ if 80
{2} if 22.2
}1{ if 20
}{ if /ˆ
s
s
s
isyt iiHT • H-T estimator:
Hopeless! Always far from true total of 7
ttE HT 7)ˆ(Can not be used, even though
![Page 76: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/76.jpg)
76
Problem:
46.295
05.0.)780(90.0)722.2(05.0)720()ˆ( 222
HTtVar
!!! 2.17)ˆ()ˆ( True HTHT tVartSE
The planned estimator, even though not a SRS:
}{ if 33ˆ isyyt iseleph
Possible values: 3, 6, 12
![Page 77: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/77.jpg)
77
49122752
atlook but unbiased,not
156
..)t̂(SE
.)t̂(E
eleph
72.1)ˆ(
95.2)ˆ(Bias)ˆ()ˆ( 22
eleph
elephelepheleph
tMSE
tVarttEtMSE
topreferableclearly is HTeleph t̂t̂
![Page 78: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/78.jpg)
78
Variance estimate for H-T estimator
2
)ˆ(ˆ
:0 iesprobabilitinclusion joint
all provided),ˆ( ofestimator unbiasedAn
j
j
i
i
siijsj ij
ijjiHT
ij
HT
yytV
tVar
Assume the size of the sample is determined in advance to be n.
)ˆ(ˆ96.1ˆ
:, largefor CI, 95% eApproximat
HTHT tVt
nNn
![Page 79: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/79.jpg)
79
• Can always compute the variance estimate!!Since, necessarily ij > 0 for all i,j in the sample s
• But: If not all ij > 0 , should not use this estimate! It can give very incorrect estimates
• The variance estimate can be negative, but for most sampling designs it is always positive
![Page 80: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/80.jpg)
80
A modified H-T estimator
Consider first estimating the population mean
Nty HTHT /ˆˆ
Nty /
An obvious choice:
Alternative: Estimate N as well, whether N is known or not
),1( 1ˆ iyN isi
i
NZENEN
i ii
N
i ii
11
11)ˆ(
Nn
NNNn
si
ˆ/ SRS,For i
![Page 81: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/81.jpg)
81
si i
si iiHTw
yNty
/1
/ˆ/ˆˆww yNt ˆˆ
estimator ratio a isit that note Wenot.or known is
whetheruse, toestimator theordinarily is ˆ So
riance.smaller va hasusually It unbiased.ely approximatonly
isit h even thoug ,ˆn better thaoften is ˆ gly,Interestin
N
t
tt
w
HTw
0 if estimatebetter a while
1Then
1 allfor
)N̂(Var,tNct̂
N̂c/ct̂
.N,...,icy
:
w
si iHT
i
onIllustrati
![Page 82: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/82.jpg)
82
If sample size varies then the “ratio” estimator performs better than the H-T estimator, the ratio is more stable than the numerator
Example:
tlyindependen ,y probabilit
with selected is population in theunit Each
:sampling Bernoulli design Sampling
,...,1for ,
Nicyi
NnE
Nn
ZPsZ iii
)(
ondistributi ),( binomial a has variable,stochastic a is
)1( with i.i.d. are '
![Page 83: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/83.jpg)
83
tNcn
ncNt
tNccN
tE
cn
t
w
HT
HT
/
/ˆ
))ˆ((
ˆ
H-T estimator varies because n varies, while the modified H-T is perfectly stable
![Page 84: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/84.jpg)
84
Review of Advantages of Probability Sampling
• Objective basis for inference• Permits unbiased or approximately unbiased
estimation• Permits estimation of sampling errors of
estimators– Use central limit theorem for confidence interval
– Can choose n to reduce SE or CV for estimator
![Page 85: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/85.jpg)
85
Outstanding issues in design-based inference
• Estimation for subpopulations, domains• Choice of sampling design –
– discuss several different sampling designs
– appropriate estimators
• More on use of auxiliary information to improve estimates
• More on variance estimation
![Page 86: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/86.jpg)
86
Estimation for domains• Domain (subpopulation): a subset of the
population of interest• Ex: Population = all adults aged 16-64
Examples of domains:
– Women
– Adults aged 35-39
– Men aged 25-29
– Women of a certain ethnic group
– Adults living in a certain city
• Partition population U into D disjoint domains U1,…,Ud,..., UD of sizes N1,…,Nd,…,ND
![Page 87: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/87.jpg)
87
Estimating domain means Simple random sample from the population
dUi did Ny / :meandomain True
• e.g., proportion of divorced women with psychiatric problems.
||
/
in sample theofpart the
: frommean sample by the Estimate
dd
dsi is
dd
dd
sn
nyy
Uss
U
dd
Note: nd is a random variable
![Page 88: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/88.jpg)
88
The estimator is a ratio estimator:
otherwise 0
if 1
otherwise 0
if
Define
di
dii
Uix
Uiyu
Rxuxuy
Rxu
sssi si iis
N
i
N
i iid
d
ˆ//
/1 1
![Page 89: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/89.jpg)
89
si isid
d
ds
s
xyunn
fN
nn
NN
NyV
ny
dd
d
22
2
2)(
1
11
/
/1)(ˆ
largefor unbiasedely approximat is
d dsi sid
d
d
yyn
s
s
22
2
)(1
1
domain, for the variancesample thebe Let
d
ddd
ds n
sfsn
nn
f
n
nyV
d
22
2
2
)1()1()1(
1)(ˆ
NnfnsfySE ddsd/ , /)1()( 2
![Page 90: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/90.jpg)
90
fNnf ddd / samples largeFor
• Can then treat sd as a SRS from Ud
• Whatever size of n is, conditional on nd, sd is a SRS from Ud – conditional inference
Example: Psychiatric Morbidity Survey 1993
Proportions with psychiatric problems
Domain d nd SE
women 4933 0.18
Divorced women
314 0.29
dsy )(dsy
005.04932/82.018.
026.0313/71.029.0
![Page 91: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/91.jpg)
91
Estimating domain totals
dsd yN
ssi isdd
dsd
d
uNun
NyNt
nnNxNN
xN
d
1ˆˆ
/ˆ
:total- theis Since
• Nd is known: Use
• Nd unknown, must be estimated
nsfNtSE ud /)1()ˆ( 2
![Page 92: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/92.jpg)
92
Stratified sampling• Basic idea: Partition the population U into H
subpopulations, called strata. • Nh = size of stratum h, known• Draw a separate sample from each stratum, sh of size nh
from stratum h, independently between the strata• In social surveys: Stratify by geographic regions, age
groups, gender• Ex –business survey. Canadian survey of employment.
Establishments stratified by o Standard Industrial Classification – 16 industry
divisionso Size – number of employees, 4 groups, 0-19, 20-49, 50-
199, 200+o Province – 12 provinces
Total number of strata: 16x4x12=768
![Page 93: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/93.jpg)
93
Reasons for stratification
1. Strata form domains of interest for which separate estimates of given precision is required, e.g. strata = geographical regions
2. To “spread” the sample over the whole population. Easier to get a representative sample
3. To get more accurate estimates of population totals, reduce sampling variance
4. Can use different modes of data collection in different strata, e.g. telephone versus home interviews
![Page 94: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/94.jpg)
94
Stratified simple random sampling
• The most common stratified sampling design
• SRS from each stratum
hsi hhih
hhi
hhi
H
h h
hh
nyy
siy
Niyh
nn
nsh
/ :mean Sample
):( :Sample
,...,1,: stratum from Values
:size sample Total
size of sample: stratum From
1
• Notation:
![Page 95: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/95.jpg)
95
th = y-total for stratum h: hN
i hih yt1
H
h htt1
: totalpopulation The
Consider estimation of th: hhh yNt ˆ
Assuming no auxiliary information in addition to the “stratifying variables”
The stratified estimator of t:
h
H
h h
H
h hst yNtt
11ˆˆ
![Page 96: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/96.jpg)
96
h
H
hh
stst yN
NNty
Nt
1/ˆ :mean Stratified
:/mean population theestimate To
A weighted average of the sample stratum means.
•Properties of the stratified estimator follows from properties of SRS estimators.
•Notation:
h
h
N
i hhih
h
h
N
i hih
yN
h
Nyh
1
22
1
)(1
1: stratumin Variance
/: stratumin Mean
![Page 97: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/97.jpg)
97
Hh
Hh h
h
hhhst
stst
)f(n
N)t̂(Var)t̂(Var
t̂t)t̂(E
1 1
22 1
unbiased is ,
Estimated variance is obtained by estimating the stratum variance with the stratum sample variance
hsi hhih
h yyn
s 22 )(1
1
)1()ˆ(ˆ1
22
h
H
hh
hhst f
n
sNtV
Approximate 95% confidence interval if n and N-n are large:
)(ˆ96.1ˆstst tVt
![Page 98: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/98.jpg)
98
Estimating population proportion in stratified simple random sampling
A sticcharacteri has stratumin unit if 1 where
ˆ
hiy
yp
hi
hh
NpNH
h hh /1
h
H
h hstst pNNyp ˆ)/(ˆ1
ph : proportion in stratum h with a certain characteristic A
p is the population mean: p = t/N
Stratum mean estimator:
Stratified estimator of the total t = number of units in thewith characteristic A:
H
h hhstst pNpNt1
ˆˆˆ
![Page 99: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/99.jpg)
99
Estimated variance:
31) (slide 11
1)
N
n(
n
)p̂(p̂)p̂(V̂
h
h
h
hhh
H
hh
hh
h
hh
H
h hhst
hh
H
h
H
hh
hh
h
hhhhst
n
pp
N
nWNpWNVtV
NNW
n
pp
N
nWpWVpV
1
22
1
1 1
2
1
)ˆ1(ˆ)1()ˆ(ˆ)ˆ(ˆ
and
/ where
1
)ˆ1(ˆ)1()ˆ(ˆ)ˆ(ˆ
![Page 100: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/100.jpg)
100
Allocation of the sample units• Important to determine the sizes of the stratum samples,
given the total sample size n and given the strata partitioning – how to allocate the sample units to the different strata
• Proportional allocation– A representative sample should mirror the population– Strata proportions: Wh=Nh/N– Strata sample proportions should be the same:
nh/n = Wh
– Proportional allocation:
hN
n
N
n
N
Nnn
h
hhh allfor
![Page 101: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/101.jpg)
101
The stratified estimator under proportional allocation
SRS anot isit but ,population in the units allfor same the
// : iesprobabilitInclusion NnNn hhhi
s
H
h si hi
si hih
H
h hh
H
h hst
yNyn
N
yn
NyNt
h
h
1
11
1ˆ
/ˆ :mean stratified The sstst yNty
The equally weighted sample mean ( sample is self-weighting: Every unit in the sample represents the same number of units in the population , N/n)
![Page 102: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/102.jpg)
102
Variance and estimated variance under proportional allocation
NNWNnfWn
fN
fn
NtVar
hh
H
h hh
H
h hh
hhst
/ ,/ , 1
)1()ˆ(
1
22
1
22
H
h hhst sWn
fNtV
1
22 1)ˆ(ˆ
![Page 103: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/103.jpg)
103
• The estimator in simple random sample:
sSRS yNt ˆ
• Under proportional allocation:
SRSst tt ˆˆ
• but the variances are different:
H
h hhst
SRSSRS
Wn
fNtVar
n
fNtVar
1
22
22
1)ˆ( :allocation alproportionUnder
1)ˆ( :SRSUnder
![Page 104: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/104.jpg)
104
H
h hh
H
h hh
h
hhh
WW
N
N
N
N
N
N
1
2
1
22 )(
:11
and 1
1 ionsapproximat theUsing
Total variance = variance within strata + variance between strata
Implications:1. No matter what the stratification scheme is: Proportional allocation gives more accurate estimates of population total than SRS2. Choose strata with little variability, smaller strata variances. Then the strata means will vary more and between variance becomes larger and precision of estimates increases compared to SRS
2
1
22 1)ˆ(
fromseen as general,in y trueessentiall also is This .3
hh
hH
h hst n
fWNtV
![Page 105: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/105.jpg)
105
Constructing stratification and drawing stratified sample in R
Use API in California schools as example with schooltype as stratifier. 3 strata: Elementary, middle and high schools.Stratum1: Elementary schools, N1 =4421Stratum 2: Middle schools, N2 = 1018Stratum 3: High schools, N3 = 755
5% stratified sample with proportional allocation:n1 = 221n2 = 51n3 = 38n = 310
![Page 106: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/106.jpg)
106
R-code: making strata
>x=apipop$stype
# To make a stratified variable from schooltype:>make123 = function(x)+{+ x=as.factor(x)+ levels_x = levels(x)+x=as.numeric(x)+attr(x,"levels") = levels_x+ x+}
> y=apipop$api00> tapply(y,strata,mean) 1 2 3 672.0627 633.7947 655.7230 # 1=E, 2=H, 3 = M. Will change stratum 2 and 3
![Page 107: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/107.jpg)
107
> x1=as.numeric(strata<1.5)> x2=as.numeric(strata<2.5)-x1> x3=as.numeric(strata>2.5)> stratum=x1+2*x3+3*x2> tapply(y,stratum,mean) 1 2 3 672.0627 655.7230 633.7947
> # stratified random sample with proportional allocation> N1=4421> N2=1018> N3=755> n1=221> n2=51> n3=38> s1=sample(N1,n1)> s2=sample(N2,n2)> s3=sample(N3,n3)
![Page 108: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/108.jpg)
108
> y1=y[stratum==1]> y2=y[stratum==2]> y3=y[stratum==3]> y1s=y1[s1]> y2s=y2[s2]> y3s=y3[s3]
> t_hat1=N1*mean(y1[s1])> t_hat2=N2*mean(y2[s2])> t_hat3=N3*mean(y3[s3])> t_hat=t_hat1+t_hat2+t_hat3> muhat=t_hat/6194> muhat[1] 661.8897
> mean(y1s)[1] 671.1493> mean(y2s)[1] 652.6078> mean(y3s)[1] 620.1842
![Page 109: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/109.jpg)
109
> varest1=N1^2*var(y1s)*(N1-n1)/(N1*n1)> varest2=N2^2*var(y2s)*(N2-n2)/(N2*n2)> varest3=N3^2*var(y3s)*(N3-n3)/(N3*n3)> se=sqrt(varest1+varest2+varest3)> se[1] 44915.56> semean=se/6194> semean[1] 7.251463> CI=muhat+qnorm(c(0.025,0.975))*semean> CI[1] 647.6771 676.1023#CI = (647.7, 676.1)
![Page 110: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/110.jpg)
110
Suppose we regard the sample as a SRS
> z=c(y1s,y2s,y3s)> mean(z)[1] 661.8516> var(z)[1] 17345.13> sesrs=sqrt(var(z)*(6194-310)/(6194*310))> sesrs[1] 7.290523Compared to 7.25 for the stratified SE.
Note: the estimate is the same, 661.9, since we have proportional allocation
![Page 111: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/111.jpg)
111
Optimal allocationIf the only concern is to estimate the population total t:
• Choose nh such that the variance of the stratified estimator is minimum
• Solution depends on the unkown stratum variances• If the stratum variances are approximately equal,
proportional allocation minimizes the variance of the stratified estimator
![Page 112: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/112.jpg)
112
H
k kk
hhh
N
Nnn
1
:allocation Optimal
)()11
(
Minimize :method multiplier Lagrange Use
fixed is subject to
sizes sample therespect to with )ˆ( Minimize
:Proof
1
2
1
2
1
nnNn
NQ
nnn
tVar
H
h hhh
h
H
h h
H
h hh
st
01
0 222
hh
hh
Nnn
Q /hhh Nn
Result follows since the sample sizes must add up to n
![Page 113: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/113.jpg)
113
• Called Neyman allocation (Neyman, 1934)• Should sample heavily in strata if
– The stratum accounts for a large part of the population
– The stratum variance is large
• If the stratum variances are equal, this is proportional allocation
• Problem, of course: Stratum variances are unknown– Take a small preliminary sample (pilot)
– The variance of the stratified estimator is not very sensitive to deviations from the optimal allocation. Need just crude approximations of the stratum variances
![Page 114: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/114.jpg)
114
Optimal allocation when considering the cost of a survey
• C represents the total cost of the survey, fixed – our budget
• c0 : overhead cost, like maintaining an office• ch : cost of taking an observation in stratum h
– Home interviews: traveling cost +interview– Telephone or postal surveys: ch is the same for all
strata– In some strata: telephone, in others home interviews
h
H
h hcncC
10
• Minimize the variance of the stratified estimator for a given total cost C
![Page 115: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/115.jpg)
115
H
h hh
hhh
H
h hst
Ccnc
NnWNtVar
10
2
1
22
:subject to
)11
()ˆ( Minimize
Solution: hhhh cWn /
H
k kkkh
hhh
cW
cC
c
Wn
1
0 )(
H
h hhh
H
h hhh
cN
cNcCn
C
1
10 /)(
:cost totalfixed afor Hence,
![Page 116: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/116.jpg)
116
allocation alproportion
:equal are ' theand equal are ' theIf 3.
allocationNeyman :equal are ' theIf .2
strata einexpensivin samples Large 1.
ssc
sc
hh
h
We can express the optimal sample sizes in relation to n
H
k kkk
hhhh
cW
cWnn
1/
/
In particular, if ch = c for all h: ccCn /)( 0
![Page 117: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/117.jpg)
117
Other issues with optimal allocation• Many survey variables• Each variable leads to a different optimal solution
– Choose one or two key variables– Use proportional allocation as a compromise
• If nh > Nh, let nh =Nh and use optimal allocation for the remaining strata
• If nh=1, can not estimate variance. Force nh =2 or collapse strata for variance estimation
• Number of strata: For a given n often best to increase number of strata as much as possible. Depends on available information
![Page 118: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/118.jpg)
118
• Sometimes the main interest is in precision of the estimates for stratum totals and less interest in the precision of the estimate for the population total
• Need to decide nh to achieve desired accuracy for estimate of th, discussed earlier
– If we decide to do proportional allocation, it can mean in small strata (small Nh) the sample size nh must be increased
![Page 119: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/119.jpg)
119
Poststratification
• Stratification reduces the uncertainty of the estimator compared to SRS
• In many cases one wants to stratify according to variables that are not known or used in sampling
• Can then stratify after the data have been collected• Hence, the term poststratification• The estimator is then the usual stratified estimator
according to the poststratification• If we take a SRS and N-n and n are large, the
estimator behaves like the stratified estimator with proportional allocation
![Page 120: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/120.jpg)
120
Poststratification to reduce nonresponse bias
• Poststratification is mostly used to correct for nonresponse
• Choose strata with different response rates• Poststratification amounts to assuming that the
response sample in poststratum h is representative for the nonresponse group in the sample from poststratum h
![Page 121: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/121.jpg)
121
Systematic sampling• Idea:Order the population and select every kth unit• Procedure: U = {1,…,N} and N=nk + c, c < n
1. Select a random integer r between 1 and k, with equal probability
2. Select the sample sr by the systematic rule
sr = {i: i = r + (j-1)k: j= 1, …, nr}
where the actual sample size nr takes values [N/k] or [N/k] +1 k : sampling interval = [N/n]
• Very easy to implement: Visit every 10th house or interview every 50th name in the telephone book
![Page 122: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/122.jpg)
122
• k distinct samples each selected with probability 1/k
otherwise 0
,...,1 , if /1)(
krssk
sp r
• Unlike in SRS, many subsets of U have zero probability
Examples:
1) N =20, n=4. Then k=5 and c=0. Suppose we select r =1. Then the sample is {1,6,11,16}
5 possible distinct samples. In SRS: 4845 distinct samples
2) N= 149, n = 12. Then k = 12, c=5. Suppose r = 3. s3 = {3,15,27,39,51,63,75,87,99,111,123,135,147} and sample size is 13 3) N=20, n=8. Then k=2 and c = 4. Sample size is nr =10
4) N= 100 000, n = 1500. Then k = 66 , c=1000 and c/k =15.15 with [c/k]=15. nr = 1515 or 1516
![Page 123: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/123.jpg)
123
Estimation of the population total
)(
)()(ˆ 2)
)(]/[)()(ˆ 1)
:) when (equal estimators Two
size sample )( ,)(
sn
stNyNst
stnNsktst
nkN
snyst
s
si i
1 ]/[or ]/[)( kNkNsn
These estimators are approximately the same:
)/(
1/
kNN
N
kNnN
![Page 124: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/124.jpg)
124
kr r
kr r
kr rr
t)s(tk
)s(kt
)s(p)s(t̂)t̂(E
t̂
11
1
1
:unbiased is
t
t
ˆ than riancesmaller vaslightly usually -
estimator) ratio a s(it' unbiasedely approximatonly is ˆ
• Advantage of systematic sampling: Can be implemented even where no population frame exists
•E.g. sample every 10th person admitted to a hospital, every 100th tourist arriving at LA airport.
![Page 125: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/125.jpg)
125
totalssample theof average theis
/)( where
))(())((1
)())(ˆ()ˆ()ˆ(
1
1
2
1
2
1
22
kstt
tstktstkk
sptstttEtVar
k
r r
k
r r
k
r r
k
r rr
• The variance is small if
shomogeneou very are ,,..}21{}1{
strata"" theif i.e., little, varies)(
etc.k, ...,k, ,..,k
st r
• Or, equivalently, if the values within the possible samples sr are very different; the samples are heterogeneous
• Problem: The variance cannot be estimated properly because we have only one observation of t(sr)
![Page 126: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/126.jpg)
126
Systematic sampling as Implicit StratificationIn practice: Very often when using systematic sampling (common design in national statistical institutes):
The population is ordered such that the first k units constitute a homogeneous “stratum”, the second k units another “stratum”, etc.
Implicit strata Units
1 1,2….,k
2 k+1,…,2k
: :
n = N/k assumed (n-1)k+1,.., nk
Systematic sampling selects 1 unit from each stratum at random
![Page 127: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/127.jpg)
127
Systematic sampling vs SRS
• Systematic sampling is more efficient if the study variable is homogeneous within the implicit strata– Ex: households ordered according to house numbers
within neighbourhooods and study variable related to income
• Households in the same neighbourhood are usually homogeneous with respect socio-economic variables
• If population is in random order (all N! permutations are equally likely): systematic sampling is similar to SRS
• Systematic sampling can be very bad if y has periodic variation relative to k: – Approximately: y1 = yk+1, y2 = yk+2 , etc
![Page 128: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/128.jpg)
128
Variance estimation
•No direct estimate, impossible to obtain unbiased estimate
• If population is in random order: can use the variance estimate form SRS as an approximation
• Develop a conservative variance estimator by collapsing the “implicit strata”, overestimate the variance
• The most promising approach may be:
Under a statistical model, estimate the expected value of the design variance
• Typically, systematic sampling is used in the second stage of two-stage sampling (to be discussed later), may not be necessary to estimate this variance then.
![Page 129: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/129.jpg)
129
Cluster sampling and multistage sampling
• Sampling designs so far: Direct sampling of the units in a single stage of sampling
• Of economial and practical reasons: may be necessary to modify these sampling designs
– There exists no population frame (register: list of all units in the population), and it is impossible or very costly to produce such a register
– The population units are scattered over a wide area, and a direct sample will also be widely scattered. In case of personal interviews, the traveling costs would be very high and it would not be possible to visit the whole sample
![Page 130: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/130.jpg)
130
• Modified sampling can be done by 1. Selecting the sample indirectly in groups , called
clusters; cluster sampling– Population is grouped into clusters– Sample is obtained by selecting a sample of
clusters and observing all units within the clusters
– Ex: In Labor Force Surveys: Clusters = Households, units = persons
2. Selecting the sample in several stages; multistage sampling
![Page 131: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/131.jpg)
131
3. In two-stage sampling: • Population is grouped into primary sampling
units (PSU)• Stage 1: A sample of PSUs• Stage 2: For each PSU in the sample at stage
1, we take a sample of population units, now also called secondary sampling units (SSU)
• Ex: PSUs are often geographical regions
![Page 132: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/132.jpg)
132
Examples1. Cluster sampling. Want a sample of high school
students in a certain area, to investigate smoking and alcohol use. If a list of high school classes is available,we can then select a sample of high school classes and give the questionaire to every student in the selected classes; cluster sampling with high school class being the clusters
2. Two-stage cluster sampling. If a list of classes is not available, we can first select high schools, then classes and finally all students in the selected classes. Then we have 2-stage cluster sample.
1. PSU = high school2. SSU = classes3. Units = students
![Page 133: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/133.jpg)
133
Psychiatric Morbidity Survey is a 4-stage sample
- Population: adults aged 16-64 living in private households in Great Britain
- PSUs = postal sectors
- SSUs = addresses
- 3SUs = households
- Units = individuals
Sampling process:
1) 200 PSUs selected
2) 90 SSUs selected within each sampled PSU (interviewer workload)
3) All households selected per SSU
4) 1 adult selected per household
![Page 134: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/134.jpg)
134
Cluster sampling
N
i iMM1
:size Population
advancein fixednot
: sample final of Size
in units all :units of sample Final
|| clusters, of sample
Isi i
I
II
Mm
s
ss
sns
Number of clusters in the population : N
Number of units in cluster i: Mi
M/ty
tt,ilustertotal in cyt
y
N
i ii
:variable for themean Population
1
![Page 135: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/135.jpg)
135
Simple random cluster samplingRatio-to-size estimator
I
I
si i
si i
R M
tMt̂
MtMty
yMNn
fNtVar
iii
N
i iiR
/ and mean,cluster the, / where
)(1
11)ˆ(
1
222
Use auxiliary information: Size of the sampled clusters
Approximately unbiased with approximate variance
![Page 136: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/136.jpg)
136
mean sample usual theis
and where
1
11
by estimated
2222
II
I
si isi is
si siiR
M/tyN/nf
)yy(Mnn
fN
n/m
N/M)t̂(V̂
Note that this ratio estimator is in fact the usual sample mean based estimator with respect to the y- variable
sR yMt ˆ
And corresponding estimator of the population mean of y is
sy
Can be used also if M is unknown
![Page 137: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/137.jpg)
137
• Estimator’s variance is highly influenced by how the clusters are constructed.
similar values themaking clusters, in the lies
values- in the variation theofmost such that
ous,heterogene clusters themake
small )( make toclusters Choose 22
i
ii
y
y
yM
• Note: The opposite in stratified sampling• Typically, clusters are formed by “nearby units’ like households, schools, hospitals because of economical and practical reasons, with little variation within the clusters:
Simple random cluster sampling will lead to much less precise estimates compared to SRS, but this is offset by big cost reductions
Sometimes SRS is not possible; information only known for clusters
![Page 138: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/138.jpg)
138
Design Effects
A design effect (deff) compares efficiency of two design-estimation strategies (sampling design and estimator) for same sample size
Now: Compare
Strategy 1:simple random cluster sampling with ratio estimator
Strategy 2: SRS, of same sample size m, with usual sample mean estimator
In terms of estimating population mean:
s
sR
y
yMt
:estimator 2Strategy
/ˆ :estimator 1Strategy
![Page 139: 1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649ead5503460f94bb48f8/html5/thumbnails/139.jpg)
139
)(/)(),( sSRSsSCSs yVaryVarySCSdeff
The design effect of simple random cluster sampling, SCS, is then
Estimated deff: )(ˆ/)(ˆsSRSsSCS yVyV
In probation example:
200387.0)1/()ˆ1(ˆ)1)](1/()ˆ1(ˆ[)ˆ(ˆ mppfmpppVSRS
9.6000387.0/0.0302 deff Estimated 22
Conclusion: Cluster sampling is much less efficient
! 99615.0/9.60
615.026/16/1ˆ/ -1 and
)/(ˆ lettingby /-1factor p.c. theestimatecan We:Note
estdeff
NnMm
nmNMMm