Sampling Methodology

76
LEARNING PROGRAMME Sampling Sampling Methodology Methodology Intermediate Training in Intermediate Training in Quantitative Analysis Quantitative Analysis Bangkok 19-23 November 2007 Bangkok 19-23 November 2007 Some materials are modified from the presentation ‘Comprehensive Survey Design’, Bradley A. Woodruff, CDC

description

Sampling Methodology. Intermediate Training in Quantitative Analysis Bangkok 19-23 November 2007. Some materials are modified from the presentation ‘Comprehensive Survey Design’, Bradley A. Woodruff, CDC. Topics to be covered in this presentation. Basic Introduction - PowerPoint PPT Presentation

Transcript of Sampling Methodology

Page 1: Sampling Methodology

LEARNING PROGRAMME

SamplingSampling MethodologyMethodology

Intermediate Training in Intermediate Training in Quantitative Analysis Quantitative Analysis

Bangkok 19-23 November 2007Bangkok 19-23 November 2007

Some materials are modified from the presentation‘Comprehensive Survey Design’, Bradley A. Woodruff, CDC

Page 2: Sampling Methodology

LEARNING PROGRAMME - 2

Topics to be covered in this presentation

1. Basic Introduction

2. Bias and error, accuracy and precision

3. Calculating sample size

4. Sampling Methodologies

5. Final Exercise

Page 3: Sampling Methodology

LEARNING PROGRAMME - 3

Learning objectives

By the end of this session, the participant should be able to:

Differentiate between precision and accuracy, bias and error

Calculate sample size Understand different sampling

methodologies

Page 4: Sampling Methodology

LEARNING PROGRAMME - 4

Starting point

Define objectives of the survey: Specific indicators to be measured (food sec, nutr) Target groups (displaced hhs) Population groups or geographic areas to be

included/studied in survey

Must also determine the level(s) at which to survey (the unit of analysis) Community Household (most common for CFSVAs) Children under 5 years of age

Page 5: Sampling Methodology

LEARNING PROGRAMME - 5

Survey Starting point cont. Must clearly define geographic area to be surveyed

Defines population to which results can be generalized

May be defined by: Area in which a programme has been implemented or is

planned An easily defined political unit: district, province, country Combination of units: rural areas in a province, Other units, such as livelihood zones, agro-ecological

zones, etc.

Page 6: Sampling Methodology

LEARNING PROGRAMME - 6

What is a cross-sectional survey?

A cross-sectional survey is a collection of data from a specific population at a single point in time.

CFSVAs and EFSAs are typically cross-sectional surveys

Often referred to as a ‘snapshot in time’

Sometimes referred to as a population survey.

(FSMS is typically a longitudinal survey)

Page 7: Sampling Methodology

LEARNING PROGRAMME - 7

What is sampling?

Sampling is the process of selecting a number of subjects (a “sample population”) from all the subjects in a “target population” or “universe.”

Source: Last. A Dictionary of Epidemiology

Page 8: Sampling Methodology

LEARNING PROGRAMME - 8

Two sampling methods

ProbabilityProbability Random methods decide who is selected and the chance of a person being selected is known

Non-probabilityNon-probability Subjective judgment is used to select the sample and you do not know the chance of a person being selected

Page 9: Sampling Methodology

LEARNING PROGRAMME - 9

Why use probability sampling?? To estimate/ measure certain outcomes (prevalence

of child malnutrition, food insecurity, etc) for a larger population by measuring only a sub-set of that population

Without probability sampling, a correct estimate for the larger population could only be attained by measuring the entire population

We will focus exclusively on probability sampling methods

Page 10: Sampling Methodology

LEARNING PROGRAMME - 10

Bias and Error,Accuracy and Precision

Page 11: Sampling Methodology

LEARNING PROGRAMME - 11

Bias and error

Non sampling bias

Sampling bias

Sampling error

Bias

Sampling error

Page 12: Sampling Methodology

LEARNING PROGRAMME - 12

Non-sampling bias

Bias introduced into the survey that is not related to your sampling methodologies/ sample schemes Always present to some extent and immeasurable

Examples:Sampling frame out of date/ do not have accurate

population numbers/ households locations;non response to certain modules of the

questionnaire for whatever reason; measurement error- child ages and weights not

recorded correctly

Page 13: Sampling Methodology

LEARNING PROGRAMME - 13

Sampling bias

Bias that is introduced by inadequate sampling methodologiesAlmost impossible to measure

Examples:Non representative samplingFailure to weight

Page 14: Sampling Methodology

LEARNING PROGRAMME - 14

Sampling error

Difference between survey result and population value due to random selection of sampleMeasurable and can be accounted for

Example:15% GAM rate in survey population but 10% GAM rates in the overall population (error of 5%)

Sampling error is influenced by: Sample size Sampling scheme The spread of the indicator we want to measure

Page 15: Sampling Methodology

LEARNING PROGRAMME - 15

Sampling error

Measures of sampling error: Confidence limits Standard error Coefficient of variation Probability values (P values) Others

Use these measures to: Calculate sample size prior to sampling Determine how sure we are of result after analysis

Page 16: Sampling Methodology

LEARNING PROGRAMME - 17

Bias and error need to be understood within the context of two other terms…

Accuracy: The degree to which a measurement, or an estimate based on measurements, represents the true value of the attribute that is being measured

How close the sample pop estimate is to true pop value

Precision:Precision corresponds to the reduction of random error.

How close are the sample pop estimates if the survey is repeated

A measurement can be precise (low random error) but still inaccurate (because of a systematic bias): give examples

Page 17: Sampling Methodology

LEARNING PROGRAMME - 18

Accuracy: obtaining results close to truth

Driven by whether the instrument accurately measures what is intended; whether the pop measured is representative of true pop; etc (whether there is bias)

Survey 1

Survey 2

Survey 3

Real population value

Page 18: Sampling Methodology

LEARNING PROGRAMME - 19

Precision: obtaining similar results with repeated measurement

Driven by sample size (error in the sample)

Page 19: Sampling Methodology

LEARNING PROGRAMME - 20

How do bias and error relate to these terms?

Bias (both sampling and non sampling) affects accuracy

Sampling Error affects precision and precision can be controlled through sample size

Page 20: Sampling Methodology

LEARNING PROGRAMME - 21

Explain survey estimates in terms of each of these terms

Page 21: Sampling Methodology

LEARNING PROGRAMME - 22

And this??

Page 22: Sampling Methodology

LEARNING PROGRAMME - 23

And???

Page 23: Sampling Methodology

LEARNING PROGRAMME - 24

Finally, this???

Page 24: Sampling Methodology

LEARNING PROGRAMME - 25

Calculating sample size

Page 25: Sampling Methodology

LEARNING PROGRAMME - 26

Calculate sample sizeSample size calculation determines the number of individuals that need to be interviewed in order to properly estimate information for a larger population

Why calculate sample size? Collecting data is expensive Collecting data and specimens is inconvenient for

subjects Collecting data takes time.

Page 26: Sampling Methodology

LEARNING PROGRAMME - 27

Calculate sample size

To estimate sample size for single survey, need to know:

Estimate of the prevalence of the outcome (% food insecure hhs, % of wasted children, etc.)

Precision desired Size of total population Level of confidence (always use 95%)

Page 27: Sampling Methodology

LEARNING PROGRAMME - 28

To calculate sample size for estimate of prevalence with 95% confidence limit:

N = 1.962 x (P)(1-P)

d2

1.96 = Z value for p = 0.05 or 95% confidence intervals(1.64= Z value for p=0.10 or 90% confidence intervals)P = Estimated prevalenced = Desired precision (for example, 0.08 for ± 8%)

Calculate sample size

Page 28: Sampling Methodology

LEARNING PROGRAMME - 29

Precision and sample size

Effect of Changing the Desired Precision(assume 95% CI, +/- .05, large population)

0

500

1000

1500

2000

2500

3000

0.00 0.05 0.10 0.15 0.20

Width of Confidence Interval

Req

uir

ed s

amp

le s

ize

Page 29: Sampling Methodology

LEARNING PROGRAMME - 30

Calculate sample size

Where to get information to make assumption about prevalence?Prior surveysQualitative estimatesWild guessesErr toward an assumed prevalence of 50% when calculating sample size.

Page 30: Sampling Methodology

LEARNING PROGRAMME - 31

Estimated prevalence and sample size

Effect of Changing the Estimated Prevalence(assume 95% CI, +/- .05, large population)

0

100

200

300

400

500

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

Estimated prevalence

Req

uir

ed s

amp

le s

ize

Page 31: Sampling Methodology

LEARNING PROGRAMME - 32

What about sample size for a Cluster survey?

Design =

Variance with cluster sampling

Effect Variance with SRS

=

Sample size of a cluster samplethat gives same CI as SRS

Sample size with SRS

SRS: Systematic Random Sampling

Page 32: Sampling Methodology

LEARNING PROGRAMME - 33

Design effect

Generally speaking, design effect households in the same village are often similar to each other (there is an intra-cluster correlation).

Twenty households from two villages will not tell us as much about the entire population as twenty households all coming from different villages.

The higher the intra-cluster correlation and the more households come from the same cluster, the higher the design effect. Example by chance we have 2 villages of predominantly fisherfolk"

Page 33: Sampling Methodology

LEARNING PROGRAMME - 34

It is good to be familiar with these formulas, but It is good to be familiar with these formulas, but we have computers to help us with the we have computers to help us with the

calculations…. calculations….

Page 34: Sampling Methodology

LEARNING PROGRAMME - 35

Sample size calculators

Epi Info (www.cdc.gov)ODAN stat calculator Excel worksheet

Page 35: Sampling Methodology

LEARNING PROGRAMME - 36

Simple random sample

Systematic random sample

Cluster sample

Stratified sample

Complex sampling designs

Sampling methodologies

Page 36: Sampling Methodology

LEARNING PROGRAMME - 37

Simple random sampling (SRS)

Most basic type of samplingStatistical theory based on SRS

Calculate p values and confidence limitsOutput from most statistical computer

programs assume a SRSSelection of people is independent and random

Page 37: Sampling Methodology

LEARNING PROGRAMME - 38

Advantages and disadvantages of SRS

No selection bias

Self-weighting

Requires knowledge of population

Costly to survey when population is spread out

Sampling frame may not be available or complete

Page 38: Sampling Methodology

LEARNING PROGRAMME - 39

Steps for conducting a simple random sample

Create list of all the sampling units Number each unit consecutively Randomly select numbers between 1 and the total

number of sampling units Random number table- Computer generated (RAND in

Excel) and pick the highest numbers “Pick a number from a hat” Birth day or serial number on paper money Flip a coin, roll a die, pick a card, pull a straw

Page 39: Sampling Methodology

LEARNING PROGRAMME - 40

Simple random sampling

Household

Smith

Pfeiffer

Anderson

Timmer

Huff

Hunt

Parvanta

Grummer-Strawn

Bobrow

Cooper

Random number table

7648 2352 6959 1937 2554 6804 9098 4316 4318 2346 7276 1880 7136 9603 0163 3152 7000 2865 8357 4475 9804 0042 1106 7949 2932 9958 9582 2235 1140 1164 7841 1688 4097 8995 5030 1785 5420 0125 4953 1332 5540 6278 1584 4392 3258 1374 1617 7427

Number

1

2

3

4

5

6

7

8

9

0

Page 40: Sampling Methodology

LEARNING PROGRAMME - 41

Simple random sampling

Household

Smith

Pfeiffer

Anderson

Timmer

Huff

Hunt

Parvanta

Grummer-Strawn

Bobrow

Cooper

Random number table

7648 2352 6959 1937 2554 6804 9098 4316 4318 2346 7276 1880 7136 9603 0163 3152 7000 2865 8357 4475 9804 0042 1106 7949 2932 9958 9582 2235 1140 1164 7841 1688 4097 8995 5030 1785 5420 0125 4953 1332 5540 6278 1584 4392 3258 1374 1617 7427

Number

1

2

3

4

5

6

7

8

9

0

Page 41: Sampling Methodology

LEARNING PROGRAMME - 42

Systematic random sampling

Similar to simple random sampling

First person chosen randomly

Systematic selection of subsequent people

Statistics same as simple random sampling

Page 42: Sampling Methodology

LEARNING PROGRAMME - 43

Steps for systematic random sample List the sampling units

Divide the number of sampling units by the sample size to determine sampling fraction

Choose random number between 1 and sampling fraction Identifies the first selected sampling unit

Add the sampling fraction to the random number to identify the second selected sampling unit

Continue to add the sampling interval until end of list

Page 43: Sampling Methodology

LEARNING PROGRAMME - 44

Systematic sample example• Example: A survey was undertaken to assess household livelihoods in one

community of 480 houses. Sample size calculations revealed that 40 households would need to be sampled systematically to be representative the larger community.

• Sampling interval= 480/40

• Random number between 1-12 was chosen (7)

• First house sampled= 7

• Subsequent households sampled

• 7 + 12= 19

• 19+12=31

• 31+12=43

• Etc

Danger: unknown, hidden patterns in the population could bias the sample

Page 44: Sampling Methodology

LEARNING PROGRAMME - 45

Household listing

Page 45: Sampling Methodology

LEARNING PROGRAMME - 46

Systematic random sampling

For instance every third unit in the sampling frame.Danger! if the sampling frame has a built-in structure or order ..

Page 46: Sampling Methodology

LEARNING PROGRAMME - 47

Both require a complete list of all basic sampling units arranged in some order.

Resources have to be adequate to sample throughout the target population

What is required for both simple and systematic random sampling?

Page 47: Sampling Methodology

LEARNING PROGRAMME - 48

What if there is no household listing??

What if the area of the target population is to widespread for available resources??

Page 48: Sampling Methodology

LEARNING PROGRAMME

Cluster Sampling!!

Page 49: Sampling Methodology

LEARNING PROGRAMME - 50

Cluster sampling

Definition: Probability sampling in which sampling units at some point in the selection process are collections, or clusters, of population elements

Source: Kalsbeek, Introduction to survey sampling

Page 50: Sampling Methodology

LEARNING PROGRAMME - 51

Cluster sampling

Objective: To choose smaller geographic areas in which simple or systematic random sampling can be done

Cluster sampling, for our purpose, are multistage (usually 2 or 3 stages)

Page 51: Sampling Methodology

LEARNING PROGRAMME - 52

Cluster Sampling- Illustration

Simple randomsampling (30 households)

Samplinguniverse

Selected households

Non-selected households

Page 52: Sampling Methodology

LEARNING PROGRAMME - 53

Cluster Sampling- Illustration

Cluster sampling -30 HHs in 3 clusters of 10 each

1

2

3

Page 53: Sampling Methodology

LEARNING PROGRAMME - 54

AdvantagesCheaper - basic

sampling units closer together

Does not need complete list of basic sampling units (usually hhs)

DisadvantagesDecreased precision

of estimateCalculation of p

values and confidence limits more complicated

Cluster sampling

Page 54: Sampling Methodology

LEARNING PROGRAMME - 55

Cluster sampling

EFSAs, CFSVAs almost always use cluster sampling techniques.

Also used in UNICEF MICS surveys (Multiple Indicator Cluster Surveys), DHS surveys, and almost all other large scale surveys.

Page 55: Sampling Methodology

LEARNING PROGRAMME - 56

What about sample size for a Cluster survey?

Cluster samples, as we just saw, have the disadvantage of decreased precision of an estimate.

To counteract this effect, we must increase the

sample size by a certain factor. This factor is referred to as the Design Effect, and is simply

a number to multiply your calculated SRS sample size by, to give you the required sample size in a cluster sample that will have the same precision.

Page 56: Sampling Methodology

LEARNING PROGRAMME - 57

Cluster surveys

To calculate sample size for estimate of prevalence with 95% confidence limit taking into account cluster sampling

N = DEFF x 1.962 x (P)(1-P)

d2

DEFF = Design effect1.96 = Z value for p = 0.05 or 95% confidence limitsP = Estimated prevalenced = Desired precision (for example, 0.05 for ± 5%)

Page 57: Sampling Methodology

LEARNING PROGRAMME - 58

Where do you get design effect to calculate sample size?

• Prior surveys• Assessment of clustering of outcome in population• Wild guess:

•2 is usually used as the default design effect.

•Remember that this is a best guess of design effect

•Overestimate design effect if uncertain

Page 58: Sampling Methodology

LEARNING PROGRAMME - 59

How design effects behave

Design effect increases with Clustering of outcome in population Increasing size of clusters (fewer clusters with

more households)

To minimize design effect: Include more clusters and decrease the size of

clusters

Page 59: Sampling Methodology

LEARNING PROGRAMME - 60

Steps to conduct a cluster survey

1. Determine the number of clusters needed and the total number of households in each cluster

Calculate total sample size (with design effect) Decide how many clusters and of which size

More clusters with lesser number of households results in smaller design effect

More clusters increase cost and time required Fewer than 30 clusters with 12 or more households each,

results in high design effect But > 30 clusters (reducing the number of households to

less than 10 / cluster) doesn’t usually decrease design effect much

Divide total sample size by number of clusters Revisit logistic constraints given cluster size

Page 60: Sampling Methodology

LEARNING PROGRAMME - 61

Cluster sample- steps

2. Determine what grouping of sampling units will be used as the primary sampling unit (PSUs) Population groups, communities, villages, etc.

3. Select required number of PSUs using probability proportional to size sampling

Page 61: Sampling Methodology

LEARNING PROGRAMME - 62

Probability proportional to size (PPS) sampling

Ensures that probability of any single household or person in the target population being chosen is the same

Process is the same as systematic random sampling, BUT

Larger PSUs (communities/ villages) have a higher probability to be chosen than do smaller ones

For example: a village of 2000 people gets 10 numbers assigned, while a village of 200 people only gets 1 number....

Page 62: Sampling Methodology

LEARNING PROGRAMME - 63

Cluster sample- steps

4. Visit each selected PSU and select required number of households SRS or systematic sampling is preferred 2nd stage sampling

method

Page 63: Sampling Methodology

LEARNING PROGRAMME - 64

Cluster sample example We are doing a national household survey in

Mongolia. Sample size calculations (taking into account a design effect of 2) indicated that 450 households had to be sampled to be representative of the target population. We have decided to include 12 households per cluster which means we would need to sample 38 clusters.

Sampling fraction calculation: We want 38 clusters, divide total number of

households in the sampling frame by number of clusters 129,177 / 38 = 3399.39 ≈ 3399

Page 64: Sampling Methodology

LEARNING PROGRAMME - 65

Cluster sample example

no. Province village number of HHs cumulativenumber of

hhs

1 Bayan-Ulgii tsagaannuur 231 231

2 Bayan-Ulgii nogoonnuur 912 1143

3 Bayan-Ulgii ulgii 3099 4242

4 Bayan-Ulgii altansogts 376 4618

5 Bayan-Ulgii bugat 484 5102

6 Bayan-Ulgii bayannuur 763 5865

7 Bayan-Ulgii tolbo 672 6537

8 Bayan-Ulgii deluun 1251 7788

9 Bayan-Ulgii bulgan 816 8604

10 Uvs bukhmoron 326 8930

11 Uvs davst 248 9178

12 Uvs ulaangom 2455 11633

13 Uvs khovd 389 12022

14 Uvs umnogobi 512 12534

15 Uvs ulgii 438 12972

16 Uvs erdeneburen 457 13429

17 khovd khovd 464 13893

18 khovd myangad 455 14348

19 khovd buyat 425 14773

20 khovd jargalant 2837 17610

21 khovd dorgon 412 18022

22 khovd chandmana 545 18567

23 khovd darvi 356 18923

etc…     etc… etc…

184 dornod matad 267 129177

Random number table

7678 2352 6959 1937 2554 6804 9098 4316 4318 2346 7276 1880 7136 9603 0163 3152 7000 2865 8357 4475 9804 0042 1106 7949 2932 9958 9582 2235 1140 1164 7841 1688 4097 8995 5030 1785 5420 0125 4953 1332 5540 6278 1584 4392 3258 1374 1617 7427

Page 65: Sampling Methodology

LEARNING PROGRAMME - 66

Taking cluster sampling into account during analysis

As we learned when discussing cluster design, cluster sampling usually DECREASES the precision of your estimates.

BUT, SPSS assumes a SRS. So, when calculating statistical tests,

standard deviations, variations, etc. (as you’ll talk about in the next sessions), the error will likely be UNDERestimated.

Page 66: Sampling Methodology

LEARNING PROGRAMME - 67

Accounting for Cluster sampling in the analysis

How do we take this increased error into account?1. Use Complex Samples option in SPSS (only available

on some versions, highly complicated)2. Aggregate values by cluster, and analyze the cluster

level database. 3. Use as many clusters as possible and thus smaller size

clusters in each strata to decrease the amount of this error, and account for in analysis by remembering that your CI are underestimated.

Option 3 is usually used by WFP analysts, although option 1 may be utilized where strict methodologies exist (wasting prevalence, vitamin deficiencies, etc.).

Page 67: Sampling Methodology

LEARNING PROGRAMME - 68

Overview of Stratified Sampling

Members of a target population are put into similar groups: geography, sex, livelihood

Each group is called a stratum

Random, systematic, or cluster sampling is used to select a sufficient number of subjects in each stratum

Must know approximate population size in each stratum

Page 68: Sampling Methodology

LEARNING PROGRAMME - 69

Goals of stratified sampling Overall objective: to define the target population by a certain

characteristic (usually urban/ rural status, gender, admin units, etc)

There are three primary reasons for stratification

1. Stratifying increases precision of the survey by selecting homogeneous sub-groups (a priori or post hoc) and accounting for these groups in the analysis of results for the whole group.

2. Stratification ensures a better sample

3. A more common goal is to split a sampling universe into smaller groups or universes, with the goal of providing results for each of these groups, or strata.

In CFSVAs, this is commonly done when creating a sampling design.

Page 69: Sampling Methodology

LEARNING PROGRAMME - 70

Types of stratification

Proportionate stratification If the number of people/hhs sampled per stratum

is proportional to the population in each stratum

Disproportionate stratification If the number of people/ hhs sampled per stratum

is independent of the population in each stratum

Page 70: Sampling Methodology

LEARNING PROGRAMME - 71

Stratified Sampling3 distinct livelihood zones

Country X

Urban area; N= 20

Pastoral zone; N= 13

Agricultural zone; N= 18

Page 71: Sampling Methodology

LEARNING PROGRAMME - 72

Proportionate stratification

Country X12 sampled8 sampled

10 sampled

Page 72: Sampling Methodology

LEARNING PROGRAMME - 73

Disproportionate stratification

Country X10 sampled10 sampled

10 sampled

Page 73: Sampling Methodology

LEARNING PROGRAMME - 74

Disproportionate sampling

When several strata are identified during the survey design phase, with the goal of providing final estimates for these strata, this amounts to doing several individual surveys- one for each strata. Sample size increases as a direct factor of the

number of strata that will be analyzed and reported on!!!

Page 74: Sampling Methodology

LEARNING PROGRAMME - 75

Complex sample designs

Complex samples simply use a combination of techniques. For example: A country is stratified into 5 provinces. A cluster sample is used to select 30 villages in

each of the strata. A systematic random sample is taken in each

village to select 15 households.

Page 75: Sampling Methodology

LEARNING PROGRAMME - 76

Recommendations and rules of thumb Disproportionate sampling is usually used for strata you want to

report on (provinces); proportionate sampling for strata that are there to have a better sample (i.e. districts - we will not report on them)

CFSVAs usually have a complex sample design (cluster and stratified)

Sampling is usually done as a 2 stage cluster (village, household).

In any area of estimation, about 250 to 300 households are needed.

Clusters are typically 10 households each, giving 25 to 30 clusters per area of estimation.

In reporting an assumed design effect of 2 is used (this is actually based on research of key food security indicators).

Page 76: Sampling Methodology

LEARNING PROGRAMME - 77

Where logistics allows, more households per cluster may be sampled (maintaining the 25 to 30 clusters, and thus increasing the overall sample size). For example, when teams visit one cluster per day, and can

accomplish 10 or 12 households without considerable extra effort/cost, the number of households per cluster is increased.

95% confidence intervals are always used, and an estimated 50% prevalence of the indicator. with the deff and the sample size, this gives a 95% CI of about 8 percentage points to estimates of 50% prevalence.

Recommendations and rules of thumb