1 Precision and Validity Information Bias Dr. J ø rn Olsen Epi 200B January 21 and 26, 2010.

1

Precision and ValidityInformation Bias

Dr. Jørn OlsenEpi 200B

January 21 and 26, 2010

2

Bias and confounding (Last, Dictionary)

Bias: Deviation of results or inference from truth, or processes leading to such deviations. Any trend in the collection, analysis, interpretation, publication, or review of data that can lead to conclusions that are systematically different from the truth.

3

Bias and confounding (Last, Dictionary)

Confounding: A situation in which the effect of two processes are not separated.

Confounder, confounding factor, confounding variable-Poor term, confounding is study specific. No variables are always confounders.

Dictionary; IEA/Last:

Information bias (observational bias):

A flaw in measuring exposure or outcome data that results in different quality (accuracy) of information between comparisons groups

5

Information Bias and Other Method Problems

Information: exposures, end points, confounders, modifiers

For discrete variables: classification error/misclassification

Differential/non-differential information bias

6

Data accuracy

Data are almost never 100% accurate

Coding errors, measurement errors We ask questions that cannot be

answered correctly-exposed to ETS last year

7

Non-differential – does not depend upon the value of other variables

Example – diagnosing has the same sensitivity and specificity among exposed and non-exposed.Or, exposure is reported

with the same sensitivity and specificity among cases and controls

Non-differential misclassification better than differential

Non-differential misclassification can often be achieved in follow-up studies

Exposures are recorded prior to disease occurrence

Diseases may be recorded by doctors who do not ask about exposures

9

Recall bias misclassification of the exposure

A serious problem in case control studies or cross sectional studies based upon recall

10

Recall bias

Hungarian case-control surveillance of congenital abnormalities (Epidemiology 2001; 12: 461-66.)

Drug use = self-reported data (interview, memory aids) = log-book: medicine prescribed by ANC doctors

Self-reported drug use

Log-book Yes No

Yes a b

No c d

Sensitivity a/(a+c)

Specificity d/(b+d)

11

A low sensitivity is expected if mothers provide a complete recall since only ANC prescribed drugs are in the log book.

12

Short-term drugs

Case status Sensitivity Specificity

All cases 0.16 0.98

Severe 0.21 0.98

Visible 0.18 0.98

Controls 0.28 0.98

13

Long-term drugs

Case status Sensitivity Specificity

All 0.25 0.97

Severe 0.16 0.95

Visible 0.29 0.97

Controls 0.46 0.97

14

What to do to reduce differential information bias?

Use blinding if possible-”blind till it hurts” Cochrane.

Use of hospital controls may, in some cases, help to reduce information bias.

The disease used to identify the comparison group must NOT be associated with the exposure under study (must not be a cause or a preventive factor).

15

For case-control studies

First study is important No disclosure of study hypothesis Use biomarkers of exposure if

possible Use secondary data collected prior

to the disease Use neutral interviewers

16

Differential misclassification of the endpoint:

sometimes a problem infollow-up studies

17

Is this follow-up study vulnerable to differential misclassification of DVT?

Exposure DVT Obs time

OC +OC -

ac

t +t -

18

Follow-up studies are usually less vulnerable to differential recall bias because the exposure is recorded before the end point, but knowing the hypothesis may introduce bias if the exposure is a suspected cause of the disease under study.

Blind the clinicians, if possible.

19

It is often stated that non-differential misclassification leads to bias towards no association (RR = IRR = OR = 1, RD = IRD = 0)

First argument for that was provided by Bross in the 1950’s.

Non differential misclassification is not the same as random misclassification (random is only non-differential in the long run).

Random misclassification (blinding) can be very differential by chance in a small study.

20

Recorded smo

True smo

+ -

Lung c + TPl FPl

- FNl TNl

Ref + TPr FPr

- FNr TNr

P = proportion of smokers; Pl and Prl = Lung cancerr = reference

21

TP = P x sens

FN = P x (1-sens)

FP = (1-P) (1-spec)

TN = (1-P) spec

If we take interest in the difference between Pl and

Pr, D = Pl – Pr

(normally we would take an interest in exposure odds-for example)

23

We are only able to estimate Pl and Pr, and then

Include D = Pl – Prand in case of non-diff. miscl.FPL = FPr = FP FNL = FNr = FN

Pr)FPr(1TPrPrrP̂

)FPP(1TPPP̂

rP̂ - P̂ D̂

lllll

24

Then = D (1– (FN + FP)) (check it out)

Meaning ≠ D if FN and FP ≠ 0 (sens + spec < 2)

FN + FP < 1.0 D < D (but same sign)

FP + FN = 1.0 D = 0 (like flipping a coin)

FN + FP = 2 D = -D (coding!)

Also true for ORs

D̂

D̂

^

^

^

25

Non differential misclassification of a dichotomous variable will, in most cases, bias values towards no association (but there are other sources of error in a study and the combined effect may be away from the null)

Non differential misclassification of a variable with more than two categories can cause bias away from the null but mainly in rather unusual situations

Misclassification of a confounder can cause bias in any direction.

26

When estimating relative effect measures a high specificity is wanted.

True cohort data

Exp N D D RR

+-

20,00010,000

400100

19,6009900 2.0

Exp N D RR

+-

20,00010,000

32080 2.0

If sensitivity is 0.8 but specificity is 1

27

Exp N D RR

+-

20,00010,000

400 + 3920 = 4320100 + 1980 = 2080 1.04

If sensitivity is 1 but specificity is 0.80

28

If sensitivity is 0.8 and specificity is 0.9

Exp N D RR

+-

20,00010,000

400 x 0.8 + 19600 x 0.10 = 2280

100 x 0.8 + 9900 x 0.10 = 1070

1.07

29

The corresponding case-cohort studies would produce the following (similar) results (if done right in this situation as a case-cohort study).

Exp Cases Controls OR

+-

400100

333.33166.66

All 500 500 2.0

30

The corresponding case-cohort studies would produce the following (similar) results


+-

32080

266.66133.33

All 400 400 2.0

31


+-

43202080

4266.662133.33

All 6400 6400 1.04

32


+-

22801070

22331117

All 3350 3350 1.07

33

If we get a reference pathologist to eliminate all FP cases, we would get (for the last table)


+-

2280 – 1960 = 3201070 – 990 = 80

266.66133.33

400 4002.0

34

Adjusting for misclassification is possible if sens and spec are known

Diagn D+ D- All

+ P x sens (1-P)(1-spec)

- P(1-sens) (1-P)spec 1-

All P 1-P

P̂

P̂

1) - spec (sens / l) - spec P̂( P

1) - spec (sens P 1 - spec P̂

spec P P - spec - 1 sens P P̂

spec)-P)(1-(1 sens P P̂

36

Example

sens = 0.44 spec = 0.94; based upon comparison with “Golden Standard” – clinical diagnosing

Sex Questionnaire – bronchitis

+ - All

M 350 1427 1777

F 277 1787 2064

RP = (350/1777) / (277/2064) = 1.47

37

Exp P (M) =

(350/1777 + 0.94 – 1) / (0.44 + 0.94 – 1)= 0.360 (640 with the disease)

Exp P (F) =

(277/2064 + 0.94 – 1) / (0.44 + 0.94 – 1)= 0.195 (403 with the disease)

In case of differential misclassification, use sex specific sens and spec

403/2064

640/1777 RP = 1.85

38

Misclassification of a confounder may bias a result in any direction (Greenland & Robins. Am J Epidemiol 1985:122;495-506)Let this be the true data:

39

E C Cases Controls OR

+ +-

10025

200100 2.0

- +-

20100

40400 2.0

The confounder has an effect (OR=2)

The exposure has no effect (OR=1)

40

Now assume exposure and disease status is recorded without error. Only the confounder is non-differential misclassified (sens=0.8 and spec=0.9), we then get:

E C Cases Controls OR

+ +-

82.542.5

170130 1.48

- +-

2694

72368 1.41

41

When stratifying on the confounderTrue data

C E Cases Controls OR

+ +-

10020

20040 1.0

- +-

25100

100400 1.0

42

Miscl data

C E Cases Controls OR

+ +-

82.526

17072 1.2

- +-

42.594

130368 1.5

43

Misclassification is likely if we ask for sensitive data (alcohol intake), if we ask for data that can not be easily recalled like diet, if the relevant time window is short (teratology), if we give little attention to the data collection or perhaps if we give too much attention to the data collection.

44

Regression towards the mean. Misclassification for a group of people because we over sample large random errors. This selection leads to misclassification.

IQ = IQ + ε

Σε = 0 for all in the study but not for those selected from extreme parts of the distribution (Σε > 0). Their measured IQs may be unusual because their IQs are unusual or because their measurement errors were large, or both. In a new round of measuring IQ one would expect Σε to be zero (at least closer to 0).

IQ^

45

Regression towards the mean comes in many different forms. Assume you want to predict PTB and collect data on a number of potential risk factors.

You select those who have the highest RR and claim you can predict 60% of PTB using these markers. When you apply these ‘predictors’ in a new data source, you are in for a disappointment, why?

46

Misclassification has an impact on estimates of effect sizes and power

A smaller study with better quality data may be preferable than a large study with poor quality data

Use blinding to avoid differential misclassification

Estimate misclassification/repeated measures

47

Capture – recapture to estimate completeness of recording (the degree of underreporting).

If you have two different data sources (parental reporting of febrile seizures and hospitalizations for febrile seizures) you may be able to estimate these data sources actual coverage

48

The arguments come from biologists and go like this:You want to know the number of salmon in a given lake; you can empty the lake and count all salmons. Or

1. You catch some salmon (M1) in the lake and give them a mark and throw them back into the lake

2. You make another catch of salmon (M2) and note how many had the mark (were caught in the first catch) M3

3. Now you know M1, M2 and M3 and you are ready to estimate the total number of salmon in the lake, N.

49

P1 (first catch) M1/N

P2 (second catch) M2/N

M3 = N x P1 x P2

= N x M1/N x M2/N

M3 =

N =

M1 x M2

N

M1 x M2

M3

50

Say, in our study, we had parental reports for 100 children with FS and 75 hospital reports.

Our estimate of the total number of children with FS in the study would be (if 50 were registered with FS both places)

(100 x 75)/50 = 150

51

Other Problems

52

In cross sectional studies, we do not know what came first

CVD – anxiety, stress, high blood pressure

But temporal ambiguity may also exist in longitudinal studies

53

Many diseases have a long preclinical phase before they are diagnosed. If they have impact on E during the preclinical phase – reverse causation may be a problem. Example exposure to selen and breast cancer.

54

Repeated events like in reproductive epidemiology may produce other problems.

55

Example from reproductive Epidemiology Howard et al. Epidemiology 2007;18:544-51

Woman often have more than one child reproductive failures often repeat themselves

Reproductive failures may impact exposure Example smoking women who get a child with CA may stop

smoking when they plan a new pregnancy. How to analyze data?

DAG 1- No adjustment needed by Oo when analyzing E1,→O1

56

DAG 2

Now a backdoor path E1←E0→O0→O1

adjustment for E0 or O0

57

DAG 3

Now 2 backdoor paths E,←E0→Oo→O1 and E1←O0 → O1 adjusting for Oo blocks both paths

58

DAG 4A

59

DAG 4B

60

Add covariate Ca that cause exposures and Cb that cause the endpoint

Incl. Oo blocks E1←Ca→Eo→ Oo← Cb→O1 adding Ca, Eo and Cb solves the problem

61

DAG 5

62

Now 2 backdoor paths from E1 to O1 E1←Oo←Cb→O1 and E1←Ca→Eo→O1and

Oo is a collider

CA and Cb would control this path

63

Studies on diseases that are part of a screening program

No protective effect of fruit and vegetables on breast cancer. The study did not take screening into consideration.

if women who like fruits and vegetables more often take part in screening and screening is not considered in the analysis

bias in the early phase of screening?

bias under steady state?

and if this had been colon cancer?

64

The ecological fallacy at the individual level

Many exposures come in packages – diet, air pollution, welding fume, coffee

Often, measurements are made at the aggregated level – carrots, coffee, etc. (more than just B-carotene and caffeine)

65

Conclusion

Make data as accurate data as possible – also true for confounders.

Avoid differential misclassification (blinding)

Estimate sensitivity and specificity of key variables if possible

Avoid low specificity when measuring ratios (RR, IRR, OR)

Do sensitivity analyses

1 Precision and Validity Information Bias Dr. J ø rn Olsen Epi 200B January 21 and 26, 2010.

Documents

Transcript of 1 Precision and Validity Information Bias Dr. J ø rn Olsen Epi 200B January 21 and 26, 2010.