Introduction to statistics for epidemiology and public ...€¦ · Introduction to statistics for...

Introduction to statistics for epidemiology and public health

ARISE partnership July 1st 2020

Paul Milligan Faculty of Epidemiology and Population Health London School of Hygiene&Tropical Medicine

1

Aims of this session • To appreciate the importance of statistics in epidemiology and public health

• To understand:

• The measures we use to quantify disease burden

• The measures used to quantify effects of risk factors and interventions

• How confidence intervals are used to express the degree of uncertainty in our estimates of these quantities

• The distinction between statistical significance and public health importance

• The characteristics that guide the choice of diagnostic tests for assessing disease

• The importance of data quality and steps to ensure data quality

2

Florence Nightingale

Used statistical data to show that more soldiers died from preventable diseases than from injuries and that hygiene reforms reduced mortality 3

https://en.wikipedia.org/wiki/File:Florence_Nightingale_(H_Hering_NPG_x82368).jpg

John Snow showed that cholera is a water-borne disease

The September 1854 cholera outbreak was centered in the Soho district, close to Snow's house.

Snow mapped the 13 public wells and all the known cholera deaths around Soho, and noted the spatial clustering of cases around one particular water pump on the southwest corner of the intersection of Broad (now Broadwick) Street and Cambridge (now Lexington) Street.

Despite strong scepticism from the local authorities, (current thinking was that cholera was spread by “miasma” in the air), he had the pump handle removed from the Broad Street pump and the outbreak quickly subsided.

Snow’s map, showing each death as a bar, conveyed a clear message.

Analysis of data on the number of cases of cholera in London, in relation to water supply:

Households Cholera cases in 1854

Rate /1000 households /year

Southwark and Vauxhall Water Company 40,046 1,263 31.5 Lambeth Water Company 26,107 98 3.8 Rest of London 256,423 1,422 5.5

4

https://en.wikipedia.org/wiki/File:John_Snow.jpg

Timeline of some key developments in statistics for epidemiology

• 1930’s Statistical inference, experimental design, confidence intervals • 1950’s Survey sampling • 1970’s/80’s Modern case control studies; Bootstrap; Survival analysis • 1990’s Causal inference • 2000-present Electronic data capture

5

Data quality control Many problems can be avoided by:

1. well-designed questionnaires that have been piloted 2. electronic data capture where possible 3. supervision 4. checking for completeness promptly 5. well-designed databases, relational structure, unique IDs, drop-down lists 6. entry-time range checks 7. attention to dates and date formats 8. inspection of datasets at regular intervals as they accumulate 9. active involvement of the PI in data QC

6

• Measures of disease and effect • Characteristics of diagnostic tests • Statistical uncertainty: confidence intervals

7

Questions: What is the burden of disease and ill health? What are the causes? How can we intervene?

8

Questions: What is the burden of disease and ill health? What are the causes? How can we intervene?

How can the disease burden be quantified, summarised, measured? How can we measure the effects of possible causes? How can we measure the impact of interventions?

9

• how many people have the disease now? • how many cases of the disease have there been in

the last year? • how much time have people spent being sick? • how many people are now affected by

consequences of the disease? • how has their quality of life been affected? • how many deaths have been caused by the

disease in the last year?

Measures of disease burden

10

• how many people have the disease now? • how many cases of the disease have there been in

the last year? • how much time have people spent being sick? • how many people are now affected by

consequences of the disease? • how has their quality of life been affected? • how many deaths have been caused by the

disease in the last year?

Measures of disease burden

Standardized measures that can be compared between populations

11

Measures of disease and ill health: • Prevalence • Incidence • Risk • (Odds) • Disability-adjusted life years; quality of life Measures of effect: • Ratio measures (strength of association with causes) • Difference measures (impact of interventions)

12


13


14

Example 1 • The number of people who were smear-positive for TB in a

recent community-based survey of a country was: Urban areas 150 Rural areas 15

• How can we interpret these results?

15

Example 1 Area No cases Sample size Prevalence Prevalence/100000

Urban 150 50,000 0.003 300 per 100,000

Rural 15 25,000 0.0006 60 per 100,000

Prevalence = No. of cases No. people in the population

... at a particular point in time

16

Example 2 The number of new cases of malaria diagnosed were reported for three districts: • District A 30 cases • District B 10 cases • District C 45 cases

What is the population in each district? Over what time period were cases documented? How were the disease cases defined? How were cases detected?

17

Incidence District

Cases

Population

Period From

To

Years

Persons x years

Rate

Rate per 1000 person years

A 30 10,324 Jan 2018 Dec 2018 1 10,324 0.0029 2.9

B 10 4,210 Jan 2018 Dec 2019 2 8,420 0.0012 1.2

C 45 14,540 July 2019 Dec 2019 0.5 7,270 0.0062 6.2

If rates are constant: 10 people followed for 1 year = 1 person followed for 10 years = 5 people followed for 2 years = 10 person years

18

Measures of disease and ill health: • Prevalence • Incidence • Risk • Odds • Disability-adjusted life years; quality of life Measures of effect • Ratio measures • Difference measures

19

Risk The probability of getting the disease during a specific of period time:

No. of new cases No. without the disease at the start of the time period

20

N

Time

d

T T is the duration of the study N is the number free of disease at the start d is the number of people who get the disease during time T

Risk during time T: d/N

Follow-up studies are needed to estimate risks

21

Prevalence, risk and incidence rates • Prevalence: the probability a person chosen at random from the

population has the disease • Risk: the probability that a person who does not have the disease

will get the disease in a specified period of time

• Epidemiologists often work with incidence rates rather than risks, because the risk always has to be qualified by specifying the time period.

• A further advantage of rates is we can include multiple events for each person.

22


23

Odds • Odds: a way of expressing risks and proportions: ● ● ● ●

– Prevalence 1/4 – Odds 1:3 = 1/3

• What are the odds of disease here? prevalence odds ● ● ● ● 1/2 ● ● ● 1/3 ● ● ● 2/3

24

Odds • Odds: a way of expressing risks and proportions: ● ● ● ●

– Prevalence 1/4 – Odds 1:3 = 1/3

• What are the odds of disease here? prevalence odds ● ● ● ● 1/2 1 ● ● ● 1/3 0.5 ● ● ● 2/3 2

25


26

Measures of effect: • Measures of association

– Prevalence ratio – Risk ratio – Rate ratio – Odds ratio

• Measures of impact

– Risk difference – Rate difference

How much more likely to: get lung cancer if you smoke? have CVD if overweight? get malaria if don’t sleep under a bednet?

How many cases of disease could we prevent if: reduce smoking? reduce number of people overweight? improve use of bednets?

27

Ratio and difference measures

Death rate from lung cancer and CVD per 1000 person-years for British male physicians 1951-1961

Smoking has a stronger association with lung cancer than CVD. But the excess deaths - how many more deaths per 1000 among smokers than non-smokers - is greater for CVD. Because the rate of death from CVD is much higher, a smaller rate ratio translates into a larger number of deaths.

Case definitions • For quantifying disease burden, case definition should be:

– Sensitive (so we don’t miss cases) – Specific (so we don’t include as cases people who do not have the

disease) • For measuring the effect of interventions, it is important for the case

definition to be – Specific

29

Case definitions TB: – Smear positive TB – TB positive defined from sputum culture

Malaria:

– Malaria defined as fever or history of fever and other malaria-like symptoms with no other obvious cause of these symptoms

– Malaria defined as fever (axillary temperature ≥37.5oC) and ≥5000 parasites/µL

Sickle Cell disease: - Paper-based solubility test - Electrophoresis 30

Disease Yes No Test Positive 90 90 Negative 10 810 TOTAL 100 900

1000 individuals

Sensitivity: 90/100=90% Specificity: 810/900=90%

31


1000 individuals, 100 (10%) with the disease

Sensitivity: 90/100=90% Specificity: 810/900=90% Positive predictive value: 90/180=50%

32


1000 individuals

Sensitivity: 9/10=90% Specificity: 891/990=90%


Positive predictive value: 9/108=8%

1000 individuals, 10 (1%) with the disease

33

Uncertainty • The confidence interval expresses the uncertainty in sample

estimates of means, proportions, treatment efficacy etc.

• The larger the sample, the narrower the confidence interval

• An important aspect of planning research studies, is to choose the sample size to yield confidence intervals that are narrow enough to draw conclusions

34

Significance testing

1. In hypothesis testing, the P-value is “the probability of obtaining test results at least as extreme as the results actually observed, assuming that the null hypothesis is correct”.

2. The smaller the P-value, the stronger the evidence against the null hypothesis.

3. The P-value does not tell us if the difference is big or small

4. P-values are often misunderstood.

35

Control

Treated

P-value

Clinical value

N=4 1/4 3/4 >0.05 Unclear

N=12 3/12 9/12 <0.05 Unclear

N=40 10/40 30/40 <0.05 Useful

0.1 1 10 100

Risk Ratio

36

Control

Treated

P-value

Clinical value

N=4 1/4 3/4 >0.05 Unclear

N=12 3/12 9/12 <0.05 Unclear

N=40 10/40 30/40 <0.05 Useful

0.1 1 10 100

Risk Ratio

37

Control

Treated

P-value

Clinical value

N=4 1/4 3/4 >0.05 Unclear

N=12 3/12 9/12 <0.05 Unclear

N=40 10/40 30/40 <0.05 Useful

0.1 1 10 100

Risk Ratio

38

Control

Treated

P-value

Clinical value

N=4 1/4 3/4 >0.05 Unclear

N=12 3/12 9/12 <0.05 Unclear

N=40 10/40 30/40 <0.05 Useful

• Confidence interval indicates the magnitude of the effect

0.1 1 10 100

Risk Ratio

39

Control

Treated

P-value

Clinical value

N=4 1/4 3/4 >0.05 Unclear

N=12 3/12 9/12 <0.05 Unclear

N=40 10/40 30/40 <0.05 Useful

0.1 1 10 100

Risk Ratio


• Small studies, wide confidence intervals, low precision, inconclusive results

• Large studies, narrow confidence intervals, more precise

40

Control

Treated

P-value

Clinical Value

N=4 1/4 3/4 >0.05 Unclear

N=12 3/12 9/12 <0.05 Unclear

N=40 10/40 30/40 <0.05 Useful


• Small studies, wide confidence intervals, low precision, inconclusive results

• Large studies, narrow confidence intervals, more precise

0.1 1 10 100

Risk Ratio

41

Interpreting confidence intervals and P-values

VE 31% (95% CI 0%,52%) P=0.046

42

Interpretation of P-values

Goodman S (2008) Seminars in hematology 45:135

Studies with similar effects can produce very different levels of statistical significance

Very different effects can have the same P-value

43

V. Amrhein et al. Nature 567, 305–307; 2019 44

Software Free software: • OpenEpi • R • Epi-Info Licensed software: • Stata

45

Introduction to statistics for epidemiology and public ...€¦ · Introduction to statistics for...

Documents

Transcript of Introduction to statistics for epidemiology and public ...€¦ · Introduction to statistics for...