Introduction to statistics for epidemiology and public ...€¦ · Introduction to statistics for...
Transcript of Introduction to statistics for epidemiology and public ...€¦ · Introduction to statistics for...
Introduction to statistics for epidemiology and public health
ARISE partnership July 1st 2020
Paul Milligan Faculty of Epidemiology and Population Health London School of Hygiene&Tropical Medicine
1
Aims of this session • To appreciate the importance of statistics in epidemiology and public health
• To understand:
• The measures we use to quantify disease burden
• The measures used to quantify effects of risk factors and interventions
• How confidence intervals are used to express the degree of uncertainty in our estimates of these quantities
• The distinction between statistical significance and public health importance
• The characteristics that guide the choice of diagnostic tests for assessing disease
• The importance of data quality and steps to ensure data quality
2
Florence Nightingale
Used statistical data to show that more soldiers died from preventable diseases than from injuries and that hygiene reforms reduced mortality 3
John Snow showed that cholera is a water-borne disease
The September 1854 cholera outbreak was centered in the Soho district, close to Snow's house.
Snow mapped the 13 public wells and all the known cholera deaths around Soho, and noted the spatial clustering of cases around one particular water pump on the southwest corner of the intersection of Broad (now Broadwick) Street and Cambridge (now Lexington) Street.
Despite strong scepticism from the local authorities, (current thinking was that cholera was spread by “miasma” in the air), he had the pump handle removed from the Broad Street pump and the outbreak quickly subsided.
Snow’s map, showing each death as a bar, conveyed a clear message.
Analysis of data on the number of cases of cholera in London, in relation to water supply:
Households Cholera cases in 1854
Rate /1000 households /year
Southwark and Vauxhall Water Company 40,046 1,263 31.5 Lambeth Water Company 26,107 98 3.8 Rest of London 256,423 1,422 5.5
4
Timeline of some key developments in statistics for epidemiology
• 1930’s Statistical inference, experimental design, confidence intervals • 1950’s Survey sampling • 1970’s/80’s Modern case control studies; Bootstrap; Survival analysis • 1990’s Causal inference • 2000-present Electronic data capture
5
Data quality control Many problems can be avoided by:
1. well-designed questionnaires that have been piloted 2. electronic data capture where possible 3. supervision 4. checking for completeness promptly 5. well-designed databases, relational structure, unique IDs, drop-down lists 6. entry-time range checks 7. attention to dates and date formats 8. inspection of datasets at regular intervals as they accumulate 9. active involvement of the PI in data QC
6
• Measures of disease and effect • Characteristics of diagnostic tests • Statistical uncertainty: confidence intervals
7
Questions: What is the burden of disease and ill health? What are the causes? How can we intervene?
8
Questions: What is the burden of disease and ill health? What are the causes? How can we intervene?
How can the disease burden be quantified, summarised, measured? How can we measure the effects of possible causes? How can we measure the impact of interventions?
9
• how many people have the disease now? • how many cases of the disease have there been in
the last year? • how much time have people spent being sick? • how many people are now affected by
consequences of the disease? • how has their quality of life been affected? • how many deaths have been caused by the
disease in the last year?
Measures of disease burden
10
• how many people have the disease now? • how many cases of the disease have there been in
the last year? • how much time have people spent being sick? • how many people are now affected by
consequences of the disease? • how has their quality of life been affected? • how many deaths have been caused by the
disease in the last year?
Measures of disease burden
Standardized measures that can be compared between populations
11
Measures of disease and ill health: • Prevalence • Incidence • Risk • (Odds) • Disability-adjusted life years; quality of life Measures of effect: • Ratio measures (strength of association with causes) • Difference measures (impact of interventions)
12
Measures of disease and ill health: • Prevalence • Incidence • Risk • (Odds) • Disability-adjusted life years; quality of life Measures of effect: • Ratio measures (strength of association with causes) • Difference measures (impact of interventions)
13
Measures of disease and ill health: • Prevalence • Incidence • Risk • (Odds) • Disability-adjusted life years; quality of life Measures of effect: • Ratio measures (strength of association with causes) • Difference measures (impact of interventions)
14
Example 1 • The number of people who were smear-positive for TB in a
recent community-based survey of a country was: Urban areas 150 Rural areas 15
• How can we interpret these results?
15
Example 1 Area No cases Sample size Prevalence Prevalence/100000
Urban 150 50,000 0.003 300 per 100,000
Rural 15 25,000 0.0006 60 per 100,000
Prevalence = No. of cases No. people in the population
... at a particular point in time
16
Example 2 The number of new cases of malaria diagnosed were reported for three districts: • District A 30 cases • District B 10 cases • District C 45 cases
What is the population in each district? Over what time period were cases documented? How were the disease cases defined? How were cases detected?
17
Incidence District
Cases
Population
Period From
To
Years
Persons x years
Rate
Rate per 1000 person years
A 30 10,324 Jan 2018 Dec 2018 1 10,324 0.0029 2.9
B 10 4,210 Jan 2018 Dec 2019 2 8,420 0.0012 1.2
C 45 14,540 July 2019 Dec 2019 0.5 7,270 0.0062 6.2
If rates are constant: 10 people followed for 1 year = 1 person followed for 10 years = 5 people followed for 2 years = 10 person years
18
Measures of disease and ill health: • Prevalence • Incidence • Risk • Odds • Disability-adjusted life years; quality of life Measures of effect • Ratio measures • Difference measures
19
Risk The probability of getting the disease during a specific of period time:
No. of new cases No. without the disease at the start of the time period
20
N
Time
d
T T is the duration of the study N is the number free of disease at the start d is the number of people who get the disease during time T
Risk during time T: d/N
Follow-up studies are needed to estimate risks
21
Prevalence, risk and incidence rates • Prevalence: the probability a person chosen at random from the
population has the disease • Risk: the probability that a person who does not have the disease
will get the disease in a specified period of time
• Epidemiologists often work with incidence rates rather than risks, because the risk always has to be qualified by specifying the time period.
• A further advantage of rates is we can include multiple events for each person.
22
Measures of disease and ill health: • Prevalence • Incidence • Risk • Odds • Disability-adjusted life years; quality of life Measures of effect • Ratio measures • Difference measures
23
Odds • Odds: a way of expressing risks and proportions: ● ● ● ●
– Prevalence 1/4 – Odds 1:3 = 1/3
• What are the odds of disease here? prevalence odds ● ● ● ● 1/2 ● ● ● 1/3 ● ● ● 2/3
24
Odds • Odds: a way of expressing risks and proportions: ● ● ● ●
– Prevalence 1/4 – Odds 1:3 = 1/3
• What are the odds of disease here? prevalence odds ● ● ● ● 1/2 1 ● ● ● 1/3 0.5 ● ● ● 2/3 2
25
Measures of disease and ill health: • Prevalence • Incidence • Risk • Odds • Disability-adjusted life years; quality of life Measures of effect • Ratio measures • Difference measures
26
Measures of effect: • Measures of association
– Prevalence ratio – Risk ratio – Rate ratio – Odds ratio
• Measures of impact
– Risk difference – Rate difference
How much more likely to: get lung cancer if you smoke? have CVD if overweight? get malaria if don’t sleep under a bednet?
How many cases of disease could we prevent if: reduce smoking? reduce number of people overweight? improve use of bednets?
27
Ratio and difference measures
Death rate from lung cancer and CVD per 1000 person-years for British male physicians 1951-1961
Smoking has a stronger association with lung cancer than CVD. But the excess deaths - how many more deaths per 1000 among smokers than non-smokers - is greater for CVD. Because the rate of death from CVD is much higher, a smaller rate ratio translates into a larger number of deaths.
Case definitions • For quantifying disease burden, case definition should be:
– Sensitive (so we don’t miss cases) – Specific (so we don’t include as cases people who do not have the
disease) • For measuring the effect of interventions, it is important for the case
definition to be – Specific
29
Case definitions TB: – Smear positive TB – TB positive defined from sputum culture
Malaria:
– Malaria defined as fever or history of fever and other malaria-like symptoms with no other obvious cause of these symptoms
– Malaria defined as fever (axillary temperature ≥37.5oC) and ≥5000 parasites/µL
Sickle Cell disease: - Paper-based solubility test - Electrophoresis 30
Disease Yes No Test Positive 90 90 Negative 10 810 TOTAL 100 900
1000 individuals
Sensitivity: 90/100=90% Specificity: 810/900=90%
31
Disease Yes No Test Positive 90 90 Negative 10 810 TOTAL 100 900
1000 individuals, 100 (10%) with the disease
Sensitivity: 90/100=90% Specificity: 810/900=90% Positive predictive value: 90/180=50%
32
Disease Yes No Test Positive 90 90 Negative 10 810 TOTAL 100 900
1000 individuals
Sensitivity: 9/10=90% Specificity: 891/990=90%
Disease Yes No Test Positive 9 99 Negative 1 891 TOTAL 10 990
Positive predictive value: 9/108=8%
1000 individuals, 10 (1%) with the disease
33
Uncertainty • The confidence interval expresses the uncertainty in sample
estimates of means, proportions, treatment efficacy etc.
• The larger the sample, the narrower the confidence interval
• An important aspect of planning research studies, is to choose the sample size to yield confidence intervals that are narrow enough to draw conclusions
34
Significance testing
1. In hypothesis testing, the P-value is “the probability of obtaining test results at least as extreme as the results actually observed, assuming that the null hypothesis is correct”.
2. The smaller the P-value, the stronger the evidence against the null hypothesis.
3. The P-value does not tell us if the difference is big or small
4. P-values are often misunderstood.
35
Control
Treated
P-value
Clinical value
N=4 1/4 3/4 >0.05 Unclear
N=12 3/12 9/12 <0.05 Unclear
N=40 10/40 30/40 <0.05 Useful
0.1 1 10 100
Risk Ratio
36
Control
Treated
P-value
Clinical value
N=4 1/4 3/4 >0.05 Unclear
N=12 3/12 9/12 <0.05 Unclear
N=40 10/40 30/40 <0.05 Useful
0.1 1 10 100
Risk Ratio
37
Control
Treated
P-value
Clinical value
N=4 1/4 3/4 >0.05 Unclear
N=12 3/12 9/12 <0.05 Unclear
N=40 10/40 30/40 <0.05 Useful
0.1 1 10 100
Risk Ratio
38
Control
Treated
P-value
Clinical value
N=4 1/4 3/4 >0.05 Unclear
N=12 3/12 9/12 <0.05 Unclear
N=40 10/40 30/40 <0.05 Useful
• Confidence interval indicates the magnitude of the effect
0.1 1 10 100
Risk Ratio
39
Control
Treated
P-value
Clinical value
N=4 1/4 3/4 >0.05 Unclear
N=12 3/12 9/12 <0.05 Unclear
N=40 10/40 30/40 <0.05 Useful
0.1 1 10 100
Risk Ratio
• Confidence interval indicates the magnitude of the effect
• Small studies, wide confidence intervals, low precision, inconclusive results
• Large studies, narrow confidence intervals, more precise
40
Control
Treated
P-value
Clinical Value
N=4 1/4 3/4 >0.05 Unclear
N=12 3/12 9/12 <0.05 Unclear
N=40 10/40 30/40 <0.05 Useful
• Confidence interval indicates the magnitude of the effect
• Small studies, wide confidence intervals, low precision, inconclusive results
• Large studies, narrow confidence intervals, more precise
0.1 1 10 100
Risk Ratio
41
Interpreting confidence intervals and P-values
VE 31% (95% CI 0%,52%) P=0.046
42
Interpretation of P-values
Goodman S (2008) Seminars in hematology 45:135
Studies with similar effects can produce very different levels of statistical significance
Very different effects can have the same P-value
43
V. Amrhein et al. Nature 567, 305–307; 2019 44
Software Free software: • OpenEpi • R • Epi-Info Licensed software: • Stata
45