Section II Descriptive stats for continuous data

Section IIDescriptive stats for continuous

dataDescriptive stats for binary data

and bivariate associations in binary data

1

Types of dataNumerical:

Continuous-age, SBP,glucoseInterval-parity, num infections

Ordinal (ranks)Cancer stage, Apgar score

Nominal (no order)Gender, ethnicity, treatment

2

Dataset used to illustrate some statistics in this section

Stomach cancer survival times in controls (Cameron & Pauling, PNAS, Oct 1976)Days from end of treatment to death

4, 6, 8, 8, 12, 14, 15, 17, 19, 22, 24, 34,45 n= 13 subjects

3

Measures of central tendency (middle)Data: 4, 6, 8, 8,12, 14, 15, 17, 19, 22, 24, 34,45

mean = 17.5 days median = 15 days

mode = 8 daysGeometric mean-GM= 13√4x6x8x8x…x45=14.25

If we delete the most extreme value, 45, mean is now 15.24, median is 14.5, GM=13,

median changes least

4

Mean versus Median (lesson #1 in how to lie with statistics)Yearly income data from n=11 persons, one income is for Dr Brilliant, the other

10 incomes from her 10 graduate students Yearly income in dollars

950 960 970 980 990 1010 1020 1030 1040 1050 $100,000

$110,000 (total) mean = 110,000/11 = $10,000, median = 1010 (the sixth ordered value) Which is better summary of “typical” value?

5

Example - Survival times in women with advanced Breast Cancer Survival time in days after end of radiotherapy woman after 275 days f/u after 305 days f/u 1 14 14 2 26 26 3 43 43 4 45 45 5 50 50 6 58 58 7 60 60 8 62 62 9 70 70 10 70 70 11 83 83 12 98* 128* 13 104* 134* 14 124* 154* 15 125* 155* 16 275* 305*

mean 75.6 83.1 median 66.0 66.0 SD 55.8 66.3 * still alive (censored)

The median is still a valid measure when less than half the data are censored. 6

Cumulative frequencies & survival

num pct cum cum pct cum pctDays dead dead dead dead alive=S 1-10 4 30.8 4 30.8 69.211-20 5 38.5 9 69.2 30.821-30 2 15.4 11 84.6 15.431-40 1 7.7 12 92.3 7.741-50 1 7.7 13 100.0 0 total 13

7

Stomach cancer survival time in days

8

day cum dead Cum incidence survival4 1 7.7% 92.3%6 2 15.4% 84.6%8 4 30.8% 69.2%

12 5 38.5% 61.5%14 6 46.2% 53.8%15 7 53.8% 46.2%17 8 61.5% 38.5%19 9 69.2% 30.8%22 10 76.9% 23.1%24 11 84.6% 15.4%34 12 92.3% 7.7%45 13 100.0% 0.0%

9

0 6 12 18 24 30 36 42 48 540%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%Stomach cancer cum incidence & survival

cum incidence

survival

days

Bevacizumab & Ovarian CancerBerger et. al. NEJM Dec 2011

10

Why survival curves?

0%10%20%30%40%50%60%70%80%90%

100%

0 1 2 3 4 5 6 7 8 9 10

day

pct a

live

11

Summarizing mortality – hazard rates Hazard rate = h =

number of persons with outcome total person-time follow up in all at risk

This is a rate per person-time. It is NOT a probability (not a risk)

In stomach cancer n=13, with 13 deaths, total follow up is 4+6+8+8+12+14+15+17+19+22+24+34+45

= 228 person-days

Hazard rate = mortality rate = 13/228 = 0.057 or 5.7 deaths per 100 person-days of

follow up. Do NOT report as 5.7%-wrong

12

Example: Why hazard rates?

Group n num dead mean f/u total f/u rate per 1000 A 100 7 36 3600 7/3600=1.94 B 100 2 3 300 2/300 =6.66

Mortality rate is higher for B than A even though the number of persons in each group is the same and more people died in group A.

The hazard rate ratio for A/B is 1.94/6.66=0.291.

When ALL patients are followed to the endpoint, (no censoring) mean time to event= 1/hazard.

13

Hazard rates & survival curvesSurvival

0%

20%

40%

60%

80%

100%

0 2 4 6 8 10 12t

S 0.2 0.4hazard rate

log Survival

-6.0

-5.0

-4.0

-3.0

-2.0

-1.0

0.0

0 2 4 6 8 10 12t

log(

S)

0.2 0.4

hazard rate

loge(S) = cum haz= h t, h is (average) slope of loge(S) vs t

14

Hazard rate ratios & Survival curves

ha = hazard rate in group A hb = hazard rate in group B, hazard rate ratio, (HR) for A compared to B is HR = ha/hb

If HR is constant over time one can compute the Survival in group A from the Survival in group B.

Sa = SbHR

Ex: HR=0.291, S at t=12 mos is 90% in group B, S=0.900.291 = 0.970 or 97.0% in group A at t=12 months.

A “protective” HR < 1 increases survival. HR >1 decreases survival.

15

Cumulative hazard rate

16

Loge(S)=Cumulative hazard = Σt hi = ∫ h(t) dt

If h is constant over timeCumulative hazard = h T where T is the

follow up time. In this case, h = cum hazard/T h is the slope of the cum hazard vs t plot.

From: Risks and Benefits of Estrogen Plus Progestin in Healthy Postmenopausal Women: Principal Results From the Women's Health Initiative Randomized Controlled Trial

JAMA. 2002;288(3):321-333.

HR indicates hazard ratio; nCI, nominal confidence interval; andaCI, adjusted confidence interval. Global index = first occurrence of CHD, cancer, stroke, pulmonary embolism, hip fracture or death.

Distribution skewnessLong right tailed distribution median < mean (common for survival data)

0

1

0 9

median

mean

18

Example: ICU length of stay(Howard)

n=94, mean=11.3 days, median= 6 daysmin=1 day, max=80 days

19

'

6 18 30 42 54 66 78

0

10

20

30

40

50

60

70

80

Per

cent

LOS_ICU

SkewnessLong left tailed distribution median > mean

(not as common in biology/medicine)

0

1

0 9

median

mean

20

Symmetric(common in biology)

0.00

0.40

-3.5 3.5

mean

median

Can be symmetric without being bell curve shaped – has one mode When data has a skewed distribution, must use “non parametric” methods

21

Measures of variation, spreadIQR – interquartile range

0

0.1

0.2

0.3

0.4

0.5

0 2 4 6 8 10 12 14 16 18 20Q1 Q3median

25% 25%25% 25%

22

Box-whisker plot

0 10 20 30 40 50

min max

Q1 Q3median

mean

23

Variation-Variance & SD _ Mean = Y= 17.54 days _ _ Y Y-Y (Y-Y)2

4 -13.54 183.3 6 -11.54 133.2 8 -9.54 91.0 8 -9.54 91.0 12 -5.54 30.7 14 -3.54 12.5 15 -2.54 6.5 17 -0.54 0.3 19 1.46 2.1 22 4.46 19.9 24 6.46 41.7 34 16.46 270.9 45 27.46 754.1sum 0 1637.2

_ Variance = (Yi - Y)2

(n-1)

Var=1637.2/12=136.4

SD=√Variance=√136.4=

11.6 days

24

Variation- Interpreting the SDRule of thumb from Gaussian (“Normal”) theory

(will study more shortly) rule ok if data has unimodel symmetric distribution

Range of middle 2/3 of the data: mean +/- SD

Range of middle 95% of the data:mean +/- 2 SD

Implies SD ≈ range/4 (after extreme values removed from range)

25

SD of differences-paired datachol in mmol/L

person chol at start chol at end difference 1 12.6 10.0 2.6 2 8.5 7.5 1.0 3 7.0 5.8 1.2 4 6.9 4.9 2.0 5 5.8 4.0 1.8 6 4.1 3.8 0.3

mean 7.48 6.00 1.48 SD 2.90 2.38 0.82

Corr of start vs end: r=0.971 26

If authors only report (mmol/L) start end change??mean 7.48 6.00 SD 2.90 2.38 Easy to get mean difference=7.48 – 6.00=1.48

But can’t get SD of differences2.90 - 2.38 = 0.52 ≠ 0.82

The 1.48 mean diff is average responseThe 0.82 diff SD is variation in response.

SDdiff= √ SD2start +SD2

end – 2 r SDstart SDend

r= correlation coeff

27

SD of differencestwo independent groups

Comparing ages in groups A vs B group A group B

30 5035 5177 5541

n 4 3 B - A B + Amean 45.75 52.00 6.25 97.75

SD 18.46 2.16 18.58 18.58Var 340.69 4.67 345.35 345.35

Data->

28

All possible differences, B-A

50 51 5530 20 21 2535 15 16 2077 -27 -26 -2241 9 10 14

mean 6.25SD 18.58

Var 345.35

All possible sums, B+A

50 51 5530 80 81 8535 85 86 9077 127 128 13241 91 92 96

mean 97.75

SD 18.58Var 345.35

29

Rule for SD of differencestwo independent groups

Var(Y - X) = Var(Y) + Var(X)Var(Y + X) = Var(Y) + Var(X)

SD(Y-X)= √ SD2(Y) + SD2(X) SD(Y+X)=√ SD2(Y) + SD2(X)

SD(X)

SD(Y)SD(Y-X)

30

BINARY DATAStatistics

31

Associations for Binary datadisease No disease total

Exposed (e) a b a+b

Unexposed(u) c d c+d

risk=P odds=O

Pe= a/(a+b) Oe= a/b

Pu = c/(c+d) Ou= c/d

RR =Pe/Pu OR= Oe/Ou32

Risk vs OddsP=risk, O=odds

O=P/(1-P), P=O/(1+O)P=1/10, O=1/9.

Risk=num sick/totalOdds=num sick/num not sick

RR = OR/(1 – Pu + OR Pu)When Pu is small,

RR=OR In general, OR is more extreme than RR

33

diseaseno

disease risk odds

exposed 50 950 1000 0.050 0.053

unexposed 200 8550 8750 0.0228 0.0234

250 9500 9750

OC use (P) 20% 10% RR OR

2.188 2.250

Oral Contraceptive exposure vs CancerProspective study (unbiased est of pop)

34

Ratios and differences

For rare events or diseases Pe=1/10,000, Pu= 1/100,000

RR = 10, risk difference = 9/100,000Misleading to only report ratio and not actual

risks.

35

Odds-case control studycancer No cancer

OC 100 5no OC 400 45

500 50

OC use (P) 20% 10%

Odds (O) 0.25 0.11OR 2.25

36

Why use ORs?1.In prospective study, usually quote disease risk &

risk ratio (RR). In case-control, we always quote OR, not RR. Case-control OR of exposure in disease/no disease

Equals Prospective OR of disease in exposed/unexposedin population if the probability of exposure is same

as in the target population.(Not necessarily true if there is confounding, bias).

2. OR more “stable” (universal) across studies. If unexposed risk=20%, RR=2, exposed risk=40%If unexposed risk=60%, RR can’t be 2.

37

Independence rule for ORsORs for heart attack (MI) For smokers/non smoker: OR = 4 For alcohol/no alcohol: OR = 2

If independent, OR for those who smoke AND drink alcohol is 4 x 2 = 8 (relative to

no smoke, no alcohol). Only true if smoking, drinking are

independent influences on MI. However, smoking & drinking can be correlated with

each other. 38

NNT – number needed to treat (or harm)(clinical trials)

Pc (like Pu)=prop w/ disease in control groupPt (like Pe)=prop w/ disease in treat group

ARR=absolute risk reduction= risk difference= RD=Pc-Pt

RRR=Relative risk reduction=(Pc-Pt)/Pc = ARR/Pc=1-RR

NNT=number needed to treat=1/ARR

39

NNT ExamplePc=0.36=36%, Pt=0.34=34%

ARR=RD=0.02=2%RRR=0.2/0.36 = 5.5% (a percent of a percent)

NNT = 1/0.02 = 50

So 50 patients must be given the treatment to cure one additional disease case.Can be extended to more complex stats.

40

NNT–Ovarian Ca screening“Tests commonly recommended to screen healthy

women for ovarian cancer do more harm than good and should not be performed, a panel of medical experts said on Monday. The screenings —blood tests for a substance linked to cancer and ultrasound scans to examine the ovaries — do not lower the death rate from the disease, and they yield many false-positive results that lead to unnecessary operations with high complication rates, the panel said.

…“To find one case of ovarian cancer, 20 women

had to undergo surgery. “ (NY Times–10 Sept 2012)

Summary-Ratios Risk Odds Hazard P O h

Ratio: RR=Pe/Pu OR=Oe/Ou HR=he/hu

All have the null value of 1.0 when there is no association. The distribution of the logs of their ratios from study to study are usually bell curve shaped around the true log scale value.

42

True-disease True-No diseaseTest-positive a b

Test-negative c dTotal a+c b+d

Sensitivity and Specificity

Sensitivity=a/(a+c), false negative=c/(a+c)

Specificity=d/(b+d), false positive=b/(b+d)

Positive predictive value=PPV=a/(a+b) *

Negative predictive value=NPV=d/(c+d) ** Depends on disease prevalence-not just attribute of test

43

Sensitivity, Specificity, Accuracy

Accuracy = W Sensitivity + (1-W) Specificity where 0 < W < 1.

Often W=0.5 (unweighted accuracy)

We wish to maximize accuracy=minimize misclassification = 1- Accuracy

Choose W depending on “costs”.

44

0.00

0.10

0.20

0.30

0.40

0.50

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Y

Y

ROC curve–choose continuous data cutpoint (threshold) for highest accuracy, best “separation”

45

“Modern” format for ROC

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Yc =threshold (cutpoint)

sensitivity

specificity

accuracy

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Yc =threshold (cutpoint)

unw

eigh

ed a

ccur

acy

Highest accuracy is NOT necessarily where sens=spec,

(only when SD1=SD2) 46

“Traditional” ROC(not recommended-hard to label cutpoints)

traditional ROC

0%10%20%30%40%50%60%70%80%90%

100%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

false pos=1-spec

sens

47

C (concordance) statistic for ROCC = area under the “traditional” ROC curve

0.5 (bad) < C < 1.0 (good)If nd=a+c true num w/disease

nnd=b+d true num w/o disease From all possible nd x nnd pairs with one

diseased and one not, call a pair “concordant” if diseased is positive and

non diseased is negative. C is the proportion of the pairs that are

concordant. 48

Positive and Negative predictive valuePositive predictive value (PPV) & negative predictive value (NPV)

depend on sensitivity (sens), specificity (spec) & disease prevalence (P).

Sensitivity and specificity do NOT depend on disease prevalence.

Can only compute PPV=a/(a+b) & NPV=d/(c+d) when disease prevalence P = (a+c)/(a+b+c+d) = (a+c)/n

Bayes formulas for PPV and NPV

Let P = prevalence of disease

PPV = test true pos/ (test true pos + test false pos) = sens x P / [ sens x P + (1- spec) x (1- P) ]

NPV = test true neg/ (test true neg + test false neg) = spec x (1-P) / [ spec x (1-P) + (1-sens) x P ]

But don’t use these formulas – there is an easier way49

Exampledisease no disease Total

Test positive 95 20 115Test negative 5 1980 1985

Total 100 2000 2100 Sens = 95/100=0.95, Spec= 1980/2000 = 0.99,

Disease prevalence=P = 100/2100 = 0.0476

PPV = (0.95 x 0.0476) / [ 0.95 x 0.0476 + 0.01 x 0.9524 ] = 0.826 PPV = 95/115=0.826

NPV = (0.99 x 0.9524) / [0.99 x 0.9524 + 0.05 x 0.0476] = 0.9974

NPV = 1980/1985 = 0.9974

50

Bayesian “paradigm” for PPVOdds of disease Probability of disease

Prior 100/2000=0.05 100/2100=0.0476=4.76%

Positive test “data” Likelihood ratio (LR)

Sensitivity/false pos=0.95/0.01=95

(not applicable)

Posterior given positive test=

Prior x LR=PPV

0.05 x 95 = 4.75 4.75/(1+4.75)=0.826=82.6%

51

LR=Prob(+ test | disease)/Prob(+test | no disease)Posterior odds = Prior odds x LR Bayes: Prior data Posterior

Prior probability is updated with data (LR) to get a posterior probability (PPV)

Bayes paradigm (algebra)Prior -> (test) data -> Posterior

52

Disease No disease Total

Test positive a b a+b

Test negative c d c+d

total a+c b+d n

Prior disease risk=(a+c)/n n=a+b+c+dPrior disease odds= (a+c)/(b+d)

Test Data:LR positive test = Sens/ false pos =[a/(a+c)]/[ b/(b+d)] = RR=LR

Posterior odds disease = Prior odds x LR pos test = a/bPosterior disease risk = a/(a+b) = PPV

Ex: FASTER Trial(NEJM 353:19, 10 Nov 2005)

53

Prior odds of Down’s syndrome (varies with gestational age)

↓ LR from biochemical markers

(& other factors/data) ↓

Posterior odds of Downs syndrome

Section II Descriptive stats for continuous data

Documents

Transcript of Section II Descriptive stats for continuous data