M&M Ch 6 Introduction to Inference OVERVIEW 660... · 2009-06-05 · M&M Ch 6 Introduction to...
Transcript of M&M Ch 6 Introduction to Inference OVERVIEW 660... · 2009-06-05 · M&M Ch 6 Introduction to...
M&M Ch 6 Introduction to Inference ... OVERVIEW
Introduction to Inference* Bayes Theorem : HaemophiliaBrother has haemophilia => Probability (WOMAN is Carrier) = 0.5New Data: Her Son is Normal (NL) .Update: Prob[Woman is Carrier, given her son is NL] = ??
Inference is about Parameters (Populations) or generalmechanisms -- or future observations. It is not aboutdata (samples) per se, although it uses data fromsamples. Might think of inference as statements about auniverse most of which one did not observe.
0.5 0.5
CARRIERNOT CARRIER
WOMAN
Son
0.00.5
NL H
Son
Products of PRIOR and LIKELIHOOD
PRIOR [ prior to knowing status of her son ]
LIKELIHOOD
0.25
0.670.33
WOMAN
CARRIERNOT CARRIER
WOMAN
POSTERIOR Given that Son is NL
0.5
observed dataNL H
1.00.5
1.
2.
3.
[ Prob son is NL | ]PRIOR
Probs. Scaled to add to 1
0.5 x 1.0 0.5 x 0.5
Two main schools or approaches:
Bayesian [ not even mentioned by M&M ]
• Makes direct statements about parametersand future observations
• Uses previous impressions plus new data to update impressionsabout parameter(s)
e.g.Everyday lifeMedical tests: Pre- and post-test impressions
Frequentist
• Makes statements about observed data (or statistics from data)(used indirectly [but often incorrectly] to assess evidence againstcertain values of parameter)
• Does not use previous impressions or data outside of currentstudy (meta-analysis is changing this)
e.g.
• Statistical Quality Control procedures [for Decisions]• Sample survey organizations: Confidence intervals• Statistical Tests of Hypotheses
Unlike Bayesian inference, there is no quantified pre-test or pre-data "impression"; the ultimate statements are about data,conditional on an assumed null or other hypothesis.
Thus, an explanation of a p-value must start with the conditional"IF the parameter is ... the probability that the data would ..."
Book "Statistical Inference" by Michael W. Oakes is an excellentintroduction to this topic and the limitations of frequentist inference.
page 1
M&M Ch 6 Introduction to Inference ... OVERVIEW
Bayesian Inference for a quantitative parameter Bayesian Inference ... in generalE.g. Interpretation of a measurement that is subject to intra-personal (incl.measurement) variation. Say we know the pattern of inter-personal and intra-personal variation. Adapted from Irwig (JAMA 266(12):1678-85, 1991)
• Interest in a parameter θ .
MY MEAN CHOLESTEROL µ
3. POSTERIOR for µ
Products of PRIOR and LIKELIHOOD (Scaled)
1. PRIOR
LIKELIHOODi.e. [Prob • | µ)uses known model for variationof measurements around µ
2. DATA: one measurement on ME
MY MEAN CHOLESTEROL µ
p(µ)
i.e. f(• | µ) forvarious values of µ (3 shown here)
(know there is substantial intra-personal & measurement variation)
Posterior is composite of
P(µ | •)
under-estimate ?
'on target' ?
over-estimate ?
prior data (•)and
.
• Have prior information concerning θ in form of a priordistribution with probability density function p(θ).[to distinguish, might use lower case p for prior]
• Obtain new data x(x could be a single piece of information or more complex)
Likelihood of the data for any contemplated value θ isgiven by
L[ x | θ ] = prob[ x | θ ]
Posterior probability for θ, GIVEN x, is calculated as:
P( θ | x ) = L[ x | θ ] p(θ)
∫ L[ x | θ ] p(θ) dθ
[To distinguish, might use UPPER CASE P for POSTERIOR]. Thedenominator is a summation/integration (the ∫ sign ) over range of θand serves as a scaling factor that makes P(θ) sum to 1.
In Bayesian approach, post-data statements of uncertaintyabout are made directly from the function P( | x) .
page 2
M&M Ch 6 Introduction to Inference ... OVERVIEW
Re: Previous 2 examples of Bayesian inference Cholesterol example
θ = my mean cholesterol level
θ = ?? i.e. p[θ] = ?
Haemophilia example
θ = possible status of woman:In absence of any knowledge about me, would have to take as aprior the distribution of mean cholesterollevels for population my age and sex
θ = "Carrier" or "Not a carrier"
p(θ = Carrier) = 0.5
p(θ = Not a Carrier) = 0.5x = one cholesterol measurement on me
Assume that if a person's mean is θ, the variation around θ would beGaussian with standard deviation σw. (Bayesian argument does not insiston Gaussian-ness). So...x = status of sonL[ X=x | my mean is θ] is obtained by evaluating height ofGaussian(θ,σw) curve at X = xL[ x=Normal | Woman is Carrier ] = 0.5
L[ x=Normal | Woman is Not Carrier ] = 1P[θ | X = x] =
L[ X = x | θ ] p[θ]
∫ L[ X = x | θ ] p[θ] dθ
If intra-individual variation is Gaussianwith SD w and the prior is Gaussian withmean and SD b [b for betweenindividuals], then the mean of theposterior distribution is a weightedaverage of and x, with weights inverselyproportional to the squares of w and brespectively. So, the less the intra-individual and lab variation, the more theposterior is dominated by themeasurement x on the individual --- andvice versa.
P(θ = Carrier | x=Normal)
= L[x=N | θ =C] p[θ = C]
L[x=N | θ =C] p[θ =C] + L[x=N | θ =Not C] p[θ = Not C]
[equation for predictive value of a diagnostic test withbinary results]
page 3
M&M Ch 6.1 Introduction to Inference ... Estimating with Confidence
(Frequentist) Confidence Interval (CI) or Interval Estimatefor a parameter
Large-sample CI's
Many large-sample CI's are of the formFormal definition:
A level 1 - Confidence Interval for a parameter isgiven by two statistics
Upper and Lower
such that when is the true value of the parameter,
Prob ( Lower Upper ) = 1 -
1 -
θ^ ± multiple of SE(θ^) or f -1 [ f{θ}^ ± multiple of SE(f{θ}^ ] ,
where f is some function of θ^ which has close to a Gaussian
distribution, and f -1 is the inverse function(other motivation is variance stabilization; cf A&B ch11)
examples of the latter are:
θ = odds ratio
f = ln ; f -1 = exp
θ = proportion πf = arcsine ; f -1 = reversef = logit ; f -1 = exp(•)/[1+exp(•)]• CI is a statistic: a quantity calculated from a sample
• usually use α = 0.01 or 0.05 or 0.10, so that the "level ofconfidence", 1 - α, is 99% or 95% or 90%. We will also use "α" fortests of significance (there is a direct correspondence betweenconfidence intervals and tests of significance)
Method of Constructing a 100(1 - )% CI (in general):
"Over" estimate ?
(point) estimate
Lower
NB: shapes of distributions may differ at the 2 limits and thus yield asymmetric limits: see e.g. CI for π , based on binomial. Notice also the use of concept of tail area (p-value) to construct CI.
θ
Lowerθ
Upperθ
Upperθ
"Under" estimate ?
• technically, we should say that we are using a procedure whichis guaranteed to cover the true in a fraction 1 - ofapplications. If we were not fussy about the semantics, we mightsay that any particular CI has a 1-α chance of covering θ.
• for a given amount of sample data] the narrower the interval from L toU, the lower the degree of confidence in the interval and vice versa.
Meaning of a CI is often "massacred" in the telling... usersslip into Bayesian-like statements without realizing it.
S TATISTICAL CORRECTNESS
The Frequentist CI (statistic) is the SUBJECT of the sentence (speak oflong-run behaviour of CI's).
In Bayesian parlance, the parameter is the SUBJECT of the sentence[speak of where parameter values lie].
Book "Statistical Inference" by Oakes is good here..
page 4
M&M Ch 6.1 Introduction to Inference ... Estimating with Confidence
SD's* for "Large Sample" CI's for specific parameters Semantics behind Confidence Interval (e.g.)
parameter estimate SD(estimate) parameter estimate SD*(estimate)
µx x– σx
nθ θ^
SD(θ^)
_______________________________________________
µx x–σx
n
Probability is 1 – α that ...
x– falls within zα/2tα/2
SD(x–) of µx (see footnote 1 )
Probability is 1 – α that ..
µx "falls" within zα/2tα/2
SD(x–) of x– (see footnote 2 )µ∆X d– σd
n
Pr { µx – zα/2tα/2
SD(x–) ≤ x– ≤ µx + zα/2tα/2
SD(x–)} = 1 – απ p
π[1-π]
nPr { –
zα/2tα/2
SD(x–) ≤ x– - µx ≤ + zα/2tα/2
SD(x–) } = 1 – α
µ1 - µ2 x–1 - x–1
σ12
n1 +
σ22
n2 Pr { + zα/2tα/2
SD(x–) ≥ µx – x– ≥ – zα/2tα/2
SD(x–) } = 1 – α
Pr { x– + zα/2tα/2
SD(x–) ≥ µx ≥ x– – zα/2tα/2
SD(x–) } = 1 – απ1 - π2 p1 - p2
π1[1-π1]
n1 +
π2[1-π2]
n2
Pr { x– - zα/2tα/2
SD(x–) ≤ µx ≤ x– + zα/2tα/2
SD(x–)} = 1 – α
The last two are of the form SD12 + SD 2
2
1 This is technically correct, because the subject of the sentence is thestatistic xbar. Statement is about behaviour of xbar.
* In practice, measures of individual (unit) variation about θ {e.g. σx,
π[1-π] , ...} are replaced by estimates (e.g. sx , p[1-p] , ... } calculated
from the sample data, and if sample sizes are small, adjustments are made
to the "multiple" used in the multiple of SD(θ^). To denote the
"substitution" , some statisticians and texts (e.g., use the term "SE"
rather than SD; others (e.g. Colton, Armitage and Berry), use the term SE
for the SD of a statistic -- whether one 'plugs in' SD estimates or not.
Notice that M&M delay using SE until p504 of Ch 7.
2 This is technically incorrect, because the subject of the sentence is theparameter. µX. In the Bayesian approach the parameter is the subjectof sentence. In special case of "prior ignorance" [e.g. if had just arrivedfrom Mars], the incorrectly stated frequentist CI is close to a Bayesianstatement based on the posterior density function p(µX | data).
Technically, we are not allowed to "switch" from one to the other [it isnot like saying "because St Lambert is 5 Km from Montreal, thusMontreal is 5Km from St Lambert".] Here µX is regarded as afixed (but unknowable) constant; it doesn't "fall" or "varyaround" any particular spot; in contrast you can think ofthe statistic xbar "falling" or "varying around" the fixed
X .
page 5
M&M Ch 6.1 Introduction to Inference ... Estimating with Confidence
.006.037
.109
.204
.237
.195
.128
.059
.019.005.001
.004.021
.085
.230
.377
.282
P=0.025
P=0.025
Clopper-Pearson 95% CI for based on observed proportion 4/12
Notice that Prob[4] is counted twice, once in each tail .
The use of CI's based on Mid-P values, where Prob[4] is counted onlyonce, is discussed in Miettinen's Theoretical Epidemiology and in §4.7of Armitage and Berry's text.
See similar graph in Fig4.5 p 120 of A&B
1 11 09876543210 1 2
1 11 09876543210 1 2
4
Binomial at
upper = 0.65
Binomial at
lower = 0.10
Constructing a Confidence Interval for
Assumptions & steps (simplified for didactic purposes)
(1) Assume (for now) that we know the sd (σ) of the Y values in thepopulation. If we don't know it, suppose we take a"conservative" or larger than necessary estimate of σ.
(2) Assume also that either(a) the Y values are normally distributed or(b) (if not) n is large enough so that the Central Limit Theoremguarantees that the distribution of possible y– 's is well enoughapproximated by a Gaussian distribution.
(3) Choose the degree of confidence (say 90%).
(4) From a table of the Gaussian Distribution, find the z value suchthat 90% of the distribution is between -z and +z. Some 5% ofthe Gaussian distribution is above, or to the right of, z = 1.645and a corresponding 5% is below, or to the left of, z = -1.645.
(5) Compute the interval y– ±1.645 SD( y– ), i.e., y– ±1.645 σ / n
Warning: Before observing y– , we can say that there is a 90%probability that the y– we are about to observe will be within ±1.645SD( y– )'s of µ . But, after observing y–, we cannot reverse thisstatement and say that there is a 90% probability that µ is in thecalculated interval.We can say that we are USING A PROCEDURE IN WHICHSIMILARLY CONSTRUCTED CI's "COVER" THECORRECT VALUE OF THE PARAMETER ( in ourexample) 90% OF THE TIME. The term "confidence" is astatistical legalism to indicate this semantic difference.Polling companies who say "polls of this size are accurate towithin so many percentage points 19 times out of 20" are beingstatistically correct -- they emphasize the procedure rather thanwhat has happened in this specific instance. Polling companies (orreporters) who say "this poll is accurate .. 19 times out of 20" aretalking statistical nonsense -- this specific poll is either "right" or"wrong"!. On average 19 polls out of 20 are "correct ". But thispoll cannot be right on average 19 times out of 20!
page 6
M&M Ch 6.2 Introduction to Inference ... Tests of Significanc
(Frequentist) Tests of Significance Example 2
Use: To assess the evidence provided by sample data in favour ofa pre-specified claim or 'hypothesis' concerning someparameter(s) or data-generating process. As with confidenceintervals, tests of significance make use of the concept of asampling distribution.
In 1949 a divorce case was heard in which the sole evidence ofadultery was that a baby was born almost 50 weeks after thehusband had gone abroad on military service.
[Preston-Jones vs. Preston-Jones, English House of Lords]
To quote the court "The appeal judges agreed that the limit ofcredibility had to be drawn somewhere, but on medicalevidence 349 (days) while improbable, was scientificallypossible." So the appeal failed.
Example 1 (see R. A Fisher, Design of Experiments Chapter 2)
Pregnancy Duration: 17000 cases > 27 weeks (quoted in Guttmacher's book)
Week
%
0
5
10
15
20
25
30
28
30
32
34
36
38
40
42
44
46
In U.S., [Lockwood vs. Lockwood, 19??], a 355-day pregnancy was found to be'legitimate'.
STATISTICAL TEST OF SIGNIFICANCELADY CLAIMS SHE CAN TELL
WHETHER
MILK WAS POUREDFIRST
MILK WAS POUREDSECOND
BLIND TEST
MILK
TEAMILK
TEA
LADY SAYS
4
0
0
4
4 00 4
2 22 2
1 33 1
3 11 3
0 44 0
if justguessing,
probability of this result
1 / 70
16 / 70
1 / 70
16 / 70
36 / 70
Other Examples: 3. Quality Control (it has given us terminology) 4 Taste-tests (see exercises ) 5. Adding water to milk.. see M&M2 Example 6.6 p448 6. Water divining.. see M&M2 exercise 6.44 p471 7. Randomness of U.S. Draft Lottery of 1970.. see M&M2
Example 6.6 p105-107, and 447- 8. Births in New York City after the "Great Blackout" 9 John Arbuthnot's "argument for divine providence"10 US Presidential elections: Taller vs. Shorter Candidate.
page 7
M&M Ch 6.2 Introduction to Inference ... Tests of Significanc
Elements of a Statistical Test Elements of a Statistical Test (Preston-Jones case)The ingredients and the methods of procedure in a statistical test are:
1. Parameter (unknown) : DATE OF CONCEPTION
Claim about parameter
H0 DATE ≤ HUSBAND LEFT (use = as 'best case')
Ha DATE > HUSBAND LEFT
1. A claim about a parameter (or about the shape of a distribution,
or the way a lottery works, etc.). Note that the null and alternative
hypotheses are usually stated using Greek letters, i.e. in terms of
population parameters, and in advance of (and indeed without
any regard for) sample data. [ Some have been known to write
hypotheses of the form H: y– = ... , thereby ignoring the fact that
the whole point of statistical inference is to say something about
the population in general, and not about the sample one
happens to study. It is worth remembering that statistical
inference is about the individuals one DID NOT study, not
about the ones one did. This point is brought out in the
absurdity of a null hypothesis that states that in a triangle taste
test, exactly p=0.333.. of the n = 10 individuals to be studied will
correctly identify the one of the three test items that is different
from the two others.]
2. A probability model for statistic ?Gaussian ?? Empirical?2. A probability model (in its simplest form, a set of assumptions)
which allows one to predict how a relevant statistic from a sample
of data might be expected to behave under H0.
3. A probability level or threshold
(a priori ) "limit of extreme-ness" relative to H0
- for judge to decide
Note extreme-ness measured as conditional probability,
not in days
3. A probability level or threshold or dividing point below which
(i.e. close to a probability of zero) one considers that an event
with this low probability 'is unlikely' or 'is not supposed to
happen with a single trial' or 'just doesn't happen'. This pre-
established limit of extreme-ness is referred to as the "α (alpha)
level" of the test.
page 8
M&M Ch 6.2 Introduction to Inference ... Tests of Significanc
Elements of a Statistical Test ... Elements of a Statistical Test (Preston-Jones case)
4. A sample of data, which under H0 is expected to follow the
probability laws in (2).
4. data: date of delivery.
5. The most relevant statistic (e.g. y- if interested in inference about
the parameter µ)
5. The most relevant statistic (date of delivery; same as raw data:
n=1)
6. The probability of observing a value of the statistic as extreme or
more extreme (relative to that hypothesized under H0) than we
observed. This is used to judge whether the value obtained is
either 'close to' i.e. 'compatible with' or 'far away from' i.e.
'incompatible with', H0. The 'distance from what is expected
under H0' is usually measured as a tail area or probability and is
referred to as the "P-value" of the statistic in relation to H0.
6. The probability of observing a value of the statistic as extreme or
more extreme (relative to that hypothesized under H0) than we
observed
P-value = Upper tail area : Prob[ 349 or 350 or 351 ...] : quite
small
7. A comparison of this "extreme-ness" or "unusualness" or
"amount of evidence against H0 " or P-value with the agreed-on
"threshold of extreme-ness". If it is beyond the limit, H0 is said
to be "rejected". If it is not-too-small, H0 is "not rejected".
These two possible decisions about the claim are reported as "the
null hypothesis is rejected at the P= α significance level" or "the
null hypothesis is not rejected at a significance level of 5%".
7. A comparison of this "extreme-ness" or "unusualness" or
"amount of evidence against H0 " or P-value with the agreed-on
"threshold of extreme-ness". Judge didn't tell us his threshold,
but it must have been smaller than that calculated in 6.
Note: the p-value does not take into account any other 'facts',
prior beliefs, testimonials, etc.. in the case. But the judge
probably used them in his overall decision (just like the jury did
in the OJ case).
.
page 9
M&M Ch 6.3 and 6.4 Introduction to Inference ... Use and Misuse of Statistical Tests
"Operating" Characteristics of a Statistical TestThe quantities (1 - β) and (1 - α) are the "sensitivity(power)" and "specificity" of the statistical test.Statisticians usually speak instead of the complements ofthese probabilities, the false positive fraction (α ) and thefalse negative fraction (β) as "Type I" and "Type II" errorsrespectively [It is interesting that those involved indiagnostic tests emphasize the correctness of the testresults, whereas statisticians seem to dwell on the errors ofthe tests; they have no term for 1-α ].
As with diagnostic tests, there are 2 ways statistical testcan be wrong:
1) The null hypothesis was in fact correct but the
sample was genuinely extreme and the null
hypothesis was therefore (wrongly) rejected.
2) The alternative hypothesis was in fact correct but
the sample was not incompatible with the null
hypothesis and so it was not ruled out. Note that all of the probabilities start with (i.e. areconditional on knowing) the truth. This is exactlyanalogous to the use of sensitivity and specificity ofdiagnostic tests to describe the performance of the tests,conditional on (i.e. given) the truth. As such, they describeperformance in a "what if" or artificial situation, just assensitivity and specificity are determined under 'lab'conditions.
The probabilities of the various test results can be put inthe same type of 2x2 table used to show thecharacteristics of a diagnostic test.
Result of Statistical Test
"Negative"(do not
reject H0)
"Positive"(reject H0 in
favour of Ha) So just as we cannot interpret the result of a Dx testsimply on basis of sensitivity and specificity, likewise wecannot interpret the result of a statistical test in isolationfrom what one already thinks about the null/alternativehypotheses.
H0 1 - α α
TRUTH
Ha β 1 - β
page 10
M&M Ch 6.3 and 6.4 Introduction to Inference ... Use and Misuse of Statistical Tests
Interpretation of a "positive statistical test" But if one follows the analogy with diagnostic tests, thisstatement is like saying that
It should be interpreted n the same way as a "positive
diagnostic test" i.e. in the light of the characteristics of the subject
being examined. The lower the prevalence of disease, the
lower is the post-test probability that a positive diagnostic test
is a "true positive". Similarly with statistical tests. We are now
no longer speaking of sensitivity = Prob( test + | Ha ) and
specificity = Prob( test - | H0 ) but rather, the other way round,
of Prob( Ha | test + ) and Prob( H0 | test - ), i.e. of positive and
negative predictive values, both of which involve the
"background" from which the sample came.
"1-minus-specificity is the probability of being wrong if, uponobserving a positive test, we assert that the person is diseased".
We know [from dealing with diagnostic tests] that we cannot turnProb( test | H ) into Prob( H | test ) without some knowledgeabout the unconditional or a-priori Prob( H ) ' s.
The influence of "background" is easily understood if oneconsiders an example such as a testing program for potentialchemotherapeutic agents. Assume a certain proportion P aretruly active and that statistical testing of them uses type I andType II errors of α and β respectively. A certain proportion ofall the agents will test positive, but what fraction of these"positives" are truly positive? It obviously depends on α andβ, but it also depends in a big way on P, as is shown below forthe case of α = 0.05, β = 0.2.
A Popular Misapprehension: It is not uncommon to see orhear seemingly knowledgeable people state that
P --> 0.001 .01 .1 .5
TP = P(1- β) --> .00080 .0080 .080 .400FP = (1 - P)(α)-> .04995 .0495 .045 .025Ratio TP : FP --> ≈ 1 : 62 ≈ 1: 6 ≈ 2 : 1 ≈ 16 : 1
"the P-value (or alpha) is the probability of beingwrong if, upon observing a statistically significantdifference, we assert that a true difference exists"
Glantz (in his otherwise excellent text) and Brown (Am J DisChild 137: 586-591, 1983 -- on reserve) are two authors whohave made statements like this. For example, Brown, in anotherwise helpful article, says (italics and strike through by JH) :
Note that the post-test odds TP:FP is
P(1- β) : (1 - P)(α) = { P : (1 - P) } × [ 1- βα ]
"In practical terms, the alpha of .05 means that the
researcher, during the course of many such decisions, accepts
being wrong one in about every 20 times that he thinks he has
found an important difference between two sets of
observations" 1
PRIOR × function of TEST's characteristics
i.e. it has the form of a "prior odds" P : (1 - P), the"background" of the study, multiplied by a "likelihood ratiopositive" which depends only on the characteristics of thestatistical test. Text by Oakes helpful here1[Incidentally, there is a second error in this statement : it has to do with
equating a "statistically significant" difference with an important one...minute differences in the means of large samples will be statisticallysignificant ]
page 11
M&M Ch 6.3 and 6.4 Introduction to Inference ... Use and Misuse of Statistical Tests
"SIGNIFICANCE" The difference between two treatments is 'statistically significant' if itis sufficiently large that it is unlikely to have risen by chance alone.The level of significance is the probability of such a large differencearising in a trial when there really is no difference in the effects ofthe treatments. (But the lower the probability, the less likely is it thatthe difference is due to chance, and so the more highly significant isthe finding.)
notes prepared by FDK Liddell, ~1970
And then, even if the cure should be performed, how can he be surethat this was not because the illness had reached its term, or a resultof chance, or the effect of something else he had eaten or drunk ortouched that day, or the merit of his grandmother's prayers?Moreover, even if this proof had been perfect, how many times wasthe experiment repeated? How many times was the long string ofchances and coincidences strung again for a rule to be derived fromit?
Michel de Montaigne 1533-1592
• Statistical significance does not imply clinical importance.
• Even a very unlikely (i.e. highly significant) difference may beunimportant.
The same arguments which explode the Notion of Luck may, on theother side, be useful in some Cases to establish a due comparisonbetween Chance and Design. We may imagine Chance and Designto be as it were in Competition with each other for the production ofsome sorts of Events, and may calculate what Probability there is,that those Events should be rather owing to one than to the other...From this last Consideration we may learn in many Cases how todistinguish the Events which are the effect of Chance, from thosewhich are produced by Design.
Abraham de Moivre: 'Doctrine of Chances' (1719)
• Non-significance does not mean no real difference exists.
• A significant difference is not necessarily reliable.
• Statistical significance is not proof that a real difference exists.
• There is no 'God-given' level of significance. What level would yourequire before being convinced:
a to use a drug (without side effects) in the treatment of lungcancer?
b that effects on the foetus are excluded in a drug whichdepresses nausea in pregnancy?
If we... agree that an event which would occur by chance only oncein (so many) trials is decidedly 'significant', in the statistical sense,we thereby admit that no isolated experiment, however significant initself, can suffice for the experimental demonstration of any naturalphenomenon; for the 'one chance in a million' will undoubtedlyoccur, with no less and no more than its appropriate frequency,however surprised we may be that it should occur to us.
R A Fisher 'The Design of Experiments'(First published 1935)
c to go on a second stage of a series of experiments with rats?
• Each statistical test (i.e. calculation of level of significance, orunlikelihood of observed difference) must be strictly independentof every other such test. Otherwise, the calculated probabilities willnot be valid. This rule is often ignored by those who:
- measure more than on response in each subject- have more than two treatment groups to compare- stop the experiment at a favourable point.
page 12
The (many) ways to (in)correctly describe a confidence interval and to talk about p-values (q's from 2nd edition of M&M Chapter 6; answers anonymous)
Below are some previous students' answers to questions from 2ndEdition of Moore and McCabe Chapter 6. For each answer, saywhether the statement/explanation is correct and why.
7 In 95 of 100 comparable polls,expect 44 - 50% of women will givethe same answer.
Given a parameter, we are 95% surethat the mean of this parameter fallsin a certain interval.
• NO. Same answer? as what?
Not given a parameter (ever) . If wewere, wouldn't need this course!Mean of a parameter makes no sensein frequentist inference.
Question 6.2
A New York Times poll on women's issues interviewed 1025 women and 472 menrandomly selected from the United States excluding Alaska and Hawaii. The pollfound that 47% of the women said they do not get enough time for themselves. 8 "using the poll procedure in which
the CI or rather the true percentage iswithin +/- 3, you cover the truepercentage 95% of times it isapplied.
• A bit muddled... but "correct in95% of applications" is accurate.
(a) The poll announced a margin of error of ±3 percentage points for 95%confidence in conclusions about women. Explain to someone who knowsno statistics why we can't just say that 47% of all adult women do not getenough time for themselves.
9 Confident that a poll (such) as thisone would have measured correctlythat the true proportion lies betweenin 95% .
• ??? [ I have trouble parsing this!]In 95% of applications/uses, pollslike these come within ± 3% of truth.
(b) Then explain clearly what "95% confidence" means.(c) The margin of error for results concerning men was ± 4 percentage points.
Why is this larger than the margin of error for women?
1 True value will be between 43 &50% in 95% of repeated samples ofsame size.
• No . Estimate will be between µ –
margin & µ + margin in 95% ofsamples.
10 95% chance that the info is correctfor between 44 and 50% of women.
• ??? 95% confidence in the procedurethat produced the interval 44-50
11 95% confidence -> 95% of time theproportion given is the goodproportion (if we interviewed othergroups).
• "Correct in 95% of applications"
Good to connect the 95% with thelong run, not specifically with thisone estimate.Always ask yourself: what do I meanby "95% of the time" ?
If you substitute "applications" for"time", it becomes clearer.
2 Pollsters say their survey methodhas 95% chance of producing a rangeof percentages that includes π.
• Good . Emphasize averageperformance in repeated applicationsof method.
3 If this same poll were repeated manytimes, then 95 of every 100 suchpolls would give a range thatincluded 47%.
• No! . See 1.
4 You're pretty sure that the truepercentage π is within 3% of 47% ."95% confidence" means that 95% ofthe time, a random poll of this sizewill produce results within 3% of π.
• Bayesians would object (and rightlyso!) to this use of the "true parameter"as the subject of the sentence. Theywould insist you use the statistic asthe subject of the sentence and theparameter as object.
12 It means that 47% give or take 3% isan accurate estimate of thepopulation mean 19 times out of 20such samplings.
• ??? 95% of applications of CI givecorrect answer. How can the sameinterval 47%±3 be accurate in 19 butnot in the other 1?
Q6.4 "This result is trustworthy 19 timesout of 20"
• ??? "this" result: Cf. thedistinction between "my operation issuccessful 19 times out of 20 … " and"operations like the one to be done onme are successful 19 times out of 20"
5 If took 100 different samples, in95% of cases, the sample proportionwill be between 44% and 50%.
• NO! The sample proportion will bebetween truth – 3% & truth + 3% in95% of them.
6 With this one sample taken, we aresure 95 times out of 100 that 41-53%of the women surveyed do not getenough time for themselves.
• NO. 95/100 times the estimate willbe within 3% of π, i.e., estimate will
be in interval π – margin to π +margin. Method used gives correctresults 95% of time.
95% of all samples we could selectwould give intervals between 8669and 8811.
• Surely n o t !
page 13
The (many) ways to (in)correctly describe a confidence interval and to talk about p-values (q's from 2nd edition of M&M Chapter 6; answers anonymous)
Question 6.18 4 Interval of true values ranges b/w27% + 33%.
• ??? There is only one true value.AND, it isn't 'going' or 'ranging' or'moving' anywhere!The Gallup Poll asked 1571 adults what they considered to be the most serious
problem facing the nation's public schools; 30% said drugs. This sample percentis an estimate of the percent of all adults who think that drugs are the schools'most serious problem. The news article reporting the poll result adds, "The pollhas a margin of error -- the measure of its statistical accuracy -- of three percentagepoints in either direction; aside from this imprecision inherent in using a sample torepresent the whole, such practical factors as the wording of questions can affecthow closely a poll reflects the opinion of the public in general" (The New YorkTimes, August 31, 1987).
5 Confident that in repeated samplesestimate would fall in this range95/100 times.
• NO. Estimate falls within 3% of π in95% of applications
6 95% of intervals will contain trueparameter value and 5% will not.Cannot know whether result ofapplying a CI to a particular set ofdata is correct.
• GOOD. Say "Cannot know whetherCI derived from a particular set of datais correct." Know about behaviour ofprocedure! If not from Mars, (i.e. if youuse past info) might be able to betmore intelligently on whether it doesor not.The Gallup Poll uses a complex multistage sample design, but the sample percent
has approximately a normal distribution. Moreover, it is standard practice toannounce the margin of error for a 95% confidence interval unless a differentconfidence level is stated.
7 In 1/20 times, the question willyield answers that do not fall intothis interval.
• No . In 5% of applications, estimatewill be more than 3% away from trueanswer. See 1,2,3 above.
a The announced poll result was 30%±3%. Can we be certainthat the true population percent falls in this interval? 8 This type of poll will give an
estimate of 27 to 33% 19 times outof 20 times.
• NO. Won't give 27 ± 3 19/20 times.Estimate will be within ± 3 of truth in19/20 applicationsb Explain to someone who knows no statistics what the
announced result 30%±3% means. 9 5% risk that µ is not in thisinterval.
• ??? If an after the fact statement,somewhat inaccurate.
c This confidence interval has the same form we have met earlier:estimate ± Z*σestimate
(Actually s is estimated from the data, but we ignore this for now.)
What is the standard deviation σestimate of the estimated percent?
10 95 out 100 times when doing thecalculations the result 27-33%would appear.
• No it wouldn't . See 1,2,3,7.
11 95% prob computed interval willcover parameter.
• Accurate if viewed as a prediction.
d Does the announced margin of error include errors due to practical problemssuch as undercoverage and nonresponse? 12 The true popl'n mean will fall
within the interval 27-33 in 95% ofsamples drawn.
• NO. True popl'n mean will not "fall"anywhere. It's a fixed, unknowableconstant. Estimates may fall around it .
1 This means that the populationresult will be between 27% and 33%19/20 times.
• NO! Populat ion resul t i swherever i t is and it doesn'tm o v e . Think of it as if it were thespeed of light.
2 95% of the time the actual truth willbe between 30 ± 3% and 5% it willbe false.
• It either is or it isn't … the truthdoesn't vary over samplings.
3 If this poll were repeated very manytimes, then 95 of 100 intervalswould include 30% .
• NO. 95% of polls give answer within3% of truth, NOT within 3% of themean in this sample.
page 14
The (many) ways to (in)correctly describe a confidence interval and to talk about p-values (q's from 2nd edition of M&M Chapter 6; answers anonymous)
Question 6.22 3 Ho : a loud noise has no effect onthe rapidity of the mouse to find itsway through the maze.
• OK if being generic. but not if itmakes a prediction about a specificmouse (sounds like this student wastalking about a specific mouse. H isabout mean of a populat ion,i .e . about mice (p lural ) . I t i snot about the 10 mice in thestudy!
In each of the following situations, a significance test for a population mean µ iscalled for. State the null hypothesis Ho and the alternativehypothesis Ha in each case.
a Experiments on learning in animals sometimes measure how long it takes amouse to find its way through a maze. The mean time is 18 seconds for oneparticular maze. A researcher thinks that a loud noise will cause the mice tocomplete the maze faster. She measures how long each of 10 mice takes witha noise as stimulus.
Question 6.24
A randomized comparative experiment examined whether a calcium supplement inthe diet reduces the blood pressure of healthy men. The subjects received either acalcium supplement or a placebo for 12 weeks. The statistical analysis was quitecomplex, but one conclusion was that "the calcium group had lower seated systolicblood pressure (P=0.008) compared with the placebo group." Explain thisconclusion, especially the P-value, as if you were speaking to adoctor who knows no statistics . (From R.M. Lyle et al., "Blood pressureand metabolic effects of calcium supplement in normotensive white and blackmen," Journal of the American Medical Association, 257 (1987), pp. 1772-1776.)
a The examinations in a large accounting class are scaled after grading so thatthe mean score is 50. a self-confident teaching assistant thinks that hisstudents this semester can be considered a sample from the population of allstudents he might teach, so he compares their mean score with 50.
c A university gives credit in French language courses to students who pass aplacement test. The language department wants to know if students who getcredit in this way differ in their understanding of spoken French from studentswho actually take the French courses. Some faculty think the students whotest out of the courses are better, but others argue that they are weaker in oralcomprehension. Experience has shown that the mean score of students in thecourses on a standard listening test is 24. The language department gives thesame listening test to a sample of 40 students who passed the creditexamination to see if their performance is different.
1 The P-value is a probability:"P=0.008" means 0.8% . It is theprobability, assuming the nullhypothesis is true, that a sample(similar in size and characteristics asin the study) would have an averageBP this far (or further) below theplacebo group's average BP. In otherwords, if the null hypothesis is reallytrue, what's the chance 2 group ofsubjects would have results thisdifferent or more different?
• Not bad!
1 Ho: Is there good evidence againstthe claim that πmale > πfemale
Ha: Fail to give evidence againstthe claim that πmale > πfemale .
• NO. Hypotheses do not includestatements about data or evidence. .This student mixed parameters andstatistics/data …
Put Ho, Ha in terms of parameters πmale
vs πfemale only;
H's have nothing to do with new data;
Evidence has to do with p-values, data.
2 Only a 0.008 chance of finding thisdifference by chance if, in thepopulation there really was nodifference between treatment andcentral groups.
• Good!
3 The p-value of .008 means that theprobability of the observed results ifthere is, in fact, no difference between"calcium" and "placebo" groups is8/1000 or 1/125.
• Good, but would change to "theobserved results or results moreextreme"
2 x– = average. time of 10 mice w/loud noise.
Ho: mu - x– = 0 or mu = x–
• NO! Ho must be in terms ofparameter(s). IT MUST NOT SPEAK OFDATA
page 15
The (many) ways to (in)correctly describe a confidence interval and to talk about p-values (q's from 2nd edition of M&M Chapter 6; answers anonymous)
4 The p-value measures the probabilityor chance that the calcium supplementhad no effect.
• No . First, Ho and Ha refer not justto the n subjects studied, but to allsubjects like them. They should bestated in the present (or even future)tense.
Second, the p-value is about data,under the null H. It is not about thecredibility of Ho or Ha.
Question 6.32
The level of calcium in the blood in healthy young adults varies with mean about9.5 milligrams per deciliter and standard deviation about σ = 0.4. A clinic in ruralGuatemala measures the blood calcium level of 180 healthy pregnant women attheir first visit for prenatal care. The mean is x– = 9.57. Is this an indication that themean calcium level in the population from which these women come differs from9.5?
a State Ho and Ha.5 There is strong evidence that Casupplement in the diet reduces theblood pressure of healthy men.
The probability of this being wrongconclusion according to the procedureand data corrected is only 0.008 (i.e.0.8%) .
• Stick to "official wording"
.. IF Ca makes no ∆ to average BP,chance of getting ...
Notice the illegal attempt to makethe p-value into a predict ivevalue -- about as illegal as astatistician trying to interpret amedical test that gave a reading inthe top percentile of the 'health'population -- without even seeingthe patient!
b Carry out the test and give the P-value, assuming that = 0 .4in this population. Report your conclusion.
c The 95% confidence interval for the mean calcium level µ in this population
is obtained from the margin of error, namely 1.96 × 0.4 / 180 = 0.058.i.e. as 9.57 ± 0.058 or 9. We are confident that µ lies quite close to 9.5.This illustrates the fact that a test based on a large sample (n=180 here) willoften declare even a small deviation from Ho to be statistically significant.
1 95% of the time the mean will beincluded in the interval of 9.512 to9.628 and 5% will be missed.
• No . See 1,2,3 in Q6.18 above
6 Only 0.8% chance that the lower BPin Calcium group is lower thanplacebo due to chance.
• If Ca++ does nothing, then prob.of obtaining a result ≥ this extremeis ... Wording borders on the illegal.
2 Ho: There is no difference betweensample area and the populationarea:
H0: µ = x–.
Ha: There is a significant differencebetween the sample mean and thepopulation area.
PS A professor in the dept. of Mathand Statistics questioned what we inEpi and Biostat are teaching, afterhe saw in a grant proposalsubmitted by one of our doctoralstudents (now a faculty member!) astatement of the type
H0: µ = x–.
Please do not g ive our cr i t icany such ammunition! -- JH
• NO. This is quite muddled. Unlessone takes a census, there will always-- because of sampling variability --be some non-zero difference between
x– and µ. The question posed is
whether "mean calcium level (µ) in thepopulation from which these womencome differs from 9.5"
ALSO: Must state H's in terms ofPARAMETERS.
Here there is one population. If twopopulations, identify them bysubscripts e.g. Ho: µarea1 = µarea2 .
"Significant" is used to interpret data.(and can be roughly paraphrased as"evidence that true parameter is non-zero". Do not use "s igni f icant"when s ta t ing hypothese s .
7 The chance that the supplement madeno change or raised the B/P is veryslim.
• NO! p-value is a conditionalstatement, predicated (calculated onsupposition that) Ca makes nodifference to µ. Often stated inpresent tense. p-value is more 'afterthe data' in 'past-conditional' tense.Again, wording bordering on illegal.
8 There is 0.8% that this difference isdue to chance alone and 99.2% chancethat this difference is a true difference.
• Not real ly .. Just like theprevious statements, this type oflanguage is crossing over intoBayes-Speak.
9 There has been a significant reductionin the BP of the treated group...there's only a probability of 0,8%that this is due to chance alone.
• NO. Cannot talk about the cause...Can say "IF no other cause thanchance, then prob. of getting ≥ adifference of this size is ...
page 16
The (many) ways to (in)correctly describe a confidence interval and to talk about p-values (q's from 2nd edition of M&M Chapter 6; answers anonymous)
3 95% CI: µ ± 1.96 σ / √180 • NO
CI for µ is x– ± 1.96 σ/√180 !!!!
If we knew µ, we would say µ ± 0 !!
and we wouldn't need a statisticscourse!
Rather than leave this column blank...
http://www.stat.psu.edu/~resources/Cartoons/
http://watson.hgen.pitt.edu/humor/
4 Ho : mu = x– = 9.57
Ha : mu ≠ x– = 9.57
• NO. Cannot use sample values inhypotheses. Must use parameters.
5 µ differs from 9.5 and theprobability that this difference isonly due to chance is 2%.
• Correct to say that "we foundevidence that µ differs from 9.5"
In frequentist inference, can speakprobabilistically only about data
(such as x–).
This miss-speak illustrates that wewould indeed prefer to speak about µrather than about the data in a sample.We should indeed start to adopt aformal Bayesian way of speaking, andnot 'mix our metaphors' as wecommonly do when we try to staywithin the confines of the frequentistapproach.
What does this difference mean?
Should not speak about theprobability that this difference isonly due to chance.
6 Ho : µ = 9.5
Ha : µ ≠ 9.5
• Correct . Notice that Ho & Ha saynothing about what you will find inyour data.
7 Q6.44: Ho: x– = 0.5; Ha: x– > 0.5 • NO. Must write H's in term ofparameters!
8 There is a 0.96% probability thatthis difference is due to chancealone.
• NO. This shorthand is so short thatit misleads. If want to keep it short,say something like "difference islarger than expected under samplingvariation alone". Don't get intoattribution of cause.
page 17
M&M Ch 7.1 (FREQUENTIST) Inference for
Inference for : A&B Ch 7.1 ; Colton Ch 4; Student"'s 't distribution (continued)
Distribution (histogram of sampling distribution)CI( small n: => "Student" 's t distribution
• is symmetric around 0 ( just like Z = x– – µσ/√n
)Use when replace σ by s (an estimate of σ) in CI's and tests.
(1) Assume that either(a) the Y values are either normally distributed or(b) if not, n is large enough so that the Central Limit Theorem guarantees that
the distribution of possible y- 's is well enough approximated by a Gaussiandistrn.
• has a shape like that of the Z distribution, but with SD slightly larger than
unity i.e. slightly flatter & more wide-tailed; Var(t) = df
df–2
• shape becomes indistinguishable from Z distribution as n -> ∞ (in fact as
n goes much beyond 30)(2) Choose the desired degree of confidence [50%, 80%, 90%, 99... ] as before.
(3) Proceed as above, except that use t Distribution rather than Z -- find the t valuesuch that xx% of the distribution is between –t and + t. The cutpoints for %-iles of the t distribution vary with the amount of data used to estimate σ.
• Instead of ± 1.96 σ√n
to enclose µ with 95% confidence, we need
Multiple n degrees of freedom ('df')"Student"'s 't distribution is (conceptual) distribution one gets if...
± 3.182 4 3• take samples (of given size n) from Normal(µ, σ) distribution
± 2.228 11 10• form the quantity t = x–
– µs/√n
from each sample
± 2.086 21 20• compile a histogram of the results
or, in Gossett's own words ...(W.S. Gossett 1908) ± 2.042 31 30
"Before I had succeeded in solving my problem analytically, Ihad endeavoured to do so empirically [i.e. by simulation]. Thematerial I used was a ... table containing the height and leftmiddle finger measurements of 3000 criminals.... Themeasurements were written out on 3000 pieces of cardboard,which were then very thoroughly shuffled and drawn atrandom... each consecutive set of 4 was taken as a sample...[i.e. n=4 above]... and the mean [and] standard deviation ofeach sample determined.... This provides us with two sets of...750 z's on which to test the theoretical results arrived at. Theheight and left middle finger... table was chosen because thedistribution of both was approximately normal..."'
± 1.980 121 120
± 1.96 ∞ ∞
• Test of µ = µ0 CI for µ
Ratio = x– – µ0
s√n
x– ± t s√n
page 1
M&M Ch 7.1 (FREQUENTIST) Inference for
WORKED EXAMPLE : CI and Test of Significance WORKED EXAMPLE C P G Barker The Lancet Vol 345 . April 22, 1995, p 1047.
Posture, blood flow, and prophylaxis of venous thromboembolismResponse of interest: D: INCREASE (D) IN HOURSOF SLEEP with DRUG Sir--Ashby and colleagues (Feb 18, p 419) report adverse effects of posture on
femoral venous blood flow. They noted a moderate reduction velocity when apatient was sitting propped up at 35° in a hospital bed posture and a furtherpronounced reduction when the patient was sitting with legs dependent.Patients recovering from operations are often asked to sit in a chair with theirfeet elevated on a footrest. The footrests used in most hospitals, while raisingthe feet, compress the posterior aspect of the calf. Such compression may beimportant in the aetiology of venous thrombo-embolism. We investigated theeffect of a footrest on blood flow in the deep veins of the calf by dynamicradionuclide venography.
Test: H0: µD = 0 vs Halt: µD ≠ 0α =0.05 (2-sided);
Data:
HOURS of SLEEP† DIFFERENCESubject DRUG PLACEBO Drug - Placebo
1 6.1 5.2 0.9 2 7.0 7.9 -0.9 3 8.2 3.9 4.3 4 . . 2.9 5 . . 1.2 6 . . 3.0 7 . . 2.7 8 . . 0.6 9 . . 3.610 . . -0.5
Calf venous blood flow was measured in fifteen young (18-31 years) healthymale volunteers. 88 MBq technetium-99m-labelled pertechnetate in 1 mL salinewas injected into the lateral dorsal vein of each foot, with ankle tourniquetsinflated to 40 mm Hg, and the time the bolus took to reach the lower border ofthe patella was measured (Sophy DSX Rectangular Gamma Camera). Eachsubject had one foot elevated with the calf resting on the footrest and the otherplantegrade on the floor as a control. The mean transit time of the bolus to theknee was 24.6 s (SE 2.2) for elevated feet and 14.8 s (SE 2.2) for control feet [seefigure overleaf]. The mean delay was 9.9 s (95% CI 7.8–12.0).
Simple leg elevation without hip flexion increases leg venous drainage andfemoral venous blood flow. The footrest used in this study raises the foot byextension at the knee with no change in the hip position. Ashby andcolleagues' findings suggest that such elevation without calf compressionwould produce an increase in blood flow. Direct pressure of the posterior aspectof the calf therefore seems to be the most likely reason for the reduction in flowwe observed. Sitting cross-legged also reduced calf venous blood flow,probably by a similar mechanism. If venous stasis is important in theaetiology of venous thrombosis, the practice of nursing patients with their feetelevated on footrests may need to be reviewed.
d– = 1.78
SD of 10 differences SD[d] = 1.77
Test statistic = 1.78 - [0]
1.77
10
= 3.18 CR:ref|t9|=2.26
JH's Analysis of raw data [data abstracted by eye, so mycalculations won't match exactly with those in text]Since 3.18 > 2.26, "Reject" H0
95% CI for µD
= 1.78 ± t9
1.77
10 = 1.78 ± 1.26 = 0.5 to 3.0 hours d
–(SD) = 9.8(4.1); t =
9.8 - [0]
4.1/ 15 =
9.81.0
= 9.8> t14,0.05 of 2.145
difference is 'off the t-scale'NOTE : I deliberately omitted the full data on the drug and placeboconditions: all we need for the analysis are the 10 differences.
What if not sure d's come from a Gaussian Distribution?
[ for t: Gaussian data or (via CLT) Gaussian statistic ( d– )
95% CI for µD: 9.8 ± 2.145[1.0] = 7.7 to 11.9 s
page 2
M&M Ch 7.1 (FREQUENTIST) Inference for
WORKED EXAMPLE: Leg Elevation (continued) Sample Size for CI's and test involving T
ran
sit
Tim
e (
s)
0
5
10
15
20
25
30
35
40
45
50
FootRest
38 48 10 26 32 6 21 28 7 18 27 9 16 21 5 15 22 7 14 25 11 12 28 16 12 31 19 12 25 13 11 20 9 8 13 5 7 17 10 7 14 7 5 18 13
mean 14.8 24.6 9.8SD 8.5 8.7 4.1SEM 2.2 2.2 1.0
No FootRest FootRest Delay
No FootRest
n to yield (2-sided) CI with margin oferror m at confidence level 1- (seeM&M p 447)
|<-- margin | | of error -->| | | (-------------------•-------------------)
• large-sample CI: x– ± Z SE( x– ) = x– ± m
• SE( x– ) = / n , so...
n = 2 • Z 2
m2
Remarks: If n small, replace Zα/2 by tα/2Whereas mean of 15 differences between 2 conditions is arithmeticallyequal to the difference of the 2 means of 15, the SE of the mean of these15 differences is not the same as the SE of the difference of twoindependent means. In general... Typically, won't know σ so use
guesstimate;Var( y–1 – y–2 ) = Var( y–1) + Var( y–2) – 2 Covariance( y–1, y–2 )
Authors continue to report the SE of each of the 2 means, but they are oflittle use here, since we are not interested in the means per se, but in themean difference.
In planning n for example just discussed, authors mighthave had pilot data on inter leg differences in transit time-- with both legs in the No FootRest position. Sometimes,one has to 'ask around' as to what the SD of the d's willbe. Always safer to assume a higher SD than might turnout to be the case.
Calculating Var( y–1 – y–2 ) = Var( y–1) + Var( y–2)assumes that we used one set of 15 subjects for the No FootRestcondition, and a different set of 15 for the FootRest condition, a muchnoisier contrast. As it is, even this inefficient analysis would have sufficedhere because the 'signal' was so much greater than the 'noise'.
See article On Reserve on display of data from pairs.
page 3
M&M Ch 7.1 (FREQUENTIST) Inference for
Sample Size for CI's and test involving .. cont'd Sign Test for mediann for power 1- if mean is units from µ0 (test value) ; type Ierror = (cf Colton p142 or CRC table next)
Test:
H0: MedianD = 0vs
Halt: MedianD ≠ 0 ; α =0.05 (2-sided);Need Zα/2 SE( x– ) + ZβSE( x– ) > ∆.
Substitute ( x– ) = σ/√n and solve for n: DIFFERENCE SIGNDrug – Placebo of d
so need n = { Zα/2 – Zβ }2 σ2
∆2 0.9 +-0.9 –
α/2µ
Za/2 SE(xbar)
µ
Zb SE(xbar)
β
∆ = µ − µ
alt
0
0alt
4.3 + 2.9 + 1.2 + 3.0 + 2.7 + 0.6 + 3.6 +-0.5 –
∑ 8+, 2–
Reference: Binomial [ n=10; π(+) = 0.5 ] See Table C (last column of p T9) orSign Test Table which I have provided in Chapter on Distribution-free Methods.
Upper Tail: Prob( ≥ 8+ | π = 0.5 ) = 0.0439 + 0.0098 + 0.0010 = 0.0547
2 Tails: P = 0.0547 +0.0547 = 0.1094
P > 0.05 (2-sided). (less Powerful than t-test)If power is > 0.5, then β < 0.5, and Zβ < 0 .
In above example on Blood Flow, the fact that all 15/15 had delays makes anyformal test unnecesary... the "Intra-Ocular Traumatic Test" says it all. [Q:could it be that always raised the left leg, and blood flow is less in left leg? Doubtit but ask the question just to point out that just because we find a numericaldifference doesn't necessarily mean that we know what caused the difference
eg. α=0.05 , β=0.2 => Zα/2 = 1.96 Zβ = –0.84
Technically, if n small, use t-test... see table next page
The question of what to use is not a matter ofstatistics or samples, or what the last guy found in astudy, but rather the difference that makes a difference"i.e it is a clinical judgement, and includes the impact,cost, alternatives, etc...It is the that IF TRUE would lead to a difference inmanagement or a substantial risk, or whatever...
Famous scientist, begins by removing one leg from an insect and, in an accent Icannot reproduce on paper, says "quick march". The insect walks briskly. Thescientist removes another leg, and again on being told "quick march" the insectwalks along... This continues until the last leg has been removed, and the insectno longer walks. Whereupon the Scientist, again in an accent I cannot convey here,, pronounces "There! it goes to prove my theory: when you remove the legs froman insect, it cannot hear you anymore!".
page 4
M&M Ch 7.1 (FREQUENTIST) Inference for
Number of Observations to Ensure Specified Power (1- ) if use 1-sample or paired t-test of Mean
α = 0.005(1-sided) α = 0.025(1-sided) α = 0.05(1-sided) α = 0.01 (2-sided) α = 0.05 (2-sided) α = 0.1 (2-sided)
β = 0.01 0.05 0.1 0.2 0.5 0.01 0.05 0.1 0.2 0.5 0.01 0.05 0.1 0.2 0.5 [ POWER = 1 – β ]
∆σ
0.2 99 700.3 134 78 119 90 45 122 97 71 320.4 115 97 77 45 117 84 68 51 26 101 70 55 40 190.5 100 75 63 51 30 76 54 44 34 18 65 45 36 27 13
0.6 71 53 45 36 22 53 38 32 24 13 46 32 26 19 90.7 53 40 34 28 17 40 29 24 19 10 34 24 19 15 80.8 41 32 27 22 14 31 22 19 15 9 27 19 15 12 60.9 34 26 22 18 12 25 19 16 12 7 21 15 13 10 51.0 28 22 19 16 10 21 16 13 10 6 18 13 11 8 5
1.2 21 16 14 12 8 15 12 10 8 5 13 10 8 61.4 16 13 12 10 7 12 9 8 7 10 8 7 51.6 13 11 10 8 6 10 8 7 6 8 6 61.8 12 10 9 8 6 8 7 6 7 62.0 10 8 8 7 5 7 6 5 6
2.5 8 7 6 6 6
3.0 7 6 6 5 5
∆σ =
µ – µ0σ =
"Signal""Noise"
Table entries transcribed from Table IV.3 of CRC Tables of Probability and Statistics. Table IV.3 tabulates the n's for the Signal/Noise ratios increments of 0.1, and alsoincludes entries for alpha=0.01(1sided)/0.02(2-sided)
See also Colton, page 142
Sample sizes based on t-tables, and so slightly larger (and more realistic, when n small) than those given by z-based formula: n = (zα + zβ)2(σ
∆)2
page 5
M&M Ch 7.1 (FREQUENTIST) Inference for
"Definitive Negative" Studies? Starch blockers--their effect on calorie absorption from a high-starch meal.
Abstract Table 1. Standard Test Meal.Ingredients
It has been known for more than 25 years that certain plant foods, such as kidneybeans and wheat, contain a substance that inhibits the activity of salivary andpancreatic amylase. More recently, this antiamylase has been purified and marketedfor use in weight control under the generic name "starch blockers." Although thisapproach to weight control is highly popular, it has never been shown whetherstarch-blocker tablets actually reduce the absorption of calories from starch. Usinga one-day calorie-balance technique and a high-starch (100 g) meal (spaghetti,tomato sauce, and bread), we measured the excretion of fecal calories after normalsubjects had taken either placebo or starch-blocker tablets. If the starch-blockertablets had prevented the digestion of starch, fecal calorie excretion should haveincreased by 400 kcal. However, fecal reduce the absorption of calories from starch.Using a one-day calorie-balance technique and a high-starch (100 g) meal(spaghetti, tomato sauce, and bread), we measured the excretion of fecal caloriesafter normal subjects had taken either placebo or starch-blocker tablets. If thestarch-blocker tablets had prevented the digestion of starch,fecal calorie excretion should have increased by 400 kcal.However, fecal calorie excretion was the same on the two testdays (mean ± S.E.M., 80 ± 4 as compared with 78 ± 2). Weconclude that starch-blocker tablets do not inhibit the digestionand absorption of starch calories in human beings.
Spaghetti (dry weight)* .............. 100 gTomato sauce .112 gWhite bread ........50 gMargarine.............................. 10 gWater .................................250 g
51CrCl3 ..................................4 µCiDietary constituents†Protein.................................19 gFat................................... 14 gCarbohydrate (starch) ................ 108 g (97 g)
•Boiled for seven minutes in 1 liter of water.† Determined by adding food-table contents of each item
Table 2. Results in Five Normal Subjects on Days of Placebo andStarch-Blocker Tests.
Placebo Test Day Starch-Blocker test Day
DUPLICATE RECTAL MARKER DUPLICATE RECTAL MARKERBo-Linn GW. et al New England Journal of Medicine. 307(23):1413-6, 1982 Dec2
TEST MEAL* EFFLUENT RECOVERY TEST MEAL EFFLUENT RECOVERY
kcal kcal % kcal kcal %[Overview of Methods: The one-day calorie-balance technique beginswith a preparatory washout in which the entire gastrointestinal tract iscleansed of all food and fecal material by lavage with a special calorie-free, electrolyte-containing solution. The subject then eats the test meal,which includes 51CrCl3 as a non absorbable marker. After 14 hours,the intestine is cleansed again by a final washout. The rectal effluent iscombined with any stool (usually none) that has been excreted sincethe meal was eaten. The energy content of the ingested meal and of therectal effluent is determined by bomb calorimetry. The completenessof stool collection is evaluated by recovery of the non absorbablemarker.]
1 664 81 97.8 665 76 96.62 675 84 95.2 672 84 98.33 682 80 97.4 681 73 94.44 686 67 95.5 675 75 103.6
5 676 89 96.3 687 83 106.9 Means 677 80 96.4 676 78 100 ±S.E.M. ±4 ±4 ±0.5 ±4 ±2 ±2
*Does not include calories contained in three placebo tablets (each tablet, 1.2±0.1kcal) or in three Carbo-Lite tablets (each tablet, 2.8±0.1 kcal) that were ingestedwith each test meal.
0 100 200 300 400-100
Company's ClaimEstimate from Study (95%CI)
kcal blocked
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _For an good paper on topic of 'negative' studies, see Powell-Tuck J "Adefence of the small clinical trial: evaluation of three gastroenterological studies."British Medical Journal Clinical Research Ed..292(6520):599-602, 1986 Mar 1.(Resources for Ch 7)
page 6
M&M Ch 7.2 (FREQUENTIST) Inference for –
Inference for µ1 – µ2 : A&B Ch 7.2 ; ColtonCh 4 WORKED EXAMPLE : CI and Test of Significance
Somewhat more complex than simply replacing σ1 and σ2 by s1 ands2 as estimates of σ's in CI's and tests.
Y= Fall in BP over 8 weeks [mm] with Modest Reduction in Dietary SaltIntake in Mild Hypertension (Lancet 25 Feb, 1989)
Need to distinguish two theoretical situations (unfortunatelyseldom clearly distinguishable in practice) where: Test: H0: µY (Normal Sodium Diet) = µY (Low Sodium Diet)
Halt: µY (Normal Sodium Diet) ≠ µY (Low Sodium Diet)σ1 = σ2 = σ
α =0.05 (2-sided); β =0.20;use a "pooled" estimate s of σ [see M&M page 550][think of s2pooled as a weighted average of s12 and s12]
t-table is accurate (if Gaussian data)
==> Power = 80% if | µY(Nl ) - µY(Low) | ≥ 2mm DBP
Data given: Mean(SEM) Fall in BP
σ1 ≠ σ2 "Normal" "Low" Group Groupuse separate estimates of σ1 and σ2
adjust d.f. downwards from (n1+n2–2) tocompensate for inaccuracy
Option 1 (p 549) "software approximation"*
Option 2 (p541) for hand use: df = Minimum[ n1–1, n2–1 ]
(n=53) (n=50)
SBP -0.6(1.0) -6.1(1.1)
Reconstruct s2's via relation: s2 = n SEM2
Mean(s2) Fall in BP[M&M are the only ones I know who suggest this option 2; I thinkthey do so because the undergraduates they teach may not bemotivated enough to use equation 7.4 page 549 to calculate thereduced degrees of freedom... I agree that the only time one should useoption 2 is the 1st time when learning about the t-test]
"Normal" "Low" (n=53) (n=50)
SBP -0.6(53) -6.1(60.5)
* The SAS manual says that in its TTEST procedure it usesSatterthwaite's approximation [p. 549 of M&M] for the reduceddegrees of freedom unless the user specifies otherwise. s2 =
[52]53+[49]60.5[52]+[49]
= 56.63; s = 56.63 = 7.52
t = -6.1- [-0.6]
7.52 1/53 + 1/50 = –3.71 vs t101,05 = 1.98Adjustments are not a big issue if sample sizes are large or variances
similar.
page 7
M&M Ch 7.2 (FREQUENTIST) Inference for –
Calculation of t-test using separate variances "Eye test": Judging overlap of two independent CI's
|----------x----------| Overlapping CI'sHad we used the separate s2's in each sample we wouldhave calculated
t =-6.1- [-0.6]
53 53
+ 60.550
= –3.70
This is equivalent to calculating:
t =-6.1- [-0.6]
SE12 + SE22 = -6.1- [-0.6]
1.12 + 1.02 = –3.70
|----------x----------|
How far apart do two independent x–'s , say x–1 and x–2 , have to befor a formal statistical test, using an alpha of 0.05 two sided, tobe statistically significant?
Need...
| x–1 – x–2 | ≥ 1.96 [ {SE[ x–1 ]}2 + [{SE[ x–2 ]}2 if using z-test*
If the 2 SEM's are about the same size (as they would be if the 2 n's, and the per-unit variability, were about the same), then ... [as in exercise X in Chapter 5]
M&M suggest that the appropriate df for t areOption 1 (via their eqn. 7.4): 99.5Option 2 (smaller df): 49
Need... | x–
1 – x–2 | ≥ 1.96 2 {SE[each x–]}2
i.e. | x–1 – x–2 | ≥ 1.96 2 SE[each x–] , or... | x–
1 – x–2 | ≥ 2.77 SE[each x–]
Either way, the t ratio is far beyond the α=0.05 pointof the null distribution. Notice that the reduction indf is minimal here because the two sample variances arequite close.
*If using t rather than z, multiple would be somewhat higher than 1.96, so that
when multiplied by 2 it might be higher than 2.77, closer to 3. Thus arough answer to the question could be
| x–1 – x–2 | 3 SE[each x–]Incidentally, as per their power calculations, theprimary response variable was DBP
Mean(SEM) Fall in DBP in the 2 samples:
This means that even when two 100(1-α)% CI's overlap slightly, as above, thedifference between the two means could be statistically significant at the α level.This is why Moses, in his article on graphical displays (see reserve material)advocates plotting the 2 CI's formed by
"Normal" "Low" Group Group x–1 ± 1.5 SE[x–1] and x–2 ± 1.5 SE[x–2] (n=53) (n=50)
-0.9(0.6) -3.7(0.6)
t = -3.7- [-0.9]
0.62 + 0.62 = 3.3
Thus, we can be reasonably sure that if the CI's do not overlap ( i.e. if x–1 and x–2
are more than 3 SE[each x–] apart) the difference between them is statisticallysignificant at the alpha=0.05 level.
[ estimate ±1.5 SE(estimate) corresponds to an 86% CI if using Z distribution].
Note: above logic applies for other symmetric CI's too.
page 8
M&M Ch 7.2 (FREQUENTIST) Inference for –
Inferences regarding means --- Summary
Situation Object known unknown(or large n's)
1 Popln. CI for x- ± z
σx√n x
- ± tn-1
sx√n
, x
Test 0 z = x- - µ0
σx√n
tn-1 = x- - µ0
sx√n
(sample of n)
1 Popln. CI for d- ± z
σd√n d
- ± tn-1
sd√n
under 2 = d)condns.
Test 0 z = d- - ∆0
σd√n
tn-1 = d- - ∆0
sd√n
(sample of n within-pairdifferences {d=x1-x2} )
2 Poplns. CI for x-1 - x
-2 ± z
σ12
n1 +
σ22
n2 x
-1 - x
-2 ± tdf
s2
n1 +
s2
n2= 1- 2
Test 0 z = x-1 - x
-2 - ∆0
σ12
n1 +
σ22
n2
tdf = x-1 - x
-2 - ∆0
s2
n1 +
s2
n2
(independent samples of n1 and n2)
Notes:
•Pooled s2 = (n1-1)s1
2 + (n2-1)s22
(n1 - 1) + (n2 - 1) (weighted average of the two s2 's) •df = (n1-1) + (n2-1) = n1 + n2 -2
•If it appears that σ12 is very different from σ22, then a "separate variances" t-test is used with df reducedto account for the differing σ2 's
page 9
M&M Ch 7.2 (FREQUENTIST) Inference for –
Sample Size for CI for 1 – 2 Sample Size for test of 1 versus 2
CI( 1 – 2 ) Test H0: 1 = 2 vs Ha: 1 2
n's to produce CI for 1 – 2 with prespecifiedmargin of error
n's for power of 100(1 – )% if 1 – 2 = ;Prob(type I error) =
(cf. Colton p 145 or CRC tables)• large-sample CI:
x–1 - x–
2 ± Z SE( x–1 - x–
2 ) = x–1 - x–
2 ± margin of error
• SE( x–1 - x–
2 ) = σ2
n1 +
σ2
n2
• if use equal n's, then ...
n per group = 2 {Zα/2 – Zβ}2 σ2
∆2
= 2(Zα/2 – Zβ)2 {
σ∆ }2
Note that if < 0.5, Z <0 (also, Z always 1-sided).
example α=0.05 (2-sided) and β=0.2 ...
Zα/2 = 1.96; Zβ = -0.84,
2(Zα/2 – Zβ)2 = 2{1.96 – (–0.84)}2 ≈ 16, i.e.
n per group ≈ 16 • {noise/signal ratio}2
n per group = 2σ2 Z2
[margin of error]2
example:
* 95% CI for difference in mean Length of Stay (LOS);* desired Margin of Error for difference: 0.5 days,* anticipate SD of individual LOS's, in each situation, of 5 days.
These formulae are easily programmed in a spreadsheet. There are also specializedsoftware packages for sample size and statistical power See web page underResources for Chapter 7.
Greenland S. "On sample-size and power calculations for studies using confidenceintervals". American Journal of Epidemiology. 128(1):231-7, 1988 Jul. Abstract: Arecent trend in epidemiologic analysis has been away from significance tests and towardconfidence intervals. In accord with this trend, several authors have proposed the use ofexpected confidence intervals in the design of epidemiologic studies. This paperdiscusses how expected confidence intervals, if not properly centered, can bemisleading indicators of the discriminatory power of a study. To rectify such problems,the study must be designed so that the confidence interval has a high probability of notcontaining at least one plausible but incorrect parameter value. To achieve this end,conventional formulas for power and sample size may be used. Expected intervals, ifproperly centered, can be used to design uniformly powerful studies but will yieldsample-size requirements far in excess of previously proposed methods.
95% -> α=0.05 -> Zα/2 = 1.96
n per group = 2 • 52 • 1.962
[0.5]2 ≅ 800
Contrast formula for test and formula for CI:CI: no null and al. values for comparative parameter;
notice also absence of beta.
See reference to Greenland [bottom of next column].
page 10
M&M Ch 7.2 (FREQUENTIST) Inference for –
Number of Observations PER GROUP to Ensure Specified Power (1 - ) if use 2-sample t-test of 2 Means
α = 0.005(1-sided) α = 0.025(1-sided) α = 0.05(1-sided) α = 0.01 (2-sided) α = 0.05 (2-sided) α = 0.1 (2-sided)
β = 0.01 0.05 0.1 0.2 0.5 0.01 0.05 0.1 0.2 0.5 0.01 0.05 0.1 0.2 0.5
∆σ
0.2 1370.3 87 610.4 85 100 50 108 78 350.5 96 55 106 86 64 32 88 70 51 23
0.6 101 85 67 39 104 74 60 45 23 89 61 49 36 160.7 100 75 63 50 29 76 55 44 34 17 66 45 36 26 120.8 77 56 49 39 23 57 42 34 26 14 50 35 28 21 100.9 62 46 39 31 19 47 34 27 21 11 40 28 22 16 81.0 50 38 32 26 15 38 27 23 17 9 33 23 18 14 7
1.2 36 27 23 18 11 27 20 16 12 7 23 16 13 10 51.4 27 20 17 14 9 20 15 12 10 6 17 12 10 8 41.6 21 16 14 11 7 16 12 10 8 5 14 10 8 6 41.8 17 13 11 10 6 13 10 8 6 4 11 8 7 52.0 14 11 10 8 6 11 8 7 6 4 9 7 6 4
2.5 10 8 7 6 4 8 6 5 4 6 5 4 3
3.0 8 6 6 5 4 6 5 4 4 5 4 3
∆σ =
µ1 – µ2σ =
"Signal""Noise"
Table entries transcribed from Table IV.4 of CRC Tables of Probability and Statistics. Table IV.4 tabulates the n's for the Signal/Noise ratiosincrements of 0.1, and also includes entries for alpha=0.01(1-sided)/0.02(2-sided).
See also Colton, page 145
Sample sizes based on t-tables, and so slightly larger (and more realistic) than those given by z-based formula: n/group = 2(zα/2 + zβ)2(σ
∆)2
See later (in Chapter 8) for unequal sample sizes i.e. n1 ≠ n2
page 11
Inference concerning a single M&M §8.1 updated Dec 26, 2003
Parameter : the proportion e.g. ... (FREQUENTIST) Confidence Interval for from a proportion p = x / n• with undiagnosed hypertension / seeing MD during a 1-year span• responding to a therapy• still breast-feeding at 6 months• of pairs where response on treatment > response on placebo• of US presidential elections where taller candidate expected to win• of twin pairs where L handed twin dies first• able to tell imported from domestic beer in a "triangle taste test"• who get a headache after drinking red wine• (of all cases, exposed and unexposed) where case was "exposed" [function of rate ratio & of relative sizes of exposed & unexposed denominators; CONDITIONAL analysis (i.e. "fix" # cases), used for CI for ratio of 2 rates , especially in 'extreme' data configurations... eg. # seroconversions in RCT of HPV16 Vaccine, NEJM Nov 21, 2002
1 . "Exact" (not as "awkward to work with' as M&M p586 say they are)
tables [Documenta Geigy, Biometrika , ...] nomograms, software
e.g . what fraction π will return a 4-page questionnaire?11/20 returns on a pilot test i.e. p= 11/20 =0.55
95% CI (from CI for proportion table Ch 8 Resources ) 32% to 77%[To save space, table gives CI's only for p≤0.5, so get CI for π of non-returns: point estimate is 9/20 or 45%, CI is 23% to 68% {1st row,middle column of the X=9 block} Turn this back to 100-68=32% to100-23=77% returns]
95% CI (Biometrika nomogram) 32% to 77%[uses c for numerator; enter through lower x-axis if p≤0.5; in ourcase p=0.55 so enter nomogram from the top at c/n = 0..55 nearupper right corner; travel downwards until you hit bowed linemarked 20 (the 5th line from the top) and exit towards the rightmostborder at πlower ≈ 0 .32 ; go back and travel downward until hit thecompanion bowed line marked 20 (the 5th line from bottom) andexit towards the rightmost border at πlupper ≈ 0.77 ].
Others may use other names for numerator and statistic, or usesymmetry (Binomial[y, n,p] <--> Binomial[n-y, n,1 - p] to save space.Nomogram on next page shows full range, but uses an approxn..
0/11084.0 W-Y in vaccinated gp. sersus 41/11076.9 W-Y in placebo gp]
Statistic: the proportion p = y/n in a sample of size n. ...
Inferences from y/n to
FREQUENTIST
via Confidence Intervals and Tests
• Confidence Interval: where is ?
supplies a NUMERICAL answer (range)
• Evidence (P-value) against H0: = 0.xx
• Test of Hypothesis: Is (P-value) < preset ?
supplies a YES / NO answer (uses Pdata | H0)
Notice link between 100(1 - α)% CI and two-sided test ofsignificance with a preset α. If true π were < πlower, there would onlybe less than a 2.5% probability of obtaining, in a sample of 20, thismany (11) or more respondents; likewise, if true π were > πlower,there would be less than a 2.5% probability of obtaining, in a sampleof 20, this many (11) or fewer respondents. The 100(1 - α)% CI forπ includes all those parameter values such that if the oberved datawere tested against them, the p-value (2-sided) would not be < α.
BAYESIAN
via posterior probability distribution for, andprobabilistic statements concerning, itself
• point estimate: median, mode, ...• interval estimate: credible intervals, ...
Software: • "Bayesian Inference for Proportion (Excel)" Resources Ch 8 • First Bayes { http://www.epi.mcgill.ca/Joseph/courses.html }
e.g. Experimental drug gives p = 0 successes14 patients
=> π = ??
95% CI for π (from table) 0% to 23%CI "rules out" (with 95% confidence) possibility that π>23%[might use a 1-sided CI if one is interested in putting just an upper bound onrisk: e.g. what is upper bound on π = probability of getting HIV from HIV-infected dentist? see JAMA article on "zero numerators" by Hanley andLippman-Hand (in Resources for Chapter 8) .
cf also A&B §4.7; Colton §4. Note that JH's notes use p for statistic, π for parameter.
page 1
Inference concerning a single M&M §8.1
CI for -- using nomogram (many books of statistical tables have fuller versions)
Observed proportion (p)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
20
50
100
200
400
1000
1000
400
200
100
50
20
sample size95% CI for
(Asymmetric) CI in above Nomogram: approx. formula π = 1 –
nn + z2 +
2npn + z2 ±
z 4np – 4np2 + z2
n + z2
2 (cf later page)
See Biometrika Tables for Statisticians for the "exact" Clopper-Pearson version.calculated so that Binomial Prob [≥ p | πlower] = Prob[ ≤ p | πupper ] = 0.025 exactly.
page 2
Inference concerning a single M&M §8.1
(FREQUENTIST) Confidence Interval for from a proportion p = x / n (FREQUENTIST) Confidence Interval for
1. Exactly, but by trial and error, via SOFTWARE with
Binomial probability function
Exactly, and directly, using table of (or SOFTWARE
function that gives) the percentiles of the F distribution
NOTES• Turn spreadsheet of Binomial Probabilities (Table C) 'on its side' to get
CI(π) .. simply find those columns (π values) for which the probability of theobserved proportion is small
• Read horizontally, Nomogram [previous page] shows the variability ofproportions from SRS samples of size n. [very close in style to table ofBinomial Probabilities, except that only the central 95% range of variationshown , & all n's on same diagram]
See spreadsheet "CI for a Proportion (Excel
spreadsheet, based on exact Binomial model) " under
Resources for Chapter 8. In this sheet one can obtain the
direct solution, or get there by trial and error. Inputs in bold
may be changed.
Read vertically, it shows:
• CI -> symmetry as p -> 0.5 or n - > ∞ [in fact, as np & n(1-p) - > ∞ ]
• widest uncertainty at p=0.5 => can use as a 'worst case scenario'
• cf. the ± 4 % points in the 'blurb' with Gallup polls of size n ≈ 1000.Polls are usually cluster (rather than SR) samples, and so havebigger margins of error [wider CI's] than predicted from the Binomial.
95%CI? IC? ... Comment dit on... ?The general "Clopper-Pearson" method for obtaining a
Binomial-based CI for a proportion is explained in 607 Notes
for Chapter 6.[La Presse, Montréal, 1993] L'Institut Gallup a demandé récemment àun échantillon représentatif de la population canadienne d'évaluer lamanière dont le gouvernement fédéral faisait face à divers problèmeséconomiques et général. Pour 59 pour cent des répondants, les libérauxn'accomplissent pas un travail efficace dans ce domaine, tandis que 30pour cent se déclarent de l'avis contraire et que onze pour cent neformulent aucune opinion.
Can obtain these limits by trial and error (e.g. in spreadsheet)
or directly using the link between the Binomial and F tail areas
(also implemented in spreadsheet). The basis for the latter is
explained by Liddell (method, and reference, given at bottom
of Table of 95% CI's).La même question a été posée par Gallup à 16 reprises entre 1973 et1990, et ne n'est qu'une seule fois, en 1973, que la proportion desCanadiens qui se disaient insatisfaits de la façon dont le gouvernementgérait l'économie a été inférieure à 50 pour cent.Spreadsheet opens with example of an observed proportion
p = 11/20. Les conclusions du sondage se fondent sur 1009 interviews effectuéesentre le 2 et le 9 mai 1994 auprès de Canadiens âgés de 18 ans et plus.Un échantillon de cette ampleur donne des résultats exacts à 3,1p.c., près dans 19 cas sur 20. La marge d'erreur est plus forte pourles régions, par suite de l'importance moidre de l'échantillonnage; parexemple, les 272 interviews effectuées au Québec ont engendré unemarge d'erreur de 6 p.c. dans 19 cas sur 20.
page 3
Inference concerning a single M&M §8.1
2. CI for based on "large-n" behaviour of p, or fn. of p
(observed) proportion p
πupper
πlower
0.30
0.27
0.33
0 . 0 2 5
0 . 0 2 5
SD, calculated at 0.30, rather than at lower limit
SD, calculated at 0.30, rather than at upper limit
CI: p ± z SE(p) = p ± z p[1-p]
n
e.g. p = 0.3, n=1000
95%CI for π
= 0.3 ± 1.96 0.3[0.7]1000
= 0.30 ± 1.96(0.015)
= 0.30 ± 0.03
= 30% ± 3%
Note: the ± 3% is pronounced and written as "± 3 percentagepoints" to avoid giving the impression that it is 3% of 30% SE-based (sometimes referrred to in texts and software output
as "Wald" CI's) use the same SE for the upper and lower limits --they calculate one SE at the point estimate, rather than twoseparate SE's, calculated at each of the two limits."Large-sample n": How large is large?
• A rule of thumb: when the expected no. of positives, np, andthe expected no. of negatives, n(1-p), are both bigger than 5 (or10 if you read M&M).
From SASDATA CI_propn;INPUT n_pos n ;
• JH's rule: when you can't find the CI tabulated anywhere!LINES; 300 1000;
• if the distribution is not 'crowded' into one corner (cf. the shapesof binomial distributions in the Binomial spreadsheet -- inResources for Ch 5), i.e., if, with the symmetric Gaussianapproximation, neither of the tails of the distribution spills over aboundary (0 or 1 if proportions, or 0 or n if on the count scale),
See M&M p383 and A&B §2.7 on Gaussian approximation toBinomial.
PROC genmod data = CI_propn; model n_pos/n = / dist = binomial link = identity waldci ; RUN;
From Stata immediate command: cii 1000 300clear * Using datafileinput n_pos n 140 500 * glm doesn't like file with 1 'observation' 160 500 * so...........split across 2 'observations'endglm n_pos , family(binomial n) link(identity)
page 4
Inference concerning a single M&M §8.1
2. CI for based on "large-n" behaviour of p... continued
Other, more accurate and more theoretically correct,large-sample (Gaussian-based) constructions
The "usual" approach is to form a symmetric CI as
point estimate ± a multiple of the SE.
This is technically incorrect in the case of a distribution, such as thebinomial, with a variance that changes with the parameter beingmeasured. In construction of CI's [see diagram on page 1 of material onCh 6.1] there are two distributions involved: the binomial at πupper andthe binomial at πlower. They have different shapes and different SD's ingeneral. Approaches i and ii (below) take this into account.
ii Based on Gaussian distribution of a variance-stabilizingtransformation of the binomial, again with SD's calculated atthe limits rather than at the point estimate itself
[sin[ sin-1[√p] – z
2√n ] ]2 , [sin[ sin-1[√p] + z
2√n ] ]2
as in most calculators, sin-1 & the * in sin[*] measured in radians.
i Based on Gaussian approximation to binomial distribution, butwith SD's calculated at limits SD = π[1–π] / n rather thanat the point estimate itself {"usual" CI uses SD = p[1–p] / n }
If define CI for π as (πL,πU},
where Prob[sample proportion ≥ p | πL ] = α/2Prob[sample proportion ≤ p | πU ] = α/2 E.g. with = 0.05, so that z=1.96, we get: * from Mainland
and if use Gaussian approximations to Binomial(n, L) andBinomial(n, U), and solve
p = L + zα/2 L [1– L ]
nand
Method n=10 p=0.0 n=10 p=0.3 n=20 p=0.15 n=40 p=0.075
1. [0.00, 0.28] [0.11, 0.60] [ 0.05, 0.36] [ 0.03, 0.20]
2. [0.09, 0.09] [0.07, 0.60] [ 0.03, 0.33] [ 0.01, 0.18]
"usual" [0.00, 0.00] [0.02, 0.58] [–0.01, 0.31] [ –0.01, 0.16]
p = U – zα/2 U [1– U ]
nBinomial* [0.00, 0.31] [0.07, 0.65] [ 0.03, 0.38] [ 0.02, 0.20]
for L and U, * from Mainland
This leads to asymmetric 100(1–α)% limits of the form: References: • Fleiss, Statistical Methods for Rates and Proportions• Miettinen, Theoretical Epidemiology, Chapter 10.
1 – n
n + z2 + 2np
n + z2 ± z 4np – 4np2 + z2
n + z2
2Rothman(2002-p132) attributes this method i to Wilson 1927.
page 5
Inference concerning a single M&M §8.1
2. CI for based on "large-n" behaviour of logittransformation of proportion
2. CI for based on "large-n" behaviour of logtransformation of proportion
iii Based on Gaussian distribution of the logit transformationof the estimate (p, the observed proportion) of the parameter π
iv Based on Gaussian distribution of estimate of log[π]
PARAMETER: LOGIT[ π ] = log [ODDS] = log [ π / (1 - π ) ]
= log ["Proportion POSITIVE" / "Proportion NEGATIVE" ]
STATISTIC: logit[ p ] = log [odds] = log [ p / (1 - p ) ]
(Here, log = 'natural' log, i.e. to base e, which some write as ln )(UPPER CASE/Greek = parameter; lower case/Roman = statistic)
Reverse transformation ( to get back from LOGIT to π )...
π = ODDS
1 + ODDS = exp[LOGIT]
1 + exp[LOGIT] ; and likewise p <-- logit
πLOWER = exp[LOWER limit of LOGIT]
1+exp[ LOWER limit of LOGIT ] ; πUPPER likewise
PARAMETER: log[ π ]
STATISTIC: log[ p ]
Reverse transformation ( to get back from log[π] to π )...
π = antilog[ log[π] ] = exp[ log[π] ] ; and likewise p <-- log[p]
πLOWER = exp[ LOWER limit of log[ π ] ] ; πUPPER likewise
SE[ log[p] ] = Sqrt[ 1 / #positive - 1 / #total ]
Limits for π from p = 3/10 : exp[ log[3/10] ± z × Sqrt[1/3 - 1/10] ]
SE[logit] = Sqrt[ 1 / #positive + 1 / #negative ]
e.g. p = 3/10 => estimated odds = 3/7 => logit = log[3/7] = -0.85
SE[logit] = Sqrt[1/3 + 1/7] = 0.69
CI in LOGIT scale: -0.85 ± 1.96×0.69 = { -2.2, 0.5}
CI in π scale: { exp[-2.2]
1+exp[-2.2], exp[0.5]
1+exp[0.5]} = { 0.10, 0.67}
Exercises:
1 Verify that you get same answer by calculator and by software
2 Even with these logiy and log transformations, the Gausian distribution isnot accurate at such small sample sizes as 3/10. Compare their preformance(against the exact methods) for various sample sizes and numbers positive.
From SAS From Stata From SAS From Stata
DATA CI_propn; INPUT n_pos n ;LINES; 3 10;PROC genmod data = CI_propn;model n_pos/n = / dist = binomial
link = logit waldci ;
clearinput n_pos n 1 5 2 5endglm n_pos, family(binomial n) link(logit)
DATA CI_propn; INPUT n_pos n ;LINES; 3 10;PROC genmod data = CI_propn;model n_pos/n = / dist = binomial
link = log waldci ;
clearinput n_pos n 1 5 2 5endglm n_pos, family(binomial n) link(log)
anti-logit [logit] = exp[logit]
1 + exp[ logit] Greenland calls it the "expit" function
page 6
Inference concerning a single M&M §8.1
1200 are hardly representative of 80 million homes /220 million people!! The "Margin of Error blurb" introduced (legislated) in the mid 1980's
The Nielsen system for TV ratings in U.S.A. Montreal Gazette August 8, 1 9 8 1(Excerpt from article on "Pollsters" from an airline magazine) NUMBER OF SMOKERS RISES BY FOUR POINTS: GALLUP POLL
"...Nielsen uses a device that, at one minute intervals, checks to see if theTV set is on or off and to which channel it is tuned. That information is periodicallyretrieved via a special telephone line and fed into the Nielsen computer center inDunedin, Florida.
Compared with a year ago, there appears to be an increase in the number of Canadianswho smoked cigarettes in the past week - up from 41% in 1980 to 45% today. Thequestion asked over the past few years was: "Have you yourself smoked anycigarettes in the past week?" Here is the national trend:
With these two samplings, Nielsen can provide a statistical estimate of thenumber of homes tuned in to a given program. A rating of 20, for instance, meansthat 20 percent, or 16 million of the 80 million households, were tuned in.
Smoked cigarettes in the past weekToday.... . . . . . . . . . . . . . . . . . . . . . . .45%1980.. . . . . . . . . . . . . . . . . . . . . . . . . . .41
To answer the criticism that 1,200 or 1,500 are hardly representative of 80 millionhomes or 220 million people, Nielsen offers this analogy:
1979.. . . . . . . . . . . . . . . . . . . . . . . . . . .441978.. . . . . . . . . . . . . . . . . . . . . . . . . . .47
Mix together 70,000 white beans and 30,000 red beans and then scoop outa sample of 1000. the mathematical odds are that the number of red beans will bebetween 270 and 330 or 27 to 33 percent of the sample, which translates to a "rating"of 30, plus or minus three, with a 20-to-1 assurance of statistical reliability. Thebasic statistical law wouldn't change even if the sampling came from 80 million beansrather than just 100,000." ...
1977.. . . . . . . . . . . . . . . . . . . . . . . . . . .451976........ . . . . . . . . . . . . . . . . . . . .Not asked1975.. . . . . . . . . . . . . . . . . . . . . . . . . . .471974.. . . . . . . . . . . . . . . . . . . . . . . . . . .52
Men (50% vs. 40% for women), young people (54% vs. 37% for those > 50 ) andCanadians of French origin (57% vs. 42% for English) are the most likely smokers.Today ' s resu l t s are based on 1 ,054 persona l in -home in terv iews wi thadul t s , 18 years and over , conducted in June .
Why, if the U.S. has a 10 times bigger population than Canada,do pollsters use the same size samples of approximately 1,000
in both countries? ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
The Gazette, Montreal, Thursday, June 27, 1 9 8 5Answer : it depends on WHAT IS IT THAT IS BEING ESTIMATED. With n=1,000,the SE or uncertainty of an estimated PROPORTION 0.30 is indeed 0.03 or 3percentage points. However, if interested in the NUMBER of households tuned in toa given program, the best estimate is 0.3N, where N is the number of units in thepopulation (N=80 million in the U.S. or N=8 million in Canada). The uncertainty in the'blown up' estimate of the TOTAL NUMBER tuned in is blown up accordingly, so thate.g. the estimated NUMBER of households is
U.S.A. 80,000,000[0.3 ± 0.03] = 24,000,000 ± 2,400,000Canada. 8,000,000[0.3 ± 0.03] = 2,400,000 ± 240,000
2.4 million is a 10 times bigger absolute uncertainty than 240,000. Our intuition aboutneeding a bigger sample for a bigger universe probably stems from absolute errorsrather than relative ones (which in our case remain at 0.03 in 0.3 or 240,000 in 2.4million or 2.4million in 24 million i.e. at 10% irrespective of the size of the universe. Itmay help to think of why we do not take bigger blood samples from bigger persons:the reason is that we are usually interested in concentrations rather than in absoluteamounts and that concentrations are like proportions.
39% OF CANADIANS SMOKED IN PAST WEEK: GALLUP POLL
Almost two in every five Canadian adults (39 per cent) smoked at least one cigarette inthe past week - down significantly from the 47 percent who reported this 10 years ago,but at the same level found a year ago. Here is the question asked fairly regularly over thepast decade: "Have you yourself smoked any cigarettes in the past week?"The national trend shows:
Smoked cigarettes in the past week1985.. . . . . . . . . . . . . . . . . . . . . . . . . . .39%1984.. . . . . . . . . . . . . . . . . . . . . . . . . . .391983.. . . . . . . . . . . . . . . . . . . . . . . . . . .411982*...........................42 (* Smoked regularly or occasionally)1981.. . . . . . . . . . . . . . . . . . . . . . . . . . .451980.. . . . . . . . . . . . . . . . . . . . . . . . . . .411979.. . . . . . . . . . . . . . . . . . . . . . . . . . .441978.. . . . . . . . . . . . . . . . . . . . . . . . . . .471977.. . . . . . . . . . . . . . . . . . . . . . . . . . .451975.. . . . . . . . . . . . . . . . . . . . . . . . . . .47
Those < 50 are more likely to smoke cigarettes (43%) than are those 50 years or over(33%). Men (43%) are more likely to be smokers than women (36%).Results are based on 1,047 personal, in-home interviews with adults, 18 years and over,conducted between May 9 and 11. A sample of this size is accurate within a4-percentage-point margin, 19 in 20 times.
page 7
Inference concerning a single M&M §8.1
Test of Hypothesis that = some test value
1 . n small enough -> Binomial Tables/Spreadsheet 2 . Large n : Gaussian Approximation
ie if testing H0: π = π0 vs Ha: π ≠ π0 [or Ha: π > π0 ]
and if observe x / n ,
then calculate
Prob( observed x, or an x that is more extreme | π0 )
using Ha to specify which x's are more extreme i.e. provide evenmore evidence for Ha and against H0.
Test π = π0 : z = p – π0
SE[p] = p – π0
π0[1–π0]
nNote that the test uses a variance based on the (specified) π0. The "usual" CI uses
a variance based on the (observed) p.
(Dis)Continuity Correction†
Because we approximate a discrete distribution [where p takes on the values 0/n,1/n, 2/n, ... n/n corresponding to the integer values (0,1,2, ..., n) in the numeratorof p] by a continuous Gaussian distribution, authors have suggested a 'continuitycorrection' (or if you are more precise in your language, a 'discontinuity' correction).This is the same concept as we saw back in §5.1, where we said that a binomialcount of 8 became the interval (7.5, 8.5) in the interval scale. Thus, e.g., if we wantto calculate the probability that proportion out of 10 is ≥ 8, we need probability of≥ 7.5 on the continuous scale.
If we work with the count itself in the numerator, this amounts to reducing theabsolute deviation y–nπ0 by 0.5 . If we work in the proportion scale, the absolutedeviation is reduced by 0.5/n viz.
or use correspondence between a 100(1-α)% CI and a test whichuses an alpha level of α i.e. check if CI obtained from CI table ornomogram includes π value being tested
[there may be slight discrepancies between test and CI: themethods used to construct CI's don't always correspond exactly tothose used for tests]
e.g. 1 A common question is whether there is evidence against theproposition that a proportion π=1/2 [Testing preferences anddiscrimination in psychophysical matters e.g., therapeutic touch,McNemar's test for discordant pairs when comparing proportions ina paired-matched study, the non-parametric' Sign Test for assessingintra-pair differences in measured quantities, ...]. Because of thespecial place of the Binomial at π=1/2, the tail probabilities havebeen calculated and tabulated. See the table entitled "Sign Test" inthe chapter on Distribution-Free Methods.
M&M (2nd paragraph p 592) say that "we do not often usesignificance tests for a single proportion, because it is uncommon tohave a situation where there is a precise proportion that we want totest". But they forget paired studies, and even the sign test formatched pairs, which they themselves cover in section 7.1, page521. They give just 1 exercise (8.18) where they ask you to testπ=0.5vs π > 0.5.
zc = |y – nπ0|–0.5
SE[y] = |y – nπ0|–0.5
nπ0[1–π0]
or
zc = |p – π0|–0.5/n
SE[p] = |p – π0|–0.5/n
π0[1–π0]
n
†Colton [who has a typo in the formula on p ___] and A&B deal with this; M&Mdo not, except to say on p386-7 "because most statistical purposes do not requireextremely accurate probability calculations, we do not emphasize use of thecontinuity correction". There are some 'fundamental' problems here that statisticiansdisagree on. The "Mid-P" material (below) gives some of the flavour of the debate.
e.g. 2 Another example, dealing with responses in a setup where the"null" is π=1/3, the "Triangle Taste Test" is described in the nextpage.
page 8
Inference concerning a single M&M §8.1
EXAMPLE of Testing : THE TRIANGLE TASTE TEST
As part of preparation for a double blind RCT of lactase-reduced infant formula oninfant crying behaviour, the experimental formulation was tested for its similarity intaste to the regular infant formula . n mothers in the waiting room at MCH weregiven the TRIANGLE TASTE TEST i.e. they were each given 3 coded formulasamples -- 2 containing the regular formula and 1 the experimental one. Told that "2of these samples are the same and one sample is different", p = y/n correctlyidentify the odd sample. Should the researcher be worried that the experimentalformula does not taste the same? (assume infants are no more or less taste-discriminating than their mothers) [ study by Ron Barr, Montreal Children's Hospital]
The null hypothesis being tested is
H0: π(correctly identified samples) = 0.33 against Ha: π() > 0.33
[here, for once, it is difficult to imagine a 2-sided alternative -- unless mothers werevery taste-discriminating but wished to confuse the investigator]
Our observed proportion of 5/12 projects to a one-sided 95% CI of "as many as 65%in the population get it right". In this worst-case scenario, assuming that thepercentage of right answers in the population is a mix of a proportion πcan who canreally tell and one third of the remaining (1-πcan) who get it right by guessing, weequate
0.65 = πcan + (1-πcan) / 3
giving us an upper bound πcan = (0.65-0.33) / (2/3) = 0.48 or 48%.
*These calculations can be done easily even on a calculator or spreadsheet withoutany combinatorials:P(0) = 0.6712 = 0.008P(1) = 12 x 0.33 x P(0) / [1 x 0.67] = 0.048P(2) = 11 x 0.33 x P(1) / [2 x 0.67] = 0.131P(3) = 10 x 0.33 x P(2) / [3 x 0.67 = 0.215P(4) = 9 x 0.33 x P(3) / [4 x 0.67] = 0.238
Σ = 0.640so Prob(5 or more correct | π = 0.33) = 1 - 0.64 = 0.32
We consider two situations (the real study with n=12, and a hypothetical largersample of n=120 for illustration)
• 5 of n = 12 mothers correctly identified the odd sample.
i.e. p = 5/12 = 0.42
Degree of evidence against H0
= Prob(5 or more correct | π=0.33)... - a Σ of 8 probabilities
= 1 - Prob(4 or fewer correct | π=0.33) ...- a shorter Σ of only 5
= 1 - [ P(0) + P(1) + P(2) + P(3) + P(4) ] = 0.37*
• 50 of 120 (p=0.42) mothers identified odd sample.
Test π = 0.33 : z = 0 .42* – 0 .33
0.33[1–0.33]
120
= 2.1
So P = Prob[ ≥ 50 | π = 0.33 ] = Prob[Z ≥ 2.1] = 0.018
Using n=12, and p=0.30 in Table C gives 0.28; using p=0.35 gives 0.42.Interpolation gives 0.37 approx.
* We treat the proportion 50/120 as a contimuous measurement; in fact it is based onan integer numerator 50; we should treat 50 as 49.5 to 50.5 so ≥50 is really > 49.5 .
The Prob. of obtaining 49.5/120 or more is te Prob. of Z = 0.413 – 0.33
0.33[1–0.33]
120
or
more. With n=120, the continuity correction does not make a large difference;however, with smaller n, and its coarser grain, the continuity correction [whichmakes differences smaller] is more substantial.
Can also obtain this probability directly via Excel , using the function
1 - BINOMDIST(4, 12, 0.33333, TRUE)
So, by conventional criteria (Prob < 0.05 is considered a cutoff for evidence againstH0) there is not a lot of evidence to contradict the H0 of taste similarity of theregular and experimental formulae.
With a sample size of only n=12, we cannot rule out the possibility that a sizablefraction of mothers could truly distinguish the two.
page 9
Inference concerning a single M&M §8.1
Sample Size for CI's and Tests involving
n to yield (2-sided) CI with margin of error m at confidencelevel 1- (see M&M p 593, Colton p161)
Worked example 1: sample size forTest that (preferences) = 0.5 vs. 0 .5
orSign Test that median difference = 0
|--- margin of error --- >|Test: H0: MedianD = 0 vs Halt: MedianD ≠ 0 (---------------------•-----------------------) CI
=0.05 (2-sided);• see CI's as function of n in tables and nomograms
• (or) large-sample CI: p ± Zα/2 SE(p) = p ± m
SE(p) = p[1-p]
n , so... n = p[1-p] • Zα/22
m2
or H0: π(+) = 0.5 vs Halt: π(+) > 0.5
For Power 1- against: Halt: π(+) = 0.65 say
[ at π=ave of 0.5 & 0.65, π[1–π] = = 0.494 ]
n ≈ { Zα/2 – Zβ }2 { 0.4940.15 }2
If unsure, use largest SE i.e. when p=0.5 i.e.
n = 0.25 • Zα/2
2
m2 [1.c] α=0.05 (2-sided) & β=0.2 ...
Zα = 1.96; Zβ = -0.84,
(Zα/2 – Zβ)2 = {1.96 – (–0.84)}2 ≈ 8, i.e.n for power 1- to "detect" a population proportion 1 that is units from 0 (test value) ; type I error = (Colton p 161)
n ≈ 8 { 0.4940.15 }2
= 87
n = { Zα/2 0[1– 0]– Zβ 1[1– 1] }2
∆ 2 [1.t]
Worked example 2: sample size for Taste Test
(correct) = 1/3 vs. >1/3≈ { Zα/2 – Zβ }2 {
[1– ]∆ }2
[1.t]≈If set α =0.05 (hardliners might allow 1 -sided test here),then Zα = 1.645; If want 90% pwer, then Zβ = -1.28; Then using eqn [1.t]above...
where π is average of π0 and π1
= { Zα/2 – Zβ }2 { σ0/1
∆ }2 n's for 90% Power against...
π(correct)= 0.4 0.5 0.6 0.7 0.8 400 69 27 14 8
Notes: Zβ will be negative; formula is same as for testing µ
(See also homegrown exercise # ___ )
page 10
Inference concerning 2 's M&M §8.2 ; updated Dec 14, 2003
Parameters: 1 and 2 .... π's are proportions / prevalences / risksLarge-sample CI for COMPARATIVE MEASURE /PARAMETER (if 2 estimates are uncorrelated)3 Comparative measures / parameters:
IN GENERAL (if calculations in transformed scale, must back-transform )
COMPARATIVE estimate1 - estimate2 ± z SE[ estimate1 - estimate2 ]
estimate1 - estimate2 ± z Sqrt[Var[estimate1] + Var[estimate2]]
IN PARTICULAR
p1 - p2 ± z SE[ p1 - p2 ] Remember: SE's don't add !
= p1 - p2 ± z SE2[p1] +SE2[p2] Their squares do !
PARAMETER estimate New Scale
(Risk orPrevalence)Difference 1 – 2 p1 – p2
cf. Rothman2002 p 135 Eqn 7-2 <<<<<<= p1 - p2 ± z
p1[1-p1]
n1
+ p2[1-p2]
n2
(Risk orPrevalence)
Ratio 12
p1p2
log[p1p2] = log[p1] – log[p2]
anti-log[ log[p1/p2] ± z SE[ log[p1] – log[p2] ] ]
= anti-log[ log[p1/p2] ± z SE2[ log[p1] ] + SE2[ log[p2] ] ]cf. Rothman2002 p 135 Eqn 7-3 <<<<<<
From 8.1: SE2[ log[p1] ] = Var[ log[p1] ] = 1/#positive1 – 1/ #total1
SE2[ log[p2] ] = Var[ log[p2] ] = 1/#positive2 – 1/ #total2
cf. Rothman2002 p 139 Eqn 7-6 <<<<<< anti-log [ log[oddsRatio] ± z SE[ logit1 – logit2 ] ]
= anti-log [ log[oddsRatio] ± z SE2[logit1] + SE2[logit2] ]
From 8.1: SE2[ logit1 ] = Var[ logit1 ] = 1/#positive1 + 1/ #negative1
SE2[ logit2 ] = Var[ logit2 ] = 1/#positive2 + 1/ #negative2
Odds Ratio 1 / (1– 1)
2 / (1– 2)
odds1odds2
log[odds1odds2
] = log[odds1] – log[odds2]
= logit1 – logit2Var[log of OR est.] = 1/a + 1/b + 1/c + 1/d ==> "Woolf's Method": CI[ODDSRATIO]
page 1
Inference concerning 2 's M&M §8.2 ; updated Dec 14, 2003
Examples:Large-sample
Test of π1 = π1 equivalent to Test of π1– π1 = 0
(Risk or Prevalence Difference = 0)
same as Test of π1π2
=1
0 The generic 2x2 contingency table:
+ve -ve both--------------------------------------------------------------------------
sample 1 y1 (%) n1-y1 n1 (100%)
sample 2 y2 (%) n2-y2 n2 (100%)
---------------------------------------------------------------------------(Risk or Prevalence Ratio = 1)
same as Test of π1/ {1 – π1}π2/ {1 – π2} = 1
(ODDS Ratio = 1)
total y (%) n - y n (100%)
1 Bromocriptine for unexplained 1º infertility (BMJ 1979)
became did total no.pregnant not couples
------------------------------------------------------------------------------------Bromocriptine 7 (29%) 17 24 (100%)Placebo 5 (22%) 18 23 (100%)------------------------------------------------------------------------------------total 12 (26%) 35 47 (100%)
z = p1 - p2 - {∆=0}SE[ p1 - p2 ] †
2 Vitamin C and the common cold (CMAJ Sept 1972 p 503)
= p1 - p2
p[1-p]
n1 +
p[1-p]n2
† †no colds ≥ 1 cold total subjects
------------------------------------------------------------------------------------Vitamin C 105 (26%) 302 407 (100%)Placebo 76 (18%) 335 411 (100%)------------------------------------------------------------------------------------total 181 (22%) 637 818 (100%)
= p1 - p2
p[1-p]{1n1
+ 1n2
} =
p1 - p2
p[1-p] 1n1
+ 1n2
[ use estimate p = n1p1 + n 2p2
n1 + n 2 =
total +vetotal
in test ]
3 Stroke Unit vs Medical Unit for Acute Stroke in elderly?Patient status at hospital discharge (BMJ 27 Sept 1980)
indept. dependent total no. pts------------------------------------------------------------------------------------Stroke Unit 67 (66%) 34 101 (100%)Medical Unit 46 (51%) 45 91 (100%)------------------------------------------------------------------------------------total 113 (59%) 79 192 (100%)
† Continuity correction: use |p1 – p
2 | – [ 1/(2n1) + 1/(2n2) ] in numerator
† † Variances add if the proportions are uncorrelated
page 2
Inference concerning 2 's M&M §8.2 ; updated Dec 14, 2003
Rx for primary infertilityStroke Unit vs Medical Unit
95%CI on ∆π : 0.29 - 0.22 ± z 0.29 • 0.71
24 +
0.22 • 0.7823
= 0.07 ± 1.96 x 0.13= 0.07 ± 0.25
95%CI for ∆π : 0.66 - 0.51 ± z 0.66 • 0.34
101 +
0.51 • 0.4991
= 0.15 ± 1.96 x 0.07= 0.15 ± 0.14
test ∆π=0 :test ∆π=0 : [carrying several decimal places, for comparison with χ2 later]
z = 0.6634 – 0.5054
0.5885 • 0.4115 • { 1
101 +
191
} =
0.15800.0711
= 2.22
[ estimate of hypothesized common = total +ve
total =
113192
= 0.5885 ]
z = 0.29 - 0.22
0.26 [0.74] { 1
24 +
123
}
= 0.070.13
= 0.55
P=0.58 (2-sided)
Fall in BP with Reduction in Dietary Salt
P = Prob[ | Z | ≥ 2.22 ] = 0.026 (2-sided) Response of interest: Y = Achieve DBP < 90 ? [ 0 / 1 ]
H0: π(Y=1 | Normal Sodium Diet) = π(Y=1 | Low Sodium Diet)Halt: π(Y=1 | Normal Sodium Diet) ≠ π(Y=1 | Low Sodium Diet)
Vitamin C and the common cold
95%CI on ∆π: 0.26 - 0.18 ± z 0.26 • 0.74
407 +
0.18 • 0.81411
= 0.08 ± 1.96 x 0.03= 0.08 ± 0.06
α =0.05 (2-sided);
Proportion achieving DBP < 90 mm
"Normal" "Low"Group Group
test ∆π=0 : [carrying several decimal places, for comparison with χ2 later]11/50 17/53
( 22 %) ( 32 %) z = 0.32 – 0.22
0.27 • 0.73 •[153 +
150]
z = 0.258 - 0.185
0.221 [0.779] { 1
407 +
1411
}
= 0.0730.029
= 2.52 (2.517 = √6.337)
P=0.006 (1-sided); P=0.012 (1-sided) i.e. |z| = 1.14 which is < Zα = 1.96 and so observed difference of 10% is"N.S."
page 3
Inference concerning 2 's M&M §8.2 ; updated Dec 14, 2003
CI for Risk Ratio (Rel. RISK) or Prev. Ratio cf. Rothman2002 p135 CI for ODDS RATIO cf. Rothman2002 p139
Example: Vitamin C and the common cold (CMAJ Sept 1972 p 503) . . .REVISITED Vitamin C Placebo:
had cold(s) 302 335avoided colds 105 76
# with cold(s) for every
no colds ≥ 1 cold total subjects------------------------------------------------------------------------------------Vitamin C 105 (26%) 302(74%) 407 (100%)Placebo 76 (18%) 335(82%) 411 (100%)------------------------------------------------------------------------------------total 181 (22%) 637(78%) 818 (100%) 1 who avoided colds 2.88 (:1) 4.41 (:1)
RR^ = Prob[ cold | Vitamin C]Prob[ cold | Placebo ]
= 74%82%
= 0.91 odds of cold(s) 2.88 4.41
odds Ratio = 2.88 / 4.41 = 0.65 >>> OR^ = 0.65
CI[OR] = anti-log [ log[oddsRatio] ± z SE[ logit1 – logit2 ] ]
CI[RR]: antilog{ log[0.91] ± z SE[ log[p1] – log[p2] ] }
=antilog{ log[0.91] ± z SE2[ log[p1] ] + SE2[ log[p2] ] }From 8.1: SE2[ log[p1] ] = Var[ log[p1] ] = 1/ 302 – 1/ 407 = 0.000854
SE2[ log[p2] ] = Var[ log[p2] ] = 1/ 335 – 1/ 411 = 0.000552From 8.1: SE2[ logit1 ] = 1/#positive1 + 1/ #negative1
SE2[ logit2 ] = 1/#positive2 + 1/ #negative2
SE[ logit1 – logit2 ] = Sqrt[ (1/ 302 + 1/ 105) + (1/ 335 + 1/ 76 ) ] = 0.17
z SE[ logit1 – logit2 ] = 1.96 × 0.17 = 0.33
anti-log [ log[0.65] ± 0.33 ] = exp[ –0.43 ± 0.33 ] = 0.47 to 0.90
>>>>> OR = 0.65; CI[OR] = {0.47 to 0.90}
CI[RR]: antilog{ log[0.91] ± z Sqrt[ 0.000854 + 0.000552] }= antilog{ log[0.91] ± 0.073 } = 0.85 to 0.98
Shortcut: calculate exp{ z×SE[log RR ] } and use it as a multiplier anddivider of RR
^. In our e.g., exp{z×SE[log RR ]} = exp{0.073} = 1.076.
Thus {RRLOWER, RRUPPER} = {0.91 / 1.076 , 0.91 × 1.076} = {0.85 , 0.98}You can use this shortcut whenever you are working with log-based CI'sthat you convert back to the original scale, there they become "multiply-divide" symmetric rather than "plus-minus" symmetric.
From SAS From StataFrom SAS From StataPROC FORMAT;
VALUE onefirst 0="z_0" 1="a_1";DATA CI_RR_OR;INPUT vitC cold n_people;LINES; 1 1 302 1 0 105 0 1 335 0 0 76;PROC FREQ data=CI_RR_OR ORDER=FORMATTED;TABLES vitC*cold / CMH;WEIGHT n_people;FORMAT vitC cold onefirst. ;RUN;
Immediate: csi 302 335 105 76
cs stands for 'cohort study'
input vit_c cold n_people 1 1 302 1 0 105 0 1 335 0 0 76endcs cold vit_c [freq = n_people]
<<<< See statements for RR
(output gives both RR and OR)
Be CAREFUL as to rows/colsIndex exposure category must be1st row ; reference exposurecategory must be 2nd
If necessary, use FORMAT tohave table come out this way...(note trick to reverse rows/cols)
SAS doesn't know if cc or cohort
Immediate: cci 302 335 105 76,woolf
cc stands for 'case control study'
input vit_c cold n_people 1 1 302 1 0 105 0 1 335 0 0 76endcc cold vit_c [freq =n_people],woolf
page 4
Inference concerning 2 's M&M §8.2 ; updated Dec 14, 2003
"Test-based" CI's ... IN GENERAL "Test-based" CI's for... IN PARTICULAR
Preamble • Difference of 2 proportions π1 - π2 (Risk or Prevalence Difference)In 1959, when Mantel and Haenszel developed their summary Odds Ratiomeasure over 2 or more strata, they did not supply a CI to accompany thispoint estimate. From 1955 onwards, the main competitor was the weightedaverage (in the log OR scale) and accompanying CI obtained by Woolf'. Butthis latter method has problems with strata where one or more cellfrequencies are zero. In 1976, Miettinen developed the "test-based"method for epidemiologic situations where the summary point estimate iseasily calculated, the standard error estimate is unknown or hard tocompute, but where a statistical test of the null value of the parameter ofinterest (derived by aggregating a "sub-statistic" from each stratum) isalready available. Although the 1886 development, by Robins, Breslowand Greenland, of a direct standard error for the log of the Mantel-HaenszelOR estimator, the "test-based" CI is still used (see A&B KKM).
Observe : p1 & p2 and (maybe via p-value) the calculated value of x2
This implies that
Sqrt[observed x2 value] = observed x value = observed z value;
But... observed z statistic = ( p1 – p2 ) / SE [ p1 – p2 ] .
So... SE [ p1 – p2 ] = p1 – p2
observed z statistic use +ve sign
95% CI for p1 - p2 :
( p1 – p2 ) ± {z value for 95%} × SE[ p1 – p2 ]i.e.,...
( p1 – p2 ) ± {z value for 95%} × p1 – p2
observed z statistic
i.e., after re-arranging terms..
( p1 – p2 ) {1 ± z value for 95%
observed z statistic }or, in terms of a reported chi-square statistic
Even though its main usefulness is for summaries over strata, the idea canbe explained using a simpler and familiar (single starum) example, thecomparison of two independent means using a z-test with large df (theprinciple does not depend on t vs. z). Suppose all that was reported wasthe difference in sample means, and the 2-sided p-value associated with atest of the null hypothesis that the mean difference was zero. From thesample means, and the p-value, how could we obtain a 95%CI for thedifference in the "population' means? The trick is to1 work back (using a table of the normal distribution) from the p-value to
the corresponding value of the z-statistic (the number of standarderrors that the difference in sample means is from zero);
2 divide this observed difference by the observed z value, to get thestandard error of the difference in sample means, and
( p1 – p2 ) {1 ± z value for 95%
Sqrt[observed chi-square statistic] }3 use the observed difference, and the desired multiple (1.645 for 90%
CI, 1.96 for 95% etc.) to create the CI.
The same procedure is directly applicable for the difference of twoindependently estimated proportions. If one tests the (null) differenceusing a z-test, one can obtain the SE of the difference by dividing theobserved difference in proportions by the z statistic; if the difference wastested by a chi-square statistic, one can obtain the z-statistic by taking thesquare root of the observed chi-square value (authors call this square rootan observed 'chi' value). Either way, the observed z-value leads directly tothe SE, and from there to the CI.
See Section 12.3 of Miettinen's "Theoretical Epidemiology"
Technically, when the variance is a function of the parameter (as isthe case with binary response data), the test-based CI is mostaccurate close to the Null. However, as you can verify by comparingtest-based CIs with CI's derived in other ways, the inaccuracies arenot as extreme as textbooks and manuals (e.g. Stata) suggest.
This is worked out in the next example, where it is assumed that the nullhypothesis is tested via a chi-squared (x2) test
page 5
Inference concerning 2 's M&M §8.2 ; updated Dec 14, 2003
"Test-based" CI's for... IN PARTICULAR "Test-based" CI's for... IN PARTICULAR
• Ratio of 2 proportions π1/π2
( Risk Ratio; Prevalence Ratio; Relative Risk; "RR" )
• Ratio of 2 odds π1/[1–π1] and π2/[1–π2]
( Odds Ratio; "OR" )
Observe :(i) rr = p1/p2 and
(ii) (maybe via p-value) the value of x2 statistic (H0: RR=1)
Observe :(i) or = p1/[1–p1] / p2/[1–p2 ] ( " a×d / b×c " ) and
(ii) (maybe via p-value) the value of x2 statistic (H0: OR=1)
>>> Sqrt[observed x2 value] = observed x value = observed z value
In log scale, in relation to log[RRnull] = 0, observed z value would be:
observed z value = ( log[rr] – 0) / SE[ log[rr] ]
>>> Sqrt[observed x2 value] = observed x value = observed z value
In log scale, in relation to log[ORnull] = 0, observed z value would be:
observed z value = ( log[or] – 0) / SE[ log[or] ]
This implies that
SE[ log[rr] ] = log[rr] / observed z value use +ve sign
95% CI for log[RR] :
log[rr] ± {z value for 95%} × SE[ log[rr] ]i.e.,...
log[rr] ± {z value for 95%} × log[rr] observed z value
i.e., after re-arranging terms..
log[rr] × {1 ± z value for 95%
observed z statistic }
This implies that
SE[ log[or] ] = log[or] / observed z value use +ve sign
95% CI for log[OR] :
log[or] ± {z value for 95%} × SE[ log[or] ]i.e.,...
log[or] ± {z value for 95%} × log[or] observed z value
i.e., after re-arranging terms..
log[or] × {1 ± z value for 95%
observed z statistic }Going back to RR scale, by taking antilogs*... Going back to OR scale, by taking antilogs*...
95% CI for RR:
rr to power of {1 ± z value for 95%
observed z statistic }95% CI for OR:
or to power of {1 ± z value for 95%
observed z statistic }
See Section13.3 of Miettinen's "Theoretical Epidemiology"See Section13.3 of Miettinen's "Theoretical Epidemiology"
* antilog[ log[a] × b ] = exp[ log[a] × b ]= { exp[log[a]] } to power of b = a to power of b
page 6
Inference concerning 2 's M&M §8.2 ; updated Dec 14, 2003
Sample Size considerations... CI(π1 – π2)
n's to produce CI for difference in π's of pre specified margin oferror m at stated confidence level
Sample Size considerations... Test involving πT and πC
Test H0: πT = πC vs Ha: πT πC :
n's for power 1– if πT = πC + ; prob[type I error] =
• large-sample CI: p1–p2 ± Z SE(p1–p2 ) = p1–p2 ± m
• SE(p1 – p2 ) = p1{1–p1}
n1 +
p2{1–p2}n2
n per group
= { Zα/2 2πC{1–πC} – Zβ πC{1–πC}+πT{1–πT} }2
2
(See Colton p 168 )
≈ 2(Zα/2 – Zβ)2 { π–{1– π–}
∆ }2
= 2{Zα/2 – Zβ}2{
σ0/1 }
2
Simplify by using an average p;
if use equal n's, then
n per group = 2p{1–p} Zα/22
[margin of error]2
M&M use the fact that if p = 1/2 then p(1–p) = 1/4, and so
2p(1–p) = 1/2, so the above equation becomes
[max] n per group = Zα/22
2 × [margin of error]2
e.g.
=0.05 (2-sided) & =0.2 ... Zα = 1.96; Zβ = -0.84,
2(Zα/2 – Zβ)2 = 2{1.96 – (–0.84)}2 ≈ 16, i.e.
n per group ≈ 16 • π–{1– π–} / ∆2
So n 100 for T group and n 100 for C groupif T = 0.6 and C = 0.4
See Sample Size Requirements for Comparison of 2Proportions (from text by Smith and Morrow) underResources for Chapter 8.
page 7
Inference concerning 2 's M&M §8.2 ; updated Dec 14, 2003
Effect of Unequal Sample Sizes ( n1 n2 ) on precision of estimated differences
If we write the SE of an estimated difference in mean responses as σ 1n1
+ 1n2
, where σ is the (average) per unit variability of the response, then we can establish the
following principles:
1 If costs and other factors (including unit variability) are equal,and if both types of units are equally scarce or equallyplentiful , then for a given total sample size of n = n1 + n2, an equal divisionof n i.e. n1 = n2 is preferable since it yields a smaller SE(estimated difference inmeans) than any non-symmetric division. However, the SE is relativelyunaffected until the ratio exceeds 70:30. This is seen in the following table
which gives the value of 1n1
+ 1n2
= SE(estimated difference in means) for
various combinations of n1 and n2 adding to 100 (the 100 itself is arbitrary) andassuming σ = 1 (also arbitrary).
2 If one type of unit is much scarcer, and thus the limiting factor ,then it makes sense to choose all (say n1) of the available scarcer units, andsome n2 ≥ n1 of the other type. The greater is n2 , the smaller the SE of theestimated difference. However, there is a 'law of diminishing returns' once n2 ismore than a few multiples of n1. This is seen in the following table which
gives the value of 1n1
+ 1n2
for n1 fixed (arbitrarily) at 100 and n2 ranging
from 1 x n1 to 100 x n1; again, we assume σ=1.
SEK:1 SEK:1 Ratio as % of as % of
n1 n2 (K) SE(µ1 - µ2) of SE(1:1) SE(∞:1) *n1 n2 SE(estimated difference in means) %Increase in SEover SE(50:50) *
50 50 0.200 -–---60 40 0.204 2.1%65 35 0.210 4.8%70 30 0.218 9.1%75 25 0.231 15.5%80 20 0.250 25.0%85 15 0.280 40.0%
50 50 1.0 0.2000 – 1.41450 75 1.5 0.1825 91.3% 1.29050 100 2.0 0.1732 86.6% 1.22550 150 3.0 0.1633 81.6% 1.15550 200 4.0 0.1581 79.1% 1.11850 250 5.0 0.1549 77.5% 1.09550 300 6.0 0.1527 76.4% 1.08050 400 8.0 0.1500 75.0% 1.06150 500 10.0 0.1483 74.2% 1.04950 1000 20.0 0.1449 72.4% 1.02550 5000 100.0 0.1421 71.1% 1.00550 ∞ ∞ 0.1414 70.7% 1 * if sample sizes are π:(1–π), the % increase is 50 / π(1-π) .
* calculated as K + 1
K ; 'efficiency' =
KK + 1
Note: these principles apply to both measurement and count data
page 8
Inference concerning 2 's M&M §8.2 ; updated Dec 14, 2003
Sample size calculation when using unequal sample sizes to estimate / test difference in 2 means or proportions
For power (sensitivity) 1–β, and specificity 1–α (2-sided), the samplesizes n1 and n2 have to be such that
Notes:
a. If K=1, so that n1=n2, then we get the familiar "2" at the front ofthe sample size formula.
Zα/2 SE( x–1 – x–2 ) – ZβSE( x–1 – x–2 ) = ∆.
(if β < 0.5, then Zβ will be negative). If we assume equal per unitvariability, σ, of the x's in the 2 populations, we can write therequirement as
Zα/2 σ 1n1
+ 1n2
– Zβ σ 1n1
+ 1n2
= ∆.
b. The same factor applies for proportions:
If we use σ0/1 = π– [1 – π– ]
as an "average" standard deviation for the
individual 0's and 1's in each population, i.e.
σ0/1 = π [1 – π ]If we rewrite
1n1
+ 1n2
as 1n1
{1 + n1n2
}
and rearrange the inequality, we get then, as we get the approximate formula:
n1 ≈ {
K+1K }(Zα/2 – Zβ)2 {
π–
[1 – π–
]
∆2 }n1 = {1 +
n1n2
}(Zα/2 – Zβ)2 { σ∆ }2
or, denoting n2n1
by K,
n1 = {1 + 1K}(Zα/2 – Zβ)2 {
σ∆ }2
i.e.
n1 = { K+1
K }(Zα/2 – Zβ)2 { σ∆ }2
page 9
Add-ins for M&M §8 and §9 statistics for epidemiology
Sample Size considerations... Test involving OR
Test H0: OR = 1 vs. Ha: OR OR :
n's for power 1– if OR = ORal t; prob[type I error] =
Key pointsln [ or] most precise when all 4 cells are of equal size; so...
1 increasing the control:case ratio leads to diminishing marginalgains in precision.
To see this... examine the function
1
# of cases + 1
multiple of this # of controls
for various values of "multiple"
[like we did back in Chapter 8, for "effect of unequal samplesizes"]
Here I use ln for natural log (elsewhere I have used log; I use them interchangeably)
Work in ln (or) scale; SE[ ln (or) ] = 1a +
1b +
1c +
1d
Need Zα/2 SE[ ln (or) ]0 + ZβSE[ ln (or) ]alt < "∆"
where "∆" = ln (ORalt)
α/2
Z SE[ ln(or) | OR alt ]
β
ln [OR] = 0
βZ SE[ ln(or) | Ho ]α/2
ln[OR ]alt
= ln[OR ]alt∆
2 The more unequal the distribution of the etiologic / preventivefactor, the less precise the estimate
Examine the functions
1# of exposed cases +
1 # of unexposed cases
and1
# of exposed controls +1
# of unexposed controlsReading graphs on next page (Note log scale for observed or)
Take as an example the study in the middle panel, with 200 cases, and an exposureprevalence of 8%. Say that the Type I error rate is set at α=0.05 (2sided) so thatthe upper critical value (the one that cuts off the top 2.5% of the null distribution)is close to or = 2. Draw a vertical line at this critical value, and examine howmuch of each non-null distribution falls to the right of this critical value. This areato the right of the critical value is the power of the study, i.e., the probability ofobtaining a significant or, when in fact the indicated non-null value of OR iscorrect. Two curves at each OR value are for studies with 1(grey) and 4(black)controls/case. Note that OR values 1, 1.5, 2.25 and 3.375 are also on a log scale.
Power larger if...Substitute expected a, b, c, d values under null and alt. intoSE's and solve for numbers of cases and controls.
i non-null OR >> 1 (cf 2.5 vs 2.25 vs 3.375)
References: Schlesselman, Breslow and Day, Volume II, ... ii exposure common (cf 2% vs 8% vs 32%) and not near universal)iii use more cases (cf 100 vs 200 vs 400), and controls/case (1 vs 4)
page 10
Add-ins for M&M §8 and §9 statistics for epidemiology
Factors affecting variability of estimates from, and statistical power of, case-control studies
OR
3.375
2.25
1.5
1
3.375
2.25
1.5
1
3.375
2.25
1.5
10.25 0.5 1 2 4 8
or
Cases: 100 Exposure Prevalence: 32%
0.25 0.5 1 2 4 8or
Cases: 200 Exposure Prevalence: 32%
0.25 0.5 1 2 4 8or
Cases: 400 Exposure Prevalence: 32%
0.25 0.5 1 2 4 8or
Cases: 100 Exposure Prevalence: 8%
0.25 0.5 1 2 4 8or
Cases: 200 Exposure Prevalence: 8%
0.25 0.5 1 2 4 8or
Cases: 400 Exposure Prevalence: 8%
0.25 0.5 1 2 4 8or
Cases: 100 Exposure Prevalence: 2%
0.25 0.5 1 2 4 8or
Cases: 200 Exposure Prevalence: 2%
0.25 0.5 1 2 4 8or
Cases: 400 Exposure Prevalence: 2%
jh 1995-2003
page 11
Add-ins for M&M §8 and §9
The "Exact" Test for 2 x 2 tables [material taken from A&B §4.9]
Even with the continuity correction there will be some doubt about the adequacy ofthe χ2 approximation when the frequencies are particularly small. An exact test wassuggested almost simultaneously in the mid-1930s by R. A. Fisher, J. O. Irwin andF. Yates. It consists in calculating the exact probabilities of the possible tablesdescribed in the previous subsection. The probability of a table with frequencies
There are six possible tables with the same marginal totals as those observed. sinceneither a nor c (in the notation given above) can fall below 0 or exceed 5, thesmallest marginal total in the table. The cell frequencies in each of these tables areshown in Table 4.7. Below them are shown the probabilities of these tables,calculated under the null hypothesis.
Table 4.7 Cell frequencies in tables with the same marginal totals as those in Table4.6
a b r1
c d r2
----------------------------c1 c2 N
0 20 20 1 19 20 2 18 20 3 17 20 4 16 20 5 15 20 5 17 22 4 18 22 3 19 22 2 20 22 1 21 22 0 22 22 5 37 42 S 37 42 5 37 42 5 37 42 5 37 42 5 37 42
is given by the formula
P[ a | r1, r2 , c1, c2 ] = r1! r2! s1! s1!N! a! b! c! d!
(4.25)
a 0 1 2 3 4 5
Pa 0.0310 0.1720 0.3440 0.3096 0.1253 0.0182
The Probabilities of the various tables are calculated in the following way*: theprobability that a = 0 is, from (4.25),
This is, in fact, the probability of the observed cell frequencies condi t ional onthe observed marginal totals, under the null hypothesis of no association betweenthe row and column classifications. Given any observed table, the probabilities ofall tables with the same marginal totals can be calculated, and the P value for thesignificance test calculated by summation. Example 4.14 illustrates the calculationsand some of he difficulties of interpretation which may arise. The data in Table 4.6,due to M. Hellman, are discussed by Yates (1934).
P0 = 20! 22! 5! 37!
42! 0! 20! 5! 7! = 0.03096.
Tables of log factorials (Fisher and Yates, 1963, Table XXX) are often useful forthis calculation, and many scientific calculators have a factorial key (although itmay only function correctly for integers less than 70). Alternatively the expressionfor P0 can be calculated without factorials by repeated multiplication and divisionafter cancelling common factors:
P0 = 22 x 21 x 20 x 19 x 1842 x 41 x 40 x 39 x 38
= 0.03096.Table 4.6 Data on malocclusion of teeth in infants (Yates, 1934)
Infants withThe probabilities for a = 1, 2, . . ., 5 can be obtained in succession. Thus,
P1 = 5 x 201 x 18
x P0
P2 = 4 x 192 x 19
x P1, etc.
Normal teeth Malocclusion Total
Breast-fed 4 16 20Bottle-ed 1 21 22
------ ------ ------Total 5 37 42
The results are shown above.
[Notes from JH: 1. The 5 tables from the tea-tasting experiment with to the 2x2 tables with all marginal totals = 4 are another example of this hypergeometric distribution]
* 2. Don't worry about the formula and the factorials; Excel has this function built in. It is called the Hypergeometric probability function, It is like the Binomial, except that instead ofspecifying p, one specifies the size of the POPULATION and the NUMBER OF POSITIVES IN THE POPULATION.. example, to get P1 above, one would ask for HYPGEOMDIST(a;r1;c1;N)
The spreadsheet "Fisher's Exact test" uses this function; to use the spreadsheet, simply type in the 4 cell frequencies, a, b, c, and d. The spreadsheet will calculate the probability for eachpossible table. Then you can find the tail areas yourself.
page 12
Add-ins for M&M §8 and §9
The "Exact" Test for 2 x 2 tables continued...
This is the complete conditional distribution for the observed marginal totals, andthe probabilities sum to unity as would be expected. Note the importance ofcarrying enough significant digits in the first probability to be calculated; the abovecalculations were carried out with more decimal places than recorded by retainingeach probability in the calculator for the next stage. The observed table has aprobability of 0.1253. To assess its significance we could measure the extent towhich it falls into the tail of the distribution by calculating the probability of thattable or of one more extreme. For a one-sided test the procedure clearly gives P =0.1253 + 0.0182 = 0.1435. The result is not significant at even the 10% level.
and the probability level of 0.12 for X2 is a fair approximation to the exact mid-Pvalue of 0.16.
Cochran (1954) recommends the use of the exact test, in preferenceto the 2 test with continuity correction, ( i) i f N < 20, or ( i i ) 20< N < 40 and the smallest expected value is less than 5. Withmodern scientific calculators and statistical software the exact testis much easier to calculate than previously and should be used forany table with an expected value less than 5.
The exact test and therefore the χ2 test with Yates's correction for continuity have
been criticized over the last 50 years on the grounds that they are conservative in the
sense that a result significant at, say, the 5% level will be found in less than 5% of
hypothetical repeated random samples from a population in which the null
hypothesis is true. This feature was discussed in §4.7 and it was remarked that the
problem was a consequence of the discrete nature of the data and causes no difficulty
if the precise level of P is stated. Another source of criticism has been that the tests
are conditional on the observed margins, which frequently would not all be fixed.
For example, in Example 4.14 one could imagine repetitions of sampling in which
20 breast-fed infants were compared with 22 bottle-fed infants but in many of these
samples the number of infants with normal teeth would differ from 5. The
conditional argument is that, whatever inference can be made about the association
between breast-feeding and tooth decay, it has to be made within the context that
exactly five children had normal teeth. If this number had been different then the
inference would have been made in this different context, but that is irrelevant to
inferences that can be made when there are five children with normal teeth.
Therefore, we do not accept the various arguments that have been put forward for
rejecting the exact test based on consideration of possible samples with different
totals in one of the margins. The issues were discussed by Yates 1984) and in the
ensuing discussion, and by Barnard (1989) and Upton (1992), .and we will not
pursue this point further. Nevertheless, the exact test and the corrected χ2 test have
the undesirable feature that the average value of the significance level, when the null
hypothesis is true, exceeds 0.5. The mid-P value avoids this problem, and so is
more appropriate when combining results from several studies (see §4.7).
For a two-sided test the other tail of the distribution must be takeninto account, and here some ambiguity arises. Many authorsadvocate that the one-tailed P value should be doubled. In thepresent example, the one-tailed test gave P = 0.1435 and the two-tailed test would give P = 0.2870. An alternative approach is tocalculate P as the total probability of tables, in either tail, whichare at least as extreme as that observed in the sense of having aprobability at least as small. In the present example we shouldhave
P = 0.1253 + 0.0182 + 0.0310 = 0.1745.
The first procedure is probably to be preferred on the grounds that asignificant result is interpreted as strong evidence for a differencein the observed d i rec t ion , and there is some merit in controllingthe chance probability of such a result to no more than half thetwo-sided significance level. The tables of Finney et a l . (1963)enable one-sided tests at various significance levels to be madewithout computation provided the frequencies are not too great.
To calculate the mid-P value only half the probability of the observed table isincluded and we have
mid-P = 0.5(0.1253) + 0.0182 = 0.0808
as the one-sided value, and the two-sided value may be obtained by doubling this togive 1617.
The results of applying the exact test in this example may be compared with thoseobtained by the χ2 test with Yates's correction. We find X2 = 2 39 (P = 0.12)without correction and X2
C = 1.14 (P = 0.29) with correction. The probabilitylevel of 0.29 for X2
C agrees well with the two-sided value 0 29 from the exact test,
page 13
Add-ins for M&M §8 and §9
The "Exact" Test for 2 x 2 tables continued...
As for a single proportion, the mid-P value corresponds to an uncorrected χ2 test,whilst the exact P value corresponds to the corrected χ2 test. The confidence limitsfor the difference, ratio or odds ratio of two proportions based on the standard errorsgiven by (4.14), (4.17) or (4.19) respectively are all approximate and theapproximate values will be suspect if one or more of the frequencies in the 2 x 2table are small. Various methods have been put forward to give improved limits butall of these involve iterations and are tedious to carry out on a calculator. The oddsratio is the easiest case. Apart from exact limits, which involve an excessiveamount of calculation, the most satisfactory limits are those of Cornfield ( 1956);see Example 16.1 and Breslow and Day (1980, §4.3) or Fleiss ( 1981, §5.6). Forthe ratio of two proportions a method was given by Koopman (1984) and Miettinenand Nurminen (1985) which can be programmed fairly readily. The confidenceinterval produced gives a good approximation to the required confidence coefficient,but the two tail probabilities are unequal due to skewness. Gart and Nam (1988)gave a correction for skewness but this is tedious to calculate. For the difference oftwo proportions a method was given by Mee (1984) and Miettinen and Nurminen(1985). This involves more calculation than for the ratio limits, and again therecould be a problem due to skewness (Gart and Nam, 1990).
• Fisher's exact test is usually used just as a test*; if one is interested in thedifference ∆ = π1 – π2 , the conditional approach does not yield acorresponding confidence interval for ∆ . [it does provide one for the
comparative odds ratio parameter ψ = 1–π1
π1 ÷
1–π2π2
]
• Thus, one can find anomalous situations where the (conditional) testprovides P>0.05 making the difference 'not statistically significant',whereas the large-sample (unconditional) CI for ∆ , computed as p1 – p2 ±zSE(p1 – p2), does not overlap 0, and so would indicate that thedifference is 'statistically significant'. [* see the Breslow and Day text Vol I ,§4.2, for CI's for ψ derived from the conditional distribution]
• See letter from Begin & Hanley re 1/20 mortality with pentamidine vs 5/20with Trimethoprim-Sulfamethoxazole in patients with Pneumocystis cariniiPreumonia-Annals Int Med 106 474 1987.
• Miettinen's test-based method of forming CI's, while it can have somedrawbacks, keeps the correspondence between test and CI and avoidssuch anomalies.Notes by JH
• The word "exact" means that the p-values arecalculated using a finite discrete referencedistribution -- the hypergeometric distribution (cousinof the binomial) rather than using large-sampleapproximations. It doesn't mean that it is the correcttest. [see comment by A&B in their section dealingwith Mid-P values].
While greater accuracy is always desirable, thisparticular test uses a 'conditional' approach that notall statisticians agree with. Moreover, compared withsome unconditional competitors, the test issomewhat conservative, and thus less powerful,particularly if sample sizes are very small.
• This illustrates one important point about parameters related to binary data-- with means of interval data, we typically deal just with differences*;however, with binary data, we often switch between differences and ratios,either because the design of the study forces us to use odds ratios (case-control studies), or because the most readily available regression softwareuses a ratio (i.e. logistic regression for odds ratios) or because one iseasier to explain that the other, or because one has a more naturalinterpretation (e.g. in assessing the cost per life saved of a moreexpensive and more efficacious management modality, it is the differencein, rather than the ratio of, mortality rates that comes into the calculation). [*the sampling variability of the estimated ratios of means of interval data isalso more difficult to calculate accurately].
• Two versions of an unconditional test for the H0: π1 = π2are available: Liddell; Suissa and Shuster;
page 14
Add-ins for M&M §8 and §9
FISHER'S EXACT TEST IN A DOUBLE-BLIND STUDY OF SYMPTOM PROVOCATION TO DETERMINE FOOD SENSITIVITY (N Engl J Med 1990; 323:429-33.)
Abstract Table 1: Responses of 18 Patients Forced to Decide WhetherInjections Contained an Active Ingredient or PlaceboBackground Some claim that food sensitivities can best be identified by
intradermal injection of extracts of the suspected allergens to reproduce the associatedsymptoms. A different dose of an offending allergen is thought to "neutralize" thereaction.
Pt. Active Placebo PNo* Injection Injection Value† resp no resp resp no resp3 2 1 1 8 0.13Methods To assess the validity of symptom provocation, we performed a double-
blind study that was carried out in the offices of seven physicians who wereproponents of this technique and experienced in its use. Eighteen patients weretested in 20 sessions (two patients were tested twice) by the same technician, usingthe same extracts (at the same dilutions with the same saline diluent) as thosepreviously thought to provoke symptoms during unblinded testing. At each sessionthree injections of extract and nine of diluent were given in random sequence. Thesymptoms evaluated included nasal stuffiness, dry mouth, nausea, fatigue, headache,and feelings of disorientation or depression. No patient had a history of asthma oranaphylaxis.
1 2 1 2 7 0.2414a 2 1 2 7 0.2412 1 2 0 9 0.2516 2 1 3 6 0.36
18 2 1 4 5 0.5014b 1 2 2 7 0.874 1 2 2 7 0.875 1 2 2 7 0.879 0 3 0 9 --
2a 0 3 1 8 0.75Results The responses of the patients to the active and control injections wereindistinguishable, as was the incidence of positive responses: 27 percent of theactive injections (16 of 60) were judged by the patients to be the active substance, aswere 24 percent of the control injections (44 of 180). Neutralizing doses given bysome of the physicians to treat the symptoms after a response were equallyefficacious whether the injection was of the suspected allergen or saline. The rate ofjudging injections as active remained relatively constant within the experimentalsessions, with no major change in the response rate due to neutralization orhabituation.
13 0 3 1 8 0.7515 1 2 3 6 0.766 0 3 2 7 0.558 0 3 2 7 0.55
17 1 2 5 4 0.502b 0 3 3 6 0.387 0 3 3 6 0.3810 0 3 3 6 0.3811 0 3 3 6 0.38
Conclusions When the provocation of symptoms to identify food sensitivities isevaluated under double-blind conditions, this type of testing, as well as thetreatments based on "neutralizing" such reactions, appears to lack scientific validity.The frequency of positive responses to the injected extracts appears to be the resultof suggestion and chance
*Patients were numbered in the order they were studied.
The order in the table is related to the degree that the results agree with thehypothesis that patients could distinguish active injections from placebo injections.The results listed below those of Patient 9 do not support this hypothesis, placeboinjections were identified as active at a higher rate than were true active injections.The letters a and b denote the first and second testing sessions, respectively, inPatients 2 and 14. true active injections.
-------------------------------------------------------------------------------------------------† Calculated according to Fisher's exact test, which assumes that the hypothesizeddirection of effect is the same as the direction of effect in the data. Therefore, whenthe effect is opposite to the hypothesis, as it is for the data below those of Patient9, the P value computed is testing the null hypothesis that the results obtainedwere due to change as compared with the possibility that the patients were morelikely to judge a placebo injection as active than an active injection.
ID denotes intradermal, and SC subcutaneous.
The value is the P value associated with the test of whether the common odds ratio(the odds ratio for all patients) is equal to 1.0. The common odds ratio was equal to1.13 (computed according to the Mantel-Haenszel test).
page 15
Notes on P-Values from Fisher's Exact Test in previous article
Response Response + – Total + – TotalPatient no. 18 Active Injection 2 1 | 3Patient no. 3 Active Injection 2 1 | 3 Placebo Injection 4 5 | 9 Placebo Injection 1 8 | 9 ---------- ---------- 6 6 3 9
All possible tables with a total of 6 +ve responsesAll possible tables with a total of 3 +ve responses
0 3 1 2 2 1 3 0 0 3 1 2 2 1 3 0 6 3 5 4 4 5 3 6 3 6 2 7 1 8 0 9
6• 5• 4 3•6 2•5 1•4 9• 8• 7 3•3 2•2 1•1prob –––––--- 0.091 • --- 0.409 • --- 0.409 • ---prob –––––--- 0.382 • --- 0.491 • --- 0.123 • --- 12•11•10 1•4 2•5 3•6 12•11•10 1•7 2•8 3•9
0.091 0.409 0.409 0.091 0.382 0.491 0.123 0.005
(pt #) (17) (18)(pt #) (2b,7,10,11) (14b, 4, 5) (3)
P-Value 1.0 0.909 0.500 0.091(1-sided, as above)P-Value* 1.0 0.618 0.128 0.005
Response
In Table 1, the P-values for patients belowpatient 9 are calculated as 1-sided, but guidedby the opposite Halt from that used for thepatients in the upper half of the table, i.e. byHalt: of +ve responses with Active< of +ve responses with Placebo.
+ – TotalPatient no. 1 Active Injection 2 1 | 3 Placebo Injection 2 7 | 9 ---------- 4 8
All possible tables with a total of 4 +ve responses
0 3 1 2 2 1 3 0 4 5 3 6 2 7 1 8
8• 7• 6 3•4 2•3 1•2prob –––––--- 0.255 • --- 0.510 • --- 0.218 • --- 12•11•10 1•6 2•7 3•8
It appears that the authors decided the "sided-ness" of the Halt after observing the data!!! andthat they used different Halt for differentpatients!!!
0.255 0.510 0.218 0.018
(pt #) (15) (1,14a)
P-Value 1.0 0.745 0.236 0.018
(*1-sided, guided by Halt: π of +ve responses with Active > π of +veresponses with Placebo)
page 16
Fisher's Exact Test political and ecological correctness . . . M&M §8.2 updated Dec 14 2003
The Namibian government expelled the authors formNamibia following the publication of this article; thereason given was that their "data andconclusions were premature" .. jh ¶•
We gathered data on more than 40 known horned and hornlessblack rhinos in the presence and absence of dangerous carnivoresin a 7,000 km2 area of the northern Namib Desert and on 60horned animals in the 22,000 km2 Etosha National Park. On thebasis of over 200 witnessed interactions between horned rhinos andspotted hyenas (Crocura crocura) and lions (Panthera leo) we sawno cases of predation, although mothers charged predators in about45% of the cases. Serious interspecific aggression is not uncommonelsewhere in Africa, and calves missing ears and tails have beenobserved from South Africa, Kenya, Tanzania, and Namibia (13).
lost calves were between 15 to 25 years old, suggestingthat they were not first time, inexperienced mothers (14).What seems more likely is that the drought-inducedmigration of more l-than 85% of the large, herbivorebiomass (kudu, springbok, zebra, gemsbok, giraffe, andostrich) resulted in hyenas preying on an alternative food,rhino neonates, when mothers with regenerating hornscould not protect them.
Since 1900 the world's population has increased from about 1.6 toover 5 billion) the U.S. population has kept pace, growing fromnearly 75 to 260 million. While the expansion of humans andenvironmental alterations go hand in hand, it remains uncertainwhether conservation programs will slow our biotic losses. Currentstrategies focus on solutions to problems associated with diminishingand less continuous habitats, but in the past, when habitat loss wasnot the issue, active intervention prevented extirpation. Here webriefly summarize intervention measures and focus on tactics forspecies with economically valuable body parts, particularly on themerits and pitfalls of biological strategies tried for Africa's mostendangered pachyderms, rhinoceroses.
Clearly, unpredictable events, including drought, may notbe anticipated on a short-term basis. Similarly, it may notbe possible to predict when governments can no longerfund antipoaching measures, an event that may have led tothe collapse of Zimbabwe's dehorned white rhinos.Nevertheless, any effective conservation actions mustaccount for uncertainty. In the case of dehorning,additional precautions must be taken. [ ... ]
To evaluate the vulnerability of dehornedrhinos to potential predators, we developedan experimental design using three regions:
• Area A had horned animals with spotted hyenas andoccasional lions
[ ... ]• Area B had dehorned animals lacking dangerouspredators, A B C
††survived 4 3 0
Given the inadequacies of protective. legislation and enforcement,Namibia. Zimbabwe, and Swaziland are using a controversialpreemptive measure, dehorning (Fig. D) with the hope thatcomplete devaluation will buy time for implementing otherprotective measures (7) In Namibia and Zimbabwe, two species,black and white rhinos (Ceratotherium simum), are dehorned, atactic resulting in sociological and biological uncertainty: Ispoaching deterred? Can hornless mothers defend calves fromdangerous predators?
• Area C consisted of dehorned animals that were sympatricwith hyenas only.
died 0 0 3 4 3 3
B vs CPopulations were discrete and inhabited similar xericlandscapes that averaged less than 125 mm ofprecipitation annually. Area A occurred north of acountry-long veterinary cordon fence, whereas animalsfrom areas B and C occurred to the south or east, and noindividuals moved between regions.
B C B C B C B C tot*survived 3 0 2 1 1 2 0 3 3died 0 3 1 2 2 1 3 3 3
On the basis of our work in Namibia during the last 3 years (8) andcomparative information from Zimbabwe, some data are available.Horns regenerate rapidly, about 8.7 cm per animal per year, so that1 year after dehorning the regrown mass exceeds 0.5 kg. Becausepoachers apparently do not prefer animals with more massive horns(8), frequent and costly horn removal may be required (9). InZimbabwe, a population of 100 white rhinos, with at least 80dehorned, was reduced to less than 5 animals in 18 months (10).These discouraging results suggest that intervention by itself isunlikely to eliminate the incentive for poaching. Nevertheless, somebenefits accrue when governments, rather than poachers, practicehorn harvesting, since less horn enters the black market Whetherhorn stockpiles may be used to enhance conservation remainscontroversial, but mortality risks associated with anesthesia duringdehorning are low (5).
3 3 3 3 3 3 3 3
The differences in calf survivorship were remarkable. Allthree calves in area C died within 1 year of birth, whereasall calves survived for both dehorned females livingwithout dangerous predators (area B; n = 3) and forhorned mothers in area A (n = 4). Despite admittedlyrestricted samples, the differences are striking [Fisher's(3 x 2) exact test, P = 0.017; area B versus C, P = 0.05;area A versus C, P = 0.0291 ††. The data offer a firstassessment of an empirically derived relation betweenhorns and recruitment.
prob 120
920
920
120
A vs C A C A C A C A C tot*survived 4 0 3 1 2 2 1 3 4Biologically, there have also been problems. Despite media
attention and a bevy of allegations about the soundness ofdehorning ( 11 ), serious attempts to determine whether dehorningis harmful have been remiss. A lack of negative effects has beensuggested because (i) horned and dehorned individuals haveinteracted without subsequent injury; (ii) dehorned animals havethwarted the advance of dangerous predators; (iii) feeding isnormal; and (iv) dehorned mothers have given birth (12) However,most claims are anecdotal and mean little without attendant data ondemographic effects. For instance, while some dehorned femalesgive birth, it may be that these females were pregnant when firstimmobilized. Perhaps others have not conceived or have lost calvesafter birth. Without knowing more about the frequency of mortality,it seems premature to argue that dehorning is effective.
died 0 3 1 2 2 1 3 3 3 4 3 4 3 4 3 4 3
Our results imply that hyena predation was responsible forcalf deaths, but other explanations are possible. If droughtaffected one area to a larger extent than the others, thencalves might be more susceptible to early mortality. Thispossibility appears unlikely because all of westernNamibia has been experiencing drought and, on average,the desert rhinos in one area were in no poorer bodilycondition than those in another. Also, the mothers who
prob 135
1235
1835
435
¶ Do you agree?
page 17
Add-ins for M&M §8 and §9 [ updated Jan 3 2004 ] stratified data
Combining measures from several strata [cf. M&M 3§2.6, A&B3 §16.2, Rothman2002, Chapter 8]
Why not just add (sum, ) the 'a' frequencies across tables(strata), the 'b' frequencies across tables, ... the 'd'frequencies across tables, to make a single 2 x 2 table withentries
e.g. 2 Numbers of Applicants (n), and Admission rates (%) toBerkeley Graduate School
Men Women
Faculty n % admitted n % admitteda b A 825 62 108 82b d
B 560 63 25 68
C 325 37 593 34and use these 4 cell counts to perform the analyses?
D 417 33 375 35e.g. 1 Batting Averages of Gehrig and Ruth
(see book "Innumeracy" by Paulos)
E 191 28 393 27
F 373 6 341 7
Combined 2691 44 1835 3 0
Gehrig Ruth (see early Chapter in text "Statistics" by Freedman et al)1st half of season .290 < .300
Paradox: π(admission | male) > π(admission | male) overall,but, by an large, faculty by faculty, its the other way!!!
2nd half of season .390 < .400––––––––––––––––––––––––––––––––––––––––––––––Entire season .357 > .333 !!!
Explanation: Women are more likely than men to apply tothe faculties that admit lower proportions of applicants.Explanation:
Gehrig Ruth Remedy: aggregate the within-strata comparisons [like vs.like], rather than make comparisons with aggregated rawdata -- see next for classical ways of doing this; MH standsfor "Mantel-Haenszel".
1st half of seasonHits 29 60AT BAT 100 200
2nd half of seasonhits 78 40AT BAT 200 100
For other examples:-
–––––––––––––––––––––––––––––––––––––––––––––– 1. See Moore and McCabe(3rd Ed) 2.6 (The Perils of Aggregation, includingSimpson's paradox) They speak of 'lurking' variables; in epidemiology wespeak of 'confounding' variables.
2. See Rothman2002, p1 (death rates Panama vs. Sweden) and p2 (20-yearmortality in female smokers and non-smokers in Whickham England)
Entire seasonhits 107 100AT BAT 300 300
Two features, involving time, created this 'paradox'Simpson's paradox is an extreme form of confounding. Sometextbooks give made-up examples See web site for course 626 for severalreal examples.
-1- batting averages increased from 1st to 2nd half of season
-2- Ruth had greater proportion of his AT BAT's in 1st half than Gehrig
page 1
Add-ins for M&M §8 and §9 stratified data
Story 4: Does Smoking Improve Survival? in the EESEE Expansion Modulesin the website for the text (link from course description) [also in Rothman2002,with finer age-categories] Table 2: Twenty-year survival status for 1314 women categorized
by age and smoking habits at the time of the original survey.http://WWW.WHFREEMAN.COM/STATISTICS/IPS/EESEE4/EESEES4.HTM
AgeGroup(Years)
Smoking StatusA survey concerned with thyroid and heart disease was conducted in 1972-74 in adistrict near Newcastle, United Kingdom by Tunbridge et al (1977). A follow-upstudy of the same subjects was conducted twenty years later by Vanderpump et al(1996). Here we explore data from the survey on the smoking habits of 1314 womenwho were classified as being a current smoker or as never having smoked at the timeof the original survey. Of interest is whether or not they survived until the secondsurvey.
SurvivalStatus Smoker
Non-Smoker
18-44 Dead 19 13Results Alive 269 327
The following tables summarize the results of the experiment: [notefrom JH.. We would not call it "an experiment"; mathematicalstatisticians call any process for generating data "an experiment"]
(or =1.78)
Table 1: Relationship between smoking habits and 20-yearsurvival in 1314 women (582 Smokers, 732 Non-Smokers) 44-64 Dead 78 52
Alive 167 147S m o k i n g S t a t u s
Survival Status Smoker Non-Smoker Compared...(or =1.32)
Dead 139 230Alive 443 502
Risk = #dead#Total
139582 = 23.9%
230732 = 31.4%
Diff: –7.5%
Ratio: 0.76 .
>64 Dead 42 165
Alive 7 28
Odds = #dead#alive
139732 = 0.314(:1)
230502 = 0.458(:1) Ratio: 0.68*
(or =1.02)
The odds ratio is > 1 in each age group!* shortcut: or =
a × db × c =
139 × 502230 × 443 =
69778101890 = 0.68
Why the contradictory results?A message the tobacco companies would love us to believe!
page 2
Add-ins for M&M §8 and §9 stratified data
Adjustment (compare like with like,i.e. Σ within-category estimates**)
Ratio** Estimators ("M-H") [implicit precision weighting]
Risk RatioΣ #casesindex × DENOMref / DENOMtotal
Σ #casesref × DENOMindex / DENOMtotal
Weighted averages [ explicit weights (w's) ]
Precision-based Investigator-chosen(inverse variance) ("Standardized")
Rate Ratio same, except that denominators areamounts of person-time, not persons.Mean Difference Σ w × y–index – Σ w × y–ref
= Σ w × ( y–index – y–ref )
Risk Difference Σ w × riskindex – Σ w × riskref
= Σ w × ( risk index – riskref )Odds Ratio
Σ #casesindex × "denom"ref / "size"total
Σ #casesref × "denom"index / "size"total
[case control study]Odds Ratio ("Woolf" method, precision based)
exp[ Σ w × logoddsindex – Σ w × logoddsref ]
= exp[ Σ w × ( log [odds ratio] ) all logs to base e
where w = 1 / var[log [odds ratio] ] = 1 / (1/a + 1/b + 1/c + 1/d)
same as risk and rate ratio above exceptthat "denominators" are partial (pseudo)ones estimated from a denominatorseries*("controls"); "size"total refers to thesize of (stratum-specific) case series anddenominator series combined. *MODERNway to view case-control studies.
Note: Computational formulae often constructed to minimizenumber of steps, and avoid division, and so may hide realstructure of the estimator.
Odds RatioΣ #casesindex × #"rest"ref / totalΣ #casesref × #"rest"index / total
e.g. 8.1 in Rothman p147, for risk diff. (precision weighting) [cohort/prevalence study]
Var[risk diff] proportional to 1/N0 + 1/N1 = (N0+N1)/(N0 N1)
So that the denominator contribution, i.e., the weight, is
w = 1/Var = (N0 N1)/(N0+N1) = (N0 N1)/T
and numerator contribution is
( riskindex – riskref ) × w
= ( a/N1 - b/N0 ) × w = ( a/N1 - b/N0 ) × (N0 N1)/T
= ( a N0 /T - b N1) / T (after some algebra)
Not that common to use this measure,since odds ratio more cumbersome toexplain, and less 'natural'. Might use it tomaintain comparability with results of alog-odds (logistic) regression. If #case asmall fraction
page 3
Add-ins for M&M §8 and §9 stratified data
**NOTE ON RATIO ESTIMATORS: Even though one could (if alldenominators were obligingly non-zero) rewrite the ratioestimator as a weighted average of ratios, this would runcounter to Mantel's express wishes.. to calculate just one ratioat the end, i.e. a ratio of two sums, rather than a sum of ratios.The main reason is statistical stability: imagine a (simpler,non-comparative) situation where one wished to estimate theoverall sex ratio in small day-care facilities: would you averagethe ratios from each facility, or take a single ratio of the totalnumber of males to the total number of females? The caveatdoes not apply to absolute differences, where the difference oftwo weighted averages (same set of weights for both) is thesame as the weighed average of the differences.
Determinant No. Pairs
Outcome* Index Ref Tota × d
Tb × c
TYes 1 1 1No 0 0 1
1 1 2 0 0 "A"
Yes 0 0 1No 1 1 1
1 1 2 0 0 "D"
Yes 1 0 1No 0 1 1
1 1 2 1/2 0 "B"
Yes 0 1 1No 1 0 1
Matched-pairs: the limiting case of finely stratified data 1 1 2 0 1/2 "C"
Examples: pair-matched case-control studies; Mother -> infant transmission ofHIV in twins in relation to order of delivery; & others...[see 607 notes for Ch 9]ALSO: Case-crossover studies (self-matched case-control studies)eg" Redelmeier: auto accidents, while on/off cell phone when driving
Odds Ratio estimator = A 0 + B 1/2 + C 0 + D 0A 0 + B 0 + C 1/2 + D 0
= BC
Tabular format for displaying matched pair-data
Result in Other PAIR MemberCOHORTSTUDY + ve – ve
Total #PAIRSe.g. Response of same subject in each of 2 conditions (self-paired)
Responses of matched pair, one in 1 condition, 1 in other
's in paired responses on interval scale, reduced to sign of
+ ve A BResult in OnePAIR Member
– ve C Dn
The 4 possibilities for 2 pair-members are:(using generic 2 x 2 table: 2nd row might be a 'denominator series' of 1 per case)
Exposure in "Control"CASE-CTLSTUDY + ve – ve
TotalPAIRS
+ ve A BCategory of DeterminantExposure in"Case"Outcome Index Reference Total
– ve C DYes a = 1 or 0 b =1 or 0 1 nNo c = 0 or 1 d = 0 or 1 1
* In matched (self- or other) case-control study, the "denominator series"is not limited to 1 "probe-for-exposure" per case... could ask about"usual" exposure (e.g. % time usually exposed) or sample several "person-moments" ['controls'] per case. i.e. the 2nd row total could be > 2.
1 1 T=2
The contributions to orMH from the 4 possibilities are ...
page 4
Add-ins for M&M §8 and §9 stratified data
Standardization of Rates [proportion-type and incidence-type] [explicit, investigator-selected weights] SMR =
Total # cases observedTotal # cases expected
= Σ # observedΣ # expected (*)
= Σ observed #
Σ ref. rate × exposed PT
= Σ observed rate × exposed PT
Σ ref. rate × exposed PT
= Σ observed rate × w
Σ ref. rate × w , with w = exposed PT
• Usual to first calculate standardized rate for index category(of the determinant) and standardized rate for referencecategory (of the determinant) separately, then compare thestandardized rates.
• If one uses the confounder distribution in one of the twocompared determinant categories as the common set ofweights, then the standardized rate in this category remainsunchanged from the crude rate in this category.See the worked example comparing death rates in Quebecmales in 1971 and 1991 in the document "Direct" and"Indirect" Standardization:2 sides of same coin?(.pdf) under"Material from previous years" in the c626 web page. this isan interesting local case of natural confounding: relative tothat 20 years earlier, the crude mortality rate in 1991 was1.00. yet, in every age category, the rate in 1991 was at least10% lower, and in many age-groups, more than 20% lowerthan in 1971 (in the table, the rate ratios in bold are 71/91, sotake their reciprocals to see the rate ratios 91/71)
If one starts again from (*), one can show that the SMR canalso be represented as a weighted average of rate ratios [aswas mentioned in footnote to Quebec table*]
SMR = Σ # observedΣ # expected
= Σ obs. rate × exposed PT
Σ # expected
= Σ
obs. rateref. rate
× ref. rate × exposed PT
Σ # expected (divide & mult. by ref rate)
= Σ
obs. rateref. rate
× # expected
Σ # expected = weighted ave. of rate ratios
• Read Rothman's comment (p159) about the uniformity ofeffect (eg a constant rate ratio across age groups in the Queexample). Why in his last sentence in that paragraph doeshe seem to "allow" a weighted average of very different rateratios, if they were derived from standardization, but NOT ifthey were derived from (precision-weighted) pooling?
• Rothman (p161) emphasizes how "silly" the term "indirect"standardization used with standardized mortality ratio, is. Hecorrectly points out that "the calculations for any ratestandardization, "direct" or "indirect", are basically the same".He leaves it as an exercise (Q4 page 166) to work out whatthe weights are in the so-called "indirect" standardizationused to compute an SMR (or SIR).
*cf. Liddell FD. The measurement of occupational mortality.Br J Ind Med. 1960 Jul;17:228-33.
Hint: write the SMR (with Σ denoting sum over strata) as
page 5
Add-ins for M&M §8 and §9 stratified data
Table 2: Twenty-year survival status for 1314 women categorized by age and smoking habits at the time of the original survey.Worked out calculations (see same calculations on spreadsheet) for... (* r1 r2 are row totals; c1 c2 are column totals) See Rothman Ch 8
Mantel-Haenszel summary odds ratio, orMH and Woolf: exp[ weighted average of ln or 's ]
Mantel Haenszel (Chi-Square) Test of OR1 = OR2 = OR3 = 1 Var[ weighted ave ] = 1/ {Sum of Weights}
AgeGroup(Years)
Smoking Status Calculations forSummary odds ratio
Calculations for TestStatistic*
Calculations for Woolf's Method
SurvStatus Smoker
Non-Smoker n
a dn
b cn
E[ a | H0] Var[ a | H0] ln or(1)
Var[ ln or] Weight(2)
W × ln or
r1 • c1n
r1 • r2 • c1 • c2 n2 • {n - 1}
1/a + 1/b+ 1/c + 1/d
1Var[ ln or]
(1) × (2)
18-44 Dead 19 13
Alive 269 327(or =1.78) 628 9.89 5.59 14.7 7.6 0.575 0.1363 7.335 4.218
44-64 Dead 78 52
Alive 167 147(or =1.32) 444 25.82 19.56 71.7 22.8 0.278 0.0448 22.30 6.199
>64 Dead 42 165
Alive 7 28(or =1.02) 242 4.86 4.77 41.9 4.9 0.018 0.2084 4.798 0.086
139 Sum 1314 40.57 29.92 128.3 35.2 34.433 10.503
MH Odds Ratio 1.36 weighted ave. of ln or ' s 10.503/ 34.433 = 0.305
(40.57/29.92) exp[weighted ave. of ln or ' s] exp[0.305] = 1.36
orMH = ∑ a d / n ∑ b c /n =
40.5729.92 = 1.36 ; X2
MH (1 df) = {∑ a – ∑ E[ a | H0 ]}2
∑ Var[ a | H0 ] =
{139 – 128.3}2
35.2 = 3.24 XMH = 1.80
(Miettinen) Test-based 100(1 - α)% CI for OR: orMH1 ± zα/2 / XMH
= 1.361 ± 1.96/1.80 = 0.97 to 1.89 (95% CI)
(Woolf) 100(1 - α)% CI for OR: exp[{weighted ave. of ln or's} ± zα/2 Sqrt[1/34.433] = exp[0.305 ± 1.96×0.170] = 0.97 to 1.89
page 6
stratified data
Via SAS TABLE 1 OF I_SMOKE BY I_DEAD SUMMARY STATISTICS FOR
I_SMOKE BY I_DEAD CONTROLLING FOR AGE=18-44
data sasuser.simpson; I_SMOKE I_DEADinput age $ i_smoke i_deadnumber;
CONTROLLING FOR AGE Frequency| Expected | 0| 1| Total Cochran-Mantel-Haenszel Statistics
(Based on Table Scores) ------+--------+--------+lines; 0 | 327 | 13 | 340
Alt. Hypothesis DF Value Prob18-44 1 1 19 | 322.68 | 17.325 |-------------------------------------18-44 1 0 269 ------+--------+--------+Nonzero Correlation 1 3.239 0.072 1 | 269 | 19 | 28818-44 0 1 13Row Mean Scores Differ 1 3.239 0.072 | 273.32 | 14.675 |18-44 0 0 327 General Association 1 3.239 0.072 ------+--------+--------+44-64 1 1 78
Total 596 32 62844-64 1 0 16744-64 0 1 52 Estimates of Common Relative Risk
(Row1/Row2) TABLE 2 OF I_SMOKE BY I_DEAD44-64 0 0 147 CONTROLLING FOR AGE=44-6464- 1 1 42 I_SMOKE I_DEAD 95%64- 1 0 7
Type of Study Method Estimate Conf Bounds64- 0 1 165 Frequency|------------------------------------- Expected | 0| 1| Total64- 0 0 28Case-Control Mantel-Haenszel 1.357 0.973 1.892 ------+--------+--------+; (Odds Ratio) Logit 1.357 0.971 1.894
0 | 147 | 52 | 199 | 140.73 | 58.266 |run; Cohort Mantel-Haenszel 1.047 0.996 1.101 ------+--------+--------+ (Col1 Risk) Logit 1.034 0.998 1.072options ls = 75 ps = 50; run; 1 | 167 | 78 | 245proc freq data=sasuser.simpson; | 173.27 | 71.734 | Cohort Mantel-Haenszel 0.864 0.738 1.013 ------+--------+--------+ (Col2 Risk) Logit 0.953 0.849 1.071tables age * i_smoke * i_dead / Total 314 130 444 nocol norow nopercent cmh expected; Confidence bounds for M-H estimates
are test-based. TABLE 3 OF I_SMOKE BY I_DEADweight number; CONTROLLING FOR AGE=64-/* weight indicates multiples */ Breslow-Day Test for Homogeneity of
the Odds Ratios I_SMOKE I_DEAD
run; Frequency| Expected | 0| 1| Total Chi-Square = 0.950 DF = 2 Prob = 0.622
See for SAS 'trick' to produce Tablesin an orientation that gives theratios of interest (use PROC FORMAT toassociate another values with eachactual value; then use theORDER=FORMATTED option in PROC FREQ )
------+--------+--------+ 0 | 28 | 165 | 193 | 27.913 | 165.09 | Total Sample Size = 1314 ------+--------+--------+ 1 | 7 | 42 | 49 | 7.0868 | 41.913 | ------+--------+--------+ Total 35 207 242
page 7
stratified data
Via Stata Aggregating Odds Ratio (OR)'s ...Woolf's Method
Recall: data from single 2x2 table: or = adbc
clearinput str5 age i_smoke i_dead number 18_44 1 1 19
SE[ ln (or) ] = 1a +
1b +
1c +
1d 18_44 1 0 269
18_44 0 1 13 18_44 0 0 327
data from several (K) 2x2 tables: (Σ: summation over strata) 44_64 1 1 78 44_64 1 0 167
ln (orWoolf)= ∑ wk ln (ork)
∑ wk (weighted average)
with wk = 1
Var[ ln [ork] ](weight ∝ 1 / variance)
(note: Var = SE2)
SE[ln (orWoolf)] = 1
∑wk =
Var*K [see drivation #]
(Var* : harmonic mean of K Var's)
44_64 0 1 52 44_64 0 0 147 64_ 1 1 42 64_ 1 0 7 64_ 0 1 165 64_ 0 0 28end
cc i_dead i_smoke [freq=number], by(age)
age | OR [95% CI] M-H Weight ------+--------------------------------------------- 18_44 | 1.78 .87 3.61 5.57 (Cornfield) 44_64 | 1.32 .87 1.99 19.56 (Cornfield) 64_ | 1.02 .42 2.43 4.77 (Cornfield)
CI[ OR ] = exp{ CI[ ln (OR) ] } ------+--------------------------------------------- Crude | .68 .53 .88 (Cornfield)M-H combined | 1.36 .97 1.90-----------------+-----------------------------------------Test of homogeneity (M-H) chi2(2) = 0.95 Pr>chi2 = 0.6234 # Derivation: Var[Σ{w × ln} / Σw] = (1/Σw])2 × Σ{w2 × Var[ln]}
= (1/Σw])2 × Σ{1/w} = 1/Σw [ since w = 1/var[ln] ] Test that combined OR = 1: Mantel-Haenszel chi2(1) = 3.24 Pr>chi2 = 0.0719
Also available... See worked example in Spreadsheet (under Resources Ch 9)[Robins-Breslow-Greenland SE for ln orMH not programmed]
cc i_dead i_smoke [freq=number], by(age) woolf
References: A&B Ch 4.8 and 16, Schlesselman, KKM, Rothman...cc i_dead i_smoke [freq=number], by(age) tb
*tb = "test-based" Summary Risk Ratio and Summary Rate Ratio
See Rothman pp 147- (Risk Ratio) and pp153- (Rate Ratio)
page 8
stratified data
Berkeley Data: M:F Comparative parameters Odds Ratio (OR), Risk Ratio (RR) and Risk Difference (R )
E E–
(Using KKM table 17.16 notation) D a b | m1
D– c d | m0
n1 n0 | n for R for OR for RR for R
Faculty a/n1 b/n0 R∆ a•db•c
a•dn b•cn
a•n0b•n1
a•n0n
b•n1n
var(R∆)* w = 1/var w•R∆
Admitted? Men Women AllA Y 512 89 | 601 0.62 0.82 –0.20 0.35 10.4 29.9 0.75 59.3 78.7 1.63E-3 614 –125 N 313 19 | 332 All 825 108 | 933
B Y 353 17 | 370 0.63 0.68 –0.05 0.80 4.8 6.0 0.93 15.1 16.3 9.12E-3 110 –5 N 207 8 | 215 All 560 25 | 585
C Y 120 202 | 322 0.37 0.34 +0.03 1.13 51.1 45.1 1.08 77.5 71.5 1.10E-3 913 26 N 205 391 | 596 All 325 593 | 918
D Y 138 131 | 269 0.33 0.35 –0.02 0.92 42.5 46.1 0.95 65.3 69.0 1.14E-3 879 –16 N 279 244 | 523 All 417 375 | 792
E Y 53 94 | 147 0.28 0.24 +0.04 1.22 27.1 22.2 1.16 35.7 30.7 1.51E-3 661 25 N 138 299 | 437 All 191 393 | 584
F Y 22 24 | 101 0.06 0.07 –0.01 0.83 9.8 11.8 0.84 10.5 12.5 3.41E-4 2935 -33 N 351 317 | 668 All 373 341 | 769
All Y 1198 557 | 1755 0.44 0.30 +0.14 1.84 1.47 N 1493 1278 | 2771 All 373 341 | 4526 ----- ----- ----- ----- ---- ---- ∑: 145.8 161.1 263.4 278.7 6113 –129
ORMH = 145.8161.1
= 0.91 RRMH = 263.4278.7
= 0.94 R∆w = ∑w•R∆
∑w = –1296113
= –0.02
* var(R∆) = Sum of 2 binomial variances
page 9
stratified data
Test of equal M:F admission rates; Confidence Intervals for ORMH (Berkeley data, KKM and A&B notation; cf. Rothman'02,Table 8.4, p152)
CI for ORMH [notation from A&B p461]
TEST M = F (Method of Robins, Breslow & Greenland 1986 *) CI ORMH continued...
Faculty E[a|Ho] Var[aHo] a+dn
b+cn
a•dn
b•cn
(P) (Q) (R) (S) P•R P•S Q•R Q•S lnORMH = ln 0.91 = –0.10 Admitted? Men Women AllA Y 512 89 | 601 531.4 21.9 0.57 0.43 10.4 29.9 5.9 17.0 4.5 12.9 Var[lnORMH ] = 0.0066 N 313 19 | 332 All 825 108 | 933 SE[lnORMH] =√Var = 0.08
B Y 353 17 | 370 354.2 5.6 0.62 0.38 4.8 6.0 3.0 3.7 1.8 2.3 CI[lnORMH]= –0.10 ± z•0.08 N 207 8 | 215 All 560 25 | 585 = –0.26 to 0.06 (95%)
C Y 120 202 | 322 114.0 47.9 0.56 0.44 51.1 45.1 28.5 25.1 22.7 20.0 CI[ ORMH ] = N 205 391 | 596 All 325 593 | 918 exp[–0.26] to exp[0.06]
D Y 138 131 | 269 141.6 44.3 0.48 0.52 42.5 46.1 20.5 22.3 22.0 23.9 = 0.77 to 1.06 N 279 244 | 523 All 417 375 | 792 _______________________________________
E Y 53 94 | 147 48.1 24.3 0.60 0.40 27.1 22.2 16.4 13.4 10.8 8.8 CI [ORMH] "test-based" (Miettinen 1976) N 138 299 | 437
All 191 393 | 584 Chi-MH = | ln orMH | / SE[ln orMH ] ===>
F Y 22 24 | 101 24.0 10.8 0.47 0.53 9.8 11.8 4.6 5.6 5.1 6.2 SE[ln orMH]=|ln orMH| / Chi-MH {0.10/√1.52= 0.08} N 351 317 | 668
All 373 341 | 769 Rothnan2002, p152 uses different notation CI[ln ORMH] = ln or ± z SE[ ln orMH ]
All Y 1198 557 | 1755 A&B {R,S, P,Q} -> Rothman{G,H, P,Q} CI[ORMH] = CI [exp[ln orMH]] N 1493 1278 | 2771
All 373 341 | 4526 ----- ----- ---- ---- ---- ---- = exp[CI for ln] = orMH[1 ± z/Chi-MH]
∑: 1213.4 154.7 145.8 161.1 78.9 87.1 66.9 74.1
(R+) (S+) = orMH[1 ± 1.96/ 1.52] in our example
{ ∑a – ∑a|Ho] }2
∑Var[a|Ho] =
{1198 – 1213.4}2
154.7 = 1 .52 [#] Var[ ln ORMH ] =
∑P•R2R+
2 + ∑[P•S + Q•R]
2R+•S+ +
∑Q•S2S+
2 ______________________________________
This MH X2 of 1.52 is "NS" in the χ2 1df distribution = 78.9
2•145.82 +
87.1 + 66.9]2•14.5.8•161.1
+ 74.1
2•161.12 = 0.0066 CI [R ] ... (continued from last column, previous page)
SE[R ] = 1/ w = 0.013 [#] see Rothman2002, p162 continued at top of next column ... CI [R ] = –0.02 ± z × 0.013
page 10
Inference from 2 way Tables M&M §9 REVISED Dec 2003 Analysis of 2 x 2 tables ... chi-square test
Recall some earlier examples... More generally...
Cross-classification of single sample of size n with respect totwo characteristics, say A and B ("second" model in M&M p 641)Test of independence of two characteristics
Montreal Metropolitan Population by knowledge of official languageData collected by Statistics Canada at the 1996 census:{numbers rounded, so subtotals do not sum exactly to total]
BEnglish?B1 B2 TotalYes No Total
A1 n11 n12 nA1Oui 1,634,785 1,309,150 2,943,935AFrançais?
A2 n21 n22 nA2Non 280,205 63,500 343,705Total 1,914,990 1,372,650 3,287,645 nB1 nB2 n
Stroke Unit vs. Medical Unit for Acute Stroke in elderly?Patient status at hospital discharge (BMJ 27 Sept 1980)
Cohort Study: Fixed /Variable follow-up. Person (P) or P-Time denominators(Cross-sectional Study, document states rather than events)
indept. dependent total no. pts------------------------------------------------------------------------------------
event(or state)
c="cases"numerator
non-event(or state)
TotalPersons
D=Denominator
or TotalPerson-Time
D=Denominator
Stroke Unit 67 34 101Medical Unit 46 45 91------------------------------------------------------------------------------------total 113 (58.9%) 79 192
"exposed" (1) c1 D1 D1
not exposed (0) c0 D0 D0c D DBone mineral density and body composition in boys with distal
forearm fractures (J Pediatr 2001 Oct;139(4):509-15)Case-Control Study: person- or person-time "quasi-denominators"
Fracture?event
(or state)
"c=cases"numerator
quasi-denominators
persons
or quasi-denominators
Person-Time
Yes NoYes 36 14
Overweight?No 64 86
100 100"exposed" (1) c1 d1 d1
Pour battre Roy, mieux vaux lancer bas ...(LA PRESSE, MONTREAL, JEUDI, 21 AVRIL 1994 ... cf. Course 626) not exposed (0) c0 d0 d0
d=sample of D d=sample of DAu cours des vingt matches des séries éliminatoires disputés l'an passé, leCanadien a accordé 51 buts... Des 51 buts alloués par le meilleur gardien au monde..
Haut 10 (20%)
... ont vu la rondelle pénétrer dans la partie .. du filet Milieu 5 (10%)
Bas 36 (70%)
51 (100%)
page 1
Inference from 2 way Tables M&M §9 REVISED Dec 2003 Analysis of 2 x 2 tables ... chi-square test
e.g. languages in Montreal Statistics from 2 x 2 table via SAS Proc FREQ
options ls = 75 ps = 50; run;create SAS file via Program Editor (Could also type directly into INSIGHT)proc freq data=sasuser.lang_mtl; tables Francais * English /data sasuser.lang_mtl; all cellchi2 expected;
input Francais $ English $ number; /* turn on all output */
/* number instead of individual per line */ weight number; /* use weight to indicate "multiples" */ /* $ sign after name indicates character variable */ run;
lines; TABLE OF FRANCAIS BY ENGLISH Oui Yes 1634785 FRANCAIS ENGLISH Oui No 1309150
(Observed) Frequency ... "obs" for short Non Yes 280205 Expected (under H0: independence) ... "exp" for short Non No 63500 (Cell Chi-Square)|; Percent |run; Row Pct |
then.. via SAS INSIGHT Mosaic plot Col Pct |No |Yes | Total ---------------+--------+--------+
Non
Oui
F
R
A
N
C
A
I
S
No Yes
ENGLISH
Non | 63500 | 280205 | 343705 | 143503 | 200202 | | 44602 | 31970 | | 1.93 | 8.52 | 10.45 | 18.48 | 81.52 | | 4.63 | 14.63 | ---------------+--------+--------+ Oui |1309150 |1634785 | 2943935 |1229147 |1714788 | | 5207.3 | 3732.5 | | 39.82 | 49.73 | 89.55 | 44.47 | 55.53 | | 95.37 | 85.37 | ---------------+--------+--------+ Total 1372650 1914990 3287640 41.75 58.25 100.00
143503 = 10.45% of 1372650 = 343705 × 1372650
3287640 =
RowTotal × ColTotalOverallTotal
44602 = (observed freq. - expected freq. )2
expected freq. =
(63500 - 143503 )2
143503
Statistic DF Value Prob*------------------------------------------------------------------
Chi-Square (2-1) × (2-1) = 1 Σ (obs - exp)2
exp = 85511.881 0.001
DF = "degrees of freedom"; Σ is over all 4 cells
* Clearly, p-values are not relevant here.
page 2
Inference from 2 way Tables M&M §9 REVISED Dec 2003 Analysis of 2 x 2 tables ... chi-square test
e.g. Stroke Unit vs. Medical Unit for Acute Stroke in elderly?Patient status at hospital discharge (BMJ 27 Sept 1980)
data sasuser.str_unit;input Unit $ Status $ number;lines;
independent. dependent total no. pts Stroke Indep 67Stroke Unit 67 (66.3%) 34 101 Stroke Dep 34Medical Unit 46 (50.5%) 45 91 Medical Indep 46
Medical Dep 45113 (58.9%) 79 192;
"Expected" numbers [in bold] under
H0: % discharged "independent" unaffected by type of unit
run;
proc freq data=sasuser.str_unit; weight number;tables Unit * Status / chisq cmh relrisk riskdiff nopercent nocol;
independent dependent total no. pts run;
Stroke Unit 59.4 (58.9%) 41.6 101 UNIT STATUSMedical Unit 54.6 (58.9%) 37.4 91
Frequency|113 (58.9%) 79 192 Row Pct |Dep |Indep | Total
¶ X2 = {67 – 59.4 }2
59.4 + {34 – 41.6 }2
41.6
+ {46 – 54.6 }2
54.6 + {45 – 37.4 }2
37.4
= 4.9268 [ to be referred to χ2 (1df) distr'n]
INDEX Category Medical | 45 | 46 | 91 (see NOTE) | 49.45 | 50.55 | Stroke | 34 | 67 | 101REFERENCE Category | 33.66 | 66.34 | Total 79 113 192
Statistic DF Value Prob ------------------------------------------------------ Chi-Square 1 4.927 0.026
Generic formula for X2 Continuity Adj. Chi-Square 1 4.296 0.038 Mantel-Haenszel Chi-Square 1 4.901 0.027
X2 = Σ { observed – expected } 2
expected [ over all cells! ] Fisher's Exact Test Left:0.991 Right:0.019 2-Tail: 0.029
Column 2 Risk Estimates
95% Conf Bounds 95% Conf BoundsThe expected number in cell in to row i and column j is:
total row i • total column j
overall total
Risk ASE (Asymptotic) (Exact) Row 1 0.505 0.052 0.403 0.608 0.399 0.612 Row 2 0.663 0.047 0.571 0.756 0.562 0.754 Total 0.589 0.036 0.519 0.658 0.515 0.659
Continuity-corrected X2 †
X2c = Σ
{ | observed – expected | – 0 .5 } 2
expected
Row 1 - Row 2 -0.158 0.070 -0.296 -0.020
NOTE FREQ uses upper row as INDEX category; lower as REFERENCE cat.
Estimates of the Relative Risk (Row1/Row2) Type of Study Value 95% Confidence Bounds Cohort (Col2 Risk) 0.762 (50.55/66.34) 0.596 0.975
¶ Use X2 to refer to the calculated statistic in a sample, χ2 for distribution.
† (Yates') Continuity-correction is used to reflect the fact that the binomial countsare discrete and that their probabilities are being approximated by intervals (count –0.5, count + 0.5). The uncorrected X2 is overly liberal i.e. it produces too large adistribution of discrepancies that is larger than the tabulated distribution... hencethe reduction of each absolute deviation | observed – expected | by 0.5.
cf. z-test for difference of 2 proportions .. in Chapter 8
z = (0.6634 – 0.5054)/sqrt [ 0.5885 • 0.4115 • (1/101+ 1/91)) = 0.1580/0.0711=2.22
P = Prob[ | Z | ≥ 2.22 ] = 0.026 (2-sided)
z2 = 2.222= 4.93 ( same as X2 with 1 degree of freedom ! )
page 3
Inference from 2 way Tables M&M §9 REVISED Dec 2003 Analysis of 2 x 2 tables ... chi-square test
OTHER E XAMPLESNote: In the following examples, in order to keep the formulae uncluttered, I havenot shown the continuity correction. In some examples, as in the one above, thesample sizes are large and so the continuity correction makes only a small change.In some others, as in the milk immunoglobulin example below, it makes a bigdifference. However, don't do as I do; do as I say -- use the continuity correctionroutinely. That way, editors and referees won't accuse you of trying to make your p-values more impressive by not using the correction.
Protection by milk immunoglobulin concentrate against oralchallenge with enterotoxigenic e. coli NEJM May 12 ,1988 p 1240
developeddiarrhea*
didnot*
Totalsubjects
received milkimmunoglobinconcentrate
0 (4 .5 ) 10(5 .5 ) 10
Do infant formula samples shorten the duration of breast-feeding?Bergevin Y, Dougherty C, Kramer MS. Lancet. 1983 May 21;1(8334):1148-51.
received controlimmunoglobinconcentrate
9 (4 .5 ) 1(5 .5 ) 10% still breastfeeding at 1 month in RCT which withheld free formula samples[normally given by baby-food companies to mothers leaving hospital with theirinfants] from a random half of those studied All 9 (45%) 11 20
breastfeeding
not breastfeeding
Totalmothers
* the numbers 4.5, 5.5, 4.5 and 5 .5 in bold in parentheses in thetable are the "Expected" numbers calculated on the (null) hypothesis H0:rate not changed by immunoglobin concentrategiven sample 175 (77%) 52 227
no sample 182 (84%) 35 217
x2 = { 0 – 4 .5 }2
4 .5 +
{ 9 – 4 .5 }2
4 .5 +
{10 – 5 .5 }2
5 .5 +
{ 1 – 5 .5 }2
5 .5
={ 4.5 }2 ( 1
4 .5 +
14 .5
+ 1
5 .5 +
15 .5
) = 16.4
357 (80.4%) 87 444
"Expected" numbers underH0: rate not changed by giving samples
breastfeeding
not breastfeeding
Totalmothers
Continuity-corrected x2 = { | 0 – 4 .5 | – 0 .5}2
4 .5 + ... + =
= { 4.0 }2 ( 1
4 .5 +
14 .5
+ 1
5 .5 +
15 .5
) = 12.9
given sample 182.5 (80.4%) 44 .5 227no sample 174.5 (80.4%) 42 .5 217
357 (80 .4% ) 87 444
X2 = {175 – 1 8 2 . 5 }2
1 8 2 . 5 +
{182 – 1 7 4 . 5 }2
1 7 4 . 5 Attack rates of ophthalmia neonatorum among exposed newbornsreceiving silver nitrate, or tetracycline NEJM March 11 ,1998 653-7
+ {52 – 44 .5 }2
44 .5 +
{35 – 42 .5 }2
42 .5
= 3.22 [ "NS" at 0.05 level, even with uncorrected X2]
X2 = 3.22 <--> |Z| = √3.22 = 1.79; Prob( |Z| > 1.79) = 2 × 0.0367 = 0.0734
Silver Nitrate Tetracycline
exposed to N. gonorrhœa 5 / 71 2 / 66attack rate ( 7% ) ( 3% )
exposed to C. trachomatis 10 / 99 8 / 111attack rate ( 10% ) ( 7% )
page 4
Inference from 2 way Tables M&M §9 Notes on X2 tests and on analysis of binary data in general
• If use x2 , it must be based on counts, not on %'s -0.1, for an "average" chi-square density of approx. (2 x 0.0394)/0.03 =2.6. You can track the 'block transfers' using the different shades.
• a short-cut method of calculation for the 2x2 table with 'generic'entries a, b, c, d, and with row, column and overall totals r1, r2, c1, c2and N respectively and overall totals is (with stroke data as e.g.):
-2.5 -1.5 -0.5 0.5 1.5 2.5
0.25 1. 2.25 4. 6.25
Z ~ N(0,1)
Z-squared ~ Chi-Square(df=1)
x2 = N { a •d – b •c } 2
r1•r2•c1•c2 =
192 { 67•45 – 34•46 } 2
101•91•113•79 = 4.93
For the continuity corrected version, the shortcut formula is:
N{ | a •d – b • c | –
N2
} 2
r1•r2•c1•c2 =
192{ |67•45–34•46 | - 96} 2
101•91•113•79 = 4.30
where | x | means 'the absolute value of'.
The formula involves the crossproducts a•d and b•c. If their ratio(empirical odds ratio) is 1, their difference is zero. The direction of thedifference in proportions is given by the sign of ad – bc.
These formulae avoid fractions -- one doesn't see expectations ordeviations, or the magnitude of the difference. Presumably for thisreason, some books, such as Norman and Streiner's Statistics: the BareEssentials, classify x2 as "non-parametric" or "distribution-free", and soput it in the non-parametric chapter. After their first edition, I pointed outto them that the above use of the chi-square test is as a test of thedifference in two binomial proportions.. how much more parametric ordistribution-specific can that be? Look up the index in the latest editionto see if my arguments convinced them. The direction is given by thesign of ad – bc.
• The uncorrected version of the 2-sided z-test for comparing two
proportions gives the same p-value as the uncorrected version of the x2
test. One can check that Z2 = x2 . Likewise, the corrected version ofthe 2-sided z-test for comparing 2 proportions gives the same p-value asthe corrected version of the x2 test.
The chi-square random variable (r. v.) is the square of the N(0,1) r. v.The very high "probability density", and the rapid change in this density,just to the right of Z2 = 0, (cf. diagram) is a result of the Z ->Z2
transformation. For example, the 3.98% of the "probability mass"between Z=0 and Z=0.1 is transferred to the small interval 02 to 0.12, or 0to 0.01, an width of 0.01 (an identical amount gets transferred from the Zinterval -0.1 to 0). The 3.94% of the "probability mass" between Z=0.1and Z=0.2 is transferred to the small interval 0.12 to 0.22, or 0.01 to 0.04,an width of 0.03. The is an identical transfer from the Z interval -0.2 to
• By construction, the x2 is a 2-sided test, unless one uses x andrefers it to the z - table.
• There are other chi-square distributions (with df > 1 ). See later.
page 5
Inference from 2 way Tables M&M §9 Notes on X2 tests and on analysis of binary data in general
• If use x2 , it must be based on all cells, not just on numerators -- unless
the more common type of outcome is so much more common that thecontribution for these cells is negligible.
Mantel-Haenszel Test Statistic for a single 2 x 2 table
Preamble• The x2 test is a large sample test i.e. it is always an approximation . Sincex2 (1 f) is just Z2, one is no more exact than the other. Some of the paragraphs on the next page more appropriately belong in
the notes for Ch 8.2, where Fisher's exact test was introduced , but the
reasons become important when we come to using the above
formulation of the x2 test for a single table to combining evidence over
several (possibly sparse) 2x2 tables. Cochran first proposed combining
evidence from 2 x 2 tables in 1954. His aim was to combine a small
number of 'large' tables, and he did not anticipate that this technique
could also be used to combine a large number of quite 'small' 2 x 2
tables (each one with quite sparse information) , with the combination of
data from n matched pairs as the limiting case. Thus, he was a wee bit
careless about variances. It took the now famous Mantel-Haenszel
paper of 1959 to make a variance correction that for 'large 'tables was
trivial, but for matched pair tables, was critical.
• In t-tests, n= 30 is often considered 'large enough' for large-sampleprocedures -- it depends on skewness of data are & whether theCentral Limit Theorem will "Gaussianize" the distribution of the statistic .For 0/1 data, such as those above, the 'real' or 'effective' sample sizesare not the denominators 71 and 66, but rather the numerators 5 and 2[e.g. changing the 2 to a 1 would make the ratio of attack rates appeartwice as good ] It doesn't matter if the 5 and 2 came from n's of 710 and660 or 7100 and 6600 . The 'effective sample size' for binary data is thenumber of subjects having the less common outcome.
• The "guidelines" (such as they are) about when it is appropriate to usex2 are based on "Expected" numbers not on the observednumbers. One quoted rule [often used by computer programs togenerate warning messages ] is that the expected numbers in most ofthe cells should exceed 5 for the χ2 to be accurate. Thus, the 2x2 tableon the left below will generate a 'warning' ; that on the right will not.
5 2 1 1166 64 11 1
SAS and others rightly acknowledge Cochran's role in the test statistic,
calling it the 'Cochran-Mantel-Haenszel' or 'CMH' statistic. (Indeed 'CMH'
is the option one uses with the PROC FREQ to obtain the summary
measures and the overall test statistic). This formulation is also the one
most commonly used for the log-trank test used to compare two survival
curves.
• The regular uncorrected x2 test statistic for a single 2x2 table can be
written in a seemingly very different format, as
x2 = { a – E[a | H0] } 2
Variance[ a | H0 ] =
{ a – E [a | H 0] } 2
r1•r2•c1•c2 / N3
The Variance in the denominator of this statistic can be viewed as arising
from a statistical model in which the 2 compared proportions are
separate independent random variables; i.e. the 'unconditional' or '2-
independent binomials' model .
Just like the formula with 4 O's and 4 E's, this format is not as calculator-
friendly as the shortcut (integer-only) one . But... cf Mantel Haenszel.
Most consider that the biggest legacy of "M-H" paper is to the Mantel-
Haenszel summary measure (point estimate) of Odds Ratio. We will
come back a little later to this issue of combining data from 2 x 2 tables .
page 6
Inference from 2 way Tables M&M §9 Notes on X2 tests and on analysis of binary data in general
Mantel-Haenszel Test Statistic for a single 2 x 2 tabletables by their degree of evidence against H0. For example, in a 2x2
table with n0=23, and n1=24 (as in the bromocryptine and infertility
study) , there are theoretically 24 x 25 = 600 possible tables. However, if
one -- after the fact -- restricts the analysis to only those tables where the
total number of "successes" is 12 (12 pregnancies) , then there are only
13 possible tables (see notes and Excel spreadsheet for Fisher's exact
test). And, by reducing the problem from a 2-dimensional one to a 1-
dimensional one, is also becomes possible to more easily rank the
tables by their degree of evidence against H0, something that is
supposedly more difficult when the tables are simultaneously arrayed
along both dimensions. (d) a fourth reason, which I will illustrate with the
Marvin Zelen "Marbles in the Folger's Coffee Can" model, is that, after
the fact, it is much easier to empirically -- and heuristically -- demonstrate -
a low p-value using the single random variable, conditional
(hypergeometric), model than it is with the '2-separate binomials' model.
Preamble: Conditional vs. Unconditional? continued... In the
separate binomials model, the only marginal totals that are fixed ahead of
time are the two sample sizes. In most instances, this model reflects
reality. The only exception I know of is the design exemplified by the
psychophysics study of the lady tasting tea . If she is told that there are 4
cups where the tea is poured first, and 4 where it is poured second, then
she will arrange her responses so that there are 4 of each. Thus, in this
instance, both the row totals and the column totals are "fixed" ahead of
time, and so it makes sense that the (frequentist) inference be limited to
the (only) 5 possible data tables that have all margins fixed.
This is the statistical model behind Fisher's exact test, and indeed Fisher
used the tea-tasting example to explain it. But this test is now used for
data situations where one cannot -- at least ahead of time -- consider
both sets of marginal totals fixed. For example, in the food sensitivity
study in the ch 8.2 notes, from the answers given, it appears that the
subjects were not told that there were 3 three injections of extract and
nine of diluent, but the authors used the conditional test anyway.
In fact there are many ways to circumvent these objections without
having to 'condition' on all margins, and there is still a considerable
debate, much of it philosophical, on this 100 years after analyses of 2x2
data were first introduced. However, since we often combine information
from data arranged as matched pairs or 'finely stratified ' strata, we do
need to consider this one setting where conditioning is the 'right thing
to do'. In the example here, there will only be 1 large table, so the
difference will not be important. But when we come to matched pairs,
the implications are large.
Many of the reasons put forward for using the conditional test based on
all margins fixed (i.e. the hypergeometric model, with only one random
variable) involve practicality rather than adherence to a coherent set of
inferential principles. They mostly have to do with one of the following
'supposed' difficulties (a) using the normal approximation when the
expected numbers are low (b) the fact that there are two parameters, but
one is only interested in their difference, or ratio, or odds ratio, and so
the 'remaining' parameter is just a 'nuisance' (c) how to order or rank the
page 7
Inference from 2 way Tables M&M §9 Notes on X2 tests and on analysis of binary data in general
Mantel-Haenszel Test Statistic for a single 2 x 2 table
Details
In the conditional model, with both margins fixed, there is only
one cell entry that can vary independently. Without loss of generality, we
focus on the frequency in the 'a' cell. Then, under the null hypothesis,
a ~ Hypergeometric[parameters given by marginal totals]
i.e. by r1=Row1Total, r2, c1=Col1Total, c2, and N=OverallTotal.
Thus
Expected value[ a | H0 ] = E[ a | H0 ] = r1 × c 1
N
Variancecondn'l [ a | H0 ] = { r1•r2•c1•c2 } / { N2(N–1) }
Example
In our stroke vs. medical unit example above, the marginal totals were
r1=101, r2=91, c1=113, and c2=79, so N=192. These yield the "excess
in the a cell" of
67 - (101•113)/192 = 67 - 59.44 = 7.56
and conditional variance
{ 101•91•113•79} / { 1922(191) } = 11.6528
giving
X2 MH = 7.562 / 11.6528 = 4.901,
in agreement with the printout from Proc FREQ in SAS.Under the null, the expectation is the same with the conditional as the
unconditional models. Note however the difference in the variance:
under the conditional model it is different, since it uses N2(N–1)
rather than N3 , reflecting the different pattern of variation in the
frequency in the 'a' (and consequently in the other 3) cell(s) if all margins
are fixed (vs. what would happen if the lady were not told "4 1st; 4 2nd").
Note:
The MH test does not use the continuity corrected with the {a - E[a]}.
Part of the justification for this is that when the point estimate of the odds
ratio falls at the null, i.e. a•d = b•c, so that E[a | H0 ] = a, it would be good if
the test statistic also had a value of zero. A continuity correction would
force the test-statistic to have a positive value even when the "observed
a" = "expected under the null" !
The test statistic using this conditional variance can be computed as a Z
statistic
Z = X = "chi" = a – E[ a | H0]
SDcondn'l [ a | H0] ,
which has the same form as the critical ratios used in the z-test for
proportions or means, or as the more traditional square
X2 MH = { a – E[ a | H0] }2
Variancecondn'l [ a | H0] .
page 8
Inference from 2 way Tables M&M §9 2x2, 2x1, 1x2 and 1x1 Tables†
2x2 samples reasonably equal in size, two types of outcome commone.g. outcomes in trial of stroke vs. medical unit .
1x2 1 sample ; two types of outcome commone.g. male and female births with specific timing of conception
BADOUTCOME
GOODOUTCOME
Totalpersons or
TotalPerson-time "LESS GOOD
OUTCOMEGOOD
OUTCOMETotal
persons orTotal
Person-timesample 1 bad1 good1 n1 n1 sample less_good good n nsample 2 bad2 good2 n2 n2
bad good n n
x2 = {less_good – less_good}2
less_good +
{good – good}2
good
+ minimal contribution + minimal contribution
"Expected" numbers of outcomes under H0: rates not different(split the events across 2 samples in ratio of n1 : n2 )
BADOUTCOME
GOODOUTCOME
Total personsor person-time "Expected" numbers of outcomes under H0: rate not different from
EXTERNAL rate (use EXTERNAL rate, based on LARGE amount ofdata (e.g. national rates), to calculate the expected split ofevents). If use internal comparison, then we have full 2 x 2 table.
sample 1 bad1 good1 n1sample 2 bad2 good2 n2
bad good n
1x1 1 large sample , BAD outcome uncommone.g. 78 cancers observed in Alberta study, 83.5 expectedx2 =
{bad1- bad1}2
bad1 +
{good1 - good1}2
good1
+{bad2- bad2}2
bad2 +
{good2- good2}2
good2BAD
OUTCOMEGOOD
OUTCOMETotal
persons orTotal
Person-timesample bad MOST n n
2x1 samples large and reasonably equal in size,BAD outcome uncommon : e.g. leukemias and breast cancers
x2 ≈ {bad- bad}2
badBADOUTCOME
GOODOUTCOME
Totalpersons or
TotalPerson-time
"Expected" number of outcomes under H0: rate not different fromEXTERNAL rate (use EXTERNAL rate to calculate expected numberof BAD events)
sample 1 bad1 MOST n1 n1sample 2 bad2 MOST n2 n2
bad MOST n n
This x2 = {observed- expected}2
expected is equivalent to the large sample
approximation to the Poisson distribution [A&B §4.10]
i .e. z = observed- expected
expected so that
z2 = {observed- expected}2
expected = x2
x2 ≈ {bad1- bad1}2
bad1 + minimal contribution
+ {bad2- bad2}2
bad2 + minimal contribution
"Expected" numbers of outcomes under H0: rates not different(only need ratio n1 : n2 to get expected split of BAD events)
[see A&B §4.10; WE WILL REVISIT THIS 2x1 TABLE , AND THE 1 x 1 TABLE, WHENCOMPUTING EFFECT MEASURES for INCIDENCE RATES ] † This terminology is my own: don't try it out on an editor!
page 9
Inference from 2 way Tables M&M §9 2x2, 2x1, 1x2 and 1x1 Tables†
e.g. Development of leukemia during a 6-year period following drug-rx for cancer
e.g. Breast Cancer in women repeatedly exposed to multiple X-rayfluoroscopies Boice and Monson 1977
leukemia notTotal
personsdrug rx 14 2053 2067
cancers Women-years (WY)no drug rx 1 1565 156615 3618 3633 exposed 41 28,010
not exposed 15 19,017"Expected" numbers of leukemia underH0: rate not increased by drug
56 47,027
"Expected" numbers of MI under H0: rate not affected by X-rays
leukemia notTotal
persons
cancers Women-years (WY)drug rx 8.53 2058.47 2067no drug rx 6.47 1559.53 1566 exposed 33.4 28,010
15 3618 3633 not exposed 22.6 19,01756 47,027
x2 = {14 – 8.53 }2
8.53 + {2053 – 2058.47 }2
2058.47
+ {1 – 6.47 }2
6.47 + {1565 – 1559.53 }2
1559.53
= 8.17
χ2 = {41 – 33.4 }2
33.4 + {15 – 22.6 }2
22.6 + {deviation}2
2 8 K +
{deviation}2
1 9 K
≈ {41 – 33.4 }2
33.4 + {15 – 22.6 }2
22.6 = 4.29
[3.74 with continuity correction]e.g. MI in the first 56 months of US MDs' study of aspirin
MI notTotalMDs
Equivalent to testing whether the a+b events could split in this extreme or moreextreme a way.. would expect under H0 that the split would be (apart from randomvariation) in the ratio of WYexposed : WYnon-exposed.aspirin 104 remainder 11K
placebo 189 remainder 11K293 remainder 22K
WE WILL REVISIT THE 2x1 TABLE , AND THE 1 x 1 TABLE, WHEN COMPUTINGEFFECT MEASURES for INCIDENCE RATES ]"Expected" numbers of MI under H0: rate not affected by aspirin
MI notTotal
personsdrug rx 146.5 rest 11Kno drug rx 146.5 rest 11K
293 rest 22Ke.g. US MDs' study of aspirin ... continued
x2 = {104 – 146.5 }2
146.5 + {189 – 146.5 }2
146.5 + {42.5}2
11K + {–42.5}2
11K = 24.7
In effect, testing whether 293 MI's could distribute this unevenly if used a coin.
M&M Ch 8.1 Inference on π ... page 10
Inference from 2 way Tables M&M §9 Comparison of Proportions --- Paired Data
(McNemar) Test of equality of proportions:e.g. Response of same subject in each of 2 conditions (self-paired)
Responses of matched pair, one in 1 condition, 1 in other
's in paired responses on interval scale, reduced to sign of
-1- discard the concordant pairs (+,+) and (–,–) as being "un-informative"(this point is somewhat controversial )
-2- analyze split of (b+c) discordant pairs (under H0, expect 50:50)Result in Other PAIR Member
Example: HIV in twins in relation to order of delivery [LancetDec14'91]Positive NegativeTotal
PAIRSPositive a b
Mother -> infant transmission of HIV-infection: 66 sets of twinsResult in OnePAIR Member
Negative c dResult in 2nd-born Twinn PAIRS
HIV + HIV –TotalSets
Exposure in "Control" HIV + 10 18
Positive NegativeTotal
PAIRS Result in 1stborn TwinPositive a b HIV – 4 34
Exposure in"Case" 66 Sets
Negative c d
To analyze the 'split' of discordant pairs:
(if n small)
Binomial probabilities with "n" = b+c and π = 0.5
(Table C or Table for Sign Test is helpful here)
(if n larger)
• Z test of observed proportion p = b / (b+c) vs π = 0.5
• χ2 test on observed 1x2 table [ b | c ]
versus 1x2 table expected if H0 holds
[ b+c
2 |
b+c2
]
Note that the Z2 and x2 are equivalent
n PAIRS
extreme situations (1 or other / forced choice e.g. exercise 8.18, or whodies first among twin pairs discordant for handedness)
Shorter
Won LostTotal
PAIRSWon - b
TallerLost c -
n PAIRS
Can also turn this table 'inside-out' and analyze using case-control approach
Loser
Taller ShorterTotal
PAIRSTaller - b
WinnerShorter c -
n PAIRS
page 11
Inference from 2 way Tables M&M §9 Comparison of Proportions --- Paired Data
McNemar) Test of equality of proportions: worked example
Gaussian (or equivalently, Chi-square) Approximation to BinomialResult in 2nd-born Twin
Z = 18 – E[b]
SD[b] =
1 8 – 222
22 x 0 .5 x 0 .5 =
18 –11
5.5 = 2.98 ; Z2 =
72
5.5 = 8.91
HIV + HIV –TotalSets
HIV + 10 18Result in 1stborn Twin
HIV – 4 34Prob ( | Z | > 2.98 ) = 0.003; From Table F, Prob[ X2 (1 df)> 8.91] ≈ 0.00366 Sets
Analysis using exact binomialχ2 test on observed 1x2 table [ 18 4 ]versus 1x2 table expected [ 11 1 1 ]if H0 holds
X2 test = {18 – 1 1 }2
1 1 +
{ 4 – 1 1 }2
1 1 =
72
5.5 = 8.91
Binomial probabilities with "n" = b+c = 28 with discordant outcomes
Under H0 that order makes no difference to likelihood of HIV
transmission, the split among these 22 should be like that obtained by
tossing 22 coins, each with
π (first born is the one to have the HIV transmitted) = 0.5
The Binomial(n=22, π = 0.5) distribution is not available in Table C, but
can be obtained from Excel. Of interest is the sum of the probabilities for
18/22, 19/22, 20/22, 21/22 and 22/22 (1-sided) then doubled if dealing
with a 2-sided alternative, i.e. 2 x ( 0.00174 + 0.00036 + negligible
terms) = 0.004.
Notice that because of the symmetry involved in testing π = 0.5 versus a 2-sided alternative, the test statistics have a particularly simple form:
Z2 = x2 = ( b – c ) 2
b + c ;
( 1 8 – 4 ) 2
22 = 8.91 in our example
With continuity correction Z2c = x2 c = ( | b – c | – 1 )2
b + c =
16922
= 7.68
(2-sided P ≈ 0.005)
Q: Why a continuity correction of 1 rather than usual 0.5?
A: The difference b–c jumps in 2's rather than 1's
e.g. if b+c =18, then b – c =18, 16, 14, ... ,-14, -16, -18)
page 12
Inference from 2 way Tables M&M §9 Inferences regarding proportions --- Summary
Situation Question G a u s s i a n A p p r o x i m a t i o nno yes
1 Popln. CI for π • Nomograms/tables/spreadsheet •p ± z p[1 - p]
n(cf. asymmetric)
π (see notes on 8.1)
Test π0 • Binomial distribution • z = p - π0
π0[1 - π0]n
{z2 = x2}
(sample of n; p = y/n are "positive")
2 Populations (matched samples) or1 population under 2 conditions
OR = [π1/(1 - π1)]/[π2/(1 - π2)]
CI for OR • b ~ Bin(n',[OR/(1 + OR)]) • CI for OR/(1+OR) => CI for OR
Test π1 = π2 Test OR=OR0 • b ~ Bin(n',[OR0/(1 + OR0)]) • z = b/n' - OR0/[1+OR0]
SE[ b/n'| H0 ] or x2
(sample of n pairs n(++)=a; n(+-)=b; n(-+)=c; n(--)=d; b+c=n' ; or (i.e. est. of OR ) = b/c )
2 Poplns. CI for ∆ = π1-π2 • Miettinen and Nurminen • p1 - p2 ± z p1[1-p1]
n1 +
p2[1-p2]n2
(also via Binomial regression** ... RD)π1 and π2
CI for RR • • cf Rothman p134; regression (RR)
CI for OR • Conditional • Condnl[Approx.]/Woolf/Miettinen
(or via Binomial(logistic)regression**)
Test RR or OR0 or 0 • Fisher's Exact Test (cond'nl) • z = [ p1 - p2 ] - ∆0
p[1-p]n1
+ p[1-p]n2
(*) {or x2}
• Unconditional methods (Suissa and Schuster)
• Permutational (StatExact software)(independent samples of n1 and n2)
Notes:
(*) p in combined data = n1p1 + n2p2 n1 + n2
= Σ numerators
Σ denominators (weighted average of two p 's )
** Binomial Regression: extension [to come] of 1-parameter binomial regression models described in notes for 8.1
page 13
Inference from 2 way Tables M&M §9 Tests of Association --- Tables with r rows c columns
e.g. Independence of classification on 2 variablesSimilarity of multinomial profiles
Analyzing data from ORDERED categories
Using a chi-square test for the following 2x3 table ignores the
ordered nature of the responses
(generic) Relationship between one factor (rows) and another (columns) inn observations; crossclassified into an
r(=# of rows) x c(=# of columns) table.e.g. 2 Quality of sleep before elective operation. [BMJ]Col1 Col2 ... Colc Total
Row1 n11 n12 ... n1c Nrow1 Bad Reasonably good Good TotalRow2 n21 n22 ... n2c Nrow2 Patients given
Triazolam 2 17 12 31
...... ... ... ... ... ...Rowr nr1 nr2 ... nrc NrowrTotal Ncol1 Ncol2 Ncolc N Patients given
Placebo 8 15 8 31(e.g. 1) Relationship between laterality of hand and laterality of eye (measured byastigmatism, acuity of vision, etc.) in 413 subjects crossclassified into a 3x3 table.[data from Woo, Biometrika 2A 79-148]
Total 10 32 20 62
See article by Moses L et al NEJM 311 442-448 1984Left-eyed Ambiocular Right-
eyedTotal (also published as Chapter in Medical Uses of Statistics by J
Bailar and F Mosteller}.Left handed 34 62 28 124Ambidextrous 27 28 20 75
e.g. 3 Outcome after 2 to 7 days of Rx in 20 patients withchronic oral candidiasis.
Right handed 57 105 52 214Total 118 195 100 413
χ2df
= ∑ { observed - expected } 2
expectedOutcome category
1(good) 2 3 4 (poor) TotalClotrimazole 6 3 1 0 10
• expected number in cell =Nrow • N column
NPlacebo 1 0 0 9 10Total 7 3 1 9 20
• summation is over all r x c cells
Any dichotomization of outcomes loses
information and statistical power. Moses et al.
suggest using the Mann-Whitney U test (also
known as the Wilcoxon Rank sum test) to take
account of ordered nature of response
categories.
• degrees of freedom (df) = (r–1)(c–1). In above eg., r=3; c=3 => df:4
The χ2 statistic measures the deviation from independence of row andcolumn classifications (e.g. 1) and dissimilarity of the distributions (profiles)of responses (e.g. 2 and 3). However, omnibus chi-square tests (H0:identical response profiles) with large df are seldom of interest, since thealternative hypothesis (profiles are not identical) is so broad, and the chi-square tests are invariant to the ordering of the rows and columns. Moreoften, a specific alternative hypothesis is of interest; omnibus tests penalizeone for looking in all directions, when in fact one's focus is narrower, andaiming to pick up a specific 'signal'. The next 2 examples (>2 ORDEREDresponse categories in each of 2 groups; binary responses in > 2ORDERED exposure categories are a more fruitful step in this direction.
page 14
Inference from 2 way Tables M&M §9 Test for trend in (Response) Proportions [from A&B §12.2]
Suppose that, in a k x 2 contingency table the k groups fall into a natural order.They may correspond to different values, or groups of values, of a quantitativevariable like age; or they may correspond to qualitative categories, such as severityof a disease, which can be ordered but not readily assigned a numerical value. Theusual χ2(k–1) test is designed to detect differences between the k proportions ---without taking the 'ordering' of the rows into account. It is an 'omnibus' test and isunchanged even if we interchange the order of the columns. More specifically onemight ask whether there is a significant trend in these proportions from group 1 togroup k. Let us assign a quantitative variable, x, to the k groups. If the definition ofgroups uses such a variable, this can be chosen to be x. If the definition isqualitative, x can take integer values from 1 to k. The notation is as follows:
Example [jh]Distribution of subjects with polluted-water exposure-related symptoms amongCompetitors and Employees and Relative Risk (RR) According to Number of Fallsin the Water Data from article "Health Hazards Associated with Windsurfing onPolluted Water " AJPH 76 690-691, 1986 -- research conducted at the WindsurferWestern Hemisphere Championship held over 9 days in August 1984. During thechampionships, the same single-menu meals were served to both competitors andemployees]
Groups ofsubjects
No. of subjectswith symptoms
Nowithout Total RD RR OR
Employees (ref gp) 8 (20%) 33 41 -- 1.0 1.0 Frequency proportionCompetitors :Group X Pos Neg Total positive
0-10 falls 15 (44%) 19 34 24% 2.3 3.31 x1 r1 n1 – r1 n1 p11 x2 r2 n2 – r2 n2 p2. . . . . .. . . . . .i xi ri ni – ri ni pi
k xk rk nk – rk nk pk
11-20 falls 9 (45%) 11 20 25% 3.5 3.4
21-30 falls 10 (71%) 4 14 51% 3.7 10.3
> 30 falls 10 (100%) 0 10 80% 5.1 inf.
Any dichotomization of exposure loses information and statistical power. Authors
correctly used Chi-square test for trend, yielding χ21df = 25.3, P = 10-6. I get
24.58 with the "spacing" 0, 5, 15, 25 and 40. SAS*, using the "Cochran-ArmitageTrend Test", with same spacing, gives a Z statistic of -4.969, (Z2 = 24.69). Theentire variation among the 5 proportions in the table (ignoring ordering) isapproximately X2(4 df) = 27, but it is almost all explained by the exposure gradient.In smaller datasets, even if the overall X2 is not significant, the trend portion canbe. In this e.g. there was such a strong relationship that even the overall test wassignificant. The same is true in the example overleaf (dealing with birth date andsporting success), where again the sample sizes are large and the signal strong.
--- -------- --- ----------All R N – R N P(=R/N)
The χ2(1) statistic for trend, X2(1) , which forms part of theoverall X2, can be computed as follows:
X2(1 df) = N{N∑rixi – R∑nixi}2
R{N–R}[N∑nixi2 – (∑nixi)2]* Syntax proc freq DATA= ... ; tables falls*sick /trend;From SAS From Stata
if 1 line of data for each of 119 individuals
PROC FREQ DATA= ... ;TABLES falls*sick /TREND;
if enter a variable (say "number" toindicate how many persons has eachexposure/response pattern, then syntax is
PROC FREQ DATA= ;TABLES falls*sick / TREND; WEIGHT number;
input falls ill number 0 0 33 0 1 8 5 0 19 5 1 15 15 0 11 15 1 9 25 0 4 25 1 10 40 0 0 40 1 10
end
tabodds ill falls [freq=number]
This syntax assumes you enter data for each of the 119 individuals; if instead you enter avariable (say you call it "number" to indicate how many persons has eachexposure/response pattern, then the required syntax is
proc freq DATA= ..; tables falls*sick /trend; weight number;
PS: If you look up A&B, you will find another x2 [ Eqn. 12.2]. This value,calculated as the difference between the trend and the overall x2 statistics, can beis used to test if there is serious non-linear variation over and above the lineartrend.
page 15
Inference from 2 way Tables M&M §9 Test for trend in (Response) Proportions [from A&B §12.2]
Example Birth date and sporting success No. Of Players
75
100
125
150
175
200
Aug-Oct Nov-Jan Feb-Apr May-Jul
Relationship between birthdate and participation rates in Dutch soccer league. Note the ordinate begins at 75 players.
SCIENTIFIC CORRESPONDENCE in NATURE • VOL 368 • 14 APRIL 1994 p592
Sir — I have found a significant relationship between birth date and success in tennis andsoccer. In the Netherlands and England, players born early in the competition year aremore likely to participate in national soccer leagues. The high incidence of elite athletesborn in the first quarter of the competition year can be explained by the effects ofage-group position.
In organized sport. talent is considered predominantly in terms of physical skills. andthe influence of social and psychological factors is often ignored or underestimated1.Various studies have investigated the psychological characteristics of elite athletes2, butnone has looked for an effect of age. I discovered a strikingly skewed distribution of thedates of birth of 12- to l6-year-old tennis players in the top rankings of the Dutch youthleague. Half of a sample of 60 tennis players were born in the first 3 months of the year.
This discovery led me to consider the distribution of the dates of birth of professionalsoccer players. In the Netherlands, there are two leagues comprising a total of 36 clubs. Ifound a striking difference between participation rates of those born in August and July.The Dutch soccer competition year starts on the first of August. A chi-square testindicates that the distribution is not uniform (P<0.001); and a regression analysisdemonstrates a clear linear relationship between month of birth and number ofparticipants. The dates of birth of 621 players, compiled into quarters, are shown in thefigure. This relationship cannot be attributed to the distribution of births in theNetherlands, as this is highly uniform.
PARTICIPATION RATES IN ENGLISH SOCCER LEAGUES
Players in birthdate quarters Statistics
Sep- Dec- Mar- Jun- Chi- Sig.League Nov Feb May Aug Total Square Level
We also inspected the distribution of the dates of birth of English football players inleague clubs in the period 1991-92 (ref.3). Birth dates for all players were tabulated bymonth and compiled into quarters. The results (table) show the significant effect of dateof birth on participation rate of soccer players within each of the national leagues,indicating that. as in the Netherlands, significantly more football players are born in thefirst quarter of the competition year (which starts in September in England).
FA premier 288 190 147 136 761 75.5 P<0.0001Division 1 264 169 154 147 734 48.47 P<0.0001Division 2 251 168 123 131 673 61.11 P<0.0001Division 3 217 169 121 102 609 52.38 P<0.0001
Total 1,020 696 545 516 2,777 230.77 P<0.0001
There is a known relationship between date of birth and educational achievement5.implying that the younger children in any school year group are at a disadvantagecompared to the older children. Children who participate in sports are also placed in agegroups, and my results imply many athletes in organized sports may never get a fairchance because of this method of classification. Very little attention has been drawn tothis problem. One of the few studies done in this area analysed the dates of birth of youngCanadian hockey players in the 1983-84 season6. Players possessing a relative ageadvantage (born in the months lanuary-June) were more likely to participate in minorhockey and more likely to play for top teams than players in July-December.
References: 1 Dudink A Fur J High Ability 1, 144-150 (1990). 2 Dudink A & Bakker.F. Ned. Tschr. Psychol 48. 55 -69 (1993). 3 Rollin,J Rothmans Football Yearbook1992-93 (Headline. London. 1992). 4 Shearer.E Educ Res 10. 51-56 (1967) 5 Doornbos,K. [Date of birth and scholastic performance (Wolters-Noordhoff, Groningen. 1971). 6Barnsley. R. H. & Thompson A. H. Can. J. Behav. Sci 20. 167-176 (1988). 7 Williams.Ph.. Davies P., Evans, R & Ferguson, N. Nature 228. 1033-1036 (1970).
Ad Dudink Faculty of Psychology, University of Amsterdam, 1018 WB 3 Amsterdam,The Netherlands
More than 20 years ago, this journal published an article concerning the relationshipbetween season of birth and cognitive development7. The authors attributed thisrelationship to a fault in the British educational system. A similar relationship wasfound5 in the Netherlands. Despite this, no action was undertaken to change theeducational system. One can only hope that this will not he the case for sports.
-----------------
For an example of an analysis of seasonal variation, see the article by H T Sørensenet al. Does month of birth affect risk of Crohn's disease in childhood andadolescence? p 907 BMJ VOLUME 323 20 OCTOBER 2001 bmj.com (copy ofarticle, and associated dataset, on course 626 website).
page 16
Correlation M&M §2.2
References: A&B Ch 5,8,9,10; Colton Ch 6, M&M Chapter 2.2 Measures of Correlation
Similarities between Correlation and Regression Loose Definition of Correlation:
• Both involve relationships between pair of numerical variables. Degree to which, in observed (x,y) pairs, y value tends to belarger than average when x is larger (smaller) than average; extentto which larger than average x's are associated with larger(smaller) than average y's
• Both: "predictability", "reduction in uncertainty"; "explanation".
• Both involve straight line relationships [can get fancier too].Pearson Product-Moment Correlation Coefficient
Differences
Context Symbol CalculationCorrelation Regression
sample ofn pairs
rxy
∑{xi – x– }{yi – y– }
( ∑{xi – x– }2 ) ( ∑{yi – y–}2 )
Symmetric
(doesn't matter which ison Y, which on X axis)
Directional
(matters which is on Y,which on X axis)
"universe"of all pairs
ρxy E{ (X – µX ) (Y – µY) }
E{ (X – µX )2 } E{ (Y – µY)2 }Chose n 'objects';measure (X,Y) on each
(i) Choose n objects onbasis of their X values;measure their Y; or
(ii) Choose objects, (aswith correlation);measure (X,Y)
Regard X value as'fixed'; .
Notes: ° ρ: Greek letter r, pronounced 'rho' ;° E : Expected value ;° µ: Greek letter 'mu'; denotes mean in universe.° Think of r as an average product of scaled deviations [M&M
p127 use n-1 because the two SDs involved in creating Z scores
implicitly involve 1/√(n-1); result is same as above]Can be extended to non-straight line relationships Spearman's (Non-parametric) Rank Correlation Coefficient
Can relate Y to multipleX variables. x -> rank replace x's by their ranks (1=smallest to n=largest)
Dimensionless (no units)( – 1 to + 1 )
∆Y/∆X units e.g., Kg/cm y -> rank replace y's by their ranks (1=smallest to n=largest)
THEN calculate Pearson correlation for n pairs of ranks
(see later)
page 1
Correlation M&M §2.2
Correlation 2 is a measure of how much the variance of Y is reduced byknowing what the value of X is (or vice versa)Positive: larger than ave. X's with larger than ave. Y's;
smaller than ave. X's with smaller than ave. Y's; See article by Chatillon on "Balloon Rule" for visually estimating r. (cf.Resources for Session 1, course 678 web page)Negative: larger than ave. X's with smaller than ave. Y's;
smaller than ave. X's with larger than ave. Y's;
None: larger than ave. X's 'equally likely' to be coupled with larger as withsmaller than ave. Y's
Var( Y | X ) = Var( Y ) × ( 1 – ρ2 )ρ2 called"coefficient ofdetermination"
ave(Y)
ave(X) ave(X)ave(X)
Var( X | Y ) = Var( X ) × ( 1 – ρ2 )
Large ρ2 (i.e. ρ close -1 or +1) - > close linear association of X and Y
values; far less uncertain about value of one variable if told value of
other.How r ranges from -1 (negative correlation) through 0 (zero correlation.) through +1(positive correlation.) (r not tied to x or y scale) If X and Y scores are standardized to have mean=0 and unit SD=1 it can
be seen that ρ is like a "rate of exchange" ie the value of a standard
deviation's worth of X in terms of PREDICTED standard deviation units
of Y.ave(Y)
ave(X)
X-deviation is –Y-deviation is +PRODUCT is –
X-deviation is +Y-deviation is –PRODUCT is –
X-deviation is –Y-deviation is –PRODUCT is +
X-deviation is +Y-deviation is +PRODUCT is +
.
If we know observation is ZX SD's from µX, then the least squares
prediction of observation's ZY value (ie relative to µY) is given by
predicted ZY = ρ • ZX
ave(Y)
ave(X) ave(X)ave(X)
PRODUCTS
Notice the regression towards mean: ρ is always less than 1 in absolute
value, and so the predicted ZY is closer to 0 (or equivalently make Y
closer to µY) than the ZX was to 0 (or X was to µX ).
page 2
Correlation M&M §2.2
Inferences re [based on sample of n (x,y) pairs] 2 Other common questions: given that r is based only on a sample, whatinterval should I put around r so it can be used as a (say 95%)confidence interval for the "true" coefficient ?
Or (answerable by the same technique): one observes a certain r1 ; inanother population, one observes a value r2 . Is there evidence that the
's in the 2 populations we are studying are unequal?
From our experience with the binomial statistic, which is limited to{0,n} or {0,1}, it is no surprise that the r statistic, limited as it is to{minus 1, plus 1}, also has a pattern of sampling variation that is notsymmetric unless is right in the middle, i.e. unless = 0. Thefollowing transformation of r will lead to a statistic which isapproximately normal even if the ρ('s) in the population(s) we arestudying is(are) quite distant from 0:
Naturally, the observed r in any particular sample will not exactly matchthe in the population (i.e. the coefficient one would get if one includedeverybody). The quantity r varies from one possible sample of n toanother possible sample of n. i.e. r is subject to sampling fluctuationsabout .1 A question all too often asked of one's data is whether there is
evidence of a non-zero correlation between 2 variables. To test this,one sets up the null hypothesis that is zero and determines theprobability, calculated under this null hypothesis that = 0 , ofobtaining an r more extreme than we observed. If the null hypothesisis true, r would just be "randomly different" from zero, with theamount of the random variation governed by n.
This discrepancy of r from 0 can be measured as r n – 2
1 – r2 and
should, if the null hypothesis of = 0 is true, follow a t distributionwith n-2 df.
12 ln {
1 + r1 – r } [where ln is log to the base e or natural log].
It is known as Fisher's transformation of r; the observed r,transformed to this new scale, should be compared against a Gaussiandistribution with
mean = 12 ln {
1 + 1 – } and SD =
1n – 3 .
[Colton's table A5 gives the smallest r which would be consideredevidence that 0. For example, if n=20, so that df = 18, an observedcorrelation of 0.44 or higher, or between -0.44 and -1 would beconsidered statistically significant at the P=0.05 level (2-sided). NB:this t-test assumes that the pairs are from a Bivariate Normaldistribution. Also, it is valid only for testing = 0, not for testingany other value of JH has seen many the researcher scan a matrix of correlations, highlightingthose with a small p-value and hoping to make something of them. But very often,that was non-zero was never in doubt; the more important question is how non-zero the underlying really was. A small p-value (from maybe a feeble r but alarge n!) should not be taken as evidence of an important ! JH has alsoobserved several disappointed researchers who mistakenly see the small p-values and think they are the correlations! (the p-values associated with the testof = 0 are often printed under the correlations)
Interesting example where r 0, and not by chance alone!
1970 U.S. DRAFT LOTTERY during Vietnam War: See Mooreand McCabe pp113-114, along with spreadsheet under Resources forChapter 10, where the lottery is simulated using random numbers(Monte Carlo method)
page 3
Correlation M&M §2.2
Inferences re [continued... ] e.g. 2c: 100(1– )% CI for from r=0.4 in sample of n=20.
e.g. 2a: Testing H0: = 0.5By solving the double inequality
– zα/2 ≤ 12 ln {
1 + r1 – r } –
12 ln {
1 + 1 – }
1
n – 3 . ≤ zα/2
so that the middle term is , we can construct a CI for :
[High, Low] = 1 + r – {1 – r} e [ ± 2 zα/2 / Sqrt[n-3] ]
1 + r + {1 – r} e [ ± 2 zα/2 / Sqrt[n-3] ]
Observe r=0.4 in sample of n=20.
Compute 12 ln {
1 + 0.41 – 0.4 } –
12 ln {
1 + 1 – }
1
n – 3 .
and compare with Gaussian (0,1) tables. Extreme values of the
standardized Z are taken as evidence against H0 . Often, the
alternative hypothesis concerning is 1-sided, of the form > some
quantity.Worked e.g. 95% CI( ) based on r=0.55 in sample of n=12.
With α=0.05, zα/2 = 1.96, lower & upper bounds for :
= 1 + 0.55 – {1 – 0.55} e [ ± 2 • 1.96 / √9 ]
1 + 0.55 + {1 – 0.55} e [ ± 2 • 1.96 / √9 ]
= 1.55 – 0.45 e [ ± 2 • 1.96 / √9 ]
1.55 + 0.45 e [ ± 2 • 1.96 / √9 ] = 1.55 – 0.45 e ± 1.307
1.55 + 0.45 e ± 1.307
= 1.55 – 0.45 • 3.691.55 + 0.45 • 3.69 ,
1.55 – 0.45 / 3.691.55 + 0.45 / 3.69 = –0.04 to 0 .84
e.g. 2b: Testing H0: =
r1 & r2 in independent samples of n1 & n2
Remembering that "variances add; SD's do not", compute the test
statistic
12 ln {
1 + r11 – r1
} – 12 ln {
1 + r21 – r2
} – [ 0 ]
1
n1 – 3 + 1
n2 – 3 .
and compare with Gaussian (0,1) tables.
This CI, which overlaps zero, agrees with the test of =0 described above.
For if we evaluate 0.55 12 – 2
1 – 0.552 , we get a value of 2.08,
which is not as extreme as the tabulated t10 ,0.05(2-sided) value of 2.23.
Note: There will be some slight discrepancies between the t-test of =0 and the z-based CI's. The latter are only approximate. Note also that both assume we have datawhich have a bivariate Gaussian distribution.
page 4
Correlation M&M §2.2
(Partial) NOMOGRAMfor 95% CI's for
n = 10, 15, 25, 50, 150
It is based on Fisher'stransformation of r. In addition toreading it vertically to get a CI for (vertical axis) based on an
observed r (horizontal axis), onecan also use it to test whether anobserved r is compatible with, orsignificantly different at the α =0.05 level, from some specific value, 0 say, on the verticalaxis: simply read across from =
0 and see if the observed r fallswithin the horizontal rangeappropriate to the sample sizeinvolved. Note that this test of anonzero is not possible via thet-test. Books of statistical tableshave fuller nomograms.
Shown: CI if observer=0.5 (o) with n=25.
Could aldo use nomogram togauge the approx. 95% limits ofvariation for the correlation in adraft lottery. The n=366 is a littlemore than 2.44 times the n=150here. So the (horizontal) variationsaround ρ = 0 should be only1/√2.44 or 64% as wide as thoseshown here for n=150. Thus the95% range of r would be approx.-0.1 to +0.1. (since X and Y areuniform, rather than Gaussian,theory may be a little "off").Observed r was -0.23.
-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
observed r
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
rho
{1+r-(1-r)Exp[±2z/Sqrt[n-3]]} / {1+r+(1-r)Exp[±2z/Sqrt[n-3]]}
rho
1015
2550
150
1015
2550
150
page 5
Correlation M&M §2.2
Spearman's (Non-parametric) Rank Correlation Coefficient Correlations -- obscured and artifactual
How Calculated:
+ =
+ =
(ii) Artifact
(i) Diluted / attenuated
(i) replace x's and y's by their ranks (1=smallest to n=largest)
(ii) calculate Pearson correlation using the pairs of ranks.
Advantages• Easy to do manually (if ranking not a chore);
rSpearman = 1 – 6∑d2
n{n2-1}
{ d = ∆ between "X rank" & "Y rank" for each observation}
• Less sensitive to outliers (x -> rank==> variance fixed (for a given n).
Extreme {xi – x– } or {yi – y
–} can exert considerable influence on
rPearson. Examples:(i) Diluted / attenuated / obscured
• Picks up on non-linear patterns e.g. the rSpearmanfor the following data is 1, whereas the rPearson. is
less.
1 Relationship, in McGill Engineering students, between their firstyear university grades and their CEGEP grades
2 Relationship between heights of offspring and heights of theirparents y
°
°
°
°
°
°
°
____________________
X = average height of 2 parentsY = height of offspring (ignore sex of offspring)
Galton's solution
'transmute' female heights to male heights
'transmuted' height = height × 1.08
x (ii) Artifact / artificially induced
1. Blood Pressure of unrelated (male, female) 'couples'
page 6
Regression M&M §2.3 and §10
Uses MALES FEMALES• Curve fittingAge. Tot. %-ile; weight,g Tot. %-ile; weight,g• Summarization ('model')
• Descriptionwk N. 10th 50th 90th No. 10th 50th 90th• Prediction
• Explanation25 100 651 810 950 73 604 750 924• Adjustment for 'confounding' variables
30 257 1 156 1 530 2 214 216 1 040 1 485 2 001Technical Meaning31
• [originally] simply a line of 'best fit' to data points 3233
• [nowadays] Regression line is the LINE that connects the CENTRES ofthe distributions of Y's at each X value.
3435 1 840 2 060 2 570 3 140 1 454 1 950 2 460 3 040
36
• not necessarily a straight line; could be curved, as with growth charts3738
• not necessarily µY|X 's used as CENTRES ; could use medians etc.3940 68 102 3 020 3 570 4 160 67 149 2 900 3 430 4 000
• strictly speaking, haven't completed description unless we characterizethe variation around the centres of the Y distributions at each X
4142 10 309 3 200 3 770 4 390 9 636 3 060 3 610 4 190
• inference not restricted to the distributions of Y's for which we makesome observations; it applies to distributions of Y's at all unobserved Xvalues in between.
2000g
1000g
3000g
4000g
GESTATIONAL AGE (week)
36 383230 4034
BIRTH WEIGHT (DISTRIBUTION) MALESBIRTH WEIGHT (MEDIAN) FEMALES
Median (50th %ile) for MALES Median (50th %ile) for FEMALES
Live singleton births, Canada 1986Source: Arbuckle & Sherman CMAJ 140 157-161, 1989
Examples (with appropriate caveats)• Birth weight (Y) in relation to gestational age (X)• Blood pressure (Y) in relation to age (X)• Cardiovascular mortality (Y) in relation to water hardness (X) ?• Cancer incidence (Y) in relation to some exposure (X) ?• Scholastic performance (Y) vis a vis amount of TV watched (X)
Caveat: No guarantee that simple straight line relationship will be adequate.Also, in some instances the relationship might change with the type of Xand Y variables used to measure the two phenomena being studied; also therelationship may be more artifact than real - see later for inference.)
page 1
Regression M&M §2.3 and §10
S i m p l e L i n e a r † R e g r e s s i o n Fitting a straight line to data - Least Squares Method (one X) (straight line)
The most common method is that of Least Squares. Note that leastsquares can be thought of as just a curve fitting method and doesn't haveto be thought of in a statistical (or random variation or samplingvariation) context. Other more statistically-oriented methods includethe method of minimum Chi-Square (matching observed and expectedcounts according to measure of discrepancy) and the Method ofMaximum likelihood (finding the parameters that made the data mostlikely). Each has a different criterion of "best-fit".
Equation
• µY|X = α + β X or ∆ µY|X
∆ X = β = "rise""run"
In Practice:one rarely sees an exact straight line relationship in health scienceapplications;
Least Squares Approach:1 - While physicists are often able to examine the relationship between Yand X in a laboratory with all other things being equal (ie controlled orheld constant) medical investigators largely are not. The universe of(X,Y) pairs is very large and any 'true' relationship is disturbed bycountless uncontrollable (and sometimes un-measurable factors. In anyparticular sample of (X,Y) pairs these distortions will surely beoperating.
• Consider a candidate slope (b) and intercept (a) and predict that the
Y value accompanying any X=x is y = a + b•x. The observed y
value will deviate from this "predicted" or "fitted" value by an
amount d = y - y
We wish to keep this deviation as small as possible, but we must
try to strike a balance over all the data points. Again just like
when calculating variances, it is easier to work with squared
deviations1 :
d2 = (y - y ) 2
We weight all deviations equally (whether they be the ones in the
middle or the extremes of the x range) using ∑ d2 = ∑ (y - y ) 2
to measure the overall (or average) discrepancy of the points from
the line.
2 - The true relationship (even if we could measure it exactly) may not be asimple straight line.
3 - The measuring instruments may be faulty or inexact (using'instruments' in the broadest sense).
One always tries to have the investigation sufficiently controlled that the'real' relationship won't be 'swamped' by factors 1 and 3 and that thebackground "noise" will be small enough so that alternative models (egcurvilinear relationships) can be distinguished from one another.------------------------------------------------------------------------------------------------------
----------------------------------------------
† Linear here means linear in the parameters. The equationy = BxC can be made linear in the parameters by taking logsi.e. log[y] = log[B] + x log[C]; y = a+b•x+c•x2 is already linearin the parameters a b and c. The following model cannot be madelinear in the parameters α β γ:
proportion dying = α + 1-α
1+exp{β–γ log[dose]}
1 there are also several theoretical advantages to least squares estimatesover others based for example on least absolute deviations: - they arethe most precise of all the possible estimates one could get by takinglinear combinations of y's.
page 2
Regression M&M §2.3 and §10
• From all the possible candidates for slope (b) and intercept (a) , we
choose the particular values a and b which make this sum of squares
(sum of squared deviations of 'fitted' from 'observed' Y's) a minimum.
ie we search for the a and b that give us the least squares fit.
but there are no cars weighing 0 lbs. It would be better to write the
equation in relation to some 'central' value for weight e.g. 3500
lbs; then the same equation can be cast as
µY|weight – 25 = 0.01•(weight – 3500)
• Fortunately, we don't have to use trial and error to arrive at the 'best' a
and b . Instead, it can be shown by calculus or algebraically that the a
and b which minimize ∑ d2 are:
b = β^ = ∑{x i – x
– }{y i – y
– }
∑{x i – x– }2
= rxy • sy
sx
a = α^ = y– – b x–
It is helpful for testing whether there is evidence of a non-zero slope to
think of the simplest of all regression models, namely that which is a
horizontal straight line
µY|X = α + 0 • X = the constant α .
This is a re-statement of the fact that the sum of squared deviances
around a constant horizontal line at height 'a' is smallest when 'α ' =
the mean .
[Note that a least-squares fit of the regression line of X on Y wouldgive a different set of values for the slope and intercept: the slope
of the line of x on y is rxy • sx
sy] . one needs to be careful when
using a calculator or computer program to specify which is theexplanatory variable (X) and which is the predicted variable (Y)].
[We don't always use the mean as the best 'centre' of a set of numbers.
Imagine waiting for one of several elevators with doors in a row along
one wall; you do not know which one will arrive next, and so want to
stand in the 'best' place no matter which one comes next. Where to
stand depends on the criterion being optimized: if you want to minimize
the maximum distance, stand in the middle between the one on the
extreme left and the extreme right; if you wish to minimize the average
distance , where do you stand?, If, for some reason, you want to
minimize the average squared distance, where to stand? If the elevator
doors are not equally spaced from each other, what then?]
Meaning of intercept parameter (a):
Unlike the slope parameter (which represents the increase/decrease
in µY|X for every unit increase in x), the intercept does not always
have a 'natural' interpretation. It depends on where the x-values lie
in relation to x=0, and may represent part of what is really the
mean Y. For example, the regression line for fuel economy of cars
(Y) in relation to their weight (x) might be
µY|weight = 60 mpg – 0.01•weight in lbs [0.01 mpg/lb]
page 3
Regression M&M §2.3 and §10
The anatomy of a slope: some re-expressions i.e. a weighted average of the slope from datapoints 1 and 4and that from datapoints 2 and 3, with weights proportional to thesquares of their distances on x axis {x4 – x1}2 and {x3 – x2}2
Consider the formula: slope = b = ∑{x – xbar}{y – ybar}
∑{x – xbar} 2
Without loss of generality & for simplicity, assume ybar=0.
If we have 3 x's, 1 unit apart (e.g. x1=1; x2 =2; x3 =3),Another way to think of the slope:
Rewrite b = ∑{x – xbar}{y – ybar}
∑{x – xbar} 2 as then... x1 – xbar = –1; x2 – xbar = 0; x3 – xbar = +1
so slope = b = { – 1}y1 + {0}y2 + { + 1}y 3
{ – 1}2 + { 0 }2 + { + 1}2
b = ∑{x – xbar} 2 {y – ybar}
{x – xbar}
∑{x – xbar} 2 = ∑ weight {y – ybar}
{x – xbar}∑weight
weight {x – xbar}2 for estimate {y – ybar}{x – xbar} of slope
i.e. slope = y3 – y 1
x3 – x 1
Note that y2 contributes to ybar and thus to an estimateof the average y (i.e. level) but not to the slope.
Yet another way to think of the slope:If 4 x's 1 unit apart (e.g. x1=1; x2 =2; x3 =3; x4 =4), then,...
x1 – xbar = –1.5 x2 – xbar = – 0.5x3 – xbar = +0.5 x4 – xbar = + 1.5
and so
slope = b = { – 1.5}y1 + { – 0 .5}y2 + { + 0 .5}y3+ { + 1.5}y4
{ – 1.5}2 + { – 0.5}2 + { 0.5 }2 + { + 1.5}2
i.e. slope = 1.5{ y4 – y 1}
5 +
0.5{ y3 – y 2}5
i.e. slope =
32 {y4 – y 1}
53 {x4 – x 1}
+
12 { y3 – y 2}
51 {x3 – x 2}
i.e. slope = 910
{y4 – y 1}{x4 – x 1}
+ 110
{ y3 – y 2}{x3 – x 2}
b is a weighted average of all the pairwise slopes yi – yjxi – xj
with weights proportional to {xi – xj }2 .
e.g. If 4 x's 1 unit apart
denote by b1&2 the slope obtained from {x2,y2} & {x1,y1}, etc...
b = 1.b1&2 + 4.b1&3 + 9.b1&4 + 1.b1&3 + 4.b2&4 + 1.b3&4
1+ 4 + 9 + 1 + 4 + 1 = 20
jh 6/94
page 4
Regression M&M §2.3 and §10
Inferences regarding Simple Linear Regression 50 than to take X =38, 39, 49, 41, 42. Any individual fluctuationswill 'throw off' the slope much less if the X's are far apart.
How reliable are
30 40 5030 40 50
BP BP
AGE AGE
(b)(a)(i) the (estimated) slope(ii) the (estimated) intercept(ii) the predicted mean Y at a given X(iv) the predicted y for a (future) individual with a given X
when they are based on data from a sample? i.e. how much would theseestimated quantities change if they were based on a different randomsample [with the same x values]?
We can use the concept of sampling variation to (i) describe the'uncertainty' in our estimates via CONFIDENCE INTERVALS or (ii)carry out TESTS of significance on the parameters (slope, intercept,predicted mean).
thick l ine : real (true) relation between average BP at age X and X : thinlines: possible apparent relationships because of individual variation whenwe study 1 individual at each of two ages (a) spaced closer together (b)spaced further apart.
We can describe the degree of reliability of (or, conversely, the degree ofuncertainty in) an estimated quantity by the standard deviation of thepossible estimates produced by different random samples of the samesize from the same x's. We call this (obviously conceptual) S.D. thestandard error of the estimated quantity (just like the standard error of themean when estimating µ). helpful to think of slope as an averagedifference in means for 2 groups that are 1 x-unit apart.
Notes
Regression line refers to the relationship between the average Y at agiven X to the X , and not to individual Y's vs X. Obviously of course ifthe individual Y's are close to the average Y, so much the better!
The size of the standard error will depend on
1. how 'spread apart' the x's areThe above argument would suggest studying individuals at theextremes of the X values of interest. We do this if we are sure that therelationship is a linear one. If we are not sure, it is wiser -- if we havea choice in the matter -- to take a 3-point distribution.
There is a common misapprehension that a Gaussian distribution of Xvalues is desirable for estimating a regression slope of Y on X. In fact,the 'inverted U' shape of the Gaussian is the least desirable!
2. How good a fit the regression line really is (i.e. how small isthe unexplained variation about the line)
3. How large the sample size, n, is.
Factors affecting reliability (in more detail)
1. The spread of the X's: The best way to get a reliable estimate ofthe slope is to take Y readings at X's that are quite a distance fromeach other. E.g. in estimating the "per year increase in BP over the30-50 yr. age range", it would be better to take X=30,35, 40, 45,
page 5
Regression M&M §2.3 and §10
Factors affecting reliability (continued) variation when we study 1 individual at each of two ages when thewithin-age distributions have (a) a narrow spread (b) a wider spread
2. The (vertical) variation about the regression line: Again, considerBP and age, and suppose that indeed the average BP of all personsaged X + 1 is β units higher than the average BP of all persons
aged X, and that this linear relationship
average BP of persons aged x = α + β • X (average of Y's at a given x = intercept + slope • X)
NOTE: For unweighted regression, should have roughly same spreadof Y's at each X.
Factors affecting reliability (continued)
3. Sample Size (n) Larger n will make it more difficult for the types ofextremes and misleading estimates caused by 1) poor X spread and 2)large variation in Y about µ Y|X , to occur. Clearly, it may be possible
to spread the x's out so as to maximize their variance (and thus reducethe n required) but it may not be possible to change the magnitude ofthe variation about µ Y|X (unless there are other known factors
influencing BP). Thus the need for reasonably stable estimated y[i.e.estimate of µY|X ]
holds over the age span 30-50.
Obviously, everybody aged x=32 won't have the exact same BP,some will be above the average of 32 yr olds, some below.Likewise for the different ages x=30,...50. In other words, at any xthere will be a distribution of y's about the average for age X.Obviously, how wide this distribution is about α + β•X will havean effect on what slopes one could find in different samples(measure vertical spread around the line by σ)
30 40 5030 40 50
BP BP
AGE AGE
(a) (b)
thick l ine : real (true) relation between average BP at age X and X :thin lines: possible apparent relationships because of individual
page 6
Regression M&M §2.3 and §10
Standard Errors The structure of SE(a) : In addition to the factors mentioned above, allof which come in again in the expected way, there is the additional
factor of x–2; since this is in the denominator, it increases the SE . This
is natural in that if the data, and thus x–, are far from x=0, then any
imprecision in the estimate of the slope will project backwards to alarge imprecision in the estimated intercept. Also, if one uses 'centered'
x's, so that x– = 0, the formula for the SE reduces to
SE(a) = σ 1n = σ
n
and we recognize this as SE(y–) -- not surprisingly, since y
– is the
'intercept' for centered data.
SE(b) = SE( β) = σ
∑{xi – x– }2 ;
SE(a) = SE( α) = σ 1n +
x–2
∑{xi – x– }2
(Note: there is a negative correlation between a and b).
We don't usually know σ so we estimate it from the data, using scatter ofthe y's from the fitted line i.e. SD of the residuals)
CI's & Tests of Significance for ^ and ^ are based ont–distribution (or Gaussian Z's if n large)
If examine the structure of SE(b), see that it reflects the 3 factors discussedabove: (i) a large spread of the x's makes contribution of each observation to
∑{xi – x– }2 large, and since this is in the denominator, it reduces the SE
(ii) a small vertical scatter is reflected in a small σ and since this is in thenumerator, it also reduces the SE of the estimated slope (iii) a large sample
size means that ∑{xi – x– }2 is larger, and like (i) this reduces the SE.
The formula, as written, tends to hide this last factor; note that
∑{xi – x– }2 is what we use to compute the spread of a set of x's -- we
simply divide it by n–1 to get a variance and then take the square root to getthe sd. To make the point here, simplify n-1 to n and write
∑{xi – x– }2 ≈ n•var(x), so that ∑{xi – x
– }2 ≈ n • sd(x)
and the equation for the SE simplifies to
SE(b) = σ
n • sd(x) =
SDy|x / SDx
n
with n in its familiar place in the denominator of the SE (even in more
complex SE's, this is where n is usually found !)
^ ± tn–2 • SE( ^ )
H0: tn–2 = ^ –
SE(^)
^ ± tn–2 • SE( ^ )
H0: tn–2 = ^ –
SE(^)
page 7
Regression M&M §2.3 and §10
Standard Error for Estimated µY|X or 'average Y at X' variation, which is as follows:
α^+ β^ •X ± t•σ 1 + 1n +
{X – x–}2
∑{xi – x– }2
.
Both the CI for the estimated mean and the CI for individuals (ie theestimated percentiles of the distribution) are bow-shaped when drawn as
a function of X . They are narrowest at X = x–, and fan out from there.
One needs to be careful not to confuse the much narrower CI for themean with the much wider CI for individuals. If one can see the rawdata, it is usually obvious which is which -- the CI for individuals isalmost as wide as the raw data themselves.
We estimate 'average Y at X' or µY|X by α^ + β^ •X . Since the
estimate is based on two estimated quantities, each of which is subject tosampling variation, it contains the uncertainty of both:
SE(estimated average Y at X) = σ 1n +
{X – x–}2
∑{xi – x– }2
Again, we must use an estimate σ^ of σ .
First-time users of this formula suspect that it has a missing ∑ or an xinstead of an xbar or something. There is no typographical error, and indeedif one examines it closely, it makes sense. X refers to the x-value at whichone is estimating the mean -- it has nothing to do with the actual x's in thestudy which generated the estimated coefficients, except that the closer X is
to the center of the data, the smaller the quantity {X – x–} and thus the
quantity {X – x–}2, and thus the SE, will be. Indeed, if we estimate the
average Y right at X = x–, the estimate is simply y
– (since the fitted
line goes through [ x–, y
–] ) and its SE will be
σ 1n +
{ x– – x
–}2
∑{xi – x– }2
or σ 1n = σ
n = SE( y
– ).
cf. data on sleeping through the night; alcohol levels and eye speed.
Confidence Interval for individual Y at X
A certain percentage P% of individuals are within tP • σ of the meanµY|X = α + β • X , where tP is a multiple, depending on P, from the t
or, if n is large, the Z table. However, we are not quite certain whereexactly the mean α + β • X is -- the best we can do is estimate,
with a certain P% confidence, that it is within tP• SE( α^ + β^ •X ) of
the point estimate α^+ β^ •X. The uncertainty concerning the mean andthe natural variation of individuals around the mean -- wherever it is --combine in the expression for the estimated P% range of individual
page 8
PERIOPERATIVE NORMOTHERMIA TO REDUCE THE INCIDENCE OF SURGICAL-WOUND INFECTION AND SHORTEN HOSPITALIZATION p 1 of 11
ANDREA KURZ, M.D., DANIEL I. SESSLER, M.D., AND RAINER LENHARDT, M.D.,WOUND infections are common and serious complications of anesthesia
and surgery. A wound infection can prolong hospitalization by 5 to 20
days and substantially increase medical costs.[1,2] In patients undergoing
colon surgery, the risk of such an infection ranges from 3 to 22 percent,
depending on such factors as the length of surgery and underlying medical
problems.[3] Mild perioperative hypothermia (approximately 2°C below
the normal core body temperature) is common in colon surgery.[4] It
results from anesthetic-induced impairment of thermoregulation,[5,6]
exposure to cold, and altered distribution of body heat.[7] Although it is
rarely desired, intraoperative hypothermia is usual because few patients are
actively warmed.[8]
FOR THE STUDY OF WOUND INFECTION AND TEMPERATURE GROUP1
AbstractBackground. Mild perioperative hypothermia, which is common during majorsurgery, may promote surgical- wound infection by triggering thermoregulatoryvasoconstriction, which decreases subcutaneous oxygen tension. Reducedlevels of oxygen in tissue impair oxidative killing by neutrophils and decreasethe strength of the healing wound by reducing the deposition of collagen.Hypothermia also directly impairs immune function. We tested the hypothesisthat hypothermia both increases susceptibility to surgical-wound infection andlengthens hospitalization.
Methods . Two hundred patients undergoing colorectal surgery wererandomly assigned to routine intraoperative thermal care (the hypothermiagroup) or additional warming (the normothermia group). The patients'anesthetic care was standardized, and they were all given cefamandole andmetronidazole. In a double-blind protocol, their wounds were evaluated dailyuntil discharge from the hospital and in the clinic after two weeks; woundscontaining culture-positive pus were considered infected. The patients'surgeons remained unaware of the patients' group assignments. Hypothermia may increase patients' susceptibility to perioperative wound
infections by causing vasoconstriction and impaired immunity. The
presence of sufficient intraoperative hypothermia triggers
thermoregulatory vasoconstriction,[9] and postoperative vasoconstriction is
universal in patients with hypothermia.[10] Vasoconstriction decreases the
partial pressure of oxygen in tissues, which lowers resistance to infection
in animals [11,12] and humans (unpublished data). There is decreased
microbial killing, partly because the production of oxygen and nitroso free
radicals is oxygen-dependent in the range of the partial pressures of
oxygen in wounds. [13,14] Mild core hypothermia can also directly impair
immune functions, such as the chemotaxis and phagocytosis of
granulocytes, the motility of macrophages, and the production of
antibody.[15,16] Mild hypothermia, by decreasing the availability of tissue
oxygen, impairs oxidative killing by neutrophils. And mild hypothermia
during anesthesia lowers resistance to inoculations with Escherichia coli
[17] and Staphylococcus aureus [18] in guinea pigs.
Results. The mean (±SD) final intraoperative core temperature was34.7±0.6°C in the hypothermia group and 36.6±0.5°C in the normothermiagroup (P<0.001). Surgical-wound infections were found in 18 of 96 patientsassigned to hypothermia (19 percent) but in only 6 of 104 patients assigned tonormothermia (6 percent, P=0.009). The sutures were removed one day laterin the patients assigned to hypothermia than in those assigned tonormothermia (P = 0.002), and the duration of hospitalization was prolonged by2.6 days (approximately 20 percent) in the hypothermia group (P=0.01).
Conclusions. Hypothermia itself may delay healing and predispose patientsto wound infections. Maintaining normothermia intraoperatively is likely todecrease the incidence of infectious complications in patients undergoingcolorectal resection and to shorten their hospitalizations(N Engl J Med 1996;334:1209-15. May 9)
1From the Thermoregulation Research Laboratory and the Department of Anesthesia. University ofCalifornia. San Francisco (A.K., D.l.S.); and the Departments of Anesthesiology and General Intensive Care,University of Vienna, Vienna Austria (A.K.. D.l.S.. R.L.). Address reprint requests to Dr. Sessler at theDepartment of Anesthesia, 374 Parnassus Ave., 3rd Fl., University of California, San Francisco. CA 941430648. Supported in part by grants (GM49670 and GM27345) from the National Institutes of Health. by theJoseph Drown and Max Kade Foundations. and by Augustine Medical. Inc. The authors do not consult for,accept honorariums from. Or own stock or stock options in any company whose products are related to thesubject of this research.Presented in part at the International Symposium on the Pharmacology of Thermoregulation, Giessen,Germany, August 17-22. 1°,94, and at the Annual Meeting of the American Society of Anesthesiologists.Atlanta October 21-25, 1995.
PERIOPERATIVE NORMOTHERMIA TO REDUCE THE INCIDENCE OF SURGICAL-WOUND INFECTION AND SHORTEN HOSPITALIZATION p 2 of 11
Vasoconstriction-induced tissue hypoxia may decrease the strength of the
healing wound independently of its ability to reduce resistance to infection.
The formation of scar requires the hydroxylation of abundant proline and
Iysine residues to form the cross-links between strands of collagen that
give healing wounds their tensile strength.[19] The hydroxylases that
catalyze this reaction are dependent on oxygen tension,[20] making
collagen deposition proportional to the partial pressure of arterial oxygen
in animals [21] and to oxygen tension in wound tissue in humans. [22]
The number of patients required for this trial was estimated on the
basis of a preliminary study in which 80 patients undergoing elective
colon surgery were randomly assigned to hypothermia (mean [±SD]
temperature, 34.4±0.4°C) or normothermia (involving warming with
forced air and fluid to a mean temperature of 37_0.3°C). The number of
wound infections (as defined by the presence of pus and a positive
culture) was evaluated by an observer unaware of the patients'
temperatures and group assignments. Nine infections occurred in the 38
patients assigned to hypothermia, but there were only four in the 42
patients assigned to normothermia (P=0.16). Using the observed
difference in the incidence of infection, we determined that an enrollment
of 400 patients would provide a 90 percent chance of identifying a
difference with an alpha value of 0.01.2 We therefore planned to study a
maximum of 400 patients, with the results to be evaluated after 200 and
300 patients had been studied. The prospective criterion for ending the
study early was a difference in the incidence of surgical-wound infection
between the two groups with a P value of less than 0.01. To compensate
for the two initial analyses, a P value of 0.03 would be required when the
study of 400 patients was completed. The combined risk of a type I error
was thus less than 5 percent [24]
Although safe and inexpensive methods of warming are available[8]
perioperative hypothermia remains common. [23] Accordingly, we tested
the hypothesis that mild core hypothermia increases both the incidence of
surgical-wound infection and the length of hospitalization in patients
undergoing colorectal surgery.
METHODS
With the approval of the institutional review board at each participating
institution and written informed consent from the patients, we studied
patients 18 to 80 years of age who underwent elective colorectal resection
for cancer or inflammatory bowel disease. Patients scheduled for
abdominal-peritoneal pull-through procedures were included, but not those
scheduled for minor colon surgery (e.g., polypectomy or colostomy
performed as the only procedure). The criteria for exclusion from the
study were any use of corticosteroids or other immunosuppressive drugs
(including cancer chemotherapy) during the four weeks before surgery; a
recent history of fever, infection, or both; serious malnutrition (serum
albumin, less than 3.3 g per deciliter, a white-cell count below 2500 cells
per milliliter, or the loss of more than 20 percent of body weight); or
bowel obstruction.
Study Protocol
The night before surgery, each patient underwent a standard mechanical
bowel preparation with an electrolyte solution. Intraluminal antibiotics
were not used, but treatment with cefamandole (2 g intravenously every
2 i tal ics by jh. . . . i t does not make a lot of sense to use the difference oneoberves in a p i lo t s tudy [or for tha t mat ter what i s the l i t e ra ture] as aguide to what w o u l d be impor tan t [ c l in i ca l l y impor tan t d i f f e rence ]
PERIOPERATIVE NORMOTHERMIA TO REDUCE THE INCIDENCE OF SURGICAL-WOUND INFECTION AND SHORTEN HOSPITALIZATION p 3 of 11
eight hours) and metronidazole (500 mg intravenously every eight hours)
was started during the induction of anesthesia; this treatment was
maintained for about four days postoperatively. Anesthesia was induced
with thiopental sodium (3 to 5 mg per kilogram of body weight), fentanyl
(1 to 3 µg per kilogram), and vecuronium bromide (0.1 mg per kilogram).
The administration of isoflurane (in 60 percent nitrous oxide) was titrated
to maintain the mean arterial blood pressure within 20 percent of the
preinduction values. Additional fentanyl was administered on the
completion of surgery, to improve analgesia when the patient emerged
from anesthesia.
extra warming. Similarly, a forced-air cover (Augustine Medical, Eden
Prairie, Minn.) was positioned over the upper body of every patient, but it
was set to deliver air at the ambient temperature in the hypothermia group
and at 40°C in the normothermia group. Cardboard shields and sterile
drapes were positioned in such a way that the surgeons could not discern
the temperature of the gas inflating the cover. Shields were also positioned
over the switches governing the fluid heater and the forced-air warmer so
that their settings were not apparent to the operating-room personnel. The
temperatures were not controlled postoperatively, and the patients were not
informed of their group assignments.
The patients were hydrated aggressively during and after surgery, because
hypovolemia decreases wound perfusion and increases the incidence of
infection.[25,26] We administered 15 ml of crystalloid per kilogram per
hour throughout surgery and replaced the volume of blood lost with either
crystalloid in a 4:1 ratio or colloid in a 2:1 ratio. Fluids were administered
intravenously at rates of 3.5 ml per kilogram per hour for the first 24
postoperative hours and 2 ml per kilogram per hour for the subsequent 24
hours. Leukocyte-depleted blood was administered as the attending
surgeon considered appropriate.
Supplemental oxygen was administered through nasal prongs at a rate of 6
liters per minute during the first three postoperative hours and was then
gradually eliminated while oxygen saturation was maintained at more than
95 percent. To minimize the decrease in wound perfusion due to activation
of the sympathetic nervous system, postoperative pain was treated with
piritramide (an opioid), the administration of which was controlled by the
patient.
The attending surgeons, who were unaware of the patients' group
assignments and core temperatures, determined when to begin feeding
them again after surgery, remove their sutures, and discharge them from
the hospital. The timing of discharge was based on routine surgical
considerations, including the return of bowel function, the control of any
infections, and adequate healing of the incision.
At the time of the induction of anesthesia, each patient was randomly
assigned to one of the following two temperature-management groups with
computer-generated codes maintained in numbered, sealed, opaque
envelopes: the normothermia group, in which the patients' core
temperatures were maintained near 36.5°C, and the hypothermia group, in
which the core temperature was allowed to decrease to approximately
34.5°C. In both groups, intravenous fluids were administered through a
fluid warmer, but the warmer was activated only in the patients assigned to
Measurements
The patients' morphometric characteristics and smoking history were
recorded. The preoperative laboratory evaluation included a complete
PERIOPERATIVE NORMOTHERMIA TO REDUCE THE INCIDENCE OF SURGICAL-WOUND INFECTION AND SHORTEN HOSPITALIZATION p 4 of 11
blood count; determinations of the prothrombin time and
partial-thromboplastin time; measurements of serum albumin, total protein,
and creatinine; and liver-function tests. The risk of infection was scored
with a standardized algorithm taken from the Study on the Efficacy of
Nosocomial lnfection Control (SENIC) of the Centers for Disease Control
and Prevention; in this scoring system, one point each is assigned for the
presence of three or more diagnoses, surgery lasting two hours or more,
Surgery at an abdominal site, and the presence of a contaminated or
infected wound.[2] The scoring system was modified slightly from its
original form by the use of the diagnoses made at admission, rather than
discharge. The risk of infection was quantified further with the use of the
National Nosocomial Infection Surveillance System (NNISS), a scoring
system in which the patient's risk of infection was predicted on the basis of
the type of surgery, the patient's physical-status rating on a scale developed
by the American Society of Anesthesiologists, and the duration of surgery
[3]
Thermal comfort was evaluated at 20-minute intervals for 6 hours
postoperatively with a 100-mm visual-analogue scale on which 0 mm
denoted intense cold, 50 mm denoted thermal comfort, and 100 mm
denoted intense warmth. The degree of surgical pain was evaluated
similarly, except that 0 mm denoted no pain and 100 mm the most intense
pain imaginable. Shivering was assessed qualitatively, on a scale on which
0 denoted no shivering; 1, mild or intermittent shivering, 2, moderate
shivering, and 3, continuous, intense shivering. All the qualitative
assessments were made by observers unaware of the patients' group
assignments and core temperatures.
The patients' surgical wounds were evaluated daily during hospitalization
and again two weeks after surgery by a physician who was unaware of the
group assignments. Wounds were suspected of being infected when pus
could be expressed from the surgical incision or aspirated from a loculated
mass inside the wound. Samples of pus were obtained and cultured for
aerobic and anaerobic bacteria, and wounds were considered infected when
the culture was positive for pathogenic bacteria. All the wound infections
diagnosed within 15 days of surgery were included in the data analysis.
Core temperatures were measured at the tympanic membrane
(Mallinckrodt Anesthesiology Products, St. Louis), with values recorded
preoperatively, at 10-minute intervals intraoperatively, and at 20-minute
intervals for 6 hours during recovery. Arteriovenous-shunt flow was
quantified by subtracting the skin temperature of the fingertip from that of
the forearm, with values exceeding 0°C indicating thermoregulatory
vasoconstriction.[27] End-tidal concentrations of isoflurane and carbon
dioxide were recorded at 10-minute intervals during anesthesia.
Measurements of arterial blood pressure and heart rate were recorded
similarly during anesthesia and for six hours thereafter. Oxyhemoglobin
saturation was measured by pulse oximetry.
Wound healing and infections were also evaluated by the ASEPSIS
system,[28] in which a score is calculated as the weighted sum of points
assigned to the following factors: the duration of antibiotic administration,
the drainage of pus during local anesthesia, the debridement of the wound
during general anesthesia, the presence of a serous discharge, the presence
of erythema, the presence of a purulent exudate, the separation of deep
tissues, the isoIation of bacteria from fluid discharged from the wound,
and a duration of hospitalization exceeding 14 days. Scores exceeding 20
on this scale indicate wound infection. As an additional indicator of
PERIOPERATIVE NORMOTHERMIA TO REDUCE THE INCIDENCE OF SURGICAL-WOUND INFECTION AND SHORTEN HOSPITALIZATION p 5 of 11
infection, preoperative differential white-cell counts were compared with
counts obtained on postoperative days 1, 3, and 6.
than 0.005 was considered to indicate a significant difference in
postoperative temperature (to compensate for multiple comparisons); for
all other data, a P value of less than 0.05 was considered to indicate a
statistically significant difference.Collagen deposition in the wound was evaluated in a subgroup of 30
patients in the normothermia group and 24 patients in the hypothermia
group. A 10-cm expanded polytetrafluoroethylene tube (Impra,
International Polymer Engineering, Tempe, Ariz.) was inserted
subcutaneously several centimeters lateral to the incision at the completion
of surgery. On the seventh postoperative day, the tube was removed and
assayed for hydroxyproline, a measure of collagen deposition.[29] The
ingrowth of collagen in such tubes is proportional to the tensile strength of
the healing wound[29] and the subcutaneous oxygen tension.[22]
RESULTS
Patients were enrolled in the study from July 1993 through March 1995;
155 were evaluated at the University of Vienna, 30 at the University of
Graz, and 15 at Rudolfstiftung Hospital. According to the investigational
protocol, the study was stopped after 200 patients were enrolled, because
the incidence of surgical-wound infection in the two study groups differed
with an alpha level of less than 0.01. One hundred four patients were
assigned to the normothermia group, and 96 to the hypothermia group. An
audit confirmed that the patients had been properly assigned to the groups
and that the slight disparity in numbers was present in the original
computer-generated randomization codes. All the patients allowed their
wounds to be evaluated daily during hospitalization. Ninety-four percent
returned for the two-week clinic visit after discharge; those who did not
were evenly distributed between the study groups and mostly returned to
visit the private offices of their attending surgeons. The wound status of
these patients was determined by calling the physician. No previously
unidentified wound infections were detected in the clinic for the first time.
Statistical Analysis
Outcomes were evaluated on an intention-to-treat basis. The number of
postoperative wound infections in each study group and the proportion of
smokers among the infected patients were analyzed by Fisher's exact test.
Scores for wound healing, the number of days of hospitalization, the extent
of collagen deposition, postoperative core temperatures, and potential
confounding factors were evaluated by unpaired, two-tailed t-tests. Factors
that potentially contributed to infection were included in a univariate
analysis. Those that correlated significantly with infection were then
included in a multivariate logistic regression with backward elimination; a
P value of less than 0.25 was required for a factor to be retained in the
analysis.
Table I shows that the characteristics, diagnoses, types of surgical
procedure, duration of surgery, hemodynamic values, and types of
anesthesia of the patients in the two study groups were similar. Nor did
smoking status, the results of preoperative laboratory tests, or preoperative
laboratory values differ significantly between the groups. The patients
assigned to hypothermia required more transfusions of allogeneic blood
All the results are presented as means ±SD. A P value of less than 0.01
was required to indicate a significant difference in our major outcomes (the
incidence of infection and the duration of hospitalization); a P value of less
PERIOPERATIVE NORMOTHERMIA TO REDUCE THE INCIDENCE OF SURGICAL-WOUND INFECTION AND SHORTEN HOSPITALIZATION p 6 of 11
(P=0.01). Intraoperative vasoconstriction was observed in 74 percent of
the patients assigned to hypothermia but in only 6 percent of those
assigned to normothermia (P<0.001). Core temperatures at the end of
surgery were significantly lower in the hypothermia group than in the
normothermia group (34.7±0.6 vs. 36.6±0.5°C, P<0.001), and they
remained significantly different for more than five hours postoperatively
(Fig. 1).
Table 1 . Characteristics of the Patients in the Two Study Groups.*
Normothermia Hypothermia P(N = 104) (N = 96) Value
CHARACTERISTIC
Male sex (no. of patients) 58 50 0.70Weight (kg) 73±14 71±14 0.31Height (cm) 170±9 169±9 0.43Age (yr) 61±15 59±14 0.33History of smoking (no.) 33 29 0.94Diagnosis (no. of patients)
Postoperative vasoconstriction was observed in 78 percent of the patients
in the hypothermia group; the vasoconstriction continued throughout the
six-hour recovery period. In contrast, vasoconstriction, usually short-lived,
was observed in only 22 percent of the patients in the normothermia group
(P<0.001). Shivering was observed in 59 percent of the hypothermia
group, but in only a few patients in the normothermia group. Thermal
comfort was significantly greater in the normothermia group than in the
hypothermia group (score on the visual- analogue scale one hour after
surgery, 73 ± 14 vs. 35 ± 17 mm). The difference in thermal comfort
remained statistically significant for three hours. Pain scores and the
amount of opioid administered were virtually identical in the two groups at
every postoperative measurement; hemodynamic values were also similar.
Inflammatory bowel disease 10 8 0.94Cancer 94 88
Duke's stage 1.0A 29 30B 37 34C 26 21D 2 3
Operative site 0.61Colon 59 51Rectum 35 37
Preoperative variablesCore temperature (°C) 36.8±0.4 36.7±0.4 0.08Hemoglobin (g/dl) 12.6±2.3 12.7±20 0.74
Intraoperative variablesFentanyl administered (mg) 0.7±0.3 0.6±0. 5 0.09End-tidal isoflurane (%) 0.6±0.1 0.6±0.2 1.0Arterial blood pressure (mm Hg) 91±17 95±18 0.11Heart rate (beats/min) 74± 17 76± 13 0.35Crystalloid (liters) 3.3± 1.5 3.2± 0.9 0.57Colloid (liters) 0.2±0.3 0.2±0.3 1.0Red-cell transfusion (no. of patients) 23 34 0.054Volume of blood transfused (units) 0.4± 1.0 0.8± 1.2 0.01Urine output (liters) 0.6±0.4 0.7±0.4 0.08Duration of surgery (hr) 3.1±1.0 3.1±0.9 1.0Ambient temperature (°C) 21.9±1.2 22.1±0.9 0.19Oxyhemoglobin saturation (%) 97.3±1.5 97.5±1.3 0.32Final core temperature (°C) 36.6±0.5 34.7±0.6 <0.001
* Plus-minus values are means ± SD.Table continued on next page
PERIOPERATIVE NORMOTHERMIA TO REDUCE THE INCIDENCE OF SURGICAL-WOUND INFECTION AND SHORTEN HOSPITALIZATION p 7 of 11
Table 1 . Characteristics of the Patients in the Two Study Groups. ... continued
In a univariate analysis, tobacco use, group assignment, surgical site,
NNISS score, SENIC score, need for transfusion, and age were all
correlated with the risk of infection. In a multivariate backward-elimination
analysis, tobacco use, group assignment, surgical site, NNISS score, and
age remained risk factors for infection (Table 3).
Normothermia Hypothermia P(N = 104) (N = 96) Value
Postoperat ive var iablesHemoglobin (g/dl) 11.7±1.9 11.6±1.4 0.67Prophylactic antibiotics (days) 3.7±1.9 3.6±1.4 0.67SENIC score (no. of patients) 0.98
1 3 32 95 88
Four patients in the normothermia group and seven in the hypothermia
group required admission to the intensive care unit (P=0.47), mainly
because of wound dehiscence, colon perforation, and peritonitis. Two
patients in each group died during the month after surgery. The incidence
of infection was similar at each study hospital, and no one surgeon was
associated with a disproportionate number of infections.
3 6 5NNISS score (no. of patients) 0.60
0 32 311 49 393 23 26
Infection rate predicted by NNISS(%)
8.9 8.8 —
Oxyhemoglobin saturation (%) 98±1 98±1 1.0Piritramide (mg)† 20±13 22±12 0.26* Plus-minus values are means ± SD.
SENIC denotes Study on the Efficacy of Nosocomial Infection Control, and NNISSNational Nosocomial Infection Surveillance System.
Table 2 shows that significantly more collagen was deposited near the
wound in the patients in the normothermia group than in the patients in the
hypothermia group (328±135 vs. 254±114 µg per centimeter). The
patients assigned to hypothermia were first able to tolerate solid food one
day later than those assigned to normothermia (P=0.006); similarly, the
sutures were removed one day later in the patients assigned to hypothermia
(P = 0.002). The duration of hospitalization was 12.1±4.4 days in the
normothermia group and 14.7±6.5 days in the hypothermia group
(P=0.001). This difference was statistically significant even when the
analysis was limited to the uninfected patients. In the normothermia group,
the duration of hospitalization was 11.8±4.1 days in patients without
infection and 17.3±7.3 days in patients with infection (P=0.003). In the
hypothermia group the duration of hospitalization was 13.5±4.5 days in
patients without infection and 20.7±11.6 days in patients with infection
(P<0.001).
† The administration of this analgesic agent was controlled by the patient.
The overall incidence of surgical-wound infection was 12 percent.
Although the SENIC and NNISS scores for the risk of infection were
similar in the two groups, there were only 6 surgical-wound infections in
the normothermia group, as compared with 18 in the hypothermia group
(P = 0.009) (Table 2). Most positive cultures contained several different
organisms; the major ones were E. cold (11 cultures), S. aureus (7),
pseudomonas (4), enterobacter (3), and candida (3). Culture-negative pus
was expressed from the wounds of two patients assigned to hypothermia
and one patient assigned to normothermia. The ASEPSIS scores were
higher in the hypothermia group than in the normothermia group (13±16
vs. 7±10, P=0.002) (Table 2); these scores exceeded 20 in 32 percent of
the former but only 6 percent of the latter (P<0.001).
PERIOPERATIVE NORMOTHERMIA TO REDUCE THE INCIDENCE OF SURGICAL-WOUND INFECTION AND SHORTEN HOSPITALIZATION p 8 of 11
38-
37-
36-
35-
34-
Core Temperature (°C)
0 1 2 3 0 1 2 3 4 5 6Final
Time (hr)
Normothermia
Hypothermia
PostoperativeIntraoperative
Table 2 . Postoperative Findings in the Two Study Groups.*
Normothermia Hypothermia PVARIABLE (N = 104) (N = 96) Value
Al l pat i entsInfection - no. of patients (%) 6(6) 18(19) 0.009ASEPSIS score 7±10 13±16 0.002Collagen deposition - µg/cm 328±135 254±114 0.04Days to first solid food 5.6±2.5 6.5±2.0 0.006Days to suture removal 9.8±2.9 10.9±1.9 0.002Days of hospitalization 12.1±4.4 14.7±6.5 0.001
Uninfected pat ientsNumber of patients 98 78Days to first solid food 5.2±1.6 6.1±1.6 <0.001Days to suture removal 9.6±2.6 10.6±1.6 0.003Days of hospitalization 11.8±4.1 13.5±4.5 0.01* Plus-minus values are means ± SD.Fig 1. Core Temperatures during and after Colorectal Surgery in the Study Patients.
The mean (±SD) final intraoperative core temperature was 34.7±0.6°C in the 96patients assigned to hypothermia, who received routine thermal care, and36.6±0.5°C in the 104 patients assigned to normothermia, who were given extrawarming. The core temperatures in the two groups differed significantly at eachmeasurement, except before the induction of anesthesia (first measurement) andafter six hours of recovery.
Table 3 . Multivariate Analysis of Risk Factors for Surgical-Wound Infection.
RISK FACTOR ODDS RATIO (95% CI)Tobacco use (yes vs. no) 10.5 (3.2—34.1)Group assignment (hypotbormia vs. normothermia) 4.9 (1.7—14.5)
The postoperative hemoglobin concentrations did not differ significantly
between the two groups (Table 1). On the first postoperative day,
leukocytosis was impaired in the hypothermia group as compared with the
normothermia group(white-cell count, 11,500±3500 vs. 13,400±2500
cells per cubic millimeter; P<0.001). On the third postoperative day,
however, white-cell counts were significantly higher in the hypothermia
group (10,100±3900 vs. 8900±2900 cells per cubic millimeter). The
difference in values on the third day was not statistically significant when
only uninfected patients were included in the analysis. By the sixth
postoperative day, the white-cell counts were similar in the two groups.
Surgical site (rectum vs. colon) 2.7 (0.9— 7.6)NNISS score (per unit increase)* 2.5 (1.2— 5.3)Age (per decade) 1.6 (1.0— 2.4)
Table 4 . Postoperative Findings in the Study Patients According to Smoking Status.*
Smokers NonSmokers PCHARACTERISTIC (N = 62) (N = 138) Value
Infection — no. of patients 14 (23) 10 (7) 0.004ASEPSIS Score 15±18 8±104 <0.001Days to suture removal 10.9±3.5 10.1±2.0 0.04Days of hospitalization 14.9±6.7 12.9±5.0 0.02SENIC score (no. of patients) 0.25
1 0 62 58 1253 4 7
NNISS score (no. of patients) 0.080 23 401 30 583 9 40
* Plus-minus values are means ± SD. SENIC denotes Study on the Efficacy of NosocomialInfection Control, and NNISS National Nosocomial Infection Surveillance System.
PERIOPERATIVE NORMOTHERMIA TO REDUCE THE INCIDENCE OF SURGICAL-WOUND INFECTION AND SHORTEN HOSPITALIZATION p 9 of 11
Among smokers, the number of cigarettes smoked per day was similar in
the two groups (22±20 in the hypothermia group vs. 22±14 in the
normothermia group). The morphometric characteristics, anesthetic care,
and SENIC and NNISS scores of smokers and nonsmokers were not
significantly different. Nonetheless, the proportion of patients with wound
infection was significantly higher among smokers (23 percent, or 14 of
62) than among nonsmokers (7 percent, or 10 of 138; P=0.004).
Furthermore, the length of hospitalization was significantly greater among
smokers (14.9±6.7 days, vs. 12.9±5.0 days among nonsmokers; P=0.02)
(Table 4).
The types of bacteria cultured from our patients' surgical wounds were
similar to those reported previously.[2,3] These organisms are susceptible
to oxidative killing, which is consistent with our hypothesis that
hypothermia inhibits the oxidative killing of bacteria.[31] The overall
incidence of infection in our study was approximately 35 percent higher
than in previous reports.[3] One explanation for this relatively high
incidence is that we considered all wounds draining pus that yielded a
positive culture to be infected, although some may have been of minor
clinical importance. The hospitalizations of infected patients were one
week longer than those of patients without surgical-wound infections,
however, indicating that most infections were substantial. Similar
prolongation of hospitalization has been reported previously.[1,2]DISCUSSION
It is interesting to note that hospitalization was also prolonged (by about
two days) in the uninfected patients in the hypothermia group (Table 2). A
number of factors influenced the decision to discharge patients, but healing
of the incision (formation of a "healing ridge," for example) was among
the most important. As is consistent with a delay in clinical healing, sutures
were removed significantly later and the deposition of collagen (an index
of scar formation and the strength of the healing wound) was significantly
less in the hypothermia group than in the normothermia group. That the
patients assigned to hypothermia required significantly more time before
they could tolerate solid food is also consistent with impaired healing.
The initial hours after bacterial contamination are a decisive period for the
establishment of infection.[25] In surgical patients, perioperative factors
can contribute to surgical-wound infections, but the infection itself is
usually not manifest until days later.
In our study, forced-air warming combined with fluid warming maintained
normothermia in the treated patients, whereas the unwarmed patients had
core temperatures approximately 2°C below normal.[8] Perioperative
hypothermia persisted for more than four hours and thus included the
decisive period for establishing an infection.[25,30] The patients with mild
perioperative hypothermia had three times as many culture-positive
surgical-wound infections as the normothermic patients. Moreover, the
ASEPSIS scores showed that in the patients assigned to hypothermia the
reduction in resistance to infection was twice that in the normothermia
group.
In Austria's medical system, administrative factors and costs of
hospitalization do not influence the length of stay in the hospital. No data
on individual costs are tabulated by the participating hospitals, and they are
therefore not available for our patients. Nonetheless, the cost of a
prolonged hospitalization must exceed the cost of fluid and forced-air
PERIOPERATIVE NORMOTHERMIA TO REDUCE THE INCIDENCE OF SURGICAL-WOUND INFECTION AND SHORTEN HOSPITALIZATION p 10 of11
warming (approximately $30 in the United States). In a managed-care
situation, the duration of hospitalization might have differed less, or not at
all. However, our data suggest that patients kept at normal temperatures
during surgery would be better prepared for discharge at a fixed time than
those allowed to become hypothermic.
Mild hypothermia can increase blood loss and the need for transfusion
during surgery.38 In vitro studies suggest that perioperative hypothermia
may aggravate surgical bleeding by impairing the function of platelets and
the activity of clotting factors. [39,40] Blood transfusions may increase
susceptibility to surgical-wound infections by impairing immune function.
[41] Our patients assigned to hypothermia required significantly more
allogeneic blood to maintain postoperative hemoglobin concentrations than
did the patients assigned to normothermia However, we administered only
leukocyte depleted blood, and multivariate regression analysis indicated
that a requirement for transfusion did not independently contribute to the
incidence of wound infection. It is thus unlikely that the differences in the
incidence of infection in the two groups we studied resulted from
transfusion-mediated immunosuppression.
Among all 200 patients in our study, those who smoked had three times
more surgical-wound infections and significantly longer hospitalizations
than the nonsmokers. Similar data have been reported previously.[32]
Numerous factors contributed to these results; one may have been that
smoking markedly lowers oxygen tension in tissue for nearly an hour after
each cigarette.[33] (Thermoregulatory vasoconstriction produces a similar
reduction.[34] The distribution of factors known to influence infection was
similar between smokers and nonsmokers, but the smokers may have had
other behavioral or physiologic factors predisposing them to infection.
The prevalence of smoking was similar in the two study groups. Other
factors may have influenced the patients' susceptibility to wound
infections, such as arterial hypoxemia, hypovolemia, the concentration of
the anesthetic used, and vasoconstriction resulting from pain-induced
stress. [25,26,35,36]. However, the administration of oxygen,
oxyhemoglobin saturation, fluid balance, hemodynamic responses,
end-tidal concentrations of anesthetic, pain scores, and quantities of opioid
administered were all similar between the two groups. These factors are
therefore not likely to have confounded our results. It is also unlikely that
exaggerated bacterial growth aggravated the infections in the hypothermia
group, because small reductions in temperature actually decrease growth in
vitro.[37]
In summary, this double-blind, randomized study indicates that
intraoperative core temperatures approximately 2°C below normal triple the
incidence of wound infection and prolong hospitalization by about 20
percent. Maintaining intraoperative normothermia is thus likely to decrease
infectious complications and shorten hospitalization in patients undergoing
colorectal surgery.
We are indebted to Heinz Scheuenstahl for the collagen-deposition analysis; to Helene Ortmann, M.D., Andrea Hubacek, M.D.,Michael Zimpfer, M.D., and Gerhard Pavecic for their generous assistance; and to Mallinckrodt Anesthesiology Products, Inc., for thedonation of thermometers and thermocouples.APPENDIX
The following investigators also participated in this study: patient safety and data auditing: H.W. Hopf and T.K. Hunt (University ofCalifornia, San Francisco); site directors: G. Polak (Hospital Rudolfstiftung, Vienna, Austria) and W. Kroll (University of Graz, Graz,Austria); patient care: E Lackner and R. Fuegger (University of Vienna); data acquisition: E. Narzt (University of Vienna), C. Wol~b(University of Vienna), E. Marker (University of Vienna), A. Bekar (Orthopedic Hospital, Speising, Vienna), H. Kaloud (University ofGraz), U. Stratil (Hospital Rudolfstiftung), and R. Csepan (University of Vienna); wound evaluation: V. Goll (University of Vienna),G.S. Bayer (University of Vienna), and P. Steindorfer (University of Graz); and data management: B. Petschnigg (University ofVienna).
PERIOPERATIVE NORMOTHERMIA TO REDUCE THE INCIDENCE OF SURGICAL-WOUND INFECTION AND SHORTEN HOSPITALIZATION p 11 of11
REFERENCES 20. De Jong L, ... Stoicheiometry and kinetics of the prolyl 4-hydroxylase partial reaction.Biochim Biophys Acta 1984;787:105-11.1. Bremmelgaard A, ... Computer-aided surveillance of surgical infections and
identification of risk factors. J Hosp Infect 1989;13:1-18. 21. Hunt TK, ... The effect of varying ambient oxygen tensions on wound metabolism andcollagen synthesis. Surg Gynecol Obstet 1972; 135:561 - 7.2. Haley RW, ... Idendfying patients at high risk of surgical wound infection: a simple
multivariate index of patient susceptibility and wound contamination. Am J Epidemiol1985;121:206 15.
22. Jonsson K, ... Tissue oxygenation, anemia and perfusion in relation to woumd healingin surgical patients. Ann Surg 1991 ,214:oO5-13.
23. Frank SM, .... The catecholamine, cortisol, and hemodynamic responses to mildperioperative hypothermia: a randomized clinical trial. Anesthesiology1995;82:83-93.
3. Culver DH, .... Surgical wound infection rates by wound class, operative procedure, andpatient risk index. Am J Med 1991; 91:152S-157S.
4. Frank SM, ... Epidural versus general anesthesia, ambient operating room temperature.and patient age as predictors of inadvertent hypothermia. Anesthesiology1992;77:252-7.
24. Fleming TR, .... Designs for group sequential tests. Control Clin Trials1984,5:348-61.
25. Miles AA, ... The value and duration of defense reacdons of the skin to the primarylodgement of bacteria Br J Exp Pathol 1957;38:79-96.
5. Matsukawa T, ... Propofol linearly reduces the vasococonstriction and shiveringthresholds. Anesthesiology 1995;82:1169-80.
26. Jonsson K, .. Assessment of perfusion in postoperative patients using tissue oxygenmeasurements. Br J Surg 1987;74:263-7.
6. Annadata RS, ... Desflurane slightly increases the sweatiug threshold but producesmarked, nonlinear decreases m the vasococonstriction and shivering threshold~Anesthesiology 1995;83:1205-1 1. 27. Rubinstein EH, .... Skin-surface temperature gradients correlate with fingertip blood
flow in humans. Anesthesiology 1990;73:541-5.7. Matsukawa T, .... Heat flow and distribudon during induction of general anesthesiaAnesthesiology 1995;82:662-73. 28. Byrne DJ, ... Postoperadve wound scoring. Biomed Pharmacother 1989;43:669-73.
8. Kurz A, .... Forced-air warming maintains intraoperative normothermia better thancirculating-water mattresses. Anesth Analg 1993;77:89-95.
29. Rabkin JM, ... Woumd healing assessment by radioisotope incubation of tissuesamples in PTFE tubes. Surg Forum 1986,37:5924.
9. Ozaki M, ... Nitrous oxide decreases the threshold for vasoconstriction less thansevoflurane or isoflurane. Anesth Analg 1995,80:1212-6.
30. Classen DC, .... The timing of prophylactic administration of antibiotics and tbe riskof surgical wound infecdon. N Engl J Med 1992;326:281-6.
10. Sessler Dl, ... Physiologic responses to mild peri 31. Babior BM, ... Cbronic granulomatous disoases. Semin Hematol 1990;27: 247 59.anesthetic hypothermia in humans. Anesthesiology 1991;75:594-610. 32. Stopinski J, ... L'abus de nicotine et d'alcool ont-ils une influence sur l'apparition des
infections bactériennes post opératoires? J Chir 1993;130:422-5.11. Chang N, ... Comparison of the effect of bacterial inoculation in musculocutaneous andrandom-pattern flaps. Plast Reconstr Surg 1982,701-10. 33. Jensen JA, .... Cigareotte smoking decreases tissue oxygen. Arch Surg
1991;126:11314.12. Jonsson K, ... Oxygen as an isolated variable influences resistance to infection. AnnSurg 1988;208:783-7. 34. Sheffield CW, ... Effect of thermoregulatory responses on subcutanoous oxygen
tension. Wound Repair Regeneration (in press).13. Hohn DC, ... The effect of O2 tension on microbicidal function of leucocytes in woundand in vitro. Surg Forum 1976;27: 18-20. 35. Knighton DR, ... Oxygen as an antibiotic: tbe effect of inspired oxygen on infection.
Arch Surg 1984;119:199-204.14. Mader JT, ... A mechanism for the amelioration by hyperbaric oxygen of experimentalstaphylococcal osteomyelitis in rabbits. J Infect Dis 1980;142:915-22. 36. Moudgil GC, ... Influence of anaosthesia. and surgery on neutrophil chemotaxis. Can
Anaesth Soc J 1981,28:232-8.15. van Oss Cl, ... Effect of temperature on the chemotaxis, phagocytic engulfment.digestion and O2 consumpbon of human polymorphonuclear leucocytes. JReticuloendothelial Soc 1980,27:561-5.
37. Mackowiak PA. Direct effects of hyperthermia on pathogenic microorganisms:teleologic implications witb regard to fever. Rev Infect Dis 1981;3: 508-20.
16. Leijh PC, ... Kinetics of phagocytosis of Staphylococcus aureus and Escherichia coliby human granulocytes, Immunology 1979;37:453-65.
38. Schmied H, .... Mild hypothermia increases blood loss and transfusion requirementsduring total hip arthroplasty. Lancet 1996,347:289-92
17. Sheffield CW, ... Mild hypothermia during isoflurane anesthesia decreases resistance toE coli dermal infection in guinea pigs. Acta Anaesthesiol Scand 1994;38:2015
39. Michelson AD, .... Reversible inhibition of human platelet activation by hypotbermiain vivo and in vitro. Thromb Haemost 1994;71:633-40.
18. Sheffield CW, .... Mild hypothermia during halathane-induced anesthesia decreasesresistance to Staphylococcus aureus dermal infection in guinea pigs. Wound RepairRegeneration 1994;2:48-56.
40. Rohrer MJ... Effect of hypothermia on the coagulation cascade. Crit Care Med1992,20:1402-5.
41. Jensen LS, ... Postoperative infection and natural killer cell function following bloodtransfusion in patients undergoing elective colorectal surgery. Br J Surg1992;79:513-6.
19. Prockop DJ ... The biosynthesis of collagen and its disorders. N Engl J Med1979;301:13-23.
M&M Ch 3.1 & 3.2 Designing Experiments ... page 1
On Experimental Design
I constructed four miniature houses of worship
a Mohammedan mosque,a Hindu temple,a Jewish synagogue,a Christian cathedral
and placed them in a row.
I then marked l5 ants («fourmis») with red paint and turnedthem loose. They made several trips to and fro, glancing in atthe places of worship, but not entering. I then turned loose l5more painted blue; they acted just as the red ones had done. Inow gilded 15 and turned them loose. No change in the result;the 45 travelled back and forth in a hurry persistently andcontinuously visiting each fane, but never entering.
This satisfied me that these ants were without religiousprejudices--just what I wished; for under no other conditionswould my next and greater experiment be valuable.
I now placed a small square of white paper within the door ofeach fane;
upon the mosque paper I put a pinch of putty, upon the templepaper a dab of tar,upon the synagogue paper a trifle of turpentine, upon thecathedral paper a small cube of sugar.
First I liberated the red ants. They examined and rejected theputty, the tar and the tuprpentine, and then took to the sugarwith zeal and apparent sincere conviction.
I next liberated the blue ants, and they did exactly as the redones had done.
The gilded ants followed. The preceding results wereprecisely repeated.
This seemed to prove that ants destitute of religious prejudicewill always prefer Christianity to any other creed.
However, to make sure, I removed the ants and put putty in thecathedral and sugar in the mosque. I now liberated the ants ina body, and they rushed tumultuously to the cathedral.
I was very much touched and gratified, and went back in theroom to write down the event.
But when I came back the ants had all apostatized and hadgone over to the Mohammedan communion.
I saw that I had been too hasty in my conclusions, andnaturally felt rebuked and humbled. With diminishedconfidence I went on with the test to the finish. I placed thesugar first in one house of worship then in another, till I hadtried them all.
With this result: whatever Church I put the sugar in, that wasthe one the ants straightway joined.
This was true beyond a shadow of doubt, that in religiousmatters the ant is the opposite of man, for man cares for butone thing; to find the only true Church; whereas the ants huntsfor the one with the sugar in it.
from Mark Twain, "On Experimental Design " in Scott W.K. and L.L.Cummings, Readings in Organizational Behavior and HumanPerformance, Irwin: Homewood, Ill., p.2, (l973) .