CHAPTER 3: INFERENTIAL STATISTICS: … sample proportion is a commonly used estimator of the...

CHAPTER 3: INFERENTIAL STATISTICS: Estimation and Testing

MOTIVATION Perhaps the most appealing aspect of Statistics is that it provides methods for speculating on the nature of a larger world than the one you can actually see and measure. If your worry is that your particular research has resulted in “one-off” findings, likely to be overturned by the next person conducting similar research, then the methodology of the inferential statistics may deliver quantitative reassurance. Alternatively, your anxiety may be warranted and the embarrassment of premature publication of your results may be averted. Goodies like these do not come free. Speculations on the wider population depend for their credibility on the rigour applied to the research design, and on fulfilling certain assumptions concerning the nature of the wider world. We will concentrate on the practical application of statistical methods for parameter estimation and hypothesis testing. The concepts we have already met – calculation of summary statistics such as means and proportions, and using models of the distributions of these statistics – provide the foundations for making inferences. However, it must be borne in mind that there are processes both before and after the “statistical mechanics” that ensure successful and valid research. Good study design, including valid methods of data collection, attention to ethical issues, and conducting research according to agreed, written protocols are pre-requisites. The interpretation of the results of statistical analysis must be honest, critical and subject to searching peer-review.

chapter 3: inferential statistics: estimation and testing

§ 3.1 INTRODUCTION It has taken quite a while to get to this point because without some knowledge of basic descriptive statistics, probability and distributions it is not possible to appreciate what is going on in inferential statistics. We will consider first the concept of statistical estimation, and then the process of testing hypotheses. § 3.2 ESTIMATION § 3.2.1 Basic Terms We often want to know about the value of a characteristic in a population (that is, a parameter) but we only have information on a sample. If we assume that the sample was selected at random from the population, then we may use sample statistics to estimate the population parameter. The population upon which we are speculating is termed the target population. Somewhat ponderously, we call a sample statistic an estimator of a population parameter, if the statistic is used to estimate that parameter. The particular numeric value of the estimator is termed the estimate. Example 3.1 A geneticist is interested in knowing the proportion of Drosophila melanogaster which hatch with wing deformities. (Drosophila is the fruit-fly; geneticists use it as a simple model to study genes.) A sample is selected at random from a specified laboratory-bred strain – the target population – and scrutinised on hatching for the deformity. The sample proportion deformed is found to be 0.001. The geneticist now uses the sample proportion, p, as the estimator for the proportion deformed in the entire strain, π. The estimate (numeric value) is 0.001. The sample proportion is a commonly used estimator of the population proportion. Similarly, the mean calculated from a random sample, x

_ , is used as an estimator of the population mean, µ. The population standard deviation, σ, is estimated by the sample standard deviation, s. In an epidemiological study, the incidence rate of new cases of an opportunistic fungal infection in an HIV positive population might be estimated from the incidence rate in suitably chosen random samples from hospital outpatients and from the records of doctors specialising in the care of HIV/AIDS patients.

2004 Philip Ryan page 3-2


§ 3.2.2 Estimators used to Compare Groups While we have so far considered estimators calculated from a single sample, in clinical and experimental studies interest is often centred on the comparison of two or more groups. Example 3.2 A physiologist wishes to compare the effects of two concentrations of muscle relaxant on isolated smooth muscle. She hopes that this will have later practical applications in the treatment of certain disorders of intestinal motility. She chooses a piece of muscle tissue and divides it into 20 smaller pieces, each 8 mm long. The higher concentration of drug is randomly assigned to 10 pieces of muscle and the lower concentration to the other 10 pieces. The variable she will measure is the length of the muscle five minutes after the solutions of drug are applied to the muscle. She will calculate a sample mean length for each of the two groups of muscle. These two sample means are her estimates of the respective population means. More than that, she will use the difference in sample means as the estimator of the difference in population means. If the difference in sample means is around zero, she might speculate that, in the population, the two concentrations of drugs do not differ in their effect. In fact, the mean length for the high concentration was 13 mm, and for the low concentration, 9 mm. Therefore the difference in means on this occasion was 4 mm; the scientist speculated that higher drug concentrations might in general have a greater relaxing effect. In Example 3.2, the comparison of groups was measured on an additive scale, in that the arithmetic difference between group means was the criterion of interest. There are other comparison scales used in particular circumstances. Sometimes a ratio scale is used to measure a difference between groups. For example, in prevalence studies and in fixed cohort studies, the risk ratio, often also termed the relative risk (RR), of disease between two groups in the population is estimated by the relative risk of two samples. One group is known to have been exposed to a presumed causative agent, the other has not. The relative risk is the ratio of the risk of disease in the exposed group to the risk in the non-exposed group. If the risk of disease in the exposed group is greater than that in the non-exposed group, then the sample relative risk will exceed 1 – this may provide evidence that the population risk is also higher for those exposed. Note, however, that comparisons of risk between two groups may also be expressed on an additive scale: the risk difference is the risk of disease among those exposed minus the risk among those not exposed. In dynamic cohort studies (see Example 3.3 below) one uses incidence rates and incidence rate ratios rather than risks and relative risks, but the principles are the same. Chapter 4 has some more details on epidemiological study designs. In comparison studies, all these estimates are sometimes given the generic name of measures of effect.



Example 3.3 In a prospective study (looking forward in time), a sample of smokers and another sample of non-smokers are followed up over 20 years. In each group the number of persons developing oesophageal (gullet) cancer is recorded. In the exposed group (the smokers) 6 cases of cancer develop; in the non-exposed group (the non-smokers) there were 4 cases. Because subjects spent different periods under observation over the 20 years (some entered late, others were “lost to follow-up”, some re-entered after a period lost), this is a dynamic cohort study. The denominator in each group’s rate of disease was expressed in terms of person-years (p-y). One person-year is the equivalent of one person being observed without the cancer for one year. Two persons being observed over six months is also one person-year. Among the smokers there was a total of 1500 person-years observed. Among the non-smokers there was a total of 3800 person-years observed. The incidence rate in smokers is: 6/1500 = 0.004 cases/p-y. The incidence rate in non-smokers is: 4/3800 = 0.001 cases/p-y. And so the incidence rate ratio (IRR)is: 0.004/0.001 = 4 (no units) This sample IRR is our estimate of the population IRR of gullet cancer for smokers versus non-smokers. There is a four-fold increase in the incidence rate of oesophageal cancer in smokers. A doctor or health counsellor might quote the incidence rate ratio when trying to get a quit-smoking message across to an individual patient (“Mr Bloggs, because you smoke, you have 4 times the risk of cancer ....”). An epidemiologist might use the IRR to help decide causality. It is just as valid to calculate a risk or rate difference on an additive scale. The difference is 0.003 cases/p-y, or equivalently, 3 cases/1000 p-y. For every 1000 p-y of exposure to smoking we may expect an extra 3 cases of oesophageal cancer above and beyond the “background” (non-smoking) incidence. The extra 3 cases are “attributed” to smoking, and so this measure is sometimes termed the attributable risk. This is the way a differential risk or rate might be used in planning for health services so that adequate resources may be allocated to the problem or the cost/benefits in prevention programmes might be estimated. The rate ratio and the rate difference give two complementary perspectives on the issue of smoking and oesophageal cancer. § 3.2.3 Some Criteria for Estimators Although it seems obvious to choose, for example, the sample mean as the estimator of the population mean, it would be nice to think that a choice for an estimator is a good one. In a non-rigorous sense, we can say that an estimator is “good” if:

• The estimator is unbiased , that is, if we calculate the estimate repeatedly on many randomly drawn samples, the average value should equal the true value of the population parameter.



• The estimator is consistent , that is, as the sample size increases, the estimate should start to converge towards the value of the population parameter.

• The estimator should have minimal variance, that is, its variability on repeated

sampling should be as small as possible – this will deliver greater precision. Sometimes we need to compromise between precision and unbiasedness.

For example, the intuitively appropriate estimator of the population mean, the sample mean, can be shown to be a consistent, unbiased and minimal variance estimator, as can the sample proportion for the population proportion. The difference in means also meets these criteria. As was mentioned in Chapter 1 (see §1.4.2.1), we use (n – 1) as the denominator in calculating the sample standard deviation, since this makes s an unbiased estimator of σ. § 3.2.4 Precision of Estimators & Confidence Intervals You will recall from Chapter 1 (§1.5.1) that it is essential to accompany any estimate with a statement of its precision. Now we may see the power and usefulness of sampling distributions (see §2.5). Since our estimators are sample statistics, they will have sampling distributions, and sampling distributions have their own measures of variability called standard errors . For example, the sample difference in means found by the physiologist in Example 3.2 is just one of the possible differences in means which might be found if the experiment were repeated many times. These possible differences in means are numeric values (her’s was 4mm), and the relative frequency of these values will form a probability distribution. By the central limit theorem, many of the sampling distributions of commonly used estimators will be Normal – and the nice properties of the Normal curve can be exploited. Without derivation, here are the standard errors of several commonly used estimators (see also E2.14 and E2.17): PARAMETER ESTIMATOR STANDARD ERROR

sample mean µ x_

n

2σ

difference in means µ1 – µ2 2

_

1

_xx −

2

22

1

21

nnσσ

+

sample proportion π p n)1( ππ −

difference in propns π1 – π2 p1 – p2 2

22

1

11 )1()1(nn

ππππ −+

− E3.1



Notes: The standard errors are themselves calculated using values of population parameters, for example, σ and π. If, as is common, the values of these parameters are unknown, they must also be estimated from the sample data. So, for example, we might use s to estimate σ.

The distribution of the sample proportion, p, is approximately Normal. The

approximation is reasonable when 0.25 < π < 0.75. One way to express the precision of an estimate is to accompany the estimate with the value of its standard error. A large standard error will lead to a cautious interpretation of an estimate. Another way is to calculate and quote a confidence interval for the population parameter. A confidence interval is a set of real numbers that are consistent with the value of the true, but unknown, population parameter we are trying to estimate. A confidence interval is defined by the lowest value in the set (the lower confidence limit) and the highest value (the upper confidence limit). By “consistent” I mean that, given the sample data, any proposed value of the population parameter below the calculated lower confidence limit is too low to be entertained as a reasonable value for the population parameter. Similarly, any value higher than the upper confidence limit is too high. To illustrate the formation of a confidence interval let us imagine an experiment which aims to estimate the difference in mean responses of subjects to two experimental conditions. In this case (though unfortunately not in the general case), we know that the sampling distribution of the estimator is Normal and this leads to an especially easy formulation of the confidence limits. Let us further imagine that we repeatedly perform the experiment a very large number of times; each experiment results in an estimate of the parameter of interest, the difference in means in the underlying populations. Each time we do the experiment we calculate the quantity: [estimate – (1.96 x standard error)] as the lower limit of an interval, and: [estimate + (1.96 x standard error)] as the upper limit. It can be shown that, in the long run, we may expect 95% of such intervals to enclose the true but unknown value of the population parameter. 5% will not. Of course, in the real world we may only do the experiment once and so we have only one such “95% confidence interval”. It would be wrong to say – although it is commonly said – that our single calculated confidence interval has a “95% chance” of including the population parameter. Once formed, it either will or it won’t (and we will never know). But if a claim is made about the value of the population difference in means and this claimed value is not included in our single 95% confidence interval we might consider this to be (probabilistically) an unlikely state of affairs and doubt the validity of the claim. Similarly, if the claimed value of the parameter is captured within the interval, we would judge that the claim is in accord with the evidence of our sample data, and not reject the claim. The width of the confidence interval, dependent as it is on the standard error, also gives us an idea of the precision of the estimate. Calculating and reporting an index of precision are absolutely necessary because biological populations are variable and can generate a variety of sample results on different occasions.



Example 3.2 continued The physiologist knew from previous large-scale research that the value of the population standard deviation for the effect of high concentrations of the relaxant drug on muscle was: σH = 0.8 mm, and the population standard deviation for the low concentration was: σL = 0.6 mm. Therefore, by E3.1 the standard error of the estimate of the difference in means is: √[(0.82/10)+(0.62/10)] = 0.32 mm She reported her findings in a journal: “The difference in mean muscle lengths using the two drug concentrations was 4 mm, with a standard error of the difference of 0.32 mm. The limits of the 95% confidence interval for the difference in means are: 4 – (1.96 x 0.32) to 4 + (1.96 x 0.32) 3.37 mm to 4.62 mm (to 2 decimal places) 95% of such intervals calculated on repeated experiments would include the true difference in means. My interval does not include zero, so I reject the notion that there is no difference in population means: I have some assurance that the difference seen in my experiment reflects a real difference in the populations.” Example 3.3 continued Unfortunately the sampling distribution of the incidence rate ratio is not Normal and the standard error is too difficult for us to discuss in detail here. The 95% confidence interval around the population IRR for oesophageal cancer was calculated (formula not shown) as: 2.9 to 8.4. Since this interval does not include 1 (the population value expected if there were no multiplication of the rate for smokers over non-smokers), we can surmise that the increased rate found in this study – the estimate was 4 – is likely to be real and not just due to chance sampling variation. Notes: In comparative studies (see Examples 3.2 and 3.3) it is instructive to see

that differences on an additive scale yield confidence intervals that are symmetrically disposed around the point estimate (eg difference in means) and that interest centres on whether a zero difference is included within the interval. Differences on a ratio scale (multiples) are asymmetrically disposed around the point estimate (eg relative risk) and interest centres on whether a ratio of 1 is included within the interval.

Using confidence intervals to give your audience an appreciation of the

precision of your estimate of a parameter is known as interval estimation, as opposed to point estimation, that is, simply giving the value of the estimator without any information as to its precision. Most quality journals demand that authors provide both forms.



Example 3.2 continued There is another way to grapple with the meaning of a confidence interval (CI), a concept many find a bit tricky when first met. The point estimate of the difference in muscle lengths is 4.0 mm and the confidence interval is (to 1 decimal place) 3.4 mm to 4.6 mm. Consider the lower limit, 3.4 mm, to be a possible value of the population parameter. If 3.4 mm were truly the value of the population parameter, then it can be shown that such a population would generate a sample statistic as high or higher than 4.0 mm on at most 2.5% of occasions. Similarly, the upper limit of the calculated 95% CI represents that value of the true population difference in means that would yield a sample difference as low or lower than 4.0 mm on at most 2.5% of occasions. Figure 3.1 shows that if the lower limit were any lower, and this was the true value of the population difference in means, there would be less than 2.5% chance of seeing the sample difference that we did. (Shift the entire leftmost distribution further to the left – what happens to the shaded area above 4 mm?) If the upper limit were any higher than 4.6 mm, there would also be less than 2.5% chance of seeing what we did. In a probabilistic sense, our sample difference in means could have been generated from a population wherein the true difference was between the values specified by the lower and upper confidence limits. Outside these limits, the chance of seeing the sample difference in means that we obtained gets rather too low to be entertained. The lower and upper limits, calculated as they are, provide us with the coverage (95%) of the parameter we desired. If we wish to have wider coverage, we could calculate, say, a 99% confidence interval; a 90% confidence interval gives us a narrower coverage. The 95% CI is much the most commonly used.

2 3 4 5 6

0

.5

1

1.5

Distribution of differences in sample means when true population difference in means is 3.4 mm, or 4.6 mm.

0.025 0.025

Fig 3.1 Construction of a confidence interval for the difference in means 95% CI

difference in mean length 3.4 4.6



§ 3.3 HYPOTHESIS TESTING

§ 3.3.1 Introduction We are going to see how inferential statistics provides us with a method to choose between possible interpretations of experimental results; that is, we are going to see how to set up a statistical hypothesis test and use it to come to a decision. Hypothesis testing involves: • Setting up two opposing hypotheses concerning parameter(s) in the population(s); • Sampling the population(s) and extracting evidence from the resulting sample

data; • Making a judgement as to which hypothesis is best supported by the evidence; and • Recognising and presenting the errors inherent in coming to a decision.

The crucial point to appreciate (see §3.3.5) is that testing the opposing hypotheses implies the generation of probability distributions - sampling distributions - so that our judgement of the evidence will be based on probabilities. There are scores of different types of tests one can use to test an hypothesis, depending on the nature of the hypothesis, the method of experimentation and the nature of the raw data. It is easy to become proficient in mechanically applying test formulae to data, especially with today’s computers and programmable calculators. It is much more important that the rationale behind it all is understood; much research work, and hence clinical and policy practice, is based on concepts you will learn in this chapter. § 3.3.2 Definitions Statistical inference is the process whereby statements about the characteristics of an entire population are made on the basis of data obtained from a sample of that population. A statistical hypothesis is an assumption or statement, which may or may not be true, concerning one or more populations. Researchers often forget that hypotheses relate to population, not sample, characteristics. (I suppose this happens because the data with which they work are sample data.) Here are some examples of hypotheses (Example 3.4) which lend themselves to being tested using one or other of the methods we will have at our disposal. You will note that each hypothesis is testable. It is not uncommon for inexperienced researchers to pose questions, which are so inexplicit or ill-defined as to be not testable in any scientific or statistical sense. (Even the hypotheses of Example 3.4 could be tightened further. For example, in [A], do we mean random or fasting blood sugar measurements? In [C], how would we measure “poverty”?)



Example 3.4 [A] The mean blood sugar level in (a population of) diabetics is 11 mmol/litre.

[B] The difference in population mean maternal plasma zinc levels in mothers of foetuses with a neural tube defect and mothers of normal foetuses is zero.

[C] There is no association between education level and poverty (in the population).

[D] There is a linear relationship between age and the death rate from chronic leukaemia (in the population).

[E] There is an increased risk of cancer of the bladder in workers in the rubber industry compared to workers in the metal industry.

In future, for brevity’s sake, I’ll usually omit the reference to populations – as long as you remember it’s really there. § 3.3.3 Samples and Populations Although it may seem pretty obvious what a sample is and what a population is, the fact is that in the world of statistical inference we must be very careful in applying these terms. The validity of the results of our hypothesis tests depends on the relation between the sample, which supplied the raw data, and the population about which we wish to make a claim. The validity of the tests we will discuss depends on the raw data being drawn as random samples. That is, each member of the population has a specified (usually, equal) chance of being chosen for inclusion in the sample. Work in the field of the validity of random samples is synonymous with the name of Sir Ronald Fisher, an eminent British statistician (who worked at Adelaide University for some time in the 1950s – the Fisher building is named for him). Many, though by no means all, tests further assume that the variable of interest has a Normal or near-Normal distribution in the population. The population about which a researcher wishes to make an inference is called a target population. The researcher usually assumes that the population from which the sample is drawn is identical to the target population. Failure to ensure this can result in bias. Example 3.5

We might take a random sample (say, 1 in 8 by lottery) of all elderly patients in teaching hospitals and find that the prevalence of hip fractures in the sample is 2%. Since we took a random sample, the inference now is that the prevalence of hip fractures in the sampled population is also 2%. But it might well not be safe to infer that the general non-hospitalised elderly population (the original target population) also has a 2% prevalence of hip fracture because the elderly in hospital may not be a true reflection of the elderly in the general community. The sample suffers from bias.



The moral is: always check that the chains of inference connecting the sample to the population whence the sample was drawn to the target population hold true. Sometimes the connection is pretty tenuous, so be suspicious. But, I hear you say, it is quite common in health research not to draw random samples, but to use “samples of convenience”, for example, cooperative patients from a hospital clinic or volunteers, for example medical students (not permitted in this university!), or drug company employees. Such samples may limit the external generalisability of any research findings to the wider population, but in certain circumstances the results themselves may be valid. Consider a drug trial comparing the effects of a new cytotoxic drug and a standard drug on the survival time in patients with acute myelocytic leukemia (AML). Patients enrolled into the study are unlikely to be a random sample from the entire population of AML sufferers. More likely they are volunteers from oncology clinics at teaching hospitals. But if the patients are randomly allocated to one or other of the two treatments, then valid probability distributions are generated and the results of the statistical testing procedure will be valid. That is, we are justified in making statements concerning the probability that any observed difference between the two treatments is due simply to chance or, alternatively, to a real difference in efficacy. To summarise: valid results from statistical hypothesis testing will be obtained if either random sampling or randomisation is used in the design of the research. Those who have studied epidemiology will have come across observational study designs, such as case-control studies and cohort studies (see also Chapter 4). In such studies, neither randomisation nor complete random sampling is evident. Yet, statistical tests of research hypotheses are routinely carried out in the context of these study designs. Usually, appeals are made to the notion that (under the Null hypothesis) some ill-defined random process is responsible for, say, the development of disease in an individual in a cohort study, or the inclusion of a case in a case-control study. § 3.3.4 Hypotheses The opposing hypotheses submitted for testing should be mutually exclusive. It is conventional to call one hypothesis the Null hypothesis, denoted Ho. The null hypothesis, rather arbitrarily, represents the reference, baseline or status quo situation. The opposing hypothesis, stating that something interesting is occurring (which the researcher would like to find, and so become rich and famous) is called the Research or Alternative hypothesis, denoted H1, or sometimes HA. Example 3.4 continued

Hypotheses [B] and [C] are in the form of Null hypotheses, whereas [D] and [E] are in the form of Alternative hypotheses. The Null hypothesis for [D] would be:

“There is no linear relationship between age and death-rate from chronic leukaemia (in the population)”.



Example 3.4 continued

Hypothesis [A] might be a Null or an Alternative form. It would depend on the nature of the research question being posed. If the researcher considers that current prevailing opinion is that the mean blood sugar in diabetics is 11 mmol/L, and he would like to show something new, then [A] would be in the form of his Null hypothesis. If we do take it as a Null hypothesis, here are the mutually exclusive hypotheses: Ho: The mean blood sugar in diabetics is 11 mmol/L versus H1: The mean blood sugar in diabetics is not 11 mmol/L. Again, the mutually exclusive hypotheses for [C] are: Ho: There is no association between education level and poverty versus H1: There is an association between education level and poverty. Note that alternative hypotheses do not often specify the alternative numeric value of the population parameter, just that it is other than that specified by the null hypothesis. [A] and [B] are examples of this. Finally, an Alternative hypothesis can be either directional (one-sided), or non-directional (two-sided) depending on the research question. Example 3.4 continued

Here is the pair of hypotheses [B] with H1 in the form of a non-directional hypothesis: Ho: The difference in population mean maternal plasma zinc levels in mothers of

foetuses with a neural tube defect and mothers of normal foetuses is zero. versus H1: The difference in population mean maternal plasma zinc levels in mothers of

foetuses with a neural tube defect and mothers of normal foetuses is not zero. Note that this two-sided formulation of H1 allows simultaneous testing of two possibilities: (i) that mothers of NTD-affected foetuses might have higher zinc levels than mothers of non-affected foetuses and (ii) that mothers of NTD-affected foetuses might have lower zinc levels than mothers of non-affected foetuses A one-sided, or directional, formulation of H1 is:



H1: The difference in mean maternal plasma zinc levels in mothers of foetuses with a neural tube defect and mothers of normal foetuses is greater than zero.

Example 3.4 continued

This allows a test only of whether mothers of NTD foetuses have higher zinc levels than mothers of normal foetuses. A separate, different one-sided formulation of H1 is: H1: The difference in mean maternal plasma zinc levels in mothers of foetuses with a

neural tube defect and mothers of normal foetuses is less than zero. This allows a test only of whether mothers of NTD foetuses have lower zinc levels than mothers of normal foetuses. As a general rule, it is better to use two-sided formulations of the Alternative hypothesis. A one-sided H1 restricts attention to results due to sampling variation on only one side of the expected (“status-quo”, Null) result. One has to work harder, that is, collect more information, to maintain a two-sided Alternative hypothesis, since one has to spot variations in either direction, but it is usually worth the trouble. Use a one-sided hypothesis only if a wealth of previous experimentation, or physical or biological principles, dictate that sampling variation could only be on one side of an expected Null result. A critical audience will need to be convinced of the validity of a one-sided hypothesis. Example 3.6 In the case of the following hypotheses: Ho: The mean Forced Expiratory Volume in 1 second (FEV1) in untreated

symptomatic asthmatic children is the same as the mean FEV1 in normal children

versus H1: The mean FEV1 in untreated asthmatic children is less than the mean FEV1 in

normal children, the directional or one-sided nature of H1 is justified on physiological grounds, and searching for the opposite result (untreated symptomatic asthmatics having greater FEV1 than normals) would be a fruitless endeavour. § 3.3.5 The Test Construction of a statistical test is really quite easy to understand if the concept of the sampling distribution is grasped. First I’ll list the steps, the algorithm, then ground the theory in an explicit example.



Algorithm 3.1 Step 1: Assume for the time being that the Null hypothesis is true. (And don’t

forget that you have made this assumption!)

Step 2: Under the assumption of Step 1, construct the sampling distribution of the sample statistic (mean, proportion, difference in means, relative risk etc, whichever is the appropriate one for your particular test).

Step 3: Fit the value of the sample statistic calculated on your set of data to the distribution of sample statistic constructed in Step 2.

Step 4: Using tables or a computer calculate how likely it is that your sample statistic (or one more extreme than yours) occurs in the distribution specified by the assumed true Ho. In other words, calculate the probability associated with your calculated statistic.

Step 5: Decide on the basis of this probability whether it is reasonable to maintain the original assumption (see Step 1) that the Null hypothesis is true. A relatively low probability, say, 0.05 or less, might make the assumption untenable and lead to a decision to reject the Null hypothesis.

Example 3.7

Consider hypotheses [B] concerning the levels of plasma zinc in mothers of NTD-affected and normal foetuses: Ho: The difference in mean maternal plasma zinc levels in mothers of foetuses

with a neural tube defect and mothers of normal foetuses is zero versus H1: The difference in mean maternal plasma zinc levels in mothers of foetuses

with a neural tube defect and mothers of normal foetuses is not zero. [We assume that we have selected two samples of mothers at random from the respective populations of those carrying NTD foetuses and those with normal foetuses. We also assume that the distribution of plasma zinc levels in both populations is approximately Normal.] Let us denote the respective populations with subscripts A for affected foetuses and U for unaffected foetuses. The researcher sampled nA = 68 mothers of affected foetuses and nU = 138 mothers of unaffected foetuses. The respective sample maternal plasma zinc levels (at 20 weeks gestation) were x

_A = 0.790 µg/100ml and x

_U = 0.742 µg/100ml. (Of course the

symbol “µ” here stands for “micro-”, not the population mean.) Now follow Algorithm 3.1:

Step 1: The assumption The research question is whether or not the mean plasma zinc is the same in each population. Assume, for the moment, the truth of the Null hypothesis, expressed as:



Ho: µA = µU, or equivalently, Ho: µA – µU = 0 (Of course the symbol “µ” here stands for the population mean, not “micro”.) Example 3.7

Step 2: The distribution In a theoretical long run repetition of the experiment we would find that the distribution of differences in means calculated from samples drawn from the two underlying Normal distributions with equal means would itself be Normal and have a mean of zero. If we had the value of the standard error of the differences in means we could draw the sampling distribution. The researcher found from previous large-scale survey work that the population standard deviations for each group were: σA = 0.209 µg/100ml and σU = 0.138 µg/100ml

So, by E3.1 the standard error of the sampling distribution will be given as: √[(0.2092/68) + (0.1382/138)] = 0.028 µg/100ml We now have all the information needed to draw the sampling distribution of differences in means that we would expect if the null hypothesis were true, that is, if the difference in means in the populations were really zero.

de

nsi

ty fu

nct

ion

difference in mean plasma zinc -.1 -.05 0 .05 .1

0

5

10

15

std error = 0.028

0.048

Fig 3.2 Sampling distribution of differences in means under Null Hypothesis Step 3: The sample statistic Our calculated sample difference in means (affected minus unaffected) is 0.048 µg/100ml, and this is marked on the sampling distribution. A difference of -0.048 is also of interest to us because our alternative hypothesis is non-directional, and we must accept the possibility that zinc levels are higher in mothers of unaffected foetuses.



Example 3.7

Step 4: The probability We now need to see the probability of getting a sample difference in means as extreme as 0.048 µg/100ml (that is, greater than +0.048 or less than -0.048 µg/100ml) when the assumed underlying difference in population means is zero. This corresponds to the shaded area in Fig 3.2. Unfortunately, we don’t have tables for a distribution with mean 0 and standard deviation 0.048, so we use the Standard Normal transformation to find z-values of the Standard Normal distribution corresponding to ±0.048 of the original distribution:

z = (0.048 – 0)/0.028 = 1.714 (see E2.15)

The Standard Normal Curve corresponding to Fig 3.2 is:

Std

Norm

al p

robabili

ty d

ensi

ty

z-4 -3 -1.714 -1 0 1 1.714 3 4

0

.1

.2

.3

.4

std dev = 1

P = .043 P = .043

Fig 3.3 Standard Normal Distribution The two-tailed probability associated with z-values as extreme as ±1.714 is given by tables as 0.086, that is, 0.043 in each tail. So in, say, 1000 random samples from a population where the true difference was zero, 86 samples would be expected to have a result as extreme as ours, that is, ±0.048 µg/100ml. Step 5: The decision We are faced with the following situation: we have an observed, sample difference in means of 0.048 µg/100ml, but we know that such a difference would arise only 86 times in 1000 when the population generating the samples has no difference in means. So, does our observed difference cast suspicion on the Null hypothesis of a zero difference in the population? This is where the statistics stops and the judgement of the researcher (or that of the reader of the researcher’s paper) must take over. Some would say that 0.086 is so low a probability that the Null assumption of no difference in population mean zinc levels is untenable; such a view would lead to a rejection of the Null hypothesis. Others may require an even lower probability before they are willing to abandon the Null hypothesis. For them, the decision will be to accept the Null, at least for the time being.



For rather obvious reasons, the statistical test constructed in Example 3.7 is termed a two-sample z test. Note that the level of probability below which we decide to reject Ho is, to an extent, arbitrary; most researchers set a level (termed α, see §3.3.7 below) of 0.01, 0.05 or 0.10, depending on the research context. The 0.05 or 5% level is the commonest choice. [We speak of the “5% significance level”.] You can see from tables of the Standard Normal Distribution that if we choose a cut-off probability of 0.05 for a two-tailed test, then the test statistic (the z-value) corresponding to a rejection of the Null hypothesis will need to exceed ±1.96. If we choose to adopt the 5% level as our criterion in Example 3.7, then, since our test statistic of z = ±1.714 is associated with a probability greater than 0.05 (0.086 in fact), we cannot reject the Null hypothesis. The set of extreme values of the test statistic that lead to rejection of the Null hypothesis, and hence acceptance of the Alternative hypothesis, is termed the rejection region of the test. Our test statistic of z = ±1.714 is not in the rejection region for a 5% test because it does not exceed ±1.96. You can see that, once a criterion significance level is set, the decision to accept or reject the Null hypothesis can be based on either the probability associated with the test statistic or the position of the test statistic with respect to the rejection region. The methods are simply two sides of the same coin. § 3.3.6 The Connection with Confidence Intervals Using the methods of §3.2.4 and Example 3.2, we could construct a 95% confidence interval for the population difference in means:

lower limit = 0.048 – (1.96 x 0.028) = –0.007 µg/100ml upper limit = 0.048 + (1.96 x 0.028) = +0.103 µg/100ml

So, the 95% confidence interval is [–0.007 , 0.103]. Note that this 95% confidence interval does include zero, the value specified by the Null hypothesis. Therefore, we cannot reject the Null hypothesis at the 5% level. § 3.3.7 It isn’t over yet - Errors Two unavoidable errors arise in hypothesis testing: Type 1 Error: If a test leads to rejection of the Null hypothesis and hence acceptance of the Alternative hypothesis then such a decision might be wrong. There is still a chance, albeit a small one, that a population defined by Ho could have generated a sample statistic as extreme as the one observed. Making the decision to reject Ho is a Type 1 error. We define the probability of making a Type 1 error as: α = Pr[Rejecting Ho|Ho is true], or equivalently:



α = Pr[Accepting H1|H1 is false] This alpha is a conditional probability. Alpha is the risk you run of claiming a new research finding and being wrong. Note the distinction: the Type 1 error is an event, α is the probability of that event. In experimental study designs, α is typically pre-set by the researcher before the hypothesis is tested. Often a level of 0.05 (the “5% level”) is chosen. There is nothing magical about the 5% level, but it has the force of history and common usage behind it. When an α level is set, the researcher is proclaiming that she is willing to be wrong in rejecting the Null hypothesis but only if the risk in doing so does not exceed α. If the risk is greater than α, then the researcher will “play it safe” and continue to accept the Null hypothesis, at least until more compelling evidence is available. As we have seen in Example 3.7 above, the researcher will compare the probability associated with her test statistic – the so-called P value – with the pre-set α level: if the P value is less than α the researcher feels confident in rejecting the Null hypothesis; if the P value exceeds α she will fail to reject the Null.

Type 2 Error:

Why not make the Type 1 error level (α) smaller than 0.05? We could set α, the probability below which we will reject Ho, at 0.00001 or even less. That way, we’ll almost never make the mistake of rejecting the Null hypothesis when the Null is, in fact, correct. This is quite possible to do, but a price must be paid for such conservatism.

If we minimise the chance of making a Type 1 error, we will run the inevitable risk of accepting the Null hypothesis even when it is false. This is a Type 2 error, and the probability of committing this error is termed beta:

β = Pr[Rejecting H1|H1 is true], or equivalently: β = Pr[Accepting Ho|Ho is false] Beta is the risk you run of claiming that the status quo is correct and being wrong; you failed to uncover the research finding, even though it was there. Your chance of becoming rich and famous just went down the tubes. Beta can be visualised as that part of the area under the distribution of the Alternative hypothesis that is not in the rejection region of the test (see Fig 3.4 below). We usually have to wear β of about 0.1 to 0.2. That is, 10% to 20% of our statistical tests will fail to detect a difference between means or proportions or in relative risk or whatever, when that difference really does exist in the population(s) from which we sampled our data. A specific value of β has to be calculated for each of the infinite number of possible alternative distributions permissible under the Alternative hypothesis, or at least for that subset that interests us. In Example 3.7, we might need to calculate a β for a difference in the population means of -0.03 µg/100ml, -0.09 µg/100ml, 0.11 µg/100ml, 0.085 µg/100ml etc, depending on the particular alternative



distribution(s) we were considering. Fig 3.4 shows the sampling distributions generated by the Null hypothesis (population difference equals zero) and by the particular Alternative hypothesis where the difference in mean zinc levels is 0.09 µg/100ml. The test is two-tailed at the α = 0.05 level. The rejection region, marked as “RR”, begins at ±0.055 µg/100ml, since this value of the difference in means is equivalent to a z-score of ±1.96 (you could satisfy yourself of this by using the standard Normal transformation).

prob

abili

ty d

ensi

ty

Null

difference in sample means

Alternative

-.2 -.1 -.055 0 .055 .09 .2

0

5

10

15

βα/2 = 0.025 α/2 = 0.025

↔Power=

1–β

RRRR

Fig 3.4 Null and Alternative Sampling Distributions

It can be seen from Fig 3.4 that if you wish to lessen the probability of making a Type 1 error, you must necessarily increase your chance of making a Type 2 error and vice versa. To see this, just move ( ↔ ) the boundary of the rejection region to the right (decreasing α and increasing β) or to the left (increasing α and decreasing β). Is there a way of decreasing both errors simultaneously? Yes! Select samples of greater size. How will this help? Note that the larger the sample size the less the variability of the resulting sampling distributions (standard error has the sample size n in the denominator – see E3.1). The distributions will be “narrower” and so overlap to less extent, so for a given alpha, beta will be less than in the case of distributions with larger variance. From an Information Theory viewpoint, it seems only fair that if you have gone to the trouble of shedding more light on the situation by collecting a bigger sample, there should be a pay-off in a reduction of errors, enabling a less error-prone decision to be reached. § 3.3.8 Power



The higher the probability of a Type 2 error, the less the power of the test to find a difference if it really exists. This is just an example of the Law of Complements applied to the conditional probability, β (see Chapter 2). Power = Pr[Rejecting Ho|Ho is false], or equivalently = Pr[Accepting H1|H1 is true], or equivalently = 1 – β If you specify the particular alternative distribution to which you feel your sample statistic belongs (that is you specify the mean and standard deviation), then you can calculate beta and/or the power of your test. Note from Fig 3.4 that, together, power and beta exhaust the entire area under the Alternative probability distribution – just a graphical illustration of complements. We won’t pursue the calculation of power further here, except to note that there is not much point in going to the time and trouble of setting up a study if the sample size doesn’t afford you the power to detect whatever it is you want to find. Researchers typically try and obtain power values of greater than 0.8. Planning research almost always calls for prior consideration of sample size and consequent statistical power. Power depends on the degree of variability in the population being sampled, the difference you wish to detect, and on the number of subjects sampled. Generally, the researcher can best address the issue of adequate power by varying the sample size. Too small a sample size and a difference, even if it exists, may not be found. Too large a sample size and either money is wasted recruiting more subjects than is necessary to find the difference sought or a difference too small to be of any practical use may be found. This leads us to reconsider the notion of “significance”. § 3.3.9 Significant Differences A phrase commonly used in reporting the results of statistical tests of comparison is:

‘... a significant difference was found between the two groups’.

This means that the probability associated with the test statistic was small enough to allow rejection of the Null hypothesis. It should be stressed that this “significant difference” is used in a statistical sense; whether the difference between the groups under test has any clinical or practical significance cannot usually be resolved by statistical methods alone. Experienced researchers design their studies so that any statistically significant difference found does correspond to something of importance in the real world. It is instructive to realise that, given sufficient sample size, virtually any difference, no matter how tiny and inconsequential, can be found and judged “statistically significant” at any level of α you care to nominate. To do this is not the point of good science or good statistical practice. It also gets very expensive!



§ 3.3.10 P values The term P value does not introduce a new concept; its meaning is implicit in the foregoing discussion of hypothesis tests and errors. However, it is such a commonly used expression in the research literature that I will define it separately here: A P value is the probability that the value of a sample statistic as extreme as the observed one could arise by random sampling of data from a population whose parameter of interest (ie the one corresponding to the sample statistic) is defined by the Null hypothesis. The mere sight of a P value should lead you to:

• Recognise that someone has performed a statistical hypothesis test; • Consider what is the Null hypothesis associated with the test; • Consider whether this probability is sufficiently low for you to doubt that the

Null-defined population gave rise to your sample. Often, especially in studies with an experimental design, this involves comparing the P value of the test with the pre-set α level. If P < α the decision to reject Ho is made.

One can think of a P value as a measure of usualness of the observed sample data. But one can only judge what is usual or unusual if a context is available. In statistical testing the context is provided by the Null hypothesis. A small P value means that something unusual has happened if the Null hypothesis were true. (So, maybe it’s time to abandon the Null.) A large P value means that something relatively usual has happened, if the Null hypothesis were true. (So, maybe we’ll stick with the Null for the time being.) Note that the P value is a conditional probability: it makes no sense to interpret a P value without stating the condition “….. if the Null hypothesis were true”. The appearance of a P value comes at the very end of a long chain of events beginning with the design of the study that led to the collection of the data that formed the substrate for the analysis that yielded the P. Interpretation of a P value without detailed knowledge of all these steps – and their validity – is likely to be a waste of time, misleading, even dangerous.



§ 3.4 STUDENT’S t TEST § 3.4.1 Introduction So far I’ve swept under the carpet the consequences of not knowing the population standard deviation. We need this parameter in order to calculate the standard error of the sampling distribution (see E3.1). If there are no other previous large scale surveys, which have yielded this information, then we use the sample standard deviation to estimate the population standard deviation. The consequences of this substitution of s for σ are: • The test statistic no longer follows a Normal distribution, but rather a t

distribution. • Accordingly, the test statistic is called a t statistic, (not z), and the test is called a t

test. • Hypothesis tests and confidence intervals use the critical values of the t-

distribution rather than those of the standard normal distribution. • In the case of the comparison of two means, without proof, the formula for the

standard error of the sampling distribution of differences in means becomes:

se )21__xx −( =

n nn n1 2

1 2

+⋅

. ( ) ( )n s n sn n

1 12

2 22

1 2

1 12

− + −+ − E3.2

Admittedly, E3.2 looks pretty awful, but that is only a problem if you try and remember the formula. (In fact it does make sense – it involves a weighted average of the sample estimates of the variances of the two populations. The weights are the degrees of freedom associated with each variance estimate.) This formula assumes that the variances of the two populations are equal, that is, σ1

2 = σ22 = σ2. This

assumption is itself testable (see more advanced texts). The number of degrees of freedom associated with this standard error E3.2, and hence associated with any test statistic that uses it, is: n1 + n2 – 2. Example 3.7(b) Let me now admit that the standard deviations I quoted for the zinc levels in mothers of NTD-affected foetuses (0.209 µg/100ml) and in mothers of unaffected foetuses (0.138 µg/100ml) were not population standard deviations but in fact were the standard deviations calculated (see E1.3 and E1.4) on the sample data. Of course, this immediately changes the distribution of the test statistic from a Normal distribution to a t distribution.



Example 3.7(b) continued By E3.2, the standard error of the sampling distribution of the difference in means is: √[(68+138)/(68x138)].√{[(67x0.2092) + (137x0.1382)]/(68+138–2)} = 0.0244 µg/100ml

So, our test statistic (completely analogous to the z-test) is:

t = (0.048 – 0)/0.0244 = 1.967

In other words, our sample difference in means, 0.048, is 1.967 standard errors from the difference expected if the Null hypothesis were true, that is, from zero. We will return to Example 3.7(b) after discussing the t distribution. § 3.4.2 t Distribution In 1908 a chemist called William Gossett, working for the Guinness brewery, published the derivation of a new distribution, now called the Student’s t distribution or t distribution, under a pseudonym, Student, since the company frowned on its employees sharing any scientific knowledge. Gossett was interested in the behaviour of means of small samples, say with n less than about 25. For such samples, the Normal distribution can give misleading results. If the Null hypothesis is true, the test statistic as constructed in Example 3.7(b), using sample estimates of the population standard deviation, follows the t distribution. [It is true that with sample sizes as large as in our example, 68 and 138, the Normal and t distributions are virtually identical, but with smaller sample sizes it does matter.] So if we wish to know probabilities associated with our value of the test statistic we need to consult tables of the t distribution. The t distribution is similar in appearance to the standard Normal distribution. It has a mean of 0, it is symmetrical about the mean, it is bell-shaped, but its variance is greater than 1: it is more spread out and it has “fatter tails”. There is an infinite number of t distributions, each one is uniquely specified by the particular value of its parameter. Unlike the Normal distribution which requires two parameters, the mean and variance, the t distribution has only one parameter, our old friend (?) the number of degrees of freedom. Fig 3.5 shows just three t distributions: those associated with 1, 6 and 60 degrees of freedom. Each has been labelled both in the body and the tail of the distribution to make identification easier. In fact, the distribution with 60 df is indistinguishable from the standard Normal distribution. Note that the distributions with few degrees of freedom have much fatter tails – more probability in the extremes. A t statistic needs to be more extreme to cut off, say, 5% of the tail probability when it is based on fewer



degrees of freedom, that is, on a smaller sample size. If you like, to achieve a given level of statistical significance when faced with a small sample size you will need a large difference in sample means.

prob

abili

ty d

ensi

ty

t

df = 6

df = 6df = 1

df = 1

df = 60

df = 60

-4 -2 0 2 4

0

.1

.2

.3

.4

Fig 3.5 Some t distributions In general, statistical tables for t distributions save space by documenting critical t values for a limited number of tail areas of interest for a subset of t distributions (defined by their degrees of freedom). Table 3 in the back of this book illustrates this. If the particular combination of α and df required is not tabulated, then one can interpolate from neighbouring cells or be conservative and use a smaller number of degrees of freedom. How do you know the number of degrees of freedom you should be using? For most tests there are easy rules. As we have seen, for the two sample t test the number of degrees of freedom is n1 + n2 – 2. Example 3.7(b) continued

If we perform the test at the 5% level (that is, we agree that if the probability associated with our test statistic is less than 0.05 we will reject Ho), then our t statistic must exceed the tabulated critical value of t on 68+138–2 = 204 df. The tabulated value of t cutting off a total of 0.05 in both tails of the distribution is tcritical = 1.972. Our test statistic of t = 1.967 just fails this criterion and so, using the more correct approach to the problem, we just fail to reject Ho, though you might well maintain that the test is of “borderline significance” at the 5% level and wished that the researcher had perhaps chosen a slightly larger sample size to yield a clearer picture. Here is another example of the two sample (“independent samples”) t test that you should work through yourself. [Take the time to explicitly write down the opposing hypotheses and draw a diagram of the relevant t distribution with the critical value of t and the value of your calculated t statistic marked in.]



Example 3.8

The following raw data are reaction times (in milliseconds) from a stimulus (the onset of a red light flashing) to a braking action carried out in a driver simulation experiment by the Department of Transport and the Department of Health. Two samples were used: participants in sample 1 were selected at random from the University Women’s Christian Temperance League membership roll, and participants in sample 2 were selected from women leaving the University bar at closing time on a Friday night at end of term.

Sample 1 data : 325 400 300 225 600 550 Sample 2 data : 500 790 480 550 700 600 510 370

Do the bar patrons have longer reaction times than the teetotallers? Use α = 0.1. Hints: The question as posed implies a directional Alternative hypothesis and so a one-tailed test will be used (although one might wish the researcher had justified not considering the possibility of the reverse being true and opting for a two-tailed test).

Ho: µ1 = µ2 versus H1: µ2 > µ1

Here are some intermediate results so you can check your calculations. All units are milliseconds. You will need access to tables of the t distribution.

x1 = 400 x 2 = 562.5 s1 = 147.48 s2 = 132.42 standard error = 75.01 (see E3.2)

t = [(562.5 – 400) – 0]/75.01 = 2.166

Observed difference in sample means = 162.5

Difference if the Null hypothesis is true = 0

So the observed difference in means is 2.166 standard errors from the Null hypothesised difference. Is this too great to be ascribed just to chance? From tables of the t distribution, the critical value of t cutting off 0.1 in the upper tail of the t distribution on (6 + 8 – 2) = 12 df is tcritical = 1.356. (Since this is a one-tailed test, in Table 3, we look for the entry with df = 12 and α = 0.2.)

Since t > tcritical we reject Ho.

That is, the probability of a difference in sample means of 162.5 msec arising purely by chance from a population wherein the true difference is zero, is less than the agreed cutoff of 1 in 10. Our test is said to be “significant at the 0.1 level”. Let’s hope the bar patrons aren’t driving home.

Of course, if you wished to go on and form a confidence interval for the difference in means, you would use the t critical value, not a z value. In our example, the confidence interval is also one-sided, with an upper bound, but no defined lower bound. The upper bound is given by: 162.5 + (1.356x75.01) = 264.2 msec.



§ 3.5 MATCHING The two-sample tests discussed above are called independent samples tests because each population is sampled without reference to the other population. In Example 3.8, the fact that the third reaction time for group 1 was 300 msec had no bearing on the fact that the third reaction time in the second sample was 480 msec. (This is additional to the notion that within each group the reaction times are independent of each other.) As mentioned, two-sample independent tests, especially t tests, are the commonest you’ll come across. But sometimes we go out of our way to match or pair subjects, one of the pair from each group, before we measure the variable of main interest. In the NTD/zinc study (Example 3.7), we might take our samples, but then before measuring the plasma zinc, we might pair off one member of the unaffected group with one member of the affected group on the basis of another variable, for example, a measure of socioeconomic status (SES). SES could be important in this study, because it would be misleading to claim a significant difference in zinc levels on the basis of foetal NTD status, when the difference might really be due to zinc levels changing with SES and NTD status also being related to SES. By individually matching on the basis of SES, we would eliminate this factor as an explanation for any difference found in zinc levels between NTD-affected and unaffected groups. By adopting such a strategy we say we have controlled for the confounding effect of a third variable, in this case, SES. Socioeconomic status is termed a confounder or nuisance variable. It is a nuisance because if we do an independent test which ignores it, then even if we find a difference between groups, a critic will always be able to say that the difference is illusory and that our results are just due to the different distribution of SES in the two populations. Compared with an unpaired experiment, by controlling for SES we will have increased the precision of our estimate of the population difference in mean zinc levels. This would be reflected in a smaller standard error and a test which has a smaller Type 2 error – it is more powerful in detecting a real difference if it is there. This seems fair; if we go to the trouble of matching subjects, the pay-off is a more powerful test. On the other hand, matching can cost time and money, and, if the matching is ill-conceived (say, SES was not really a confounder), no gain in precision may result. Medical researchers have long recognised that twins provide a ready-made “natural pairing” situation. For example, in an effort to understand the causes of the severe affective disorder schizophrenia, psychiatrists have made intensive study of the effects of “nature versus nurture” (genetic make-up versus social environment) on sets of twins. Standard textbooks of Psychiatry can supply further information on this research. When, and how, to match on which variables is not within the scope of this course, but you should know that simple statistical tests exist for situations in which matching has been used. Here is an example of a t test when samples have been matched.



Example 3.9 The following data represent the falls in systolic blood pressure (mmHg) in two samples of patients with hypertension who were studied in a drug trial. One group of patients was given a new, hopefully superior, anti-hypertensive drug, “Blotto”, and the other group was given an older, “standard” medication. For example, the first patient given Blotto experienced a 10 mmHg drop in blood pressure after administration of the drug. Each of the 6 patients in the Blotto group was paired with one in the standard group on the basis of age (to within ±2 years), sex and severity of their blood pressure condition (baseline systolic pressure to within ±5 mmHg). Is Blotto any different from the standard drug in its effect on blood pressure? (Use a two-tailed test with α = 0.05.) The standard deviations in the populations aren’t given, so we will need to estimate them from the sample data. A t test rather than a z test is in order. Data:

Blotto Standard (Blotto – Standard)

10 8 2 7 3 4 12 1 11 2 0 2 4 9 -5 6 5 1 Differences Total = 15 The first two columns above are the data as reported. The third has been calculated. You remember that, in the independent tests earlier, we calculated the difference in means and the standard error of the difference in means to construct a z or t test. In a paired situation, we get the mean of the differences and calculate the standard error of the mean of the differences. Better read that again. Ho: mean of the differences, µd = 0 versus H1: mean of the differences, µd ≠ 0

(The notation: “≠” means “does not equal”.) If you think about it, even though a matched sample problem necessarily starts out with two samples, by calculating the differences between each pair, we end up with what amounts to a single sample – a sample of differences – so a matched samples test ends up as a single sample test.



Example 3.9 continued If you examine the (Blotto – Standard) column of the data, you’ll see that the mean of these sample data, that is, the mean of the differences, is 15/6 = 2.5 mmHg and the standard deviation of the sample differences is 5.167 mmHg (using E1.3 and E1.4 on the difference scores). The estimated standard error of the sampling distribution of the mean of the differences is sd/√n, where the subscript d signifies that the standard deviation is that of the difference scores, and n is the number of pairs. Therefore the standard error is 5.167/√6 = 2.109 mmHg. The calculation of the standard deviation involves 6 difference scores, so the degrees of freedom associated with the standard error and hence the t test is: (6 – 1) = 5. Our t statistic, to be compared against the t distribution on 5 degrees of freedom is: t = (2.5 – 0)/2.109 = 1.185 which is less than the tabulated critical value of t for a two-tailed test at the 5% level on 5 df: tcrit = 2.571. In fact, t = 1.185 is associated with a two-tailed P value of 0.29, which far exceeds α = 0.05. Hence we are unable to say that Blotto works any better or worse than the standard drug. We accept the Null hypothesis. A 95% confidence interval around the population mean difference is given by: 2.5 – (2.571 x 2.109) to 2.5 + (2.571 x 2.109) = [–2.92 , 7.92], and we note that this interval includes the Null hypothesised value of zero, leading as before to an acceptance of Ho. (Also note that the construction of the confidence interval uses the 5% critical value of t on 5 df, 2.571, rather than the z critical value of 1.96, since in constructing the standard error, we used the sample standard deviation as an estimate of the population standard deviation.) In fact, if you rework Example 3.9 as an unmatched study, and do the independent sample t test, you will find that the P-value resulting from the test is 0.27; this is still not significant, but it is less than the result from the matched analysis (P = 0.29). It appears that the matching strategy may have achieved nothing; indeed, if there really were a difference to be found, it may have compromised the performance of the test a little. The matching process leads to a loss of degrees of freedom compared with the unmatched study (n-1 versus n1+n2-2). From Fig 3.5 we know that t distributions with fewer degrees of freedom require larger sample statistics to achieve significance than distributions with many degrees of freedom. So a matched study will only perform




efficiently if the matching achieves a level of precision sufficient to outweigh the “handicap” of the loss of degrees of freedom. Much thought should precede the decision to match. It is worthwhile knowing that matching is not the only way to control for the effects of confounding variables. Matching is a controlling stratagem we may adopt at the design phase of research; it then demands we subsequently perform the appropriate matched analysis. One can often avoid matching – which may be logistically difficult – by dealing with confounding solely at the analytic phase by using more complex variants of the simple statistical techniques discussed in this introductory book. § 3.6 SUMMARY The inferential statistics is a set of powerful tools enabling the cautious researcher to speculate beyond the data at his or her immediate disposal. It is crucial to appreciate and report the errors associated with testing hypotheses. Interval estimation or the reporting of confidence intervals is becoming standard practice; these techniques allow an audience to gauge the precision of any research estimates. A single experiment, or even multiple experiments, however well designed and carefully executed, cannot establish the truth. The very next study might overturn all previous work. Scientific method, including the judicious use of statistics, provides some assurance that, for the time being at least, we are on the right track.

CHAPTER 3: INFERENTIAL STATISTICS: … sample proportion is a commonly used estimator of the...

Documents

Transcript of CHAPTER 3: INFERENTIAL STATISTICS: … sample proportion is a commonly used estimator of the...