comparisons

46
9. Comparisons 9-1 ©Copyright 2020 Mikel Aickin 26 Jul 2020 Comparisons Many datasets are collected for the purpose of estimating some particular value, with the intent to compare it to what would be seen in different circumstances. The issue is, then, to compare the distribution of outcomes, often in two (or more) substudies, and this often reduces to comparing two (or more) QF-curves or parameter estimates. This chapter deals with such comparisons. The topics are q-q plots for comparing distributions Plotting quantiles in different groups Combination of support functions from different groups ••• In the situations we will consider there are multiple parameters but they all measure the same thing, just in different groups or under different conditions. The studies tend to fall into two categories. In the first, the scientists would like the parameters to vary, because this would be a phenomenon requiring explanation. In the second, they would like the parameters to be equal, because the different groups or conditions are not supposed to influence the parameters. The second category is called homogeneity, but we will start with the first category, called heterogeneity, because it is dominant. Heterogeneity Direct comparison of QF-curves. By far, most scientific investigation involves comparing measurements under different circumstances. In an experiment, the circumstances have been arranged to change their character in some understandable way, and the issue is to see whether this has anything to do with the resulting measurements. In a naturalistic study the characteristics of the circumstances will have been arranged by nature, but again the whole point is to see whether the distributions of measurements vary in some potentially informative way across the different circumstances. Given two samples, taken under different circumstances, it makes sense to ask what transformation could be applied to one sample so that it would have

Transcript of comparisons

Page 1: comparisons

9. Comparisons

9-1

©Copyright 2020 Mikel Aickin

26 Jul 2020

Comparisons Many datasets are collected for the purpose of estimating some particular value, with the intent to compare it to what would be seen in different circumstances. The issue is, then, to compare the distribution of outcomes, often in two (or more) substudies, and this often reduces to comparing two (or more) QF-curves or parameter estimates. This chapter deals with such comparisons.

The topics are

• q-q plots for comparing distributions

• Plotting quantiles in different groups

• Combination of support functions from different groups

•••

In the situations we will consider there are multiple parameters but they all measure the same thing, just in different groups or under different conditions. The studies tend to fall into two categories. In the first, the scientists would like the parameters to vary, because this would be a phenomenon requiring explanation. In the second, they would like the parameters to be equal, because the different groups or conditions are not supposed to influence the parameters. The second category is called homogeneity, but we will start with the first category, called heterogeneity, because it is dominant.

Heterogeneity

Direct comparison of QF-curves. By far, most scientific investigation involves comparing measurements under different circumstances. In an experiment, the circumstances have been arranged to change their character in some understandable way, and the issue is to see whether this has anything to do with the resulting measurements. In a naturalistic study the characteristics of the circumstances will have been arranged by nature, but again the whole point is to see whether the distributions of measurements vary in some potentially informative way across the different circumstances.

Given two samples, taken under different circumstances, it makes sense to ask what transformation could be applied to one sample so that it would have

Page 2: comparisons

9. Comparisons

9-2

the distribution of the other sample. That is, if we have a good QF-curve for a sample x1, …, xn and another good QF-curve for a sample y1, …, ym, then what transformation g could we apply as y = g(x) converting the QF-curve of the x’s into the QF-curve of the y’s? To say this a slightly different way, if Q1 is the quantile function for the x-sample and Q2 is the quantile function for the y-sample, then what g will give Q2 = g(Q1)? It seems apparent that the only way the two QF-curves could be the same for an increasing g is if g(x) = x, and this gives us a way of telling, or estimating, whether the two distributions are the same.

The theoretical computation of g lies in the fundamental QF theorem. The theorem says that u = F1(x) transforms a random sample of x’s with fractile function F1 into a uniform random sample, and then y = Q2(u) transforms a uniform random sample to a sample of y’s with fractile function F2. Thus g(x) = Q2(F1(x)). Here is a template that would implement this approach:

Xqf <- QF(x) #make QF from x sample Yqf <- QF(y) #make QF from y sample # evenly spaced values in the x-range xvals <- seq(range(x)[1],range(x)[2], length=40) # fractiles corresponding to xvals fx <- quan(Xqf,xvals)[,2] # yvals corresponding to the xvals fractiles yvals <- frac(Yqf, fx)[,1] plot(xvals, yvals, type=”o”)

Most of the complexity here comes from wanting the evenly space values, and this is because one does not want the result to be biased somehow by choosing the observed x-quantiles.

A more conventional computational strategy goes as follows. Create a u-sample, and then use it to create corresponding observations in both the x- and y-samples. This is done by making pairs (Q1(u), Q2(u)), and then seeing how they relate to each other. We would be looking for g such that Q2(u) = g(Q1(u)). The most organized way to do this is to take u not as a random sample, but as a fixed grid. Then the plot of Q2(u) against Q1(u) can be taken to show an estimator of g. Here is the computational template:

u <- sim(100, type=”g”) xq <- frac(QF(x),u)[,1] yq <- frac(QF(y),u)[,1] plot(xq, yq, type=”o”, xlab=”Quantile x”,

ylab=”Quantile y”) lines(xq, xq) Versions of this kind of plot are called q-q plots. To the extent that the

points lie on the diagonal, we have nearly g(x) = x, and the two distributions are close. This suggests that the mean or median of |Q2(u) – Q1(u)| could be used as a measure of the degree to which we do not have g(x) = x, although there are certainly other sensible measures of discordance that we could use instead.

Page 3: comparisons

9. Comparisons

9-3

It is worthwhile to recognize that if the y-sample is a shifted and scaled version of the x-sample, then we would have

Q2(u) = µ + σQ1(u)

But this just says that if the q-q plot shows a straight line, then the shift-scale relationship holds. It even suggests that fitting a line to the q-q plot would provide estimates of the shift and scale.

An enormous amount of applied statistics focuses on the possibility of a shift, while ignoring not only the possibility of a scale change, but the even more likely possibility that there is no shift-scale relationship at all. This kind of conventional analysis shows itself by an almost exclusive preoccupation with comparing the means of the two distributions.

Calcium and systolic blood pressure. The script qf041.r shows the

analysis of a small trial to see whether an intervention to increase calcium intake had an effect on systolic blood pressure (SBP). There were two groups, calcium-supplemented and a control with no supplementation. The study sample consisted of only African-American men, and since it is so small it should be regarded as an early-phase study. One purpose of such a study is to see whether the results suggest features that should be taken into account in subsequent studies.

SBP was measured at the beginning and end of the study, and the usual way to assess effects is to look at the changes. The QFview() of changes in the two

groups cannot be terribly informative due to the small sample sizes. There are, for example, indications of outliers, but one needs to be cautious here. Note that the winsor() plots do not suggest outliers, but this is because they are not

very informative with small samples with erratic variability. Again considering that this is an early-phase study, it is sensible to do the analysis both Winsorized and not Winsorized.

The first q-q plot (not Winsorized) shows a potentially important pattern. This is made even clearer in the Winsorized analysis, whose q-q plot is in Fig. 9-1. The lower end (that is, below the median) of changes in the calcium group are quite a bit more negative than the changes in the lower end of the control group distribution. On the other hand, above the medians there less and less difference between the two. This suggests that there may be two groups of people, one whose SBP is sensitive to calcium supplementation, and the other whose SBP is not. Or, there may be a spectrum of SBP responses to calcium. In either case, this is not a persuasive result, because it is only an early-phase study. It is a hint that there may be a basic issue in calcium effects on SBP in African-American males that needs serious attention in the next study. In particular, there is little point in performing any inference on this sample.

Page 4: comparisons

9. Comparisons

9-4

-5 0 5

-15

-10

-50

5

SBP Changes

Control

Cal

cium

Fig. 9-1. Comparison (q-q plot) of systolic blood pressure changes (Winsorized) in a calcium-treated group against a control group. (qf041.r )

The publication in which these data were presented never offered any effect estimates in African-Americans, nor in the 54 European-Americans in the same study. It is unfortunately common in medical research, even in leading journals (Journal of the American Medical Association in this case) to fail to provide adequate information about treatment effects. It is also common, as in this case, for the study data to disappear after publication, so that no more informative analyses are possible. (It is not clear how the subset of the African-American data was preserved for textbook use.)

Darwin’s Corn. In 1876 Charles Darwin published a volume comparing properties of cross- and self-fertilized plants. The script qf042.r uses a table

from his book relating to corn. The data faithfully copied from Darwin’s table are:

Page 5: comparisons

9. Comparisons

9-5

pot cross self c4 c5 c6 c7 c8 1 23.5 17.375 23.5 20.375 23.5 20.375 -3.125 1 12 20.375 21 20 23.25 20 -3.25 1 21 20 12 17.375 23 20 -3 1 . . . . 22.125 18.625 -3.5 1 22 20 22 20 22.125 18.625 -3.5 2 19.125 18.375 21.5 18.625 22 18.375 -3.625 2 21.5 18.625 19.125 18.375 21.625 18 -3.625 2 . . . . 21.5 18 -3.5 2 22.125 18.625 23.25 18.625 21 18 -3 2 20.375 15.25 22.125 18 21 17.375 -3.625 3 18.25 16.5 21.625 16.5 20.375 16.5 -3.875 3 21.625 18 20.375 16.25 19.125 16.25 -2.875 3 23.25 16.25 18.25 15.25 18.25 15.5 -2.75 3 . . . . 12 15.25 3.25 3 21 18 23 18 12 12.75 0.75 4 22.125 12.75 22.125 18 . . . 4 23 15.5 21 15.5 . . . 4 12 18 12 12.75 . . .

The arrangement may be a little confusing. There were multiple plants in each of four pots. Half within each pot were cross-fertilized and the others were self-fertilized. The eventual plant heights in inches are in the second and third columns. All the other columns were derived from these two. In c4 and c5 Darwin sorted the crosses and selfs within each pot, and then in c6 and c7 he sorted the crosses and selfs without regard to pot. The final column is the difference of the last two, self minus cross, so negative values favor cross. Here is what he says:

The observations as I received them are shown in (the table) where they certainly have no prima facie appearance of regularity. But as soon as we arrange them in the order of their magnitudes as in columns (c4 & c5), the case is materially altered. We now see, with few exceptions, that the largest plant on the crossed side in each pot exceeds the largest plant on the self-fertilised side, that the second exceeds the second, the third the third, and so on. Out of fifteen cases in the table there are only two exceptions to this rule.

Darwin noted that for his conclusion it does not make any difference whether you take pots into consideration or not. He thought that the evidence was convincing that the crossed plants did better, even though the dataset was small.

From qf042.r , the results in QFview() suggest that there are two small

outliers in the crossed and one in the self samples. Although it will not make much difference to our analysis, both samples have been Winsorized. The q-q plot is shown in Fig. 9-2. Here it seems pretty obvious that at the very low end of heights there is not much to distinguish the two samples, but for the remainder of the distributions the crossed plants do better.

Page 6: comparisons

9. Comparisons

9-6

15 16 17 18 19 20

1416

1820

22

Darwin Corn

Self

Cro

ss

Fig. 9-2. Comparison of growth between self-fertilized and cross-fertilized corn plants, conducted by Charles Darwin. (qf042.r )

Since these samples are small, we should state the usual early-phase precautions, that these should be taken as indications rather than established facts. The hint given by the data is that in plots of corn some plants will falter, and among them the method of fertilization will make no difference, but among the plants that do not falter there may be an advantage in the crosses. This rather suggests that in the next experiment, there should be a clear rule for determining faltering plants that is applied the same way in all samples. It might be wise to compare the number of faltering plants to check that neither condition causes faltering, and inference on height would then be reserved for the remaining plants. In any case, it seems evident that no analysis seeking a shift, such as that based on means or medians, would be appropriate for Darwin’s data unless the faltering plants were culled out first.

It is worth emphasizing that although the Winsorization does not much affect our results, it can be important in cases like this. If there are outliers at one end of only one of the samples, then this effectively causes a shift in that

Page 7: comparisons

9. Comparisons

9-7

distribution, which could be great enough to make the q-q plot misleading. This underscores the overall message that potential outliers should not be treated in a cavalier fashion.

The presentation of the data that I have copied above was due to Francis Galton. Darwin evidently realized that the odd distribution of his values would make analysis difficult, and so he called on his half cousin (they shared a grandfather) to help him out. Galton was used to expecting data to follow the “law of error” (aka the Normal distribution), and so he was not able to offer much help to his relative. Darwin did the best he could with the analysis I quoted above, and gave Galton credit for the assistance. This had a large but unexpected effect.

Darwin’s little dataset played a greater role in history than it might have done, due to the fact that R.A. Fisher included it as the first example of his “randomization” test, in the second chapter of The Design of Experiments in 1935. It was not one of Fisher’s finest works. He misunderstood Darwin’s explanation, drawing the conclusion that the cross- and self-fertilized plants occurred in pairs. Although he cited Darwin at length on how he did the experiment, he omitted passages in which Darwin made it plain (in combination with his data table) that there was no pairing of plants. Probably Fisher made this mistake because he wanted to use a certain kind of simulation to obtain the distribution of the average difference under the assumption that there was no difference in the theoretical distributions. Specifically, Fisher conceptually swapped the “cross” and “self” labels within each pair at random, across the entire sample, repeating such a sample-swap many times. This generated a histogram of mean differences, one for each random permutation of the entire sample, and this was supposed to represent the desired distribution. If he had understood that there was no pairing, then he would have seen that his random swapping made no sense. But this was not Fisher’s biggest mistake. He also overlooked the fact that even if there were pairing, but the distributions of cross and self plants were different, then his permutations did not, in fact, reproduce what would have happened if the distributions had been the same. He failed to produce the distribution he claimed to be aiming for. This was a major blunder that ultimately led to popular but spurious arguments favoring experimental randomization.

The reason Galton’s participation was important was because Fisher had an ongoing feud with Karl Pearson, who was a great friend of Galton. Fisher used every opportunity to try to belittle Pearson’s work, either directly or as in this case indirectly through a friend.

It is perhaps of further historical interest that Darwin had a good idea of how to look at his data, and that if he had been used to plotting graphs he might have discovered the q-q plot. The dominant trend in statistics became, however, to follow Fisher in a rush to inferential analysis based on means.

(Aside. This dataset is putatively available in R’s HistData package under

the name ZeaMays . It is worthwhile reading the documentation, which makes

one mistake by saying that the data are paired, and a second mistake by saying

Page 8: comparisons

9. Comparisons

9-8

that Fisher analyzed them in his book, whereas Fisher did not use the pairing that appears in the supplied dataset.)

Galton’s Peas. Francis Galton was very interested in quantifying the effects of inheritance. He collected a large number of different datasets, but one of the most famous consisted of the sizes of sweet peas in a parental generation and the corresponding distributions of sizes of offspring. Here is a reproduction of Table 2, Appendix C, of his 1889 volume Natural Inheritance.

Galton had sent seven packets of seeds to seven friends living in various parts of England, asking each to plant them, harvest the peas of the progeny, and measure their diameters. He claimed to have given them all explicit instructions, trying to suggest that all the peas were raised similarly, an odd presumption for anyone familiar with English weather. Each row in the table is for one of the packets, in which the parental seeds were (as nearly as possible) of the same size. The numbers across a row are the percentages of seeds in that row (corresponding to a parent) that fell into various categories (corresponding to offspring).

The reason this table is famous is Galton’s interpretation. By looking at the column of means at the right, he saw two things. First, the average offspring size was less than the (controlled) parental size. Secondly, there seemed to be a linear relationship between the child and parent values. According to his biographer, Karl Pearson, this was the first inkling Galton had of what would later be called “linear regression”.

For the purposes of analysis, we can note several things. The first is that Galton grouped the sizes of the progeny. He probably did the same for the parents, but we cannot tell. The next is that he grouped all those below 15 into one category (and similarly for those above 21). The serious effect of this is that we cannot see the left sides of each offspring distribution, and this becomes more serious the smaller the parent sizes. Thirdly, it is not clear how Galton computed the averages, and in particular how the severe groupings of the smaller sizes were handled.

Page 9: comparisons

9. Comparisons

9-9

If we note that about half of the child sizes for parental size 15 themselves lie below 15, then it seems plausible that 15 was about the median of the sizes of sweet peas. Thus we appear to have the upper half of the distribution of parental seeds. The key to doing an analysis is noting that we can estimate the quantiles corresponding to fractiles 0.5 and above for each parental sample, and for some we can even estimate a few of the lower quantiles. From this we can at least partially reconstruct what Galton could have seen with complete data.

The analysis is in qf043.r Since the data were copied out of Galton’s

original table, they are displayed so they can be checked. The idea is to select offspring sizes and their corresponding counts using subset() for each parent

size, then apply frac() , which will use the child and count columns to

compute a data frame of (quantile, fractile) pairs for a discrete (type=”d” )

QF-curve. With the qfcurve object we then compute a standard set of quantiles, and finally store them in a column of Q. The final segment plots the rows of Q

one by one, on an empty plot that had been set up for that purpose. Note that the reason for the for() loop is that we need to fill Q by columns, but we

intend to display it by rows. A great deal of the complexity of this script is due to the fact that R does not have good features for handling count data.

The resulting plot in Fig. 9-3 shows several interesting features. First, the medians show the “regression” effect that Galton discovered. The dot-dashed line is what would have been seen if offspring size had equaled parent size, and it is plain that the offspring size is less than this. The general regression effect is that the offspring tend to have values closer to the overall median than their parents do, but since Galton skimmed off the upper part of the distribution, the only effect would be that offspring values would tend to be lower than their parents. Next, it is a bit odd that Galton evidently referred to the effect in terms of means, since he tended to prefer the median as a representative of the center of a distribution. The medians do not show the same sharp relationship as the means do, but then we don’t know how Galton computed the means. Another feature is that so far as we can tell, the spread of the distributions does not change much with the parental size, and in fact the child distributions (for differently-sized parents) appear to be simple shifts of each other This is the meaning of the parallelism in the quantile lines. We now know that the residual standard deviations in conditional Normal distributions are constant, so Galton had before him the information to have made an additional discovery.

Page 10: comparisons

9. Comparisons

9-10

15 16 17 18 19 20 21

1415

1617

1819

Galton's Peas

Parent

Chi

ld Q

uant

iles

.3(.1

).9

Fig. 9-3. Quantiles 0.3 through 0.9 (bottom to top) of offspring sizes according to parent sizes. The median offspring size is the thick line, and the dot-dashed line shows where offspring size would equal parent size. (qf043.r )

The history of Galton’s table is somewhat strange. He first published it in 1889, but there is no explanation for the table in Natural Inheritance. Instead the discussion appeared in a talk that he delivered in 1877. He mentioned the data almost in passing, despite the important role it played in his thinking. Galton’s data has been made available on the Internet for a number of years, but in a form that distorts his actual table. For the sake of making it easier for students to analyze it has been “enhanced” to be an ordinary dataset rather than a frequency table, hiding the excessive grouping.

It has been frequently said that Galton’s discovery contributed to the arguments for eugenics in late 19th and early 20th century Britain. The idea was that if parents at the extremes of a distribution had children who, on average, were closer to the overall mean, then every heritable trait would tend to

Page 11: comparisons

9. Comparisons

9-11

mediocrity over time. The implication seemed to be that a program was needed for potential parents in the lower part of the distribution to be discouraged from reproducing. While it is true that Galton was a proponent of eugenics, in his very first published words on regression to the mean in 1877 he recognized that the “reversion to mediocrity” argument was false. He explicitly stated that the phenomenon was technically necessary, since otherwise the variability of the inherited trait would inexorably rise without bound. In other words, any characteristic that has the same distribution in each generation must exhibit regression to the mean. There are many probabilists and statisticians who do not understand this even today. Possibly Galton also forgot it, since many of his later writings rather strongly suggest that the eugenics program he promoted was designed to counteract the regression effect.

Galton apparently thought of the regression phenomenon as making an important statement about the nature of heredity, especially the inheritance of beneficial human traits, which he defined as those of the class in British society to which he belonged. Whether he fully understood the mathematics underlying the phenomenon, independently of whether the data measured anything having to do with heredity, is questionable. In particular, Galton failed to recognize that regression had been studied thoroughly by Gauss more than 50 years earlier. This was probably the consequence of Galton’s education not being of the same caliber as his intellect.

London cholera. During the middle of the 19th century London was assaulted by cholera epidemics several times. Cholera is caused by the cholera vibrio, a micro-organism that is most frequently found in polluted water. When fecal matter leaks into the water supply, cholera becomes rather likely. At the time in London no one knew about microbes, and there was contentious speculation about where cholera came from.

John Snow, a London surgeon and anesthesiologist, became famous as the “father of epidemiology” for actions he took during the epidemic of 1854. He adopted the position that cholera was a water-borne disease, against those who felt it was air-borne. In fact, he took this position rather earlier than the London outbreak of 1848, and spent some time trying to assemble data to support his case. He was assisted by William Farr, then the Registrar-General who had command over government resources to collect data. Farr help Snow despite the fact that he did not initially agree about the water-borne hypothesis.

In the epidemic of 1848 Farr collected a dataset that we will use to evaluate the opinions of the two men. For the moment, we will employ some simplifications, in qf044.r . First, there were three sources of water in London,

one relatively clean drawing from above the point on the Thames where most sewage was dumped, one downstream of the pollution, and one in between. Here I reduce them to two, by pooling the two more polluted sources. There are several variables in this dataset that we will use later on, but for now the one other variable I use is crate , the cholera death rate per 10,000 inhabitants

during the epidemic. Each row in the dataset stands for one London district.

The first step is to obtain QF-curves for crate , separately in the two water-

source groups; C0 and C1. I want to compare the values of Q(.25), Q(.50) and

Page 12: comparisons

9. Comparisons

9-12

Q(.75) between the groups. For each parameter, the steps are the same. Simulate samples for the estimators, separately in each sample. Compute the difference between the estimates, for each simulation repetition. Computing the QF-curve for the comparison has a twist, that we set n=1 . This has no actual

effect here, but in general it is better than allowing nreps (default 1000) to

become a fake sample size. Especially note, that in support() we set

sim=FALSE , because we are inputting the QF-curve of the estimator.

Here are the tabular results of the simplified support functions for the differences between cholera rate quantiles, worst water minus best water. The estimated effects for Q(.25), Q(.50), and Q(.75) are 60, 92, and 105 extra deaths per 10,000 inhabitants. Well-supported values are only a few deaths above or below these numbers. From the bottom row, in all cases zero (meaning equal quantiles) would be very poorly supported. By default we have used the shift model for each individual quantile, but it seems unlikely that the shift model would apply across the entire distribution. In fact, the median is moved upward more than the 1st quartile, and the 3rd quartile is moved up more than the median. John Snow appears to have been right, not only in that the cholera rates were elevated in the worst water districts, but even further that the elevation is more severe at the higher cholera levels.

Q(.25) Q(.50) Q(.75) Supp lo hi 1.0 60.2 60.2 0.9 60.2 63.1 0.8 53.9 66.0 0.7 50.6 69.3 0.6 47.0 73.2 0.5 42.5 75.9 0.4 42.5 80.3 0.3 38.0 84.8 0.2 33.2 89.5 0.1 16.0 96.8

Supp lo hi 1.0 92.3 92.3 0.9 92.3 95.8 0.8 86.8 99.7 0.7 83.9 103.0 0.6 81.2 106.2 0.5 78.1 110.1 0.4 78.1 114.6 0.3 74.6 119.4 0.2 71.4 125.1 0.1 59.9 134.5

Supp lo hi 1.0 105.4 105.4 0.9 105.4 108.4 0.8 100.1 110.9 0.7 97.4 113.7 0.6 94.4 116.5 0.5 91.7 119.4 0.4 91.7 123.3 0.3 89.8 128.2 0.2 87.4 133.6 0.1 75.6 142.8

Support intervals for differences between quantiles in cholera death rates per 10,000 inhabitants( polluted minus unpolluted), in the London epidemic of 1848.

This latter point is worth some attention. One has to see that cholera occurred in all districts, and that even in those with the more favorable water supply there are differences among districts. In some districts something other than just the water supply must have been operating to produce cholera deaths. Whatever those factors may have been, the effect of more polluted water appears to have been higher in the more vulnerable districts than in the less vulnerable, perhaps suggesting some kind of interaction.

There is some additional information in Farr’s dataset on this point. He also collected the total death rates in each district, for a short period occurring several years before the 1848 epidemic. These presumably reflect differences in the degree of poor sanitation and industrial pollution that plagued London at

Page 13: comparisons

9. Comparisons

9-13

the time, and Farr’s strategy was to see whether the cholera differences were merely another aspect of more general health differences. The theory would be that “water quality” was actually just a marker for a complex of other causes of poor health. We can analyze the potential effect of water source on total mortality by the simple expedient of replacing crate by drate . Here are the

results:

Q(.25) Q(.50) Q(.75) Supp lo hi 1 1.00 5.0 5.0 2 0.95 3.5 6.0 3 0.75 -1.8 10.2 4 0.50 -11.2 16.7 5 0.25 -24.3 24.2 6 0.05 -38.0 35.8

Supp lo hi 1 1.00 20.5 20.5 2 0.95 19.2 22.1 3 0.75 15.0 28.0 4 0.50 9.1 37.0 5 0.25 3.3 48.7 6 0.05 -6.5 60.7

Supp lo hi 1 1.00 14.4 14.4 2 0.95 13.8 15.2 3 0.75 10.6 17.6 4 0.50 7.3 21.9 5 0.25 1.5 28.3 6 0.05 -6.2 40.8

Support intervals for difference of quantiles of total death rates (polluted minus unpolluted).

Compared with the apparent effect of water source on cholera death rates, the effects on total death rates are modest. Indeed, some negative values are well-supported for the 1st quartile, and 0 is not very poorly-supported for any quantile. If there is a water-related effect on total death rates, it is modest. It seems likely that any inferential comparison with the effects on cholera would say that water-source effects are not simply a reflection of a more general mortality effect. This supports the argument that the cause of cholera was water-borne. It is, however, not the end of the story on London cholera.

Tyndall’s glacier. One of the most celebrated scientists of the 19th century was John Tyndall. He wrote multiple books in which he tried to make science understandable for ordinary, intelligent people. Among his many interests, he was inspired by Louis Agassiz’s earlier measurements showing that glaciers moved, including some rough estimates of how fast they moved. This was important because glaciers were seen for the first time in the early 19th century as an explanation for how boulders could appear in places where one did not expect to find them, because moving glaciers would clearly transport them and then disappear. This in turn implied that places littered by unexpected boulders might have been glaciated long ago. Tyndall took on the task of making the velocity measurements more precise, as the first step toward better scientific understanding of glaciers.

In 1864 along with a companion he investigated the speed of ice flow in the Morteratsch glacier in the Swiss Alps. They set out three lines of stakes, one high up the glacier, one further down, and the last further down yet. He had devised a method, by sighting from a known position to a fixed position on the other side of the glacier with a telescope, to tell how much each stake moved in 24 hours. This was a substantial improvement over earlier use of moving stakes, which required many days or even months. Tyndall’s data is read in qf045.r . The

stake variable is numbered from the eastern side of the glacier to the west side, and the figures are the inches moved.

Page 14: comparisons

9. Comparisons

9-14

Tyndall was interested in the relative velocities at the three locations. These are shown on a per stake and glacier-position basis in Fig. 9-4. Tyndall scanned the numbers and concluded that there was a gradient, with the velocity being higher the higher up you go on the glacier. His conclusion was that there was an impaction on the ice in the lower reaches, which expressed itself in increased pressure on the sides of the glacier channel.

2 4 6 8 10

24

68

1012

14

stake

Inch

es M

oved

High

Mid

Low

Fig. 9-4. Recordings of movement in 24 hours of stakes distributed across a glacier at three different levels. (qf045.r)

But Fig. 9-4 suggests some potential problems. The first is that “velocity” depends on where you measure it across the glacier. Tyndall already knew this, because he had presented his work on it in the chapter preceding the one where he looked at Morteratsch. In qf045.r we use the median of each string of

stakes, and so the first issue is to compare these. In [3] we convert each string

into a QF-curve, and from that to a support function for the median. Note that the dbl option is used, to increase the number of quantiles in the qfcurves. The

Page 15: comparisons

9. Comparisons

9-15

graph in Fig. 9-5. shows that the lower reaches are moving more slowly than the mid and high reaches, but between the latter two one needs to do a computation because the support functions overlap so much. For completeness, we look at all three comparisons in [4] . The strategy consists of several steps.

• simulate samples of medians from the three groups

• compute a (simulated) sample of differences

• compute the QF-curve for the differences. This is complicated by the fact that the distributions of the medians are severely left-skewed, which is a bit unusual. Consequently, the tran=”lexp”

option of the smoothing is used.

• convert the resulting QF-curve to support. This is done directly (sim=FALSE) because the QF-curve of the preceding step is

already for the differences.

The support for 0 as the difference between medians is pretty much ruled out in all three cases in Fig. 9-6, so that Tyndall appeared to have been right in his feeling that glacier velocity increased as one went upward. There is also some evidence that this effect diminishes the higher one goes.

One of the reasons that Tyndall’s observations are important is that he discovered that the ice moved at different speeds along a cross-section perpendicular to the glacier edges. In fact the pattern was exactly the same as what is seen in rivers, bolstering the idea that glaciers are simply very slowly-moving rivers. As Tyndall noted, however, unlike water, ice can pile up and even be compressed, and these consequences of differential glacier speed make it a critical measurement for glacier science.

Page 16: comparisons

9. Comparisons

9-16

4 6 8 10 12 14 16

0.0

0.2

0.4

0.6

0.8

1.0

Tyndall's Glacier Speeds

Inches per 24h

Sup

port

Fig. 9-5. Support functions for median number of inches moved in 24 hours for three glacier levels: from left to right: lower, middle, higher. (qf045.r)

Page 17: comparisons

9. Comparisons

9-17

1.80 1.85 1.90 1.95

0.0

0.2

0.4

0.6

0.8

1.0

High minus Mid

Difference of Medians

Sup

p

4.2 4.4 4.6 4.8 5.0

0.0

0.2

0.4

0.6

0.8

1.0

Med minus Low

Difference of Medians

Sup

p

6.3 6.4 6.5 6.6 6.7 6.8 6.9 7.0

0.2

0.4

0.6

0.8

1.0

High minus Low

Difference of Medians

Sup

p

Fig. 9-6. Support functions for differences between median 24-hour shifts, for three glacier levels. (qf045.r)

Methods for multiple parameters. We have seen several approaches to inference for multiple parameters.

• Estimating the transformation from one QF-curve to another using q-q plots (calcium and SBP, Darwin’s corn)

• Graphing quantiles (Galton’s peas)

• Using simulation of estimators, allowing for various comparative approaches (London cholera, Tyndall’s glaciers)

These are only the most obvious approaches to heterogeneity, and one can develop other strategies tailored to particular data situations.

Homegeneity

Heterogeneity studies tend to dominate, because scientists are usually looking for patterns of variation that require explanation. Earlier in a program of study, however, it may be more important to verify that different ways of making measurements do not have much influence on the results. Often this involves “nuisance” factors, which one would like to be irrelevant, but which

Page 18: comparisons

9. Comparisons

9-18

need to be checked. In the 19th century such studies were eminently publishable, because they assisted other investigators in developing their experimental or observational protocols, but in the 20th century gradually more and more emphasis was given to the importance of explainable heterogeneity. So-called “methods” studies of homogeneity have become increasingly less publishable, admittedly with exceptions in some research areas, and so some scientists have tended to skip the fundamental first steps in establishing measurement protocols.

There are two basic approaches to homogeneity studies. The first is to compare each pair of parameter estimates. We will do this by combining their support functions. The second (which is traditional) is to devise an overall measure of the extent to which the parameter estimates are discrepant, and then through simulation to use the support function of the discrepancy measure. When inhomogeneities are found, it is natural to start trying to understand how they are expressed in the QF-functions, and eventually to search for hidden variables that might explain them.

Combining supports. The issue in a homogeneity study is how to extend the inference for a single parameter to inference for several parameters. This can get complicated. In order to keep it from getting too complicated, we consider the situation where we have the same kind of parameter, but in two

different circumstances, or experiments, or populations. Call the parameters θ1

and θ2. They could be medians, means, quantiles, or any other sensible parameter. The hope is that the experimental or population differences will be reflected in the parameter values. We assume that we have estimates t1 and t2 of the two parameters.

The usual situation is that we have done the “same” type of experiment, or collected data in the “same” sort of way, in two different circumstances. Although the circumstances are similar with regard to their form, they also differ in some significant ways, the consequences of which we worry will be

captured in the values of θ1 and θ2. In any case, the estimates t1 and t2 are computed in the same fashion from the two resulting samples.

We will have separate support functions, support(θ1:t1) and support(θ2:t2), each obtained using the methods described in the preceding chapter. This notation includes the fact that the first support function is computed only from t1 and the second only from t2. What I want to construct is a joint support

function of the form support(θ1, θ2: t1, t2) for both parameters based on both estimates.

The conceptual definition of support is one minus the largest p for which a p estimation interval would exclude the parameter value. Here “p estimation interval” means one with coverage probability p. Consider the fact that each of the individual support functions is determined by a complete family of estimation intervals for its parameter. From any pair of such intervals we can always get a joint estimation rectangle for both parameters; it is the collection of

pairs ( )21 ,θθ such that each component lies in its estimation interval.

Page 19: comparisons

9. Comparisons

9-19

Now focus on a particular ( )21 ,θθ . A confidence rectangle will exclude this

pair if (and only if) it excludes at least one of the two. If it excludes the first, then we maximize its coverage probability by taking all possible values for the second. If it excludes the second, then we maximize its coverage probability by taking all possible values for the first. Thus we have two maximized estimation rectangles, each with their individual coverage probabilities. If we take the maximum of the two coverage probabilities, and subtract from 1, then this is the same as the minimum of 1 minus each of them. We have established

If support(θ1) and support(θ2) are support functions for mathematically

independent parameters θ1 and θ2, then their joint support is

support(θ1, θ2) = min { support(θ1), support(θ2) }

The computation of support based on estimation intervals is capable of

generalization to an arbitrary number of parameters. Let θ1,…, θm denote m parameters, usually of the “same type” but estimated in different experiments or circumstances, Then the joint support for all parameters is given by

( ) ( ){ }m1m1 portsup,),(portsupmin,...,portsup θθ=θθ K

In deriving the joint support for multiple parameters I have used my original definition based on confirmation/disconfirmation through central probability intervals and estimation intervals. If that principle made sense in the case of one parameter, then it seems to extend effortlessly to two or more parameters. Further, it says that if we have a collection of parameter values each of which is well-supported by its corresponding sample, then we should regard that the summary degree to which they are all supported should be high. Conversely, if one of the values is poorly supported, then the summary support should also be low.

There are three further things to emphasize. The first is that no assumption has been made about the various samples that are involved. In particular, they have not been assumed independent. The second is that the parameters are mathematically independent. This means in particular than there are no

automatic relationships between them. As a trivial counterexample, if θ2 = θ1 + 1 then there could be cases in which their joint support (by the above equation) is rather high for impossible pairs that should have support zero.

The third comment is that the above derivation completely does away with one of the most peculiar features of conventional inference, which I will now describe. The source of the problem is with Fisherian hypothesis testing. Fisher’s thinking was dominated by controlling the probability of saying that one of his crop treatments had an effect, when in fact it did not. This was because he wanted to avoid putting further experimental effort into a treatment that was not outstandingly good. If you apply this reasoning to two treatments simultaneously, then the probability of declaring one or the other (or both) to have an effect, when in fact neither did, can only be bounded by the sum of the two probabilities for them individually. So Fisher’s followers would say that if

Page 20: comparisons

9. Comparisons

9-20

you bound the probability of falsely finding a treatment effect to 0.05 in each case, then you cannot assert any bound below 0.10 for them both. (If you assume independence this can be improved, but only marginally.) The upshot is that if you want to test m treatments, with a bound of 0.05 on the probability of declaring any of the ineffective ones effective, then you must test each one with probability 0.05/m. This is called the “Bonferroni” adjustment. There is a uniformly superior way of accomplishing the same thing due to Sture Holm, but somehow it has never caught on.

To be fair, it was Fisher’s followers, not Fisher himself who developed this draconian adjustment procedure. There is a corresponding adjustment to multiple confidence intervals that would make them each have coverage probability 1 – 0.05/m, but it is not used so much. Nevertheless, in effect either the Bonferroni or Holm adjustments make estimation intervals very much wider. That is, they exacerbate the tendency of a single confidence interval to make it possible to summarize an experiment with a value that is remote from the estimator. This means that there is a price to be paid for asking more than one scientific question, and it amounts to degrading the inference to the point that you can hardly say anything about anything. The astonishing consequence is that rigid statistical practitioners actually advise (or coerce) their clients to forego asking interesting scientific questions. It is hard to think of anything nice to say about a method of scientific inference that actively discourages the asking of questions.

There is a metaphoric way to interpret this that I find enlightening. The conventional view is that we have a finite budget of potential error probability, which by custom is set at 0.05. Each time we do a Fisherian null hypothesis test, we spend some of that budget on the probability of rejecting a null that is true. If we do only one test, we can expend our entire error budget on it. But if we do two or more tests, we must spend less on each individual one so that the sum does not exceed 0.05. It is inevitable that the decision to test a particular hypothesis degrades the effort to find an effect for all other hypotheses.

The support function view does not propose a finite support budget. A sample can support as many different claims as can be reasonably stated. The fact that it supports one claim does not diminish the support for other claims. The fact that I support climate change activism does not subtract somehow from my support for social justice or economic justice or voting rights. To the contrary, if we are talking about how much financial support I can give to these causes, then I do have a distinctly limited budget. I would maintain that in science, however, we do not have a support budget, because we are not spending anything. Fisher did have an actual budget because there was a time and space limitation on how many treatments he could investigate. So for him, in his special circumstances, spending error probability made sense. His huge mistake was generalizing this to all science.

I maintain that the above definition of a joint support function completely does away with this bizarre, unjustified error-spending methodology. You are permitted to estimate as many parameters as you want, but then you must restrict your joint support statement to equal the weakest support in the entire collection.

Page 21: comparisons

9. Comparisons

9-21

Syntax of suppjoin() , suppadj() and suppsm() . There are two

possible kinds of support one might want to compute from two support functions. One is the joint support for parameter pairs, and the other is the joint support of a measure that compares them. Both can be done with suppjoin() .

suppjoin <- function(s1,s2, FUN=NULL) # Input # s1, s2 = support functions # FUN = comparative function # example: FUN=function(x,y){ y-x } # Output # support function for pairs (if FUN=NULL) # support function for fun comparison (otherwise )

s1, s2 = two support functions

FUN = the name of a function previously defined, or the definition of an

anonymous function

If no FUN is given, then the output is a support function for pairs, with the

format (val1, val2, supp), where val1 and val2 pertain to s1 and s2 . Otherwise it

is a usual support function for the parameter comparing the two parameters, which is specified by FUN.

For example, suppose S1 and S2 are the support functions for two

parameters, and we want to do inference about their ratio. Then do

J <- suppjoin(S1,S2,FUN=function(x,y){ z <- y/x } ) plot(J, type=”l”)

Alternatively, we could do

rat <- function(x,y){ z <- y/x } J <- suppjoin(S1,S2, FUN=rat)

Anonymous functions allow you to create a function on the fly, without having to name it, which is only a convenience. Also, remember that a function will return the last thing it computes, without a return() statement.

It sometimes happens that a support function which results from a sequence of steps displays some computational artifacts. One frequent occurrence is that there are far too many values in the grid of parameter values. In complicated procedures, it might not even be consistently increasing to the left of its peak or not be consistently decreasing to the right. Or the peak may not occur at 1. Or it may have roughness, with irregular jumps. All of these can be fixed.

suppadj <- function(s, rnd=NULL, dbl=NULL) { # Input # s = support function # dbl = number of doublings of values

Page 22: comparisons

9. Comparisons

9-22

# (done first) # rnd = power of 10 for rounding values # Output # support function, doubled/rounded

s = a support function

dbl = double values (or halve for negative values)

rnd = power of 10 to round values.

The output is the modified support function.

When a support function is too rough, or even shows oscillations:

suppsm <- function(s) { # Input # s = support function # Output # smoothed support function

Syntax of suppcomb2() . This is a convenience function for combining

two support functions.

suppcomb2(s1,s2, rnd=-2, plot=FALSE, trim=TRUE dbl=0, title=””, xlab=””) # Input # s1, s2 = support functions (values, supp) # rnd = rounding for values # dbl = doubling of values grid # plot = request plot # title, xlab = plotting options # trim = remove NA's from output # Output # dataframe = (values, suppmin, support1, # support2) # plot if requested

s1 and s2 are support functions. Most functions produce these, but note

that suppnorm reverses the columns. Thus it is useful in this case to define (for

example)

s1 <- suppnorm(est,sd)[c(2,1)]

rnd is the power of 10 for rounding off the pooled set of values from the two

support functions. This is useful to prevent too many exceedingly close values in the output.

dbl doubles the number of grid values. This deals with the case where the

input support functions have value grids so coarse that the output support function is too irregular.

Page 23: comparisons

9. Comparisons

9-23

plot requests a simple plot of both support functions and their minimum

trim deals with the fact that failure of the values in the two support

functions to overlap creates NA’s in the output, by removing them. This can leave one with an empty result, which is returned as a dataframe with a single line of three zero’s.

The output dataframe contains three support functions; the two that were entered but on the combined grid, and their minimum, which is the support of the presumed common value of the two parameters. The columns are (values, minimum support, s1, s2).

Applications of suppcomb2 can be concatenated to get the support for a

presumed common value of multiple parameters. For example

S1 <- suppcomb2(s1,s2) S2 <- suppcomb2(S1[,1:2], s3)

to combine s1 , s2 , and s3 .

Moon radiation. In 1896 Sven Arrhenius published an article in which he calculated the effect on earth surface temperatures that would be due to different concentrations of carbon dioxide in the atmosphere. At that time so little was known about this effect that few would have worried that human activity, increasing carbon dioxide levels, would have any effect on weather. It took more than a century for reasonable people to appreciate the importance of what Arrhenius found.

He did not have much surface temperature data to use. In fact, virtually all of what he did was based on the theory of how the sun’s heat energy would reach the surface and then be absorbed or re-radiated back into space. The concept was aptly expressed as a “greenhouse” effect. In a greenhouse the incoming radiation is mostly in the infra-red (heat) end of the spectrum, although we tend to focus on the smaller amount of visible radiation (light). In any case, re-radiation tends to lower the wavelength, and glass acts more as a barrier to low wavelength radiation than to high wavelength. Thus, there is a difference in the rate at which the glass admits the sun’s (relativity high wavelength) energy and allows the escape of the (relatively low wavelength) re-radiated energy, and this difference accounts for the build-up of heat inside the greenhouse. With respect to the earth, the notion is that atmospheric carbon dioxide plays the role of the glass in the greenhouse.

Arrhenius’ contribution was to do the theoretical calculations, and this was quite complex. One piece that he needed regarding heat radiation he got from a previous publication (Langley, 1888), in which Langley developed the theory for the easier-to-observe radiation from the moon. Although the moon’s heat radiation is very small, it was detectable, and far easier to deal with than trying to analyze the sun’s heat radiation. The only issue with Langley’s data that I want to investigate here has to do with five series of similar measurements, each of the amount of lunar radiation, obtained under slightly different conditions, because Langley wanted to see if altering the measurement process slightly had

Page 24: comparisons

9. Comparisons

9-24

any effect on the results. Thus the issue is to determine whether the five series were essentially the same.

The analysis is in qf046.r . The computations in [1] are to provide for

converting angular values to wavelengths, and we need not worry about that here. Section [2] is to apply the conversions, ending up with L containing ser

(the five groups of measurements), the wavelength w (which we ignore), and the

radiation measurement rad . [3] is an optional check on the distributions

within in each series. In [4] we compute the QF-curve for each series, and then

the support function for the median. It is convenient when dealing with multiple objects of the same type to put them into a list, which is S here. The str()

function will display this structure, and check that subsequently I access each support function as an element in a list. In [5] we go through unique

combinations of series, each time producing the combination of the support functions for the medians. Note that the elements of the list, each S[i] , are

support functions. We not only make a plot each time, but we capture the maximum of the support function for the supposed common value of the two medians. Here are the supports for the equality of medians:

2 3 4 5 1 0.446 0.648 0.282 0.321 2 0.750 0.000 0.000 3 0.000 0.000 4 0.834

where the row name is the first series and the column name is the second series. Check the output to see how table2() reformatted X to produce this. The

graphics of the support function combinations is shown below. Note that in some cases there is no overlap of the domains of the support functions, so that the implied maximum support is zero, and there is no plot.

Page 25: comparisons

9. Comparisons

9-25

2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

1 & 2

Sup

port

2 3 4 5 6

0.0

0.2

0.4

0.6

0.8

1.0

1 & 3

Sup

port

8 9 10 11

0.0

0.2

0.4

0.6

0.8

1.0

1 & 4

Sup

port

8 9 10 11

0.0

0.2

0.4

0.6

0.8

1.0

1 & 5

Sup

port

2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

2 & 3

Sup

port

8 10 12 14 16

0.0

0.2

0.4

0.6

0.8

1.0

4 & 5

Sup

port

Fig. 9-7. Combinations of support functions for medians in the 5 series. Some plots do not appear because there was no overlap in the domains of the support functions. (qf046.r )

From the table above, we have strong support for 4 and 5 having the same median, which we can consider a pooled 4&5. Next, 2 and 3 are reasonably supported to be the same, so we form 2&3. There is no support for either of 2 or 3 being the same as either 4 or 5, so these are quite distinct subgroups. 1 is a bit peculiar, since it could conceivably be equal to any of the others, but if a choice were necessary we would group it with 2&3. Our options are thus

1, 2&3, 4&5 with support 0.750, or

1&2&3, 4&5 with support 0.446

In qf046.r , section [6] computes the QF-curves for the first grouping

and plots them together. The way to compare these is to pick a fractile and then run your eye along the graph to see where the QF-curves cross it. It appears that

Page 26: comparisons

9. Comparisons

9-26

1 has uniformly lower values than 4&5, and that 2&3 looks like 1 at the lower end but like 4&5 at the higher end.

0 5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

Radiation

Fra

c

Q1

Q23

Q45

Fig. 9-8. QF-curves for the three series groups. (qf046.r )

The overall conclusion seems to be that it would be risky to treat all five groups as being equivalent, with respect to median radiation, and there may be either two or three distinct subgroups. Why the groups differ depends on an investigation of how the circumstances of the samples varied.

Homogeneity support. The joint support of any collection of parameter values is the minimum of their individual supports, assuming they are mathematically independent (no relationships among them). Evaluating the joint support with all values equal is the support for them having that common value. Then the support for the claim that the parameters are all equal is the maximum such support. That is,

( ) ( )θθ=θθ

,,supportmaxs'equalsupport K

We can then define the conditional support

( ) ( ) ( )s'equaltsuppor/,,supports'equalcommonsupport θθθ=θθ K

The first definition gives the support for equal θ’s as the largest value of the joint support at which all components take the same value. The second definition says

Page 27: comparisons

9. Comparisons

9-27

that the support for any common value of θ, given that there is one, is the joint support for that value for all components, normed to have a maximum of 1.

To be clear, there are two issues here. The first is whether there is support for the claim that the true parameter values are all equal. Without this claim being true, there is little reason to talk about their common value. If the support for the assertion of a common value is sufficiently high, then there is reason to go further to compute the support for this common parameter. It might (or might not) then make sense to condition on the assumption that there is a common parameter, which is the second definition above. On the other hand, if the assertion of a common parameter has poor support, then one should look in more detail at which parameter subsets seem homogenous, work with those, and search for explanations.

Statins. The shift in medical research over the past century has been away from treatment of infections and acute illnesses and toward chronic diseases and their prevention. One reason is that in economically advanced countries, more people are threatened and afflicted by the latter than by the former, and these are the countries where medical research is concentrated. There is, however, a further economic reason. The more people at risk for a bad medical outcome, the larger the pool of available customers for preventive products, so that again attention toward prevention is often emphasized in more affluent countries. The other shift in medicine has been towards drugs as solutions for virtually all problems, and because it is expensive to move drugs to market, pharmaceutical companies have been keen to expand the market for each successful drug to the maximum extent possible.

The class of drugs known as statins is a good example. There have been multiple studies implying that survival of heart disease patients is prolonged by statins. While this is encouraging in a medical sense, in a marketing sense the pool of available heart disease patients might be considered too small. It would be very useful for a manufacturer to obtain a wider “indication” for statins. (“Indication” is a term of art which refers to the collection of conditions or patient characteristics for which a regulatory agency would permit use of the drug.)

Statins generally seem to lower low-density lipids (LDL’s) in the blood. Starting with questionable research by Ancel Keys (discussed in Relationships), high blood cholesterol (a lipid) has been associated with heart disease. The marketing argument that came to be developed was that people with high LDL should be taking statins, in order to prevent heart disease. The pool of high LDL people is much larger than the pool of heart disease patients.

This strategy has been strongly supported by some, and equally strongly rejected by others. The “pro” group has cited studies showing that statins appear to produce a range of favorable physiologic changes, of which lowering LDL is only one. The “anti” group has argued that most people will not benefit from statins at all, and that a tiny fraction suffers very severe side effects.

My position is somewhat mixed. As a person with heart disease I take a statin, partly due to the favorable studies for patients like me, and partly to please my cardiologist. But as a scientist I do not see that there has ever been a

Page 28: comparisons

9. Comparisons

9-28

study that correctly evaluated the claim that statins protect against heart disease through the mechanism of lowering LDL. I’ve outlined what mechanistic studies need to look like (Aickin 2007), and statins have never been studied this way.

The data that I want to consider here were concerned not with preventing heart disease, but with postponing death from any cause among people without heard disease, but who were nevertheless at intermediate or high risk of it. Given that statins have a range of apparently favorable, but poorly understood physiologic effects, it makes sense to jump to studies on their effects on all-cause mortality. Of course, since everyone is eventually at risk of death, the pool of statin customers would be greatly enlarged if such studies showed an effect.

The data in qf047.r are from nine studies of potential statin effects on all-

cause mortality. For us the only interesting parts are the rate-ratios; mortality rate in the statin-taking group divided by the mortality rate in the placebo group. The evaluation of a set of studies on a single treatment for a single disease or condition in a well-defined population of patients is called a “systematic review”. Our data comes from such a systematic review (Ray et al. 2010).

The ostensible point of a systematic review is to come to an overall conclusion about whether the treatment helps or not, a favorable decision supporting an expanded indication. Our purpose differs from this only in that we want to portray the evidence accurately in terms of support, leaving decision-making to a later stage. Consequently, our first question should be whether there is anything worth summarizing; that is, do the studies preponderantly point in one direction? The support functions for the mortality rate-ratios of the nine studies are shown in Fig. 9-9. These are based on results extracted from the article, using Normal-based inference for the log rate-ratios, ln(rr). We are reminded of Charles Poole’s admonition that anyone looking at support functions will be far better informed than anyone who does not. There seems to be a split, with four studies showing beneficial effects and five studies showing zero or harmful effects.

Page 29: comparisons

9. Comparisons

9-29

-0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4

0.2

0.4

0.6

0.8

1.0

ln(RR)

Sup

port

Fig. 9-9. Support functions for the log mortality rate-ratios in 9 statin studies. (qf047.r )

We could use the approach based on combining support functions of pairs of studies, which would give us 36 support functions. It is fairly evident from the above graph that we would only learn what the graph already tells us, that we appear to have two subgroups of studies, each of which appears reasonably consistent, with little or no consistency between any study in one group with any study in the other group.

To be complete, our first step should be to quantify support for homogeneity. This is done in [4] , with the graphic in Fig. 9-10, showing the

minimum of all nine support functions. The maximum value of this support function is the support for homogeneity, which is about 0.125, and so homogeneity is supported only very weakly. If one were to take this small amount of support as being adequate, then the support function would be normalized by dividing by its maximum, giving the conditional support for a statin effect (on the ln(rr) scale) given that the studies are homogeneous. Since only negative values would then have any degree of support, the overall conclusion would be a beneficial statin effect.

Page 30: comparisons

9. Comparisons

9-30

-0.12 -0.10 -0.08 -0.06 -0.04

0.06

0.08

0.10

0.12

Homogeneity Support

val

supp

Fig. 9-10. Support for an assumed common ln(rr) in all 9 statin studies. (qf047.r )

A more sensible path would be to take the position that homogeneity must be reasonably well supported, perhaps even strongly supported, before one should even bother computing a conditional support function. There is no point summarizing something if the evidence is that there is nothing to summarize. By this argument, probably the next thing to do is to accept the fairly obvious pattern of Fig. 9-9, and divide the studies into two groups based on their estimate of the statin effect. The homogeneity studies for each group separately are in [5] , with the output graphic in Fig. 9-11. One group has homogeneity

support about 0.58 with well-supported statin effects around -0.25 to -0.3. The other has homogeneity support about 0.67 and well-supported effects around 0 to 0.05.

My interpretation is that there is inadequate support for homogeneity across all nine studies, but that there is support for two homogeneous groups of studies. One group indicates no effect or probably a small harmful effect, while the other group indicates a beneficial effect. Whether either of the indicated effects is of any clinical significance is an issue that requires investigation by clinicians.

Page 31: comparisons

9. Comparisons

9-31

-0.4 -0.3 -0.2 -0.1 0.0 0.1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Homogeneity by Group

ln(rr)

Sup

port

Fig. 9-11. Support functions for statin effect in two seemingly homogeneous groups of studies (qf047.r )

The systematic review article performed a significance test and concluded that the homogeneity hypothesis cannot be rejected. This was done in a perfunctory manner, because the whole remainder of the article was based on the homogeneity assumption, and if the hypothesis test had led to rejection then there would have been no article. My analysis casts considerable doubt on homogeneity, and should make one uneasy about an analysis that is based on it. In general, one wants to work with models or under hypotheses that the data support, not just for ones that are not badly unsupported.

In its final summary, the systematic review article came to the conclusion that there was no overall (pooled across all studies) benefit, so that there was no reason to promote statin use in the general population, at least not from the standpoint of prolonging life. My analysis would certainly agree with this latter point, but for an entirely different reason: some patients may be helped and others may receive no benefit or be harmed, but since we have no basis for differentiating between them, there is no support for a general recommendation in favor of statins. The hanging issue this leaves is, if there is a subpopulation of people who could be helped, who are they?

Page 32: comparisons

9. Comparisons

9-32

One might surmise that by intensive analysis of the actual studies, there might be some clue of how to find the subgroup of people that may benefit from statins. I looked at all of the characteristics listed in the systematic review, computing how they related to the study results, and found none that would even suggest an answer to this question. Since the authors of the systematic review relied on significance testing for their evaluation, they did not see the opportunity to identify a benefited subgroup, and so they did not look for it. Because the source data of the studies are not publicly available, and probably no longer exist in an analyzable form, a marketing opportunity for the pharmaceutical industry was lost.

There are two interesting messages in this example. The first is that it is quite common for medical researchers to use Fisherian hypothesis testing to decide the homogeneity question. This amounts to saying that unless there is extreme evidence of heterogeneity, the homogeneity claim will be accepted. It is widely and very well-known that analysis based on an incorrect model can be misleading just for that reason, and so the conventional medical research strategy is at best risky and at worse delusional. But that is not the end of the problems. What frequently happens next is another Fisherian hypothesis test, now of the hypothesis of no treatment effect. If there is enough heterogeneity among the studies, then by increasing the imprecision of the analysis, a no-effect decision may be reached, as it was in the statin case. On the other hand, if the potentially heterogeneous studies happen, when inappropriately combined, to provide sufficiently little support for no effect, then this is taken as the conclusion. That is, even if the studies are not consistent among themselves, when they are forced into an analysis that treats them as being consistent they may show a treatment effect. This is the usual pattern in overly-optimistic meta-analyses. As in the case of statins, it sometimes happens that there is no meaningful overall effect, but instead there is a subclass of studies pointing in the effectiveness direction and another subclass pointing in the opposite direction. In such cases an overly-pessimistic meta-analysis discourages further research.

The irony here is that the push for clinical trials, beginning in the late 1970’s, was supposed to consolidate and solidify our knowledge about which treatments work and which do not. Recall Wennberg positing that physician practices varied in part because of their ignorance about what was in the literature. Obviously systematic reviews were directed toward solving this problem. Although many such reviews do accomplish this, it is at least mildly unsettling how many do a better job of making it plain that the included studies do not speak with one voice. The bias that has developed has been to try to ignore between-study differences and pretend that there is a simple message.

Gravity. After you get past Newton’s three laws about dynamics, his fourth law describes gravity. It says that the force of gravitational attraction between two masses is proportional to the product of their masses divided by the square of the distance between their centers. The issue is then the “proportional to”, which means there is some multiplicative constant. It is called the “gravitational constant” and denoted G, which has led to it being called “big G”.

Page 33: comparisons

9. Comparisons

9-33

Most physicists believe that G is a universal constant; that is, it is the correct multiplier for Newton’s fourth law across the entire universe. Therefore it is worthwhile to nail down its value, and this is where the problems start. Experiments to measure G have to take place on the surface of the earth. The difficulty is that G is tiny, so the experiments have to be delicate. We do not usually think of gravity as being a weak force, but it is – very weak. If you consider that the gravity we experience is the result of G times the mass of the earth, which is huge, it becomes clear why G is so small. In any case, its tiny size means that attempts to measure it might give inconsistent answers, and this is precisely what has happened.

Data on some recent attempts to determine G are analyzed in qf048.r .

The data file was set up for convenience of data entry, and so some initial manipulations need to be done to get sensible values. Each estimate of G is accompanied by what physicists call its “uncertainty”. This is certainly not a defined term in data analysis, and I do not believe it is even well-defined in physics, but experts in this area treat it as if it is the standard deviation of a sample of estimates, and so that is what I will do. The first plot is of the G estimates and bars representing two standard deviations on each side of the estimate. These are conventional 0.95 confidence intervals, but I prefer to think of them as slices of support functions. There are two astonishing things about the plot: (1) the estimates differ so much, and (2) their uncertainties differ so much. Incidentally, the reason for plotting them against the year in which they were published is because one often sees initial disagreements turn into agreement as the experimental technique becomes refined and improved. We do not see that here, which is perhaps the third astonishing thing.

Page 34: comparisons

9. Comparisons

9-34

1985 1990 1995 2000 2005 2010 2015

6.67

26.

673

6.67

46.

675

Year

G

Fig. 9-12. Estimates of the gravitational constant G, and their putative 2SDE intervals. (qf048.r )

Even though we know roughly how it is going to look, we next show the individual support functions for the estimates. It is clear than there is inconsistency among the studies, although there do seem to be some clusters of studies that reasonably agree. Social scientists, who are used to seeing published studies contradict one another, might take heart here that the hardest of all physical sciences sometimes has the same embarrassing problem. I started this project with the intent of combining all the support functions into one, but no special computational procedure is available for doing this, due to their extreme disagreements.

Page 35: comparisons

9. Comparisons

9-35

2000 3000 4000 5000

0.2

0.4

0.6

0.8

1.0

G(trailing digits)

Sup

port

Fig. 9-13. Support functions for the estimates of G from Fig. 9-12. (qf048.r )

One might throw up one’s hands and leave it by just saying that there are problems. The barrier to doing this is that astronomers, and especially cosmologists, have a vital need to know the value of G. They can make measurements of movements of extremely distant objects, but the only way they have to determine their masses is indirectly, by measuring the gravitational effects that they have on each other, and if we have G wrong then they have the masses wrong. This might lead to stories explaining the observations which are in fact only side-effects of an inaccurate G. In any case, physicists collectively want there to be a “standardized” or “accepted” value of G for all of their research needs. So the problem is, how to make concordance of discordant data.

Here is one approach. We do not know exactly how the physicists who did the experiments computed their “uncertainties”. It appears to be based on a methodology that has grown up within the G-estimation community, independently of science more generally. Whatever variability these putative standard deviations measure, they clearly leave out “method variability”; that is, the variability due to different scientists using different methods. This suggests that in order to force some kind of consensus, we could vastly inflate the uncertainties in order to get support functions that could conceivably be combined.

Page 36: comparisons

9. Comparisons

9-36

Here are the results of replacing all uncertainties with the standard deviation of G estimates across all studies. This is about 20 times their average. There are two little minimum support functions, one with a maximum of about 0.10 centered on 4000, and a second with maximum 0.40 centered a bit above 3000 obtained by dropping the two largest estimates. The first one would apply if we include all studies, and the second one if we exclude the two studies reporting relatively large values. Of course the one that discards two studies has higher maximal support, which is in the “ambiguous” range, whereas the smaller is “questionable”. One could argue that this approach ignores the basic issue, that the studies seem to be estimating different constants, but it does appear to be a reasonable support-based method of doing the best one can to resolve the need for a “consensus” value. We have at least succeeded in narrowing the options from 14 to 2, and provided support functions in both cases.

0 2000 4000 6000 8000

0.2

0.4

0.6

0.8

1.0

G(trailing digits)

Sup

port

Fig. 9-14. Support functions for G estimates, using a “between methods” standard deviation, in order to force a consensus. (qf048.r )

Because it was one of his laws, Newton knew about G, but he never had an idea about its value. His data came exclusively from the solar system – planets, moons, and the occasional comet. For his purposes, gravity applied to point masses at appreciable distances, because the radius of a planet is a tiny fraction of the size of its orbit. He only needed to know ratios of quantities, in which G

Page 37: comparisons

9. Comparisons

9-37

always cancelled out. The first measurement of G was in 1798 by Henry Cavendish, using a torsion balance that became standard for subsequent attempts. Modern methods have vastly increased in subtlety and precision, but one suspects that the proliferation of experimental approaches may be exactly the source of between-study variability. The physicists are indeed trying to measure the same thing, but their instrumentation is subject to influences that are very difficult to detect and quantify. An alternative would be that Newton was wrong, and that there are in effect multiple different kinds of gravity. Or perhaps gravity behaves differently on the surface of a planet than it does throughout a solar system. And if it is the latter, then we have to ask whether gravity works the same within an entire galaxy as Newton found it to work for planets and comets.

General relativity. Albert Einstein published the special theory of relativity (for systems moving with constant velocity with respect to each other) in 1905, and the general theory (for accelerating systems and gravity) in 1916. Although these two events are often represented as revolutions solely due to Einstein’s genius, in fact quite a few theorists had been working on the same ideas, Joseph Larmor, Hendrik Lorentz, and Henri Poincaré in the case of special relativity and David Hilbert in general relativity. Hilbert would have been the discoverer of general relativity if Einstein had procrastinated even a few weeks. In fact, Poincaré presented the modern Lorentz transformation, the key element in special relativity, some 3 or 4 weeks before Einstein submitted his article. Lorentz himself had published it in 1904, and Joseph Larmor in 1900. In many ways Einstein simply pieced together the work of others to produce special relativity, but general relativity was more uniquely his idea.

Unlike many discoveries in physics, general relativity was not developed to explain some previously mysterious experimental result. In fact, as scientists absorbed and debated Einstein’s ideas, there were no experimental tests at all. The first feasible opportunity appeared in 1919, when the British supported two expeditions led by Arthur Eddington to locations where a solar eclipse would be observed, one in Solari, Brazil, and the other on the island of Principe. The idea was that Einstein’s theory implied that starlight would bend as it passed near the sun, and the bending would be about twice what was predicted by Newton’s theory. In actuality, the “Newton” prediction was produced by Einstein (or Eddington, it is not clear which), putatively using Newton’s concept of light consisting of particles. The alternative prediction was said to follow from relativity, although this turned out to be less than clearly true. In any case, a star appearing near the rim of the sun would show an apparent shift in position, based on the projection of its trajectory when it was far from the rim, and the amount of shift could be used to compare the two theories. Of course it would only be possible to observe this phenomenon during an eclipse, when the moon blotted out enough of the sun to make it possible to see stars near the solar rim.

Eddington triumphantly announced that the experiment had proved Einstein right, and that Newton’s physics had therefore been supplanted. This was an overstatement, because testing a single prediction of a theory does not validate the theory. The results did, however, decisively show that masses were capable of bending light, and as I will discuss below, that “Newton’s” prediction was very poorly supported by Eddington’s data.

Page 38: comparisons

9. Comparisons

9-38

As part of the celebration of the 60th anniversary of Eddington’s expeditions, a re-analysis of the photographic plates was carried out by G.M. Harvey in 1979. Miraculously most of the plates that had been selected for analysis from one of the two expeditions had survived, and were in good enough shape to be re-measured by modern equipment. We can look at the re-analysis in qf049.r .

Harvey summed up his computations, and we will look at this first. In [1]

we see the mean deflections and their SDEs according to Harvey, for the two telescopes that were used. One was a 4-inch (scope = 4 ) and one was 13-inch

(scope = 13 ). In [2] we compute the support functions for the two scopes

separately (S1 and S2), and for their average (S3). The top two panels in Fig. 9-

15 are of S1 and S2, with the dotted line representing S3. The conclusion

Harvey reached is that the average of the two estimates supports Einstein’s prediction of 1.75 seconds of arc (the vertical line). Our computation shows that the supports for this value are 0.17 for the 4-inch and 0.56 for the 13-inch, and 0.88 for their average. The problem with this is that it does not take into consideration whether the two scopes were measuring the same thing. To do this we compute the support function for the difference in mean deflections for the two scopes, resulting in the right top panel in Fig. 9-15, and with a support of 0.33 for homogeneity (that they both estimate the same quantity).

The problem this creates is a serious one. If one has measurements of what should be the same thing by several different instruments, and there are differences between their estimates, then there is a fundamental question about what is really being estimated, if anything. The conventional view of the homogeneity result above would be that 0 is not ruled out, so one is licensed to proceed as if the true mean difference were 0. But this is inappropriate. The issue is not whether there is massive evidence against homogeneity, but instead whether there is adequate support in favor of it. Remember that if one selects an incorrect model for analysis (homogeneity in this case) then all subsequent analysis is potentially wrong. The support for homogeneity in this case is quite questionable.

Page 39: comparisons

9. Comparisons

9-39

1.0 1.2 1.4 1.6 1.8 2.0 2.2

0.2

0.4

0.6

0.8

1.0

Re-analysis of Eddington's Data

Seconds of Arc

Sup

port

-0.4 0.0 0.2 0.4 0.6 0.8 1.0

0.2

0.4

0.6

0.8

1.0

Homogeneity

Seconds of Arc Difference

Sup

port

1.5 2.0 2.5

0.2

0.4

0.6

0.8

1.0

Re-analyzed Raw Data

Seconds of Arc

Sup

port

-0.5 0.0 0.5

0.2

0.4

0.6

0.8

1.0

Homogeneity Raw Data

Seconds of Arc Difference

Sup

port

Fig. 9-15. Re-analysis of Eddington’s data on the deflection of starlight near the rim of the sun. (qf049.r)

We can see why. The 13-inch results have considerably more variability than the 4-inch results. This has the effect of spreading out the support function, and that contributes to why the support for 1.75 is so high for the 13-inch data. It also causes the support function for homogeneity to spread out, which is why 0 has as much support as it does. This is a more precise way of saying that the complaint of critics of the results, that the plates were too fuzzy to draw any conclusions, is in fact at least for the 13-inch data supported by the observations. From this perspective, Harvey got the final result he did by averaging two heterogeneous estimates, which happened to be on each side of the target value, and then using their large variability to settle on 1.75 as being the true value. For an issue as important as whether relativity is right or not, this is not a very impressive analysis.

Page 40: comparisons

9. Comparisons

9-40

(The 4-inch telescope was only included in the expedition at the insistence of an astronomer who could not make the trip. After the fact the story was told that the plates from the 13-inch telescope had warped, while the 4-inch plates had not.)

In this case we can take one step further back into the raw data on which Harvey’s summary is based, because he included them in his article. The data are copied into a data frame directly in [4] , and printed out to check against

the publication. (The values here need to be multiplied by 1.75 to be on the same scale as the previous analysis.) Using QFview() on the 4-inch and 13-inch

data separately strongly suggests that each dataset contains a low outlier. For subsequent analysis here I remove them both. The usual argument against removing outliers is that we expect the data not to have any, so it is surprising to see some, and therefore additional evidence is necessary before attending to them. In this case the historical evidence emphasizes the difficulty which Eddington’s team had to get any usable measurements at all, and this raises our expectation that there will be outliers. (In any case, you are free to re-run the analysis including the outliers to see if it makes any difference.)

The result is in the lower panels of Fig. 9-15. On the left the support for 1.75 is 0.17 for the 4-inch and 0.71 for the 13-inch, but only o.42 based on the average. The homogeneity support for 0 (right panel) is now 0.91. Consequently, the effect of using raw data and eliminating outliers has been to increase our confidence that the two datasets are estimating the same thing, and then lowering the support for Einstein’s value. The results are ambiguous about Einstein’s prediction, but they certainly completely refute a deflection of about -.85, supposedly implied by Newton’s theory.

From the modern statistical perspective, Harvey tested the wrong hypothesis. The rejectionist philosophy is that you pose (as the null hypothesis) a value that you would like to disprove, and then show that your data give very low support to that value. Harvey set up his analysis from a confirmationist viewpoint, posing the value that he wanted to prove (1.75), and then he used a rejectionist analysis by showing that his data did not give that value very low support. What Harvey needed to do was to follow his confirmationist formulation with a confirmationist analysis, which would have required high support for the value he wanted to be true, but he failed to do this. When properly re-analyzed the data are ambiguous about Einstein’s prediction. Thus the critics were right that Eddington did not “prove” Einstein right, and they correctly identified the problem, that (in our terms) the support function is so wide that it gives good support to values that contradict Einstein. A massive amount of subsequent research is said to show that Eddington was right, even if the evidence that he had at the time was inadequate.

On the other hand, along with scientists at the time Eddington took the position that the only plausible alternative to Einstein’s theory was the value purported to have been derived from Newtonian kinetics, which gave a value roughly one-half of Einstein’s 1.75. No very sophisticated reasoning is necessary to reject the Newtonian value using Harvey’s data, so if the issue really was Einstein vs. Newton then clearly Einstein won.

Page 41: comparisons

9. Comparisons

9-41

But one might nevertheless be bothered by the fact that the results point toward a value exceeding Einstein’s prediction. Although it did not happen in this case, it is not unusual for experiments testing two rival theories to give results somewhere between the two predictions, often casting doubt on both theories. In these cases the standard practice is to “confirm” whichever theory came closer, yet another instance in which hypothesis testing distorts inference.

I have presented the analysis here from the point of view that in addition to measuring support for Einstein’s theory, there is an issue of homogeneity that needs to be addressed. Indeed, statisticians are often drawn to questions like this, but sometimes they miss the point. In 1919 there were two expeditions. The re-analyzed plates all came from the one to Sobral, which Eddington himself was on. Eddington decided which plates to discard, and the average of the discarded plates was evidently 0.93.

There is a post-script to this story that makes the analysis of Harvey’s data only vaguely relevant. In 1930 Charles Poor published an article that went into depth on all the aspects of eclipse investigations, including the 1919 one and another Australian one in 1922. Poor pointed out the problems posed by trying to determine that star positions had apparently shifted, using one plate taken during the eclipse and another, made later and under rather different conditions. The interpretational and computational problems were both fearsome and assumption-laden. Poor pointed out that the 1.75 figure was not obtained from general relativity, but by inserting a virtually arbitrary retardation factor into a known pre-relativity equation involving the propagation of light. He showed that the issue of aligning the plates was a major problem, and that the complex mathematical method of doing the alignment was based on general relativity. He also presented evidence that even after all of the considerable data-massaging done by Eddington and Dyson, there was remarkable variability in the amount and direction of the apparent star shifts. The data that was published included only the radial components (that is, deflections toward or away from the sun), ignoring the fact that the non-radial components could be taken as evidence against the reliability of the observations.

I would argue that even if Poor was right, my analysis of Harvey’s data is still sensible. The reason is that the only thing support functions can do is express what a sample has to say about parameter values. Whether the sample itself is valid or trustworthy is a subsequent concern, and complete scientific inference should go beyond just support functions to more general considerations.

As a post-post-script, the results of the 1919 expeditions began the process of turning Albert Einstein into a popular icon. He caught on perhaps because it was noteworthy for a physicist to become a celebrity, and he did what he could to cultivate his peculiarity, seldom combing his hair and usually going without socks. His life after relativity was mostly a failure. He objected to quantum physics, which dominated the field from the mid 1920’s, left Germany in 1933 when Hitler was appointed Reichskanzler, and spent the remainder of his life (to 1955) in the US generating abortive newspaper stories about breakthroughs in finding a unified field theory. He had been awarded the Nobel Prize in 1922

Page 42: comparisons

9. Comparisons

9-42

(the delayed 1921 prize) but explicitly not for his relativity theories – the only time the Nobel committee has specifically excluded work of a prize recipient – which one might possibly consider an indirect criticism of Eddington’s exuberance.

Page 43: comparisons

9. Comparisons

9-43

Tasks

1. It is widely accepted that cigarette smoking affects lung function, but this is usually studied in long-time smokers. Thus it is interesting to study whether the effect can be seen in teenage smokers, who have only just started. Do a comparative analysis of fev (forced expiratory volume in 1 sec) with the intent

of seeing any diminution among teenage smokers. The data can be accessed by

load(file=“../Data/Objects/fev”, verbose=TRUE) dfdesc(F)

Note the difference in age distributions. Re-do the comparison in a way that takes potential age effects out of the picture. (The data are probably real, but the web documentation is too poor to locate the source.)

2. Compare the distributions of alkaline phosphatase blood levels in men and women, which you cleaned up in Task 3 of Strategies.

3. Access some data on autism and mercury levels using the following script: load(“../Data/Objects/autism”, verbose=TRUE) . The three

variables of interest are A$Aut (an indicator of an autistic child, as opposed to a

non-autistic child), A$BloodHg (blood mercury level), and A$HairHg (hair

mercury level). Compare the median levels of blood and hair separately, between the autistic and non-autistic children. Which measure, if any, would you recommend for further study?

This dataset became notorious for the usual strange reasons. The original authors (Ip et al. 2004) miscomputed the “p-value”, but a better job was done by the authors who reanalyzed the data (DeSoto & Hitlan 2007). and then the original authors in response miscomputed it once again. Your analysis should make it clear that there is no difficulty in interpreting these data. This example illustrates how easy it is to create confusion with conventional statistical methods.

4. An analysis of published studies of the possible increase in various kinds of cardiovascular death related to intake of saturated fat gave the following summary:

The pooled relative risk estimates that compared extreme quantiles of saturated fat intake were 1.07 (95% CI: 0.96, 1.19; P = 0.22) for CHD, 0.81 (95% CI: 0.62, 1.05; P = 0.11) for stroke, and 1.00 (95% CI: 0.89, 1.11; P = 0.95) for CVD.

The authors concluded that saturated fat intake was unrelated to cardiovascular disease death. ("Risk estimate" is an epidemiologic term that usually means ratio of probabilities, although in this case it should have been ratio of death rates. "Quantile" probably meant quartile, that is, the comparison was between the lowest and highest quarter of the sample, with respect to saturated fat intake.)

Construct the support functions for the three “risk estimates”, and interpret them. The source is

Page 44: comparisons

9. Comparisons

9-44

Siri-Tarino PW, Sun Q, Hu FB, Krauss RM. Meta-analysis of prospective cohort studies evaluating the association of saturated fat with cardiovascular disease. Am J Clin Nutr doi: 10.3945/ajcn.2009.27725

Resources

In the scientific literature, the Snow vs. Farr story has become a surrogate for a battle between epidemiology and statistics, largely prosecuted by epidemiologists. The historical facts have become so distorted that articles like the next one are mostly of use for portraying the confusion.

Bingham P, Verlander NQ, Cheal MJ. Johh Snow, William Farr and the 1849 outbreak of cholera that affected London: a reworking of the data highlights the importance of the water supply. Public Health (Journal of the Royal Institute of Public Health) 2004;118:187-394

Darwin died a little too early to appreciate the implications of Mendel’s genetic results for his theory of evolution. He knew, however, that whatever the genetic inheritance mechanism was, it was important because it was possible that it might invalidate his work.

Darwin, C. The Effects of Cross & Self-Fertilisation in the Vegetable Kingdom. London UK: John Murray, 1876

Through the force of his personality, if not his brilliance, Fisher was probably the most influential statistician of the 20th century. He wrote two texts on statistics, the second of which was the design book in which he bungled an analysis of Darwin’s data to support his “randomization test”.

Fisher, RA. The Design of Experiments. New York NY: MacMillan Publishing Co, 1935

Galton was related to Darwin and worked with Karl Pearson, who wrote an enormous biography of him. It was possibly due to the antagonism that Fisher had for Pearson that he chose to indirectly criticize Darwin’s work in Chapter 2 of the above reference, due to the fact that Darwin claimed to have had Galton’s help in doing his analysis. Here are the sources of Galton’s data on peas.

Galton, F. Typical laws of heredity. Proceedings of the Royal Institution, 1877;8:282-301

Galton, F. Natural Inheritance. New York NY: MacMillan Publishing Co, 1889.

Eddington has been criticized for excluding most of the plates recording apparent star positions during the 1919 eclipse, thus biasing his analysis toward confirmation of relativity. The plates that Harvey was able to find were limited to those that had been originally chosen, and so his analysis was only a minor technical improvement, insufficient to remove the selection problem..

Harvey GM. Gravitational deflection of light: a re-examination of the observations of the solar eclipse of 1919. The Observatory 1979;99:195-198

Page 45: comparisons

9. Comparisons

9-45

One has to read this next paper, or at least skim it, to get a sense of what a heroic effort was necessary for Dyson and Eddington to turn the fuzzy plates into “definitive” results.

Dyson FW, Eddington AS, Davidson C. A determination of the deflection of light by the sun’s gravitational field from observations made at the total eclipse of May 29, 1919. Philosophical Transactions of the Royal Society, Series A, 1920; 220: 291-330

Charles Poor was in a position to be able to criticize the Eddington results with authority, because he not only understood the issues of telescopy and eclipses, but he had access to the raw data. Poor’s article is one of the major forgotten pieces of the history of physics.

Poor CL. The deflection of light as observed at total solar eclipses. Journal of the Optical Society of America 1930;20(4):173-211

The following is quite typical of modern systematic reviews, and as in this case the apparent inhomogeity of the included medical studies is usually suppressed.

Ray KK, Seshasi SRK, Eqou S, Sever P, Jukema W, Ford I, Sattar N. Statins and all-cause mortality in high-risk primary prevention. Archives of Internal Medicine 2010;170(12):1024-1031,

My definition and treatment of support functions is a primitive version of a far more elaborate theory propounded in the next reference. Shafer takes on the issue of how to combine two or more sets of evidence about the same question, but he restricts the kinds of evidence in ways that might unnerve some practical data analysts.

Shafer G. A Mathematical Theory of Evidence. Princeton University Press: Princeton NJ, 1976

Tyndall is one of the more pleasant authors to read in the area of science in the late 19th century.

Tyndall J. The Forms of Water in Clouds and Rivers, Ice and Glaciers. D. Appleton & Co.: New York NY, 1896

The source of the data on radiation from the moon.

Langley SP. The temperature of the moon. Memoirs of the National Academy of Science , Vol IV, Part 2, 1889, pp.107-212

This is the source of the studies to estimate the gravitational constant. It is not the only source of such studies, but the others tend to present even more inconsistencies.

Rothleitner C, Schlamminger S. Measurement of the Newtonian constant of gravitation, G. Review of Scientific Instruments 2017;88:111101

The source of the data on calcium and blood pressure:

Lyle RM, Melby CL, Hyner GC, Edmondson, Miller JZ, Weinberger MH. Blood pressure and metabolic effects of calcium supplementation on normotensive white and black men. Journal of the American Medical Association 1987;257:1771-1776

Page 46: comparisons

9. Comparisons

9-46

A treatment mechanism (as for statins) is supported when the treatment changes the mechanistic variable, and this latter change is then associated with an improvement. This requires analyzing triplets: treatment status, mechanism status, and outcome. Medical studies are never done this way.

Aickin M. Conceptualization and analysis of mechanistic studies. Journal of Alternative and Complementary Medicine 2007;13(1):151-158.