Gelman (2012) considers a case where the overall available evidence, E, is at odds with the indication of the results x from a given study:
Consider the notorious study in which a random sample of a few thousand people was analyzed, and it was found that the most beautiful parents were 8 percentage points more likely to have girls, compared to less attractive parents. The result was statistically significant (p<.05) and published in a reputable journal. But in this case we have good prior information suggesting that the difference in sex ratios in the population, comparing beautiful to less-beautiful parents, is less than 1 percentage point. A (non-Bayesian) design analysis reveals that, with this level of true difference, any statistically-significant observed difference in the sample is likely to be noise. At this point, you might well say that the original analysis should never have been done at all—but, given that it has been done, it is essential to use prior information (even if not in any formal Bayesian way) to interpret the data and generalize from sample to population.
Where did Fisher’s principle go wrong here? The answer is simple—and I think Cox would agree with me here. We’re in a setting where the prior information is much stronger than the data. (p. 3)
Let me simply grant Gelman that this prior information warrants (with severity) the hypothesis H:
H: “difference in sex ratios in the population, comparing beautiful to less-beautiful parents, is less than 1 percentage point,” (ibid.)
especially given my suspicions of the well-testedness of claims to show the effects of “beautiful to less-beautiful” on anything. I will simply take it as a given that it is well-tested background “knowledge.” Presumably, the well-tested claim goes beyond those individuals observed, and is generalizing at least to some degree. So we are given that the hypothesis H is one for which there is strong evidence.
For me this would mean, roughly, that if H were wrong and that, in fact, there is as much as (or more than) a 1 percentage point difference in sex ratios in the population, comparing beautiful to less-beautiful parents, then these earlier studies would have indicated this. Or at least there’s fairly good grounds for thinking they would have (or with high probability they would have) indicated a greater difference than they did. If, despite giving the effect a good chance to show itself, evidence in sync with a population difference of less than 1 percentage point is regularly found, then there is evidence for the absence of a greater effect.
I don’t think there’s much for Gelman to disagree with in this rough construal of the well-testedness of H (but if not, he will say so). My question is: Is Gelman suggesting that we translate this strong background information into a high prior for H? I’m guessing that the answer is no, that he imagines using it in an informal, non-Bayesian way. How then might the background knowledge about H be taken account of in reporting apparently conflicting data x?
I suggest that when it comes to the general hypothesis H, something like the following statement might be given:
If the statistical assumptions underlying this data x are approximately met, we would have an indication of a difference d (on sex ratio), but given what we already know, almost surely this study is in error somewhere (unless we are dealing with a very different population here).
Even without spotting the source of the error, this would be an entirely reasonable, and informative report. It would not, of course, be a Bayesian updating to a posterior probability, but it would use the well-corroborated background information about the correctness of H to appraise the new data. Although he does not require, would Gelman even allow, blending the background together with the new information, rather than calling out the conflict?
To combine the presumed background knowledge with the anomalous data, even inferring something like:
as a result of data x, the degree of confirmation has gone down slightly for H
would seem very inadequate. Scientists interpreted anomalous data for Newton (in 1919 and before) as anomalies, and they did this despite the strong belief in Newton. (I once wrote a paper, “Duhem’s Problem, The Bayesian Way, and Error Statistics, or ‘What’s Belief got To Do With It?’“.) But unlike the standard inductive Bayesian, Gelman seems to favor a testing (Bayesian) philosophy. Thus, I am guessing he would combine the information much as I do in the bold statement above. But I leave off with this question at the end.
My own view would go further to consult a general “repertoire of errors” with respect to the type of study: for example, might there be a difference in what counts as beautiful in this new study? Might a variety of other causal factors be responsible? Of course, there is always the possibility that the influence on sex ratio has changed from past inquiries. But if it is assumed that there is a universal generalization on the order of H—at least for this point in time–, and without such an assumption Gelman wouldn’t be regarding the current data as in conflict with the known background, then there must be some explanation, even if there is no interest in discovering it.
If this all seems very obvious, that is exactly my point. We have no trouble (in science and in day-to-day reasoning) incorporating this prior information. Here’s Gelman:
We’re in a setting where the prior information is much stronger than the data. If one’s only goal is to summarize the data, then taking the difference of 8% (along with a confidence interval and even a p-value) is fine. But if you want to generalize to the population—which was indeed the goal of the researcher in this example—then it makes no sense to stop there. (2012, p. 3)
From the perspective of the frequentist error statistician, the background knowledge in favor of H would be used even at the stage of criticizing the assumptions of the statistical inference (whatever the range of generalization of interest). A statistical inference, if it is an “inference” based on valid statistical premises, always goes strictly beyond the data, even if it is limited to an inference about what is responsible for the particular data x in this study.
So I’m not so sure the reported confidence interval or p-value would “be fine,” instead we would have stopped, even before getting there. Still, granting Gelman that it would “be fine” for “summarizing the data,” this is very different from using the data to infer that there is a genuine effect—even of the statistical variety (even with a limited population). Rather than asking where Fisher’s principle seems to go wrong (at least in the statements by Cox under discussion), we would emphasize that Fisher always denied, even with the lady tasting tea, that an “isolated record” of statistically significant results suffices:
In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result. (Fisher 1935, p.14)
It follows that, for Fisher, even showing the (statistical) reality of an experimental effect, much less showing the evidence of a specific explanation of the effect (be it an evolutionary story about beauty and daughters or something else), requires going beyond an “isolated record.” There is clearly a difference between
What this data x seems to indicate about H,
What all available scientific evidence E indicates about H,
where “indicates” is being used as a neutral term, which the reader may substitute with others as they wish.
And, there are still more differences between what E indicates about H, and what is indicated about subsequent explanations for the real effect, once demonstrated (my levels again).
Beyond this, all statistical methods require choices—assumptions, if you will. . . . It’s just not possible to determine or even validate all one’s choices from the data at hand. (Gelman 2012, p. 4)
Now I understand the distinction Gelman has in mind, between merely describing data x and moving to much wider generalizations, but making inferences even from this limited data set x does not take place starting with a blank slate. Even in entirely informal situations outside statistics, interpreting “the data at hand” invokes background information. That doesn’t make all data “theory laden” in any interesting sense. So long as the aspects of the background information used, and the way they are used, do not interfere with what one is trying to learn, there is no reason to worry. (Nor need we even trot out all the background, much less believe in the truth of theories, to reliably use the background.)
But steering back to our main issue: How is David Cox using “prior information” in our conversation? Cox is clear about using prior knowledge in analyzing the data. When Cox says in our conversation (see Sept 12 post or [i]):
In fact you have very clever ways of making sure that your analysis is valid even if the prior information is totally wrong. If you use the wrong prior information you just got an inefficient design, that’s all. (p. 105),
he is using “prior information” to refer to various uncorroborated beliefs, conjectures, hunches, whether or not they are influenced by policy, ethical, or other values, and for which no evidence is given. That is why he mentions the possibility that “the prior information is totally wrong”.[i]
It doesn’t really matter much whether one describes these background theories and assumptions as part of the design and modeling (as Cox did in the passages Gelman cites), or as part of the various moves: (1) from “an isolated statistical record” to a genuine experimental effect, and separately (2) from genuine effects to explanations and predictions. I can see where Gelman regards the latter as preferable, and I do too. Background information enters at all levels, not in the form of mathematical prior probability distributions[ii], but as a repertoire of existing severely tested background claims (as well as those poorly tested, and why), along with a cluster of current obstacles, flaws and foibles to build upon.
That ends my deconstruction of (the two relevant sections in) Gelman. I have received two U-Phils which I will post next time*.
But I want to leave off with a general question (to which I invite responses):
Is the recommendation, in relation to his sex ratio example, where we grant him that ”the prior information is much stronger than the data,” to report the background along the lines of my statement above (in bold)? Or is the recommendation to turn the prior information into a prior probability distribution and combine it Bayesianly with the new, anomalous, data x? The former is in the spirit of error statistical testing, at least for this case. (There may be other cases, obviously, where some kind of average or aggregation is what is wanted.) The latter is in the spirit of Bayesian confirmation theorists, at least in philosophy (given my background knowledge).
*the original U-Phil call was in my Sept 12 post, with a deadline of Sept. 25.
Gelman, A. (2012). Ethics and the statistical use of prior information.
Fisher, R. A. (1935). The Design of Experiments. Edinburgh: Oliver & Boyd.
[ii] We do not deny there are some cases where background comes in via prior probabilities (I discuss this elsewhere); but even then, there will be other pieces of information that would not enter as formal probabilities.