What would I say is the most important takeaway from last week’s NISS “statistics debate” if you’re using (or contemplating using) Bayes factors (BFs)–of the sort Jim Berger recommends–as replacements for P-values? It is that J. Berger only regards the BFs as appropriate when there’s grounds for a high concentration (or spike) of probability on a sharp null hypothesis, e.g.,H0: θ = θ0.
Thus, it is crucial to distinguish between precise hypotheses that are just stated for convenience and have no special prior believability, and precise hypotheses which do correspond to a concentration of prior belief. (J. Berger and Delampady 1987, p. 330).
Now, to be clear, I do not think that P-values need to be misinterpreted (Bayesianly) to use them evidentially, and think it’s a mistake to try to convert them into comparative measures of belief or support. However, it’s important to realize that even if you do think such a conversion is required, and are contemplating replacing them with the kind of BF Jim Berger advances, then it would be wrong to do so if there were no grounds for a high prior belief on a point null. Jim said in the debate that people want a Bayes factor, so we give it to them. But when you’re asking for it, especially if it’s described as a “default” method, you might assume it is capturing a reasonably common standpoint—not one that only arises in an idiosyncratic case. To allege that there’s really much less evidence against the sharp null than is suggested by a P-value, as does the BF advocate, is to hide the fact that most of this “evidence” is due to the spiked concentration of prior belief being given to the sharp null hypothesis. This is an a priori bias in favor of the sharp null, not evidence in the data. (There is also the matter of how the remainder of the prior is smeared over the parameter values in the alternative.) Jim Berger, somewhat to my surprise (at the debate) reaffirms that that is the context for the intended use of his recommended Bayes factor with the spiked prior. Yet these BFs are being touted as a tool to replace P-values for everyday use.
Jim’s Sharp Null BFs were developed For a Very Special Case. Harold Jeffreys developed the spiked priors for a Bayesian special problem: how to give high posterior probabilities to well corroborated theories. This is quite different from the typical use of statistical significance tests to detect indications of an observed effect that is not readily due to noise. (Of course isolated small P-values do not suffice to infer a genuine experimental phenomenon, as R.A. Fisher emphasized.)
Precise hypotheses . . . ideally relate to, say, some precise theory being tested. Of primary interest is whether the theory is right or wrong; the amount by which it is wrong may be of interest in developing alternative theories, but the initial question of interest is that modeled by the precise hypothesis test (J. Berger and Sellke 1987, p. 136).
Casella and Roger Berger (1987b) respond to Jim Berger and Sellke and to Jim Berger and Delampady –all in 1987. “We would be surprised if most researchers would place even a 10% prior probability of H0. We hope that the casual reader of Berger and Delampady realizes that the big discrepancies between P-values P(H0|x) . . . are due to a large extent to the large value of [the prior of 0.5 to H0] that was used.” They make the astute point that the most common uses of a point null, asserting the difference between means is 0, or the coefficient of a regression coefficient is 0, merely describe a potentially interesting feature of the population, with no special prior believability. They conclude: “J. Berger and Delampady admit…, P-values are reasonable measures of evidence when there is no a priori concentration of belief about H0” (ibid., p. 345). Thus, they conclude, “the very argument that Berger and Delampady use to dismiss P-values can be turned around to argue for P-values” (ibid., p. 346).
As I said in response to debate question 3, “the move to redefine statistical significance, advanced by a megateam in 2017, including Jim, all rest upon the lump high prior probability on the null as well as the appropriateness of evaluating P-values using Bayes factors. The redefiners are prepared to say there’s no evidence against or even evidence for a null hypothesis, even though that point null is entirely excluded from the corresponding 95% confidence interval. This would often erroneously fail to uncover discrepancies [from the point null]”.
Conduct an Error Statistical Critique. Thus a question you should ask in contemplating the application of the default BF is this: What’s the probability the default BF would find no evidence against the null or even evidence for it for an alternative or discrepancy of interest to you? If the probability is fairly high then you’d not want to apply it.
Notice what we’re doing in asking this question: we’re applying the frequentist error statistical analysis to the Bayes factor. What’s sauce for the goose is sauce for the gander.[ii] This is what the error statistician needs to do whenever she’s told an alternative measure ought to be adopted as a substitute for an error statistical one: check its error statistical properties.
Is the Spiked Prior Appropriate to the Problem, Even With a Well-corroborated Value? Even in those highly special cases where a well-corroborated substantive theory gives a high warrant for a particular value of a parameter, it’s far from clear that a spiked prior reflects how scientists examine the question: is the observed anomaly (with the theory) merely background noise or some systematic effect? Remember when Neutrinos appeared to travel faster than light—an anomaly for special relativity—in an OPERA experiment in 2011?
This would be a case where Berger would place a high concentration of prior probability on the point null, the speed of light c given by special relativity. The anomalous results, at most, would lower the posterior belief. But I don’t think scientists were interested in reporting that the posterior probability for the special relativity value had gone down a bit, due to their anomalous result, but was still quite high. Rather, they wanted to know whether the anomaly was mere noise or genuine, and finding it was genuine, they wanted to pinpoint blame for the anomaly. It turns out a fiber optic cable wasn’t fully screwed in and one of the clocks was ticking too fast. Merely discounting the anomaly (or worse, interpreting it as evidence strengthening their belief in the precise null) because of strong belief in special relativity would sidestep the most interesting work: gleaning important information about how well or poorly run the experiment was.[iii]
It is interesting to compare the position of the spiked prior with an equally common Bayesian position that all null hypotheses are false. The disagreement may stem from viewing H0 as asserting the correctness of a scientific theory (the spiked prior view) as opposed to asserting a parameter in a model, representing a portion of that theory, is correct (the all nulls are false view).
Search Under “Overstates” On This Blog for More (and “P-values” for much more). The reason to focus attention on the disagreement between the P-value and the Bayes factor with a sharp null is that it explains an important battle in the statistics wars, and thus points the way to (hopefully) getting beyond it. The very understanding of the use and interpretation of error probabilities differs in the rival approaches.
As I was writing my book, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP), I was often distracted by high pitched discussions in 2015-17 about P-values “overstating” evidence on account of being smaller than a posterior probability on a sharp null. Thus, I wound up writing several posts, the ingredients of which made their way into the book, notably, Section 4.4. Here’s one. I eventually coined it as a fallacy, “P-values overstate the evidence fallacy”. For many excerpts from the book, including the rest of the “Tour” where this issue arises, see this blogpost.
Stephen Senn wrote excellent guest posts on P-values for this blog that are especially clarifying, such as this one. He observes that Jeffreys, having already placed the spiked prior on the point null, required only that the posterior on the alternative exceeded .5 in order to find evidence against the null, not that it be a large number such as .95.
A parsimony principle is used on the prior distribution. You can’t use it again on the posterior distribution. Once that is calculated, you should simply prefer the more probable model. The error that is made is not only to assume that P-values should be what they are not but that when one tries to interpret them in the way that one should not, the previous calibration survives. (S. Senn)
***
[i] I mentioned two of the simplest inferential arguments using P-values during the debate: one for blocking an inference, a second for inferring incompatibility with (or discrepancy from) a null hypothesis, set as a reference: “If even larger or more extreme effects than you observed are frequently brought about by chance variability alone (P-value is not small), clearly you don’t have evidence of incompatibility with the “mere chance” hypothesis.”
“…A small P-value indicates discrepancy from a null value because with high probability, 1 – p the test would have produced a larger P-value (less impressive difference) in a world adequately described by H0. Since the null hypothesis would very probably have survived if correct, when it doesn’t survive, it indicates inconsistency with it.“ For a more detailed discussion see SIST, e.g., Souvenir C (SIST, p. 52) https://errorstatistics.files.wordpress.com/2019/04/sist_ex1-tourii.pdf.
[ii] From SIST* (p. 247): “The danger in critiquing statistical method X from the standpoint of the goals and measures of a distinct school Y, is that of falling into begging the question. If the P -value is exaggerating evidence against a null, meaning it seems too small from the perspective of school Y, then Y’ s numbers are too big, or just irrelevant, from the perspective of school X. Whatever you say about me bounces off and sticks to you.”
[iii] Conversely, the sharp null in discovering the Higgs Boson was disbelieved even before they built the expensive particle colliders (physicists knew there had to be a Higgs particle of some sort). You can find a number of posts on the Higgs on this blog (also in Mayo 2018, Excursion 3 Tour III).
*Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP). The book follows an “itinerary” of a stat cruise with lots of museum stops and souvenirs.
Thank you for the credit to Casella and Roger Berger (1987b) regarding our remarks about point null hypotheses.
Roger! Thank you for coming by to notice my blog. I’m honored! You came up during the debate on this point, and I’ve learned a lot from your papers. https://errorstatistics.com/2020/10/16/the-p-values-debate/
I think it would be great if you wrote an update on your thoughts regarding this old chestnut, if you were ever keen to do so.
Does J. Berger really think that a BF informs us “whether the theory is right or wrong”? The likelihoods of the hypothesis (understanding likelihood in their formal manner) are both going to be very small, even if some alternative explanation for the data was 30 or 40 times as likely as the precise null, it’s doubtful that that would be regarded as showing the theory wrong (where the theory is presumed to be represented by the point null hypothesis). Worse would be to suppose it’s evidence for a particular alternative that is more likely. The alternative, by the way, can be data dependent on the BF approach. Jim said something about multiplicity being taken care of by the priors, so presumably the priors are arrived at after the data dredging, suggesting, many would say, that they are not really priors.
One thing that strikes me is that although I meet Bayesians who say that frequentists need to believe the model is true, and there really needs to be infinite identical repeatability for frequentism to make sense, many interpretations of Bayesian measures seem to rely far more on a belief in “true models” than frequentist logic if applied with appropriate caution. This does not hold for proper de Finettians for whom the sampling model is about belief as well, but the majority of Bayesians seems to talk as if this is about a data generating process in reality, and then their evidence measurement business, particularly the drive to assign high probability to certain hypotheses to be true, seems to rely much stronger on it than the frequentist inquiry about what models are compatible with the data.