How often do you hear P-values criticized for “exaggerating” the evidence against a null hypothesis? If your experience is like mine, the answer is ‘all the time’, and in fact, the charge is often taken as one of the strongest cards in the anti-statistical significance playbook. The argument boils down to the fact that the P-value accorded to a point null H0 can be small while its Bayesian posterior probability high–provided a high enough prior is accorded to H0. But why suppose P-values should match Bayesian posteriors? And what justifies the high (or “spike”) prior to a point null? While I discuss this criticism at considerable length in Statistical Inference as Severe Testing: How to get beyond the statistics wars (CUP, 2018), I did not quote an intriguing response by R.A. Fisher to disagreements between P-values and posteriors’s (in Statistical Methods and Scientific Inference, Fisher 1956); namely, that such a prior probability assignment would itself be rejected by the observed small P-value–if the prior were itself regarded as a hypothesis to test. Or so he says. I did mention this response by Fisher in an encyclopedia article from way back in 2006 on “philosophy of statistics”:
Discussing a test of the hypothesis that the stars [in the Pleiades] are distributed at random, Fisher takes the low p-value (about l in 33,000) to ”exclude at a high level of significance any theory involving a random distribution” (Fisher 1956, 42). Even if one were to imagine that H0 had an extremely high prior probability: Fisher continues–never minding ”what such a statement of probability a priori could possibly mean–the resulting high posteriori probability to H0, he thinks, would only show that “reluctance to accept a hypothesis strongly contradicted by a test of significance” (44) “is not capable of finding expression in any calculation of probability a posteriori'” (43)…. Indeed, if one were to consider the claim about the a priori probability to be itself a hypothesis, Fisher suggests, it would be rejected by the data! (Mayo 2006, pp. 813-814)
I have recently come across a (2021) article by David Bickel, “Null Hypothesis Significance Testing Defended and Calibrated by Bayesian Model Checking” in The American Statistician which seems to me (and I think to Bickel) to cash out what Fisher’s argument might look like. Bickel’s abstract begins:
Significance testing is often criticized because p-values can be low even though posterior probabilities of the null hypothesis are not low according to some Bayesian models. Those models, however, would assign low prior probabilities to the observation that the p-value is sufficiently low. That conflict between the models and the data may indicate that the models needs revision. Indeed, if the p-value is sufficiently small while the posterior probability according to a model is insufficiently small, then the model will fail a model check. (from Bickel 2021)
Bickel will go further and use this result to transform a P-value into an upper bound on the posterior probability of the null hypothesis (conditional on rejection) for any model that would pass the check.
Given the tenuousness of prior probability assignments, Fisher’s (and Bickel’s) suggestion that they be subjected to test seems like a good idea. But are Bayesians bound to accept such a test? I don’t know. Has anyone seen Bickel’s paper and wish to comment on it? I will discuss it in a later post.
The key passages from Fisher are:
It is to be noted that the mental reluctance to accept an event intrinsically improbable would still be felt if, for example, a datum were added [to the problem] to the effect that it was a million to one a priori that the stars should be scattered at random. We need not consider what such a statement of probability a priori could possibly mean in the astronomical problem; all that is needed [is that were this datum introduced] a probability statement could be inferred a posteriori, to the effect that the odds were 30 to 1 that the stars really had been scattered at random. The inherent improbability of what has been observed being observable on this view still remains in our minds, and no explanation has been given of it…The observer is thus not left at all in the same state of mind as if the stars had actually displayed no evidence against a random arrangement, although he would have been forced logically to admit that (as far as statements in terms of probability went) such a theory was probably true.
The example shows that the resistance felt by the normal mind to accepting a story intrinsically too improbable is not capable of finding expression in any calculation of probability a posteriori…..
If the proposed datum, ‘The odds are a million to one a priori that the stars should really be distributed…at random’—if this datum were considered as a hypothesis, it would be rejected at once by the observations at a level of significance almost as great as the hypothesis ‘The stars are really distributed at random’, was rejected in the first instance.” (Fisher pp. 42-44)
For the full pages, see pp. 42-44).:
References:
Bickel, D. (2021). Null Hypothesis Significance Testing Defended and Calibrated by Bayesian Model Checking. The American Statistician 75(3):249-255.
Fisher, R. A. ( 1956). Statistical Methods and Scientific Inference. Edinburgh: Oliver and Boyd. (See pp. 42-44).
Mayo, D. (2006). “Philosophy of Statistics”in S. Sarkar and J. Pfeifer (eds.) Philosophy of Science: An Encyclopedia, London: Routledge: 802-815.
Mayo, D. (2018). Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP)
One thing that always strikes me when revisiting this argument is that it is hard to defend for a Bayesian to say (a) that null hypotheses are almost surely rejected by significance tests with large sample sizes because they are never precisely true, but then (b) to advertise a Bayesian analysis using a model that puts a considerable point mass on the null hypothesis.
Christian: Yes, I’ve mentioned this many times, including, of course, in Statistical Inference as Severe Testing: How to get beyond the statistics wars. Competing Bayesians have opposite and extreme standpoints regarding the point null, which is artificial anyway. I think it is often used to churn out problems for the significance tester.
The bird’s eye view is permitting to shift the attention to information, and information quality. Christian’s point that refers to large sample sizes, leads to predictive analytics. The context there is assessing overfitting, interpretable AI and generalization of findings. Nothing to do with p-values or severe testing. Try to look at the big picture, it provides a reality check on what is done now in the statistics and analytics domain. The paper in https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3591808 provides such a context.
All of those concerns connect directly to how severely claims have been tested, as they involve probing classic mistakes and threats to inferences.
Ron:
You will find quite a lot on these topics in my book Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars. For example in Excursion II:
II Rejection fallacies: Who’s exaggerating what? on p. 239, the first Tour is
4.3 Significant results with overly sensitive tests: large n problem.
The conflict between the test critics who base their criticism on high spike priors, and those who hold that all nulls are false, is discussed in that Tour as well. Of course Christian knows this, but is usefully raising these points in relation to this article.
We deal with the sample size issue by reporting the extent of discrepancy indicated. This has a consequence for comparing equally statistically significant results from tests with different powers to discriminate. On the second point, I argue that the fact that a sufficiently sensitive test can find evidence of a discrepancy does not mean that all nulls are false in any interesting sense.
Those quotes of Fisher may anticipate Box’s idea of a prior predictive check of a Bayesian model. On the other hand, Fisher was concerned with testing a scientific hypothesis about a physical distribution, whereas Box was concerned with checking a mental model assumed for data analysis and prediction.
The Fisher-Box analogy fits today’s overlap between physical and mental distributions. For example, empirical Bayes estimates of prior distributions are very useful in multiple testing. Also, deep neural networks, never intended to hypothesize physical distributions, have been tested as null hypotheses.
I wrote some [thoughts here](https://ctwardy.micro.blog/2021/12/07/model-testing-mayo.html).
Briefly: I see tension between Fisher’s use of priors to dismiss ESP but not dismiss star clusters. I think the tension is resolved by considering the source of the prior: in ESP it’s a century of failed or vanishing results; in the Pleiades, it seems picked out of the air: it could plausibly have been 1:100,000, leaving us uncertain after evidence.