The allegation that P-values overstate the evidence against the null hypothesis continues to be taken as gospel in discussions of significance tests. All such discussions, however, assume a notion of “evidence” that’s at odds with significance tests–generally Bayesian probabilities of the sort used in Jeffrey’s-Lindley disagreement (default or “I’m selecting from an urn of nulls” variety). Szucs and Ioannidis (in a draft of a 2016 paper) claim “it can be shown formally that the definition of the p value does exaggerate the evidence against H0” (p. 15) and they reference the paper I discuss below: Berger and Sellke (1987). It’s not that a single small P-value provides good evidence of a discrepancy (even assuming the model, and no biasing selection effects); Fisher and others warned against over-interpreting an “isolated” small P-value long ago. But the formulation of the “P-values overstate the evidence” meme introduces brand new misinterpretations into an already confused literature! The following are snippets from some earlier posts–mostly this one–and also includes some additions from my new book (forthcoming).
1. What you should ask…
When you hear the familiar refrain, “We all know that P-values overstate the evidence against the null hypothesis”, what you should ask is:
“What do you mean by overstating the evidence against a hypothesis?”
One honest answer is:
“What I mean is that when I put a lump of prior probability π0 = 1/2 on a point null H0 (or a very small interval around it), the P-value is smaller than my Bayesian posterior probability on H0.”
Your reply might then be: (a) P-values are not intended as posteriors in H0 and (b) P-values can be used to determine whether there is evidence of inconsistency with a null hypothesis at various levels, and to distinguish how well or poorly tested claims are–depending on the type of question asked. A report on the discrepancies “poorly” warranted is what controls any overstatements about discrepancies indicated with large n.
You might toss in the query: Why do you assume that “the” correct measure of evidence (for scrutinizing the P-value) is via the Bayesian posterior?
If you wanted to go even further you might ask: And by the way, what warrants your lump of prior to the null? (See Section 3. A Dialogue.)
2. J. Berger and Sellke and Casella and R. Berger
It is well-known that for a fixed P-value, with a sufficiently large n, even a statistically significant result can correspond to large posteriors in H0 (Jeffreys-Lindley disagreement). I.J. Good recommended decreasing the required P-value as n increases, and had a formula for it. A more satisfactory route is to ensure the interpretation takes account of the (obvious) fact that with a fixed P-value and increasing n, the test is more and more sensitive to discrepancies–much as is done with lower/upper bounds of confidence intervals. (See Mayo and Spanos 2011: Fallacy #4, p. 174).
The Jeffreys-Lindley result is generalized in J. Berger and Sellke (1987). They make out the conflict between P-values and Bayesian posteriors by considering the two sided test of the Normal mean, H0: μ = μ0 versus H1: μ ≠ μ0 .
“If n = 50…, one can classically ‘reject H0 at significance level p = .05,’ although Pr (H0|x) = .52 (which would actually indicate that the evidence favors H0).” (Berger and Sellke, 1987, p. 113).
If n = 1000, a result statistically significant at the .05 level leads to a posterior to the null going from .5 to .82!
While from their Bayesian perspective, this appears to supply grounds for denying P-values are adequate for assessing evidence, significance testers rightly balk at the fact that using the recommended priors allows highly significant results to be interpreted as no evidence against the null–or even evidence for it! Note, the probability on the null goes up from .5 to .82 when a statistically significant result at the .025 level (one-sided) is observed.
The following chart records the posterior probability on the null.
Many think this shows that the P-value ‘overstates evidence against a null’ because it claims to use an ‘impartial’ Bayesian prior probability assignment of .5 to H0, the remaining .5 spread out over the alternative parameter space. (But see the justification Berger and Sellke give in Section 3. A Dialogue.) Casella and R. Berger (1987) charge that the problem is not P-values but the high prior, and that “concentrating mass on the point null hypothesis is biasing the prior in favor of H0 as much as possible” (p. 111) whether in 1 or 2-sided tests. There’s nothing impartial about these priors. Casella and Berger show that the reason for the wide range of variation of the posterior is the fact that it depends radically on the choice of alternative to the null and its prior.[i] There’s ample latitude (in smearing the alternative) so that the Bayes test only detects (in the sense of favoring probabilistically) discrepancies quite large (for the context).
Stephen Senn argues, “…the reason that Bayesians can regard P-values as overstating the evidence against the null is simply a reflection of the fact that Bayesians can disagree sharply with each other“ (Senn 2002, p. 2442). Riffing on the well-known joke of Jeffreys (1961, p. 385):
It would require that a procedure is dismissed [by significance testers] because, when combined with information which it doesn’t require and which may not exist, it disagrees with a [Bayesian] procedure that disagrees with itself. Senn (ibid. p. 195)
In other words, if Bayesians disagree with each other even when they’re measuring the same thing–posterior probabilities–why be surprised that disagreement is found between posteriors and P-values! See Senn’s interesting points on this same issue in his letter (to Goodman) here, as well as in this post, and it’s sequel.
Note, too, the conflict with confidence interval reasoning since the null value (here it is 0) lies outside the corresponding confidence interval (Mayo 2005). The same conflict persists between Bayesian “tests” and Bayesian credibility intervals.
3. A Dialogue (ending with a little curiosity in J. Berger and Sellke):
So a guy is fishing in Lake Elba, and a representative from the EPA (Elba Protection Association) points to notices that mean toxin levels in fish were found to exceed the permissible mean concentration, set at 0.
EPA Rep: We’ve conducted two studies (each with random sample of 100 fish) showing statistically significant concentrations of toxin, at low P-values, e.g., .02.
P-Value denier: I deny you’ve shown evidence of high mean toxin levels; P-values exaggerate the evidence against the null.
EPA Rep: Why is that?
P-value denier: If I update the prior of .5 that I give to the null hypothesis (asserting toxin levels are of no concern), my posterior for H0 is still not all that low, not as low as .05 for sure.
EPA Rep: Why do you assign such a high prior probability to H0?
P-value denier: If I gave H0 a value lower than .5, then, if there’s evidence to reject H0 , at most I would be claiming an improbable hypothesis has become more improbable. Who would be convinced by the statement ‘I conducted a Bayesian test of H0, assigning prior probability .1 to H0, and my conclusion is that H0 has posterior probability .05 and should be rejected’?
The last sentence is a quote from Berger and Sellke!
“When giving numerical results, we will tend to present Pr(H0|x) for π0 = 1/2. The choice of π0 = 1/2 has obvious intuitive appeal in scientific investigations as being ‘objective.’ (some might argue that π0 should even be chosen larger than 1/2 since H0 is often the ‘established theory.’) …[I]t will rarely be justifiable to choose π0 < 1/2; who, after all, would be convinced by the statement ‘I conducted a Bayesian test of H0, assigning prior probability .1 to H0, and my conclusion is that H0 has posterior probability .05 and should be rejected’? We emphasize this obvious point because some react to the Bayesian-classical conflict by attempting to argue that π0 should be made small in the Bayesian analysis so as to force agreement.” (Berger and Sellke, 115)
There’s something curious in assigning a high prior to the null H0–thereby making it harder to reject (or find evidence against) H0–and then justifying the assignment by saying it ensures that, if you do reject H0, there will be a meaningful drop in the probability of H0. What do you think of this?
4. A puzzle.
I agree with J. Berger and Sellke that we should not “force agreement”. Why should an account that can evaluate how well or poorly tested hypotheses are–as significance tests can do (if correctly used)–want to measure up to an account that can only give a comparative assessment (be they likelihoods, Bayes Factors, or other) [ii]. From the perspective of the significance tester, the disagreements between (audited) P-values and posterior probabilities are an indictment, not of the P-value, but of the posterior, as well as the Bayes ratio leading to the disagreement (as even one or two Bayesians appear to be coming around to realize, e.g., Bernardo 2011, 58-9). Casella and R. Berger show that for sensible priors with one-sided tests, the P-value can be “reconciled” with the posterior, thereby giving an excellent retort to J. Berger and Sellke.
Changing the interpretation may satisfy J. Berger’s call for “an agreement on numbers”, but doing so precludes the proper functioning of P-values, confidence levels, and other error probabilities. And “what is the intended interpretation of the prior?” you might ask. Aside from the subjective construals (of betting and belief, or the like), the main one on offer (from the default Bayesians) is that the prior is undefined and is simply a way to compute a posterior. There are several conflicting default priors; there’s no agreement on which to use. You might ask, as does David Cox: “If the prior is only a formal device and not to be interpreted as a probability, what interpretation is justified for the posterior as an adequate summary of information?” (Cox 2006, p. 77)
The most common argument behind the “P-values exaggerate evidence” collapses. It reappears in different forms, also fallacious.[iii] I end with a quote from Senn.
The error that is made is not only to assume that P-values should be what they are not but that when one tries to interpret them in the way that one should not, the previous calibration survives.
It is as if in giving recommendations in dosing children one abandoned a formula based on age and adopted one based on weight but insisted on using the same number of kg one had used for years.
Error probabilities are not posterior probabilities. Certainly, there is much more to statistical analysis than P-values but they should be left alone rather than being deformed in some way to become second class Bayesian posterior probabilities.
[o] In a recent blog, I discussed the “limb-sawing” fallacy in that same paper.
[i] Berger and Sellke try another gambit. “Precise hypotheses…ideally relate to, say, some precise theory being tested. Of primary interest is whether the theory is right or wrong; the amount by which it is wrong may be of interest in developing alternative theories, but the initial question of interest is that modeled by the precise hypothesis test” (1987, p. 136). Are we really not interested in magnitudes?
[ii]To a severe tester, these aren’t really tests, in the sense that they don’t falsify, and they don’t satisfy a minimal severity principle.
[iii]Bayesians, Edwards, Lindman and Savage (1963, p. 235), despite being the first (?) to raise the “P-values exaggerate” argument, aver that for Bayesian statisticians, “no procedure for testing a sharp null hypothesis is likely to be appropriate unless the null hypothesis deserves special initial credence”. See also Pratt 1987, commenting on Berger and Sellke.
Epidemiologists Sander Greenland and Charles Poole give this construal of a spiked prior in the case of a two-sided test:
“[A] null spike represents an assertion that, with prior probability q, we have background data that prove [μ = 0] with absolute certainty; q = 1/2 thus represents a 50-50 bet that there is decisive information literally proving the null. Without such information…a probability spike at the null is an example of ‘spinning knowledge out of ignorance’”. (Greenland and Poole 2012, p. 66).
You might be interested in the comments in the original blog, and the continuation of the discussion comments here.
Berger, J. O. and Sellke, T. (1987). “Testing a point null hypothesis: The irreconcilability of p values and evidence,” (with discussion).
Casella G. and Berger, R. (1987). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem,” (with discussion).
Cox, D. R. (2006), Principles of Statistical Inference: CUP.
Edwards, Lindman and Savage (1963). “Bayesian Statistical Inference for Psychological Research”.
Greenland and Poole (2013), “Living With P-values.
Mayo, D. Statistical Inference as Severe Testing: CUP.
Mayo, D. and Spanos, A. (2011). “Error Statistics”.
Pratt, J. (1987). “Testing a point null hypothesis: The irreconcilability of p values and evidence: Comment.”
Senn, S, (2001). “Two Cheers for P-values”. Journal of Epidemiology and Biostatistics 6(2): 193-204.
Among Related Posts:
Seems to me that there is a very important mistake in those arguments that P-values overstate evidence. A fatal mistake that might not have been pointed out directly:
In the Bayesian framework the `evidence’ is the likelihood function, so any attempt to compare the evidential meaning of the P-values to Bayesian evidence should be based on the likelihood function rather than the posterior!
The criticism that the slab and spike prior is inappropriate is clearly related to mine, but it is not as direct. I suspect that few Bayesians would be able to defend the silliness of treating the posterior as evidence once it has been pointed out.