Excerpt from Excursion 4 Tour II*
4.4 Do P-Values Exaggerate the Evidence?
“Significance levels overstate the evidence against the null hypothesis,” is a line you may often hear. Your first question is:
What do you mean by overstating the evidence against a hypothesis?
Several (honest) answers are possible. Here is one possibility:
What I mean is that when I put a lump of prior weight π0 of 1/2 on a point null H0 (or a very small interval around it), the P-value is smaller than my Bayesian posterior probability on H0.
More generally, the “P-values exaggerate” criticism typically boils down to showing that if inference is appraised via one of the probabilisms – Bayesian posteriors, Bayes factors, or likelihood ratios – the evidence against the null (or against the null and in favor of some alternative) isn’t as big as 1 − P.
You might react by observing that: (a) P-values are not intended as posteriors in H0 (or Bayes ratios, likelihood ratios) but rather are used to determine if there’s an indication of discrepancy from, or inconsistency with, H0. This might only mean it’s worth getting more data to probe for a real effect. It’s not a degree of belief or comparative strength of support to walk away with. (b) Thus there’s no reason to suppose a P-value should match numbers computed in very different accounts, that differ among themselves, and are measuring entirely different things. Stephen Senn gives an analogy with “height and stones”:
. . . [S]ome Bayesians in criticizing P-values seem to think that it is appropriate to use a threshold for significance of 0.95 of the probability of the alternative hypothesis being true. This makes no more sense than, in moving from a minimum height standard (say) for recruiting police officers to a minimum weight standard, declaring that since it was previously 6 foot it must now be 6 stone. (Senn 2001b, p. 202)
To top off your rejoinder, you might ask: (c) Why assume that “the” or even “a” correct measure of evidence (relevant for scrutinizing the P-value) is one of the probabilist ones?
All such retorts are valid, and we’ll want to explore how they play out here. Yet, I want to push beyond them. Let’s be open to the possibility that evidential measures from very different accounts can be used to scrutinize each other.
Getting Beyond “I’m Rubber and You’re Glue”. The danger in critiquing statistical method X from the standpoint of the goals and measures of a distinct school Y, is that of falling into begging the question. If the P-value is exaggerating evidence against a null, meaning it seems too small from the perspective of school Y, then Y’s numbers are too big, or just irrelevant, from the perspective of school X. Whatever you say about me bounces off and sticks to you. This is a genuine worry, but it’ s not fatal. The goal of this journey is to identify minimal theses about “ bad evidence, no test (BENT)” that enable some degree of scrutiny of any statistical inference account – at least on the meta-level. Why assume all schools of statistical inference embrace the minimum severity principle? I don’t, and they don’t. But by identifying when methods violate severity, we can pull back the veil on at least one source of disagreement behind the battles.
Thus, in tackling this latest canard, let’ s resist depicting the critics as committing a gross blunder of confusing a P-value with a posterior probability in a null. We resist, as well, merely denying we care about their measure of support. I say we should look at exactly what the critics are on about. When we do, we will have gleaned some short-cuts for grasping a plethora of critical debates. We may even wind up with new respect for what a P-value, the least popular girl in the class, really does.
To visit the core arguments, we travel to 1987 to papers by J. Berger and Sellke, and Casella and R. Berger. These, in turn, are based on a handful of older ones (Cox 1977, E, L, & S 1963, Pratt 1965), and current discussions invariably revert back to them. Our struggles through quicksand of Excursion 3, Tour II, are about to pay large dividends.
This excerpt comes from Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo, CUP 2018).
Readers can find blogposts that trace out the discussion of this topic, as I was developing it, along with comments. The following 2 are central:
(7/14) “P-values overstate the evidence against the null”: legit or fallacious? (revised) 71 comments
(7/23) Continued:”P-values overstate the evidence against the null”: legit or fallacious? 39 comments
Earlier excerpts and mementos from SIST up to Dec 31, 20018 are here.