The allegation that P-values overstate the evidence against the null hypothesis continues to be taken as gospel in discussions of significance tests. All such discussions, however, assume a notion of “evidence” that’s at odds with significance tests–generally likelihood ratios, or Bayesian posterior probabilities (conventional or of the “I’m selecting hypotheses from an urn of nulls” variety). I’m reblogging the bulk of an earlier post as background for a new post to appear tomorrow. It’s not that a single small P-value provides good evidence of a discrepancy (even assuming the model, and no biasing selection effects); Fisher and others warned against over-interpreting an “isolated” small P-value long ago. The problem is that the current formulation of the “P-values overstate the evidence” meme is attached to a sleight of hand (on meanings) that is introducing brand new misinterpretations into an already confused literature!
2. J. Berger and Sellke and Casella and R. Berger
Of course it is well-known that for a fixed P-value, with a sufficiently large n, even a statistically significant result can correspond to large posteriors in H0 (Jeffreys-Good-Lindley paradox). I.J. Good (I don’t know if he was the first) recommended decreasing the required P-value as n increases, and had a formula for it. A more satisfactory route is to ensure the interpretation takes account of the (obvious) fact that with a fixed P-value and increasing n, the test is more and more sensitive to discrepancies–much as is done with lower/upper bounds of confidence intervals. For some rules of thumb see Section 5.
The JGL result is generalized in J. Berger and Sellke (1987). They make out the conflict between P-values and Bayesian posteriors by considering the two sided test of the Normal mean, H0: μ = μ0 versus H1: μ ≠ μ0 .
“If n = 50…, one can classically ‘reject H0 at significance level p = .05,’ although Pr (H0|x) = .52 (which would actually indicate that the evidence favors H0).” (Berger and Sellke, 1987, p. 113).
If n = 1000, a result statistically significant at the .05 level leads to a posterior to the null going from .5 to .82!
While from their Bayesian perspective, this appears to supply grounds for denying P-values are adequate for assessing evidence, significance testers rightly balk at the fact that using the recommended priors allows highly significant results to be interpreted as no evidence against the null–or even evidence for it!
Many think this shows that the P-value ‘overstates evidence against a null’ because it claims to use an ‘impartial’ Bayesian prior probability assignment of .5 to H0, the remaining .5 spread out over the alternative parameter space. (But see the justification Berger and Sellke give in Section 3. A Dialogue.) Casella and R. Berger (1987) charge that the problem is not P-values but the high prior, and that “concentrating mass on the point null hypothesis is biasing the prior in favor of H0 as much as possible” (p. 111) whether in 1 or 2-sided tests. Note, too, the conflict with confidence interval reasoning since the null value (here it is 0) lies outside the corresponding confidence interval (Mayo 2005). See Senn’s very interesting points on this same issue in his letter (to Goodman) here. 1/16 Update: Senn goes further in this post, and it’s sequel.
3. A Dialogue (ending with a little curiosity in J. Berger and Sellke):
So a guy is fishing in Lake Elba, and a representative from the EPA (Elba Protection Association) points to notices that mean toxin levels in fish were found to exceed the permissible mean concentration, set at 0.
EPA Rep: We’ve conducted two studies (each with random sample of 100 fish) showing statistically significant concentrations of toxin, at low P-values, e.g., .02.
P-Value denier: I deny you’ve shown evidence of high mean toxin levels; P-values exaggerate the evidence against the null.
EPA Rep: Why is that?
P-value denier: If I update the prior of .5 that I give to the null hypothesis (asserting toxin levels are of no concern), my posterior for H0 is still not all that low, not as low as .05 for sure.
EPA Rep: Why do you assign such a high prior probability to H0?
P-value denier: If I gave H0 a value lower than .5, then, if there’s evidence to reject H0 , at most I would be claiming an improbable hypothesis has become more improbable. Who would be convinced by the statement ‘I conducted a Bayesian test of H0, assigning prior probability .1 to H0, and my conclusion is that H0 has posterior probability .05 and should be rejected’?
The last sentence is a direct quote from Berger and Sellke!
There’s something curious in assigning a high prior to the null H0–thereby making it harder to reject (or find evidence against) H0–and then justifying the assignment by saying it ensures that, if you do reject H0, there will be a meaningful drop in the probability of H0. What do you think of this?
4. The real puzzle.
I agree with J. Berger and Sellke that we should not “force agreement”. What’s puzzling to me is why it would be thought that an account that manages to evaluate how well or poorly tested hypotheses are–as significance tests can do–would want to measure up to an account that can only give a comparative assessment (be they likelihoods, odds ratios, or other) [ii]. From the perspective of the significance tester, the disagreements between (audited) P-values and posterior probabilities are an indictment, not of the P-value, but of the posterior, as well as the Bayes ratio leading to the disagreement (as even one or two Bayesians appear to be coming around to realize, e.g., Bernardo 2011, 58-9). Casella and R. Berger show that for sensible priors with one-sided tests, the P-value can be “reconciled” with the posterior, thereby giving an excellent retort to J. Berger and Sellke. Personally, I don’t see why an error statistician would wish to construe the P-value as how “believe worthy” or “bet worthy” statistical hypotheses are. Changing the interpretation may satisfy J. Berger’s call for “an agreement on numbers” (and never mind philosophies), but doing so precludes the proper functioning of P-values, confidence levels, and other error probabilities. And “what is the intended interpretation of the prior, again?” you might ask. Aside from the subjective construals (of betting and belief, or the like), the main one on offer (from the conventionalist Bayesians) is that the prior is undefined and is simply a way to compute a posterior. Never mind that they don’t agree on which to use. Your question should be: “Please tell me: how does a posterior, based on an undefined prior used solely to compute a posterior, become “the” measure of evidence that we should aim to match?”
5. (Crude) Benchmarks for taking into account sample size:
Throwing out a few numbers may give sufficient warning to those inclined to misinterpret statistically significant differences at a given level but with varying sample sizes (please also search this blog [iii]). Using the familiar example of Normal testing with T+ :
H0: μ ≤ 0 vs. H1: μ > 0.
Let σ = 1, n = 25, so σx= (σ/√n).
(The logic is identical if we estimate σ , my example follows the existing discussions by Berger and Sellke (1987) and others.) For this exercise, fix the sample mean M to be just significant at the .025 level for a 1-sided test, and vary the sample size n. In one case, n = 100, in a second, n = 1600. So, for simplicity, using the 2-standard deviation cut-off:
m0 = 0 + 2(σ/√n).
With stat sig results from test T+, we worry about unwarranted inferences of form: μ > 0 + γ.
* The lower bound of a 50% confidence interval is 2(σ/√n). So there’s quite lousy evidence that μ > 2(σ/√n) (the associated severity is .5).
Jan 17, 2016 add-on: Nevertheless, μ = 2(σ/√n) is the “most likely” alternative given m0 = 2(σ/√n).
*The lower bound of the 93% confidence interval is .5(σ/√n). So there’s decent evidence that μ > .5(σ/√n) (The associated severity is .93).
*For n = 100, σ/√n = .1 (σ= 1); for n = 1600, σ/√n = .025
*Therefore, a .025 stat sig result is fairly good evidence that μ > .05, when n = 100; whereas, a .025 stat sig result is quite lousy evidence that μ > .05, when n = 1600.
You’re picking up smaller and smaller discrepancies as n increases, when P is kept fixed. Taking the indicated discrepancy into account avoids erroneous construals and scotches any “paradox”.
6. “The Jeffreys-Lindley Paradox and Discovery Criteria in High Energy Physics” (Cousins, 2014)
Robert Cousins, a HEP physicist willing to talk to philosophers(and from whom I learned about statistics in the Higgs discovery), illuminates key issues in his paper with that title. (The reference to Bernardo 2011 that I had in mind in Section 4 is cited on p. 26 of Cousins 2014).
7. July 20, 2014: There is a distinct issue here….That “P-values overstate the evidence against the null” is often stated as an uncontroversial “given”. In calling it a “fallacy”, I was being provocative. However, in dubbing it a fallacy, some people assumed I was referring to one or another well-known fallacies, leading them to guess I was referring to the fallacy of confusing P(E|H) with P(H|E)—what some call the “prosecutor’s fallacy”. I wasn’t. Nor are Berger and Sellke committing a simple blunder of transposing conditionals. 1/16 In any event, they say they aren’t and I’m prepared to distinguish what they are doing from that fallacy. Casella and Berger would scarcely have needed to write their reply to point this out. So how shall we state the basis for the familiar criticism that P-values overstate evidence against (a null)? I take it that the criticism goes something like this:
The problem with using a P-value to assess evidence against a given null hypothesis H0 is that it tends to be smaller, even much smaller, than an apparently plausible posterior assessment of H0, given data x (especially as n increases). The mismatch is avoided with a suitably small P-value, and that’s why many recommend this tactic. [iv] Yet I say the correct answer to the question in my (new) title is: “fallacious”. It’s one of those criticisms that have not been thought through carefully, but rather repeated based on some well-known articles.
[i] We assume the P-values are “audited”, that they are not merely “nominal”, but are “actual” P-values. Selection effects, cherry-picking and other biases would alter the error probing capacity of the tests, and thus the purported P-value would fail the audit.
[ii] Note too that the comparative assessment will vary depending on the “catchall”.
[iii] See for example:
[iv] So we can also put aside for the moment the issue of P-values not being conditional probabilities to begin with. We can also (I hope) distinguish another related issue, which requires a distinct post: using ratios of frequentist error probabilities, e.g., type 1 errors and power, to form a kind of “likelihood ratio” in a screening computation. Jan 17 update: It turns out these are invariably run together.
Berger, J. O. and Sellke, T. (1987). “Testing a point null hypothesis: The irreconcilability of p values and evidence,” (with discussion). J. Amer. Statist. Assoc. 82: 112–139.
Cassella G. and Berger, R.. (1987). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem,” (with discussion). J. Amer. Statist. Assoc. 82 106–111, 123–139.