The allegation that P-values overstate the evidence against the null hypothesis continues to be taken as gospel in discussions of significance tests. All such discussions, however, assume a notion of “evidence” that’s at odds with significance tests–generally likelihood ratios, or Bayesian posterior probabilities (conventional or of the “I’m selecting hypotheses from an urn of nulls” variety). I’m reblogging the bulk of an earlier post as background for a new post to appear tomorrow. It’s not that a single small P-value provides good evidence of a discrepancy (even assuming the model, and no biasing selection effects); Fisher and others warned against over-interpreting an “isolated” small P-value long ago. The problem is that the current formulation of the “P-values overstate the evidence” meme is attached to a sleight of hand (on meanings) that is introducing brand new misinterpretations into an already confused literature!

**1. What you should ask…**

When you hear the familiar refrain, “We all know that P-values overstate the evidence against the null hypothesis”, denying the P-value aptly measures evidence, what you should ask is:

“What do you mean by overstating the evidence against a hypothesis?”

One honest answer is:

“What I mean is that when I put a lump of prior probability π_{0}> 1/2 on a point nullH_{0 }(or a very small interval around it), the P-value is smaller than my Bayesian posterior probability onH_{0}.”

Your reply might then be: *(a) P-values are not intended as posteriors in H _{0} and (b) P-values can be used to determine whether there is evidence of inconsistency with a null hypothesis at various levels, and to distinguish how well or poorly tested claims are–depending on the type of question asked. A report on the discrepancies “poorly” warranted is what controls any overstatements about discrepancies indicated.*

You might toss in the query: *Why do you assume that “the” correct measure of evidence (for scrutinizing the P-value) is via the Bayesian posterior?*

If you wanted to go even further you might rightly ask: ** And by the way, what warrants your lump of prior to the null?** (See Section 3

*. A Dialogue.*)

^^^^^^^^^^^^^^^

**2 . J. Berger and Sellke and Casella and R. Berger**

Of course it is well-known that for a fixed P-value, with a sufficiently large n, even a statistically significant result can correspond to large posteriors in H_{0} (Jeffreys-Good-Lindley paradox). I.J. Good (I don’t know if he was the first) recommended decreasing the required P-value as n increases, and had a formula for it. A more satisfactory route is to ensure the interpretation takes account of the (obvious) fact that with a fixed P-value and increasing n, the test is more and more sensitive to discrepancies–much as is done with lower/upper bounds of confidence intervals. For some rules of thumb see Section 5.

The JGL result is generalized in J. Berger and Sellke (1987). They make out the conflict between P-values and Bayesian posteriors by considering the two sided test of the Normal mean, *H*_{0}: μ = μ_{0} versus *H*_{1}: μ ≠ μ_{0} .

“If

n= 50…, one can classically ‘rejectH_{0}at significance level p = .05,’ although Pr (H_{0}|) = .52 (which would actually indicate that the evidence favorsxH_{0}).” (Berger and Sellke, 1987, p. 113).

If *n* = 1000, a result statistically significant at the .05 level leads to a posterior to the null going from .5 to .82!

While from their Bayesian perspective, this appears to supply grounds for denying P-values are adequate for assessing evidence, significance testers rightly balk at the fact that using the recommended priors allows highly significant results to be interpreted as no evidence against the null–or even evidence *for* it!

Many think this shows that the P-value ‘overstates evidence against a null’ because it claims to use an ‘impartial’ Bayesian prior probability assignment of .5 to *H*_{0}**, **the remaining .5 spread out over the alternative parameter space. (But see the justification Berger and Sellke give in Section 3. *A Dialogue*.) Casella and R. Berger (1987) charge that the problem is not P-values but the high prior, and that “concentrating mass on the point null hypothesis is biasing the prior in favor of *H*_{0 }as much as possible” (p. 111) whether in 1 or 2-sided tests. Note, too, the conflict with confidence interval reasoning since the null value (here it is 0) lies outside the corresponding confidence interval (Mayo 2005). See Senn’s very interesting points on this same issue in his letter (to Goodman) here. 1/16 Update: Senn goes further in this post, and it’s sequel.

^^^^^^^^^^^^^^^^^

**3. A Dialogue **(ending with a little curiosity in J. Berger and Sellke):

*So a guy is fishing in Lake Elba, and a representative from the EPA (Elba Protection Association) points to notices that mean toxin levels in fish were found to exceed the permissible mean concentration, set at 0.*

EPA Rep: We’ve conducted two studies (each with random sample of 100 fish) showing statistically significant concentrations of toxin, at low P-values, e.g., .02.

P-Value denier:I deny you’ve shown evidence of high mean toxin levels; P-values exaggerate the evidence against the null.

EPA Rep: Why is that?

P-value denier:If I update the prior of .5 that I give to the null hypothesis (asserting toxin levels are of no concern), my posterior for H_{0 }is still not all that low, not as low as .05 for sure.

EPA Rep:Why do you assign such a high prior probability to H_{0}?

P-value denier:If I gave H_{0}a value lower than .5, then, if there’s evidence to reject H_{0 , }at most I would be claiming an improbable hypothesis has become more improbable. Who would be convinced by the statement ‘I conducted a Bayesian test of H_{0}, assigning prior probability .1 to H_{0}, and my conclusion is that H_{0 }has posterior probability .05 and should be rejected’?

*The last sentence is a direct quote from Berger and Sellke!*

“When giving numerical results, we will tend to present Pr(H

_{0}|) for πx_{0}= 1/2. The choice of π_{0}= 1/2 has obvious intuitive appeal in scientific investigations as being ‘objective.’ (some might argue that π_{0 }should even be chosen larger than 1/2 since H_{0 }is often the ‘established theory.’) …[I]t will rarely be justifiable to choose π_{0 }< 1/2; who, after all, would be convinced by the statement ‘I conducted a Bayesian test of H_{0}, assigning prior probability .1 to H_{0}, and my conclusion is that H_{0 }has posterior probability .05 and should be rejected’? We emphasize this obvious point because some react to the Bayesian-classical conflict by attempting to argue that π_{0 }should be made small in the Bayesian analysis so as to force agreement.” (Berger and Sellke, 115)

*There’s something curious in assigning a high prior to the null H _{0}–thereby making it harder to reject (or find evidence against) H_{0}–and then justifying the assignment by saying it ensures that, if you do reject H_{0}, there will be a meaningful drop in the probability of H_{0. }What do you think of this?*

^^^^^^^^^^^^^^^^^^^^

**4. The real puzzle. **

I agree with J. Berger and Sellke that we should not “force agreement”. What’s puzzling to me is why it would be thought that an account that manages to evaluate how well or poorly tested hypotheses are–as significance tests can do–would want to measure up to an account that can only give a comparative assessment (be they likelihoods, odds ratios, or other) [ii]. From the perspective of the significance tester, the disagreements between (audited) P-values and posterior probabilities are an indictment, not of the P-value, but of the posterior, as well as the Bayes ratio leading to the disagreement (as even one or two Bayesians appear to be coming around to realize, e.g., Bernardo 2011, 58-9). Casella and R. Berger show that for sensible priors with one-sided tests, the P-value can be “reconciled” with the posterior, thereby giving an excellent retort to J. Berger and Sellke. Personally, I don’t see why an error statistician would wish to construe the P-value as how “believe worthy” or “bet worthy” statistical hypotheses are. Changing the interpretation may satisfy J. Berger’s call for “an agreement on numbers” (and never mind philosophies), but doing so precludes the proper functioning of P-values, confidence levels, and other error probabilities. And “**what is the intended interpretation of the prior, again****?**” you might ask. Aside from the subjective construals (of betting and belief, or the like), the main one on offer (from the conventionalist Bayesians) is that the prior is undefined and is simply a way to compute a posterior. Never mind that they don’t agree on which to use. Your question should be: **“Please tell me: how does a posterior, based on an undefined prior used solely to compute a posterior, become “the” measure of evidence that we should aim to match?” **

^^^^^^^^^^^^^^^^

**5. (Crude) Benchmarks for taking into account sample size: **

Throwing out a few numbers may give sufficient warning to those inclined to misinterpret statistically significant differences at a given level but with varying sample sizes (please also search this blog [iii]). Using the familiar example of Normal testing with T+ :

H_{0}: μ ≤ 0 vs. H_{1}: μ > 0.Let

σ= 1,n= 25, soσ= (σ/√_{x}n).

*(The logic is identical if we estimate σ , my example follows the existing discussions by Berger and Sellke (1987) and others.) For this exercise, fix the sample mean M to be*

*just significant at the .025 level for a 1-sided test, and vary the sample size n. In one case, n = 100, in a second, n = 1600. So, for simplicity, using the 2-standard deviation cut-off:*

m= 0 + 2_{0}(σ/√n).

With stat sig results from test T+, we worry about unwarranted inferences of form: * μ > 0 + γ.*

*Some benchmarks:*

** *** The lower bound of a 50% confidence interval is** **2(σ/√*n*). *So there’s quite lousy evidence that μ > *2

*(σ/√n)*(the associated severity is .5).

Jan 17, 2016 add-on: Nevertheless, *μ = *2*(σ/√n) is the “most likely” alternative given m_{0} = 2(σ/√n).*

** ***The lower bound of the 93% confidence interval is** **.5(σ/√*n*). *So there’s decent evidence that μ > *.5

*(σ/√n) (*

*The associated severity is .93).*

** ***For *n* = 100,* σ/√n* = .1 (*σ= 1); f*or *n* = 1600,* σ/√n* = .025

* *Therefore, a .025 stat sig result is fairly good evidence that μ > *.05, when

*n*= 100; whereas

*,*.05, when

*a .025 stat sig result*is quite lousy evidence that μ >*n*= 1600.

You’re picking up smaller and smaller discrepancies as *n* increases, when P is kept fixed. Taking the indicated discrepancy into account avoids erroneous construals and scotches any “paradox”.

^^^^^^^^^^

**6.**** “The Jeffreys-Lindley Paradox and Discovery Criteria in High Energy Physics” (Cousins, 2014)**

Robert Cousins, a HEP physicist willing to talk to philosophers(and from whom I learned about statistics in the Higgs discovery), illuminates key issues in his paper with that title. (The reference to Bernardo 2011 that I had in mind in Section 4 is cited on p. 26 of Cousins 2014).

^^^^^^^^^^^^^^^^^^^^^^^^^^

*7. July 20, 2014: *** There is a distinct issue here…**.That “P-values overstate the evidence against the null” is often stated as an uncontroversial “given”. In calling it a “fallacy”, I was being provocative. However, in dubbing it a fallacy, some people assumed I was referring to one or another

*well-known*fallacies, leading them to guess I was referring to the fallacy of confusing P(E|H) with P(H|E)—what some call the “prosecutor’s fallacy”. I wasn’t. Nor are Berger and Sellke committing a simple blunder of transposing conditionals. 1/16 In any event, they say they aren’t and I’m prepared to distinguish what they are doing from that fallacy. Casella and Berger would scarcely have needed to write their reply to point this out. So how shall we state the basis for the familiar criticism that P-values overstate evidence against (a null)? I take it that the criticism goes something like this:

The problem with using a P-value to assess evidence against a given null hypothesis H_{0} is that it tends to be smaller, even much smaller, than an apparently plausible posterior assessment of H_{0}, given data *x* (especially as n increases). The mismatch is avoided with a suitably small P-value, and that’s why many recommend this tactic. [iv] Yet I say the correct answer to the question in my (new) title is: “fallacious”. It’s one of those criticisms that have not been thought through carefully, but rather repeated based on some well-known articles.

[i] We assume the P-values are “audited”, that they are not merely “nominal”, but are “actual” P-values. Selection effects, cherry-picking and other biases would alter the error probing capacity of the tests, and thus the purported P-value would fail the audit.

[ii] Note too that the comparative assessment will vary depending on the “catchall”.

[iii] See for example:

Section 6.1 “fallacies of rejection“.

Slide #8 of Spanos lecture in our seminar Phil 6334.

[iv] So we can also put aside for the moment the issue of P-values not being conditional probabilities to begin with. We can also (I hope) distinguish another related issue, which requires a distinct post: using ratios of frequentist error probabilities, e.g., type 1 errors and power, to form a kind of “likelihood ratio” in a screening computation. Jan 17 update: It turns out these are invariably run together.

**Please see the comments in the original blog, and the continuation of the discussion comments here.**

**References (minimalist)**

Berger, J. O. and Sellke, T. (1987). “Testing a point null hypothesis: The irreconcilability of *p *values and evidence,” (with discussion). *J. Amer. Statist. Assoc. ***82: **112–139.

Cassella G. and Berger, R.. (1987). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem,” (with discussion). *J. Amer. Statist. Assoc. ***82 **106–111, 123–139.

*Blog posts:*

Mayo, it seems to me that these discussions of P-values and evidence would gain clarity if it were to be explicitly stated that a P-value should never be interpreted as a numerical measure of evidence. I do not deny that P-values are related to evidence, but that relationship is by way of serving as an index to a likelihood function among the family of possible likelihood functions for the parameter of interest.

The key point is that, as likelihoodlums have to keep saying, evidence is by nature comparative, and the P-value is calculated without a comparison. Thus P-values are not ‘made of’ the right ‘substance’ to be a numerical measure of evidence. I think that such an explanation is far more useful than the very difficult descriptions of the shape-shifting aspect of the standardly wrong arguments about P-values and evidence.

Michael: And as error statisticians keep saying, reporting this is more likely than that is not to make a useful statistical inference, and you can find comparative evidence against h with high prob even if h is true–so likelihoodlums have poor or even terrible error control (thanks for reminding me of this apt term for you hoodlums). The need for evaluating whether the data are consistent with a given h is something so desperately needed (in many contexts) that even full blown Bayesians resort to p-values and variations on them. Note there’s a difference between having an alternative (a la N-P tests) and having the inference be comparative.

I should say, since I’ve got you here, that I thought your contribution to the p-value pow wow was among the best!

Click to access casella-and-bergernocover1.pdf