Since the comments to my previous post are getting too long, I’m reblogging it here to make more room. I say that the issue raised by J. Berger and Sellke (1987) and Casella and R. Berger (1987) concerns evaluating the evidence in relation to a given hypothesis (using error probabilities). Given the information that *this* hypothesis H* was randomly selected from an urn with 99% true hypothesis, we wouldn’t say this gives a great deal of evidence for the truth of H*, nor suppose that H* had thereby been well-tested. (H* might concern the existence of a standard model-like Higgs.) I think the issues about “science-wise error rates” and long-run performance in dichotomous, diagnostic screening should be taken up separately, but commentators can continue on this, if they wish (perhaps see this related post).

** 0. July 20, 2014: **Some of the comments to this post reveal that using the word “fallacy” in my original title might have encouraged running together the current issue with the fallacy of transposing the conditional. Please see a newly added Section 7.

**1. What you should ask…**

Discussions of P-values in the Higgs discovery invariably recapitulate many of the familiar criticisms of P-values (some “howlers”, some not). When you hear the familiar refrain, “We all know that P-values overstate the evidence against the null hypothesis”, denying the P-value aptly measures evidence, what you should ask is:

“What do you mean by overstating the evidence against a hypothesis?”

An honest answer might be:

“What I mean is that when I put a lump of prior probability π_{0}> 1/2 on a point nullH_{0 }(or a very small interval around it), the P-value is smaller than my Bayesian posterior probability onH_{0}.”

Your reply might then be: *(a) P-values are not intended as posteriors in H _{0} and (b) P-values can be used to determine whether there is evidence of inconsistency with a null hypothesis at various levels, and to distinguish how well or poorly tested claims are–depending on the type of question asked. The report on discrepancies “poorly” warranted is what controls any overstatements about discrepancies indicated.*

You might toss in the query: *Why do you assume that “the” correct measure of evidence (for scrutinizing the P-value) is via the Bayesian posterior?*

If you wanted to go even further you might rightly ask: ** And by the way, what warrants your lump of prior to the null?** (See Section 3

*. A Dialogue.*)

^^^^^^^^^^^^^^^

**2 . J. Berger and Sellke and Casella and R. Berger**

Of course it is well-known that for a fixed P-value, with a sufficiently large n, even a statistically significant result can correspond to large posteriors in H_{0} (Jeffreys-Good-Lindley paradox). I.J. Good (I don’t know if he was the first) recommended decreasing the required P-value as n increases, and had a formula for it. A more satisfactory route is to ensure the interpretation takes account of the (obvious) fact that with a fixed P-value and increasing n, the test is more and more sensitive to discrepancies–much as is done with lower/upper bounds of confidence intervals. For some rules of thumb see Section 5.

The JGL result is generalized in J. Berger and Sellke (1987). They make out the conflict between P-values and Bayesian posteriors by considering the two sided test of the Normal mean, *H*_{0}: μ = μ_{0} versus *H*_{1}: μ ≠ μ_{0} .

“If

n= 50…, one can classically ‘rejectH_{0}at significance level p = .05,’ although Pr (H_{0}|) = .52 (which would actually indicate that the evidence favorsxH_{0}).” (Berger and Sellke, 1987, p. 113).

If *n* = 1000, a result statistically significant at the .05 level leads to a posterior to the null going from .5 to .82!

While from their Bayesian perspective, this appears to supply grounds for denying P-values are adequate for assessing evidence, significance testers rightly balk at the fact that using the recommended priors allows highly significant results to be interpreted as no evidence against the null–or even evidence *for* it!

Many think this shows that the P-value ‘overstates evidence against a null’ because it claims to use an ‘impartial’ Bayesian prior probability assignment of .5 to *H*_{0}**, **the remaining .5 spread out over the alternative parameter space. (But see the justification Berger and Sellke give in Section 3. *A Dialogue*.) Casella and R. Berger (1987) charge that the problem is not P-values but the high prior, and that “concentrating mass on the point null hypothesis is biasing the prior in favor of *H*_{0 }as much as possible” (p. 111) whether in 1 or 2-sided tests. Note, too, the conflict with confidence interval reasoning since the null value (here it is 0) lies outside the corresponding confidence interval (Mayo 2005). See Senn’s very interesting points on this same issue in his letter (to Goodman) here.

^^^^^^^^^^^^^^^^^

**3. A Dialogue **(ending with a little curiosity in J. Berger and Sellke):

*So a guy is fishing in Lake Elba, and a representative from the EPA (Elba Protection Association) points to notices that mean toxin levels in fish were found to exceed the permissible mean concentration, set at 0.*

EPA Rep: We’ve conducted two studies (each with random sample of 100 fish) showing statistically significant concentrations of toxin, at low P-values, e.g., .02.

P-Value denier:I deny you’ve shown evidence of high mean toxin levels; P-values exaggerate the evidence against the null.

EPA Rep: Why is that?

P-value denier:If I update the prior of .5 that I give to the null hypothesis (asserting toxin levels are of no concern), my posterior for H_{0 }is still not all that low, not as low as .05 for sure.

EPA Rep:Why do you assign such a high prior probability to H_{0}?

P-value denier:If I gave H_{0}a value lower than .5, then, if there’s evidence to reject H_{0 , }at most I would be claiming an improbable hypothesis has become more improbable. Who would be convinced by the statement ‘I conducted a Bayesian test of H_{0}, assigning prior probability .1 to H_{0}, and my conclusion is that H_{0 }has posterior probability .05 and should be rejected’?

*The last sentence is a direct quote from Berger and Sellke!*

“When giving numerical results, we will tend to present Pr(H

_{0}|) for πx_{0}= 1/2. The choice of π_{0}= 1/2 has obvious intuitive appeal in scientific investigations as being ‘objective.’ (some might argue that should even be chosen larger than 1/2 since H_{0 }is often the ‘established theory.’) …[I]t will rarely be justifiable to choose π_{0 }< 1/2; who, after all, would be convinced by the statement ‘I conducted a Bayesian test of H_{0}, assigning prior probability .1 to H_{0}, and my conclusion is that H_{0 }has posterior probability .05 and should be rejected’? We emphasize this obvious point because some react to the Bayesian-classical conflict by attempting to argue that π_{0 }should be made small in the Bayesian analysis so as to force agreement.” (Berger and Sellke, 115)

*There’s something curious in assigning a high prior to the null H _{0}–thereby making it harder to reject (or find evidence against) H_{0}–and then justifying the assignment by saying it ensures that, if you do reject H_{0}, there will be a meaningful drop in the probability of H_{0. }What do you think of this?*

^^^^^^^^^^^^^^^^^^^^

**4. The real puzzle. **

I agree with J. Berger and Sellke that we should not “force agreement”. What’s puzzling to me is why it would be thought that an account that manages to evaluate how well or poorly tested hypotheses are–as significance tests can do–would want to measure up to an account that can only give a comparative assessment (be they likelihoods, odds ratios, or other) [ii]. From the perspective of the significance tester, the disagreements between (audited) P-values and posterior probabilities are an indictment, not of the P-value, but of the posterior, as well as the Bayes ratio leading to the disagreement (as even one or two Bayesians appear to be coming around to realize, e.g., Bernardo 2011, 58-9). Casella and R. Berger show that for sensible priors with one-sided tests, the P-value can be “reconciled” with the posterior, thereby giving an excellent retort to J. Berger and Sellke. Personally, I don’t see why an error statistician would wish to construe the P-value as how “believe worthy” or “bet worthy” statistical hypotheses are. Changing the interpretation may satisfy J. Berger’s call for “an agreement on numbers” (and never mind philosophies), but doing so precludes the proper functioning of P-values, confidence levels, and other error probabilities. And “**what is the intended interpretation of the prior, again****?**” you might ask. Aside from the subjective construals (of betting and belief, or the like), the main one on offer (from the conventionalist Bayesians) is that the prior is undefined and is simply a way to compute a posterior. Never mind that they don’t agree on which to use. Your question should be: **“Please tell me: how does a posterior, based on an undefined prior used solely to compute a posterior, become “the” measure of evidence that we should aim to match?” **

^^^^^^^^^^^^^^^^

**5. (Crude) Benchmarks for taking into account sample size: **

Throwing out a few numbers may give sufficient warning to those inclined to misinterpret statistically significant differences at a given level but with varying sample sizes (please also search this blog [iii]). Using the familiar example of Normal testing with T+ :

H_{0}: μ ≤ 0 vs. H_{1}: μ > 0.Let

σ= 1,n= 25, soσ= (σ/√_{x}n).

*For this exercise, fix the sample mean M to be** just significant at the .025 level for a 1-sided test, and vary the sample size n. In one case, n = 100, in a second, n = 1600. So, for simplicity, using the 2-standard deviation cut-off:*

m= 0 + 2_{0}(σ/√n).

With stat sig results from test T+, we worry about unwarranted inferences of form: * μ > 0 + γ.*

*Some benchmarks:*

** *** The lower bound of a 50% confidence interval is** **2(σ/√*n*). *So there’s quite lousy evidence that μ > *2

*(σ/√n)*(the associated severity is .5).

** ***The lower bound of the 93% confidence interval is** **.5(σ/√*n*). *So there’s decent evidence that μ > *.5

*(σ/√n) (*

*The associated severity is .93).*

** ***For *n* = 100,* σ/√n* = .1 (*σ= 1); f*or *n* = 1600,* σ/√n* = .025

* *Therefore, a .025 stat sig result is fairly good evidence that μ > *.05, when

*n*= 100; whereas

*,*.05, when

*a .025 stat sig result*is quite lousy evidence that μ >*n*= 1600.

You’re picking up smaller and smaller discrepancies as *n* increases, when P is kept fixed. Taking the indicated discrepancy into account avoids erroneous construals and scotches any “paradox”.

^^^^^^^^^^

**6.**** “The Jeffreys-Lindley Paradox and Discovery Criteria in High Energy Physics” (Cousins, 2014)**

Robert Cousins, a HEP physicist willing to talk to philosophers and from whom I am learning about statistics in the Higgs discovery, illuminates the key issues, models and problems in his paper with that title. (The reference to Bernardo 2011 that I had in mind in Section 4 is cited on p. 26 of Cousins 2014).

^^^^^^^^^^^^^^^^^^^^^^^^^^

*7. July 20, 2014: *** There is a distinct issue here…**.That “P-values overstate the evidence against the null” is often stated as an uncontroversial “given”. In calling it a “fallacy”, I was being provocative. However, in dubbing it a fallacy, some people assumed I was referring to one or another

*well-known*fallacies, leading them to guess I was referring to the fallacy of confusing P(E|H) with P(H|E)—what some call the “prosecutor’s fallacy”. I wasn’t. Nor are Berger and Sellke committing a simple blunder of transposing conditionals. If they were, Casella and Berger would scarcely have needed to write their reply to point this out. So how shall we state the basis for the familiar criticism that P-values overstate evidence against (a null)? I take it that the criticism goes something like this:

The problem with using a P-value to assess evidence against a given null hypothesis H_{0} is that it tends to be smaller, even much smaller, than an apparently plausible posterior assessment of H_{0}, given data *x* (especially as n increases). The mismatch is avoided with a suitably tiny P-value, and that’s why many recommend this tactic. [iv] Yet I say the correct answer to the question in my (new) title is: “fallacious”. It’s one of those criticisms that have not been thought through carefully, but rather repeated based on some well-known articles.

[i] We assume the P-values are “audited”, that they are not merely “nominal”, but are “actual” P-values. Selection effects, cherry-picking and other biases would alter the error probing capacity of the tests, and thus the purported P-value would fail the audit.

[ii] Note too that the comparative assessment will vary depending on the “catchall”.

[iii] See for example:

Section 6.1 “fallacies of rejection“.

Slide #8 of Spanos lecture in our seminar Phil 6334.

[iv] So we can also put aside for the moment the issue of P-values not being conditional probabilities to begin with. We can also (I hope) distinguish another related issue, which requires a distinct post: using ratios of frequentist error probabilities, e.g., type 1 errors and power, to form a kind of “likelihood ratio” in a screening computation.

**References **(minimalist) A number of additional links are given in comments to my previous post

Berger, J. O. and Sellke, T. (1987). “Testing a point null hypothesis: The irreconcilability of *p *values and evidence,” (with discussion). *J. Amer. Statist. Assoc. ***82: **112–139.

Cassella G. and Berger, R.. (1987). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem,” (with discussion). *J. Amer. Statist. Assoc. ***82 **106–111, 123–139.

*Blog posts:*