Since the comments to my previous post are getting too long, I’m reblogging it here to make more room. I say that the issue raised by J. Berger and Sellke (1987) and Casella and R. Berger (1987) concerns evaluating the evidence in relation to a given hypothesis (using error probabilities). Given the information that this hypothesis H* was randomly selected from an urn with 99% true hypothesis, we wouldn’t say this gives a great deal of evidence for the truth of H*, nor suppose that H* had thereby been well-tested. (H* might concern the existence of a standard model-like Higgs.) I think the issues about “science-wise error rates” and long-run performance in dichotomous, diagnostic screening should be taken up separately, but commentators can continue on this, if they wish (perhaps see this related post).
0. July 20, 2014: Some of the comments to this post reveal that using the word “fallacy” in my original title might have encouraged running together the current issue with the fallacy of transposing the conditional. Please see a newly added Section 7.
1. What you should ask…
Discussions of P-values in the Higgs discovery invariably recapitulate many of the familiar criticisms of P-values (some “howlers”, some not). When you hear the familiar refrain, “We all know that P-values overstate the evidence against the null hypothesis”, denying the P-value aptly measures evidence, what you should ask is:
“What do you mean by overstating the evidence against a hypothesis?”
An honest answer might be:
“What I mean is that when I put a lump of prior probability π0 > 1/2 on a point null H0 (or a very small interval around it), the P-value is smaller than my Bayesian posterior probability on H0.”
Your reply might then be: (a) P-values are not intended as posteriors in H0 and (b) P-values can be used to determine whether there is evidence of inconsistency with a null hypothesis at various levels, and to distinguish how well or poorly tested claims are–depending on the type of question asked. The report on discrepancies “poorly” warranted is what controls any overstatements about discrepancies indicated.
You might toss in the query: Why do you assume that “the” correct measure of evidence (for scrutinizing the P-value) is via the Bayesian posterior?
If you wanted to go even further you might rightly ask: And by the way, what warrants your lump of prior to the null? (See Section 3. A Dialogue.)
2. J. Berger and Sellke and Casella and R. Berger
Of course it is well-known that for a fixed P-value, with a sufficiently large n, even a statistically significant result can correspond to large posteriors in H0 (Jeffreys-Good-Lindley paradox). I.J. Good (I don’t know if he was the first) recommended decreasing the required P-value as n increases, and had a formula for it. A more satisfactory route is to ensure the interpretation takes account of the (obvious) fact that with a fixed P-value and increasing n, the test is more and more sensitive to discrepancies–much as is done with lower/upper bounds of confidence intervals. For some rules of thumb see Section 5.
The JGL result is generalized in J. Berger and Sellke (1987). They make out the conflict between P-values and Bayesian posteriors by considering the two sided test of the Normal mean, H0: μ = μ0 versus H1: μ ≠ μ0 .
“If n = 50…, one can classically ‘reject H0 at significance level p = .05,’ although Pr (H0|x) = .52 (which would actually indicate that the evidence favors H0).” (Berger and Sellke, 1987, p. 113).
If n = 1000, a result statistically significant at the .05 level leads to a posterior to the null going from .5 to .82!
While from their Bayesian perspective, this appears to supply grounds for denying P-values are adequate for assessing evidence, significance testers rightly balk at the fact that using the recommended priors allows highly significant results to be interpreted as no evidence against the null–or even evidence for it!
Many think this shows that the P-value ‘overstates evidence against a null’ because it claims to use an ‘impartial’ Bayesian prior probability assignment of .5 to H0, the remaining .5 spread out over the alternative parameter space. (But see the justification Berger and Sellke give in Section 3. A Dialogue.) Casella and R. Berger(1987) charge that the problem is not P-values but the high prior, and that “concentrating mass on the point null hypothesis is biasing the prior in favor of H0 as much as possible” (p. 111) whether in 1 or 2-sided tests. Note, too, the conflict with confidence interval reasoning since the null value (here it is 0) lies outside the corresponding confidence interval (Mayo 2005). See Senn’s very interesting points on this same issue in his letter (to Goodman) here.
3. A Dialogue (ending with a little curiosity in J. Berger and Sellke):
So a guy is fishing in Lake Elba, and a representative from the EPA (Elba Protection Association) points to notices that mean toxin levels in fish were found to exceed the permissible mean concentration, set at 0.
EPA Rep: We’ve conducted two studies (each with random sample of 100 fish) showing statistically significant concentrations of toxin, at low P-values, e.g., .02.
P-Value denier: I deny you’ve shown evidence of high mean toxin levels; P-values exaggerate the evidence against the null.
EPA Rep: Why is that?
P-value denier: If I update the prior of .5 that I give to the null hypothesis (asserting toxin levels are of no concern), my posterior for H0 is still not all that low, not as low as .05 for sure.
EPA Rep: Why do you assign such a high prior probability to H0?
P-value denier: If I gave H0 a value lower than .5, then, if there’s evidence to reject H0 , at most I would be claiming an improbable hypothesis has become more improbable. Who would be convinced by the statement ‘I conducted a Bayesian test of H0, assigning prior probability .1 to H0, and my conclusion is that H0 has posterior probability .05 and should be rejected’?
The last sentence is a direct quote from Berger and Sellke!
“When giving numerical results, we will tend to present Pr(H0|x) for π0 = 1/2. The choice of π0 = 1/2 has obvious intuitive appeal in scientific investigations as being ‘objective.’ (some might argue that should even be chosen larger than 1/2 since H0 is often the ‘established theory.’) …[I]t will rarely be justifiable to choose π0 < 1/2; who, after all, would be convinced by the statement ‘I conducted a Bayesian test of H0, assigning prior probability .1 to H0, and my conclusion is that H0 has posterior probability .05 and should be rejected’? We emphasize this obvious point because some react to the Bayesian-classical conflict by attempting to argue that π0 should be made small in the Bayesian analysis so as to force agreement.” (Berger and Sellke, 115)
There’s something curious in assigning a high prior to the null H0–thereby making it harder to reject (or find evidence against) H0–and then justifying the assignment by saying it ensures that, if you do reject H0, there will be a meaningful drop in the probability of H0. What do you think of this?
4. The real puzzle.
I agree with J. Berger and Sellke that we should not “force agreement”. What’s puzzling to me is why it would be thought that an account that manages to evaluate how well or poorly tested hypotheses are–as significance tests can do–would want to measure up to an account that can only give a comparative assessment (be they likelihoods, odds ratios, or other) [ii]. From the perspective of the significance tester, the disagreements between (audited) P-values and posterior probabilities are an indictment, not of the P-value, but of the posterior, as well as the Bayes ratio leading to the disagreement (as even one or two Bayesians appear to be coming around to realize, e.g., Bernardo 2011, 58-9). Casella and R. Berger show that for sensible priors with one-sided tests, the P-value can be “reconciled” with the posterior, thereby giving an excellent retort to J. Berger and Sellke. Personally, I don’t see why an error statistician would wish to construe the P-value as how “believe worthy” or “bet worthy” statistical hypotheses are. Changing the interpretation may satisfy J. Berger’s call for “an agreement on numbers” (and never mind philosophies), but doing so precludes the proper functioning of P-values, confidence levels, and other error probabilities. And “what is the intended interpretation of the prior, again?” you might ask. Aside from the subjective construals (of betting and belief, or the like), the main one on offer (from the conventionalist Bayesians) is that the prior is undefined and is simply a way to compute a posterior. Never mind that they don’t agree on which to use. Your question should be: “Please tell me: how does a posterior, based on an undefined prior used solely to compute a posterior, become “the” measure of evidence that we should aim to match?”
5. (Crude) Benchmarks for taking into account sample size:
Throwing out a few numbers may give sufficient warning to those inclined to misinterpret statistically significant differences at a given level but with varying sample sizes (please also search this blog [iii]). Using the familiar example of Normal testing with T+ :
H0: μ ≤ 0 vs. H1: μ > 0.
Let σ = 1, n = 25, so σx= (σ/√n).
For this exercise, fix the sample mean M to be just significant at the .025 level for a 1-sided test, and vary the sample size n. In one case, n = 100, in a second, n = 1600. So, for simplicity, using the 2-standard deviation cut-off:
m0 = 0 + 2(σ/√n).
With stat sig results from test T+, we worry about unwarranted inferences of form: μ > 0 + γ.
* The lower bound of a 50% confidence interval is 2(σ/√n). So there’s quite lousy evidence that μ > 2(σ/√n) (the associated severity is .5).
*The lower bound of the 93% confidence interval is .5(σ/√n). So there’s decent evidence that μ > .5(σ/√n) (The associated severity is .93).
*For n = 100, σ/√n = .1 (σ= 1); for n = 1600, σ/√n = .025
*Therefore, a .025 stat sig result is fairly good evidence that μ > .05, when n = 100; whereas, a .025 stat sig result is quite lousy evidence that μ > .05, when n = 1600.
You’re picking up smaller and smaller discrepancies as n increases, when P is kept fixed. Taking the indicated discrepancy into account avoids erroneous construals and scotches any “paradox”.
6. “The Jeffreys-Lindley Paradox and Discovery Criteria in High Energy Physics” (Cousins, 2014)
Robert Cousins, a HEP physicist willing to talk to philosophers and from whom I am learning about statistics in the Higgs discovery, illuminates the key issues, models and problems in his paper with that title. (The reference to Bernardo 2011 that I had in mind in Section 4 is cited on p. 26 of Cousins 2014).
7. July 20, 2014: There is a distinct issue here….That “P-values overstate the evidence against the null” is often stated as an uncontroversial “given”. In calling it a “fallacy”, I was being provocative. However, in dubbing it a fallacy, some people assumed I was referring to one or another well-known fallacies, leading them to guess I was referring to the fallacy of confusing P(E|H) with P(H|E)—what some call the “prosecutor’s fallacy”. I wasn’t. Nor are Berger and Sellke committing a simple blunder of transposing conditionals. If they were, Casella and Berger would scarcely have needed to write their reply to point this out. So how shall we state the basis for the familiar criticism that P-values overstate evidence against (a null)? I take it that the criticism goes something like this:
The problem with using a P-value to assess evidence against a given null hypothesis H0 is that it tends to be smaller, even much smaller, than an apparently plausible posterior assessment of H0, given data x (especially as n increases). The mismatch is avoided with a suitably tiny P-value, and that’s why many recommend this tactic. [iv] Yet I say the correct answer to the question in my (new) title is: “fallacious”. It’s one of those criticisms that have not been thought through carefully, but rather repeated based on some well-known articles.
[i] We assume the P-values are “audited”, that they are not merely “nominal”, but are “actual” P-values. Selection effects, cherry-picking and other biases would alter the error probing capacity of the tests, and thus the purported P-value would fail the audit.
[ii] Note too that the comparative assessment will vary depending on the “catchall”.
[iii] See for example:
Section 6.1 “fallacies of rejection“.
Slide #8 of Spanos lecture in our seminar Phil 6334.
[iv] So we can also put aside for the moment the issue of P-values not being conditional probabilities to begin with. We can also (I hope) distinguish another related issue, which requires a distinct post: using ratios of frequentist error probabilities, e.g., type 1 errors and power, to form a kind of “likelihood ratio” in a screening computation.
References (minimalist) A number of additional links are given in comments to my previous post
Berger, J. O. and Sellke, T. (1987). “Testing a point null hypothesis: The irreconcilability of p values and evidence,” (with discussion). J. Amer. Statist. Assoc. 82: 112–139.
Casella G. and Berger, R.. (1987). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem,” (with discussion). J. Amer. Statist. Assoc. 82 106–111, 123–139.
To respond to the last comment from the previous post, someone named vl claimed: “When most non-statisticians think about frequentist guarantees what they really care about is given a decision rule (for what to believe), how often is my decision correct, how often is it wrong?”
Perhaps this kind of “performance” goal is in sync with the way frequentist “decision rules” are often presented, unfortunately. I think that what most people care about, if they aren’t trying to frame their data problem in terms of some formal statistical technique, isn’t that at all. It’s more like, what have I learned from the data about the particular problem at hand? What would be warranted to infer, and what unwarranted?