“P-values overstate the evidence against the null”: legit or fallacious?

The allegation that P-values overstate the evidence against the null hypothesis continues to be taken as gospel in discussions of significance tests. All such discussions, however, assume a notion of “evidence” that’s at odds with significance tests–generally likelihood ratios, or Bayesian posterior probabilities (conventional or of the “I’m selecting hypotheses from an urn of nulls” variety). I’m reblogging the bulk of an earlier post as background for a new post to appear tomorrow. It’s not that a single small P-value provides good evidence of a discrepancy (even assuming the model, and no biasing selection effects); Fisher and others warned against over-interpreting an “isolated” small P-value long ago. The problem is that the current formulation of the “P-values overstate the evidence” meme is attached to a sleight of hand (on meanings) that is introducing brand new misinterpretations into an already confused literature!

1. What you should ask…

When you hear the familiar refrain, “We all know that P-values overstate the evidence against the null hypothesis”, denying the P-value aptly measures evidence, what you should ask is:

“What do you mean by overstating the evidence against a hypothesis?”

One honest answer is:

“What I mean is that when I put a lump of prior probability π₀ > 1/2 on a point null H₀(or a very small interval around it), the P-value is smaller than my Bayesian posterior probability on H₀.”

Your reply might then be: (a) P-values are not intended as posteriors in H₀ and (b) P-values can be used to determine whether there is evidence of inconsistency with a null hypothesis at various levels, and to distinguish how well or poorly tested claims are–depending on the type of question asked. A report on the discrepancies “poorly” warranted is what controls any overstatements about discrepancies indicated.

You might toss in the query: Why do you assume that “the” correct measure of evidence (for scrutinizing the P-value) is via the Bayesian posterior?

If you wanted to go even further you might rightly ask: And by the way, what warrants your lump of prior to the null? (See Section 3. A Dialogue.)

^^^^^^^^^^^^^^^

2. J. Berger and Sellke and Casella and R. Berger

Of course it is well-known that for a fixed P-value, with a sufficiently large n, even a statistically significant result can correspond to large posteriors in H₀ (Jeffreys-Good-Lindley paradox). I.J. Good (I don’t know if he was the first) recommended decreasing the required P-value as n increases, and had a formula for it. A more satisfactory route is to ensure the interpretation takes account of the (obvious) fact that with a fixed P-value and increasing n, the test is more and more sensitive to discrepancies–much as is done with lower/upper bounds of confidence intervals. For some rules of thumb see Section 5.

The JGL result is generalized in J. Berger and Sellke (1987). They make out the conflict between P-values and Bayesian posteriors by considering the two sided test of the Normal mean, H₀: μ = μ₀ versus H₁: μ ≠ μ₀ .

“If n = 50…, one can classically ‘reject H₀ at significance level p = .05,’ although Pr (H₀|x) = .52 (which would actually indicate that the evidence favors H₀).” (Berger and Sellke, 1987, p. 113).

If n = 1000, a result statistically significant at the .05 level leads to a posterior to the null going from .5 to .82!

While from their Bayesian perspective, this appears to supply grounds for denying P-values are adequate for assessing evidence, significance testers rightly balk at the fact that using the recommended priors allows highly significant results to be interpreted as no evidence against the null–or even evidence for it!

From J. Berger and T. Selke (1987) “Testing a Point Null Hypothesis,” JASA 82(397) : 113.

Many think this shows that the P-value ‘overstates evidence against a null’ because it claims to use an ‘impartial’ Bayesian prior probability assignment of .5 to H₀, the remaining .5 spread out over the alternative parameter space. (But see the justification Berger and Sellke give in Section 3. A Dialogue.) Casella and R. Berger (1987) charge that the problem is not P-values but the high prior, and that “concentrating mass on the point null hypothesis is biasing the prior in favor of H₀as much as possible” (p. 111) whether in 1 or 2-sided tests. Note, too, the conflict with confidence interval reasoning since the null value (here it is 0) lies outside the corresponding confidence interval (Mayo 2005). See Senn’s very interesting points on this same issue in his letter (to Goodman) here. 1/16 Update: Senn goes further in this post, and it’s sequel.

^^^^^^^^^^^^^^^^^

3. A Dialogue (ending with a little curiosity in J. Berger and Sellke):

So a guy is fishing in Lake Elba, and a representative from the EPA (Elba Protection Association) points to notices that mean toxin levels in fish were found to exceed the permissible mean concentration, set at 0.

EPA Rep: We’ve conducted two studies (each with random sample of 100 fish) showing statistically significant concentrations of toxin, at low P-values, e.g., .02.

P-Value denier: I deny you’ve shown evidence of high mean toxin levels; P-values exaggerate the evidence against the null.

EPA Rep: Why is that?

P-value denier: If I update the prior of .5 that I give to the null hypothesis (asserting toxin levels are of no concern), my posterior for H₀is still not all that low, not as low as .05 for sure.

EPA Rep: Why do you assign such a high prior probability to H₀?

P-value denier: If I gave H₀ a value lower than .5, then, if there’s evidence to reject H_{0 ,}at most I would be claiming an improbable hypothesis has become more improbable. Who would be convinced by the statement ‘I conducted a Bayesian test of H₀, assigning prior probability .1 to H₀, and my conclusion is that H₀has posterior probability .05 and should be rejected’?

The last sentence is a direct quote from Berger and Sellke!

“When giving numerical results, we will tend to present Pr(H₀|x) for π₀ = 1/2. The choice of π₀ = 1/2 has obvious intuitive appeal in scientific investigations as being ‘objective.’ (some might argue that π₀should even be chosen larger than 1/2 since H₀is often the ‘established theory.’) …[I]t will rarely be justifiable to choose π₀< 1/2; who, after all, would be convinced by the statement ‘I conducted a Bayesian test of H₀, assigning prior probability .1 to H₀, and my conclusion is that H₀has posterior probability .05 and should be rejected’? We emphasize this obvious point because some react to the Bayesian-classical conflict by attempting to argue that π₀should be made small in the Bayesian analysis so as to force agreement.” (Berger and Sellke, 115)

There’s something curious in assigning a high prior to the null H₀–thereby making it harder to reject (or find evidence against) H₀–and then justifying the assignment by saying it ensures that, if you do reject H₀, there will be a meaningful drop in the probability of H_0.What do you think of this?

^^^^^^^^^^^^^^^^^^^^

4. The real puzzle.

I agree with J. Berger and Sellke that we should not “force agreement”. What’s puzzling to me is why it would be thought that an account that manages to evaluate how well or poorly tested hypotheses are–as significance tests can do–would want to measure up to an account that can only give a comparative assessment (be they likelihoods, odds ratios, or other) [ii]. From the perspective of the significance tester, the disagreements between (audited) P-values and posterior probabilities are an indictment, not of the P-value, but of the posterior, as well as the Bayes ratio leading to the disagreement (as even one or two Bayesians appear to be coming around to realize, e.g., Bernardo 2011, 58-9). Casella and R. Berger show that for sensible priors with one-sided tests, the P-value can be “reconciled” with the posterior, thereby giving an excellent retort to J. Berger and Sellke. Personally, I don’t see why an error statistician would wish to construe the P-value as how “believe worthy” or “bet worthy” statistical hypotheses are. Changing the interpretation may satisfy J. Berger’s call for “an agreement on numbers” (and never mind philosophies), but doing so precludes the proper functioning of P-values, confidence levels, and other error probabilities. And “what is the intended interpretation of the prior, again?” you might ask. Aside from the subjective construals (of betting and belief, or the like), the main one on offer (from the conventionalist Bayesians) is that the prior is undefined and is simply a way to compute a posterior. Never mind that they don’t agree on which to use. Your question should be: “Please tell me: how does a posterior, based on an undefined prior used solely to compute a posterior, become “the” measure of evidence that we should aim to match?”

^^^^^^^^^^^^^^^^

5. (Crude) Benchmarks for taking into account sample size:

Throwing out a few numbers may give sufficient warning to those inclined to misinterpret statistically significant differences at a given level but with varying sample sizes (please also search this blog [iii]). Using the familiar example of Normal testing with T+ :

H₀: μ ≤ 0 vs. H₁: μ > 0.

Let σ = 1, n = 25, so σ_x= (σ/√n).

(The logic is identical if we estimate σ , my example follows the existing discussions by Berger and Sellke (1987) and others.) For this exercise, fix the sample mean M to be just significant at the .025 level for a 1-sided test, and vary the sample size n. In one case, n = 100, in a second, n = 1600. So, for simplicity, using the 2-standard deviation cut-off:

m₀ = 0 + 2(σ/√n).

With stat sig results from test T+, we worry about unwarranted inferences of form: μ > 0 + γ.

Some benchmarks:

* The lower bound of a 50% confidence interval is 2(σ/√n). So there’s quite lousy evidence that μ > 2(σ/√n) (the associated severity is .5).

Jan 17, 2016 add-on: Nevertheless, μ = 2(σ/√n) is the “most likely” alternative given m₀ = 2(σ/√n).

*The lower bound of the 93% confidence interval is .5(σ/√n). So there’s decent evidence that μ > .5(σ/√n) (The associated severity is .93).

*For n = 100, σ/√n = .1 (σ= 1); for n = 1600, σ/√n = .025

*Therefore, a .025 stat sig result is fairly good evidence that μ > .05, when n = 100; whereas, a .025 stat sig result is quite lousy evidence that μ > .05, when n = 1600.

You’re picking up smaller and smaller discrepancies as n increases, when P is kept fixed. Taking the indicated discrepancy into account avoids erroneous construals and scotches any “paradox”.

^^^^^^^^^^

6. “The Jeffreys-Lindley Paradox and Discovery Criteria in High Energy Physics” (Cousins, 2014)

Robert Cousins, a HEP physicist willing to talk to philosophers(and from whom I learned about statistics in the Higgs discovery), illuminates key issues in his paper with that title. (The reference to Bernardo 2011 that I had in mind in Section 4 is cited on p. 26 of Cousins 2014).

^^^^^^^^^^^^^^^^^^^^^^^^^^

7. July 20, 2014: There is a distinct issue here….That “P-values overstate the evidence against the null” is often stated as an uncontroversial “given”. In calling it a “fallacy”, I was being provocative. However, in dubbing it a fallacy, some people assumed I was referring to one or another well-known fallacies, leading them to guess I was referring to the fallacy of confusing P(E|H) with P(H|E)—what some call the “prosecutor’s fallacy”. I wasn’t. Nor are Berger and Sellke committing a simple blunder of transposing conditionals. 1/16 In any event, they say they aren’t and I’m prepared to distinguish what they are doing from that fallacy. Casella and Berger would scarcely have needed to write their reply to point this out. So how shall we state the basis for the familiar criticism that P-values overstate evidence against (a null)? I take it that the criticism goes something like this:

The problem with using a P-value to assess evidence against a given null hypothesis H₀ is that it tends to be smaller, even much smaller, than an apparently plausible posterior assessment of H₀, given data x (especially as n increases). The mismatch is avoided with a suitably small P-value, and that’s why many recommend this tactic. [iv] Yet I say the correct answer to the question in my (new) title is: “fallacious”. It’s one of those criticisms that have not been thought through carefully, but rather repeated based on some well-known articles.

[i] We assume the P-values are “audited”, that they are not merely “nominal”, but are “actual” P-values. Selection effects, cherry-picking and other biases would alter the error probing capacity of the tests, and thus the purported P-value would fail the audit.

[ii] Note too that the comparative assessment will vary depending on the “catchall”.

[iii] See for example:

Section 6.1 “fallacies of rejection“.
Slide #8 of Spanos lecture in our seminar Phil 6334.

[iv] So we can also put aside for the moment the issue of P-values not being conditional probabilities to begin with. We can also (I hope) distinguish another related issue, which requires a distinct post: using ratios of frequentist error probabilities, e.g., type 1 errors and power, to form a kind of “likelihood ratio” in a screening computation. Jan 17 update: It turns out these are invariably run together.

Please see the comments in the original blog, and the continuation of the discussion comments here.

References (minimalist)

Berger, J. O. and Sellke, T. (1987). “Testing a point null hypothesis: The irreconcilability of p values and evidence,” (with discussion). J. Amer. Statist. Assoc. 82: 112–139.

Cassella G. and Berger, R.. (1987). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem,” (with discussion). J. Amer. Statist. Assoc. 82 106–111, 123–139.

Blog posts:

Comedy Hour at the Bayesian Retreat: P-values versus Posteriors.

Highly probable vs highly probed: Bayesian/ error statistical differences.

3 thoughts on ““P-values overstate the evidence against the null”: legit or fallacious?”

January 21, 2016

Michael Lew

Mayo, it seems to me that these discussions of P-values and evidence would gain clarity if it were to be explicitly stated that a P-value should never be interpreted as a numerical measure of evidence. I do not deny that P-values are related to evidence, but that relationship is by way of serving as an index to a likelihood function among the family of possible likelihood functions for the parameter of interest.

The key point is that, as likelihoodlums have to keep saying, evidence is by nature comparative, and the P-value is calculated without a comparison. Thus P-values are not ‘made of’ the right ‘substance’ to be a numerical measure of evidence. I think that such an explanation is far more useful than the very difficult descriptions of the shape-shifting aspect of the standardly wrong arguments about P-values and evidence.

January 26, 2016

Mayo

Michael: And as error statisticians keep saying, reporting this is more likely than that is not to make a useful statistical inference, and you can find comparative evidence against h with high prob even if h is true–so likelihoodlums have poor or even terrible error control (thanks for reminding me of this apt term for you hoodlums). The need for evaluating whether the data are consistent with a given h is something so desperately needed (in many contexts) that even full blown Bayesians resort to p-values and variations on them. Note there’s a difference between having an alternative (a la N-P tests) and having the inference be comparative.

I should say, since I’ve got you here, that I thought your contribution to the p-value pow wow was among the best!

September 4, 2016

Mayo

Click to access casella-and-bergernocover1.pdf

I welcome constructive comments that are of relevance to the post and the discussion, and discourage detours into irrelevant topics, however interesting, or unconstructive declarations that "you (or they) are just all wrong". If you want to correct or remove a comment, send me an e-mail. If readers have already replied to the comment, you may be asked to replace it to retain comprehension. Cancel reply

“P-values overstate the evidence against the null”: legit or fallacious?

Post navigation

3 thoughts on ““P-values overstate the evidence against the null”: legit or fallacious?”

The Statistics Wars & Their Casualties

Blog links (references)

Reviews of Statistical Inference as Severe Testing (SIST)

Interviews & Debates on PhilStat (2020)

Interviews on PhilStat (2019)

LSE PH500 Research Seminar (May 21-June 25, 2020): Controversies in Phil Stat

Summer Seminar 2019 (article)

Top Posts & Pages

Conferences & Workshops

RMM Special Topic

Mayo & Spanos, Error Statistics

Follow Blog via Email

My Websites

Recent Posts: PhilStatWars

The Statistics Wars and Their Casualties Videos & Slides from Sessions 1 & 2

THE STATISTICS WARS AND THEIR CASUALTIES VIDEOS & SLIDES FROM SESSIONS 3 & 4

Final session: The Statistics Wars and Their Casualties: 8 December, Session 4

SCHEDULE: The Statistics Wars and Their Casualties: 1 Dec & 8 Dec: Sessions 3 & 4

WORKSHOP

LOG IN/OUT

Archives

© Deborah G. Mayo, Error Statistics Philosophy, 2011-2018 All Rights Reserved.

“P-values overstate the evidence against the null”: legit or fallacious?

Related

Post navigation

3 thoughts on ““P-values overstate the evidence against the null”: legit or fallacious?”

The Statistics Wars & Their Casualties

Blog links (references)

Reviews of Statistical Inference as Severe Testing (SIST)

Interviews & Debates on PhilStat (2020)

Interviews on PhilStat (2019)

LSE PH500 Research Seminar (May 21-June 25, 2020): Controversies in Phil Stat

Summer Seminar 2019 (article)

Top Posts & Pages

Conferences & Workshops

RMM Special Topic

Mayo & Spanos, Error Statistics

Follow Blog via Email

My Websites

Recent Posts: PhilStatWars

The Statistics Wars and Their Casualties Videos & Slides from Sessions 1 & 2

THE STATISTICS WARS AND THEIR CASUALTIES VIDEOS & SLIDES FROM SESSIONS 3 & 4

Final session: The Statistics Wars and Their Casualties: 8 December, Session 4

SCHEDULE: The Statistics Wars and Their Casualties: 1 Dec & 8 Dec: Sessions 3 & 4

WORKSHOP

LOG IN/OUT

Archives

© Deborah G. Mayo, Error Statistics Philosophy, 2011-2018 All Rights Reserved.