Comedy Hour at the Bayesian Retreat: P-values versus Posteriors

Did you hear the one about the frequentist significance tester when he was shown the nonfrequentist nature of p-values?

JB: I just simulated a long series of tests on a pool of null hypotheses, and I found that among tests with p-values of .05, at least 22%—and typically over 50%—of the null hypotheses are true!

Frequentist Significance Tester: Scratches head: But rejecting the null with a p-value of .05 ensures erroneous rejection no more than 5% of the time!

Raucous laughter ensues!

(Hah, hah,…. I feel I’m back in high school: “So funny, I forgot to laugh!)

The frequentist tester should retort:

Frequentist significance tester: But you assumed 50% of the null hypotheses are true, and  computed P(H0|x) (imagining P(H0)= .5)—and then assumed my p-value should agree with the number you get!

But, our significance tester is not heard from as they move on to the next joke….

Of course it is well-known that for a fixed p-value, with a sufficiently large n, even a statistically significant result can correspond to large posteriors in H0 [i] .  Somewhat more recent work generalizes the result, e.g., J. Berger and Sellke, 1987. Although from their Bayesian perspective, it appears that p-values come up short as measures of evidence, the significance testers balk at the fact that use of the recommended priors allows highly significant results to be interpreted as no evidence against the null — or even evidence for it!   An interesting twist in recent work is to try to “reconcile” the p-value and the posterior e.g., Berger 2003[ii].

The conflict between p-values and Bayesian posteriors considers the two sided  test of the Normal mean, H0: μ = μ0 versus H1: μ ≠ μ0 .

“If n = 50 one can classically ‘reject H0 at significance level p = .05,’ although Pr (H0|x) = .52 (which would actually indicate that the evidence favors H0).” (Berger and Sellke, 1987, p. 113).

If n = 1000, a result statistically significant at the .05 level leads to a posterior to the null of .82!


Table 1 (modified) from J.O. Berger and T. Selke (1987) “Testing a Point Null Hypothesis,” JASA 82(397) : 113.

Many find the example compelling evidence that the p-value “overstates evidence against a null” because it claims to use an “impartial” or “uninformative”(?) Bayesian prior probability assignment of .5 toH0, the remaining .5 being spread out over the alternative parameter space. Others charge that the problem is not p-values but the high prior (Casella and R.Berger, 1987).  Moreover, the “spiked concentration of belief in the null” is at odds with the prevailing view “we know all nulls are false”.  Note too the conflict with confidence interval reasoning since the value zero (0) lies outside the corresponding confidence interval (Mayo 2005).

But often, as in the opening joke, the prior assignment is claimed to be keeping to the frequentist camp and frequentist error probabilities: it is imagined that we sample randomly from a population of hypotheses, some proportion of which are assumed to be true, 50% is a common number used. We randomly draw a hypothesis and get this particular one, maybe it concerns the mean deflection of light, or perhaps it is an assertion of bioequivalence of two drugs or whatever. The percentage “initially true” (in this urn of nulls) serves as the prior probability for H0. I see this gambit in statistics, psychology, philosophy and elsewhere, and yet it commits a fallacious instantiation of probabilities:

50% of the null hypotheses in a given pool of nulls are true.

This particular null H0 was randomly selected from this urn (and, it may be added, nothing else is known, or the like).

Therefore P(H0 is true) = .5.

It isn’t that one cannot play a carnival game of reaching into an urn of nulls (and one can imagine lots of choices for what to put in the urn), and use a Bernouilli model for the chance of drawing a true hypothesis (assuming we could even tell), but this “generic hypothesis”  is no longer the particular hypothesis one aims to use in computing the probability of data x0 (be it on eclipse data, risk rates, or whatever) under hypothesis H0. [iii]  In any event .5 is not the frequentist probability that the chosen null H0 is true. (Note the selected null would get the benefit of being selected from an urn of nulls where few have been shown false yet: “innocence by association”.)

Yet J. Berger claims his applets are perfectly frequentist, and by adopting his recommended O-priors, we frequentists can become more frequentist (than using our flawed p-values)[iv]. We get what he calls conditional p-values (of a special sort). This is a reason for a coining a different name, e.g.,  frequentist error statistician.

Upshot: Berger and Sellke tell us they will cure  the significance tester’s tendency to exaggerate the evidence against the null  (in two-sided testing) by using some variant on a spiked prior. But the result of their “cure” is that outcomes may too readily be taken as no evidence against, or even evidence for, the null hypothesis, even if it is false.  We actually don’t think we need a cure.  Faced with conflicts between error probabilities and Bayesian posterior probabilities, the error statistician may well conclude that the flaw lies with the latter measure. This is precisely what Fisher argued:

Discussing a test of the hypothesis that the stars are distributed at random, Fisher takes the low p-value (about 1 in 33,000) to “exclude at a high level of significance any theory involving a random distribution” (Fisher, 1956, page 42). Even if one were to imagine that H0 had an extremely high prior probability, Fisher continues—never minding “what such a statement of probability a priori could possibly mean”—the resulting high posteriori probability to H0, he thinks, would only show that “reluctance to accept a hypothesis strongly contradicted by a test of significance” (44) . . . “is not capable of finding expression in any calculation of probability a posteriori” (43). Sampling theorists do not deny there is ever a legitimate frequentist prior probability distribution for a statistical hypothesis: one may consider hypotheses about such distributions and subject them to probative tests. Indeed, Fisher says,  if one were to consider the claim about the a priori probability to be itself a hypothesis, it would be rejected by the data!

[i] A result my late colleague I.J. wanted me to call the Jeffreys-Good-Lindley Paradox).

[ii] An applet is available at∼berger

[iii] Bayesian philosophers, e.g., Achinstein, allow this does not yield a frequentist prior, but he claims it yields an acceptable prior for the epistemic  probabilist (e.g., See Error and Inference 2010).

[iv]Does this remind you of how the Bayesian is said to become more subjective by using the Berger O-Bayesian prior? See Berger deconstruction.

References & Related articles

Berger, J. O.  (2003). “Could Fisher, Jeffreys and Neyman have Agreed on Testing?” Statistical Science 18: 1-12.

Berger, J. O. and Sellke, T.  (1987). “Testing a point null hypothesis: The irreconcilability of p values and evidence,” (with discussion). J. Amer. Statist. Assoc. 82: 112–139.

Cassella G. and Berger, R..  (1987). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem,” (with discussion). J. Amer. Statist. Assoc. 82 106–111, 123–139.

Fisher, R. A., (1956) Statistical Methods and Scientific Inference, Edinburgh: Oliver and Boyd.

Jeffreys, (1939) Theory of Probability, Oxford: Oxford University Press.

Mayo, D. (2003), Comment on J. O. Berger’s “Could Fisher,Jeffreys and Neyman Have Agreed on Testing?”, Statistical Science 18, 19-24.

Mayo, D. (2004). “An Error-Statistical Philosophy of Evidence,” in M. Taper and S. Lele (eds.) The Nature of Scientific Evidence: Statistical, Philosophical and Empirical Considerations. Chicago: University of Chicago Press: 79-118.

Mayo, D.G. and Cox, D. R. (2006) “Frequentists Statistics as a Theory of Inductive Inference,” Optimality: The Second Erich L. Lehmann Symposium (ed. J. Rojo), Lecture Notes-Monograph series, Institute of Mathematical Statistics (IMS), Vol. 49: 77-97.

Mayo, D. and Kruse, M. (2001). “Principles of Inference and Their Consequences,” in D. Cornfield and J. Williamson (eds.) Foundations of Bayesianism. Dordrecht: Kluwer Academic Publishes: 381-403.

Mayo, D. and Spanos, A. (2011) “Error Statistics” in Philosophy of Statistics , Handbook of Philosophy of Science Volume 7 Philosophy of Statistics, (General editors: Dov M. Gabbay, Paul Thagard and John Woods; Volume eds. Prasanta S. Bandyopadhyay and Malcolm R. Forster.) Elsevier: 1-46.
Categories: Statistics | Tags: , , , , ,

Post navigation

53 thoughts on “Comedy Hour at the Bayesian Retreat: P-values versus Posteriors

  1. guest

    A number of comments;

    * Berger and Sellke’s motivation is that (regular, two-sided) p-values are nowhere near posterior measures of support for the null, and that this is a problem because this is how non-experts mistakenly interpret p-values. One can certainly be critical of their resolution, in which the analysis behaves more like a posterior measure of support, but this doesn’t address the underlying problem of misinterpretation of p-values. If correct use of p-values is too subtle for most users – and it often is – then p-values do need a “cure” of some sort.

    * It’s not difficult to contrive examples where the best explanation for a tiny p-value (1 in 33000, if you like) is that something very unusual happened, not that the alternative is true. So Fisher’s argument is, in general, not entirely watertight.

    * While the fundamental goals of default Bayesian (two-sided) tests and default frequentist ones are different (as nicely explained in this blog comment) this does not mean that all Bayesian testing methods (and other methods) will disagree with all frequentist ones. The fundamental result is Bernstein von Mises theorem, which holds very broadly, and helps explain why good statisticians working in either paradigm will often agree. It can break down, notably when priors contain spikes, but when “we know all nulls are false” that’s not a big problem.

    * Why do you write that “Yet J. Berger claims his applets are perfectly frequentist”? Surely we can credit J Berger (or R Berger, or anyone else who publishes extensively on topics like these) with knowing what a long-run property is? If the problem is that your definition of “frequentist” differs from the one JB is using, it’s not a valid criticism.

  2. Guest: Just on the query in the last para, for now: Well he’s offering it as something that we frequentist testers should regard as doing a better job at what we want to do; it is assumed we have a problem. But admittedly, when I asked him, he seemed to disown the example of the applet. But it and variations on it are often given. It isn’t that there’s no frequency that can be defined (for randomly selecting hypotheses from urns), it’s whether it is a frequentist prior in the hypothesis about which one is making an inference. Moreover, if we put that to one side, what’s the grounds for us to switch to the O-Bayesian analysis here?

    • guest

      Frank Samaniego’s book (comparing frequentist and Bayesian estimation) has an extensive treatment on the frequentist prior idea, that might be of interest.

      Should we switch to the O-Bayes analysis here? Even if we’re being Bayesian, it’s easy to answer “no”. If a Bayesian analysis that approximates use of two-sided p-values better reflects our goals for the analysis at hand, we should use it instead.

      • Giving a high prior to my favorite null value may well “reflect my goal for the analysis” (as with the Newtonians who believed strongly in the ether) but this scarcely shows the posterior prob to indicate its (the null value’s) warrant by the data.

        • guest

          Such a spiky prior does reflect Your goals for the analysis if that’s Your prior and You want to know what You should (rationally) believe after seeing the data – and if You believe whatever modeling assumptions are built in.

          If not, fine, do something else.

  3. Guest” If correct use of p-values is too subtle for most users – and it often is – then p-values do need a “cure” of some sort.”
    Well I don’t agree, please tell me how to interpret the O-Bayesian p-value in this test. And why are Bayesians even using p-values if they have been saying what we really need are posteriors in the hypotheses and not p-values.
    I have a house of 30 Elbians who want to play philosophy charades, will write again later. Will open the comments.

    • guest

      I have not been defending the O-Bayes p-value, and I’m not going to start now. However, if you’d like examples of p-values being too complex for non-experts, look any place where e.g. p=0.01 in one study is considered much better evidence against the null than p=0.05 in another, without regard to power, effect size, severity, or whatever else one prefers to invoke. This “p-value culture” is a real problem, though a “cure” for it, if one is possible, may not require killing off p-values entirely.

      • As we saw, in the one sided normal test, the significance level is ~ the posterior probability for the O-Bayesian. So now your non-experts have a justification for regarding the .01 result as better evidence than the .05 result. At least with significance probabilities there are fairly ordinary grounds for distinguishing results, what is to be said once they’ve been given the blessing for a posterior probability assignment?

        • Eileen

          Wow! That’s right, keep forgetting that p-values are being misinterpreted as posteriors,.., but If the O-Bayesian posterior = the p-value in the tests of one-side, as in the last post I commented on, then how is it magically ok?

  4. Guest: thanks for the comments and I will certainly study the book and post that you cite.
    * “Berger and Sellke’s motivation is that (regular, two-sided) p-values are nowhere near posterior measures of support for the null, and that this is a problem because this is how non-experts mistakenly interpret p-values.”
    But unless it is shown that the “posterior measures of support” are plausible and p-values are not, then the disagreement is scarcely to be settled against the frequentist p-value (why must the fault lie with us?). In particular, if the disagreement results only from spiked priors to a point null or to a small region around it, then if that prior doesn’t make sense, then why would we want to use it? Now at least one big reason for deeming them implausible (at least for an error statistician) is that (a) they result in poor error probabilities, and (b) the are based on values that are at odds with the O-Bayesian’s own recommended priors in the one-sided test, which agree with the frequentist’s (see my last post).

    The kind of situation where it is thought this two-sided test would arise, say the O-Bayesians since Jeffreys, is when the null is a special value, perhaps one deemed plausible or about which there’s a lot of evidence. An example given is a test of a parameter like the deflection of light. Well Newtonians had a lot of evidence that it took a value consistent with Newton (0 or ½ the GTR value), and so would have accorded the null a prior even higher than .5. Thus, the statistically significant differences would rightly not have been construed as evidence against Newton. But, fortunately, that’s not how scientists were permitted to reason.

    * “It’s not difficult to contrive examples where the best explanation for a tiny p-value (1 in 33000, if you like) is that something very unusual happened, not that the alternative is true. So Fisher’s argument is, in general, not entirely watertight”.
    Yes, I understand the proper alternative here to be that there is a genuine discrepancy from the null, not that any particular alternative is true. So, I’m not sure if by “the alternative”” you mean some substantive alternative which would not be the complement of the null.
    I want to be clear that I would not raise this example (or any of the other howlers on this blog) were it not that we have been beaten and bruised on these grounds over and over again, with few people really analyzing if the criticism is deserved.

    • guest

      If the spiked prior is nothing like what we believe, then I agree it’s hard to defend (and I’m not trying to defend it). The blog comment I mentioned before has a clear description of where the spike comes from – and it’s often a miscommunication. Where it’s not, and we want a posterior measure of support for the null, it’s hard to defend the p-value as an answer.

      I think the “fault”, as you put it, is not with either Bayesian or frequentist statistics, but with the insistence that what’s become a default method in one or other paradigm must be the right approach to every problem.

      Re: the p=1/30,000 issue; the examples I had in mind are those where the data is extremely unlikely under the posited model, and that this is a better explanation than the null being false. If we nevertheless insist the model is true, it may be most appropriate to interpret p=1/30,000 as evidence simply that something unusual happened. And please note I used the word “contrived” deliberately!

      • john byrd

        Guest, as to the 1/30,0000 result, let me suggest that Fisher the statistician would be quick to acknowledge the possibility of the rare event, but then he as an experienced scientist would ask why one runs an experiment only to ignore the result?

        • guest

          @John, I agree Fisher would acknowledge this – and would furthermore have had the good sense to avoid such situations in his own research, by doing a (literally) textbook job of designing experiments that couldn’t lead to them.

          But as he also fairly famously noted, not every researcher consults a statistician prior to their study. Statisticians do get called in, post hoc, to extract whatever it is that studies without good design can tell us – and if the alternative is no study at all, grumbling about “what the experiment died of” isn’t justified. In these messy settings, even with small p, happenstance may be the best single explanation.

          • Fisher always emphasized that no single statistically significant result sufficed to show a genuine experimental effect with stringency. It required knowing how to bring about results that would rarely fail to be statistically significant.

  5. Eileen

    This last post and the discussion helps, but I’m all the more confused when Berger’s paper says all 3 would agree on these tests?!?

  6. David Rohde

    It seems a few issues are being mixed together here….
    In particular the evaluation of p-values by Neyman Pearson criteria and the evaluation of p-values by Bayesian criteria.

    You start off talking about the first and move to the second, I will just talk about the first. i.e. comedy hour at the Neyman Pearson retreat.

    “Frequentist (Fisherian) Significance Tester: Scratches head: But rejecting the null with a p-value of .05 ensures erroneous rejection no more than 5% of the time!”

    … as I understand it, this is false. The p-value is the probability under the null for a test statistic of the observed value or a more extreme value. It does not have a repeat sampling interpretation. The interpretation when the null is rejected is rather either something unexpected happened or the null is false.

    In order to evaluate the frequentist properties of an accept reject rule based on a p-value it is necessary to supply extra information in the form of the distribution of the p-value under H_1 (a prior is not needed).

    Let me ask some questions (I understand you may not have time or interest to answer…)

    * Do you disagree with Berger that his procedure has the advertised coverage?

    * Do you object to computing the probability of the p-value under H_1? that is maybe the Fisherian should retort “you made assumptions about the distribution of the p-value under H_1 which I don’t like”.

    * Do you see p-values and coverage as just different, and see no reason to reconcile them?

    * Do you object to conditional frequentist methods?

    • David: I owed you a bit more:
      I hope we are now clear that the p-value (the proper, ordinary frequentist animal) most certainly has the frequentist interpretation, P(T > t) is just a sampling distribution. Neither Fisher nor N-P denies this, J. Berger seems to be misinterpreting a couple of phrases from one Neyman paper, and others follow suit. As for the allegation of inconsistency between Fisherian and N-P tests, you might read an earlier post beginning the 4th para . Listen to Fisher. Or, if you win the palindrome contest, you can get Lehmann’s new book that also exposes this. No I don’t object to computing the p-value under alternatives. Do I object to conditional frequentist methods? This could only be answered if one knew what “conditional frequentist methods” are. If you mean the J. Berger attempt at conditional p-values, all I can say is that no account that even he deems satisfactory exists. I believe he said it wasn’t ready for prime time. J. Berger tries out all kinds of ingenious things, and drops them if they don’t seem to work; it is my understanding that this is one of those (and I’d be glad if he jumped in to clarify). But in their wake, there’s a trail of statements by his readers that haven’t followed the story to the end it seems….

      • David Rohde


        Thanks for the response… I believe you are right on the first point. (and I made a mistake in that the error in that the distribution of the p-value under H_1, is relevant only to the type II error).

        On the other hand as “guest” have mentioned, reporting both the threshold and the p-value is problematic I believe… because the p-value changes when the test is repeated, but in unconditional tests the threshold doesn’t. For you are both the threshold and the p-value an error probability?… and hence there is no need to unify NP and Fisharian tests? Your points of difference with Berger seem to be around this point…

        Thanks for the post, reading it and the cited literature has helped my understanding….

        • To get to your main point, my points of difference with J. Berger are not around the need to unify NP and Fisher, in any way that remains within the frequentist error statitical account. What I mean is, the differences between the two are vastly exaggerated in the literature, but the issue Berger is raising has nothing to do with the minor differences between NP and Fisher, or between p-values and type 1 error probs (it has to do with his idea that p-values overstate evidence and I deny that! Actually, Senn just sent me something relating to this (he also denies the allegations of overstating), I might post it.

  7. All: I had to remove keywords from this post to avoid having automatic links to the wikipedia which is chock full of erroneous definitions. Anyone know how to avoid this?

  8. There’s a lot here to respond to, and I’m happy to. Taking one at a time, on the first: if one rejects the null whenever the p value is below a chosen threshold (in a proper significance test), then the probability of a type 1 error is p. I must leave for a few hours, but you might see my comment on Berger, Mayo 2003, on the previous post.

    • guest

      I’m sure this is just a typo, but you mean that the probability of a type I error is alpha (the threshold) – not p.

  9. Not a typo. Take Cox and Hinkley (1974); The p value “is the probability that we would mistakenly declare there to be evidence against Ho, were we to regard the data under analysis as just decisive against Ho.” (p. 66) Or take Lehmann and Romano on p-values: (Testing Statistical Hypotheses, third ed. 2005) “it is good practice to determine not only whether the hypothesis is accepted or rejected at the given significance level, but also to determine the smallest significance level,….at which the hypothesis would be rejected for the given observation. This number, the so-called p-value gives an idea of how strongly the data contradict the hypothesis. It also enables others to reach a verdict based on the significance level of their choice.”(64-5)
    They go on to discuss the p-value viewed as a statistic and show the probability the p-value is less than u is no greater than u.

    • guest

      If it’s not a typo then it needs a more precise statement;

      Pr(p<alpha) = alpha if we choose alpha in advance, and consider all replications of the experiment under the null, _including_ the one at hand.

      Pr(p*<p) = p, where p* denotes the p-value from replicate experiments under the null and _other_ than the one at hand, i.e. the one that generated p in the first place.

      In general, your statement about u is not correct; if one is permitted to choose u after seeing p, setting u=p+delta trivially provides Pr(p<u)=1.

      For more on the confusion caused, see Hubbard and Bayarri (2003, American Statistician)

      • Guest: Please just check the references I gave. Lehmann is correctly suggesting people report the p-value observed so that they can apply their own type 1 error rate to them. The computation is exactly the same. This is all well known and should not be controversial. Maybe you prefer Casella and R. Berger (2002, p. 397). No wonder there’s so much confusion about p-values. Your claims in your 3rd and 4th paras seem to involve different animals from ordinary p-values, maybe that’s the problem.

        • guest

          The practice of reporting the p-value is not the problem, and I’m not claiming any of it is controversial.

          You wrote “if one rejects the null whenever the p value is below [alpha] then the probability of a type 1 error is p”. But if one implements the testing procedure as described in these words, then in the long run (under the null) one makes Type I errors with probability alpha.

          If you are referring to some other sampling or testing procedure – which is what the “were we to regard” caveat of Cox and Hinkley entails – then other statements can of course be made.

          But if we don’t make the distinction, confusion reigns, as per Hubbard and Bayarri.

          • I do not see what part of the assertion you are regarding as a caveat, but never mind. If one rejects the null with the threshhold say, .01, then the probability that we would reject the null (and declare evidence against Ho) under the assumption Ho is true, is no more than .01. I don’t know what the “distinction” is that you are referring to, and Bayarri concurs with J. Berger who is the one making the joke about p-values not being frequentist in nature. I hope we have settle this now, or there’s little reason to go further.

            • guest

              What part of the assertion? The “were we to regard the data under analysis as just decisive against H_0” part, of course. Nothing says we *have* to do this, and many constructions of testing do *not* do it – in particular the version where one sets a threshold alpha and uses whether p<alpha to make the decision.

              On its own the term "probability" does not make a distinction between elements of the following non-exhaustive list;

              i) randomness of sampled data, unconditional on the data at hand
              ii) randomness of hypothetical replicates, that are compared to some function of the data at hand (typically its p-value)
              iii) randomness describing knowledge about parameters, should one choose to describe it this way

              The probability of Type I error in each is;
              i) preset threshold alpha
              ii) p from the data at hand
              iii) the posterior probability of the alternative

              These are all different, as I hope you'll acknowledge.

              • Cox’s assertion is just one of many implications of the correct mathematical definition of a p-value, has nothing to do with what we have to do. Your (i)-(iii) regarding probability are extremely confusing, and of course (iii) has no part in the frequentist significance test. The knowledge about parameters give posterior probabilities of the alternative? Oy! It seems the well-defined mathematical notions (e.g.,significance levels) are being subjected to a hodge-podge of different interpretations, and this is supposed to be progress over the “p-value culture”. I think I’m beaten…going back to Elba.

  10. Mark

    Deborah, I’m confused. The p-value is certainly not the probability of a type 1 error… If it were, then 1-p would be the probability that the result is not a type 1 error, but it’s calculated assuming the null is true, and if it’s lower than the threshold then one would reject (and thus be making a type 1 error). Or, were you giving an example of mis-interpretations of p-values? I need to read this more closely…

    • Mark: Just to repeat the Cox and Hinkley reference above, The p value “is the probability that we would mistakenly declare there to be evidence against Ho, were we to regard the data under analysis as just decisive against Ho.” (p. 66)
      Abbreviate the outcome corresponding to the one that just attains stat sig equal to p as T(p) (for ease of symbols). So were we to regard the data under analysis as just decisive against Ho when T is as great as T(p), then whenever T < T(p) the result is regarded as NOT evidence against the null. For, in those cases, the p-value exceeds the threshold. So 1 – p is the probability of T < T(P) under the assumption that the null is true.
      P(do not reach p-level significance; Ho) = 1 – p.
      So of course this is computed under the assumption Ho. To just write as you do Mark, the “probability that the result is not a type 1 error” is HUGELY ambiguous (almost sounds like you could mean a posterior prob). But the assertion I wrote, which is correct, is not ambiguous, even though I’m writing it informally. Oy! I cannot teach significance tests here.

      • Where’s that enthusiastic commentator, Fisher–he always used to back me up!

    • john byrd

      I am not a statistician but will try to explain. The p-value is the estimate of the probability a TRUE null as defined in the experiment will yield a value as large/small as observed. It pertains to actual results in hand. Neyman/Pearson called this the prob of Type I error. Fisher called it the significance level.

      The alpha of Neyman/Pearson is a planning factor that will ensure that the Type I error RATE OVER TIME does not exceed alpha=a, for some “a” consistently used as a means of selecting the cutoff in the tests. Alpha is used to help control the Type I error rate, but is not the same thing as the p-value, which does provide your estimate of the probability of a Type I error in a single test result.

      If we use alpha=0.05 for a series of four tests, then perform the tests, we have good reason to believe we have established an upper bound on the probability of Type I errors at approx 0.05. However, if we obtain results like 0.01, 0.001, 0.01, and 0.02 as p-values in the respective runs, then we have done much better than 0.05 as probabilities of Type I errors. Alpha for planning and controlling through planning. P-value for interpreting a result.

      • guest

        Goodness. One final time, I promise – and I promise I am not asking for anything complex.

        Yes, p-values pertain to actual results in hand, p being the probability of seeing something more extreme than your data in replicate studies. Yes, you can use p-values to interpret data, though please don’t pretend that on their own they’ll make a good job of this. Yes, if you did other experiments and rejected when p*<p, you'd control the Type I error rate among them at p.

        The only point I have been raising is that, for statements saying Pr(Type I error) = p to make any sense, one needs to condition on the data at hand … and to *say* one is conditioning on it. If not, the reader can't tell between what you actually mean, or instead the Type I error rate among studies where we simply reject when the study's respective p-value is < alpha. They are different procedures, and they have a different Type I error rate. p's are not alpha's.

        It's not *any* more complicated than spelling out what the denominator means when we give other rates.

        That is all.

      • Mark

        John Byrd,

        Oh my, no that’s not correct, the p-value is not the probability of a type 1 error unless we somehow decide that the observed p-value will be “just decisive” against the null, which I personally think is bunk. Imagine doing a study where you plan to use a significance level of 0.01. Suppose at the end of the day, you observe a p of 0.20. You don’t have 20% chance of making a type 1 error in this case, you have 0 chance because you won’t reject the null. The p-value is a function of the observed data, not an error probability.

        • Mark, what you wrote is absurd. The p-values give you counterfactual information about the probability of a difference as large as what you got under the null. If you take that difference as just decisive (i.e., “evidence to reject the null) the the probability of a type 1 error is of relevance. Were you to take a p value of .2 as gounds to infer “reject the null” then you’d be making such an inference even though the null is true with probability .2. Running through the error probabilities associated with inferential moves is to help understand the impact of the data. These claims hold regardless of inference you actually reach. Seeing you’d be erroneosuly construing the result 20% of the time, might well lead you not to declare it evidence against the null.
          You continue to write “theprobability of a type 1 error” in ways that guarantee you will misinterpret it, I recommend going back and reading someone like Cox and Hinkley, and not bandy the term about so loosely and equivocally.

          • Mark

            “Even though the null is true with probability .2”? I do agree with the first part of your comment regarding the counterfactual information. That is precisely what p-values are (except that you need to say “as large or larger”), and that is all they are. They are not error probabilities.

            • You’d be erroneously making such an inference with prob. .2. These English phrases are much more clearly an unambiguously put in symbols which is hard to do here, something like P(T > t(.2);Ho) = .2. Once again, let me recommend you see Cox and Hinkley (1974); The p value “is the probability that we would mistakenly declare there to be evidence against Ho, were we to regard the data under analysis as just decisive against Ho.” (p. 66) Or take Lehmann and Romano on p-values: (Testing Statistical Hypotheses, third ed. 2005) “it is good practice to determine not only whether the hypothesis is accepted or rejected at the given significance level, but also to determine the smallest significance level,….at which the hypothesis would be rejected for the given observation. This number, the so-called p-value gives an idea of how strongly the data contradict the hypothesis. It also enables others to reach a verdict based on the significance level of their choice.”(64-5) Do not use for this purpose articles busy grinding an axe, e.g., to replace p-values with conditional 0-Bayesian conditional posteriors.

        • john byrd

          Mark, I think you are confounding how we can project future performance of tests with what we can say about a result. N-P define Type I error as “probability of a false rejection” of the null H. If my alpha going in was 0.05 but I obtain p=0.01, , then the prob of having falsely rejected is best measured at 0.01, post experiment. Else you discard relevant info as pointed out by Lehmann.

  11. Christian Hennig

    Just for the moment shouting in something regarding the first remark of the guest, it is true that many people including too many researchers struggle to properly understand p-values.
    However, as opposed to what many Bayesians claim, Bayesian posterior probabilities are regularly misunderstood as well. For example, most practitioners faced with a statement that the posterior probability for the H_0 is, say, 28%, are not aware that according to Bayesian philosophy (I refer to de Finetti here but most others agree on this) there is no such thing as a “true underlying distribution” and that therefore the posterior probability doesn’t at all mean that the set of distributions formalised as H_0 would therefore be true with a probability of 28%, which is what many would make of this statement.

    • It is one thing to misunderstand or misinterpret something—that assumes there is a correct understanding, something error-statistical notions enjoy. I do not know the correct understanding of the O-Bayesian posteriors, and other Bayesian measures are equally fluid: I’ve collected around 4 meanings at least, sometimes by the same author. I will write more on this later…back to book.

  12. Hi, guys, it´s nice to see a discussion like this on the web. Let me try to give my 2 cents.

    Well, I think both the guest, Mark and Mayo have a point. Maybe you guys don´t realise that you´re saying the same thing with different words.

    First, it’s clear that p-values are different from alphas. One clear example is when the researcher makes his hypothesis after the fact. See, suppose he trhows 10 coins, and the result is HHHTTTHHHT.

    If he wants to test if the coin is fair, the could check the data and look for “strange” features of the data. He could think “hum, what is the probability of three or more runs of heads and tails, alternating like this one?”. And the could calculate the p-value.

    But, if one knows that he did that after the result, one could easily claim that his “type I error probability” is not his p-value. Because other strange data could have been seen (is that the correct verbal tense?) and we know that he would also have found low p-values.

    On the other hand, if he set up the experiment in advance, and said what kind of “strangeness” of data he would take as evidence against the hypothesis, then we could calculate the real “type I error”.

    Now, is the p-value calculated before valueless? Of course not. It is still useful as a counterfactual reasoning. But, it has to be thought carefully.

    Another example is with randomized rejection regions. If your variable is discrete, you have to randomize your rejection region to get an arbitrary “type I error”. In that case, the diference of p-value and alpha is clear, because sometimes the same p-value will be counted as evidence and sometimes won’t (but is fair to say that I don’t know who would actually do that in practice, but it is still an example).



    • Just to make myself clear, if we know that the researcher would look for “rare” features of the data after the fact, then the real type I error would be something like: “what is the probability of finding at least one rare feature in the data (that is, something with low p-value), in all different metrics, if the null hypothesis were true”.

      But the p-value that the researcher show us would be something like: “what is the probability of finding something rare, or rarer, in this particular metric, if the null hypothesis were true”.

      The same thing happens with multiple hypothesis testing (when he just wants to find a low p-value) and so on.

      • Sure, and we (error statistical p-value folk) would take such “selection effects” into account, and that is why the post-data specification of unusual features or the like that you describe would NOT be the correct computation for a p-value. But this issue is very different from the one under discussion, or the one in J. Berger. Thanks.

        • I see, I didn’t want to refer to Berger, just to the alpha x p-value discussion.

          I think that was what Mark was talking about: many researchers think that the p-value (no matter how it was calculated) is the real type I error probablity.

          Something like this:



          • Carlos: I will look at your link when I can, all I can say is that they’d better care about such things as “hunting with a shotgun” if their reported p-values are to be anywhere close to the actual ones. But I want people just tuning in to realize that is a distinct issue. You might see chapters 9 & 10 of EGEK(1996) which is linked to on this blog.

  13. statmath


    My two cents about p-values…

    One way to find a p-value follows: we must first elaborate an statistic T(X) (whose distribution under the null hypothesis is known or at least approximately known) such that the larger it is, the more inconsistent the observed data is with the null hypothesis. The p-value is then defined as

    p = P( T(X) > t ; under the null hypothesis),

    where t is the observed value of the statistic T. In these terms, p-value is nothing more than the probability of finding an statistic T at least as extreme as the one was observed. That is, if the observed statistic t is large then we expect to observe an associated small p-value.

    A p-value of 0.01 may also mean that: if the null hypothesis is true and we replicate the experiment a thousand times we would expect to observe just 10 producing p-values smaller (and 9990 would expected be greater) than the observed p-value. Therefore, if we observe a small p-value our sample is denouncing a deviation from the null hypothesis (assuming, of course, that all suppositions are met).

    This is a nice form to evaluate `evidence’ against null hypotheses, however it has also some logical inconsistencies. In general, statistical methods that use the sample distribution of the statistic under null hypotheses carry some internal problems. Suppose the following null hypotheses H{01} and H{02} such that H{01} implies H{02} (if H{01} is true then H{02} must be true). By using the very same data, the p-value may shown more disagreement with H{02} than H{01}.

    What do you think about that? I can build many instances to show this feature.

    Best regards,

    • Alexandre:
      Of course you can, and there’s no logical inconsistency, though it does seem problematic for the Bayesian of this example. I might note, before I get to that, that the replication remarks don’t hold, but I put those aside. OK, so briefly:
      Looking at my previous post and this one, you see the O-Bayesian will infer the one sided alternative but not the two sided at the given level. Nor is it just a slight difference as between one and two-sided tests, in cases where a “selection effect” is taken into account by an error statistician: as we see it is routinely going to be a huge difference. Now for a probabilist, or one seeking to use probability for a degree of belief or support, it is indeed quite odd. The error statistician, on the other hand, reports the relevant error probability, and if she is of the severe testing ilk, the warrant for doing this pertains to controlling and evaluating the severity or stringency or the like of the test, in relation to a specific inference. The error probability associated with the inference is not a posterior probability and does not obey the probability calculus. There is no onus whatever to obey versions of “the consequence principle” such that this would be assumed to follow, nor is it the “same” data or the same test. That doesn’t mean it is always violated (and I explain where and why it remains). I hope that helps.

      • Alexandre


        sorry but I cannot see any connections with my post. Maybe because I’m too tired today

  14. Alexandre: As they say, one wo/man’s modes ponens in another’s modus tollens.
    If that doesn’t get to what I took to be your issue, then I’m sorry, I was obviously going too fast in my concern to give feedback to a comment on an older post, rather than leave it unanswered. Why don’t you try again.

    • Alexandre

      Suppose the following null hypotheses H{01} and H{02} such that H{01} implies H{02} (if H{01} is true then H{02} must be true).

      By using the very same data, p-values may shown more disagreement with H{02} than H{01}, this result goes against the logical reasoning. I can build many instances to show this feature.

      What do you think about that?

  15. Alexandre

    This is a little trick issue and needs careful attention.

    Sometimes practitioners have to test a complicated hypothesis H. By reasons of easiness of computations, instead of testing H, they may think of testing another hypothesis H’ such that if H’ is false then H is also false. You can draw a Venn diagram to properly understand what I’m saying.

    By using the logical reasoning, if these hypotheses aren’t true, we hope to find more evidence against H than H’, so if we find evidence against H’ we can relax and claim evidence against H.

    However, p-values do not follow this logical reasoning. We cannot use p-values in a such situation, that is, *if we find evidence against H’ we cannot relax and claim evidence against H*.

  16. Ronald

    Dear Mayo

    It is fascinating to see the overall confusion about the meaning of the p value not only among researchers but also amomg majority of statisticians. BTW, did you ever consider that Lehmann and Romero were wrong, since p value is a random variable? Accoeding to them there are two kinds of significance levels, p values being the “smallest one”. This miscomprehension has been shared among many other statisticians in the past.

Blog at