The P-Values Debate


National Institute of Statistical Sciences (NISS): The Statistics Debate (Video)

Categories: J. Berger, P-values, statistics debate | 7 Comments

Post navigation

7 thoughts on “The P-Values Debate

  1. Yu-li

    I have been reading your book “Error and the Growth of Experimental Knowledge,” and the debate greatly helped me understand the book. I plan on reading all of your books! Thank you.

  2. Overall I enjoyed the debate. I think you were very on point Prof. Mayo, Jim put a good fight, too. I was unimpressed by David. Two points I’d like to comment on.

    1.) In the video, timestamps 1:09:20 to 1:10:10, Jim said:

    “Suppose the power of the test equals the type I error. Then a rejection of the null hypothesis means nothing. The rejection region could have equally happened under the null or the alternative and so you’ve learned nothing. Trying to infer something from type I error alone, not considering power, can be grossly misleading.”

    While I agree with the very last sentence in general, the specific example and logic put forth in the other sentences seems misguided. In the debate you pushed back on it a bit, I think a more thorough argument can be made against the above statement.

    First, I think he meant to say that the type II error equals the type I error, not that power equals the type I error since otherwise it becomes demonstrably absurd – a rejection of the null by a test with size alpha .01 and power .01 against a reasonable alternative would be quite an amazing thing indeed and would suggest a true effect size much greater than the one the test was powered for.

    The definition of power which is the probability of observing a statistically significant outcome at level alpha if a particular point alternative is in fact true. His statement makes sense only if:

    – the null is incorrectly posed, e.g. a nill null when in fact we must reject a null about an effect of a certain size for it to matter substantively (super common in my opinion)
    – AND if the point alternative has a value coinciding with the observed effect size (shpower)
    – AND the observed p-value is ~ alpha

    Say we have a test of size alpha = .01 and beta = .01 (power = .99) towards the alternative mu = 1. Say the true effect is indeed mu = 1. If we reject mu = 0 with p-val ~ alpha, we have learned something. If the test was well designed, the null would be such that rejecting it with any observed value would have substantive consequences. In any case we would certainly be able to infer a directional effect of non-zero size. Now, if one were to argue that rejecting a new null such as mu <= 1 would be just as warranted as rejecting mu <= 0, then that would be an obvious mistake in which can be pointed out easily by various methods, severity being a great way, obv.

    In another test, alpha = .01 and beta = .01 (power = .99) towards the alternative mu = 10. The true effect is mu = 1. If we were to observe a statistically significant outcome with an observed value of say 2 that would lead to learning, regardless if the observed p-value is ~ alpha or is smaller. Even if the data doesn't warrant significant support for the particular alternative mu = 2, we would still be able to infer a directional effect of non-zero size by rejecting mu <= 0 which should be sufficient if the statistical null was chosen so it reflects the substantive issue.

    While power would have an effect on what we can learn, I don't think there is a situation in which we've not learned something from a rejection of the posited null. And to the extent to which the choice of the null was not cookbook, rejecting it is telling us exactly what we have set out to learn.

    2.) David pointed out twice that his journal saw an increase in its impact factor and that its rejection rate went up. However, these are not great arguments. A journal can see an influx of more papers of lower quality if it is perceived to have lowered its standards or simply due to it getting more press. Or by virtue of its increased impact factor. A higher rejection rate then does not mean that the standards have increased or even that they have remained the same. In fact the standards could have been lowered a bit and still the influx of papers could be offsetting the net effect on the rejection rate.

    Impact factor is a poor measure of standards and quality of science, as "there is no association between statistical power and impact factor, and journals with higher impact factor have more papers with erroneous p-values." based on "Prestigious Science Journals Struggle to Reach Even Average Reliability" DOI: 10.3389/fnhum.2018.00037 , "Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature" DOI: 10.1101/071530

    • Georgi:
      Yes I wanted to respond on this at the debate but, given the very limited time, I used it on the business of power over alpha. I will study what you wrote and comment. On the point you are discussing, I do think he meant, what you think he didn’t mean, namely power = alpha. This is all related to the attempt to view these error probabilities as likelihoods. More later.
      Oh, on the claim his “impact factor” went up, in fact it went down. I’d never looked at impact factors before, but two people sent me this:
      When Trafimow first made his claim-to fame move, people cited it, so it likely went up at first. At the debate, I didn’t understand why he thought that was at all relevant to the fact that his authors badly exaggerate results.

      • Christian: > Nice debate and well done you!
        Agree and also David “seemed” to just not understand scientific reasoning…

        It happens and also to philosophers who do not “seem” to understand statistical reasoning (what probability models are and how they work).

        It is hard to be sure of such judgements.

        Keith O’Rourke

        • Keith:
          But he is in charge of editing or co-editing a journal. Do they not have thresholds of understanding scientific reasoning for editing a journal that hinges on scientific reasoning? That’s one of the reasons I said (in my last answer) that some of the reactions to the replication crisis were unconstructive. Anything goes, in some circles.

  3. Christian Hennig

    Nice debate and well done you!
    I had a question for David that eventually wasn’t used, but anyway, he was banging on all the time about the fact that we know anyway that the null hypothesis isn’t true, so why should we test it; we can’t learn anything useful from it that we don’t know already.

    But this is nonsense. Of course the H0 isn’t literally true, however if the data don’t allow us to distinguish the real situation from the H0, it is clear that no evidence can be claimed for anything substantially different from it. I’d have been curious how he could defend his “we can’t learn anything useful from testing an H0 that is wrong anyway” against this. (Apart from this no parametric model is true anyway, so based on this he shouldn’t do any parametric statistics at all… and nothing nonparametric either, because data isn’t even perfectly i.i.d.! Being true is not the job of models – that doesn’t mean they don’t have any job!

    • Christian:
      Glad that you watched. I did bring up the point that we can find out true things from deliberately false models 2 or 3 times. He’s obviously wedded to this dismissal. Such “al flesh is grass” pronouncements get us nowhere and miss how we learn from models in science.

I welcome constructive comments that are of relevance to the post and the discussion, and discourage detours into irrelevant topics, however interesting, or unconstructive declarations that "you (or they) are just all wrong". If you want to correct or remove a comment, send me an e-mail. If readers have already replied to the comment, you may be asked to replace it to retain comprehension.

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Blog at