The P-Values Debate



National Institute of Statistical Sciences (NISS): The Statistics Debate (Video)

Categories: J. Berger, P-values, statistics debate

Post navigation

14 thoughts on “The P-Values Debate

  1. Yu-li

    I have been reading your book “Error and the Growth of Experimental Knowledge,” and the debate greatly helped me understand the book. I plan on reading all of your books! Thank you.

  2. Overall I enjoyed the debate. I think you were very on point Prof. Mayo, Jim put a good fight, too. I was unimpressed by David. Two points I’d like to comment on.

    1.) In the video, timestamps 1:09:20 to 1:10:10, Jim said:

    “Suppose the power of the test equals the type I error. Then a rejection of the null hypothesis means nothing. The rejection region could have equally happened under the null or the alternative and so you’ve learned nothing. Trying to infer something from type I error alone, not considering power, can be grossly misleading.”

    While I agree with the very last sentence in general, the specific example and logic put forth in the other sentences seems misguided. In the debate you pushed back on it a bit, I think a more thorough argument can be made against the above statement.

    First, I think he meant to say that the type II error equals the type I error, not that power equals the type I error since otherwise it becomes demonstrably absurd – a rejection of the null by a test with size alpha .01 and power .01 against a reasonable alternative would be quite an amazing thing indeed and would suggest a true effect size much greater than the one the test was powered for.

    The definition of power which is the probability of observing a statistically significant outcome at level alpha if a particular point alternative is in fact true. His statement makes sense only if:

    – the null is incorrectly posed, e.g. a nill null when in fact we must reject a null about an effect of a certain size for it to matter substantively (super common in my opinion)
    – AND if the point alternative has a value coinciding with the observed effect size (shpower)
    – AND the observed p-value is ~ alpha

    Say we have a test of size alpha = .01 and beta = .01 (power = .99) towards the alternative mu = 1. Say the true effect is indeed mu = 1. If we reject mu = 0 with p-val ~ alpha, we have learned something. If the test was well designed, the null would be such that rejecting it with any observed value would have substantive consequences. In any case we would certainly be able to infer a directional effect of non-zero size. Now, if one were to argue that rejecting a new null such as mu <= 1 would be just as warranted as rejecting mu <= 0, then that would be an obvious mistake in which can be pointed out easily by various methods, severity being a great way, obv.

    In another test, alpha = .01 and beta = .01 (power = .99) towards the alternative mu = 10. The true effect is mu = 1. If we were to observe a statistically significant outcome with an observed value of say 2 that would lead to learning, regardless if the observed p-value is ~ alpha or is smaller. Even if the data doesn't warrant significant support for the particular alternative mu = 2, we would still be able to infer a directional effect of non-zero size by rejecting mu <= 0 which should be sufficient if the statistical null was chosen so it reflects the substantive issue.

    While power would have an effect on what we can learn, I don't think there is a situation in which we've not learned something from a rejection of the posited null. And to the extent to which the choice of the null was not cookbook, rejecting it is telling us exactly what we have set out to learn.

    2.) David pointed out twice that his journal saw an increase in its impact factor and that its rejection rate went up. However, these are not great arguments. A journal can see an influx of more papers of lower quality if it is perceived to have lowered its standards or simply due to it getting more press. Or by virtue of its increased impact factor. A higher rejection rate then does not mean that the standards have increased or even that they have remained the same. In fact the standards could have been lowered a bit and still the influx of papers could be offsetting the net effect on the rejection rate.

    Impact factor is a poor measure of standards and quality of science, as "there is no association between statistical power and impact factor, and journals with higher impact factor have more papers with erroneous p-values." based on "Prestigious Science Journals Struggle to Reach Even Average Reliability" DOI: 10.3389/fnhum.2018.00037 , "Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature" DOI: 10.1101/071530

    • Georgi:
      Yes I wanted to respond on this at the debate but, given the very limited time, I used it on the business of power over alpha. I will study what you wrote and comment. On the point you are discussing, I do think he meant, what you think he didn’t mean, namely power = alpha. This is all related to the attempt to view these error probabilities as likelihoods. More later.
      Oh, on the claim his “impact factor” went up, in fact it went down. I’d never looked at impact factors before, but two people sent me this:
      When Trafimow first made his claim-to fame move, people cited it, so it likely went up at first. At the debate, I didn’t understand why he thought that was at all relevant to the fact that his authors badly exaggerate results.

      • Christian: > Nice debate and well done you!
        Agree and also David “seemed” to just not understand scientific reasoning…

        It happens and also to philosophers who do not “seem” to understand statistical reasoning (what probability models are and how they work).

        It is hard to be sure of such judgements.

        Keith O’Rourke

        • Keith:
          But he is in charge of editing or co-editing a journal. Do they not have thresholds of understanding scientific reasoning for editing a journal that hinges on scientific reasoning? That’s one of the reasons I said (in my last answer) that some of the reactions to the replication crisis were unconstructive. Anything goes, in some circles.

    • Georgi: I’m fairly sure that J. Berger did mean to say what he said, that you couldn’t learn anything from a stat sig result if the power was low. Of course “the power” doesn’t mean anything by itself. I think the remark is sufficiently perplexing to either write a post on it or write to Jim about it. It’s come up before on this blog having been claimed by Andrew Gelman.

      • Georgi Georgiev

        @Mayo – Thanks, I had just found the time to revisit your post “Frequentstein: What’s wrong with (1 – β)/α as a measure of evidence against the null? (ii)” and I see that indeed Jim must have meant alpha equal to power. The post does a great job of explaining what I quickly dismissed as “demonstrably absurd”. I remember reading it back when it was first posted and appreciating it a lot, but I’ve not heard anyone bring up this argument since. Again, thank you for the debate!

  3. Christian Hennig

    Nice debate and well done you!
    I had a question for David that eventually wasn’t used, but anyway, he was banging on all the time about the fact that we know anyway that the null hypothesis isn’t true, so why should we test it; we can’t learn anything useful from it that we don’t know already.

    But this is nonsense. Of course the H0 isn’t literally true, however if the data don’t allow us to distinguish the real situation from the H0, it is clear that no evidence can be claimed for anything substantially different from it. I’d have been curious how he could defend his “we can’t learn anything useful from testing an H0 that is wrong anyway” against this. (Apart from this no parametric model is true anyway, so based on this he shouldn’t do any parametric statistics at all… and nothing nonparametric either, because data isn’t even perfectly i.i.d.! Being true is not the job of models – that doesn’t mean they don’t have any job!

    • Christian:
      Glad that you watched. I did bring up the point that we can find out true things from deliberately false models 2 or 3 times. He’s obviously wedded to this dismissal. Such “al flesh is grass” pronouncements get us nowhere and miss how we learn from models in science.

  4. Pingback: My Responses (at the P-value debate) | Error Statistics Philosophy

  5. Tom Junk

    Excellent debate! I very much enjoyed it! Regarding at least how high-energy physicists calculate p-values, we include the effects of uncertainty in the model as well as noise due to random data that fluctuate on repetitions of the experiment. While of course all models are wrong and some are useful, the extent to which they are useful has to be quantified, because some models are known to be highly accurate but still imperfect, while others are be less well constrained. Jim gives high-energy physicists a compliment by saying that they check their model assumptions very carefully, which is true, but the important thing is what to do with remaining model uncertainty after the checks have been made, since not all checks are perfectly constraining. After the checks, one has a better model that’s still imperfect. Sometimes “better” is still not all that good. There are of course a number of ways to include model uncertainty in the calculation of p-values and we ought to ask statisticians’ help in evaluating them. But the meaning of the p-value as computed by particle physicists is then not just the chance of observing the data or something more extreme under the null assuming the outcome is subject to random noise, but it also includes possible mismodeling which is not noisy but rather is systematic from one repetition to the next.

    describes common methods for including systematic uncertainties in the calculation of p-values in high-energy physics.

    Of course when we include the effects of uncertainty in the model, we only include those effects we know about. You make a very good comment about the OPERA faster-than-light neutrinos on your main blog page — why cover over with an uncertainty an actual effect that needs investigating? The loose cable is a special case in that it is so easily fixable and the experiment can be rerun (or perhaps the data can be corrected once its impact is known). Not all nuisance parameters are that easy to fix in the experiment. Sometimes they are inherent to the experiment design or they are impossible to constrain with data. And there are many, many of them. In the end, we care about the estimations of parameters of interest, their intervals, and hypothesis tests, so we do stop investigating systematic uncertainties when our knowledge of them is “good enough”.

    • Tom: Thanks for your comment. When I was writing about the Higgs boson around 2012-13, and since then as well, I was fortunate to have several consultations with physicist Robert Cousins. He recently wrote an article that connects that case , and stat in HEP physics in general, to my book, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars.

  6. Pingback: Is it impossible to commit Type I errors in statistical significance tests? | Error Statistics Philosophy

  7. Pingback: The Statistics Debate (NISS) in Transcript Form–Question 1 | Error Statistics Philosophy

Blog at