Gelman blogged our exchange on abandoning statistical significance

A. Gelman

I came across this post on Gelman’s blog today:

Exchange with Deborah Mayo on abandoning statistical significance

It was straight out of blog comments and email correspondence back when the ASA, and significant others, were rising up against the concept of statistical significance. Here it is:

Exchange with Deborah Mayo on abandoning statistical significance

The philosopher wrote:

The big move in the statistics wars these days is to fight irreplication by making it harder to reject, and find evidence against, a null hypothesis.

Mayo is referring to, among other things, the proposal to “redefine statistical significance” as p less than 0.005. My colleagues and I do not actually like that idea, so I responded to Mayo as follows:

I don’t know what the big moves are, but my own perspective, and I think that of the three authors of the recent article being discussed, is that we should not be “rejecting” at all, that we should move beyond the idea that the purpose of statistics is to reject the null hypothesis of zero effect and zero systematic error.

I don’t want to ban speech, and I don’t think the authors of that article do, either. I’m on record that I’d like to see everything published, including Bem’s ESP paper data and various other silly research. My problem is with the idea that rejecting the null hypothesis tells us anything useful.

Mayo replied:

I just don’t see that you can really mean to say that nothing is learned from finding low-p values, especially if it’s not an isolated case but time and again. We may know a hypothesis/model is strictly false, but we do not yet know in which way we will find violations. Otherwise we could never learn from data. As a falsificationist, you must think we find things out from discovering our theory clashes with the facts–enough even to direct a change in your model. Even though inferences are strictly fallible, we may argue from coincidence to a genuine anomaly & even to pinpointing the source of the misfit.So I’m puzzled.
I hope that “only” will be added to the statement in the editorial to the ASA collection. Doesn’t the ASA worry that the whole effort might otherwise be discredited as anti-science?

My response:

The problem with null hypothesis significance testing is that rejection of straw-man hypothesis B is used as evidence in favor of preferred alternative A. This is a disaster. See here.

Then Mayo:

I know all this. I’ve been writing about it for donkey’s years. But that’s a testing fallacy. N-P and Fisher couldn’t have been clearer. That does not mean we learn nothing from a correct use of tests. N-P tests have a statistical alternative and at most one learns, say, about a discrepancy from a hypothesized value. If a double blind RCT clinical trial repeatedly shows statistically significant (small p-value) increase in cancer risks among exposed, will you deny that’s evidence?


I don’t care about the people, Neyman, Fisher, and Pearson. I care about what researchers do. They do something called NHST, and it’s a disaster, and I’m glad that Greenland and others are writing papers pointing this out.


We’ve been saying this for years and years. Are you saying you would no longer falsify models because some people will move from falsifying a model to their favorite alternative theory that fits the data? That’s crazy. You don’t give up on correct logic because some people use illogic. The clinical trials I’m speaking about do not commit those crimes. would you really be willing to say that they’re all bunk because some psychology researchers do erroneous experiments and make inferences to claims where we don’t even know we’re measuring the intended phenomenon?
Ironically, by the way, the Greenland argument only weakens the possibility of finding failed replications.


I pretty much said it all here.

I don’t think clinical trials are all bunk. I think that existing methods, NHST included, can be adapted to useful purposes at times. But I think the principles underlying these methods don’t correspond to the scientific questions of interest, and I think there are lots of ways to do better.


And I’ve said it all many times in great detail. I say drop NHST. It was never part of any official methodology. That is no justification for endorsing official policy that denies we can learn from statistically significant effects in controlled clinical trials among other legitimate probes. Why not punish the wrong-doers rather than all of science that uses statistical falsification?

Would critics of statistical significance tests use a drug that resulted in statistically significant increased risks in patients time and again? Would they recommend it to members of their family? If the answer to these questions is “no”, then they cannot at the same time deny that anything can be learned from finding statistical significance.


In those cases where NHST works, I think other methods work better. To me, the main value of significance testing is: (a) when the test doesn’t reject, that tells you your data are too noisy to reject the null model, and so it’s good to know that, and (b) in some cases as a convenient shorthand for a more thorough analysis, and (3) for finding flaws in models that we are interested in (as in chapter 6 of BDA). I would not use significance testing to evaluate a drug, or to prove that some psychological manipulation has a nonzero effect, or whatever, and those are the sorts of examples that keep coming up.

In answer to your previous email, I don’t want to punish anyone, I just think statistical significance is a bad idea and I think we’d all be better off without it. In your example of a drug, the key phrase is “time and again.” No statistical significance is needed here.


One or two times would be enough if they were well controlled. And the ONLY reason they have meaning even if it were time and time again is because they are well controlled. I’m totally puzzled as to how you can falsify models using p-values & deny p-value reasoning.

As I discuss through my book, Statistical Inference as Severe Testing, the most important role of the severity requirement is to block claims—precisely the kinds of claims that get support under other methods be they likelihood or Bayesian.
Stop using NHST—there’s speech ban I can agree with. In many cases the best way to evaluate a drug is via controlled trials. I think you forget that for me, since any claim must be well probed to be warranted, estimations can still be viewed as tests.
I will stop trading in biotechs if the rule to just report observed effects gets passed and the responsibility that went with claiming a genuinely statistically significant effect goes by the board.

That said, it’s fun to be talking with you again.


I’m interested in falsifying real models, not straw-man nulls of zero effect. Regarding your example of the new drug: yes, it can be solved using confidence intervals, or z-scores, or estimates and standard errors, or p-values, or Bayesian methods, or just about anything, if the evidence is strong enough. I agree there are simple problems for which many methods work, including p-values when properly interpreted. But I don’t see the point of using hypothesis testing in those situations either—it seems to make much more sense to treat them as estimation problems: how effective is the drug, ideally for each person or else just estimate the average effect if you’re ok fitting that simpler model.

I can blog our exchange if you’d like.

And so I did.

Please be polite in any comments. Thank you.

I was glad to see that I’d pretty much said just what I’d want to say. I might have wanted to get the last word in regarding his last remark, namely I would say that I think the task of distinguishing genuine from spurious effects is crucial. If you start out thinking you’re “estimating” something when it could readily have been exposed as noise, you will be led astray. The only confusion in what I’d said might be as regards the term “NHST”. On this, see comments to this post and my “Farewell Keepsake” from SIST (2018, CUP)

Categories: Gelman blogs an exchange with Mayo | Tags:

Post navigation

7 thoughts on “Gelman blogged our exchange on abandoning statistical significance

  1. I was referring to the term NHST which seems largely to be used to refer to an illicit animal wherein a low p-value is taken as evidence for one of a million substantive scientific claims that could “explain” the data.
    It’s interesting that Gelman asked his people to be “polite” in the comments. They can be real houligans at times, though not those who commented today.
    It’s funny to agree to let someone blog your email exchange. Gelman has done it before, and I’m fine with it. But it’s the kind of thing where you don’t want to say, let me fix some of it first, and I didn’t.

    • Deborah:

      Thanks for agreeing to post our exchange; I think this sort of thing can be useful to people.

      Also, one thing. As I wrote here, you seem to like calling null hypothesis significance testing (NHST) a “fallacious animal” or a “straw man.” But it’s not an animal and it’s not a straw man. It’s a statistical method that thousands of researchers use every day. It’s a method with real problems, and these problems have malign consequences; see for example section 2 of this article.

      Finally, the commenters at our blog are not “my people.” They’re just people! They’re whoever chooses to comment at the blog. I agree that sometimes they can be rude on this topic which is why I specifically asked for politeness in this discussion.

      • I allude to the very common and pejorative referent to NHST. Proper significance tests or tests of hypothesis (N-P used both terms) do NOT endorse moving from a statistically significant effect to a substantive claim. You cannot be both claiming to condemn actual tests and NHST while still allowing NHST to be defined as an abusive animal and not anything recognizable by N-P-F. If you say it refers to actual significance tests as put forward either in the N-P or F form, then your criticisms don’t hold. If your criticisms hold, then you’re referring to the abusive NHST animal. The latter reading is the one I’m giving.

      • “Your people” was a shorthand for people who comment on your problem, whomever they may be.

  2. Yet another abusive construal of NHST. It never purports to give true/mom-true, whatever that might mean. This is a falsificationist/corroboration account.

  3. rkenett

    Interesting and instructive exchange. Two issues however seem to deserve attention, beyond the above:
    1. Selective inference
    P-hacking is a particularly painful of selective information. You do not know what was selected against,and you will probably never know. Selective inference on the choices made in highlighting results is more assessable. Highlighting a small p-value in a large table, without correcting for multiplicity is one such example.
    2. Generalisation of findings.
    The National Academies are now recognising this as an important aspect of research analysis. My book with Galit Shmueli on Information Quality dedicates a full chapter to it. What needs to be expanded are methods for doing so. Some generalizability is technical, the more qualitative type is more challenging. A proposal I made for this is available in Ironically, clinical researchers refer to it and use it. Apparently, statisticians do not have the imagination required for his.

    Overall, statistics, the grammar of research, needs to expand its perspective. I guess one way to achieve this is to strengthen communication between academic and grass root activities. The blog of Gelman is helping out in this. Much more is needed….

    PS The information quality book is available from

    • Christian Hennig

      Thanks for pointing this out. I like the “Boundary of Meaning” concept a lot.

Blog at