An exchange between A. Gelman and D. Mayo on abandoning statistical significance: 5 years ago

.

Below is an email exchange that Andrew Gelman posted on this day 5 years ago on his blog, Statistical Modeling, Causal Inference, and Social Science.  (You can find the original exchange, with its 130 comments, here.) Note: “Me” refers to Gelman. I will share my current reflections in the comments.

Exchange with Deborah Mayo on abandoning statistical significance

The philosopher wrote:

The big move in the statistics wars these days is to fight irreplication by making it harder to reject, and find evidence against, a null hypothesis.

Mayo is referring to, among other things, the proposal to “redefine statistical significance” as p less than 0.005. My colleagues and I do not actually like that idea, so I responded to Mayo as follows:

I don’t know what the big moves are, but my own perspective, and I think that of the three authors of the recent article being discussed, is that we should not be “rejecting” at all, that we should move beyond the idea that the purpose of statistics is to reject the null hypothesis of zero effect and zero systematic error.

I don’t want to ban speech, and I don’t think the authors of that article do, either. I’m on record that I’d like to see everything published, including Bem’s ESP paper data and various other silly research. My problem is with the idea that rejecting the null hypothesis tells us anything useful.

Mayo replied:

I just don’t see that you can really mean to say that nothing is learned from finding low-p values, especially if it’s not an isolated case but time and again. We may know a hypothesis/model is strictly false, but we do not yet know in which way we will find violations. Otherwise we could never learn from data. As a falsificationist, you must think we find things out from discovering our theory clashes with the facts–enough even to direct a change in your model. Even though inferences are strictly fallible, we may argue from coincidence to a genuine anomaly & even to pinpointing the source of the misfit.So I’m puzzled.
I hope that “only” will be added to the statement in the editorial to the ASA collection. Doesn’t the ASA worry that the whole effort might otherwise be discredited as anti-science?

My response:

The problem with null hypothesis significance testing is that rejection of straw-man hypothesis B is used as evidence in favor of preferred alternative A. This is a disaster. See here.

Then Mayo:

I know all this. I’ve been writing about it for donkey’s years. But that’s a testing fallacy. N-P and Fisher couldn’t have been clearer. That does not mean we learn nothing from a correct use of tests. N-P tests have a statistical alternative and at most one learns, say, about a discrepancy from a hypothesized value. If a double blind RCT clinical trial repeatedly shows statistically significant (small p-value) increase in cancer risks among exposed, will you deny that’s evidence?

Me:

I don’t care about the people, Neyman, Fisher, and Pearson. I care about what researchers do. They do something called NHST, and it’s a disaster, and I’m glad that Greenland and others are writing papers pointing this out.

Mayo:

We’ve been saying this for years and years. Are you saying you would no longer falsify models because some people will move from falsifying a model to their favorite alternative theory that fits the data? That’s crazy. You don’t give up on correct logic because some people use illogic. The clinical trials I’m speaking about do not commit those crimes. would you really be willing to say that they’re all bunk because some psychology researchers do erroneous experiments and make inferences to claims where we don’t even know we’re measuring the intended phenomenon?
Ironically, by the way, the Greenland argument only weakens the possibility of finding failed replications.

Me:

I pretty much said it all here.

I don’t think clinical trials are all bunk. I think that existing methods, NHST included, can be adapted to useful purposes at times. But I think the principles underlying these methods don’t correspond to the scientific questions of interest, and I think there are lots of ways to do better.

Mayo:

And I’ve said it all many times in great detail. I say drop NHST. It was never part of any official methodology. That is no justification for endorsing official policy that denies we can learn from statistically significant effects in controlled clinical trials among other legitimate probes. Why not punish the wrong-doers rather than all of science that uses statistical falsification?

Would critics of statistical significance tests use a drug that resulted in statistically significant increased risks in patients time and again? Would they recommend it to members of their family? If the answer to these questions is “no”, then they cannot at the same time deny that anything can be learned from finding statistical significance.

Me:

In those cases where NHST works, I think other methods work better. To me, the main value of significance testing is: (a) when the test doesn’t reject, that tells you your data are too noisy to reject the null model, and so it’s good to know that, and (b) in some cases as a convenient shorthand for a more thorough analysis, and (3) for finding flaws in models that we are interested in (as in chapter 6 of BDA). I would not use significance testing to evaluate a drug, or to prove that some psychological manipulation has a nonzero effect, or whatever, and those are the sorts of examples that keep coming up.

In answer to your previous email, I don’t want to punish anyone, I just think statistical significance is a bad idea and I think we’d all be better off without it. In your example of a drug, the key phrase is “time and again.” No statistical significance is needed here.

Mayo:

One or two times would be enough if they were well controlled. And the ONLY reason they have meaning even if it were time and time again is because they are well controlled. I’m totally puzzled as to how you can falsify models using p-values & deny p-value reasoning.

As I discuss through my book, Statistical Inference as Severe Testing, the most important role of the severity requirement is to block claims—precisely the kinds of claims that get support under other methods be they likelihood or Bayesian.
Stop using NHST—there’s speech ban I can agree with. In many cases the best way to evaluate a drug is via controlled trials. I think you forget that for me, since any claim must be well probed to be warranted, estimations can still be viewed as tests.
I will stop trading in biotechs if the rule to just report observed effects gets passed and the responsibility that went with claiming a genuinely statistically significant effect goes by the board.

That said, it’s fun to be talking with you again.

Me:

I’m interested in falsifying real models, not straw-man nulls of zero effect. Regarding your example of the new drug: yes, it can be solved using confidence intervals, or z-scores, or estimates and standard errors, or p-values, or Bayesian methods, or just about anything, if the evidence is strong enough. I agree there are simple problems for which many methods work, including p-values when properly interpreted. But I don’t see the point of using hypothesis testing in those situations either—it seems to make much more sense to treat them as estimation problems: how effective is the drug, ideally for each person or else just estimate the average effect if you’re ok fitting that simpler model.

I can blog our exchange if you’d like.

And so I did.

Please be polite in any comments. Thank you.

I am posting this with Gelman’s approval. You might find it interesting to check out some of the 130 comments on his blog here. I invite you to share reflections in the comments to this post.

Categories: 5-year memory lane, abandon statistical significance, Gelman blogs an exchange with Mayo | 4 Comments

Post navigation

4 thoughts on “An exchange between A. Gelman and D. Mayo on abandoning statistical significance: 5 years ago

  1. In some very significant senses, Gelman and I agree, and in others, we disagree. But the bulk of the apparent disagreement is semantics. Gelman says:

    “The problem with null hypothesis significance testing is that rejection of straw-man hypothesis B is used as evidence in favor of preferred alternative A.”

    Gelman’s statement is true only if null hypothesis significance testing is understood as moving from statistical significance (from a point null hypothesis) to inferring evidence of an alternative hypothesis H that “accords with” or “explains” data x. This is a fallacious use of statistical significance testing. Because NHST is so often understood to refer to precisely this fallacious use of statistical significance test reasoning—it is no part of either Fisher or Neyman-Pearson statistical tests—I am prepared to abandon NHST, IF IT IS understood as the fallacious animal

    Classic statistical fallacies, notably, moving from a statistical correlation to a causal hypothesis H that fits the data, or moving from inferring a statistical hypothesis “there is evidence of an effect” to evidence for substantive scientific hypotheses and theories are of this variety. These fallacious uses of tests are statistical versions of the deductive fallacy of AFFIRMING THE CONSEQUENT. They are barred in a non-fallacious use of statistical significance testing. The error probabilities, e.g., the type 1 error, does not apply to inferring such an H.

    On the fallacy of affirming the consequent, see pp 62-3 in my book SIST:

    https://errorstatistics.com/wp-content/uploads/2019/09/ex2-ti.pdf

    However, Bayesian updating and Bayes factors do permit statistical affirming the consequent. So, if all the hullabaloo is about erroneously using statistical significance tests to statistically affirm the consequent, why would it be desirable to replace them with methods that license precisely this inferential move? Consider Bayesian confirmation (often written as C(H,x)):

    “The most familiar interpretation is that H is confirmed by x if x gives a boost to the probability of H, incremental confirmation. The components of C(H,x) are allowed to be any statements, and, in identifying C with Pr, no reference to a probability model is required. There is typically a background variable k, so that x confirms H relative to k: to the extent that Pr(H|x and k) > Pr(H and k). However, for readability, I will drop the explicit inclusion of k. More generally, if H entails x, then assuming Pr(x) ≠ 1 and Pr(H) ≠ 0, we have Pr(H|x) > Pr(H). This is an instance of probabilistic affirming the consequent. (Note: if Pr(H|x) > Pr(H) then Pr(x|H) > Pr(x).)”

    SIST: 66-67 from Excursion 2 Tour I. The entire Tour I is here:

    https://errorstatistics.com/wp-content/uploads/2019/09/ex2-ti.pdf

    Of course, even if Gelman were to agree that the problematic uses of statistical significance tests are strictly fallacious uses, he might well still want to abandon them because that is how he finds the tests are actually used. He says as much. Here is a place we disagree.

    • I can illustrate my point with studies I read about today in the NYT suggesting that Covid lockdowns damaged adolescent brains, and did so much more severely in women than men:

      “Social isolation due to lockdowns that were imposed because of the COVID-19 pandemic had a detrimental impact on adolescent mental health, with the mental health of females more affected than males. accelerated cortical thinning in adolescents in association with the COVID-19 pandemic lockdowns (1314) … demonstrating a significant effect of sex in which females show more dramatic accelerated cortical thinning when compared to males”. 

      A source that describes the statistical analysis is this:

      https://www.pnas.org/doi/10.1073/pnas.2403200121

      Even if there is a statistically significant difference between cortical thinning in females compared to males, one can’t immediately infer it is due to Covid lockdowns (nor do the authors do more than suggest the causal connection). To move from the statistically significant correlation to a causal claim, would be an example of what I’m calling statistical affirming the consequent: If there were a causal connection, then the observed effect would be expected; but it’s fallacious to automatically move from the observed effect to the causal explanation. I’m not saying the authors commit this fallacy but just using it to illustrate my last comment, although in the NYT article we read:
      “Dr. Kuhl attributed the change to “social deprivation caused by the pandemic,” which she suggested had hit adolescent girls harder because they are more dependent on social interaction.”

      Whether or not this is plausible would depend on other background knowledge, which I can’t comment on.

      However, a Bayes factor would allow high support for the substantive causal explanation insofar as it is much more likely (in the technical sense) than the “no effect” null hypothesis. More generally, since the causal hypothesis gets a Bayes-boost by the data, the data “confirm” it.

      This is a good example to take up Gelman’s other claim, with which I disagree, that we know that null hypotheses aren’t true. In responding to a recent comment of Gelman’s that ‘Rejection isn’t so helpful because we’re rejecting a null hypothesis that we know isn’t true anyway’, I noted, “The sense in which we know models and hypotheses aren’t true, even without testing them, is that they embody idealizations and approximations. But this is not the sense used in statistical testing. ..A null hypothesis says things like: this observed effect is explainable by random variation alone, or it’s not a genuine effect, or the like” as expressed in a statistical model.

      https://errorstatistics.com/2024/08/18/andrew-gelman-guest-post-trying-to-clear-up-a-misunderstanding-about-decision-analysis-and-significance-testing/#comment-265576

      We clearly don’t know at the outset that the following null hypothesis is false:

      Ho: the difference in cortical thinning between adolescent males and females (during lockdown) is explainable by chance variability.
      This might be viewed as a “dividing null hypothesis,” or a one-sided test (or two one-sided test). The use of a point null, accorded a prior lump of probability, is only found in Bayesian inference—not statistical significance tests. Note that while Gelman regards point nulls as false; the popular Bayes factor convention, recommended as a replacement for statistical significance tests, is to accord them high priors.

      • Deborah:

        Regarding that paper, you write, “nor do the authors do more than suggest the causal connection.” But they do suggest the causal connection! It’s right there in the very first paragraph (the Significance statement) of their paper. Here are some quotes:

        “the lockdown measures enacted during the COVID-19 pandemic resulted in . . .” Here, “resulted in” is a causal claim.

        “greater vulnerability of the female brain, as compared to the male brain, to the lifestyle changes resulting from the pandemic lockdowns”: Again, “resulted.” Also, “vulnerability” implies causation: if outcomes are “vulnerable” to lifestyle changes, this implies that the lifestyle changes are contributing to the outcomes.

        In the second paragraph (the Abstract), they write, “These findings suggest that the lifestyle disruptions associated with the COVID-19 pandemic lockdowns caused . . .” The words “suggest” is a qualifier but it does not negate the causal statements that the authors used in their earlier paragraph.

        The paper under discussion uses something called “normative modeling,” which seems to be a form of the counterfactual reasoning used for causal inference in statistics and econometrics. I have no problem with this sort of modeling. Ultimately, though, they’re comparing time trends, and I guess that boys and girls differ in a lot of ways so just because their trends on these measurements are different, on average, I don’t see what COVID-19, let alone “lockdowns,” would have to do with it. This looks like a lot of studies I’ve seen where the putative treatment is small and it’s supposed to have very large effects. I don’t think statistical significance or null hypothesis significance testing adds anything to this project, and I think it would be better for them to just graph their data.

        You refer to the hypothesis that “the difference in cortical thinning between adolescent males and females (during lockdown) is explainable by chance variability.” I would not call this a scientific hypothesis; it’s more of a statement about sample size. With a large enough sample, it should be possible to distinguish the differences between these groups to a precision that is finer than sampling variation. Similarly, I am sure that if picked two students out of my classroom and took them to a basketball court, that if you had them take enough shots you could distinguish their average shooting abilities in a way that would not be explained by chance variation.

        • Andrew:

          Thanks so much for your reply to my comment. My points are simply, first, that a null hypothesis in statistical significance testing need not be known to be false, second, that statistical significance tests do not license the fallacious move I call statistical affirming the consequent, and third, alternative recommended as replacements for statistical significance tests do, e.g., null hypothesis Bayes factor tests.

          (1) You say “the hypothesis that ‘the difference in cortical thinning between adolescent males and females (during lockdown) is explainable by chance variability.’ I would not call this a scientific hypothesis”.

          It’s a statistical hypothesis; but that does mean we know it’s false. It’s an empirical claim.

            I would distinguish a hypothesis that would be found false, if false, provided a large enough n, and a hypothesis that is known to be false. That is, we may know any discrepancy that exists will be found in a sensitive enough test, without knowing it is false (except in an uninteresting sense that statistical claims are approximations and idealizations.)

            (2) Regarding your point that they actually do infer a causal explanation–I was trying to be generous, since it didn’t matter for my point of illustration (and the PNAS paper, which I mainly used because it contained a discussion of the statistics, was more reserved, saying the results “suggest”). For all I know, Covid lockdown did cause cortical thinning. But you’re right, they do go there in the NYT article.

            So we agree that it illustrates what I call the fallacy of statistical affirming the consequent. My point is that such an inference is not licensed by statistical significance tests, and it’s a classic fallacy. The test’s error probability does not extend to the causal claim. The causal claim is inseverely tested by dint of the statistical effect alone (even assuming its validity).

            (3) On the other hand, the probability of the data under the causal claim would be much higher than under the null hypothesis. So the causal claim gets support according to a Bayes factor BF test. The two hypotheses tested in a BF test need not exhaust the space of possibilities. Yet “null hypothesis BF tests” [NHBT] are being recommended as replacements for statistical significance tests. Yet they instantiate your fallacy:

            rejection of straw-man hypothesis B is used as evidence in favor of preferred alternative A. This is a disaster.

    Leave a reply to Mayo Cancel reply

    Blog at WordPress.com.