Andrew Gelman (Guest post): (Trying to) clear up a misunderstanding about decision analysis and significance testing

.

Professor Andrew Gelman
Higgins Professor of Statistics
Professor of Political Science
Director of the Applied Statistics Center
Columbia University

 

(Trying to) clear up a misunderstanding about decision analysis and significance testing

Background

In our 2019 article, Abandon Statistical Significance, Blake McShane, David Gal, Christian Robert, Jennifer Tackett, and I talk about three scenarios: summarizing research, scientific publication, and decision making.

In making our recommendations, we’re not saying it will be easy; we’re just saying that screening based on statistical significance has lots of problems. P-values and related measures are not useless—there can be value in saying that an estimate is only 1 standard error away from 0 and so it is consistent with the null hypothesis, or that an estimate is 10 standard errors from zero and so the null can be rejected, or than an estimate is 2 standard errors from zero, which is something that we would not usually see if the null hypothesis were true. Comparison to a null model can be a useful statistical tool, in its place. The problem we see with “statistical significance” is when this tool is used as a dominant or default or master paradigm:

We have no desire to “ban” p-values or other purely statistical measures. Rather, we believe that such measures should not be thresholded and that, thresholded or not, they should not take priority over the currently subordinate factors. We also argue that it seldom makes sense to calibrate evidence as a function of p-values or other purely statistical measures.

For summarizing research, we recommend the acceptance of uncertainty and that “the p-value be demoted from its threshold screening role and instead, treated continuously, be considered along with the currently subordinate factors [e.g., related prior evidence, plausibility of mechanism, study design and data quality, real world costs and benefits, novelty of finding, and other factors that vary by research domain], as just one among many pieces of evidence.”

For publication, we again recommend against using statistical-significance-based thresholding or screening; as we pointed out, even those journals that screen, implicitly or explicitly, on p-values or other thresholds do not automatically publish every submission that contains a statistically-significant finding: these journals necessarily use other criteria to decide what to publish, and so we’re just saying to use these other criteria without applying a significance screen.

Finally, for decision making more generally, we recommend cost-benefit analysis accounting for experimental uncertainty. Or, when there are no clear costs and benefits, we recommend that decisions be made based on all the available results and “working directly with hypotheses of interest rather than reasoning indirectly from a null model.”

A point of dispute or confusion

Recently we had an online discussion regarding our recommendations for decision analysis. A recent guest post [by Daniël Lakens] on Deborah Mayo’s blog expressed the view that our decision rule will be in practice equivalent to a p-value. This claim is not correct—or, to put it another way, given any dataset summarized by a single point estimate, a decision threshold can be mapped to a p-value or a Bayes factor or a z-score or any other monotonic summary of that point estimate, but then the cutoff p-value (or whatever) will depend on the data, so it’s not a p-value (or whatever) threshold in the usual sense.

The applied point is that our recommended approach depends not on type I and II errors (or, for that matter, on type M and S errors) but rather on estimates of costs and benefits of the decision being considered, where costs and benefits are measured in terms of dollars, lives, or some other external units.

This is not to say that type I, II, M, and S errors are irrelevant! They can be useful concepts in helping to understand and evaluate statistical procedures. We just don’t want to use them as decision thresholds.

I think some of the confusion in our online discussion on decision thresholds came because there was confusion about what sort of decisions we were talking about. So I thought it would help to explain using a simple hypothetical example. We also presented this example on our blog where there is some discussion in comments. I thank Daniël Lakens for providing the comment that motivated this post and Deborah Mayo for the opportunity to add to this discussion.

Working through an example

Consider the following decision problem. Your company is considering adopting an innovation that would cost $10,000 and that has uncertain benefit. Your prior on the benefit is normal(0, $100,000). In order to inform your decision you conduct an experiment that gives you an unbiased estimate of the benefit with some standard error, s. To give some notation here, let C be the cost ($10,000 in this case), let sigma_0 be the prior sd ($100,000 in this case), and let theta be the benefit, so that theta – C is the net gain, and you get an estimate theta_hat ~ normal(theta, s) with prior theta ~ normal(0, sigma_0). Further assume that sigma_0 is a small amount of money relative to the size of your company, so that your utility is a linear function of the net gain.

Given the estimate theta_hat, what should your decision be? There are lots of rules you could use here, but I’m gonna go Bayesian, given that all the costs, benefits, and uncertainties are defined up front.

Given the data, the posterior mean of theta is (theta_hat/s^2)/(1/s^2 + 1/sigma_0^2), so the expected net benefit of adopting the innovation is (theta_hat/s^2)/(1/s^2 + 1/sigma_0^2) – C, and you should adopt the innovation if (theta_hat/s^2)/(1/s^2 + 1/sigma_0^2) > C, that is, if theta_hat > C(1 + (s/sigma_0)^2). We’re assuming C=10.

Here are a couple examples: If s = $10,000, then s/sigma_0 = 0.1 and the rule is to adopt the innovation if theta_hat > $10,100. If s = $100,000, then the rule is to adopt if theta_hat > $20,000.

Another possible decision rule is based on statistical significance: adopt the innovation if the estimate is statistically significantly greater than zero, given some p-value threshold, which in this normal model is equivalent to adopting if the z-score—the estimate divided by its standard error—exceeds some value. In the above notation, the rule would be to accept if theta_hat > A*s, where A is some pre-chosen value. A commonly-used threshold is A=2, but sometimes people want to be conservative and set A to some larger value such as 3, while other times people will set A to a lower value such as 1.5 so as not to stifle innovation.

The Bayesian rule and the statistical significance rule are different! Not just in their motivations, but also in their mathematical forms. The Bayesian rule is to accept the innovation if theta_hat > C(1 + (s/sigma_0)^2); the statistical significance rule is to accept the innovation if theta_hat > A*s. The classical rule has two parameters: the threshold z-score A and the standard error s. The Bayesian rule has three parameters: the cost of the innovation C, the prior sd sigma_0 of the benefit, and the standard error s.

You might try to wrestle the Bayesian and statistical significance rules into agreement by setting the threshold z-score A to an appropriate value, but the resulting rules will still be different functions of s. If you have a Bayesian rule given some values of C and sigma_0, there is no significance threshold A which will give you the equivalent rule.

You might think that all this could be saved if the Bayesian inference used a flat prior, but, no, not at all! A flat prior here corresponds to sigma_0 = infinity, which in turn implies the decision rule to adopt the innovation if theta_hat > C. This rule depends entirely on the point estimate, not on the standard error at all! For it to be a p-value threshold, the decision would have to depend on theta_hat/s, which it doesn’t.

This is all basic decision theory (modulo any algebra or calculation or conceptual errors I’ve made in my above quick derivation), and I hadn’t thought much about it recently until I came across a comment from psychology researcher Daniël Lakens on Deborah Mayo’s blog where he pointed to this quote from a paper I’d coauthored: “Instead, to the extent that decisions do need to be made about which lines of research to pursue further, we recommend making such decisions using a model of the distribution of effect sizes and variation, thus working directly with hypotheses of interest rather than reasoning indirectly from a null model,” and responded:

I [Lakens] think if I would see the authors do this in practice, any decision they make ends up being a dichotomous claim based on a critical value, and I will be able to recompute that critical value into a p-value. I am happy to be proven wrong, and all the authors need to do is to link me to a real life example where this is not the case.

The request for a real-life example was easy to satisfy; I provided some in my follow-up comment, pointing to chapter 9 of Bayesian Data Analysis, third edition, and this article about decision making for radon measurement and remediation where we went through all aspects of uncertainty and utility quantification in detail. Providing references was no big deal; it can be hard to find things in an unfamiliar literature, and I was happy to help Lakens by answering his question by supplying examples.

The more interesting question to me was how it was that someone who’d thought about decision analysis and hypothesis testing could have thought that “any decision [we] make ends up being a dichotomous claim based on a critical value, and I [Lakens] will be able to recompute that critical value into a p-value.” If a data-based decision is dichotomous (in the above example, adopt the proposed innovation or stick with the status quo), it can be said to be based on a critical value, but, as the above example shows, that critical value cannot in general be recomputed into a p-value—unless you allow the p-value threshold to itself depend on various features of the decision problem, the prior distribution, and the data (in the above example, C, sigma_0, and s), in which case this is not a p-value threshold in the usual sense in that it cannot be defined before seeing the data.

After some reflection, though, I think I understand what Lakens was talking about, and it’s good to have a chance to clarify the point.

In my decision problem as stated above, there are two decision options:
(i) Adopt the innovation, or
(ii) Stick with the status quo,
and the relative utility of option (i) compared to option (ii) is theta-C. Very direct; as I said, it’s standard decision theory from the 1940s through today.

But here’s another way to frame the problem. You take the same two decision options, but the goal is not to maximize dollars but to make the correct decision. If you take option (i), there is a loss L1 if the innovation was not beneficial (that is, if theta<0), and if you take option (ii), there is a loss L2 if theta>0. (In our problem with a continuous prior distribution, we can ignore the zero-probability event that theta=0 exactly. You could include that possibility and it wouldn’t change the basic features of the problem.) These are the costs of classical type I and II errors, and the relative utility of option (1) compared to option (ii) is – L1*1_{theta<0} + L2*1_{theta>0}.

If you then attack the problem from a Bayesian perspective, the relative expected utility of option (1) compared to option (ii) is L2*Pr(theta>0|data) – L1*Pr(theta<0|data) = (L1 + L2) * Pr(theta>0|data) – L1, and this is positive if Pr(theta>0|data) > L1/(L1+L2).

Let’s quickly check this calculation. If L1=L2, then the two sorts of losses are equivalent, so it makes sense that you’d adopt the innovation if and only if it is more likely than not to have a positive benefit. If L1>L2, then the loss of mistakenly adapting the innovation is higher than mistakenly sticking with the status quo, so there’s a higher threshold, etc.

For the problem above, Pr(theta>0|data) = Phi((theta_hat/s^2)/sqrt(1/s^2 + 1/sigma_0^2)), where Phi is the cumulative normal distribution function, and so the Bayesian rule is to adopt the innovation if Phi((theta_hat/s^2)/sqrt(1/s^2 + 1/sigma_0^2)) > L1/(L1+L2), which maps to adopting if theta_hat > s^2*sqrt(1/s^2 + 1/sigma_0^2)*(Phi^{-1}(L1/(L1+L2)). Again, the threshold depends on s, hence there is no critical p-value corresponding to this decision. Under a flat prior, though, with sigma_0=infinity, the decision rule is to adopt the innovation if theta_hat/s > L1/(L1+L2), which would be a p-value threshold. So maybe that’s what Lakens was thinking about.

From a decision-analytic perspective, I don’t think that this second formulation, in which the losses depend only on whether the optimal decision was chosen, makes sense. In any medical, business, or policy example I’ve seen, the benefit of the decision should depend on the actual costs and benefits of the decision options—indeed, that’s what McShane and I and the others were thinking when we wrote:

For regulatory, policy, and business decisions, cost-benefit calculations seem clearly superior to acontextual statistical thresholds. Specifically, and as noted, such thresholds implicitly express a particular tradeoff between Type I and Type II error, but in reality this tradeoff should depend on the costs, benefits, and probabilities of all outcomes.

I can’t be sure, but I’m guessing that what Lakens was talking about in his above-quoted comment was not a business or policy decision (whether to adopt an innovation of uncertain benefit relative to its cost) but rather a scientific decision of whether to accept or reject a hypothesis. All I can say there is that I don’t see decision analysis as being particularly relevant here because I don’t see the need for accepting or rejecting a hypothesis or scientific model. Here’s what we wrote in our paper:

While we see the intuitive appeal of using p-value or other statistical thresholds as a screening device to decide what avenues (e.g., ideas, drugs, or genes) to pursue further, this approach fundamentally does not make efficient use of data: there is in general no connection between a p-value—a probability based on a particular null model—and either the potential gains from pursuing a potential research lead or the predictive probability that the lead in question will ultimately be successful. Instead, to the extent that decisions do need to be made about which lines of research to pursue further, we recommend making such decisions using a model of the distribution of effect sizes and variation, thus working directly with hypotheses of interest rather than reasoning indirectly from a null model.

The general point here is that there is no general equivalence between data-informed cost-benefit analysis and statistical significance thresholds or Neyman-Pearson hypothesis testing more generally. If you really want to use a p-value threshold, you can reverse-engineer a Bayesian decision analysis that will get there, but you have to use this weird utility function where a gain of a billion dollars (or the discovery of a huge effect) is as good as a gain of 50 cents (or the discovery of a tiny effect which might not even hold up in the future).

Some reasonable arguments in favor of statistical significance

That all said, I can see potential good reasons for using p-values and statistical significance to make decisions:
It’s a standard approach, so by doing this you can communicate with others using this set of methods, and your results will be comparable to other results obtained in this way.
Bayesian inference and, even more so, Bayesian decision analysis, has lots of knobs you can turn, lots of ways for a researcher to mess up or cheat.
Bayesian inference can be done in so many different ways. Even setting aside problems with researcher degrees of freedom, the advice to follow Bayesian methods does not in general provide much guidance to an applied researcher: there are so many Bayesian methods out there!
If you’re already familiar with classical approaches, it might not be worth the effort to retool to learn other methods.
Bayesian methods that have been effective for me in applications in pharmacology, political science, etc., might not be so appropriate in some areas of psychology.
Our methods might work for us, but we actually could do just as well by judicious use of significance testing and p-values.
Just cos significance testing has problems, it doesn’t mean that any particular specified alternative, Bayesian or otherwise, will perform better.

Ultimately, despite these issues, I still will recommend abandoning statistical significance, but I recognize that the methods used by myself and my collaborators are far from perfect. Other people can have good reasons for not following our advice. Also, as a researcher in statistical theory and methods, I remain interested in statistical significance and p-values, if for no other reason than that it behooves us to understand these methods that so many people remain committed to.

The point of this post is not to argue to abandon statistical significance—for that, I recommend our article!—but rather to highlight the fundamental differences between Bayesian decision analysis and statistical significance thresholds. They’re not just two ways of doing the same thing, and that’s important!

Real-world decisions are typically more complicated

To keep things simple, I’ve focused the above discussion on a single binary decision: go with the innovation or stick with the status quo. Blake McShane points out that “more typically, however, decisions are not binary; the utility function is unknown or different stakeholders disagree about it, etc.; the data generating process is unknown, etc. I would think all of these would accentuate differences between the decision theoretic and significance testing approaches.” Blake also points out that adopting a decision theoretic point of view does not require adopting a Bayesian approach to statistics. I find Bayesian inference to be a convenient framework here, but the key point is that if the decision depends on real-world costs and benefits—that is, if the utility or equivalent depends on theta, not just on theta>0 or theta!=0 or whatever—then decisions do not in general map to statistical-significance thresholds.

Sander Greenland also pointed out that what I’m calling the classical or significance-testing framework itself comes in many forms and has a long history. The present post is not intended to be any summary or tutorial of that approach. Rather, I’m focusing on the lack of equivalence between cost-benefit decision rules and significance-testing decision rules.

We’re incoherent too

Yes, our approach has holes! Our models are always under construction and we’re always checking them. A model check can be seen as a sort of hypothesis check, where if the data shows important features that are not predicted by the model, we need to alter or expand the model. Our predictions are stochastic, and so any such evaluation must be probabilistic. So . . . when we do the necessary step of model checking, we’re doing something like significance testing, asking whether the data differ from the model in important ways, more than could be explained by chance alone. This is a point that Deborah Mayo and others have made, and which we have discussed too. To loop back to “Abandon statistical significance,” there’s an awkwardness here because we are using some aspects of statistical significance when we do model checking. I’ll just say that we’re not using significance as a threshold; we’re using it as one piece of information in our decision making. Still, I wanted to acknowledge the tension that remains in our workflow.

Note from Mayo: This discussion began in the comments to Lakens’ guest post, part 1 and part 2 . We welcome your questions and remarks in the comments to this blogpost. 

Categories: abandon statistical significance, gelman, statistical significance tests, Wasserstein et al 2019 | 29 Comments

Post navigation

29 thoughts on “Andrew Gelman (Guest post): (Trying to) clear up a misunderstanding about decision analysis and significance testing

  1. I thank Gelman for expanding on his initial comment to provide a guest post aimed at continuing and clarifying the discussion that began with Lakens’ blog. I’m sure Lakens is making a deeper point (on which he will hopefully expand), but on the face of it, I agree with Gelman that the paper in which he is a joint author follows its title in declaring that we should “abandon significance”, quite in contrast to the Gelman of a decade or so ago who co-wrote: “What we are advocating, then, is what Cox and Hinkley (1974) call ‘pure significance testing’, in which certain of the model’s implications are compared directly to the data, rather than entering into a contest with some alternative model” (Gelman and Shalizi 2013, 20). Indeed, the 2019 paper Gelman co-authored was one of the few, 5 or 6 of 43, that endorsed abandoning significance, as recommended in the editorial introduction (Wasserstein et al., 2019) to the special TAS volume. In my view, Gelman was the most influential in the 2019 “abandon significance” movement, with his call to sign onto to Amrhein et al. 2019 on his blog. However, I also agree with Lakens that the McShane paper does not provide grounds for abandoning the use of thresholds, including those based on statistical significance. I wrote a paper with David Hand (2022) that responds specifically to the “2019 abandoners”:

    I don’t see how Gelman can test his models, let alone be a falsificationist, without thresholds.

    When the test is to distinguish genuine effects from those that easily arise from background variability, I do not think it makes sense to say: “ the probability of an observed effect even larger than mine occurs over 50% of the time from chance alone” but “I claim my observed effect is good evidence that it did not occur by chance alone”. If you agree with me, then you agree to a threshold of 50% (in accord with Cox’s “weak repeated sampling”). How do you even distinguish “inconclusive” results without some threshold? One doesn’t need a rigid dividing line in order to distinguish rather strong results from terrible ones—but that too demands a threshold.

    In a reply to me in the comments on Lakens, Gelman says:

    “I’m still interested in model checking (“pure significance testing,” “error probes,” “severe testing,” etc.) but I’m no longer doing this or recommending to do this using p-values.

    Even in pure research scenarios where there is no obvious cost-benefit calculation—….—we see no value in p-value or other statistical thresholds. Instead, we would like to see researchers simply report results: estimates, standard errors, confidence intervals, etc., with statistically inconclusive results being relevant for motivating future research”.

    It seems to me that even the notion of “inconclusive results” implies a threshold.

    Spanos recently commented that eyeballing won’t suffice in model-testing:

  2. I don’t see what turns on the distinction Gelman is keen to make in this post between statistical significance tests and “full-dressed” Bayesian decision-theory. Even subjective Bayesian Lindley distinguishes inference and decision, and allows that both N-P tests as decisions, and full-dressed Bayesian decision theory have roles.

    Lindley says: “The language of decision analysis has been used by Neyman and others in connection with hypothesis testing where they speak of the decisions to accept, and to reject, the hypothesis. …acceptance and rejection can legitimately be thought of as action, as with the rejection of a batch of items. Equally there are other cases where we could calculate [the posterior probability of H]…The latter form may, as in the last paragraph, be thought of as a decision. Both forms are valid and useful for different purposes… Our philosophy accommodates both views.”

    https://errorstatistics.com/2012/07/12/dennis-lindleys-philosophy-of-statistics/

    • Andrew Gelman

      Deborah,

      As I wrote in my post, neither my methods nor my philosophy are entirely coherent, so I don’t have perfectly clean answers here. My feeling about p-values and null hypothesis significance tests, in the areas of applications I’ve worked on (mostly social, biological, and environmental science) is that non-rejection can be informative, in the sense that it tells us that we don’t have enough data to learn what we want to learn from the data. Rejection isn’t so helpful because we’re rejecting a null hypothesis that we know isn’t true anyway.

      Regarding model checking: I think it’s helpful to understand the ways in which our models don’t fit our data. I don’t really see the relevance of tail-area probabilities when doing this.

      You’re right that my views have changed. When I first started working on hypothesis tests and model checking, back around 1989, it was all about p-values. I still think model checking is important (not just “eyeballing”); I’m not usually interested in doing it using tail-area probabilities.

      • Logic First

        Thanks for your remarks and post. Let’s get to the simple, logical stuff first (although 2am is a bit late for me to write).

        A. You say:

        1. Regarding model checking: I think it’s helpful to understand the ways in which our models don’t fit our data.
        2. Rejection isn’t so helpful because we’re rejecting a null hypothesis that we know isn’t true anyway.

        MAYO: 1 and 2   seem conflicting since we find out about ways models don’t fit our data by rejecting various null hypotheses about them

        B. Moreover, it’s a misunderstanding to say “Rejection isn’t so helpful because we’re rejecting a null hypothesis that we know isn’t true anyway”. (Have you been speaking to Trafimow?) The sense in which “we know” models and hypotheses aren’t true, even without testing them, is that they embody idealizations and approximations. But this is not the sense used in statistical testing. Hypothesis don’t assert: I am precisely true—whatever that would mean. A null hypothesis says things like: a given observed effect is explainable by random variation alone, or it’s not a genuine effect, or the like.

        We clearly don’t know that these claims aren’t  true!! That is, we do not know the denial of these claims are true. It’s absurd to say we know these are genuine effects not explainable by chance. Therefore it’s not correct to say we know nulls are false. Think of that favorite example of yours, I guess it was there’s no genuine correlation between ovulation and voting. The null asserts this, but we sure don’t know it’s false.  To know it’s false would be to know there is a genuine correlation between ovulation and voting.

        C. non-rejection can be informative, in the sense that it tells us that we don’t have enough data to learn what we want to learn from the data.

        MAYO: non-rejection can be informative—I agree with that—but not by telling us we don’t have enough data to learn what we want. Think of non-rejection of the null hypothesis that there’s no genuine correlation between ovulation and politics or whatever. Or the non-rejection of a hypothesis that this drug does not benefit those with a given neurological disease. If the study was highly capable of having found a meaningful benefit, were one to exist, the non-rejection is evidence that there is not.a meaningful effect. In some cases such negative results warrant inferring that if there is an effect, it’s smaller than such and such.

        I’ll get to tail areas tomorrow. Inform me of errors.

        • Andrew Gelman

          Deborah:

          I’ve been talking about the null hypothesis being false for many years; see this paper from 1995 in Sociological Methodology: http://stat.columbia.edu/~gelman/research/published/avoiding.pdf and this paper from 2011 in the American Journal of Sociology: http://stat.columbia.edu/~gelman/research/published/causalreview4.pdf

          I recognize that researchers have been divided on this issue for many years, and I’m not trying to convince everybody. I’m mostly trying to explain how my colleagues and I do things, and to show how in some cases (not all!) the null hypothesis significance testing framework leads to problems.

          • Andrew:

            I know you have, and one is free to declare all statistical hypotheses are strictly false, being housed within approximate models, but you still need a distinct notion to capture learning from statistically falsifying a test hypothesis or null hypothesis. When the high energy particle HEP researchers inferred the existence of a Higgs particle in 2012, they had an adequate model of “background alone” (represented in the model as mu = 0) and could find out if there was a signal (captured using a 1-sided test. They didn’t already know there was a signal, even if they would agree their models are approximate. After they found one, they could start studying it’s properties (which were not what they were expecting or hoping for).

            Through their researches, often bumps disappear, and as much as they would like to say they can reject the “background alone” hypothesis, and claim to have found a new particle, they have to conclude the “background alone” hypothesis is adequate to explain the bump. Also, the bumps fade away, as is typical for flukes. This is just one way falsification is informative, contra point B. And aren’t you a falsificationist of sorts (or not any more)? Of course, they first have to show the adequacy of their model of “background alone”, which they can do at CERN.

            I’m no expert on HEP research, but I briefly discuss it on and around p. 203 in SIST (2018).

            All I’m saying is that a blanket dismissal of test hypotheses as trivially false prevents us from talking about finding things out statistically.

            I’m reminded of extreme skeptics in philosophy: For all we know, we might be brains in a vat.

            • Andrew Gelman

              Deborah,

              As I wrote in my comment below, I guess it depends on the application. I’ve done statistical modeling for physics problems, and p-values never came up. That said, I’ve never worked at CERN on particle physics, and it could well be that such methods have been useful to them. There are many roads to Rome, and even if a method is, in my view, sub-optimal, it can still be useful to people.

              In the many application areas I’ve worked on, I’ve only very rarely found it helpful to reject null hypotheses. It’s come up from time to time, but not often. My usual mode of operation is to assume a model, perform inference conditional on that model, simulate fake data under that model, compare the observed data to the simulated data, and then improve the model, gather more data, etc. The “compare the observed data to the simulated data” step is model checking, and this can be done using p-values, as discussed in my 1996 paper with Meng and Stern, but since then I’ve been much more interested in understanding <em>how</em> the observed data differ from the model, rather than in establishing that some difference exists.

              I will say, though, that I get very annoyed by researchers who refuse to check their models. I remember back in 1991 going to a Bayesian conference where researcher after researcher insisted to me that they had no need to check their model—indeed, they seemed to feel that model checking was somehow inappropriate—because their models were “subjective.” Which seemed so wrong to me: if a model is subjective, all the more reason to check its fit, no? I did not like that solipsistic philosophy, which reminded me of various non-Bayesian or anti-Bayesian statisticians I’d met who staunchly opposed all adjustments to data on the grounds that you couldn’t <em>know</em> that the adjustment would not make things worse, but then they’d be just fine treating real-world surveys as if they were simple random samples, and recommending the usual Poisson distributions etc. Both the extreme Bayesians and the extreme non-Bayesians held positions that were so skeptical as to be completely credulous of whatever models they happened to be using.

              So, yeah, I think that some version of (probabilistic) falsification is really important. I’m just not framing it in terms of type 1 and type 2 errors. The point of my above post is not to justify everything that my colleagues do and say but rather to explain that decision analysis is not necessarily equivalent to p-value thresholds or type 1 and 2 analyses.

              • Andrew:
                Thanks for your reply. I realize this issue is separate from your general point distinguishing statistical significance tests from full-blown Bayesian decision making, but it comes up at the end of your post, and in Hennig’s comment, and since I’m quite keen to ascertain what your current view is, I hope you won’t mind going a little further. Let me say that I’m very glad that there’s one thing you still hold: model checking.
                You wrote: “if a model is subjective, all the more reason to check its fit, no?

                Subjective Bayes factor theorists tell me that what they want is to discriminate (beliefs in) two hypotheses given it is believed that one is true (or something I can’t quite translate).

                But one thing, before leaving off my comment, so we can always refer to it. That concerned an example where the model (of the background) was already well-tested, and the goal was to ascertain if there was a genuine signal. (It was not model testing at that stage, but testing within a model*.) So can we agree that falsification of “background alone” is informative in those contexts where the goal is just to learn if the observed difference (or “bump”) is a mere statistical fluctuation (as opposed to a genuine signal), where the model of “statistical fluctuation” is already shown adequate (and is continually checked)? Any subsequent estimation of the character of the effect depends on this, as does subsequent testing of HEP theories.

                • I think type 1 and 2 errors, i.e., N-P tests occur only (or mainly) in testing within a model. I will ask Spanos.
                • Particle physicist, Robert Cousins, sent me a reply by email, and is allowing me to share it. The link to his talk, which I haven’t yet looked at,is bound to be of enormous interest:

                  FROM ROBERT COUSINS TO ME:
                  Evidently, Andrew Gelman operates in an environment where models are assumed to be wrong, so “checking” how wrong they are is a somewhat low-stakes endeavor to see how they can be improved. In contrast, in particle physics, our “core” models are “Laws of Nature”, and it is a HUGE deal to add a parameter to a Law of Nature. If someone shows conclusively that the probability per year of a proton decaying is not zero, but rather 10^{-34), that will win a Nobel Prize. I would point your readers to a talk that I gave over Zoom at STAMPS, https://www.cmu.edu/dietrich/statistics-datascience/stamps/events/webinars/webinar-archive/spring-21.html .
                  Scroll down to get my talk, slides and video.

                  One thing you wrote about the Higgs might be misleading: Not everyone was expecting or hoping for the same thing. What is true is that the properties (many measurements) are essentially all consistent with the Standard Model predictions. (The few outliers are not yet very significant.) So anyone expecting or hoping for no fun surprises got what they were expecting/hoping. Probably most people (including me) were hoping for serious discrepancies with the Standard Model, for example caused by dark matter particles. The hope was that the Higgs boson would be a “portal” to “New Physics”, a hope fueled by a technical argument (still true) that a spin 0 particle with positive parity would interact with new particles. So far, we have only confirmed its interactions with Standard Model particles, not unknown ones.

                  • Andrew Gelman

                    Robert:

                    As I wrote in my comment below, I guess it depends on the application. I’ve done statistical modeling for physics problems, and p-values never came up. That said, I’ve never worked at CERN on particle physics, and it could well be that such methods have been useful to them. There are many roads to Rome, and even if a method is, in my view, sub-optimal, it can still be useful to people.

                    I have done some physics research, but not on particle physics. The physics research I did had more of an engineering flavor and we were not attempting to test any fundamental models. My applied research has mostly been in areas related to social, biological, and environmental sciences, and my post was a reply to Lakens, whose research is in psychology.

                    There are people who use Bayesian methods and not tail-area probabilities to when working on particle physics, and if any of them are reading this thread, they could respond to you. For me, I’ll just say that this is outside my area of expertise and that I have no problem accepting that methods that do not make much sense in my areas of application can work well in yours.

                    Relevant to all this is our paper, Abandon Statistical Significance, whose first sentence is, “The biomedical and social sciences are facing a widespread crisis with published findings failing to replicate at an alarming rate,” and which later states: “we propose to abandon statistical significance, to drop the null hypothesis significance testing (NHST) paradigm—and the p-value thresholds intrinsic to it—as the default statistical paradigm for research, publication, and discovery in the biomedical and social sciences.” We add, “we have no desire to “ban” p-values or other purely statistical measures. Rather, we believe that such measures should not be thresholded and that, thresholded or not, they should not take priority over the currently subordinate factors.” I don’t think any of this is in conflict with the situation as described in your comment. It would be fair to say that we should’ve changed the title of our paper from Abandon Statistical Significance to something like Abandon Statistical Significance for the Biomedical and Social Sciences; all I can say here was that this entire discussion was taking place within the context of the replication crisis in the biomedical and social sciences.

                    I hope this is helpful.

                    • Andrew, thanks very much. The use of Bayesian machinery in high energy physics (including by myself) is typically for estimation (especially interval estimation), and nearly always in a context where the intervals are known to have good frequentist properties. The more interesting/problematic case to me by far is the subject of my talk at Carnegie Mellon’s STAMPS that my email posted by Deborah pointed to: the test of a sharp null hypothesis (typically approx 0, or a non-zero Standard Model prediction) vs a continuous alternative (often but not always non-negative). As I said at the end, “This is what we do for a living!”. And indeed we compute p-values (after *long*, detailed scrutiny of systematic uncertainties and our models and nuisance parameters, where most of our analysis effort and internal review effort usually goes, especially as our sample sizes become enormous). What’s a compelling alternative?

                      As you know, and as much discussed in the statistics literature since Bartlett’s comment on Lindley’s 1957 “paradox” paper, the answers from the approach of Jeffreys Chapter V (Bayes Factor, etc.) depend directly on the scale of the prior pdf of the parameter, not a satisfactory situation for most of us. (And for the issues to arise, sharp null hypotheses do not have to be measure zero, just sharp on the scale of the measurement uncertainty; examples from HEP are in my talk.)  I see that BDA3 (pp. 182-184) also does not generally recommend Bayes Factors, indeed using this high sensitivity to the prior variance as an example. Bayes Factors are occasionally tried in HEP, but I think that the more common cases are in the related field of cosmology, which has different statistical traditions.

                      Abandoning p-values (or equivalent number of sigma) in this testing situation (sharp null vs continuous alternative) would require having something more useful. I mention alternatives in my talk, for example Bernardo’s Reference Analysis, and also quotes from a couple of your writings, pointing to your 2013 paper with Shalizi. But so far, alternatives to p-values (mapped to sigmas) have not caught on in HEP, despite investigation by some.  (I noticed that the recent book on Objective Bayesian Inference by Berger, Bernardo, and Sun does not include this testing problem in their scope, although Jim and José have each published a lot on it.)

                      Another issue is scaling with sample size. In HEP, reporting confidence intervals for the effect size (at one or more usual standard C.L., such as 68% and 95%) has always been essentially mandatory, and the accompanying interpretation of any p-values (for rejecting the sharp null hypothesis or not) is typically independent of the sample size n.  My talk discusses at some length how opinions have varied widely among statisticians regarding the sqrt(n) scaling in Jeffreys’s method. It is hard for me to imagine acceptance in HEP of scaling p-value thresholds as sqrt(n). 

                      Many of us in HEP know that p-values are widely bashed, violate the Likelihood Principle, etc.  But we also know that the Jeffreys method is unsatisfactory to us, and indeed to many Bayesians. I do not see how predictive p-values are a compelling replacement for p-values in this context.  So my talk concludes that “The status quo of the philosophical foundations is unsettling.”

                      There is more about this testing problem in my Synthese paper that was the basis of a lot of my talk, https://arxiv.org/abs/1310.3791. Section 9 also discusses the choice of Type I error probability alpha.

                      Thanks again,

                      Bob

                • Aris Spanos

                  Reply to Deborah Mayo’s query

                  Mayo query: “I think type 1 and 2 errors, i.e., N-P tests occur only (or mainly) in testing within a model.”

                  The type I and II error probabilities are formally defined in the text of testing within the boundaries of a statistical model when the null (H0) and alternative (H1) hypotheses are framed in terms of the statistical model parameter(s). This is because in such a context one can clearly demarcate the type I and II errors in terms of the range of values of the unknown parameter(s).

                  Example. In the context of a simple Bernoulli model, consider the hypotheses:

                                                      H0: θ≤θ₀ vs. H1: θ>θ₀

                  Since this model has only one parameter θ taking values in the interval (0,1), one can demarcate the  Type I error: rejecting H0 when is true, meaning that the true value of θ, say θ*, lies in the interval ( 0, θ₀]. Note: curly bracket “(“ excludes the value after it, and square bracket “]” includes the value before it. Analogously, the Type II error: accepting H0 when false, means that θ* lies in the interval (θ₀,1).

                  This clear demarcation of the clauses “H0 is true” and “H0 is false” is crucially important for the evaluation of the type I and II error probabilities that calibrate the capacity of a N-P test to detect different discrepancies from the null value. Without it, such error probabilities cannot be operationalized with sufficient accuracy to calibrate the capacity of the test!

              • To continue my reply to you from last night:

                On “tail areas”
                Here’s the rest of my reply to Gelman.
                When E. Pearson (1970) takes up Jeffreys’ question: “Why did we use tail area
                probabilities . . .?”, his reply is that “this interpretation was not part of our
                approach” (p. 464). Tail areas simply fall out of the N-P desiderata of good
                tests. Good tests map observed values of d(x) into either “reject Ho” or “do not reject Ho” in such a way that there is a low probability of erroneously rejecting Ho and a much higher probability of correctly rejecting Ho.
                “Given the lambda criterion one needed to decide at what point Ho
                should be regarded as no longer tenable, that is where should one choose to bound the
                rejection region? To help in reaching this decision it appeared that the probability of
                falling into the region chosen, if Ho were true, was one necessary piece of information.
                (ibid.)
                Cox and Hinkley (1974, p. 66) put it this way:
                “Suppose that we were to accept the available data as evidence against Ho. Then we would be bound to accept all data with a larger value of [d] as even stronger evidence. Hence pobs [the observed p-value] is the probability that we would mistakenly declare there to be evidence against Ho, were we to regard the data under analysis as just decisive against Ho.
                Note that what is being accepted or not are claims that the data provide evidence against Ho—not claims to accept or reject Ho itself.

                “Reject Ho” and “fail to reject Ho” are generally interpreted as x is evidence against Ho, or x fails to provide evidence against Ho, (which is not the same as evidence for Ho.)*
                It is also a way to check if you have a sensible test statistic d(X). If the probability of results get MORE PROBABLE under Ho, the larger the value of d(X), then this is the wrong test statistic to use for finding evidence against Ho.
                So looking at the tail area could be seen as the result of formulating a sensible
                distance measure (for Fisher), or erecting a good critical region (for Neyman
                and Pearson). SIST 169

                Bayesians often criticize the use of tail areas because they accept the LP and reject error probabilities such as p-values. Another reason, rarely discussed, is that they have a tendency to use Pr(d(X) > d(x) ; H1)/ Pr(d(X) > d(x) ; Ho) as a kind of likelihood ratio (in favor of H1) in Bayes formula. (They usually use an H1 that gives high power.) Using it this way, as opposed to how error statisticians use it, there is more evidence against Ho, compared to the ratio Pr(d(X) = d(x) ; H1)/ Pr(d(X) = d(x); Ho). For the statistical significance tester, looking at Pr(d(X) > d(x) ; Ho) makes it HARDER, not easier, to reject Ho than if she used Pr(d(X) = d(x); Ho). Since continuous d(x) are always improbable, we’d take any d(x) to reject Ho, if all we needed was improbability under Ho. SIST (334)

                *Note: The term “acceptance,”Neyman tells us, was merely shorthand: “The phrase ‘do not reject H’ is longish and cumbersome . . . My own preferred substitute for ‘do not reject H’ is ‘no
                evidence against H is found’” (Neyman 1976, p. 749).

                See SIST for references or the captain’s biblio on this blog.

        • Mayo: I think what helps here is my distinction between nominal and interpretative null hypotheses. The nominal null hypothesis (a specific probability model with a specific parameter value, say) will be false anyway, but what really is of interest is the interpretative null hypothesis such as “new drug isn’t better than placebo”. But this can be formalised by many models (potentially involving dependence, among other things), and not all of these can be distinguished from the nominal null (or the nominal overall model involving the alternative) by misspecification testing.

          The interesting question now is, how good is the test (derived based on the nominal null hypothesis) at distiguishing the interpretative null from the interpretative alternative. The answer may be complex and will certainly depend on the specific situation. The standard theory of testing treats the nominal model only; robustness theory offers more (and will often point to different tests), but doesn’t solve the problem fully, due to focus on worst case situations regarding distributional shape but only very limited potential regarding violations of i.i.d. and the like.

          • Christian:
            To say your “nominal null Ho” is false anyway is to suppose Ho asserts “this is a good representation of ‘new drug isn’t better than placebo'” –or something like that. But that is not what a statistical hypothesis Ho asserts. So it’s wrong to say “it’s false anyway”, by which I mean it’s logically incorrect. Although I would agree on the importance of showing the relevance of a statistical test for the substantive scientific question, that is a distinct task. Cox has a good taxonomy of types of nulls, including substantive ones. As I say in SIST, the trick is to find out true things using deliberately false (and simplified) models. Since humans do that regularly, it’s not clear what we learn from noting that models are idealizations.
            Can you explain your remark “robustness theory offers more (and will often point to different tests).”

            • The “nominal null” in my terminology is a formal construct, like “X1,…,Xn distributed i.i.d. according to N(0,sigma^2)”. Of course in every situation we can discuss whether this is a good representation of the “interpretative null”. My point is that the distinction is helpful to clarify the discussion. Cox’ taxonomy is about the relation between interpretative and nominal null hypotheses.

              Whether this is “false” can be discussed on more than one level. I’d agree with the statement that calling it “false” is a triviality in the sense that it’s an idealised formal model, and it is not its job to be “true”. I can also sympathise with the view that calling a formal model on these grounds “false” is something of a category error, as there is no well defined true/false relation between reality and formal models (although I suspect one could make a “true/false” statement based on more precise consideration in which exact way a model is meant to represent reality – I’d be interested in what Roman Frigg has to say about this).

              On the other hand the recognition that many other models exist that could represent the same interpretative H0 and actually also give rise to the same data (and that there is no particular reason to privilege the nominal model compared to these alternative ones regarding the representation of the interpretative null) has implications for the interplay between theory and data analysis in practice. If somebody comes up with such a model, i.e., different from the nominal null hypothesis, but valid for formalising the same interpretative null hypothesis, and consistent with the same data, we would hope that the test that we’re running would still come to the same conclusion. This may or may not be the case, and here is where robustness theory comes in, as it investigates what happens if data are generated by models that are in some sense “similar” to the nominal model but not identical.

  3. Hi Andrew, I have a remark regarding a side aspect in the beginning of your posting (and discussion of p-values elsewhere). You seem to prefer expression of deviations from what is expected under the null in terms of standard errors rather than p-values. If the test statistic is normally distributed, these are translated into each other without difficulty. However in my view the argument (from the data) against the null hypothesis comes from low probability (“we have observed something surprising”) rather than from distance. Standard errors don’t seem so suitable for some test statistics with clearly non-normal distributional shape (e.g., skew, such as chi-squared with low df). Also p-values imply distinction between one- and two-sided tests, on which I’m keen (you may not be). I’d appreciate if you could comment on this. (I do realise that the current posting does not explicitly say that standard errors are a better unit than probabilities/p-values.)

    • Christian:

      We discussed how tests relate to test statistics in Spanos’ guest post:

      https://errorstatistics.com/2024/08/08/aris-spanos-guest-commentary-on-frequentist-testing-revisiting-widely-held-confusions-and-misinterpretations-2/

      Larger values of the test stat are designed to be improbable under Ho, while smaller values are more in accord with it.*

      You seem to be suggesting that Gelman would recommend just reporting the test statistic. Whether one reports the test stat or a corresponding p-value, what matters is whether anything is inferred. If there’s no statistical inference of any type, then there is no statistical test. I’m afraid that, in some quarters, that’s what “abandon significance thresholds” leads to. Moreover, reporting the test statistic or p-value is inadequate for statistical inference because the warranted discrepancy from Ho will differ with different sample sizes. A severity assessment ameliorates that.

      On another point you raise: I would agree with Cox (and Cox and Hinkley) that p-values are more suitably reported as one-sided unless the direction truly doesn’t matter and the inference is going to be to “there’s evidence of some discrepancy” in one or both directions, but without saying which. Both directions can be considered with an adjustment for selection. Spanos, I think, will just use one-sided p-values.

      *With chi-square tests of assumptions, of course, data can be too good to be true.

      • My posting was exclusively on Andrew’s apparent preference (not only in this posting) to say a test statistic is “2 standard errors away from what is expected under the null” rather than giving a p-value. This doesn’t amount to only giving the value of the test statistic, and neither does it imply that nothing else should be reported. It is rather a matter of communication.

    • Andrew Gelman

      Christian:

      The reason I prefer stating distance in terms of sd’s rather than p-values is that the sd’s have a direct interpretation without assuming the null hypothesis of no effect. In the applications I’ve looked at, I already know the null hypothesis is false, and I’m not particularly interested in the probability of some event conditional on that. I’m more interested in the distance of the data from what is predicted by the model.

      • Andrew:
        “I’m more interested in the distance of the data from what is predicted by the model.”
        But to pursue this, don’t you look at the probability of some data assuming the model is adequate (even if strictly false)?
        There may be some confusion between a test hypothesis that refers to no discrepancy from a hypothesized parameter value, within a model M, and one that asserts model M is adequate.
        In either case, the typical reductio begins “suppose this test hypothesis is true….” (The statistical version is analogous.)

  4. lakens

    I thank Andrew Gelman for his attempt to defend in more detail the idea that decision based on cost-benefit analysis is different from significance testing. I will attempt to explain why they are not different in more detail in this reply.

    In 1933 Neyman and Pearson wrote:

    But whatever conclusion is reached the following position must be recognised. If we reject H0 , we may reject it when it is true; if we accept H0 , we may be accepting it when it is false, that is to say, when really some alternative Ht is true. These two sources of error can rarely be eliminated completely; in some cases it will be more important to avoid the first, in others the second. We are reminded of the old problem considered by LAPLACE of the number of votes in a court of judges that should be needed to convict a prisoner. Is it more serious to convict an innocent man or to acquit a guilty? That will depend upon the consequences of the error ; is the punishment death or fine ; what is the danger to the community of released criminals ; what are the current ethical views on punishment? From the point of view of mathematical theory all that we can do is to show how the risk of the errors may be controlled and minimised. The use of these statistical tools in any given case, in determining just how the balance should be struck, must be left to the investigator.

    It is therefore surprising to read Andrew Gelman write: “The general point here is that there is no general equivalence between data-informed cost-benefit analysis and statistical significance thresholds or Neyman-Pearson hypothesis testing more generally.” Indeed, decision analysis was developed by Wald, who directly built on Neyman’s work, and Neyman was always very positive about the work Wald did.

    In other words, Neyman and Pearson stress that your choice for the Type 1 and Type 2 error should depend on the costs. For example, if the cost of a Type 2 error are always higher than the cost of a Type 1 error, a researchers would set their alpha level to 1 (Field et al., 2014).

    As we explain in Maier and Lakens (2023) whenever you can put actual costs on Type 1 and Type 2 errors, you can use these to set the alpha level. We made an R package to perform the necessary calculations https://lakens.github.io/JustifyAlpha/index.html

    Gelman seems to want to limit ‘significance testing’ to something that does not allow us to incorporate costs and benefits in to where we set our alpha level, but I will continue to go with Neyman’s original view myself.

    As most statisticians do, Gelman focusses on how to analyze data and draw a statistical inference. But when discussing which approach to statistics should be used in science, there are more important things to consider. The actual question is a social epistemological one: How can we use a method to generate knowledge that will work in the social system that science is?

    In the scientific system, decision making is rare beyond applied research. In applied research, such as health research and medicine, cost-effectiveness analyses are common (Neumann et al., 2016). But when you listened to the press conference that announces the discovery of the Higgs boson, you will not have heard anyone talk about costs and benefits.

    And yet, even at CERN the considered costs and benefits when deciding where to set the alpha level in their tests. Annoyed at the presence of too many Type 1 errors, the community itself decided to lower the alpha threshold for scientific discoveries to 5 sigma. They followed Neyman and Pearson’s advice to the letter, but at the general level of a field.

    Having a fixed threshold in a field that guarantees there is still progress, but without too many Type 1 errors, is an example of a cost-benefit analysis. Perhaps Andrew Gelman want’s to ‘abandon statistical significance’ in the sense that people mindlessly choose a 5% alpha level for all their tests. A more appropriate title for their original article would then have been ‘think about costs and benefits when setting your alpha level’. But in practice, it is not so easy to justify a different alpha level than 5% (in for some social epistemological reasons, 5% might just be low enough, but not too low, to work in practice, Uygun-Tunç et al, 2023).  

    Field, S. A., Tyre, A. J., Jonzén, N., Rhodes, J. R., & Possingham, H. P. (2004). Minimizing the cost of environmental management decisions by optimizing statistical thresholds. Ecology Letters, 7(8), 669–675. https://doi.org/10.1111/j.1461-0248.2004.00625.x

    Maier, M., & Lakens, D. (2022). Justify Your Alpha: A Primer on Two Practical Approaches. Advances in Methods and Practices in Psychological Science, 5(2), 25152459221080396. https://doi.org/10.1177/25152459221080396

    Neumann, P. J., Ganiats, T. G., Russell, L. B., Sanders, G. D., & Siegel, J. E. (Eds.). (2016). Cost-Effectiveness in Health and Medicine. Oxford University Press. https://doi.org/10.1093/acprof:oso/9780190492939.001.0001

    Uygun Tunç, D., Tunç, M. N., & Lakens, D. (2023). The epistemic and pragmatic function of dichotomous claims based on statistical hypothesis tests. Theory & Psychology, 33(3), 403–423. https://doi.org/10.1177/09593543231160112

    • Andrew Gelman

      Daniël:

      I guess it depends on the application. I’ve done statistical modeling for physics problems, and p-values never came up. That said, I’ve never worked at CERN on particle physics, and it could well be that such methods have been useful to them. There are many roads to Rome, and even if a method is, in my view, sub-optimal, it can still be useful to people.

      I’ve also not worked on the problem of “Is it more serious to convict an innocent man or to acquit a guilty?” This is an important problem! Just not something that I’ve worked on, and it doesn’t really line up with the sort of applications that I’ve ever looked at.

      Regarding decision analysis and hypothesis testing: I agree with you that, if you start with thresholds based on type 1 and type 2 errors, that this can be formulated as a decision problem. As you say, “whenever you can put actual costs on Type 1 and Type 2 errors, you can use these to set the alpha level.”

      You write that we should’ve titled our paper, “think about costs and benefits when setting your alpha level.” No–that is not correct. We do not recommend that people “set an alpha level” at all.

      Here’s the key issue. When my colleagues and I talk about costs and benefits, we are not talking about costs and benefits of type 1 and type 2 errors. Indeed, in the sorts of problems that I work on, neither type 1 nor type 2 errors apply. When we talk about costs and benefits, we’re talking about costs and benefits in dollars, lives, or other externally measurable units. For examples, I refer you to chapter 9 of Bayesian Data Analysis, third edition–we have three different decision-analysis examples in that chapter, and the book is free to download–or to this paper from 1999: http://stat.columbia.edu/~gelman/research/published/lin.pdf

      So I think this discussion has been useful in that it points to our disagreement. We are both comfortable with decision analysis. You work on problems where you can set up costs and benefits in terms of type 1 and type 2 errors, and I agree that if you’re working in that framework, there can be a mapping to significance testing thresholds. I work on problems where I would set up costs and benefits in terms of dollars and lives, and in such problems there is no mapping to significance testing thresholds. See also the paragraph beginning “The applied point” in my above post, and the example that follows.

      My colleagues and I also work on many problems where there is no thresholding at all, where we just fit models, do inference, gather new data, etc.

    • Daniel:
      I appreciate your guest post and comment. Gelman has moved to a more serious position. Recall decision-making was just one context that is touched on in McShane et al. 2019 paper; the main one is research. Gelman was the major force behind the Wasserstein et al 2019 movement. In my opinion, he was/is so fed up with bad statistical social science that he decided it was best to shoot down all of statistical significance, even though, ironically, he admits he relies on statistical significance test reasoning in model checking, which is all important to him. I think that, with such ambivalence, a less radical position would have been along what you suggest. [I have corrected or cut some terminology, thanks to Gelman’s correction. I apologize. I deleted my ruminations about AI/ML coming in to save us!] I have responded separately to Gelman’s reply, which I very much appreciate.

      • Andrew Gelman

        Deborah:

        You write, “Gelman was the major force behind the Wasserstein et al 2019 movement.” What do you mean by that? I typically don’t work well with committees (I’m not proud of that; it’s a weakness of mine that I’m stubborn; I just have to make my contributions in other ways). But maybe there is something specific you are referring to but which I have forgotten?

        The point of my above post is that decision analysis based on external measures such as dollars and lives is different from decision analysis based on type 1 and type 2 errors, and that decision analysis based on dollars and lives is not equivalent to p-value thresholding.

        Regarding your incorrect statements that I “decided it was best to shoot down all of statistical significance” and that some sort of “thrill” was involved, let me just say two things:

        1. I get my thrills in other ways. I write these papers, blog comments, etc., out of a feeling of obligation, not thrill. Writing can be pleasant and even fun sometimes, but it is not thrilling for me.
        2. I am not interested in “shooting down” statistical significance. In our paper, Abandon Statistical Significance, we write, “we propose to abandon statistical significance, to drop the null hypothesis significance testing (NHST) paradigm—and the p-value thresholds intrinsic to it—as the default statistical paradigm for research, publication, and discovery in the biomedical and social sciences.” We add, “we have no desire to ‘ban’ p-values or other purely statistical measures. Rather, we believe that such measures should not be thresholded and that, thresholded or not, they should not take priority over the currently subordinate factors.” For better or worse, we are much more “sober” than would appear from your inaccurate characterization of our positions.
      • Andrew Gelman

        Deborah:

        You write, “Gelman was the major force behind the Wasserstein et al 2019 movement.” What do you mean by that? I typically don’t work well with committees (I’m not proud of that; it’s a weakness of mine that I’m stubborn; I just have to make my contributions in other ways). But maybe there is something specific you are referring to but which I have forgotten?

        The point of my above post is that decision analysis based on external measures such as dollars and lives is different from decision analysis based on type 1 and type 2 errors, and that decision analysis based on dollars and lives is not equivalent to p-value thresholding.

        Regarding your incorrect statements that I “decided it was best to shoot down all of statistical significance” and that some sort of “thrill” was involved, let me just say two things:

        1. I get my thrills in other ways. I write these papers, blog comments, etc., out of a feeling of obligation, not thrill. Writing can be pleasant and even fun sometimes, but it is not thrilling for me.
        2. I am not interested in “shooting down” statistical significance. In our paper, Abandon Statistical Significance, we write, “we propose to abandon statistical significance, to drop the null hypothesis significance testing (NHST) paradigm—and the p-value thresholds intrinsic to it—as the default statistical paradigm for research, publication, and discovery in the biomedical and social sciences.” We add, “we have no desire to ‘ban’ p-values or other purely statistical measures. Rather, we believe that such measures should not be thresholded and that, thresholded or not, they should not take priority over the currently subordinate factors.” For better or worse, we are much more “sober” than would appear from your inaccurate characterization of our positions.
        • Thank you very much for your reply to my comment to Lakens. (I’m still having problems with the comments to this blog, and am not informed when comments go up.*) My reply to Lakens was attempting to underscore the point of your post—namely, your title was correctly intended, and definitely not merely recommending a context-dependent setting of thresholds. I had already posted comments taking your side on the difference between statistical significance tests and “full-dressed” Bayesian decision-making, which really is a different context, so I was looking for a stronger way to convey that, Gelman really means it. In your guest post, you were serving as a spokesperson for your joint paper, and I was mistaken for not alluding to McShane et al., 2019. I have often said before that I considered you a major force behind the “abandon/retire significance” movement. You have long been a leading voice to which scientists turn for guidance on these matters. In writing Mayo and Hand (2022), it was evident, as we traced out Wasserstein et al., that your arguments (and those in McShane et al.,) reasons, and recommendations formed the foundation of the editorial.

          Wasserstein et al. 2019: “The ASA Statement on P-Values and Statistical Significance stopped just short of recommending that declarations of “statistical significance” be abandoned. We take that step here. We conclude, based on our review of the articles in this special issue and the broader literature, that it is time to stop using the term “statistically significant” entirely. Nor should variants such as “significantly different,” “p < 0.05,” and “nonsignificant” survive.”
          “…Some (e.g., Amrhein, Trafimow, and Greenland Citation2019; Hurlbert, Levine, and Utts Citation2019; McShane et al. Citation2019) prefer to rip off the bandage and abandon use of statistical significance altogether.”(Wasserstein et al. 2019)

          I think it was the basis for ATOM, the overarching recommendation of the editorial.
          Mayo and Hand (2022): https://errorstatistics.com/2022/05/22/d-mayo-d-hand-statistical-significance-and-its-critics-practicing-damaging-science-or-damaging-scientific-practice/

          It is obvious that you were incredibly successful with the call on your blog to “Abandon/retire statistical significance: your chance to sign a petition!,” resulting in over 800 people signing their names to the article in Nature by Amrhein et al.,
          https://statmodeling.stat.columbia.edu/2019/03/05/abandon-retire-statistical-significance-your-chance-to-sign-a-petition/
          Hardwicke and Ioannidis (2019) were sufficiently impressed to embark on a survey to analyze the signees: “Petitions in scientific argumentation: Dissecting the request to retire statistical significance”. A link to the article (also one by you, and one by me) is in a blogpost:
          https://errorstatistics.com/2024/05/04/5-year-review-hardwicke-and-ioannidis-
          Committees had nothing to do with it: your voice on these matters has been majorly influential for a long time!

          I apologize for my jokey remark in my email reply to Lakens, which I have revised. From the view of an outsider like me, the hoopla surrounding the episode appeared very exciting, the 2019 ASA Executive Director’s editorial, the creation of the 2019 ASA Task force (that recommended statistical significance not be abandoned), etc.. My opinion had been that it would be somewhat of a thrill to be at the center of this unique and successful movement.
          Link to the 2019 Task Force on Statistical Significance and Replication:

          Click to access presidents-task-force-statement.pdf

          *It is evident, I’m not up to date in the new “block editing” in wordpress; I’m still using “classical editing”.

  5. It might be interesting to note, that Daniël’s social epistemological question has been discussed extensively in philosophy of science – including the question if we should use (expected) costs to set the alpha level. (Douglas 2000) is the locus classicus for affirming the use of even “non-epistemic” values in setting the alpha level.
    Tellingly this discussion is a reaction to (Jeffrey 1956), who takes a position akin to Andrew in not “see[ing] the need for accepting or rejecting a hypothesis or scientific model.” in science.
    Neither side in this debate rejects the possibility of doing decision analysis in science, they rather differ in their assessment if this is what science should do. A difference that perhaps divides Daniël and Andrew too.

    Douglas, Heather. 2000. “Inductive Risk and Values in Science.” Philosophy of Science 67 (4): 559–79. https://doi.org/10.1086/392855.

    Jeffrey, Richard. 1956. “Valuation and Acceptance of Scientific Hypotheses.” Philosophy of Science 23 (3): 237–46. https://www.jstor.org/stable/185418

Leave a reply to Nico Cancel reply

Blog at WordPress.com.