Guest Post: “Daniël Lakens: How were we supposed to move beyond  p < .05, and why didn’t we? “(part 1 of 2):

.

Professor Daniël Lakens
Human Technology Interaction
Eindhoven University of Technology

*[Some earlier posts by D. Lakens on this topic are listed at the end of part 2, forthcoming this week]

How were we supposed to move beyond  p < .05, and why didn’t we?

It has been 5 years since the special issue “Moving to a world beyond p < .05” came out (Wasserstein et al., 2019). I might be the only person in the world who has read all 43 contributions to this special issue. [In part 1] I will provide a summary of what the articles proposed we should do instead of p < .05, and [in part 2] offer some reflections on why they did not lead to any noticeable change.

1: Test Range predictions

Perhaps surprisingly the most common recommendation is to continue to make dichotomous claims based on whether a statistic falls below a critical score – exactly as we now do with p-values. However, instead of computing test scores for null hypothesis significance tests, the recommendation is to compute a test statistic for interval hypothesis tests. This recommendation is 70 years old (Hodges & Lehmann, 1954), and it solves most of the criticisms people have on the current use of p-values. However, it is challenging to specify a range of values to test against, and this is why adoption of range predictions is incredibly slow. It requires a large time investment to figure out which effect sizes we would meaningfully predict, and most research areas have not invested that time yet.

Anderson discusses the importance of testing against effects that are considered practically significant, and to take the precision into account (Anderson, 2019). He interprets confidence intervals as tests, and therefore his contribution boils down to a reminder that range predictions are an improvement to NHST. If you can specify a smallest effect of interest, and have an idea of how precise you want the estimate to be, Anderson proposes you should design studies that return confidence intervals that yield informative results with respect to an effect size you have determined would be practically significant.

Betensky also proposes testing range predictions (Betensky, 2019), and specifically minimum effect tests. She writes “The p-value and sample size jointly yield 95% confidence bounds for the effect of interest, which can be compared to the predetermined meaningful effect size to make inferences about the true effect.” She proposes: “The principle: Reject the null in favor of a meaningful effect if and only if the lower 95% confidence bound exceeds the smallest effect size considered meaningful. Thus, rejecting the null means we can be 95% confident that the true effect size is at least as large as the size considered to be clinically meaningful.” This is a test, that yields a dichotomous claim, based on a p-value. We are not moving beyond p < .05 in this proposal, but we are testing against meaningful effects, instead of a null hypothesis of 0.

Pogrow suggests to focus on practical benefit, which he proposes is evaluated by comparing the unadjusted actual performance of an experimental group to an existing benchmark (Pogrow, 2019). The recommendation is not very good – in essence, the author suggests to ignore all statistical inferences. He acknowledges that this leaves open the question when a difference is ‘large enough’. He writes: “At the very least “oomph” is a clearly noticeable improvement or other benefit that does not require precise statistical criteria to discern.” This will not do (my general writing advice is that if you cannot define what is within quotation marks, you typically need to think more about what you are proposing). The correct way to achieve this proposal would be a test (even if the Type 1 error rate is set to a higher level than 5% based on a cost-benefit analysis (Maier & Lakens, 2022)). Nevertheless, in essence the paper recommends comparing effects against a meaningful effect size.

Goodman, Spruill and Komaroff (2019) also propose a test of a range prediction, and call it a minimum effect size plus p-value (MESP) approach. They write “Here, rejecting the null also requires the difference of the observed statistic from the exact null to be meaningfully large or practically significant, in the researcher’s judgment and experience.” Again, this still is a test with a dichotomous conclusion based on a p-value – and in essence simply a minimal effect test (Mazzolari et al., 2022; Murphy & Myors, 1999).

Blume and colleagues (2019) state that p-values might be criticized in the literature, but “having a gross indicator for when a set of data are sufficient to separate signal from noise is not a bad idea” (p. 157). They propose ‘Second Generation P-Values’, which is an interval null hypothesis test. As we have pointed out elsewhere, their idea is conceptually extremely similar to traditional equivalence tests (Lakens & Delacre, 2020), and equivalence tests are a superior solution. Regardless, this proposal again leads to dichotomous decisions based on whether a confidence interval falls within a range prediction, or not.

Greenland (2019) also suggests “computing P-values for contextually important alternatives to null (nil or “no effect”) hypotheses, such as minimal important differences”, in a piece in which he criticizes unwarranted criticism of p-values. He does respond against the dichotomization of p-values, but mainly in observational studies: “It may be argued that exceptions to (4) [the dichotomization of p-values] arise, for example when the central purpose of an analysis is to reach a decision based solely on comparing p to an α, especially when that decision rule has been given explicit justification including both false-acceptance (Type-II) error costs under relevant alternatives and false-rejection (Type-I) error costs (Lakens et al. Citation2018), as in quality-control applications. Such thoroughly justified applications are, however, uncommon in observational research settings.” This is perhaps similar to Cohen (1994) who criticized the use of NHST, only to accept that it might play a role in well-controlled experiments (Cohen, 1995).

Calin-Jageman and Cumming stress an estimation approach (Calin-Jageman & Cumming, 2019), but when a researcher wants to make a decision (such as whether or not a prediction is supported or not) they write “Another frequent concern is that scientists need to make clear Yes/No decisions (e.g. Does this drug work? Is this project worth funding?). No problem! Focusing on effect sizes and uncertainty does not preclude making decisions—in fact, it makes it easier because one can easily test a result against any desired standard of evidence. For example, suppose you know that a drug improves outcomes by 10% with a 95% confidence interval from 2% up to 18%. If the standard of evidence required is at least a 1% increase in outcomes, the drug would be considered suitable (because a 1% increase is not within the range of the confidence interval).” In other words, the authors propose testing range predictions, such as minimal effect tests, based on the confidence interval.

2: Complement p with something else

So far, we have seen that 7 of the articles either directly propose the use of p-values for interval hypothesis tests, or make a proposal that is strongly in line with it. Several other papers support the continued use of dichotomous claims, but recommend adding other statistics. This is commonly already required in reporting guidelines, which typically require researchers to add effect sizes and confidence intervals. But the special issue saw some additional proposals.

For example Colquhoun foresees the continued use of p-values, and suggests  complementing it with additional statistics. Colquhoun (2019) writes: “It is suggested that p-values and confidence intervals should continue to be given, but that they should be supplemented by a single additional number that conveys the strength of the evidence better than the p-value. This number could be the minimum FPR (that calculated on the assumption of a prior probability of 0.5, the largest value that can be assumed in the absence of hard prior data).”

Matthews (2019) suggests the use of Bayesian credible intervals instead of p-values (called AnCred), but the proposed alternative is still a testing procedure that leads to dichotomous conclusions, just based on a different critical value, which is now determined with a Bayesian flavor. As Matthews writes: “Perhaps the most obvious potential criticism of AnCred is that the concept of credibility simply replaces statistical significance as a means of dichotomizing findings. However, it should be stressed that dichotomization per se has never been the problem with NHST; it is the actions that flow from it.”

Benjamin and Berger (Benjamin & Berger, 2019) state that “In statistical practice, perhaps the single biggest problem with p-values is that they are often misinterpreted in ways that lead to overstating the evidence against the null hypothesis.” They propose to complement p-values with a Bayes factor bound to communicate how strong the evidence in the data is. Their long term plan is to make researchers more comfortable with Bayesian approaches. They hint at being ok with completely replacing p-values in the future, but when pressed, I doubt they would support giving up error control when making claims. As another contribution in the special issue shows, removing statistical error control will lead to more researchers claiming their predictions are supported than is warranted based on commonly used error rates (Fricker et al., 2019). Given the widespread misuse of Bayes factors (Wong et al., 2022), anyone proposing an alternative to p-values will also need to engage with the problems that will emerge if this recommendation would be adopted at the same scale as p-values are currently adopted.

Krueger and Heck (2019) write “As experimentalists, we are reluctant to relinquish dichotomous decision-making entirely” (p. 125) and provide a defense of dichotomous claims in science. The authors do not really provide a suggestion for an alternative, beyond a generic statement that “We join those who recommend researchers use a toolbox of statistical techniques, employ good judgment, and keep an eye on developments in statistical and data science.”

3: Use Statistical Decision Theory

Statistical decision theory is an important upgrade to Neyman-Pearson hypothesis testing, and yet, Neyman-Pearson hypothesis tests remain the dominant approach to p-values in science. The reason statistical decision theory is not adopted is because we don’t know how. In statistical decision theory we still make dichotomous claims based on p-values, but the alpha level is no longer set to 5%, and the Type 2 error is no longer a minimum of 20%, but error rates are specified based on a cost benefit analysis (Maier & Lakens, 2022). Researchers need to be able to quantify the costs and benefits of doing their study – and they can’t. This is already difficult enough in applied research, but only becomes more difficult in theoretical research. Several authors in the special issue propose to use statistical decision theory, but no one explains how we can achieve it.

Manski (2019) provides a nice overview of statistical decision theory, and suggests to use it instead of NHST. I would love to see this used more, but personally, I would not know how to implement it in practice in my research.

Gannon et al. (2019) join the choir of people who say we need to keep using p-values, but stress the need to choose alpha levels in a smarter way. They write “This article argues that researchers do not need to completely abandon the p-value, the best-known significance index, but should instead stop using significance levels that do not depend on sample sizes.” This is a solid recommendation, but it is a recommendation to justify alpha levels (Lakens et al., 2018), not a suggestion to move beyond p-values. It uses the simplest form of statistical decision theory, where we try to minimize errors, but consider all errors equally costly.

Despite the misleading title of ‘Abandon Statistical Significance’, McShane and colleagues (McShane et al., 2019) also propose the use of statistical decision theory to make dichotomous claims. They write: “While we see the intuitive appeal of using p-value or other statistical thresholds as a screening device to decide what avenues (e.g., ideas, drugs, or genes) to pursue further, this approach fundamentally does not make efficient use of data: there is in general no connection between a p-value—a probability based on a particular null model—and either the potential gains from pursuing a potential research lead or the predictive probability that the lead in question will ultimately be successful. Instead, to the extent that decisions do need to be made about which lines of research to pursue further, we recommend making such decisions using a model of the distribution of effect sizes and variation, thus working directly with hypotheses of interest rather than reasoning indirectly from a null model.” This proposal would still be based on p-values (or a comparable critical value) and dichotomous decisions.

To be continued in Part 2 (to be posted later this week)…

Categories: abandon statistical significance, D. Lakens, Wasserstein et al. (2019) | 23 Comments

Post navigation

23 thoughts on “Guest Post: “Daniël Lakens: How were we supposed to move beyond  p < .05, and why didn’t we? “(part 1 of 2):

  1. Daniel:

    Thank you so much for the excellent blogpost reflecting in detail on many of the proposals in Wasserstein (2019). I was an invited speaker at the conference that gave rise to the special issue of TAS, but since the deadline had been 2018, I couldn’t contribute, since I was completing my book SIST (2018). The deadline then got very delayed, but in retrospect, it’s a good thing I wasn’t a contributor to it. I did write a short response to the ASA 2016 policy document: “Don’t throw out the error control baby with the bad statistics bathwater.”

    https://errorstatistics.com/2016/03/07/dont-throw-out-the-error-control-baby-with-the-bad-statistics-bathwater/

    David Hand and I wrote a longish paper (in Synthese) in reaction to the “abandon significance” call by Wasserstein et al.:”Statistical Significance and its Critics: Practicing Damaging Scence or Damaging Scientific Practice”

    https://errorstatistics.com/2022/05/22/d-mayo-d-hand-statistical-significance-and-its-critics-practicing-damaging-science-or-damaging-scientific-practice/

    It’s too bad authors weren’t warned that the introduction to the special volume would be declaring, “we take that step here” (don’t say significance) and claiming to have obtained grounds for doing so from the articles in the issue. Authors might have wanted to qualify their position on that matter (Jean and I counted around 6 or 7 of the 44 in agreement), but I forget the precise number.

    The suggestions to look at range predictions and effect sizes are, as you know, already accommodated with equivalence tests, severity, attained power, and a reinterpretationm of confidence distributions. It’s amazing to me that any scientists would assume they are limited to nil nulls, which are highly artificial, although apt for (what David Cox calls) strong RCT tests.

    The central topic that needs to be addressed by those proposing alternatives to statistical significance are biasing selection effects, selective inference and the like; I’m not sure from your review whether they came up. Suggesting p-values be replaced by measures that obey the likelihood principle, and thus, do not control error probabilities, overlooks the most important job for which scientists look to statistical significance tests: distinguishing genuine effects from chance. Most statisticians, even Bayesians, will try to sneak in p-values, because otherwise they cannot test their models. Thus, their proposed alternatives, if they pretend not to require reliance on p-values, are disingenuous.

     Here’s a link to a paper I recently wrote on “Error statistics, Bayes factor tests, and the fallacy of non-exhaustive alternatives”. Comments from readers are very welcome.

    https://osf.io/preprints/osf/tmgqd

    I will have a lot more to say about part 2 of your post. I might make my remarks a separate blogpost. I’d had the idea to do so after listening to your interesting recent talk at that “foundations of applied statistics” conference.

    Thanks again for all of your work on this.

    Mayo

  2. John Park

    Much ink has been spilled trying to erase p-values. The alternatives nicely listed here in this post either use p-values and/or use suggestions that perhaps just mildly improves current practice at the cost of unfamiliarity and complexity. As a clinician, I’d rather use what we know then some strange new statistical test that offers little to no improvement – better to deal with the devil we know…

    • Thanks for your comment John. I’ve been very relieved to find the importance still placed on achieving statistical significance in assessing the evidence from clinical trials of new treatments that are important for a family member. I was afraid Wasserstein might have convinced some at FDA to “abandon statistical significance”, just when it really matters, personally. When people try to sell us on their latest methods that ignore error probabilities, allow optional stopping (given strong beliefs in a drug’s effectiveness), downplay or ignore whether predesignated thresholds are met, etc. etc. we should hold off on buying. The general public needs to become “stat activists”.

  3. Thanks for this very informative posting. I had read a good number of these but not all of them and maybe not as thoroughly as you did.

    I have a comment partly already mentioned by Mayo. You write: “This recommendation (…) solves most of the criticisms people have on the current use of p-values.” I don’t think so. It addresses one major issue (ignorance of effect sizes), but there are many more, such as issues with multiple testing, encouragement of binary thinking, incentivisation of “fishing for significance” and selective reporting, issues with model assumptions, issues with combination of evidence from different trials or sources, issues with sequential decision making, misinterpretation of p-values as probabilities of the truth of the null hypothesis (and the Bayesian claim that this is what really matters to the practitioner), violation of the likelihood principle, and probably even more. I’m not saying all of these are in fact problems all the time, and for sure, whereas a good number of proposed alternative approaches addresses one issue, some issues are not addressed by any existing approach, and some may be impossible to address. In my discussion of Grünwald et al.s paper in e-values I criticise the term “safe testing”, because as they address one of these issues but do not help regarding several others. I think we should generally refrain from “solutionist” marketing claims. Much trouble is here to stay.

    • lakens

      Hi Christian, whether or not it solves most of the criticisms depends a little on how fine-grained we make the criticisms.

      I think most of the things you list are problems, but they have nothing to do with NHST and p-values. If you read Bacon’s Novum Organum, he already complained in 1620 about confirmation bias. Now we do it with p-values, but they are not the cause – human biases are the cause. It is very important to not make causal claims about what p-values or NHST lead to unless you can convincingly claim they are the actual cause! Too common a mistake, regrettably.

      Some other things, like violating the likelihood criticism, is indeed not solved. But range predictions more things than you mention. They solve the criticism that the null is never true, they solve the criticism practically insignificant effects can be significant, they solve the problem that tests are a mindless ritual, etc. If this boils down to “most” or “many”, we will have to decide while counting all criticisms over a beer one day 😉

      • “They solve the criticism that the null is never true” – not quite, as models are never precisely true anyway, neither a point null nor an interval null. However, once more, you’d be right saying that this is not a problem specific to p-values.

  4. I have two comments on the above post.

    First, I disagree with the claim that the title of our paper, “Abandon Statistical Significance,” is “misleading.” We indeed recommend abandoning statistical significance, which is a procedure for dichotomizing data summaries using tail-area probabilities.

    The fact that we make decisions (indeed, an entire chapter of our Bayesian Data Analysis book is devoted to decision analysis) does not in any way imply that we recommend making decisions based on statistical significance.

    Second, and relatedly, I disagree with the claim that our recommendations for decision analysis “would still be based on p-values (or a comparable critical value).” Not at all. A Bayesian decision analysis based on expected costs and benefits does not use any tail-area probabilities, nor does it use any comparable threshold based on data alone. Assessed costs and benefits are a key part of the decision.

    I’m not saying that Bayesian decision analysis is the only way to go or that it is always better than decision making based on statistical significance. We indeed recommend abandoning statistical significance, but we recognize that people have lots of reasons for using the methods that they use, and I could well believe that in many cases an existing approach, even one that I don’t like, could work just fine, while a new approach, even one that I like, could backfire. A lot depends on context. The point of this comment is just to clarify that: (1) the title of our paper is accurate, and (2) we do not recommend making decisions based on p-values or a comparable critical value. If you want to disagree with our prescriptions, go for it; just describe them accurately.

    • I’m sure that Lakens will clarify his remark, but I agree that (sadly, from my perspective) the paper in which Gelman is a joint author does follow through on its promise to declare that we should “abandon significance”, quite in contrast to the Gelman of a decade or so ago who co-wrote: “What we are advocating, then, is what Cox and Hinkley (1974) call ‘pure significance testing’, in which certain of the model’s implications are compared directly to the data, rather than entering into a contest with some alternative model” (Gelman and Shalizi 2013, 20). Indeed, the 2019 paper Gelman co-authored was one of the 5 or 6 of 43 that endorsed abandoning significance, as recommended in the editorial introduction (Wasserstein et al., 2019) to the special TAS volume.* Incidentally, a disclaimer was added to that editorial 2 years ago, to try to avoid its being regarded–erroneously– as official ASA policy.  
      https://errorstatistics.com/2022/06/15/too-little-too-late-the-dont-say-significance-editorial-gets-a-disclaimer/

      *Gelman will use (Bayesian) p-values for model checking.

      Gelman and Shalizi 2013 wrote: “Indeed, crucial parts of Bayesian data analysis, such as model checking, can be understood as ‘error probes’ in Mayo’s sense” (ibid., p. 10).

      • lakens

        The recommendations in ‘Abandon Statistical Significance” are intentionally vague. The authors do not want to propose any new procedure that can be mindlessly misused. It is therefore difficult to point down what they end up recommending. But they acknowledge decisions need to be made. They write “Instead, to the extent that decisions do need to be made about which lines of research to pursue further, we recommend making such decisions using a model of the distribution of effect sizes and variation, thus working directly with hypotheses of interest rather than reasoning indirectly from a null model.” I think if I would see the authors do this in practice, any decision they make ends up being a dichotomous claim based on a critical value, and I will be able to recompute that critical value into a p-value. I am happy to be proven wrong, and all the authors need to do is to link me to a real life example where this is not the case.

        Of course that threshold is based on costs and benefits – so is the decision based in well-performed frequentist hypothesis testing, where we set the alpha and beta based on the cost of errors (see Maier & Lakens, 2022).

        I tried to figure out what Andrew’s recommendations are, beyond “statistics is hard”. I read his PNAS letter that he uses as an example of a scientific contribution which uses what he would consider a good statistical approach, but there were no details in the letter about the statistics at all. The technical report cited is just pictures and nothing else: https://statmodeling.stat.columbia.edu/wp-content/uploads/2017/03/auerbach_gelman_23_mar_2017_smallest.pdf It states “Details on model and estimation will be provided in a forthcoming paper.” Can you provide a link to that paper?

        Until I see more clearly what your recommendation boils down to in real life, I will stick to my evaluation that the title of your paper is misleading.

        • Lakens:

          1. You write, “The recommendations in ‘Abandon Statistical Significance’ are intentionally vague.” I have no idea how you, as a psychologist, think you can figure out our intentions.
          2. You ask for an example where we make decision recommendations that are not based on p-values. It’s not hard to find such examples! Classical Bayesian decision theory from the 1940s onwards is about computing expected utility by averaging over uncertainties. Tail-area probabilities just don’t come up. There are lots of textbooks on the topic. We have some examples in chapter 9 of Bayesian Data Analysis, third edition. An applied example from one of my research papers is here: http://stat.columbia.edu/~gelman/research/published/lin.pdf Again, this is standard decision analysis, not controversial in any way.
          3. If you are interested in my recommendations (with coauthors) regarding statistical practice, you could start with our books, Bayesian Data Analysis and Regression and Other Stories, which are full of methodological detail and applied examples. Both these books are available for free download online. I also have a bunch of research articles here: http://stat.columbia.edu/~gelman/research/published/, many of which directly address applied problems and very few of which look at statistical significance.

          Again, I recognize that different methods will be useful in different settings, and I’m not saying that you or anyone else is necessarily getting bad conclusions when using statistical significance to design and analyze experiments. In our paper, we recommend abandoning statistical significance (not misleadingly; we really do recommend that!); at the same time, we recognize that others will continue to use methods that we consider flawed. I get that you do not want to abandon statistical significance, and, given that, it makes sense for you and others to do research to figure out how best to use the tools that you want to use.

        • Let me put it another way that might be helpful.

          I am not trying to convince you to change your view, which I take to be that significance testing using p-values is a good idea for statistical analysis. Indeed, there are many good arguments that you and others have made in favor of this approach, including:

          • Lots of good research has been done using significance testing and p-values. These are methods that have stood the test of time.
          • Alternative methods of analysis are more complicated.
          • You and others have experience working with significance testing using p-values, and it makes sense to keep using what you know.
          • The methods that I use in pharmacology, political science, etc., might not be so appropriate in psychology.
          • Our methods might work for us, but we actually could do just as well by judicious use of significance testing and p-values.

          All of these are potentially good reasons, and I’m sure you have others. I’d disagree, but that’s fine. Different people have different experiences, perspectives, and goals. It makes sense that there are disagreements.

          In your post and comments above, though, you’re taking the position that we’re not really recommending abandoning statistical significance, which is an offbeat position to take, given the content of our paper and also our track records as researchers. Really you can make lots of strong arguments without needing to mischaracterize our paper, our statistical recommendations, or the applied statistics that we do.

      • Deborah:

        Indeed, I no longer want to use p-values. In that way, both my recommendations and my practice have changed since what I was saying and doing 30 years ago. For more on this, see the section, “My thinking has changed,” in <a href=”https://statmodeling.stat.columbia.edu/2023/04/14/4-different-meanings-of-p-value-and-how-my-thinking-has-changed/“>this post</a> from last year. (Actually, that post is relevant to our discussion here and I’d be happy if you wanted to repost it on your blog in the context of the current series of posts.)

        I’m still interested in model checking (“pure significance testing,” “error probes,” “severe testing,” etc.) but I’m no longer doing this or recommending to do this using p-values.

        I remain interested in p-values, not so much for their own sake but because they are often used in practice so it is good to understand them. For example in <a href=”http://stat.columbia.edu/~gelman/research/published/pval_RCTs_rev3.pdf“>this paper with Zwet et al.</a>, we give an empirical interpretation of p-values, not conditional on the null hypothesis of zero effect but averaging over a distribution of effect sizes estimated from a corpus of medical studies.

        Also relevant is the following sentence from section 4 of <a href=”http://stat.columbia.edu/~gelman/research/published/abandon.pdf“>my paper with McShane et al.</a>:

        <blockquote>Even in pure research scenarios where there is no obvious cost-benefit calculation—for example, a comparison of the underlying mechanisms, as opposed to the efficacy, of two drugs used to treat some disease—we see no value in p-value or other statistical thresholds. Instead, we would like to see researchers simply report results: estimates, standard errors, confidence intervals, etc., with statistically inconclusive results being relevant for motivating future research.</blockquote>

        Actually I recommend that people read all of section 4 of that article. I pretty much stand by all of it today.

  5. Thanks for summarizing these viewpoints. I just wanted to share a few points for now.

    1. Are you familiar with “Norman’s Law”? Medical Care (lww.com). The idea is that patients can start to detect an effect when it is equal to 1/2 of a std deviation of a test measure. So, if my construct has a normed SD of 10, I can set the MCID at 5.
    2. I want bring up something controversial here. I also brought this up in response to Christian’s post – he disagreed and I still need to respond to that. But I my feeling wrt to statistical decision theory is that the 0.05 threshold is often too harsh. Here’s an example. I recently read something like the following sentence:
      • “Although the effect was not statistically significant (p = 0.07), it is important for policymakers to understand that the estimated effect was positive”.
      • One way to interpret this sentence is that they not playing the NHST game fairly. If we cannot distinguish between signal and noise, why should policymakers treat this study as evidence that the intervention worked?
      • Another interpretation that I’ve heard is that a non-significant result means that we must act as if the intervention is not effective. I don’t endorse this either!
      • What I think happened here is a misapplication of statistical decision theory. The intervention in the study was very sensible. There are serious costs to patients from a Type II error. Power was limited by the cluster randomization design. I think this study would have been an ideal candidate to increase alpha to 0.1 or maybe even higher.
      • When I see people write “it didn’t meet the threshold but …. ” I think they are using a default threshold that is too harsh for the situation. It may be a sub-optimal interpretation, but it could be better for society if researchers allow for a “fudge factor” in cases like this.
    • lakens

      Hi Henry, I know about Norman’s law (I just taught about it yesterday) and I think it makes sense in psychology – individuals do not directly notice many effects, unless they are large enough.

      Decision theory is a useful approach and should be applied more where it can. In Maier and Lakens (2022) we discuss how to implement this by setting a smallest effect of interest, and then carefully choosing the alpha and beta error. When I discuss frequentist statistics, I always assume best practice, which means thinking about the error rates of the testing procedure.

    • I don’t think in the earlier exchange I disagreed on what you say here. What I wrote was sceptical about making very general prescriptions (not even for specific subject areas), whereas there may well be situations where I agree with you, and it may well be thoise you have in mind.

  6. David Colquhoun

    Thank you for this summary. When I wrote ““It is suggested that p-values and confidence intervals should continue to be given, but that they should be supplemented by a single additional number that conveys the strength of the evidence better than the p-value”, my intention was not to advocate the continued use of p values.

    Rather, my hope was that citing both p values and a measure of the false positive risk, people would come to realise the weakness of the evidence provided by p = 0.05 in any case where it is sensible to test a (near-) point null hypothesis. And once people realise how unreliable p values are.in such cases. my hope and expectation was that they would gradually abandon reliance on p values. This, it seems to me, might be a more effective way than lecturing people, to ensure that p values are eventually superseded by better measures of error.

    The paper by Benjamin and Berger proposed an even simpler measure of false positive risk than mine, and if it were adopted, the results would be similar.

    • David:

      David:

      Yes that is the aim of the “redefiners”: to get people to reject p-values, by misleadingly presenting them as if they ought to be evaluated by an entirely different quantity–one that is half Bayesian and half frequentist. That is, they assume a “screening model” of tests, wherein one’s test hypothesis can be assumed to have been randomly selected from an urn of test of hypotheses with a known (very low) proportion being “true”. Then they assume type 1 errors and power can serve as likelihoods in a Bayesian computation (highly problematic). Next, having given a very high (spike) “prior” to the test hypothesis H0, e.g., no effect, they show its “posterior probability” is still high after observing a statistically significant result. But this fails to show there’s good evidence for the test hypothesis of no effect, and low evidence for the (non-exhaustive) alternative hypothesis. It’s a well-known fallacy to construe negative results as evidence for no effect. Here, those who defend this line of criticism of stat sig tests, a positive i.e., statistically significant, result is being taken as evidence of no effect. 

      Of course, we’ve discussed this at length before.

      • David Colquhoun

        Yes, we have indeed discussed this problem before. And I think that we need to keep talking about it until some sort of agreement is reached.

        I take it that it’s one role of statisticians to stop experimenters from making fools of themselves by claiming a discovery when in fact we are seeing only sampling error. It is therefore rather shocking that statisticians still haven’t managed to agree on the best way to answer the most basic of statistical questions, is there a real difference between to means of two independent samples. The question sounds simple but it turns out to be remarkably complicated. Nonetheless, this failure to agree must surely be the main reason why journals have failed to change their policies.

        You say “having given a very high (spike) “prior” to the test hypothesis H0, e.g., no effect, they show its “posterior probability” is still high after observing a statistically significant result.” It seems to me that, given the high failure rate in drug trials, a probability of 0.5 on H0 vs H1 cannot be described as “very high£”. It seems distinctly overoptimistic to me.

        In any case, no such assumption is necessary. The fact that the likelihood ratio in favour of H1 is only about 3 if you’ve observed p=0.05 is quite sufficient to show the poor evidence provided by p=0.05, and that result doesn’t depend on assumptions about how much prior weight is put on H0.

        Also, the approaches by Berger & Sellke, and of Valen Johnson come to similar conclusions to those that I (long after them) arrived at from simple simulations, despite the fact that their assumptions about priors wer totally different from mine.

        Two quotations get to the heart of the matter for me.

        Paraphrasing Sellke et al. (2001)

        “knowing that the data are ‘rare’ when there is no true difference is of little use unless one determines whether or not they are also ‘rare’ when there is a true difference.”

        and, from Frank Harrell,

        “Do you want the probability of a positive result when there is nothing? Or the probability that a positive result turns out to be nothing”

        Most people want the latter, but the former is what a p value tells you about.

        • David:
          Claiming that a chosen likelihood ratio not being large enough as evidence that statistically significant results fail to provide grounds to infer genuine effects is another question begging move wherein it is assumed that the correct assessment of evidence against a null hypothesis is Bayesian. I say it is a “chosen” likelihood ratio, because it entirely depends on the H1 selected. In Johnson, and the popular Benjamin et al.. 2017 article, although the redefiners could not themselves agree on the H1 to use, they are all non-exhaustive alternatives and, most important, they all represent effect sizes much larger than a statistical significance tester would infer. For example, where a statistical significance tester infers “the data are evidence of some positive discrepancy,” the criticism based on the likelihood ratio or Bayes factor compares the likelihood ratio of mu = 0 vs mu = the observed sample mean mo–in their example of testing a Normal mean. To take a just stat sig result as evidence that mu > mo would be to use a procedure that errs 50% of the time. In other words, the Bayesian comparative inference to the chosen alternative H1 is much, much more extreme than what a stat sig tester would infer. Thus, it’s wrong to take a particular low Bayes factor as evidence against what stat sig testers infer, namely, evidence of some discrepancy. So the criticism is highly misleading, aside from insisting that a Bayes factor is the “right” measure. Jim Berger admits that the spiked priors used in all the examples in Benjamin et al., 2017 are appropriate only when there is high prior belief in the null (whatever that is to mean), else the p-value is a proper measure of evidence. All of this is discussed in detail in my book Statistical Inference as Severe Testing: How to Get Beynd the Statistics Wars (CUP, 2018)

          • David Colquhoun

            Yes of course the result depends on the choice of H1, as it should do. But if you compare H0: mu=0 with H1: mu=mu_hat that gives you the maximum probability of rejecting H0 (insofar as mu_hat is the value with maximum likelihood). Yet H0 is only about three times as likely as H1 if you’ve observed p=0.05. Odds (ratio) of 3 to 1 on H! sounds a lot less impressive than 19:1 which is how most people (mistakenly) interpret p = 0.05.

            I just can’t agree that a prior probability of 0.5 on H0 is “a high prior belief in the null”. You didn’t address the example of phase 3 drug trials in which the failure rate is well over 50%. And to assume anything less that 50% would be to play into the hands of those who accuse Bayesian arguments of feeding prejudices into testing. As Stephen Senn has pointed out often, a (near) point null is not always a sensible thing to test, but when it is a sensible thing to test, I think that the discrepancy between p=0.05 and FPR_50 = 0.26 – 0.29 should really give pause for thought

            • David:
              I’ve already addressed, in my last comment, the fallacy of comparing the statistical significance testers inference from a just stat sig result –namely, there’s evidence of a positive discrepancy–with evidence for the max likely alternative. To infer the latter would be to follow a policy with error prob .5. That’s why J. Berger says that the p-values correctly assesses evidence when you don’t have strong belief in a point null. Of course, testing the point null Bayesianly is highly artificial to begin with. Moreover, the very supposition that p-values need to match posterior probabilities–which are measuring very different things–is wron gheaded. We’d had this conversation too many times to repeat it here. To readers: please search : “p-values exaggerate” or p-values overstate” evidence on this blog, e.g., https://errorstatistics.com/2017/01/19/the-p-values-overstate-the-evidence-against-the-null-fallacy-2/

  7. Gelman put up a post on his blog that is intended to clarify the comment he makes to Lakens on this blog.

    https://statmodeling.stat.columbia.edu/2024/07/10/a-misunderstanding-about-decision-analysis-and-significance-testing/

I welcome constructive comments that are of relevance to the post and the discussion, and discourage detours into irrelevant topics, however interesting, or unconstructive declarations that "you (or they) are just all wrong". If you want to correct or remove a comment, send me an e-mail. If readers have already replied to the comment, you may be asked to replace it to retain comprehension.

Blog at WordPress.com.