In what began as a guest commentary on my 2021 editorial in Conservation Biology, Daniël Lakens recently published a response to a recommendation against using null hypothesis significance tests by journal editors from the International Society of Physiotherapy Journal. Here are some excerpts from his full article, replies (‘response to Lakens‘), links and a few comments of my own.
“….The editors list five problems with p values. First, they state that ‘p values do not equate to a probability that researchers need to know’ because ‘Researchers need to know the probability that the null hypothesis is true given the data observed in their study.’ Regrettably, the editors do not seem to realise that only God knows the probability that the null hypothesis is true given the data observed, and no statistical method can provide it. Estimation will not tell you anything about the probability of hypotheses. Their second point is that a p values does not constitute evidence. Neither do estimates, so their proposed alternative suffers from the same criticism. Third, the editors claim that significant results have a low probability of replicating, and that when a p value between 0.005 and 0.05 is observed, repeating this study would only have a 67% probability of observing a significant result. This is incorrect. The citation to Boos and Stefanski is based on the assumption that multiple p values are available to estimate the average power of the set of studies, and that the studies will have 67% power. It is not possible to determine the probability a study will replicate based on a single p value. Furthermore, well-designed replication studies do not use the same sample size as an earlier study, but are designed to have high power for an effect size of interest. Fourth, the editors argue, without any empirical evidence, that in most clinical trials the null hypothesis must be false. The prevalence of null results make it doubtful this statement is true in any practical sense. In an analysis of 11,852 meta-analyses from Cochrane reviews, only 5,903 meta-analyses, or 49.8%, found a statistically significant meta-analytic effect. …Finally, the fifth point that ‘Researchers need to know more than just whether an effect does or does not exist’ is correct, but the ‘more than’ is crucial. It remains important to prevent authors from claiming there is an effect, when they are actually looking at random noise, and therefore, effect sizes complement, but do not replace, hypothesis tests.
The editors recommend to use estimation, and report confidence intervals around estimates. But then the editors write ‘The estimate and its confidence interval should be compared against the “smallest worthwhile effect” of the intervention on that outcome in that population.’ ‘If the estimate and the ends of its confidence interval are all more favourable than the smallest worthwhile effect, then the treatment effect can be interpreted as typically considered worthwhile by patients in that clinical population. If the effect and its confidence interval are less favourable than the smallest worthwhile effect, then the treatment effect can be interpreted as typically considered trivial by patients in that clinical population.’ This is not a description of an estimation approach. The editors are recommending a hypothesis testing approach against a smallest effect size of interest. When examining if the effect is more favorable than the smallest effect size of interest, this is known as a minimum effect test. When examining whether an effect is less favorable than the smallest effect size of interest, this is known as an equivalence test. Both are examples of interval hypothesis tests, where instead of comparing the observed effect against 0 (as in a null-hypothesis test) the effect is compared against a range of values that are deemed theoretically or practically interesting…. Forcing researchers to prematurely test hypotheses against a smallest effect size of interest, before they have carefully established a smallest effect size of interest, will be counterproductive.”
These are excellent points, and I would note a few others (even aside from their misdefining p-values). The problem I have with the complaint that p-values aren’t posterior probabilities is not that only God knows–presumably God would know if a claim was true or if an event would occur, or not. The problem is that no one knows how to obtain, interpret, and justify the required prior probability distributions (except in special cases of frequentist priors), and there’s no agreement as to whether they should be regarded as measuring belief (and in what?), or supply “default”, (data-dominant) priors (obtained from one of several rival systems).
However, in their response to Lakens, the editors aver that “there is little point in knowing the probability that the null hypothesis is true”. That’s because they assume it is always false! They fall into the fallacy of thinking that because no models are literally true claims about the world–that’s why they’re called ‘models’–and with enough data a discrepancy from a point null, however small, may be found, it follows that all nulls are false. This is false. (Read proofs of Mayo 2018, Excursion 4 tour IV here). (It would follow, by the way, that all the assumptions needed to get estimations off the ground are false.) Ironically, the second most famous criticisms of statistical significance tests rest on assuming there is a high (spike) prior probability that the null hypothesis is true. (For a recent example–though the argument is quite old–see Benjamin 2018). Thus, many critics of statistical significance tests agree in retiring tests, but disagree as to whether it’s because the null hypothesis is always false or probably true!
Interestingly “probable” and “probability” come from the Latin probare, meaning to try, test, or prove. “Proof” as in “The proof is in the pudding” refers to how well you put something to the test. You must show or provide good grounds for the claim, not just believe strongly.* If we used “probability” this way, it would be would very close to my idea of measuring how well or severely tested (or how well shown) a claim is. I discuss this on p. 10 of Mayo (2018), which you can read here. But it’s not our current, informal English sense of probability, as varied as that can be. (I recall that Donald Frasier (2011) blamed this on Dennis Lindley). In any of the currently used meanings, claims can be strongly believed or even known to be true while not being poorly tested by data x.
I don’t generally agree with Stuart Hurbert of the statistics wars, but he has an apt term for the strident movement to insist that only confidence intervals (CIs) be used, no tests of statistical hypotheses: the CI crusade. The problem I have with “confidence interval crusaders,” is that while supplementing tests with confidence intervals (CIs) is good (at least where the model assumptions hold adequately), the “CIs only” crusaders advance the misleading perception that the only alternative is the abusive use of tests which fallaciously take statistical significance as substantive importance and commit all the well-known, hackneyed, fallacies. The “CIs only” crusaders get their condemnation of statistical significance tests off the ground only by identifying them with a highly artificial point null hypothesis test, as if Neyman and Pearson tests (with alternative hypotheses, power etc.) never existed (let alone do they consider the variations discussed by Lakens). But Neyman developed CIs at the same time he and Pearson developed tests. There is a direct duality between tests and intervals: a confidence interval (CI) at level 1 – c consists of parameter values that are not statistically significantly different from the data at significance level c. You can obtain the lower CI bound (at level 1-c) by asking: what parameter value is the data statistically significantly greater than at level c? Lakens is correct that the procedure the editors describe, checking for departures from a chosen value for a “smallest worthwhile effect,” is more properly seen as a test.
My own preferred reformulation of statistical significance tests–in terms of discrepancies (from a reference) that are well or poorly indicated–is in the spirit of CIs, but improves on them. It accomplishes the goal of CIs (to use the data to infer population effect sizes), while providing them with an inferential, and not merely a “performance” justification. The performance justification of a particular confidence interval estimate is merely that it arose from a method with good performance in a long run of uses. The editors to whom Lakens is replying deride this long-run frequency justification of tests, but fail to say how they get around this justification in using CIs. The erroneous construal of the confidence level as a probability the specific estimate is correct is encouraged. What confidence level should be used? The CI advocates typically stick with .95, but it is more informative to consider several different levels, mathematically akin to confidence distributions. Looking at estimates associated with low confidence levels, e.g., .6, .5, .4 is quite informative, but we do not see this. Moreover, the confidence intervals estimate does not tell us, for a given parameter value μ’ in the interval estimate, the answer to questions like: how well (or poorly0warranted is μ > μ’? Again, it’s the poorly warranted (or inseverely tested) claims that are most informative, in my view. And unlike what some seem to think, estimation procedures require the same or more assumptions than do statistical significance tests. That is why simple significance tests (without alternatives) are relied on to test assumptions of statistical models used for estimation.
Please share your constructive comments.
 D. Fraser’s (2011) “Is Bayes posterior just quick and dirty confidence?”
 Hurlbert and Lombardi 2009, p. 331. While I agree with their rejection of the “CIs only” crusaders, I reject their own crusade pitting Fisherian tests against Neyman and Pearson tests, rejecting the latter.
 For a very short discussion of how the severity reinterpretation of statistical tests of hypotheses connects and improves on CIs, see the Appendix of Mayo 2020. For access to the proofs of the entire book see excerpts on this blog.
*July 4, 2022 See Grieves’ comment and my reply on the roots of these words.
Hurlbert, S. and Lombardi, C. (2009). ‘Final Collapse of the Neyman-Pearson Decision Theoretic Framework and Rise of the NeoFisherian’, Annales Zoologici Fennici 46, 311–49.
Mayo, D. Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP, 2018)