In what began as a guest commentary on my 2021 editorial in *Conservation Biology,* Daniël Lakens recently published a response to a recommendation against using null hypothesis significance tests by journal editors from the International Society of Physiotherapy Journal. Here are some excerpts from his full article, replies (‘response to Lakens‘), links and a few comments of my own.

“….The editors list five problems with

pvalues. First, they state that ‘pvalues do not equate to a probability that researchers need to know’ because ‘Researchers need to know the probability that the null hypothesis is true given the data observed in their study.’ Regrettably, the editors do not seem to realise that only God knows the probability that the null hypothesis is true given the data observed, and no statistical method can provide it. Estimation will not tell you anything about the probability of hypotheses. Their second point is that apvalues does not constitute evidence. Neither do estimates, so their proposed alternative suffers from the same criticism. Third, the editors claim that significant results have a low probability of replicating, and that when apvalue between 0.005 and 0.05 is observed, repeating this study would only have a 67% probability of observing a significant result. This is incorrect. The citation to Boos and Stefanski is based on the assumption that multiplepvalues are available to estimate the average power of the set of studies, and that the studies will have 67% power. It is not possible to determine the probability a study will replicate based on a singlepvalue. Furthermore, well-designed replication studies do not use the same sample size as an earlier study, but are designed to have high power for an effect size of interest. Fourth, the editors argue, without any empirical evidence, that in most clinical trials the null hypothesis must be false. The prevalence of null results make it doubtful this statement is true in any practical sense. In an analysis of 11,852 meta-analyses from Cochrane reviews, only 5,903 meta-analyses, or 49.8%, found a statistically significant meta-analytic effect. …Finally, the fifth point that ‘Researchers need to know more than just whether an effect does or does not exist’ is correct, but the ‘more than’ is crucial. It remains important to prevent authors from claiming there is an effect, when they are actually looking at random noise, and therefore, effect sizes complement, but do not replace, hypothesis tests.

The editors recommend to use estimation, and report confidence intervals around estimates. But then the editors write ‘The estimate and its confidence interval should be compared against the “smallest worthwhile effect” of the intervention on that outcome in that population.’ ‘If the estimate and the ends of its confidence interval are all more favourable than the smallest worthwhile effect, then the treatment effect can be interpreted as typically considered worthwhile by patients in that clinical population. If the effect and its confidence interval are less favourable than the smallest worthwhile effect, then the treatment effect can be interpreted as typically considered trivial by patients in that clinical population.’ This is not a description of an estimation approach. The editors are recommending a hypothesis testing approach against a smallest effect size of interest. When examining if the effect is more favorable than the smallest effect size of interest, this is known as a minimum effect test. When examining whether an effect is less favorable than the smallest effect size of interest, this is known as an equivalence test. Both are examples of interval hypothesis tests, where instead of comparing the observed effect against 0 (as in a null-hypothesis test) the effect is compared against a range of values that are deemed theoretically or practically interesting…. Forcing researchers to prematurely test hypotheses against a smallest effect size of interest, before they have carefully established a smallest effect size of interest, will be counterproductive.”

These are excellent points, and I would note a few others (even aside from their misdefining p-values). The problem I have with the complaint that p-values aren’t posterior probabilities is not that only God knows–presumably God would know if a claim was true or if an event would occur, or not. The problem is that no one knows how to obtain, interpret, and justify the required prior probability distributions (except in special cases of frequentist priors), and there’s no agreement as to whether they should be regarded as measuring belief (and in what?), or supply “default”, (data-dominant) priors (obtained from one of several rival systems).

However, in their response to Lakens, the editors aver that “there is little point in knowing the probability that the null hypothesis is true”. That’s because they assume it is always false! They fall into the fallacy of thinking that because no models are literally true claims about the world–that’s why they’re called ‘models’–and with enough data a discrepancy from a point null, however small, may be found, it follows that all nulls are false. This is false. (Read proofs of Mayo 2018, Excursion 4 tour IV here). (It would follow, by the way, that all the assumptions needed to get estimations off the ground are false.) Ironically, the second most famous criticisms of statistical significance tests rest on assuming there is a high (spike) prior probability that the null hypothesis is true. (For a recent example–though the argument is quite old–see Benjamin 2018). Thus, many critics of statistical significance tests agree in retiring tests, but disagree as to whether it’s because the null hypothesis is always false or probably true!

Interestingly “probable” and “probability” come from the Latin *probare*, meaning to try, test, or prove. “Proof” as in “The proof is in the pudding” refers to how well you put something to the test. You must show or provide good grounds for the claim, not just believe strongly.* If we used “probability” this way, it would be would very close to my idea of measuring how well or severely tested (or how well shown) a claim is. I discuss this on p. 10 of Mayo (2018), which you can read here. But it’s not our current, informal English sense of probability, as varied as that can be. (I recall that Donald Frasier (2011) blamed this on Dennis Lindley)[1]. In any of the currently used meanings, claims can be strongly believed or even known to be true while not being poorly tested by data ** x**.

I don’t generally agree with Stuart Hurbert of the statistics wars, but he has an apt term for the strident movement to insist that only confidence intervals (CIs) be used, no tests of statistical hypotheses: the CI crusade.[2] The problem I have with “confidence interval crusaders,” is that while supplementing tests with confidence intervals (CIs) is good (at least where the model assumptions hold adequately), the “CIs only” crusaders advance the misleading perception that the only alternative is the abusive use of tests which fallaciously take statistical significance as substantive importance and commit all the well-known, hackneyed, fallacies. The “CIs only” crusaders get their condemnation of statistical significance tests off the ground only by identifying them with a highly artificial point null hypothesis test, as if Neyman and Pearson tests (with alternative hypotheses, power etc.) never existed (let alone do they consider the variations discussed by Lakens). But Neyman developed CIs at the same time he and Pearson developed tests. There is a direct duality between tests and intervals: a confidence interval (CI) at level 1 – c consists of parameter values that are not statistically significantly different from the data at significance level c. You can obtain the lower CI bound (at level 1-c) by asking: what parameter value is the data statistically significantly greater than at level c? Lakens is correct that the procedure the editors describe, checking for departures from a chosen value for a “smallest worthwhile effect,” is more properly seen as a test.

My own preferred reformulation of statistical significance tests–in terms of discrepancies (from a reference) that are well or poorly indicated–is in the spirit of CIs, but improves on them. It accomplishes the goal of CIs (to use the data to infer population effect sizes), while providing them with an inferential, and not merely a “performance” justification. The performance justification of a particular confidence interval estimate is merely that it arose from a method with good performance in a *long* run of uses. The editors to whom Lakens is replying deride this long-run frequency justification of tests, but fail to say how they get around this justification in using CIs. The erroneous construal of the confidence level as a probability the specific estimate is correct is encouraged. What confidence level should be used? The CI advocates typically stick with .95, but it is more informative to consider several different levels, mathematically akin to *confidence distributions*. Looking at estimates associated with low confidence levels, e.g., .6, .5, .4 is quite informative, but we do not see this. Moreover, the confidence intervals estimate does not tell us, for a given parameter value μ’ in the interval estimate, the answer to questions like: how well (or poorly0warranted is μ > μ’?[3] Again, it’s the poorly warranted (or inseverely tested) claims that are *most informative*, in my view. And unlike what some seem to think, estimation procedures require the same or more assumptions than do statistical significance tests. That is why simple significance tests (without alternatives) are relied on to test assumptions of statistical models used for estimation.

Lakens’ article is here. He is responding to this editorial; their reply to him, which not all the initial authors signed on to, is here.

**Please share your constructive comments.**

[1] D. Fraser’s (2011) “Is Bayes posterior just quick and dirty confidence?”

[2] Hurlbert and Lombardi 2009, p. 331. While I agree with their rejection of the “CIs only” crusaders, I reject their own crusade pitting Fisherian tests against Neyman and Pearson tests, rejecting the latter.

[3] For a very short discussion of how the severity reinterpretation of statistical tests of hypotheses connects and improves on CIs, see the Appendix of Mayo 2020. For access to the proofs of the entire book see excerpts on this blog.

*July 4, 2022 See Grieves’ comment and my reply on the roots of these words.

Hurlbert, S. and Lombardi, C. (2009). ‘Final Collapse of the Neyman-Pearson Decision Theoretic Framework and Rise of the NeoFisherian’, Annales Zoologici Fennici 46, 311–49.

Mayo, D. Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP, 2018)

The journal’s response to Laken’s editorial ends with the following, which I don’t fully agree with.

“And there is another reason not to conduct minimum effect tests: researchers who supply confidence intervals, rather than conducting minimum effect tests, devolve the responsibility of distinguishing between important and unimportant effects to their readers. Arguably that is where that responsibility should lie.”

I pretty much agree with Tunҫ, Tunҫ and Lakens that dichotomous claims are important. Yoav Benjamini has written about “selection into the abstract”. Whatever is on the first page is implicitly important, even if it isn’t labelled as statistically significant. Resisting experimenter bias is important, as you’ve pointed out.

We should write clearly and be transparent about data. Authors should own their decisions about which results are highlighted, and can make space in the method section, supplement or github archive to get into all the details.

I have pointed out previously that your etymology of probability is incorrect. It does not stem from “probare” as you claim here and in SIST.

The Oxford English Dictionary has the root as: 1) Middle French probabilité and 2) classical Latin probābilitāt-, probābilitās appearance of truth, likelihood. cf the German Wahrscheinlichkeit – wahr=true, schein=appearance.

This is also supported by Dr Johnson’s Dictionary of the English Language.

The word I sought the root of is “probable”, not “probability”. But surely they share roots. Still, both are beset by equivocation, so their roots might also shift. The etymology book I used in writing the book is In NYC, but I think it’s fairly standard (I make no claims to being etymology expertise). Here’s one: probable, from the latin probare, to test, demonstrate: https://www.google.com/search?q=etymology+of+probable&source=hp&ei=95rDYp63K7mn5NoP0Yal8As&iflsig=AJiK0e8AAAAAYsOpB7NOHFM55FWYwb-XcFM-hnW4z3VR&ved=0ahUKEwjejvma0-D4AhW5E1kFHVFDCb4Q4dUDCAo&uact=5&oq=etymology+of+probable&gs_lcp=Cgdnd3Mtd2l6EAMyCAgAEB4QDxAWMggIABAeEA8QFjIFCAAQhgM6EQguEIAEELEDEIMBEMcBENEDOg4ILhCABBCxAxCDARDUAjoICC4QgAQQsQM6CAgAELEDEIMBOgsIABCABBCxAxCDAToOCC4QgAQQsQMQxwEQ0QM6BQguEIAEOgsILhCxAxCDARDUAjoLCC4QgAQQsQMQgwE6CAgAEIAEELEDOg4ILhCABBCxAxDHARCjAjoFCAAQgAQ6DQguEIAEEMcBENEDEAo6DQgAEIAEELEDEEYQ-QE6BggAEB4QFjoKCAAQHhAPEBYQClAAWMJMYNxRaAFwAHgAgAFziAGbDpIBBDIwLjKYAQCgAQE&sclient=gws-wiz

Here’s Oxford:

https://www.oxfordlearnersdictionaries.com/us/definition/english/probable_1

late Middle English (in the sense ‘worthy of belief’): via Old French from Latin probabilis, from probare ‘to test, demonstrate’.

It would be the same root as for “probate” which seems to be a concept I’m increasingly involved with these days.

I’m reminded of a reference on p. 302 of my book SIST where I refer to Box and Jenkins (1976):

“Box and Jenkins highlight the link between ‘prove’ and ‘test’: ‘A model is only

capable of being ‘proved’ in the biblical sense of being put to the test’ (ibid.,

p. 286). They refer to

1 Thessalonians 5:21

I Thessalonians 5:21 instructs us to “test [prove, KJV] all things,” which would include our old notions, and then “hold fast” to the good ones—the ones that pass the test.

Andy:

The book I used, “Dictionary of Word Origins” is here in Virginia after all.

“Probable Latin probare meant ‘test’, …’prove’…From it was derived the adjective probabilis ‘provable’.” etc etc.

We can see some ambiguity, but the phrases you give also go back to probare. Again, I’m no expert. Thanks for pointing this out.

Deborah:

You write of “the complaint that p-values aren’t posterior probabilities.” That’s not quite right. The p-value is not the posterior probability of the null hypothesis–it’s not, that’s just a mathematical fact. The complaint is that (a) the p-values is presented as a generally useful inferential summary or measure of evidence, and (b) many users, including expert users, have misconceptions about p-values (the belief that the p-value can be taken as the probability that the null hypothesis is true; the attitude that results that are not statistically significant can be treated as zero; the belief that estimates remain unbiased even after selection on statistical significance; and others; see for example this paper by McShane and Gal: https://www.tandfonline.com/doi/full/10.1080/01621459.2017.1289846 ).

Again: it’s not that reformers such as myself are upset that the p-values is not the posterior probability of the null hypothesis, or that we want the p-value to be the posterior probability of the null hypothesis, or that we are interested in posterior probability of the null hypothesis at all. The problem is that the p-value is the answer to a very indirect and typically irrelevant question, and users seem to want to act as if the p-value is answering questions it does not answer.

Andrew:

Thanks for your comment. This reference was to a specific group of editors addressed by Lakens:

“First, they state that ‘p values do not equate to a probability that researchers need to know’ because ‘Researchers need to know the probability that the null hypothesis is true given the data observed in their study.”

That said, I think the basis for considerable criticism of p-values, and maybe error statistics in general, is that they don’t give posteriors–even though most Bayesians don’t either. It’s a very powerful rhetorical move, so much so that people often forget to ask whether the recommended replacement method gives posteriors, and whether they’d want to use them even if they did.

Why “indirect”? That’s generally said because it is assumed a “direct” assessment would be a posterior or Bayes factor. No?

Even if the alternative methods put forward don’t give posteriors either, I don’t think anyone can deny that a major source of criticism is that p-values disagree with posteriors–that’s the whole reason we hear that they “exaggerate” evidence, and why p-value thresholds should be lowered. It’s an extremely old argument. But of course we know for different priors, posteriors on a null can range fro alpha to 1- alpha.

Here’s something you might wish to address: Bickel’s argument that when p-values disagree with posteriors, there is evidence that the model fails a model check. I have not seen any Bayesian reply. Jim Berger said he thinks it’s a difficult problem.

https://errorstatistics.com/2021/12/13/bickels-defense-of-significance-testing-on-the-basis-of-bayesian-model-checking/

As far as the relevance of p-values–many think it’s quite relevant to be able to ascertain if a method/data were unable to distinguish between genuine effects and those due to background variability. That suffices to deny there’s evidence of a genuine effect!

Still, as you well know, I have always held p-values were inadequate and required supplements in the form of assessments of discrepancies (from a reference) that are well or poorly warranted by data. These are based on considering what is often called the P-value distribution, and is the basis for severity assessments. (Other measures can also be used.) The error statistical account of statistical testing of claims, therefore, goes considerably beyond p-values. But most good scientists know how to use p-values as a limited, but altogether important, piece of a larger inquiry. Science involves lots of piecemeal tools to build strong error probes. You used to think so too, and I’m guessing still do (e.g., Gelman and Shalizi).

Deborah:

I completely agree with you and Lakens that the statement, “Researchers need to know the probability that the null hypothesis is true given the data observed in their study,” is bad, and if that’s what those journal editors said, then they’re misinformed and it causes me to distrust whatever else they say on the topic.

Regarding the linked article by David Bickel: I have not read the article carefully so have no comments on its details, but I agree with the general point that when two different methods give different answers on a single dataset, this can be used as a model check. For a simple example, if you have independent data from a normal distribution with population mean mu, then the mean of the data and the median of the data are two different estimates of mu. If these two estimates differ by a lot, this is evidence that there is a problem with the model. I guess that’s my “Bayesian reply,” but I don’t really think the argument needs a “reply”: it’s just a general point that models can be checked in this way.

P.S. As a separate matter, I object to your use of the term “crusading” in the title of your post. This is a loaded term, no? The journal editors have a position on a topic that is important to them, so they take it seriously, just as you do and I do. We’re all “crusading.” Labeling people you disagree with as “crusaders” doesn’t seem right to me. They’re making arguments, just like the rest of us.

Andrew: Thank you for your reply. Yes, “crusaders” is loaded, it’s Hurlbert’s term, but you know what? It fits the “CIs only” crowd who fought a (largely successful) crusade to rid statistical inference of anything other than CIs. (I added the work “largely” on 7/8/22) They (the crusaders, not mere critics of tests) have done more to distort and misrepresent tests than any critics from the Bayesian tribes. Since the dual to CIs is N-P statistics, not simple single null hypotheses tests, it is unfair of them to contrast their view with simple p-values. CIs require much more in the way of model structure and assumptions, than simple p-values. Neyman developed CIs at the same time as he was developing statistical tests with alternatives, power, etc. That said, remember, this is a blog, and not an academic paper, and Hurlbert uses it in academic papers.

I’ll come back to this, I’m in a waiting room, and don’t know if internet will last.

To be a “crusader”, to strongly champion a cause is not a pejorative term. We hear, for instance, of someone being a crusader for a type of cancer research.

Wait–there was a “successful crusade to rid statistics of anything other than CIs”? Somebody forgot to tell the world about this success. This is just getting ridiculous. I get it, you disagree with these people. I don’t think anything is gained by using the term “crusader” to describe someone who holds a position different from yours.

Andrew:

I was just trying to give credit where credit is due. I’ve known Cumming since his attempt was just getting off the ground. At the time he favored also the kind of revised version of statistical significance tests that Spanos and I were developing. Severity seemlessly connects tests and CIs, but also improves on the latter by attaching assessment to claims corresponding to different points in the CI (at different confidence levels). In fact, Cumming created the first severity calculator starting from his CI excel program at a workshop we were at. Then he decided CIs only, and I don’t think anyone can deny he has had a huge influence at least in psychology.

I’m going by what the people in psych themselves say. Take Haig in this blog:

“Erich Eich. In his accompanying editorial (Eich, 2014) explicitly discouraged prospective authors from using null hypothesis significance testing, and invited them to consider using the new statistics of effect sizes, estimation, and meta-analysis. Cumming, now with Bob Calin-Jageman, continues to assiduously promote the new statistics in the form of textbooks, articles, workshops, symposia, tutorials, and a dedicated website. It is fair to say that the new statistics has become the quasi-official position of the Association for Psychological Science…”

Recall, I said I was in a waiting room, so I might instead have written “largely successful” (which I now wrote, with a date)–but success doesn’t have to be 100%. I’m happy for him but I’m sad for the picture of hypothesis tests he has championed. What bothers me is that the good recommendation to accompany significance tests with CIs (when it is possible to compute them) has (amongst the crusaders) come hand in hand with a highly distorted view of tests. No shortcomings of CIs are mentioned ( e.g., how to set and interpret the confidence level, and how it warrants particular inferences.) We now have rival tribes led by very smart, convincing people, each rather successful at getting their preferred replacement to tests, giving us opposed methods. For example, a BF can license no evidence against (or even evidence for) a null that is entirely excluded from a CI interval with a high confidence level. Then there’s the lump prior to point nulls advocated by the “redefine significance” group, in contrast to the “all nulls are false” group. I’m not saying these apparent oppositions cannot be understood. My book, “Stat Inference as Severe Testing: How to get beyond the stat wars” tries to at least explain how we get these rival positions. I say improvement of methods can’t occur until these stat wars are at least understood. It came out in Sept 2018. In 2019 leading voices of these competing tribes found they could agree on one thing: statistical significance tests caused a crisis of replication.

It’s interesting you say in your last comment that when competing methods disagree, we should look for a flaw in the underlying model. That’s not what’s suggested by those finding p-values disagree with posteriors, instead it’s a criticism of p-values. What about the fact they’re measuring very different things?

Andrew:

Today’s critics of statistical significance tests, built on age-old fallacies and/or rival statistical philosophies, have, in my opinion, lost any claim to the high ground on the use of overloaded words once they adopted “abandon” statistical significance–and with the full backing of the Executive Director of the ASA. Turns out most of them don’t literally mean we should never use statistical significance tests, but if they don’t mean it, they shouldn’t be saying it, most especially with officials’ thumbs on the scale. That is why I wrote my editorial: Mayo 2021 https://conbio.onlinelibrary.wiley.com/doi/full/10.1111/cobi.13861

Of course, the sensational “scientists rise up against….” sells. An ASA Presidential Task force on significance and replication came out against abandoning statistical significance tests, but no one would publish it—not even the ASA whose president appointed the task force. It has skillfully been downplayed and muted. So the there’s scarcely a level playing field.

A crusader is a person who works hard or campaigns forcefully for a cause. Most crusaders advocate dramatic social or political change.