A key recognition among those who write on the statistical crisis in science is that the pressure to publish attention-getting articles can incentivize researchers to produce eye-catching but inadequately scrutinized claims. We may see much the same sensationalism in broadcasting metastatistical research, especially if it takes the form of scapegoating or banning statistical significance. A lot of excitement was generated recently when Ron Wasserstein, Executive Director of the American Statistical Association (ASA), and co-editors A. Schirm and N. Lazar, updated^{(note)} the 2016 ASA Statement on P-Values and Statistical Significance (ASA I). In their 2019 interpretation, ASA I “stopped just short of recommending that declarations of ‘statistical significance’ be abandoned,” and in their new statement (ASA II)^{(note)} announced: “We take that step here….’statistically significant’ –don’t say it and don’t use it”.^{ }To herald the ASA II^{(note)}, and the special issue “Moving to a world beyond ‘p < 0.05’”, the journal *Nature* requisitioned a commentary from Amrhein, Greenland and McShane “Retire Statistical Significance” (AGM). With over 800 signatories, the commentary received the imposing title “Scientists rise up against significance tests”!

Tom Hardwicke and John Ioannidis surveyed those signatories and give a report on the respondents (Hardwicke and Ioannidis 2019). I was invited to write an editorial on any aspect of the episode (“P-value thresholds: Forfeit at your peril“)–the opening of which is above. Hardwicke and Ioannidis 2019, a preprint of my editorial, and an editorial by Andrew Gelman are currently “free access” in the *European Journal of Clinical Investigation*. I guess that means these versions are currently freely accessible.

My article continues:

Note: By “ASA II”^{(note)} I allude only to the authors’ general recommendations, not their summaries of the 43 papers in the issue.)

Hardwicke and Ioannidis (2019) worry that recruiting signatories on such a paper politicizes the process of evaluating a stance on scientific method, and fallaciously appeals to popularity (argumentum ad populum) “because it conflates justification of a belief with the acceptance of a belief by a given group of people”. Opposing viewpoints are not given a similar forum. Fortunately, John Ioannidis (2019) can come out with a note in JAMA challenging ASA II^{(note)} and AGM, but the vast majority of stakeholders in the debate go unheard. Appealing to popularity gives a *prudential* reason to go along, it is risky to stand in opposition to the hundreds who signed, not to mention, the thought leaders at the ASA. There is also an appeal to fear, with the result that many will fear using statistical significance tests altogether. Why risk using a method that is persecuted with such zeal and fanfare?

Ioannidis (2019) points out what may not be obvious at first: it is not just a word ban but a gatekeeper ban:

Many fields of investigation … have major gaps in the ways they conduct, analyze, and report studies and lack protection from bias. Instead of trying to fix what is lacking and set better and clearer rules, one reaction is to overturn the tables and abolish any gatekeeping rules (such as removing the term statistical significance). However, potential for falsification is a prerequisite for science. Fields that obstinately resist refutation can hide behind the abolition of statistical significance but risk becoming self-ostracized from the remit of science.

Among the top-cited signatories who respond to their questionnaire, Hardwicke and Ioannidis find a heavy representation of fields with prevalent concerns about low reproducibility. Yet “abandoning the concept of statistical significance would make claims of ‘irreproducibility’ difficult if not impossible to make. In our opinion this approach may give bias a free pass”.

I agree, and will show why.

I continue with (excerpts of a preprint of) my article; references are formatted in the usual way. You can read the “free access” version here.

….

It might be assumed I would agree to “retire significance” since I often claim “the crude dichotomy of ‘pass/fail’ or ‘significant or not’ will scarcely do” and because I reformulate tests so as to “determine the magnitudes (and directions) of any statistical discrepancies warranted, and the limits to any substantive claims you may be entitled to infer from the statistical ones.”(Mayo 2018) [Genuine effects, as Fisher insisted,require not isolated small P-values, but a reliable method to successfully generate them.] We should not confuse prespecifying minimal thresholds in each test, which I would uphold, with fixing a value to habitually use (which I would not). N-P tests called for the practitioner to balance error probabilities according to context, not rigidly fix a value like .05. Nor does having a minimal P-value threshold mean we do not report the attained P-value: we should, and N-P agreed!

**The “no threshold” view is not merely to never use the S word and report continuous P-values **

These two rules alone would not lead Hardwicke and Ioannidis to charge, correctly, in my judgment, that “this approach may give bias a free pass”. ASA II^{(note)} and AGM decry using any prespecified P-value threshold as the basis for categorizing data in some way, such as inferring that results are, or are not, evidence of a genuine effect.

- “Decisions to interpret or to publish results will not be based on statistical thresholds” (AGM).
- “Whether a p-value passes any arbitrary threshold should not be considered at all” in interpreting data (ASA II).
^{(note)}

Consider how far reaching the “no threshold” view is for interpreting data. For example, according to ASA II^{(note)}, in order for the U.S. Food and Drug Administration (FDA) to comply with its “no threshold” position, it does not suffice that they report continuous P-values and confidence intervals. The FDA would have to end its “long established drug review procedures that involve comparing *p*-values to significance thresholds for Phase III drug trials”.

The New England Journal of Medicine (NEJM) responds (2019) to the ASA call to revise their guidelines, but insists that a central premise on which their revisions are based is “the use of statistical thresholds for claiming an effect or association should be limited to analyses for which the analysis plan outlined a method for controlling type I error”. In the article accompanying the revised guidelines:

“A well-designed randomized or observational study will have a primary hypothesis and a prespecified method of analysis, and the significance level from that analysis is a reliable indicator of the extent to which the observed data contradict a null hypothesis of no association between an intervention or an exposure and a response. Clinicians and regulatory agencies must make decisions about which treatment to use or to allow to be marketed, and P values interpreted by reliably calculated thresholds subjected to appropriate adjustments [for multiple trials] have a role in those decisions”.

Specifying “thresholds that have a strong theoretical and empirical justification” escapes the ASA II^{(note)} ruling: “Don’t conclude anything about scientific …importance based on statistical significance”.

Although less well advertised, the “no thresholds” view also torpedoes common uses of confidence intervals and Bayes Factor standards.

[T]he problem is not that of having only two labels. Results should not be trichotomized, or indeed categorized into any number of groups. Similarly, we need to stop using confidence intervals [CIs] as another means of dichotomizing. (ASA II)

^{(note)}

AGM’s “compatibility intervals” are redolent of the consonance intervals of Kempthorne and Folks(1971) , except that the latter use many thresholds, one for each of several consonance levels. Even these would seem to violate the rule that results should not be “categorized into any number of groups”.

…Nor could Bayes factor thresholds be used, as they often are, to test a null against an alternative. It is not clear how any statistical tests survive. A claim has not passed a genuine test, if none of the results are allowed to count against it. We are not told what happens to the use of significance tests to check if statistical model assumptions hold approximately, or not–essential across methodologies. As George Box, a Bayesian, remarks, “diagnostic checks and tests of fit … require frequentist theory significance tests for their formal justification”(1983, p. 57).

**What arguments are given to accept the no threshold view?**

Getting past the appeals to popularity and fear, the reasons ASA II^{(note)} and AGM give are that thresholds can lead to well-known fallacies, and even to some howlers more extreme than those long lampooned. Of course it’s true:

a statistically non-significant result does not ‘prove’ the null hypothesis (the hypothesis that there is no difference between groups or no effect of a treatment …). Nor do statistically significant results ‘prove’ some other hypothesis. (AGM)

^{ }

It is easy to be swept up in their outrage, but the argument: “significance thresholds can be used very badly, therefore remove significance thresholds” is a very bad argument. Moreover, it would remove the very standards we need to call out the fallacies. A rule that went from any non-significant result to inferring no effect was proved, or to take something less extreme, to inferring it is well warranted or the like, would have extremely high Type II error probabilities. They deal with a point null hypothesis, which makes it even worse.

…The “free access” version is here.

**Giving Data Dredgers a Free Pass**

The danger of removing thresholds on grounds they could be badly used is that they are not there when you need them. Ioannidis zeroes in on the problem:

The proposal to entirely remove the barrier does not mean that scientists will not often still wish to interpret their results as showing important signals and fit preconceived notions and biases. With the gatekeeper of statistical significance, eager investigators whose analyses yield, for example, P = .09 have to either manipulate their statistics to get to P < .05 or add spin to their interpretation to suggest that results point to an important signal through an observed “trend.” When that gate keeper is removed, any result may be directly claimed to reflect an important signal or fit to a preexisting narrative.

As against Ioannidis’ anything goes charge, it might be said that even in a world without thresholds a largish P-value could not be taken as evidence of a genuine effect. For to do so would be to say something nonsensical. It would be to say: Even though larger differences would frequently be expected by chance variability alone (i.e., even though the P-value is largish), I maintain the data provide evidence they are not due to chance variability.

But such a response turns on appealing to a threshold to block it, minimally requiring the P-value be rather small e.g., < .1? (It also shows why P-values are apt measures for the job of distinguishing random error.) Thus, our eager investigators, facing a non-small P-value, are still incentivized to manipulate their statistics. Say they ransack the data until finding a non-prespecified subgroup that provides a nominally small enough P-value. In a world without thresholds, we would be hamstrung from highlighting, critically, P-values that breach (as opposed to uphold) preset thresholds.

“Whether a p-value passes any arbitrary threshold should not be considered *at all* when deciding which results to present or highlight” (my emphasis, ASA II)^{(note)}.

More important than keeping a specific word is keeping a filter for error control. The 2016 ASA I warned in Principle 4: “Valid scientific conclusions based on p-values and related statistics cannot be drawn without at least knowing how many and which analyses were conducted, and how those analyses (including p-values) were selected for reporting”. …An unanswered question is how Principle 4 is to operate in a world with ASA II^{(note)}.

The NEJM’s revised guidelines, far from agreeing to use P-values without error probability thresholds, will now be stricter in their use. When no method to adjust for multiplicity of inferences or controlling the Type I error probability is prespecified, the report of secondary endpoints

should be limited to point estimates of treatment effects with 95% confidence intervals. In such cases, the Methods section should note that the widths of the intervals have not been adjusted for multiplicity and that the inferences drawn may not be reproducible. No P values should be reported for these analyses.

Confidence intervals severed from their dualities with tests, from which they were initially developed, lose their error probability guarantees.

**Conclusion**

The ASA P-value project is lately careering into recommendations on which there has been little balanced discussion and much disagreement. Hardwicke and Ioannidis find that more than half of the respondents deny significance should be excluded from all science, and the 43 papers in the special issue “Moving to a world beyond ‘p < 0.05’” offer a cacophony of competing reforms.

It is hard to resist the missionary zeal of masterful calls: Do you want bad science to thrive? or Do you want to ban significance? (a false dilemma). A question to raise before jumping on the bandwagon: Are they asking the most unbiased questions about the consequences of removing thresholds currently ensconced into hundreds of legal statutes and best practice manuals? This needs to be carefully considered, if the reforms intended to improve credibility of statistics are not to backfire, as they may already be doing.

ASA II^{(note)} is part of a large undertaking; it contains plenty of sagacious advice. Notably the M in ATOM: Modesty.

Be modestby recognizing that different readers may have very different stakes on the results of your analysis, which means you should try to take the role of a neutral judge rather than an advocate for any hypothesis.

ASA II^{(note)} regards its positions “open to debate”. An open debate is very much needed.

Here’s the full (uncorrected) preprint of my editorial.

*Mayo (2018), Mayo and Cox (2006), Mayo and Spanos (2006).

**Acknowledgement**

I would like to thank D. Hand, N. Schachtman and A. Spanos for comments and corrections on earlier drafts.

**References not linked above**

Birnbaum, A. Statistical Methods in Scientific Inference (letter to the Editor), *Nature* 1970;225(5237):1033.

Box, G. An apology for ecumenism in statistics. In G. E. P. Box, T. Leonard, and D. F. J. Wu* (Eds.), Scientific inference, data analysis, and robustness. *Academic Press, 1983:51-84.

Fisher, RA. The design of experiments*,* Oliver and Boyd, 1947.

Kempthorne, O, Folks, J. Probability, statistics, data analysis. Iowa State University Press, 1971.

Mayo, D. G. *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars. *CUP, 2018.

Mayo, D.G. and Cox, D. R. “Frequentist Statistics as a Theory of Inductive Inference,” *Optimality: The Second Erich L. Lehmann Symposium *(ed. J. Rojo), Lecture Notes-Monograph series, Institute of Mathematical Statistics (IMS), 2006; 49: 77-97.

Mayo, D. G. and Spanos, A. “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,” *British Journal of Philosophy of Science*, 2006;57: 323-357.

Neyman, J. Tests of statistical hypotheses and their use in studies of natural phenomena. Communications in Statistics: Theory and Methods 1976;5(8):737–51.

NEJM Author Guidelines: Retrieved from: https://www.nejm.org/author-center/new-manuscripts on July 19, 2019.

**Relevant (2019) posts:**

The NEJM Issues New Guidelines on Statistical Reporting: Is the ASA P-Value Project Backfiring?

Statistics is difficult and putting probability models on real life is always subtle and prone to misunderstandings, paradoxes and the like. At the same time it offers an intriguing set of tools to make sense of and learn from data.

I think that tests and p-values are intriguing concepts. I’ve more than one student who, asked what motivated them to go into statistics, would have said “the principle of hypothesis tests” with all its implications and pitfalls.

Many people need statistics, many want to uncover simple black/white messages from their data, and very very many don’t understand statistics properly. Having tests and p-values as a celebrated and rather easy to apply method would automatically mean that their mindless and “automatic” (mis)use would be promoted and in many places even required, partly encouraging all kinds of conscious and unconscious cheating and manipulation. All of which has led to the current situation in which they seem to have become victims of their own success.

It saddens me that so many people nowadays seem to think that this is the fault of the approach per se, whereas I find it pretty clear that whatever is proposed to replace them, as long as it seems reasonably simple, will end up with pretty much the same issues. Unfortunately I’m not optimistic about forcing everyone who analyses data to collaborate with a statistician who knows what they’re doing… so bad statistics won’t die and surely can’t be stamped out by banning p-values, significance and the like. The best we can do, I think, is to make people aware of statistics’ subtleties and pitfalls in the hope that we can help some people to think a bit more and a bit better before jumping to fast and easy conclusions.

Christian:

Thanks for your comment. I don’t see how any proposed replacement, if relevant for the task of distinguishing genuine from chance variability, won’t actually be worse. But I agree with you that it’s not the methods but bad science and bad understanding. The current bandwagon greatly worsens things by making it appear that the methods permit the fallacious uses: moving from statistical to scientific (in a statistical affirming the consequent), data dredging & selection effects are among the main ones. The ‘alternative’ statistical methods permit those fallacies and they’re just as destructive in those settings, except for one big difference. Your direct grounds for identifying them & holding people accountable for committing them, vanishes. Unless they are supplemented by stipulations and principles that are not now standard. There are, of course, developments within the broad error statistical umbrella that are attacking and solving problems of selection effects. That is a worthwhile place to turn our attention.

I appreciate the comments and agree with them. As to, “Unfortunately I’m not optimistic about forcing everyone who analyses data to collaborate with a statistician who knows what they’re doing… so bad statistics won’t die and surely can’t be stamped out by banning p-values, significance and the like”, I will simply point out that ASA II with all its bad advice is coming from statisticians. As a practitioner, this does not boost my confidence in where the profession is going. A petition, and no thought as to the efficacy of proposed replacements for significance testing (for testing for random effects, etc.)? Looks like concepts to fear not trust.

John:

Thanks for your comment. You raise a very good point. The confusions we see today, I’m very sorry to say, are exacerbated by some from statistics and metastatistics. I had hoped to at least convince the authors of ASA II to modify some of their declarations, and wrote to Wasserstein with my suggestions soon after ASA II appeared on March 20.

It is little wonder that many view the “statistics wars” as proxy battles wherein leaders of various rival tribes push to extoll their favorite method/terminology, without those methods actually being the subject of scrutiny. In my opinion, the ASA representatives made a serious mistake in taking sides, and in giving most ungenerous readings of significant tests, rather than carefully delineating the core concepts. A fair and honest discussion, probably led by external parties, and without bandwagons, is needed.

I’m waiting for something like ASA XVII where they fully explain the cons of the numerous proposed alternative unproven approaches.

For example, take Greenland’s information type of measure like s = -log_base 2(p-value), or s = -log(p-value)/log(2), which can be interpreted as bits of information against H0, or the number of heads observed in as many flips of a coin, to measure “surprisal”. It is claimed it is better than using a p-value, namely because large values reject H0, it may be more intuitive, is on a better scale, etc. I don’t find this reasoning compelling. We are currently already going from raw data to summaries like means and SDs to standardized values like z-scores, and finally to p-values. Now we add an extra step and look at a transformation of the p-value? Probability is already a fairly natural scale, and small p-values already correspond to large values of your test statistic. If you want something intuitive and on a good, intuitive scale just use the observed data. There are many proposed “pet alternatives” to p-values, but what is their acceptance and performance in scientific and other areas all over the world? With p-values we already have this, but other approaches are unproven. I’m also not sure I think bits is that intuitive. Winning the lottery may be about 24 bits of surprisal, but that is not as intuitive to me as a really, really small probability. I can’t believe I actually read that writing 24 is more manageable than writing out a really, really small probability, when we can just write really, really small probabilities using scientific notation or a less than symbol.

Justin

http://www.statisticool.com

Justin: There’s quite a lot of info about the problems confronting the better known alternatives to error statistical methods (in which I would include CIs, but that might change). The hoped-for breakthroughs in default Bayesianism haven’t happened, for example. The apparent disinterest in bringing to light the problems with various alternatives sets a bad example.