When it comes to the statistics wars, leaders of rival tribes sometimes sound as if they believed “les stats, c’est moi”. . So, rather than say they would like to supplement some well-known tenets (e.g., “a statistically significant effect may not be substantively important”) with a new rule that advances their particular preferred language or statistical philosophy, they may simply blurt out: “we take that step here!” followed by whatever rule of language or statistical philosophy they happen to prefer (as if they have just added the new rule to the existing, uncontested tenets). Karan Kefadar, in her last official (December) report as President of the American Statistical Association (ASA), expresses her determination to call out this problem at the ASA itself. (She raised it first in her June article, discussed in my last post.)
One final challenge, which I hope to address in my final month as ASA president, concerns issues of significance, multiplicity, and reproducibility. In 2016, the ASA published a statement that simply reiterated what p-values are and are not. It did not recommend specific approaches, other than “good statistical practice … principles of good study design and conduct, a variety of numerical and graphical summaries of data, understanding of the phenomenon under study, interpretation of results in context, complete reporting and proper logical and quantitative understanding of what data summaries mean.”
The guest editors of the March 2019 supplement to The American Statistician went further, writing: “The ASA Statement on P-Values and Statistical Significance stopped just short of recommending that declarations of ‘statistical significance’ be abandoned. We take that step here. … [I]t is time to stop using the term ‘statistically significant’ entirely.”
Many of you have written of instances in which authors and journal editors—and even some ASA members—have mistakenly assumed this editorial represented ASA policy. The mistake is understandable: The editorial was co-authored by an official of the ASA. In fact, the ASA does not endorse any article, by any author, in any journal—even an article written by a member of its own staff in a journal the ASA publishes. (Kafadar, December President’s Corner)
Yet Wasserstein et al. 2019 describes itself as a continuation of the ASA 2016 Statement on P-values, which I abbreviate as ASA I. (Wasserstein is the Executive Director of the ASA.) It describes itself as merely recording the decision to “take that step here”, and add one more “don’t” to ASA I. As part of this new “don’t,” it also stipulates that we should not consider “at all” whether pre-designated P-value thresholds are met. (It also restates four of the six principles in ASA I so as to be considerably stronger than those in ASA I. I argue, in fact, the resulting principles are inconsistent with principles 1 and 4 of ASA I. See my post from June 17, 2019.) Since it describes itself as a continuation of the ASA policy in ASA I, and that description survived peer review at the journal TAS, readers presume that’s what it is; absent any disclaimer to the contrary, that conception (or misconception) remains operative.
There really is no other way to read the claim in the Wasserstein et al. March 2019 editorial: “The ASA Statement on P-Values and Statistical Significance stopped just short of recommending that declarations of ‘statistical significance’ be abandoned.” We take that step here.” Had the authors viewed their follow-up as anything but a continuation of ASA I, they would have said something like: “Our own recommendation is to go much further than ASA I. We suggest that all branches of science stop using the term ‘statistically significant’ entirely.” They do not say that. What they say is written from the perspective of “Les stats, c’est moi”.
The 2019 P-value Project II
Kafadar deserves a great deal of credit for providing some needed qualification in her December note. However, there needs to be a disclaimer by ASA as regards what it calls its P-value Project. The P-value project, started in 2014, refers to the overall ASA campaign to provide guides for the correct use and interpretation of P-values and statistical significance, and journal editors and societies are to consider revising their instructions to authors taking into account its guidelines. ASA I was distilled from many meetings and discussions from representatives in statistics. The only difference in today’s P-value Project is that both ASA I and the 2019 editorial by Wasserstein et al. are to form the new ASA guidelines–even if the latter is not to be regarded as a continuation of ASA I (in accord with Kafadar’s qualification). I will refer to it as the 2019 ASA P-value Project II. Wasserstein et al. 2019 is a piece of the P-value project, and the authors thank the ASA for its support of this Project at the end of the article.  
Of Policies and Working Groups
Even our own ASA members are asking each other, “What do we tell our collaborators when they ask us what they should do about statistical hypothesis tests and p-values?” Should the ASA have a policy on hypothesis testing or on using “statistical significance”?
Allow me to weigh in here: No, no it should not. At one time I would have said yes, but no more. I can hear the policy now (sounding much like Wasserstein et al. 2019, only written in stone): “Don’t say, never say, or if you really feel you must say significance, and are prepared to thoroughly justify such a “thoughtless” term, then you may only say “significance level p” where p is continuous, and never rounded up or cut off, ever. But never, ever use the “ant” ending: significant. You can’t, can’t, can’t say results are statistically significant (at level p). The only exception would be if you’re giving the history of statistics. (3)
Why can’t the ASA merely provide a bipartisan forum for discussion of the multitude of models, methods, aims, goals, and philosophies of its members? Wasserstein et al. 2019 admits there is no agreement, and that there might never be. Spare us another document whose implication is: we need not test, and cannot falsify claims, even statistically (since that is the consequence of no thresholds). I realize that Kafadar is calling for a serious statement–one that counters the impression of the Wasserstein et al. opinion.
To address these issues, I hope to establish a working group that will prepare a thoughtful and concise piece reflecting “good statistical practice,” without leaving the impression that p-values and hypothesis tests—and, perhaps by extension as many have inferred, statistical methods generally—have no role in “good statistical practice.” …The ASA should develop—and publicize—a properly endorsed statement on these issues that will guide good practice.
Be careful what you wish for. I give major plaudits to Kafadar for pressing hard to see that alternative views are respected, and to counter the popular but terrible arguments of the form: since these methods are misused, they should be banished, and replaced with methods advocated by group Z (even if the credentials of Z’s methods haven’t been scrutinized!) We have already seen in 2019 the extensive politicization and sensationalizing of bandwagons in statistics. (See my editorial P-value Thresholds: Forfeit at your Peril.)The average ASA member, who doesn’t happen to be a thought leader or member of a politically correct statistical-philosophical tribe, is in great danger of being muffled entirely. There’s already a loss of trust. We already know, under the motto that “a crisis should never be wasted”, that many leaders of statistical tribes view the crisis of replication as an opportunity to sell alternative methods they have long been promoting. Rather than the properly endorsed, truly representative, statement that Kafadar seeks, we may get dictates from those who are quite convinced that they know best: “les stats, c’est moi”.
APPENDIX. How a Working Group on P-values and Significance Testing Could Work
I see one way that a working group could actually work. The 2016 ASA statement, ASA I, had a principle, it was #4. You don’t hear about it in the 2019 follow-up. It is that “P-values and related statistics” cannot be correctly interpreted without knowing how many hypotheses were tested, how data were specified and results selected for inference. Notice the qualification “and related statistics”. The presumption is that some methods don’t require that information! That information is necessary only if one is out to control the error probabilities associated with an inference.
Here’s my idea: Have the group consist of those who work in areas where statistical inferences depend on controlling error probabilities (I call such methods error statistical). They would be involved in current uses and developments of statistical significance testing and the much larger (frequentist) error statistical methodology within which it forms just a part. They would be familiar with, and some would be involved in developing, the latest error statistical tools, including tests and confidence distributions, P-values with high dimensional data, current problems of adjusting for multiple testing, and of testing statistical model assumptions, and they would be capable of different aspects of comparative statistical methods (Bayesian and error statistical). They would present their findings and recommendations, and responses sought.
The need for the kind of forum I’m envisioning is so pressing, that it should not be contingent on being created by any outside association. It should emerge spontaneously in 2020. We take that step here.
Please share your comments in the comments.
 This is a pun on “l’état, c’est moi” (“I am the state”, Louis XIV*.) I thank Glenn Shafer for the appropriate French spelling for my pun. (*Thanks to S. Senn for noticing I was missing the X in Louis XIV.)
 They are referring to the last section of ASA I on “other measures of evidence”. Indeed, that section suggests an endorsement of an assortment of alternative measures of evidence including Bayes factors, likelihood ratios and others. There is no attention to whether any of these methods accomplish the key task of the statistical significance test–to distinguish genuine from spurious effects. For a fuller explanation of this last section, please see my post from June 17, 2019 and November 14, 2019. And, obviously, check the last section of ASA I.
Shortly after the 2019 editorial appeared, I queried Wasserstein as to the relationship between it and ASA I. It was never clarified. I hope now that it will be. At the same time I informed him of what appeared to me to be slips in expressing principles of ASA I, and I offered friendly amendments (see my post from June 17, 2019).
 If you’re giving the history of statistics, you can speak of those bad, bad men–dichotomaniacs, Neyman and Pearson– who, following Fisher, divided results into significant and non-significant discrepancies (introduced the alternative hypotheses, type I and II errors, power and optimal tests) and thereby tried to reduce all of statistics to acceptance sampling, engineering, and 5-year plans in Russia–as Fisher (1955) himself said (after the professional break with Neyman in 1935). Never mind that Neyman developed confidence intervals at the same time, 1930. For a full discussion of the history of the Fisher-Neyman (and related) wars, please see my Statistical Inference as severe Testing: How to Get Beyond the Statistics Wars (CUP, 2018).
 I was just sent this podcast and interview of Ron Wasserstein, so I’m adding it as a footnote. There, Wasserstein et al. 2019 is clearly described as the ASA’s “further guidance”, and Wasserstein takes no exception to it. The interviewer says:
“But it would seem as though Ron’s work has only just begun. The ASA has just published further guidance in the most recent edition of The American Statistician, which is open access and written for non-statisticians. The guidance is intended to go further and argues for an end to the concept of statistical significance and towards a model which the ASA have coined their ATOM Principle: Accept uncertainty, Thoughtful, Open and Modest.”
Nathan Schachtman, in a new post just added to his law blog on this very topic, displays a letter from the ASA acknowledging that a journal has revised its guidelines taking into account both ASA I and the 2019 Wasserstein et al. editorial. I had seen this letter, in relation to the NEJM, but it’s hard to know what to make of it. I haven’t seen others acknowledging other journals, and there have been around 7 at this point. I may just be out of the loop.
Selected blog posts on ASA I and the Wasserstein et al. 2019 editorial:
- March 25, 2019: “Diary for Statistical War Correspondents on the Latest Ban on Speech.”
- June 17, 2019: “The 2019 ASA Guide to P-values and Statistical Significance: Don’t Say What You Don’t Mean” (Some Recommendations)(ii)
- July 19, 2019: The NEJM Issues New Guidelines on Statistical Reporting: Is the ASA P-Value Project Backfiring? (i)
- September 19, 2019: (Excerpts from) ‘P-Value Thresholds: Forfeit at Your Peril’ (free access). The article by Hardwicke and Ioannidis (2019), and the editorials by Gelman and by me are linked on this post.
- November 4, 2019:On some Self-defeating aspects of the ASA’s 2019 recommendations of statistical significance tests
- November 14, 2019: The ASA’s P-value Project: Why it’s Doing More Harm than Good (cont from 11/4/19)
- November 30, 2019: P-Value Statements and Their Unintended(?) Consequences: The June 2019 ASA President’s Corner (b)