Why hasn’t the ASA Board revealed the recommendations of its new task force on statistical significance and replicability?

something’s not revealed

A little over a year ago, the board of the American Statistical Association (ASA) appointed a new Task Force on Statistical Significance and Replicability (under then president, Karen Kafadar), to provide it with recommendations. [Its members are here (i).] You might remember my blogpost at the time, “Les Stats C’est Moi”. The Task Force worked quickly, despite the pandemic, giving its recommendations to the ASA Board early, in time for the Joint Statistical Meetings at the end of July 2020. But the ASA hasn’t revealed the Task Force’s recommendations, and I just learned yesterday that it has no plans to do so*. A panel session I was in at the JSM, (P-values and ‘Statistical Significance’: Deconstructing the Arguments), grew out of this episode, and papers from the proceedings are now out. The introduction to my contribution gives you the background to my question, while revealing one of the recommendations (I only know of 2). 

[i] Linda Young, (Co-Chair), Xuming He, (Co-Chair) Yoav Benjamini, Dick De Veaux, Bradley Efron, Scott Evans, Mark Glickman, Barry Graubard, Xiao-Li Meng, Vijay Nair, Nancy Reid, Stephen Stigler, Stephen Vardeman, Chris Wikle, Tommy Wright, Karen Kafadar, Ex-officio. (Kafadar 2020)

You can access the full paper here.

 

Rejecting Statistical Significance Tests: Defanging the Arguments^

Abstract: I critically analyze three groups of arguments for rejecting statistical significance tests (don’t say ‘significance’, don’t use P-value thresholds), as espoused in the 2019 Editorial of The American Statistician (Wasserstein, Schirm and Lazar 2019). The strongest argument supposes that banning P-value thresholds would diminish P-hacking and data dredging. I argue that it is the opposite. In a world without thresholds, it would be harder to hold accountable those who fail to meet a predesignated threshold by dint of data dredging. Forgoing predesignated thresholds obstructs error control. If an account cannot say about any outcomes that they will not count as evidence for a claim—if all thresholds are abandoned—then there is no a test of that claim. Giving up on tests means forgoing statistical falsification. The second group of arguments constitutes a series of strawperson fallacies in which statistical significance tests are too readily identified with classic abuses of tests. The logical principle of charity is violated. The third group rests on implicit arguments. The first in this group presupposes, without argument, a different philosophy of statistics from the one underlying statistical significance tests; the second group—appeals to popularity and fear—only exacerbate the ‘perverse’ incentives underlying today’s replication crisis. 

1. Introduction and Background 

Today’s crisis of replication gives a new urgency to critically appraising proposed statistical reforms intended to ameliorate the situation. Many are welcome, such as preregistration, testing by replication, and encouraging a move away from cookbook uses of statistical methods. Others are radical and might inadvertently obstruct practices known to improve on replication. The problem is one of evidence policy, that is, it concerns policies regarding evidence and inference. Problems of evidence policy call for a mix of statistical and philosophical considerations, and while I am not a statistician but a philosopher of science, logic, and statistics, I hope to add some useful reflections on the problem that confronts us today. 

In 2016 the American Statistical Association (ASA) issued a statement on P-values, intended to highlight classic misinterpretations and abuses. 

The statistical community has been deeply concerned about issues of reproducibility and replicability of scientific conclusions. …. much confusion and even doubt about the validity of science is arising. (Wasserstein and Lazar 2016, p. 129) 

The statement itself grew out of meetings and discussions with over two dozen others, and was specifically approved by the ASA board. The six principles it offers are largely rehearsals of fallacious interpretations to avoid. In a nutshell: P-values are not direct measures of posterior probabilities, population effect sizes, or substantive importance, and can be invalidated by biasing selection effects (e.g., cherry picking, P-hacking, multiple testing). The one positive principle is the first: “P-values can indicate how incompatible the data are with a specified statistical model” (ibid., p. 131). 

The authors of the editorial that introduces the 2016 ASA Statement, Wasserstein and Lazar, assure us that “Nothing in the ASA statement is new” (p. 130). It is merely a “statement clarifying several widely agreed upon principles underlying the proper use and interpretation of the p-value” ( p. 131). Thus, it came as a surprise, at least to this outsider’s ears, to hear the authors of the 2016 Statement, along with a third co-author (Schirm), declare in March 2019 that: “The ASA Statement on P-Values and Statistical Significance stopped just short of recommending that declarations of ‘statistical significance’ be abandoned” (Wasserstein, Schirm and Lazar 2019, p. 2, hereafter, WSL 2019). 

The 2019 Editorial announces: “We take that step here….[I]t is time to stop using the term ‘statistically significant’ entirely. …[S]tatistically significant –don’t say it and don’t use it” (WSL 2019, p. 2). Not just outsiders to statistics were surprised. To insiders as well, the 2019 Editorial was sufficiently perplexing for the then ASA President, Karen Kafadar, to call for a New ASA Task Force on Significance Tests and Replicability. 

Many of you have written of instances in which authors and journal editors—and even some ASA members—have mistakenly assumed this editorial represented ASA policy. The mistake is understandable: The editorial was co-authored by an official of the ASA. 

… To address these issues, I hope to establish a working group that will prepare a thoughtful and concise piece … without leaving the impression that p-values and hypothesis tests…have no role in ‘good statistical practice’. (K. Kafadar, President’s Corner, 2019, p. 4) 

This was a key impetus for the JSM panel discussion from which the current paper derives (“P-values and ‘Statistical Significance’: Deconstructing the Arguments”). Kafadar deserves enormous credit for creating the new task force.1 Although the new task force’s report, submitted shortly before the JSM 2020 meeting, has not been disclosed, Kadar’s presentation noted that one of its recommendations is that there be a “disclaimer on all publications, articles, editorials, … authored by ASA Staff”.2 In this case, a disclaimer would have noted that the 2019 Editorial is not ASA policy. Still, given that its authors include ASA officials, it has a great deal of impact. 

We should indeed move away from unthinking and rigid uses of thresholds—not just with significance levels, but also with confidence levels and other quantities. No single statistical quantity from any school, by itself, is an adequate measure of evidence, for any of the many disparate meanings of “evidence” one might adduce. Thus, it is no special indictment of P-values that they fail to supply such a measure. We agree as well that the actual P-value should be reported, as all the founders of tests recommended (see Mayo 2018, Excursion 3 Tour II). But the 2019 Editorial goes much further. In its view: Prespecified P-value thresholds should not be used at all in interpreting results. In other words, the position advanced by the 2019 Editorial, “reject statistical significance”, is not just a word ban but a gatekeeper ban. For example, in order to comply with its recommendations, the FDA would have to end its “long established drug review procedures that involve comparing p-values to significance thresholds for Phase III drug trials” as the authors admit (p. 10). 

Kafadar is right to see the 2019 Editorial as challenging the overall use of hypothesis tests, even though it is not banning P-values. Although P-values can be used as descriptive measures, rather than as tests, when we wish to employ them as tests, we require thresholds. Ideally there are several P-value benchmarks, but even that is foreclosed if we take seriously their view: “[T]he problem is not that of having only two labels. Results should not be trichotomized, or indeed categorized into any number of groups…” (WSL 2019, p. 2). 

The March 2019 Editorial (WSL 2019) also includes a detailed introduction to a special issue of The American Statistician (“Moving to a World beyond p < 0.05”). The position that I will discuss, reject statistical significance, (“don’t say ‘significance’, don’t use P-value thresholds”), is outlined largely in the first two sections of the 2019 Editorial. What are the arguments given for the leap from the reasonable principles of the 2016 ASA Statement to the dramatic “reject statistical significance” position? Do they stand up to principles for good argumentation? 

Continue reading the paper here. Please share your comments.

NOTES:

1 Linda Young, (Co-Chair), Xuming He, (Co-Chair) Yoav Benjamini, Dick De Veaux, Bradley Efron, Scott Evans, Mark Glickman, Barry Graubard, Xiao-Li Meng, Vijay Nair, Nancy Reid, Stephen Stigler, Stephen Vardeman, Chris Wikle, Tommy Wright, Karen Kafadar, Ex-officio. (Kafadar 2020) 

2 Kafadar, K., “P-values: Assumptions, Replicability, ‘Significance’,” slides given in the Contributed Panel: P-Values and “Statistical Significance”: Deconstructing the Arguments at the (virtual) JSM 2020. (August 6, 2020). 

^CITATION: Mayo, D. (2020). Rejecting Statistical Significance Tests: Defanging the Arguments. In JSM Proceedings, Statistical Consulting Section. Alexandria, VA: American Statistical Association. (2020). 236-256.

*Jan 11 update. The ASA executive director, Ron Wasserstein, wants to emphasize that it is leaving to the members of the Task Force when and how to release the report on their own. I do not know if it will do so or if all of the authors will agree to this shift. Personally, I don’t know why the ASA Board would not wish to reveal the recommendations of the Task Force that it created–even without any presumption that it thereby is understood to be a policy document. There can be a clear disclaimer that it is not. The Task Force carried out the work that was asked of them in a timely manner. You can find a statement of the charge given to the Task Force in my comments.

Categories: 2016 ASA Statement on P-values, JSM 2020, replication crisis, statistical significance tests, straw person fallacy | 7 Comments

Post navigation

7 thoughts on “Why hasn’t the ASA Board revealed the recommendations of its new task force on statistical significance and replicability?

  1. Stanley Young

    Deborah Obviously unacceptable. Have you contacted any members?? Would you like me to? Stan

    Sent from my iPhone

    >

    • Stan:
      I have written to the committee chairs, and, most recently, Wasserstein–and it would be great if others did as well. It appears the ASA has no intention of reporting on the recommendations of the Task Force it appointed to make recommendations. The January issue of Amstat reported that, https://magazine.amstat.org/wp-content/uploads/2020/08/JANUARY2020_web2.pdf at the November 2019 ASA Board meeting, members of the board approved the following motion: An ASA Task Force on Statistical Significance and Reproducibility will be created, “with a charge to develop thoughtful principles and practices that the ASA can endorse and share with scientists and journal editors. The task force will be appointed by the ASA President with advice and participation from the ASA BOD. The task force will report to the ASA BOD by November 2020.”

      Task Force members were listed in the Feb 2020 issue of Amstat: https://magazine.amstat.org/wp-content/uploads/2020/02/February2020_FINAL.pdf

      Apparently, having asked the Task Force for recommendations, certain individuals said “never mind”. I don’t know its recommendations aside from the 2 Kafadar mentioned in her JSM talk. The recommendation to add a disclaimer to articles authored by ASA staff–obvious as it seems–would be blocked by the individuals who resisted a disclaimer to begin with, being content I suppose, on having their views interpreted as ASA policy. But I’m guessing the other recommendations had much more substance. Anything short of “don’t say significance, don’t use P-value thresholds” would be objectionable to those in favor of that stance, and it is no secret that that includes the ASA executive director, given he was the lead author of Wasserstein et al. (2019). I’m guessing (hoping) the Task Force will publish their report.

  2. rkenett

    Mayo – I wrote to you at the time that the ASA initiative to formulate position statements seems to have started with the VAM education models. Seems like it now shies away and has reversed the approach. As I also wrote at the time, in my opinion, ASA (or other statistical associations) should provide platforms for discussion but avoid formulating position statements. From this perspective ASA should encourage the disclosure of the committee Karen formed. In fact, as you point out, they sort of “preregistered” this disclosure implicitly..

    On the cynical side, it seems that this has not bearing on the practice of statistics. These events do impact what is published in most journals and certainly what is done with statistics in most places. The belief that “Les Stat c’est moi” applies more than ever. Instead of constructive discussions on how to meet new challenges, ASA took destructive directions on what not to do. Apparently it now stopped having such ambitions letting the task force members express their opinion, or not, as they choose.

    If this is the case, ASA should state this openly. Something like “ASA is a forum for discussion and is abstaining from making position statements as an organization” is needed.

    I was president of two statistical societies. This situation came up in both of them. Since we never made “statements” like ASA did, we never had to make statements like the above suggestion. Better avoid a problem than having to solve it…

    • Ron:

      You seem to be saying opposite things in suggesting “it seems that this has not bearing on the practice of statistics. These events do impact what is published in most journals and certainly what is done with statistics in most places.” How can the practice of statistics not be impacted by what is allowed to be published and what is done with statistics? I agree that “ASA took destructive directions” beginning with the 2016 statement and the Wasserstein et al., 2019 editorial. I doubt that the ASA can any longer say that it is “a forum for discussion and is abstaining from making position statements as an organization”. To anyone watching the episode, it appears that the ASA Board is allowing itself to be a voice piece of one faction in the debate about the use of statistical significance tests and p-value thresholds–based on some very flagrant abuses of the methods. Of course, it is possible that something changes. A first step would be a disclaimer associated with Wasserstein et al 2019, and an end to sending recommendations to journal editors that they revise their publication guidelines in compliance with its anti-statistical significance testing stance.

    • rkenett

      I corrected a typo in a subsequent note. My point is that some perspective to this would be useful. Again I refer to: “it seems that this has no bearing on the practice of statistics”. Trafimow is seeking exposure and the NISS debate gave him that. I currently have papers under review in a wide range of areas such as engineering, education, official statistics, translational medicine, clinical research and computer science (yes the pandemic has some positive aspects as I am stuck at home with extra time to work on publications). In none (NONE) of these journals the topics under debate percolated. So, yes, I have evidence supporting my statement that “it seems that this has no bearing on the practice of statistics”.

  3. rkenett

    Typo correction:

    On the cynical side, it seems that this has no bearing on the practice of statistics. These events do not impact what is published in most journals and certainly what is done with statistics in most places.

I welcome constructive comments that are of relevance to the post and the discussion, and discourage detours into irrelevant topics, however interesting, or unconstructive declarations that "you (or they) are just all wrong". If you want to correct or remove a comment, send me an e-mail. If readers have already replied to the comment, you may be asked to replace it to retain comprehension.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Blog at WordPress.com.