Invitation to discuss the ASA Task Force on Statistical Significance and Replication

.

The latest salvo in the statistics wars comes in the form of the publication of The ASA Task Force on Statistical Significance and Replicability, appointed by past ASA president Karen Kafadar in November/December 2019. (In the ‘before times’!) Its members are:

Linda Young, (Co-Chair), Xuming He, (Co-Chair) Yoav Benjamini, Dick De Veaux, Bradley Efron, Scott Evans, Mark Glickman, Barry Graubard, Xiao-Li Meng, Vijay Nair, Nancy Reid, Stephen Stigler, Stephen Vardeman, Chris Wikle, Tommy Wright, Karen Kafadar, Ex-officio. (Kafadar 2020)

The full report of this Task Force is in the The Annals of Applied Statistics, and on my blogpost. It begins:

In 2019 the President of the American Statistical Association (ASA) established a task force to address concerns that a 2019 editorial in The American Statistician (an ASA journal) might be mistakenly interpreted as official ASA policy. (The 2019 editorial recommended eliminating the use of “p < 0.05” and “statistically significant” in statistical analysis.) This document is the statement of the task force… (Benjamini et al. 2021)

On Monday, August 2, the National Institute of Statistical Science (NISS) held a public discussion whose focus is this report, and several of its members will be there. (See the announcement at the end of this post).

Kafadar, and the members of this task force, deserve a lot of credit for defying the popular movement to “abandon statistical significance” by unanimously declaring: “that the use of P -values and significance testing, properly applied and interpreted, are important tools that should not be abandoned”…

• P -values are valid statistical measures that provide convenient conventions for communicating the uncertainty inherent in quantitative results. Indeed, P -values and significance tests are among the most studied and best understood statistical procedures in the statistics literature.

• They are important tools that have advanced science through their proper application. …

• P-values and significance tests, when properly applied and interpreted, increase the rigor of the conclusions drawn from data.
(Benjamini et al. 2021)

If you follow this blog, you know that I have often discussed the 2019 editorial in The American Statistician to which this Task Force report refers: Wasserstein, Schirm and Lazar (2019), hereafter WSL 2019 (see blog links below). But now I’m inviting you to share your views on any aspects of the overall episode (ASAgate?) for posting on this blog. (Send them by September October 31, 2021, new info in Note [1]  .) I’d like to put together a blogpost with multiple authors, and multiple perspectives soon after. For background see this post. (For even more background, see the links at the end of this post.)

I first assumed WSL 2019 was a continuation of the 2016 ASA Statement on P-values, especially given how it is written. According to WSL 2019, the 2016 ASA Statement had “stopped just short of recommending that declarations of ‘statistical significance’ be abandoned”, and they announce: “We take that step here….‘statistically significant’—don’t say it and don’t use it”. The use of p-value thresholds to distinguish data that do and do not indicate various discrepancies from a test hypotheses are also verboten.) [1] In fact, it rejects any number of classifications of data: “[T]he problem is not that of having only two labels. Results should not be trichotomized, or indeed categorized into any number of groups…” (WSL 2019, p. 2).

To many, including then ASA president Karen Kafadar (2019), the position in WSL 2019 challenges the overall use of hypothesis tests even though it does not ban P-values:

Many of you have written of instances in which authors and journal editors—and even some ASA members—have mistakenly assumed this editorial represented ASA policy. The mistake is understandable: The editorial was co-authored by an official of the ASA.

Even our own ASA members are asking each other, “What do we tell our collaborators when they ask us what they should do about statistical hypothesis tests and p-values?”

So the ASA Board created the new Task Force in November 2019 “with a charge to develop thoughtful principles and practices that the ASA can endorse and share with scientists and journal editors.” (AMSTATNEWS 1 February 2020).

Several of its members will be at Monday’s NISS meeting. A panel session I organized at the 2020 JSM, (P-values and ‘Statistical Significance’: Deconstructing the Arguments), grew out of this episode (my contribution in the proceedings is here).

The Task Force worked quickly, despite the pandemic, giving its recommendations to the ASA Board early–in time for the Joint Statistical Meetings (JSM) at the end of July 2020. But the ASA didn’t “endorse and share” the Task Force’s recommendations, and for months the document has been in limbo, turned down for publication in numerous venues, until recently finding a home in the Annals of Applied Statistics. So, it is finally out. What does it say aside from what I have quoted above? I’m guessing that because the statements were unanimous, they couldn’t go much beyond some rather unobjectionable claims. It’s quite short, and there’s also an editorial by Kafadar (editor-in-chief of the journal) in the issue.

I imagine a statistical significance tester raising these objections to the task force report:

  1. It does not tell us why, properly used, p-values increase rigor—namely by enabling error probability control.
  2. It implicitly seems to accept the view that using thresholds means viewing test outcomes as leading directly to decisions or actions (as in the behavioristic interpretation of Neyman-Pearson tests), rather than as part of an appraisal of evidence.

In fact any time you test a claim (or compute power) you are implicitly using a threshold. It needs to be specified, in advance, that not all outcomes will be allowed to be taken as evidence for a given claim.

3. It doesn’t tell us what’s meant by “properly used”.

While this might be assumed to be uncontroversial, nowadays, you will sometimes hear critics aver ‘of course the tests are fine if properly used’, but then in the next breath suggest that this requires p-values to agree with quantities measuring very different things (since ‘that’s what people want’). Even the meaning of “abandon” statistical significance has become highly equivocal (e.g., Mcshane et al. 2019).

4. Others?  (Use the comments, or put them in your guest blog contributions).

On the meta-level, of course, she would be concerned about the communication break down suggested by the very fact that the ASA board felt the need to appoint a Task Force to dispel the supposition that a position advanced by its Executive Director reflects the views of the ASA itself.[3] Still, in today’s climate of anti-statistical significance test fervor, the Task Force’s pressing on to find a home for their report when the ASA declined to make it public is impressive, if not heroic. We need more of that type of independence if scientific integrity is to be restored. The Task Force deserves further accolades for sparing us, for once, a rehearsal of the well-known howlers of abusives of tests that have long been lampooned.

Here’s the announcement of the NISS Program for August 2, 2021 (5pm ET)

NISS Affiliates liaisons and representatives representing academia, industry and government institutions traditionally meet over lunch at JSM to catch up with one another and hear from speakers on a topic of current interest. This year, (even though this event takes place in the evening) the ‘luncheon’ speakers featured will be Karen Kafadar, the 2019 ASA President from the University of Virginia.  Karen initiated the ASA Task Force on Statistical Significance and Replicability during her presidential year, that was convened to address issues surrounding the use of p-values and statistical significance, as well as their connection to replicability.  Xuming He from the University of Michigan and Linda Young from NASS who both served as co-chairs of this Task Force will summarize the discussion leading to the final report, and invite other task force members to join the discussion.

Luncheon Speakers: Karen Kafadar- 2019 ASA President, (University of Virginia), Xuming He – Task Force Co-chair, (University of Michigan), Linda Young – Task Force Co-chair, (NASS), Steven Stigler, (University of Chicago), Nancy Reid (University of Toronto) and Yoav Benjamini, (Tel Aviv University)..

This is a Special Event that is open to the public. You need not be a member of a NISS Affiliate Institution to attend the event this year.  Invite your colleagues!

NOTES

[1] Some have asked for more time, so I’m extending the deadline through the month of October, although I may well post some as they come in. They can be as long as you think apt. We can always post part of your commentary and link to the remainder (write me with questions). All who have their guest post included will receive a free copy of my book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP)

[2] Since the clarification was not made public until December 2019, an editorial I wrote “P-value thresholds, forfeit at your peril” mistakenly referred to WSL 2019 as ASA II https://errorstatistics.com/wp-content/uploads/2019/11/mayo-2019-forfeit-own-peril-european_journal_of_clinical_investigation-2.pdf

[3] It would seem that a public disclaimer by the authors sent around to members and journals would have avoided this. Kafadar had indicated early on that the Task Force recommendations would include a call for a “Disclaimer on all publications, articles, editorials, … authored by ASA Staff (e.g., as required for U.S. Govt employees)”.
At any rate, this was in Kafadar’s slide presentation at our JSM forum. Perhaps at Monday’s forum someone will ask: Why was that sensible recommendation deleted from the final report?

REFERENCES:

Benjamini, Y., De Veaux, R., Efron, B., et al. (2021). The ASA President’s task force statement on statistical significance and replicability. The Annals of Applied Statistics. (Online June 20, 2021.)

Kafadar, K. (2019). “The Year in Review … And More to Come”. AmStat News3 (Dec. 2019)

Kafadar, K. (2020). “Task Force on Statistical Significance and Replicability”. ASA Amstat Blog (Feb. 1, 2020).

Kafadar, K. (2021) “Editorial: Statistical Significance, P-Values, and Replicability“. The Annals of Applied Statistics

Mayo, D. G. (2020). Rejecting Statistical Significance Tests: Defanging the Arguments. In JSM Proceedings, Statistical Consulting Section. Alexandria, VA: American Statistical Association. 236-256.

Mayo, D. G. (2019). P-value Thresholds: Forfeit at Your Peril,European Journal of Clinical Investigation 49(10). EJCI-2019-0447

McShane, B. B., Gal, D., Gelman, A., Robert, C., & Tackett, J. L. (2019). Abandon statistical significance. American Statistician, 73, 235–245.

Wasserstein R. & Lazar, N. “The ASA’s Statement on p-Values: Context, Process, and Purpose,” The American Statistician 70(129 )(2016); see “The American Statistical Association’s Statement on and of Significance” (March 17, 2016).

Wasserstein, R., Schirm, A,. & Lazar, N. (2019). Moving to a world beyond “p < 0.05” (Editorial). The American Statistician 73(S1), 1–19.  https://doi.org/10.1080/00031305.2019.1583913

 

(SELECTED) BLOGPOSTS ON WSL 2019 FROM ERRORSTATISTICS.COM:

March 25, 2019: “Diary for Statistical War Correspondents on the Latest Ban on Speech.”

June 17, 2019: “The 2019 ASA Guide to P-values and Statistical Significance: Don’t Say What You Don’t Mean” (Some Recommendations)(ii)

July 19, 2019: “The NEJM Issues New Guidelines on Statistical Reporting: Is the ASA P-Value Project Backfiring? (i)”

September 19, 2019: “(Excerpts from) ‘P-Value Thresholds: Forfeit at Your Peril’ (free access).” The article by Hardwicke and Ioannidis (2019), and the editorials by Gelman and by me are linked on this post.

November 4, 2019: “On some Self-defeating aspects of the ASA’s 2019 recommendations of statistical significance tests”

November 14, 2019: “The ASA’s P-value Project: Why it’s Doing More Harm than Good (cont from 11/4/19)”

November 30, 2019: “P-Value Statements and Their Unintended(?) Consequences: The June 2019 ASA President’s Corner (b)”

December 13, 2019: “’Les stats, c’est moi’: We take that step here! (Adopt our fav word or phil stat!)(iii)”

         August 4, 2020: “August 6: JSM 2020 Panel on P-values & ‘Statistical significance’”

October 16, 2020: “The P-Values Debate”

December 13, 2020: “The Statistics Debate (NISS) in Transcript Form”

January 9, 2021: “Why hasn’t the ASA Board revealed the recommendations of its new task force on statistical significance and replicability?”

June 20, 2021:  “At Long Last! The ASA President’s Task Force Statement on Statistical Significance and Replicability”

June 28, 2021: “Statisticians Rise Up To Defend (error statistical) Hypothesis Testing”

Kafadar, K. (2020) JSM slides.

 

Categories: 2016 ASA Statement on P-values, ASA Task Force on Significance and Replicability, JSM 2020, National Institute of Statistical Sciences (NISS), statistical significance tests | 3 Comments

Post navigation

3 thoughts on “Invitation to discuss the ASA Task Force on Statistical Significance and Replication

  1. pmbrown

    I don’t see the regulatory context given a lot of attention in these discussions, maybe I have missed it. Although you mention the FDA’s use of standard thresholds in the JSM 2020 presentation and how the ASA paper amounts to a gate-keeper ban. The regulatory statistician would note how important secondary endpoints have become in drug development. Companies provide longs lists of secondary endpoints in the hope of differentiating between their product and their competitor’s (very similar) product – a difference on a secondary outcome can translate into a lot of money. In response the reg agencies request a heirarchical testing strategy for these endpoints (uisng p<0.05 to proceed to the next outcome). Also for composite endpoints which amalgamate a subset of secondary endpoints. The next president of the ASA, Dionne Price, is an FDA statistician, maybe this means the pragmatic value of p-values will soon be more widely appreciated ….?

  2. Pingback: Too little, too late? The “Don’t say significance…” editorial gets a disclaimer | Error Statistics Philosophy

I welcome constructive comments that are of relevance to the post and the discussion, and discourage detours into irrelevant topics, however interesting, or unconstructive declarations that "you (or they) are just all wrong". If you want to correct or remove a comment, send me an e-mail. If readers have already replied to the comment, you may be asked to replace it to retain comprehension.

Blog at WordPress.com.