Statisticians Rise Up To Defend (error statistical) Hypothesis Testing


What is the message conveyed when the board of a professional association X appoints a Task Force intended to dispel the supposition that a position advanced by the Executive Director of association X does not reflect the views of association X on a topic that members of X disagree on? What it says to me is that there is a serious break-down of communication amongst the leadership and membership of that association. So while I’m extremely glad that the ASA appointed the Task Force on Statistical Significance and Replicability in 2019, I’m very sorry that the main reason it was needed was to address concerns that an editorial put forward by the ASA Executive Director (and 2 others) “might be mistakenly interpreted as official ASA policy”. The 2021 Statement of the Task Force (Benjamini et al. 2021) explains:

In 2019 the President of the American Statistical Association (ASA) established a task force to address concerns that a 2019 editorial in The American Statistician (an ASA journal) might be mistakenly interpreted as official ASA policy. (The 2019 editorial recommended eliminating the use of “p < 0.05” and “statistically significant” in statistical analysis.) This document is the statement of the task force…

It’s also too bad that the statement was blocked for nearly a year, and wasn’t shared by the ASA. In contrast to the 2019 editorial, the Task Force on Statistical Significance and Replicability writes that “P -values and significance testing, properly applied and interpreted, are important tools that should not be abandoned”. The full statement is in the The Annals of Applied Statistics, and on my blogpost. It is very welcome that leading statisticians rose up to block the attitude that I describe in this post as Les Stats C’est Moi, diminishing the inclusivity of a variety of methodologies and philosophies among ASA members. Where was Nature, Science and other venues when they had their shot at an article: “Scientists Rise Up in Favor of (error statistical) hypothesis Testing”? Nowhere to be found.

An excellent overview is given by Nathan Schachtman on his law blog here:

A Proclamation from the Task Force on Statistical Significance

June 21st, 2021

The American Statistical Association (ASA) has finally spoken up about statistical significance testing.[1] Sort of.

Back in February of this year, I wrote about the simmering controversy over statistical significance at the ASA.[2] Back in 2016, the ASA issued its guidance paper on p-values and statistical significance, which sought to correct misinterpretations and misrepresentations of “statistical significance.”[3] Lawsuit industry lawyers seized upon the ASA statement to proclaim a new freedom from having to exclude random error.[4] To obtain their ends, however, the plaintiffs’ bar had to distort the ASA guidance in statistically significant ways.

To add to the confusion, in 2019, the ASA Executive Director published an editorial that called for an end to statistical significance testing.[5] Because the editorial lacked disclaimers about whether or not it represented official ASA positions, scientists, statisticians, and lawyers on all sides were fooled into thinking the ASA had gone whole hog.[6] Then ASA President Karen Kafadar stepped into the breach to explain that the Executive Director was not speaking for the ASA.[7]

In November 2019, members of the ASA board of directors (BOD) approved a motion to create a “Task Force on Statistical Significance and Replicability.”[8] Its charge was

“to develop thoughtful principles and practices that the ASA can endorse and share with scientists and journal editors. The task force will be appointed by the ASA President with advice and participation from the ASA BOD. The task force will report to the ASA BOD by November 2020.

The members of the Task Force identified in the motion were:

Linda Young (Nat’l Agricultural Statistics Service & Univ. of Florida; Co-Chair)

Xuming He (Univ. Michigan; Co-Chair)

Yoav Benjamini (Tel Aviv Univ.)

Dick De Veaux (Williams College; ASA Vice President)

Bradley Efron (Stanford Univ.)

Scott Evans (George Washington Univ.; ASA Publications Representative)

Mark Glickman (Harvard Univ.; ASA Section Representative)

Barry Graubard (Nat’l Cancer Instit.)

Xiao-Li Meng (Harvard Univ.)

Vijay Nair (Wells Fargo & Univ. Michigan)

Nancy Reid (Univ. Toronto)

Stephen Stigler (Univ. Chicago)

Stephen Vardeman (Iowa State Univ.)

Chris Wikle (Univ. Missouri)

Tommy Wright (U.S. Census Bureau)

[T]he Taskforce arrived at its recommendations, but curiously, its report did not find a home in an ASA publication. Instead, the “The ASA President’s Task Force Statement on Statistical Significance and Replicability” has now appeared as an “in press” publication at The Annals of Applied Statistics, where Karen Kafadar is the editor in chief. The report is accompanied by an editorial by Kafadar.

You can read the rest of his post here.

Some links from Schachtman’s blog:

[1] Deborah Mayo, “At Long Last! The ASA President’s Task Force Statement on Statistical Significance and Replicability,” Error Statistics (June 20, 2021).

[2] “Falsehood Flies – The ASA 2016 Statement on Statistical Significance” (Feb. 26, 2021).

[3] Ronald L. Wasserstein & Nicole A. Lazar, “The ASA’s Statement on p-Values: Context, Process, and Purpose,” 70 The Am. Statistician 129 (2016); see “The American Statistical Association’s Statement on and of Significance” (March 17, 2016).

[4] “The American Statistical Association Statement on Significance Testing Goes to Court – Part I” (Nov. 13, 2018); “The American Statistical Association Statement on Significance Testing Goes to Court – Part 2” (Mar. 7, 2019).

[5] “Has the American Statistical Association Gone Post-Modern?” (Mar. 24, 2019); “American Statistical Association – Consensus versus Personal Opinion” (Dec. 13, 2019). See also Deborah G. Mayo, “The 2019 ASA Guide to P-values and Statistical Significance: Don’t Say What You Don’t Mean,” Error Statistics Philosophy (June 17, 2019); B. Haig, “The ASA’s 2019 update on P-values and significance,” Error Statistics Philosophy  (July 12, 2019); Brian Tarran, “THE S WORD … and what to do about it,” Significance (Aug. 2019); Donald Macnaughton, “Who Said What,” Significance 47 (Oct. 2019).

[6] Ronald L. Wasserstein, Allen L. Schirm, and Nicole A. Lazar, “Editorial: Moving to a World Beyond ‘p < 0.05’,” 73 Am. Statistician S1, S2 (2019).

[7] Karen Kafadar, “The Year in Review … And More to Come,” AmStat News 3 (Dec. 2019) (emphasis added); see Kafadar, “Statistics & Unintended Consequences,” AmStat News 3,4 (June 2019).

[8] Karen Kafadar, “Task Force on Statistical Significance and Replicability,” ASA Amstat Blog (Feb. 1, 2020).

Categories: ASA Task Force on Significance and Replicability, Schachtman, significance tests

Post navigation

10 thoughts on “Statisticians Rise Up To Defend (error statistical) Hypothesis Testing

  1. Michael Lew

    I’ve just read the statement and it seems to me that the task force has deliberately written about significance tests rather than hypothesis tests, with the latter being (properly, in my opinion) relegated to the occasions where “actions are required”.

    How does that accord with the title of your post?

    • Michael:
      Any decision to claim evidence or interpret data in a given way can be regarded as an action, and is regarded that way by Popper, Neyman, C. S. Peirce. As for “hypotheses tests”, I was just picking up on Kafadar’s associated editorial:

      Click to access kafadar-editorial-2021.pdf

      I think her viewing statistical significance tests as under the umbrella of hypothesis tests is fairly standard. Of course the TF does recognize the importance of thresholds for p-values. I don’t know the basis for the way they framed their claims.

      • Michael Lew

        “Any decision to claim evidence or interpret data in a given way can be regarded as an action, and is regarded that way by Popper, Neyman, C. S. Peirce.” In other words, almost always. However, the ASA task force write of the use of a (thoughtfully designed) threshold as something to use in special circumstances. Thus the ASA task force, like me, consider the use of a P-value to be mostly non-decisional. It is an inadequate account of inference to pretend for convenience (or self-validation) that an evidential summary statistic is a decision.

        • Michael: The non-decisional use of a P-value, then, is to be inferential or evidential right? I concur. But what is inference? It’s usually defined as arriving at a claim based on one or more premises or data (at least that’s how we define it in 30+ years of teaching logic). That is not the cognitive or psychological act itself, but its product. Now I’ve always opposed the so-called ” inductive behavior” conception of hypothesis tests. Erich Lehmann was a leader in developing it, but even he only saw that framework as a way to develop methodologies in terms of optimization–not a way to capture scientific inference. I know because I knew him personally for years and this is what we mostly talked about and wrote to each other about (his are among my last, and much-valued, hand-written letters from statisticians.) So the key issue doesn’t really turn on this philosophical point (about whether an inference, ontologically, is a kind of decision). Now if you look at the history, as delineated in my book Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP), you see the historical explanation for why Neyman distinguished himself from the “inferentialists”. He was referring to those who sought uniquely rational rules of belief (such that anyone who deviated from them was irrational). He found them dogmatic. (He was also rejecting the likelihood principle.) So it’s really just an unfortunate historical accident that Neyman didn’t want to be an inferentialist. By the way, I’ve never been a fan of the decision-theoretic conceptions of logic and inference so popular in philosophy either. Decision theory involves costs and benefits and utilities and my view is that this is separate from inference.

          So there’s a jumble of at least 3 core issues here. But back to your main point, although the ASA task force on Stat Sig and Rep likely had in mind ‘special circumstances’ such as deciding whether to approve a drug, that refers to using a p-value threshold for a POLICY DECISION based on an inference, not an inference itself. The policy decision, e.g., to approve a drug, is always a distinct step from the inference. However, the inference based on a test ALSO must have a threshold (in my view), of a different but parallel kind, in the sense that not every outcome is allowed to count as evidence for a claim. You don’t have a test of a claim if you can’t say in advance that not all data will be allowed to count as evidence in its favor. Well, I’ve talked about this point much more elsewhere.

          • Michael Lew

            I doubt that any argument for using a p-value threshold for a policy decision to approve a drug (or not) would be thought reasonable. Drug approval is, I hope, never granted on the basis of just a one-and-done experiment yielding a singular p-value, and that means that drug approval is NOT an example where it is practical to design a threshold and compare a p-value to it (or, more properly, to design a rejection region and see if the test statistic falls within it).

            The fact that people often use drug approval decisions as a setting where one might choose to use a p-value threshold shows that we struggle to find settings where it would be appropriate. Of course, there is nothing wrong with Fisher’s example of quality assurance testing…

  2. rkenett

    It is always interesting to go up the river and try to find the source of something, including this p-value thing. I mentioned before in this blog that the ASA “original sin” was the ASA Statement on
    Using Value-Added Models for Educational Assessment published on April 8th, 2014. In my book with Galit Shmueli on information quality we assessed the information in this statement and found it very poor. In other words, ASA did not produce an informative statement. The p-value statement is similar and the AMSTAT special issue added confusion. Talking some steps back and looking at this from an organizational perspective:
    1. The sign was on the wall
    2. As far as I know, ASA did not run a retrospective evaluation of its 2014 VAM statement
    3. If you do not lean from the past you are bound for a repeat of mistakes
    4. Statisticians should be the first to encourage data driven lessons learned (i.e. running a survey on the impact of an ASA statement).
    5. The impact of the ASA p-statement and the AMSTAT issue of Ron and Nicole was mostly educated discussions. As Mayo wrote “Les Stat c’est moi” does not work anymore and statistics (and ASA) needs to find an opportunity to play a constructive role in the current digital transformation.

    • Ron:
      I vaguely recall that statement. I wonder if there were other ASA “task forces” whose reporting was rejected as was this one. “Les Stat c’est moi” works fine for the ASA Executive Director.

  3. rkenett

    There is an ASA Statement on the Use of Ketamine for a Non-medical Purpose

    but this is from the e American Society of Anesthesiologists.

  4. rkenett

    Mayo, Michael: The point is what we call generalization and operationalization of findings (the 6th and 7th information quality dimensions). One is inferential the other is about decision making, See

    In our book on information quality (Wiley, 2016) we quote Lindley. Specifically in chapter 2 page 27:

    In Lindley’s foreword to an edited volume by Di Bacco et al.(2004), he treats the question “what is meant by statistics” by referring to those he considers the founding fathers: Harold Jeffreys, Bruno de Finetti, Frank Ramsey, and Jimmie Savage:

    “Both Jeffreys and de Finetti developed probability as the coherent appreciation
    of uncertainty, but Ramsey and Savage looked at the world rather
    differently. Their starting point was not the concept of uncertainty but
    rather decision‐making in the face of uncertainty. They thought in terms
    of action, rather than in the passive contemplation of the uncertain world.
    Coherence for them was not so much a matter of how your beliefs hung
    together but of whether your several actions, considered collectively,
    make sense… If one looks today at a typical statistical paper that uses the
    Bayesian method, copious use will be made of probability, but utility, or
    maximum expected utility, will rarely get a mention… When I look at
    statistics today, I am astonished at the almost complete failure to use
    utility… Probability is there but not utility. This failure has to be my
    major criticism of current statistics; we are abandoning our task halfway,
    producing the inference but declining to explain to others how to act
    on that inference. The lack of papers that provide discussions on utility
    is another omission from our publications.”

    The meandering discussion on the ASA “clarification” task force is another example of poor information quality. People understand generalization and operationalization of findings. The p value thresholding as inferential or driving decision making is a work around.

  5. Pingback: Too little, too late? The “Don’t say significance…” editorial gets a disclaimer | Error Statistics Philosophy

Blog at