ASA Task Force on Significance and Replicability

Comments on “The ASA p-value statement 10 years on” (ii)

.

Given how much I’ve blogged about the 2016 ASA p-value statement, the 2019 Executive Editor’s editorial in The American Statistician (TAS), the 2020 ASA (President’s) Task Force, and the various casualties of the related teeth pulling, I thought I should say something about the recent article by Robert Matthews in Significance (March 2026): “The ASA p-value statement 10 years on: An event of statistical significance?” He begins: “Ten years ago this month, the American Statistical Association (ASA) took the unprecedented step of issuing a statement on one of the most controversial issues in statistics: the use and abuse of p-values.” The Statement is here, 2016 ASA Statement on P-Values and Statistical Significance [1]. The Executive director of the ASA, Ronald Wasserstein, invited me to be a ”philosophical observer” at the meeting which gave rise to the 2016 statement. Although the 2016 ASA statement wasn’t radically controversial, at least as compared to the 2019 Executive Editor’s editorial, which I’ll get to in a minute, it was met with critical reactions on all sides. Stephen Senn provides a figure displaying relationships between reactions. Here’s how Matthews’ article begins:

Popularised in the 1920s by the hugely influential English statistician Ronald Fisher, p-values lie at the heart of “significance testing”, widely used by researchers to claim to have found something interesting lurking in data. Yet despite their ubiquity in research journals, p-values have also long been criticised as misunderstood, misleading and open to abuse. The problem lies in their definition. p-values typically give the chances of getting an effect at least as impressive as that seen, assuming it’s actually just a fluke. If these chances are sufficiently low – less than 0.05 is the traditional standard – the finding is then deemed “statistically significant”. For many researchers, this has been taken as implying that their finding is not a fluke, and worth taking seriously. But this overlooks the fact that p-values are calculated on the assumption the result is a fluke. As such, they cannot also be used to decide if this assumption is valid…

Wait a minute. According to Matthews, taking a small p-value as evidence the observed effect is not a fluke “overlooks the fact that p-values are calculated on the assumption the result is a fluke. As such, they cannot also be used to decide if this assumption is valid.” This overlooks the very nature of reductio (or indirect or falsificationist) proofs, say that there’s no smallest rational q: Assume q is the smallest rational. If so, q/2 would be a smaller rational. From this contradiction, infer there is no smallest rational number. It is a deductively valid argument. P-value reasoning is a statistical version of the reductio argument– providing a statistical contradiction to the fluke assumption, with an associated error probability. The small p-value tells us it’s very probable (1-p) that a smaller effect would have resulted, were it due to chance alone. Replicating the small p-value strengthens the contradiction further. [0] So can we please stop saying that assuming a claim C in a reductio argument precludes finding evidence to falsify C?

The assumption in the null hypothesis  is just an “implicationary assumption” for purposes of drawing out the consequences of C. Overlooking falsificationist logic is at the heart of today’s confusion over p-value reasoning. If we could run an experiment in which the p-value critics magically became falsificationists for 1 day, I think the scales would fall from the eyes of a statistically significant proportion of them at least during that time.[2]

Admittedly, statistical significance tests are just a small part of a rich set of “techniques for systematically appraising and bounding the probabilities (under respective hypotheses) of seriously misleading interpretations of data” (Birnbaum 1970, p. 1033). The simple Fisherian test that the 2016 Statement restricts itself to–there’s just the single null hypothesis without considering alternatives or power–is an even smaller part. But even they have important uses, especially in testing assumptions of statistical models or misspecification tests. In any event, their limited use is not grounds for misinterpreting their logic. Much less is it grounds to abandon or retire them.

Returning to Matthews:

“Finally, in 2021, the ASA issued [3] another statement, this time from a Presidential Task Force whose focus was not promoting the 2016 principles but addressing concerns” that an editorial in TAS–I’ll call it the ASA Executive Director editorial– “might be seen as official ASA policy.” Why the worry it might be seen as ASA policy? One reason is that one of the authors was the ASA Executive director Wasserstein. The second was that it sounded like a continuation of the 2016 ASA statement–which is ASA policy. According to the 2019 Executive Director’s editorial, the 2016 ASA Statement had “stopped just short of recommending that declarations of ‘statistical significance’ be abandoned”, and they announce: “We take that step here….‘statistically significant’—don’t say it and don’t use it”. The use of p-value thresholds is also verboten. “[T]he problem is not that of having only two labels. Results should not be trichotomized, or indeed categorized into any number of groups…” (2019 Executive Director Editorial, p. 2).

Then ASA president Karen Kafadar (2019) wrote in an ASA Newsletter:

Many of you have written of instances in which authors and journal editors—and even some ASA members—have mistakenly assumed this editorial represented ASA policy. The mistake is understandable: The editorial was co-authored by an official of the ASA.

So she appointed a Task Force in 2019. Its full (1 page) report is in the The Annals of Applied Statistics, also on my blogpost.[4] The report (Benjamini et al. 2021) begins:

In 2019 the President of the American Statistical Association (ASA) established a task force to address concerns that a 2019 editorial in The American Statistician (an ASA journal) might be mistakenly interpreted as official ASA policy. (The 2019 editorial recommended eliminating the use of “p < 0.05” and “statistically significant” in statistical analysis.) This document is the statement of the task force… (Benjamini et al. 2021)

Among its main points:

  • the use of P -values and significance testing, properly applied and interpreted, are important tools that should not be abandoned”…
  • P -values are valid statistical measures that provide convenient conventions for communicating the uncertainty inherent in quantitative results. Indeed, P -values and significance tests are among the most studied and best understood statistical procedures in the statistics literature.
  • They are important tools that have advanced science through their proper application. …(Benjamini et al. 2021)

According to Matthews:

“For those who saw improper use and misinterpretation as the key issue in the p-value debate, this seemed to miss the point.”

But defending the scientific value of a tool when an Executive Director’s editorial is calling for its abandonment is exactly to the point. Forgoing predesignated thresholds obstructs error control. If an account cannot say about any outcomes that they will not count as evidence for a claim—if all thresholds are abandoned—then there is no test of that claim. Giving up on tests means forgoing falsification even of the statistical variety. What’s the point of requiring replication if at no point can you say an effect has failed to replicate?

Maybe the ASA should invite 10 year reflections, or maybe they’re out there and I haven’t seen them.

Please share your queries and thoughts in the comments.

References
Birnbaum, A. (1970), “Statistical Methods in Scientific Inference (letter to the Editor),” Nature 225(5237): 1033.
Mayo, D. G. and Cox, D. R. (2006), “Frequentists Statistics as a Theory of Inductive Inference,” in Optimality: The Second Erich L. Lehmann Symposium, ed. J. Rojo, Lecture Notes-Monograph series, Institute of Mathematical Statistics (IMS), Vol. 49: 77-97.

Some related posts (search this blog for others):

March 7, 2016: “Don’t throw out the error control baby with the bad statistics bathwater
May 21, 2024: 5-year review: “Les stats, c’est moi”: We take that step here! (Adopt our fav word or phil stat!)(iii)
June 20, 2021: At long last! The ASA President’s Task Force Statement on Statistical Significance and Replicability
Mayo 31, 2024: 2-4 year review: The Statistics Wars and Intellectual Conflicts of Interest
June 17, 2019: The 2019 ASA executive editor’s guide to p-values: Don’t say what you don’t mean
June 4, 2024: 2-4 year review: commentaries on my editorial
May 15, 2022: 2-4 year review: commentaries on my editorial

My editorial: The statistics wars and intellectual conflicts of interest

 


[0] p-value. The significance test arises to test the conformity of the particular data under analysis with H0 in some respect: To do this we find a function t = t(y) of the data, to be called the test statistic, such that

  • the larger the value of t the more inconsistent are the data with H0;
  • the corresponding random variable T = t(Y) has a (numerically) known probability distribution when H0 is true.

…[We define the] p-value corresponding to any t as p = p(t) = P(T ≥ t; H0). (Mayo and Cox 2006, p. 81)

[1] The 2016 ASA Statement’s six principles: 1. P-values can indicate how incompatible the data are with a specified statistical model. 2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone. 3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold. 4. Proper inference requires full reporting and transparency. 5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result. 6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.

[2] There are a few critics who are falsificationists, notably Andrew Gelman.

[3]  The 2019 ASA [president’s] task force submitted its statement to the ASA in 2020, and for a long time its contents were shrouded in mystery. It eventually was published in 2021 in the Annals of Applied Statistics where Kafadar was editor in chief.

[4] The 2019 Task Force members: Linda Young, (Co-Chair), Xuming He, (Co-Chair) Yoav Benjamini, Dick De Veaux, Bradley Efron, Scott Evans, Mark Glickman, Barry Graubard, Xiao-Li Meng, Vijay Nair, Nancy Reid, Stephen Stigler, Stephen Vardeman, Chris Wikle, Tommy Wright, Karen Kafadar, Ex-officio. (Kafadar 2020)

 

Categories: abandon statistical significance, ASA Task Force on Significance and Replicability, P-values, significance tests, stat wars and their casualties | 22 Comments

My 2019 friendly amendments to that “abandon significance” editorial

.

It was 3 months before I decided to write a blogpost in response to Wasserstein, Schirm and Lazar (2019)’s editorial in The American Statistician in which they recommend that the concept of “statistical significance” be abandoned, hereafter, WSL 2019. (I titled it “Don’t Say What You don’t Mean”.) In that June 17, 2019 blogpost, pasted below, I proposed 3 “friendly amendments” to the language of that document. (There are 97 comments on that post!) The problem is that WSL 2019 presents several of the 6 principles from ASA I (the 2016 ASA statement on Statistical Significance) in a far stronger fashion so as to be inconsistent or at least in tension with some of them. I didn’t think they really meant what they said. I discussed these amendments with Ron Wasserstein, Executive Director of the ASA at the time. Had these friendly amendments been carried out, the document would not have caused as much of a problem, and people might focus more on the positive recommendations it includes about scientific integrity. The proposed ban on a key concept of statistics would still be problematic, resulting in the 2019 ASA President’s Task Force, but it would have helped the document.  At the time, it was still not known whether WSL 2019 was intended as a continuation of the 2016 ASA policy document [ASA I]. That explains why I first referred to WSL 2019 in this blogpost as ASA II. Once it was revealed that it was not official policy at all (many months later), but only the recommendations of the 3 authors, I placed a “note” after each mention of ASA II. But given it caused sufficient confusion as to result in the then ASA president (Karen Kafadar) appointing an ASA Task Force on Statistical Significance and Replicability in 2019 (see here and here), and later, a disclaimer by the authors, in this reblog I refer to it as WSL 2019. You can search this blog for other posts on the 2019 Task Force: their report is here, and the disclaimer here. Continue reading

Categories: 2016 ASA Statement on P-values, ASA Guide to P-values, ASA Task Force on Significance and Replicability | Leave a comment

Too little, too late? The “Don’t say significance…” editorial gets a disclaimer (ii)

.

Someone sent me an email the other day telling me that a disclaimer had been added to the editorial written by the ASA Executive Director and 2 co-authors (Wasserstein et al., 2019) (“Moving to a world beyond ‘p < 0.05′”). It reads:

 

The editorial was written by the three editors acting as individuals and reflects their scientific views not an an endorsed position of the American Statistical Association.

Continue reading

Categories: ASA Guide to P-values, ASA Task Force on Significance and Replicability, editorial COIs, WSL 2019 | 20 Comments

January 11 Forum: “Statistical Significance Test Anxiety” : Benjamini, Mayo, Hand

Here are all the slides along with the video from the 11 January Phil Stat Forum with speakers: Deborah G. Mayo, Yoav Benjamini and moderator/discussant David Hand.

D. Mayo                 Y. Benjamini.           D. Hand

Continue reading

Categories: ASA Guide to P-values, ASA Task Force on Significance and Replicability, P-values, statistical significance | 2 Comments

Nathan Schactman: Of Significance, Error, Confidence, and Confusion – In the Law and In Statistical Practice (Guest Post)

.

Nathan Schachtman,  Esq., J.D.
Legal Counsel for Scientific Challenges

Of Significance, Error, Confidence, and Confusion – In the Law and In Statistical Practice

The metaphor of law as an “empty vessel” is frequently invoked to describe the law generally, as well as pejoratively to describe lawyers. The metaphor rings true at least in describing how the factual content of legal judgments comes from outside the law. In many varieties of litigation, not only the facts and data, but the scientific and statistical inferences must be added to the “empty vessel” to obtain a correct and meaningful outcome. Continue reading

Categories: ASA Guide to P-values, ASA Task Force on Significance and Replicability, PhilStat Law, Schachtman | 3 Comments

John Park: Poisoned Priors: Will You Drink from This Well?(Guest Post)

.

John Park, MD
Radiation Oncologist
Kansas City VA Medical Center

Poisoned Priors: Will You Drink from This Well?

As an oncologist, specializing in the field of radiation oncology, “The Statistics Wars and Intellectual Conflicts of Interest”, as Prof. Mayo’s recent editorial is titled, is one of practical importance to me and my patients (Mayo, 2021). Some are flirting with Bayesian statistics to move on from statistical significance testing and the use of P-values. In fact, what many consider the world’s preeminent cancer center, MD Anderson, has a strong Bayesian group that completed 2 early phase Bayesian studies in radiation oncology that have been published in the most prestigious cancer journal —The Journal of Clinical Oncology (Liao et al., 2018 and Lin et al, 2020). This brings about the hotly contested issue of subjective priors and much ado has been written about the ability to overcome this problem. Specifically in medicine, one thinks about Spiegelhalter’s classic 1994 paper mentioning reference, clinical, skeptical, or enthusiastic priors who also uses an example from radiation oncology (Spiegelhalter et al., 1994) to make his case. This is nice and all in theory, but what if there is ample evidence that the subject matter experts have major conflicts of interests (COIs) and biases so that their priors cannot be trusted?  A debate raging in oncology, is whether non-invasive radiation therapy is as good as invasive surgery for early stage lung cancer patients. This is a not a trivial question as postoperative morbidity from surgery can range from 19-50% and 90-day mortality anywhere from 0–5% (Chang et al., 2021). Radiation therapy is highly attractive as there are numerous reports hinting at equal efficacy with far less morbidity. Unfortunately, 4 major clinical trials were unable to accrue patients for this important question. Why could they not enroll patients you ask? Long story short, if a patient is referred to radiation oncology and treated with radiation, the surgeon loses out on the revenue, and vice versa. Dr. David Jones, a surgeon at Memorial Sloan Kettering, notes there was no “equipoise among enrolling investigators and medical specialties… Although the reasons are multiple… I believe the primary reason is financial” (Jones, 2015). I am not skirting responsibility for my field’s biases. Dr. Hanbo Chen, a radiation oncologist, notes in his meta-analysis of multiple publications looking at surgery vs radiation that overall survival was associated with the specialty of the first author who published the article (Chen et al, 2018). Perhaps the pen is mightier than the scalpel! Continue reading

Categories: ASA Task Force on Significance and Replicability, Bayesian priors, PhilStat/Med, statistical significance tests | Tags: | 4 Comments

Philip Stark (guest post): commentary on “The Statistics Wars and Intellectual Conflicts of Interest” (Mayo Editorial)

.

Philip B. Stark
Professor
Department of Statistics
University of California, Berkeley

I enjoyed Prof. Mayo’s comment in Conservation Biology Mayo, 2021 very much, and agree enthusiastically with most of it. Here are my key takeaways and reflections.

Error probabilities (or error rates) are essential to consider. If you don’t give thought to what the data would be like if your theory is false, you are not doing science. Some applications really require a decision to be made. Does the drug go to market or not? Are the girders for the bridge strong enough, or not? Hence, banning “bright lines” is silly. Conversely, no threshold for significance, no matter how small, suffices to prove an empirical claim. In replication lies truth. Abandoning P-values exacerbates moral hazard for journal editors, although there has always been moral hazard in the gatekeeping function. Absent any objective assessment of evidence, publication decisions are even more subject to cronyism, “taste”, confirmation bias, etc. Throwing away P-values because many practitioners don’t know how to use them is perverse. It’s like banning scalpels because most people don’t know how to perform surgery. People who wish to perform surgery should be trained in the proper use of scalpels, and those who wish to use statistics should be trained in the proper use of P-values. Throwing out P-values is self-serving to statistical instruction, too: we’re making our lives easier by teaching less instead of teaching better. Continue reading

Categories: ASA Task Force on Significance and Replicability, editorial, multiplicity, P-values | 6 Comments

The ASA controversy on P-values as an illustration of the difficulty of statistics

.

Christian Hennig
Professor
Department of Statistical Sciences
University of Bologna

The ASA controversy on P-values as an illustration of the difficulty of statistics

“I work on Multidimensional Scaling for more than 40 years, and the longer I work on it, the more I realise how much of it I don’t understand. This presentation is about my current state of not understanding.” (John Gower, world leading expert on Multidimensional Scaling, on a conference in 2009)

“The lecturer contradicts herself.” (Student feedback to an ex-colleague for teaching methods and then teaching what problems they have)

1 Limits of understanding

Statistical tests and P-values are widely used and widely misused. In 2016, the ASA issued a statement on significance and P-values with the intention to curb misuse while acknowledging their proper definition and potential use. In my view the statement did a rather good job saying things that are worthwhile saying while trying to be acceptable to those who are generally critical on P-values as well as those who tend to defend their use. As was predictable, the statement did not settle the issue. A “2019 editorial” by some of the authors of the original statement (recommending “to abandon statistical significance”) and a 2021 ASA task force statement, much more positive on P-values, followed, showing the level of disagreement in the profession. Continue reading

Categories: ASA Task Force on Significance and Replicability, Mayo editorial, P-values | 3 Comments

E. Ionides & Ya’acov Ritov (Guest Post) on Mayo’s editorial, “The Statatistics Wars and Intellectual Conflicts of Interest”

.

Edward L. Ionides

.

Director of Undergraduate Programs and Professor,
Department of Statistics, University of Michigan

Ya’acov Ritov Professor
Department of Statistics, University of Michigan

 

Thanks for the clear presentation of the issues at stake in your recent Conservation Biology editorial (Mayo 2021). There is a need for such articles elaborating and contextualizing the ASA President’s Task Force statement on statistical significance (Benjamini et al, 2021). The Benjamini et al (2021) statement is sensible advice that avoids directly addressing the current debate. For better or worse, it has no references, and just speaks what looks to us like plain sense. However, it avoids addressing why there is a debate in the first place, and what are the justifications and misconceptions that drive different positions. Consequently, it may be ineffective at communicating to those swing voters who have sympathies with some of the insinuations in the Wasserstein & Lazar (2016) statement. We say “insinuations” here since we consider that their 2016 statement made an attack on p-values which was forceful, indirect and erroneous. Wasserstein & Lazar (2016) started with a constructive discussion about the uses and abuses of p-values before moving against them. This approach was good rhetoric: “I have come to praise p-values, not to bury them” to invert Shakespeare’s Anthony. Good rhetoric does not always promote good science, but Wasserstein & Lazar (2016) successfully managed to frame and lead the debate, according to Google Scholar. We warned of the potential consequences of that article and its flaws (Ionides et al, 2017) and we refer the reader to our article for more explanation of these issues (it may be found below). Wasserstein, Schirm and Lazar (2019) made their position clearer, and therefore easier to confront. We are grateful to Benjamini et al (2021) and Mayo (2021) for rising to the debate. Rephrasing Churchill in support of their efforts, “Many forms of statistical methods have been tried, and will be tried in this world of sin and woe. No one pretends that the p-value is perfect or all-wise. Indeed (noting that its abuse has much responsibility for the replication crisis) it has been said that the p-value is the worst form of inference except all those other forms that have been tried from time to time”. Continue reading

Categories: ASA Task Force on Significance and Replicability, editors, P-values, significance tests | 2 Comments

B. Haig on questionable editorial directives from Psychological Science (Guest Post)

.

Brian Haig, Professor Emeritus
Department of Psychology
University of Canterbury
Christchurch, New Zealand

 

What do editors of psychology journals think about tests of statistical significance? Questionable editorial directives from Psychological Science

Deborah Mayo’s (2021) recent editorial in Conservation Biology addresses the important issue of how journal editors should deal with strong disagreements about tests of statistical significance (ToSS). Her commentary speaks to applied fields, such as conservation science, but it is relevant to basic research, as well as other sciences, such as psychology. In this short guest commentary, I briefly remark on the role played by the prominent journal, Psychological Science (PS), regarding whether or not researchers should employ ToSS. PS is the flagship journal of the Association for Psychological Science, and two of its editors-in-chief have offered explicit, but questionable, advice on this matter. Continue reading

Categories: ASA Task Force on Significance and Replicability, Brian Haig, editors, significance tests | Tags: | 2 Comments

Invitation to discuss the ASA Task Force on Statistical Significance and Replication

.

The latest salvo in the statistics wars comes in the form of the publication of The ASA Task Force on Statistical Significance and Replicability, appointed by past ASA president Karen Kafadar in November/December 2019. (In the ‘before times’!) Its members are:

Linda Young, (Co-Chair), Xuming He, (Co-Chair) Yoav Benjamini, Dick De Veaux, Bradley Efron, Scott Evans, Mark Glickman, Barry Graubard, Xiao-Li Meng, Vijay Nair, Nancy Reid, Stephen Stigler, Stephen Vardeman, Chris Wikle, Tommy Wright, Karen Kafadar, Ex-officio. (Kafadar 2020)

The full report of this Task Force is in the The Annals of Applied Statistics, and on my blogpost. It begins:

In 2019 the President of the American Statistical Association (ASA) established a task force to address concerns that a 2019 editorial in The American Statistician (an ASA journal) might be mistakenly interpreted as official ASA policy. (The 2019 editorial recommended eliminating the use of “p < 0.05” and “statistically significant” in statistical analysis.) This document is the statement of the task force… (Benjamini et al. 2021)

Continue reading

Categories: 2016 ASA Statement on P-values, ASA Task Force on Significance and Replicability, JSM 2020, National Institute of Statistical Sciences (NISS), statistical significance tests | 3 Comments

Statisticians Rise Up To Defend (error statistical) Hypothesis Testing

.

What is the message conveyed when the board of a professional association X appoints a Task Force intended to dispel the supposition that a position advanced by the Executive Director of association X does not reflect the views of association X on a topic that members of X disagree on? What it says to me is that there is a serious break-down of communication amongst the leadership and membership of that association. So while I’m extremely glad that the ASA appointed the Task Force on Statistical Significance and Replicability in 2019, I’m very sorry that the main reason it was needed was to address concerns that an editorial put forward by the ASA Executive Director (and 2 others) “might be mistakenly interpreted as official ASA policy”. The 2021 Statement of the Task Force (Benjamini et al. 2021) explains:

In 2019 the President of the American Statistical Association (ASA) established a task force to address concerns that a 2019 editorial in The American Statistician (an ASA journal) might be mistakenly interpreted as official ASA policy. (The 2019 editorial recommended eliminating the use of “p < 0.05” and “statistically significant” in statistical analysis.) This document is the statement of the task force…

Continue reading

Categories: ASA Task Force on Significance and Replicability, Schachtman, significance tests | 10 Comments

At long last! The ASA President’s Task Force Statement on Statistical Significance and Replicability

The ASA President’s Task Force Statement on Statistical Significance and Replicability has finally been published. It found a home in The Annals of Applied Statistics, after everyone else they looked to–including the ASA itself– refused to publish it.  For background see this post. I’ll comment on it in a later post. There is also an Editorial: Statistical Significance, P-Values, and Replicability by Karen Kafadar. Continue reading

Categories: ASA Task Force on Significance and Replicability | 11 Comments

Blog at WordPress.com.