abandon statistical significance

Comments on “The ASA p-value statement 10 years on” (ii)

.

Given how much I’ve blogged about the 2016 ASA p-value statement, the 2019 Executive Editor’s editorial in The American Statistician (TAS), the 2020 ASA (President’s) Task Force, and the various casualties of the related teeth pulling, I thought I should say something about the recent article by Robert Matthews in Significance (March 2026): “The ASA p-value statement 10 years on: An event of statistical significance?” He begins: “Ten years ago this month, the American Statistical Association (ASA) took the unprecedented step of issuing a statement on one of the most controversial issues in statistics: the use and abuse of p-values.” The Statement is here, 2016 ASA Statement on P-Values and Statistical Significance [1]. The Executive director of the ASA, Ronald Wasserstein, invited me to be a ”philosophical observer” at the meeting which gave rise to the 2016 statement. Although the 2016 ASA statement wasn’t radically controversial, at least as compared to the 2019 Executive Editor’s editorial, which I’ll get to in a minute, it was met with critical reactions on all sides. Stephen Senn provides a figure displaying relationships between reactions. Here’s how Matthews’ article begins:

Popularised in the 1920s by the hugely influential English statistician Ronald Fisher, p-values lie at the heart of “significance testing”, widely used by researchers to claim to have found something interesting lurking in data. Yet despite their ubiquity in research journals, p-values have also long been criticised as misunderstood, misleading and open to abuse. The problem lies in their definition. p-values typically give the chances of getting an effect at least as impressive as that seen, assuming it’s actually just a fluke. If these chances are sufficiently low – less than 0.05 is the traditional standard – the finding is then deemed “statistically significant”. For many researchers, this has been taken as implying that their finding is not a fluke, and worth taking seriously. But this overlooks the fact that p-values are calculated on the assumption the result is a fluke. As such, they cannot also be used to decide if this assumption is valid…

Wait a minute. According to Matthews, taking a small p-value as evidence the observed effect is not a fluke “overlooks the fact that p-values are calculated on the assumption the result is a fluke. As such, they cannot also be used to decide if this assumption is valid.” This overlooks the very nature of reductio (or indirect or falsificationist) proofs, say that there’s no smallest rational q: Assume q is the smallest rational. If so, q/2 would be a smaller rational. From this contradiction, infer there is no smallest rational number. It is a deductively valid argument. P-value reasoning is a statistical version of the reductio argument– providing a statistical contradiction to the fluke assumption, with an associated error probability. The small p-value tells us it’s very probable (1-p) that a smaller effect would have resulted, were it due to chance alone. Replicating the small p-value strengthens the contradiction further. [0] So can we please stop saying that assuming a claim C in a reductio argument precludes finding evidence to falsify C?

The assumption in the null hypothesis  is just an “implicationary assumption” for purposes of drawing out the consequences of C. Overlooking falsificationist logic is at the heart of today’s confusion over p-value reasoning. If we could run an experiment in which the p-value critics magically became falsificationists for 1 day, I think the scales would fall from the eyes of a statistically significant proportion of them at least during that time.[2]

Admittedly, statistical significance tests are just a small part of a rich set of “techniques for systematically appraising and bounding the probabilities (under respective hypotheses) of seriously misleading interpretations of data” (Birnbaum 1970, p. 1033). The simple Fisherian test that the 2016 Statement restricts itself to–there’s just the single null hypothesis without considering alternatives or power–is an even smaller part. But even they have important uses, especially in testing assumptions of statistical models or misspecification tests. In any event, their limited use is not grounds for misinterpreting their logic. Much less is it grounds to abandon or retire them.

Returning to Matthews:

“Finally, in 2021, the ASA issued [3] another statement, this time from a Presidential Task Force whose focus was not promoting the 2016 principles but addressing concerns” that an editorial in TAS–I’ll call it the ASA Executive Director editorial– “might be seen as official ASA policy.” Why the worry it might be seen as ASA policy? One reason is that one of the authors was the ASA Executive director Wasserstein. The second was that it sounded like a continuation of the 2016 ASA statement–which is ASA policy. According to the 2019 Executive Director’s editorial, the 2016 ASA Statement had “stopped just short of recommending that declarations of ‘statistical significance’ be abandoned”, and they announce: “We take that step here….‘statistically significant’—don’t say it and don’t use it”. The use of p-value thresholds is also verboten. “[T]he problem is not that of having only two labels. Results should not be trichotomized, or indeed categorized into any number of groups…” (2019 Executive Director Editorial, p. 2).

Then ASA president Karen Kafadar (2019) wrote in an ASA Newsletter:

Many of you have written of instances in which authors and journal editors—and even some ASA members—have mistakenly assumed this editorial represented ASA policy. The mistake is understandable: The editorial was co-authored by an official of the ASA.

So she appointed a Task Force in 2019. Its full (1 page) report is in the The Annals of Applied Statistics, also on my blogpost.[4] The report (Benjamini et al. 2021) begins:

In 2019 the President of the American Statistical Association (ASA) established a task force to address concerns that a 2019 editorial in The American Statistician (an ASA journal) might be mistakenly interpreted as official ASA policy. (The 2019 editorial recommended eliminating the use of “p < 0.05” and “statistically significant” in statistical analysis.) This document is the statement of the task force… (Benjamini et al. 2021)

Among its main points:

  • the use of P -values and significance testing, properly applied and interpreted, are important tools that should not be abandoned”…
  • P -values are valid statistical measures that provide convenient conventions for communicating the uncertainty inherent in quantitative results. Indeed, P -values and significance tests are among the most studied and best understood statistical procedures in the statistics literature.
  • They are important tools that have advanced science through their proper application. …(Benjamini et al. 2021)

According to Matthews:

“For those who saw improper use and misinterpretation as the key issue in the p-value debate, this seemed to miss the point.”

But defending the scientific value of a tool when an Executive Director’s editorial is calling for its abandonment is exactly to the point. Forgoing predesignated thresholds obstructs error control. If an account cannot say about any outcomes that they will not count as evidence for a claim—if all thresholds are abandoned—then there is no test of that claim. Giving up on tests means forgoing falsification even of the statistical variety. What’s the point of requiring replication if at no point can you say an effect has failed to replicate?

Maybe the ASA should invite 10 year reflections, or maybe they’re out there and I haven’t seen them.

Please share your queries and thoughts in the comments.

References
Birnbaum, A. (1970), “Statistical Methods in Scientific Inference (letter to the Editor),” Nature 225(5237): 1033.
Mayo, D. G. and Cox, D. R. (2006), “Frequentists Statistics as a Theory of Inductive Inference,” in Optimality: The Second Erich L. Lehmann Symposium, ed. J. Rojo, Lecture Notes-Monograph series, Institute of Mathematical Statistics (IMS), Vol. 49: 77-97.

Some related posts (search this blog for others):

March 7, 2016: “Don’t throw out the error control baby with the bad statistics bathwater
May 21, 2024: 5-year review: “Les stats, c’est moi”: We take that step here! (Adopt our fav word or phil stat!)(iii)
June 20, 2021: At long last! The ASA President’s Task Force Statement on Statistical Significance and Replicability
Mayo 31, 2024: 2-4 year review: The Statistics Wars and Intellectual Conflicts of Interest
June 17, 2019: The 2019 ASA executive editor’s guide to p-values: Don’t say what you don’t mean
June 4, 2024: 2-4 year review: commentaries on my editorial
May 15, 2022: 2-4 year review: commentaries on my editorial

My editorial: The statistics wars and intellectual conflicts of interest

 


[0] p-value. The significance test arises to test the conformity of the particular data under analysis with H0 in some respect: To do this we find a function t = t(y) of the data, to be called the test statistic, such that

  • the larger the value of t the more inconsistent are the data with H0;
  • the corresponding random variable T = t(Y) has a (numerically) known probability distribution when H0 is true.

…[We define the] p-value corresponding to any t as p = p(t) = P(T ≥ t; H0). (Mayo and Cox 2006, p. 81)

[1] The 2016 ASA Statement’s six principles: 1. P-values can indicate how incompatible the data are with a specified statistical model. 2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone. 3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold. 4. Proper inference requires full reporting and transparency. 5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result. 6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.

[2] There are a few critics who are falsificationists, notably Andrew Gelman.

[3]  The 2019 ASA [president’s] task force submitted its statement to the ASA in 2020, and for a long time its contents were shrouded in mystery. It eventually was published in 2021 in the Annals of Applied Statistics where Kafadar was editor in chief.

[4] The 2019 Task Force members: Linda Young, (Co-Chair), Xuming He, (Co-Chair) Yoav Benjamini, Dick De Veaux, Bradley Efron, Scott Evans, Mark Glickman, Barry Graubard, Xiao-Li Meng, Vijay Nair, Nancy Reid, Stephen Stigler, Stephen Vardeman, Chris Wikle, Tommy Wright, Karen Kafadar, Ex-officio. (Kafadar 2020)

 

Categories: abandon statistical significance, ASA Task Force on Significance and Replicability, P-values, significance tests, stat wars and their casualties | 22 Comments

An exchange between A. Gelman and D. Mayo on abandoning statistical significance: 5 years ago

.

Below is an email exchange that Andrew Gelman posted on this day 5 years ago on his blog, Statistical Modeling, Causal Inference, and Social Science.  (You can find the original exchange, with its 130 comments, here.) Note: “Me” refers to Gelman. I will share my current reflections in the comments.

Exchange with Deborah Mayo on abandoning statistical significance

Continue reading

Categories: 5-year memory lane, abandon statistical significance, Gelman blogs an exchange with Mayo | 4 Comments

Georgi Georgiev (Guest Post): “The frequentist vs Bayesian split in online experimentation before and after the ‘abandon statistical significance’ call”

.

Georgi Georgiev

  • Author of Statistical methods in online A/B testing
  • Founder of Analytics-Toolkit.com
  • Statistics instructor at CXL Institute

In online experimentation, a.k.a. online A/B testing, one is primarily interested in estimating if and how different user experiences affect key business metrics such as average revenue per user. A trivial example would be to determine if a given change to the purchase flow of an e-commerce website is positive or negative as measured by average revenue per user, and by how much. An online controlled experiment would be conducted with actual users assigned randomly to either the currently implemented experience or the changed one. Continue reading

Categories: A/B testing, abandon statistical significance, optional stopping | Tags: | 25 Comments

Andrew Gelman (Guest post): (Trying to) clear up a misunderstanding about decision analysis and significance testing

.

Professor Andrew Gelman
Higgins Professor of Statistics
Professor of Political Science
Director of the Applied Statistics Center
Columbia University

 

(Trying to) clear up a misunderstanding about decision analysis and significance testing

Background

In our 2019 article, Abandon Statistical Significance, Blake McShane, David Gal, Christian Robert, Jennifer Tackett, and I talk about three scenarios: summarizing research, scientific publication, and decision making.

In making our recommendations, we’re not saying it will be easy; we’re just saying that screening based on statistical significance has lots of problems. P-values and related measures are not useless—there can be value in saying that an estimate is only 1 standard error away from 0 and so it is consistent with the null hypothesis, or that an estimate is 10 standard errors from zero and so the null can be rejected, or than an estimate is 2 standard errors from zero, which is something that we would not usually see if the null hypothesis were true. Comparison to a null model can be a useful statistical tool, in its place. The problem we see with “statistical significance” is when this tool is used as a dominant or default or master paradigm: Continue reading

Categories: abandon statistical significance, gelman, statistical significance tests, Wasserstein et al 2019 | 29 Comments

Aris Spanos Guest Post: “On Frequentist Testing: revisiting widely held confusions and misinterpretations”

.

Aris Spanos
Wilson Schmidt Professor of Economics
Department of Economics
Virginia Tech

The following guest post (link to PDF of this post) was written as a comment to Mayo’s recent post: “Abandon Statistical Significance and Bayesian Epistemology: some troubles in philosophy v3“.

On Frequentist Testing: revisiting widely held confusions and misinterpretations

After reading chapter 13.2 of the 2022 book Fundamentals of Bayesian Epistemology 2: Arguments, Challenges, Alternatives, by Michael G. Titelbaum, I decided to write a few comments relating to his discussion in an attempt to delineate certain key concepts in frequentist testing with a view to shed light on several long-standing confusions and misinterpretations of these testing procedures. The key concepts include ‘what is a frequentist test’, ‘what is a test statistic and how it is chosen’, and ‘how the hypotheses of interest are framed’. Continue reading

Categories: abandon statistical significance, Spanos | 13 Comments

Guest Post: Yudi Pawitan: “Update on Behavioral aspects in the statistical significance war-game (‘abandon statistical significance 5 years on’)

.

Professor Yudi Pawitan
Department of Medical Epidemiology and Biostatistics
Karolinska Institutet, Stockholm, Sweden

[An earlier guest post on this topic by Y. Pawitan is Jan 10, 2022: Yudi Pawitan: Behavioral aspects in the statistical significance war-game]

Behavioral aspects in the statistical significance war-game

I remember with fondness the good old days when the only ‘statistical war’-game was fought between the Bayesian and the frequentist. It was simpler and the participants were for the most part collegial. Moreover, there was a feeling that it was a philosophical debate. Even though the Bayesian-frequentist war is not fully settled, we can see areas of consensus, for example in objective Bayesianism or in conditional inference. However, on the P-value and statistical significance front, the war looks less simple since it is about statistical praxis; it is no longer Bayesian vs frequentist, with no consensus in sight and with wide implications affecting the day-to-day use of statistics. Continue reading

Categories: abandon statistical significance, game-theoretic analyses, Wasserstein et al. (2019) | 12 Comments

Guest Post: John Park: Abandoning P-values and Embracing Artificial Intelligence in Medicine (thoughts on “abandon statistical significance 5 years on”)

.

John Park, MD
Medical Director of Radiation Oncology
North Kansas City Hospital
Clinical Assistant Professor
Univ. Of Missouri-Kansas City

[An earlier post  by J. Park on this topic: Jan 17, 2022: John Park: Poisoned Priors: Will You Drink from This Well? (Guest Post)]

Abandoning P-values and Embracing Artificial Intelligence in Medicine

The move to abandon P-values that started 5 years ago was, as we say in medicine, merely a symptom of a deeper more sinister diagnosis. Within medicine, the diagnosis was a lack of statistical and philosophical knowledge. Specifically, this presented as an uncritical move towards Bayesianism away from frequentist methods, that went essentially unchallenged. The debate between frequentists and Bayesians, though longstanding, was little known inside oncology. Out of concern, I sought a collaboration with Prof. Mayo, which culminated into a lecture given at the 2021 American Society of Radiation Oncology meeting. The lecture included not only representatives from frequentist and Bayesian statistics, but another interesting guest that was flying under the radar in my field at that time… artificial intelligence (AI). Continue reading

Categories: abandon statistical significance, Artificial Intelligence/Machine Learning, oncology | 21 Comments

Guest Post: Ron Kenett: What’s happening in statistical practice since the “abandon statistical significance” call

.

Ron S. Kenett
Chairman of the KPA Group;
Senior Research Fellow, the Samuel Neaman Institute, Technion, Haifa;
Chairman, Data Science Society, Israel

 

What’s happening in statistical practice since the “abandon statistical significance” call

This is a retrospective view from experience gained by applying statistics to a wide range of problems, with an emphasis on the past few years. The post is kept at a general level in order to provide a bird’s eye view of the points being made. Continue reading

Categories: abandon statistical significance, Wasserstein et al 2019 | 26 Comments

Guest Post (part 2 of 2): Daniël Lakens: “How were we supposed to move beyond  p < .05, and why didn’t we?”

.

Professor Daniël Lakens
Human Technology Interaction
Eindhoven University of Technology

[Some earlier posts by D. Lakens on this topic are at the end of this post]*

This continues Part 1:

4: Most do not offer any alternative at all

At this point, it might be worthwhile to point out that most of the contributions to the special issue do not discuss alternative approaches to p < .05 at all. They discuss general problems with low quality research (Kmetz, 2019), the importance of improving quality control (D. W. Hubbard & Carriquiry, 2019), results blind reviewing (Locascio, 2019), or the role of subjective judgment (Brownstein et al., 2019). There are historical perspectives on how we got to this point (Kennedy-Shaffer, 2019), ideas about how science should work instead, many stressing the importance of replication studies (R. Hubbard et al., 2019; Tong, 2019). Note that Trafimow both recommends replication as an alternative (Trafimow, 2019), but also co-authors a paper stating we should not expect findings to replicate (Amrhein et al., 2019), thereby directly contradicting himself within the same special issue. Others propose not simply giving up on p-values, but on generalizable knowledge (Amrhein et al., 2019). The suggestion is to only report descriptive statistics. Continue reading

Categories: abandon statistical significance, D. Lakens, Wasserstein et al 2019 | 13 Comments

Guest Post: “Daniël Lakens: How were we supposed to move beyond  p < .05, and why didn’t we? “(part 1 of 2):

.

Professor Daniël Lakens
Human Technology Interaction
Eindhoven University of Technology

*[Some earlier posts by D. Lakens on this topic are listed at the end of part 2, forthcoming this week]

How were we supposed to move beyond  p < .05, and why didn’t we?

It has been 5 years since the special issue “Moving to a world beyond p < .05” came out (Wasserstein et al., 2019). I might be the only person in the world who has read all 43 contributions to this special issue. [In part 1] I will provide a summary of what the articles proposed we should do instead of p < .05, and [in part 2] offer some reflections on why they did not lead to any noticeable change. Continue reading

Categories: abandon statistical significance, D. Lakens, Wasserstein et al. (2019) | 23 Comments

Guest Post: Christian Hennig: “Statistical tests in five random research papers of 2024, and related thoughts on the ‘don’t say significant’ initiative”

.

Professor Christian Hennig
Department of Statistical Sciences “Paolo Fortunati”
University of Bologna

[An earlier post by C. Hennig on this topic:  Jan 9, 2022: The ASA controversy on P-values as an illustration of the difficulty of statistics]

Statistical tests in five random research papers of 2024, and related thoughts on the “don’t say significant” initiative

This text follows an invitation to write on “abandon statistical significance 5 years on”, so I decided to do a tiny bit of empirical research. I had a look at five new papers listed on May 17 on the “Research Articles” site of Scientific Reports. I chose the most recent five papers when I looked without being selective. As I “sampled” papers for a general impression, I don’t want this to be a criticism of particular papers or authors, however in the interest of transparency, the doi addresses of the papers are: Continue reading

Categories: 5-year memory lane, abandon statistical significance, Christian Hennig | 7 Comments

Guest Post: Andrea Saltelli: Analytic flexibility: a badly kept secret? (thoughts on “abandon statistical significance 5 years on”)

.

.

Professor Andrea Saltelli
UPF Barcelona School of Management, Barcelona, Spain, Centre for the Study of the Sciences and the Humanities, University of Bergen, Bergen, Norway

[An earlier post by A. Saltelli on this topic: Nov 22, 2019: A. Saltelli (Guest post): What can we learn from the debate on statistical significance?]

Analytic flexibility: a badly kept secret?

In a previous post in this blog I expressed concern about a loss of trust that could incur the activity of scientific quantification – as practiced in several discipline – unless some technical and normative element of crisis could be managed. The piece warned that the phenomenon could lead to “a decline of public trust in the findings of science”. Five years and one pandemic later, we may wonder if the danger has indeed materialized. Continue reading

Categories: abandon statistical significance | Tags: , , , | 11 Comments

2-4 year review: The Statistics Wars and Intellectual Conflicts of Interest

.

Before posting new reflections on where we are 5 years after the ASA P-value controversy–both my own and readers’–I will reblog some reader commentaries from 2022 in connection with my (2022) editorial in Conservation Biology: “The Statistical Wars and Intellectual Conflicts of Interest”. First, here are excerpts from my editorial: Continue reading

Categories: 3-year memory lane, abandon statistical significance, stat activist watch 2023, stat wars and their casualties | Leave a comment

5-year review: Hardwicke and Ioannidis, Gelman, and Mayo: P-values: Petitions, Practice, and Perils

 

.

Soon after the Wasserstein et al (2019) “don’t say significance” editorial, John Ioannidis invited Andrew Gelman and I to write editorials from our different perspectives on an associated editorial that Nature invited. It was written by Amrhein, Greenland and McShane (AGM, 2019). Prior to the publication of AGM 2019, people were given the opportunity to add their names to the Nature article.

A campaign followed that aimed at the collection of signatures in what was called a ‘petition’ on the widely popular blogsite of Andrew Gelman. Ultimately, 854 scientists signed the petition and the list of their names was published along with commentary. (Hardwicke and Ioannidis, 2019, p. 2)

Tom Hardwicke and John Ioannidis (2019) took advantage of the opportunity “to perform a survey of the signatories to understand how and why they signed the endorsement” (ibid.). This post, reblogged from September 25 2019, includes all 3 articles: the survey by Hardwicke and Ioannidis, and the editorials by Gelman and I. They appeared in the European Journal of Clinical Investigations (2019). I’m still interested in reader responses (in the comments) to the question I pose. Continue reading

Categories: 5-year memory lane, abandon statistical significance | Leave a comment

5-year Review: B. Haig: [TAS] 2019 update on P-values and significance (ASA II)(Guest Post)

This is the guest post by Bran Haig on July 12, 2019 in response to the “abandon statistical significance” editorial in The American Statistician (TAS) by Wasserstein, Schirm, and Lazar (WSL 2019). In the post it is referred to as ASAII with a note added once we learned that it is actually not a continuation of the 2016 ASA policy statement. (I decided to leave it that way, as otherwise the context seems lost. But in the title to this post, I refer to the journal TAS.) Brian lists some of the benefits that were to result from abandoning statistical significance. I welcome your constructive thoughts in the comments.

Brian Haig, Professor Emeritus
Department of Psychology
University of Canterbury
Christchurch, New Zealand Continue reading

Categories: 5-year memory lane, abandon statistical significance, ASA Guide to P-values, Brian Haig | Tags: | Leave a comment

5-year review: The NEJM Issues New Guidelines on Statistical Reporting: Is the ASA P-Value Project Backfiring? (i)

In a July 19, 2019 post I discussed The New England Journal of Medicine’s response to Wasserstein’s (2019) call for journals to change their guidelines in reaction to the “abandon significance” drive. The NEJM said “no thanks” [A]. However confidence intervals CIs got hurt in the mix. In this reblog, I kept the reference to “ASA II” with a note, because that best conveys the context of the discussion at the time. Switching it to WSL (2019) just didn’t read right. I invite your comments. Continue reading

Categories: 5-year memory lane, abandon statistical significance, ASA Guide to P-values | 6 Comments

5-year review: Don’t let the tail wag the dog by being overly influenced by flawed statistical inferences

.

On June 1, 2019, I posted portions of an article [i],“There is Still a Place for Significance Testing in Clinical Trials,” in Clinical Trials responding to the 2019 call to abandon significance. I reblog it here. While very short, it effectively responds to the 2019 movement (by some) to abandon the concept of statistical significance [ii]. I have recently been involved in researching drug trials for a condition of a family member, and I can say that I’m extremely grateful that they are still reporting error statistical assessments of new treatments, and using carefully designed statistical significance tests with thresholds. Without them, I think we’d be lost in a sea of potential treatments and clinical trials. Please share any of your own experiences in the comments. The emphasis in this excerpt is mine: 

Much hand-wringing has been stimulated by the reflection that reports of clinical studies often misinterpret and misrepresent the findings of the statistical analyses. Recent proposals to address these concerns have included abandoning p-values and much of the traditional classical approach to statistical inference, or dropping the concept of statistical significance while still allowing some place for p-values. How should we in the clinical trials community respond to these concerns? Responses may vary from bemusement, pity for our colleagues working in the wilderness outside the relatively protected environment of clinical trials, to unease about the implications for those of us engaged in clinical trials…. Continue reading

Categories: 5-year memory lane, abandon statistical significance, statistical tests | 9 Comments

Blog at WordPress.com.