
.
Given how much I’ve blogged about the 2016 ASA p-value statement, the 2019 Executive Editor’s editorial in The American Statistician (TAS), the 2020 ASA (President’s) Task Force, and the various casualties of the related teeth pulling, I thought I should say something about the recent article by Robert Matthews in Significance (March 2026): “The ASA p-value statement 10 years on: An event of statistical significance?” He begins: “Ten years ago this month, the American Statistical Association (ASA) took the unprecedented step of issuing a statement on one of the most controversial issues in statistics: the use and abuse of p-values.” The Statement is here, 2016 ASA Statement on P-Values and Statistical Significance [1]. The Executive director of the ASA, Ronald Wasserstein, invited me to be a ”philosophical observer” at the meeting which gave rise to the 2016 statement. Although the 2016 ASA statement wasn’t radically controversial, at least as compared to the 2019 Executive Editor’s editorial, which I’ll get to in a minute, it was met with critical reactions on all sides. Stephen Senn provides a figure displaying relationships between reactions. Here’s how Matthews’ article begins:
Popularised in the 1920s by the hugely influential English statistician Ronald Fisher, p-values lie at the heart of “significance testing”, widely used by researchers to claim to have found something interesting lurking in data. Yet despite their ubiquity in research journals, p-values have also long been criticised as misunderstood, misleading and open to abuse. The problem lies in their definition. p-values typically give the chances of getting an effect at least as impressive as that seen, assuming it’s actually just a fluke. If these chances are sufficiently low – less than 0.05 is the traditional standard – the finding is then deemed “statistically significant”. For many researchers, this has been taken as implying that their finding is not a fluke, and worth taking seriously. But this overlooks the fact that p-values are calculated on the assumption the result is a fluke. As such, they cannot also be used to decide if this assumption is valid…
Wait a minute. According to Matthews, taking a small p-value as evidence the observed effect is not a fluke “overlooks the fact that p-values are calculated on the assumption the result is a fluke. As such, they cannot also be used to decide if this assumption is valid.” This overlooks the very nature of reductio (or indirect or falsificationist) proofs, say that there’s no smallest rational q: Assume q is the smallest rational. If so, q/2 would be a smaller rational. From this contradiction, infer there is no smallest rational number. It is a deductively valid argument. P-value reasoning is a statistical version of the reductio argument– providing a statistical contradiction to the fluke assumption, with an associated error probability. The small p-value tells us it’s very probable (1-p) that a smaller effect would have resulted, were it due to chance alone. Replicating the small p-value strengthens the contradiction further. [0] So can we please stop saying that assuming a claim C in a reductio argument precludes finding evidence to falsify C?
The assumption in the null hypothesis is just an “implicationary assumption” for purposes of drawing out the consequences of C. Overlooking falsificationist logic is at the heart of today’s confusion over p-value reasoning. If we could run an experiment in which the p-value critics magically became falsificationists for 1 day, I think the scales would fall from the eyes of a statistically significant proportion of them at least during that time.[2]
Admittedly, statistical significance tests are just a small part of a rich set of “techniques for systematically appraising and bounding the probabilities (under respective hypotheses) of seriously misleading interpretations of data” (Birnbaum 1970, p. 1033). The simple Fisherian test that the 2016 Statement restricts itself to–there’s just the single null hypothesis without considering alternatives or power–is an even smaller part. But even they have important uses, especially in testing assumptions of statistical models or misspecification tests. In any event, their limited use is not grounds for misinterpreting their logic. Much less is it grounds to abandon or retire them.
Returning to Matthews:
“Finally, in 2021, the ASA issued [3] another statement, this time from a Presidential Task Force whose focus was not promoting the 2016 principles but addressing concerns” that an editorial in TAS–I’ll call it the ASA Executive Director editorial– “might be seen as official ASA policy.” Why the worry it might be seen as ASA policy? One reason is that one of the authors was the ASA Executive director Wasserstein. The second was that it sounded like a continuation of the 2016 ASA statement–which is ASA policy. According to the 2019 Executive Director’s editorial, the 2016 ASA Statement had “stopped just short of recommending that declarations of ‘statistical significance’ be abandoned”, and they announce: “We take that step here….‘statistically significant’—don’t say it and don’t use it”. The use of p-value thresholds is also verboten. “[T]he problem is not that of having only two labels. Results should not be trichotomized, or indeed categorized into any number of groups…” (2019 Executive Director Editorial, p. 2).
Then ASA president Karen Kafadar (2019) wrote in an ASA Newsletter:
Many of you have written of instances in which authors and journal editors—and even some ASA members—have mistakenly assumed this editorial represented ASA policy. The mistake is understandable: The editorial was co-authored by an official of the ASA.
So she appointed a Task Force in 2019. Its full (1 page) report is in the The Annals of Applied Statistics, also on my blogpost.[4] The report (Benjamini et al. 2021) begins:
In 2019 the President of the American Statistical Association (ASA) established a task force to address concerns that a 2019 editorial in The American Statistician (an ASA journal) might be mistakenly interpreted as official ASA policy. (The 2019 editorial recommended eliminating the use of “p < 0.05” and “statistically significant” in statistical analysis.) This document is the statement of the task force… (Benjamini et al. 2021)
Among its main points:
- the use of P -values and significance testing, properly applied and interpreted, are important tools that should not be abandoned”…
- P -values are valid statistical measures that provide convenient conventions for communicating the uncertainty inherent in quantitative results. Indeed, P -values and significance tests are among the most studied and best understood statistical procedures in the statistics literature.
- They are important tools that have advanced science through their proper application. …(Benjamini et al. 2021)
According to Matthews:
“For those who saw improper use and misinterpretation as the key issue in the p-value debate, this seemed to miss the point.”
But defending the scientific value of a tool when an Executive Director’s editorial is calling for its abandonment is exactly to the point. Forgoing predesignated thresholds obstructs error control. If an account cannot say about any outcomes that they will not count as evidence for a claim—if all thresholds are abandoned—then there is no test of that claim. Giving up on tests means forgoing falsification even of the statistical variety. What’s the point of requiring replication if at no point can you say an effect has failed to replicate?
Maybe the ASA should invite 10 year reflections, or maybe they’re out there and I haven’t seen them.
Please share your queries and thoughts in the comments.
References
Birnbaum, A. (1970), “Statistical Methods in Scientific Inference (letter to the Editor),” Nature 225(5237): 1033.
Mayo, D. G. and Cox, D. R. (2006), “Frequentists Statistics as a Theory of Inductive Inference,” in Optimality: The Second Erich L. Lehmann Symposium, ed. J. Rojo, Lecture Notes-Monograph series, Institute of Mathematical Statistics (IMS), Vol. 49: 77-97.
Some related posts (search this blog for others):
March 7, 2016: “Don’t throw out the error control baby with the bad statistics bathwater”
May 21, 2024: 5-year review: “Les stats, c’est moi”: We take that step here! (Adopt our fav word or phil stat!)(iii)
June 20, 2021: At long last! The ASA President’s Task Force Statement on Statistical Significance and Replicability
Mayo 31, 2024: 2-4 year review: The Statistics Wars and Intellectual Conflicts of Interest
[0] p-value. The significance test arises to test the conformity of the particular data under analysis with H0 in some respect: To do this we find a function t = t(y) of the data, to be called the test statistic, such that
- the larger the value of t the more inconsistent are the data with H0;
- the corresponding random variable T = t(Y) has a (numerically) known probability distribution when H0 is true.
…[We define the] p-value corresponding to any t as p = p(t) = P(T ≥ t; H0). (Mayo and Cox 2006, p. 81)
[1] The 2016 ASA Statement’s six principles: 1. P-values can indicate how incompatible the data are with a specified statistical model. 2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone. 3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold. 4. Proper inference requires full reporting and transparency. 5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result. 6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.
[2] There are a few critics who are falsificationists, notably Andrew Gelman.
[3] The 2019 ASA [president’s] task force submitted its statement to the ASA in 2020, and for a long time its contents were shrouded in mystery. It eventually was published in 2021 in the Annals of Applied Statistics where Kafadar was editor in chief.
[4] The 2019 Task Force members: Linda Young, (Co-Chair), Xuming He, (Co-Chair) Yoav Benjamini, Dick De Veaux, Bradley Efron, Scott Evans, Mark Glickman, Barry Graubard, Xiao-Li Meng, Vijay Nair, Nancy Reid, Stephen Stigler, Stephen Vardeman, Chris Wikle, Tommy Wright, Karen Kafadar, Ex-officio. (Kafadar 2020)


