Prof. Deborah Mayo, Emerita
Department of Philosophy
Virginia Tech

.
Prof. David Hand
Department of Mathematics Statistics
Imperial College London
Statistical significance and its critics: practicing damaging science, or damaging scientific practice? (Synthese)
[pdf of full paper.]
Abstract: While the common procedure of statistical significance testing and its accompanying concept of p-values have long been surrounded by controversy, renewed concern has been triggered by the replication crisis in science. Many blame statistical significance tests themselves, and some regard them as sufficiently damaging to scientific practice as to warrant being abandoned. We take a contrary position, arguing that the central criticisms arise from misunderstanding and misusing the statistical tools, and that in fact the purported remedies themselves risk damaging science. We argue that banning the use of p-value thresholds in interpreting data does not diminish but rather exacerbates data-dredging and biasing selection effects. If an account cannot specify outcomes that will not be allowed to count as evidence for a claim-if all thresholds– are abandoned-then there is no test of that claim. The contributions of this paper are: To explain the rival statistical philosophies underlying the ongoing controversy; To elucidate and reinterpret statistical significance tests, and explain how this reinterpretation ameliorates common misuses and misinterpretations; To argue why recent recommendations to replace, abandon, or retire statistical significance undermine a central function of statistics in science: to test whether observed patterns in the data are genuine or due to background variability
Keywords: Data-dredging · Error probabilities · Fisher · Neyman and Pearson . P-values · Statistical significance tests
Introduction and background
While the common procedure of statistical significance testing and its accompanying concept of p-values have long been surrounded by controversy, renewed concern has been triggered by the so-called replication crisis in some scientific fields. In those fields, many results that had been found statistically significant are not found to be so (or have smaller effect sizes) when an independent group tries to replicate them. This has led many to blame the statistical significance tests themselves, and some view the use of p-value thresholds as sufficiently damaging to scientific practice as to warrant being abandoned. We take a contrary position, arguing that the central criticisms arise from misunderstanding and misusing the statistical tools, and that in fact the purported remedies themselves risk damaging science. In our view, if an account cannot specify remedies themselves risk damaging science. In our view, if an account cannot specify are abandoned-then there is no test of that claim.
….
The goals of our paper are:
- To explain the key issues in the ongoing controversy surrounding statistical significance tests;
- To reinterpret statistical significance tests, and the use of p-values, and explain how this reinterpretation ameliorates common misuses that underlie criticisms of these methods;
- To show that underlying many criticisms of statistical significance tests, and especially proposed alternatives, are often controversial philosophical presuppositions about statistical evidence and inference;
- To argue that recommendations to replace, abandon, or retire statistical significance tests are damaging to scientific practice.
Section 2 sets out the main features of statistical significance tests, emphasizing aspects that are routinely misunderstood, especially by their critics. In Sects. 3 and 4 we will flesh out, and respond to, what seem to be the strongest arguments in support of the view that current uses of statistical significance tests are damaging to science. Section 3 explores five key mistaken interpretations of p-values, how these can lead to damaging science, and how to avoid them. In Sect. 4 we discuss and respond to central criticisms of p-values that arise from presupposing alternative philosophies of evidence and inference. In Sect. 5 we argue that calls to replace, abandon, or retire statistical significance tests are damaging to scientific practice. We argue that the “no threshold” view does not diminish but rather exacerbates data-dredging and biasing selection effects (Sect. 5.1), and undermines a central function of statistics in science: to test whether observed patterns in the data can be explained by chance variation or not (Sect. 5.2). Section 5.3 shows why specific recommendations to retire, replace, or abandon statistical significance yield unsatisfactory tools for answering the significance tester’s question. Finally, in Sect. 6 we pull together the main threads of the discussion, and consider some implications for evaluating statistical methods with integrity.
You can read the rest on line.
- REFERENCE:
- Mayo, D.G., Hand, D. Statistical significance and its critics: practicing damaging science, or damaging scientific practice?. Synthese 200, 220 (2022). https://doi.org/10.1007/s11229-022-03692-0.
Other papers in the Special Topic: Recent Issues in Philosophy of Statistics: Evidence, Testing, and Applications (so far), are here.
We welcome your constructive comments!
Deborah:
I have some disagreements with your article.
At the beginning you “propose to explain why some of today’s attempts to fix statistical practice are actually jeopardizing reliability and integrity.” The title of section 5 is “Abandoning statistical significance tests is damaging scientific practice.” But I don’t see that you offer any evidence in support of this claim, except for sone legal case from 2018, which I don’t think tells us anything about scientific practice. Then in section 6 you say, “As noted in Sect. 1, our paper explains why some of today’s attempts to fix statistical practice by abandoning or replacing statistical significance are actually jeopardizing reliability and integrity.” Again, I see no evidence for that statement, and I don’t think the word “actually” helps to strengthen an evidence-free claim. You also say that this could be “the most dramatic example of a scientific discipline shooting itself in the foot.” I think there are many more dramatic examples of a scientific discipline shooting itself in the foot, for example psychology shooting itself in the foot in recent decades by heavily promoting unreplicable research, or, for a more serious case, Soviet biology shooting itself in the foot in the mid-twentieth century by enforcing Lysenkoism. Finally, the article concludes with the statement that “calls to abandon statistical significance are damaging scientific practice.” Again, I see no evidence for this claim of damage.
On the substance of the discussion, I continue to believe that null hypothesis significance testing in general, and p-values in particular, make sense in some narrow settings. It can be informative to know that data are consistent with a particular random number generator, or that they are not. Beyond that, null hypothesis significance tests have great practical use because they’re sometimes a convenient tool that people have, but I don’t think that people are doing “error statistics” when they’re rejecting some uninteresting null hypothesis that no one would ever believe.
Andrew:
I just found your comment by chance, deep inside of WordPress comments and I have no idea why I never saw it. I’m so sorry your comment wasn’t put up.
We argue in great detail in the paper why abandoning a method which performs a valuable, if limited, job to which scientists look to statistics to perform is damaging. Moreover we argue that none of the reasons put forward for abandoning statistical significance tests hold up. Damage comes with reforms that we argue enable rather than reveal illicit inferences due to multiple testing, and data-dredging. Either they obey the LP or block thresholds. All this is plenty of evidence of damage, and the basis for each claim is given. The legal example was just one illustration of the damage that was already done based on the largely uncontroversial 2016 ASA statement. I would be repeating what we have said in much greater detail than I could offer here to go further, but I can do so in a subsequent post (I wanted to reply quickly, since your comments had been sitting there….). You, Gelman, had said, not so very long ago:“[W]hat we are advocating, then, is what Cox and Hinkley (1974) call ‘pure significance testing’, in which certain of the model’s implications are compared directly to the data.” (Gelman and Shalizi 2013, 10, 20).
As for it being “the most dramatic example of a scientific discipline shooting itself in the foot.” I am prepared to agree with your claim that “there are many more dramatic examples of a scientific discipline shooting itself in the foot…”. That was a quote from a paper by David Hand. But it is shooting itself in the foot nevertheless—even if others could be said to be more dramatic. Perhaps the recent ASA (President’s) Task Force, and even more recent disclaimer by Wasserstein et al (2019), discussed in my current blog, corroborates at least parts of our allegations.
I totally agree with the points you make in this paper. My only concern is with the P value, which is the probability of an observed value or some more extreme hypothetical values conditional on another hypothetical value (the null hypothesis). This is a difficult concept for non–statisticians.
The reason I agree with what you write in this paper is because I think (subject to some assumptions of course) that the above P value is equal to the probability of the null hypothesis or something more extreme conditional on the same observed value as above. This means that the P value is equal to the probability of non-replication more extreme than the null hypothesis if the study is continued with impeccable consistency until there are an infinite number of observations (not just repeated with the same number of observations performed so far). The probability of replication less extreme than the null hypothesis after an infinite number of observations is therefore 1-P. Exactly same considerations apply if ‘significant’ thresholds or bright lines are applied to probabilities of replication or non replication and also their vulnerability to ‘p-hacking’ etc., which is why I agree totally with what you write in our paper.
All this depends on the assumption that the likelihood distribution of the observed values is symmetrical and that the possible observed values are the same as the possible hypothetical values after an infinite number of observations. As each possible value is unique, the prior probabilities of the possible observations conditional on the universal set of all numbers are uniform and equal to the prior probabilities of each possible result of the hypothetical values conditional on the universal set of all numbers after an infinite number of observations. The likelihood distribution probabilities will therefore sum to 1.
If two studies of identical design are conducted independently then the first probability distribution will be conditional on the first study result as well as the universal set of all numbers. If the latter is combined with the likelihood distribution of the second study to create a second posterior distribution then this will represent a simple frequentist meta-analysis combining the first and second studies. I emphasise that the above prior probability distribution is the result of combining a uniform prior probability conditional on the universal set of all numbers and the likelihood distribution from the first study. Bayesians will of course estimate a prior distribution of first study based on personal background information.
When distributions are not symmetrical (e.g. when they are based on binomial distributions) then the situation is more complex. The prior probabilities of possible results conditional on the universal set of all numbers will be still be uniform but may be different for the possible observations and possible postulated outcomes. This is explored in detail elsewhere [1]. I intend to explain the above interpretation of P values as being usually equal to a probability of replication after an infinite number of observations to medical students in the next edition of the Oxford Handbook of Clinical Diagnosis (unless someone provides valid objections). I would be grateful for advice.
Reference
1. Llewelyn H (2019) Replacing P-values with frequentist posterior probabilities of replication—When possible parameter values must have uniform marginal prior probabilities. PLoS ONE 14(2): e0212302. https://doi.org/10.1371/journal.pone.0212302