Prof. Deborah Mayo, Emerita

Department of Philosophy

Virginia Tech

Prof. David Hand

Department of Mathematics Statistics

Imperial College London

**Statistical significance and its critics: practicing damaging science, or damaging scientific practice? (***Synthese*)

**[pdf of full paper.]**

**Abstract: **While the common procedure of statistical significance testing and its accompanying concept of p-values have long been surrounded by controversy, renewed concern has been triggered by the replication crisis in science. Many blame statistical significance tests themselves, and some regard them as sufficiently damaging to scientific practice as to warrant being abandoned. We take a contrary position, arguing that the central criticisms arise from misunderstanding and misusing the statistical tools, and that in fact the purported remedies themselves risk damaging science. We argue that banning the use of p-value thresholds in interpreting data does not diminish but rather exacerbates data-dredging and biasing selection effects. If an account cannot specify outcomes that will not be allowed to count as evidence for a claim-if all thresholds outcomes that will not be allowed to count as evidence for a claim- if all thresholds are abandoned-then there is no test of that claim. The contributions of this paper are: To explain the rival statistical philosophies underlying the ongoing controversy; To elucidate and reinterpret statistical significance tests, and explain how this reinterpretation ameliorates common misuses and misinterpretations; To argue why recent recommendations to replace, abandon, or retire statistical significance undermine a central function of statistics in science: to test whether observed patterns in the data are genuine or due to background variability

**Keywords:** Data-dredging · Error probabilities · Fisher · Neyman and Pearson . P-values · Statistical significance tests

**Introduction and background**

While the common procedure of statistical significance testing and its accompanying concept of p-values have long been surrounded by controversy, renewed concern has been triggered by the so-called replication crisis in some scientific fields. In those fields, many results that had been found statistically significant are not found to be so (or have smaller effect sizes) when an independent group tries to replicate them. This has led many to blame the statistical significance tests themselves, and some view the use of p-value thresholds as sufficiently damaging to scientific practice as to warrant being abandoned. We take a contrary position, arguing that the central criticisms arise from misunderstanding and misusing the statistical tools, and that in fact the purported remedies themselves risk damaging science. In our view, if an account cannot specify remedies themselves risk damaging science. In our view, if an account cannot specify are abandoned-then there is no test of that claim.

The goals of our paper are:

- To explain the key issues in the ongoing controversy surrounding statistical significance tests;
- To reinterpret statistical significance tests, and the use of p-values, and explain how this reinterpretation ameliorates common misuses that underlie criticisms of these methods;
- To show that underlying many criticisms of statistical significance tests, and especially proposed alternatives, are often controversial philosophical presuppositions about statistical evidence and inference;
- To argue that recommendations to replace, abandon, or retire statistical significance tests are damaging to scientific practice.

Section 2 sets out the main features of statistical significance tests, emphasizing aspects that are routinely misunderstood, especially by their critics. In Sects. 3 and 4 we will flesh out, and respond to, what seem to be the strongest arguments in support of the view that current uses of statistical significance tests are damaging to science. Section 3 explores five key mistaken interpretations of p-values, how these can lead to damaging science, and how to avoid them. In Sect. 4 we discuss and respond to central criticisms of p-values that arise from presupposing alternative philosophies of evidence and inference. In Sect. 5 we argue that calls to replace, abandon, or retire statistical significance tests are damaging to scientific practice. We argue that the “no threshold” view does not diminish but rather exacerbates data-dredging and biasing selection effects (Sect. 5.1), and undermines a central function of statistics in science: to test whether observed patterns in the data can be explained by chance variation or not (Sect. 5.2). Section 5.3 shows why specific recommendations to retire, replace, or abandon statistical significance yield unsatisfactory tools for answering the significance tester’s question. Finally, in Sect. 6 we pull together the main threads of the discussion, and consider some implications for evaluating statistical methods with integrity.

**REFERENCE:**- Mayo, D.G., Hand, D. Statistical significance and its critics: practicing damaging science, or damaging scientific practice?.
*Synthese***200,**220 (2022). https://doi.org/10.1007/s11229-022-03692-0.

