Edward L. Ionides

Director of Undergraduate Programs and Professor,

Department of Statistics, University of Michigan

Department of Statistics, University of Michigan

Thanks for the clear presentation of the issues at stake in your recent *Conservation Biology* editorial (Mayo 2021). There is a need for such articles elaborating and contextualizing the ASA President’s Task Force statement on statistical significance (Benjamini et al, 2021). The Benjamini et al (2021) statement is sensible advice that avoids directly addressing the current debate. For better or worse, it has no references, and just speaks what looks to us like plain sense. However, it avoids addressing why there is a debate in the first place, and what are the justifications and misconceptions that drive different positions. Consequently, it may be ineffective at communicating to those swing voters who have sympathies with some of the insinuations in the Wasserstein & Lazar (2016) statement. We say “insinuations” here since we consider that their 2016 statement made an attack on p-values which was forceful, indirect and erroneous. Wasserstein & Lazar (2016) started with a constructive discussion about the uses and abuses of p-values before moving against them. This approach was good rhetoric: “I have come to praise p-values, not to bury them” to invert Shakespeare’s Anthony. Good rhetoric does not always promote good science, but Wasserstein & Lazar (2016) successfully managed to frame and lead the debate, according to Google Scholar. We warned of the potential consequences of that article and its flaws (Ionides et al, 2017) and we refer the reader to our article for more explanation of these issues (it may be found below). Wasserstein, Schirm and Lazar (2019) made their position clearer, and therefore easier to confront. We are grateful to Benjamini et al (2021) and Mayo (2021) for rising to the debate. Rephrasing Churchill in support of their efforts, “Many forms of statistical methods have been tried, and will be tried in this world of sin and woe. No one pretends that the p-value is perfect or all-wise. Indeed (noting that its abuse has much responsibility for the replication crisis) it has been said that the p-value is the worst form of inference except all those other forms that have been tried from time to time”.

Benjamini, Y., De Veaux, R.D., Efron, B., Evans, S., Glickman, M., Graubard, B.I., He, X., Meng, X.L., Reid, N.M., Stigler, S.M. and Vardeman, S.B., 2021. ASA President’s Task Force Statement on Statistical Significance and Replicability. Annals of Applied Statistics, 15(3), pp. 1084-1085.

Ionides, E.L., Giessing, A., Ritov, Y. and Page, S.E., 2017. Response to the ASA’s statement on p-values: context, process, and purpose. The American Statistician, 71(1), pp. 88-89.

Mayo, D.G., The statistics wars and intellectual conflicts of interest. Conservation Biology, to appear. (Online Mayo 2021.)

Wasserstein, R.L. and Lazar, N.A., 2016. The ASA statement on p-values: context, process, and purpose. The American Statistician, 70(2), pp. 129-133.

Wasserstein, R.L., Schirm, A.L. and Lazar, N.A., 2019. Moving to a world beyond “p< 0.05”. The American Statistician, 73(sup1), pp. 1-19.

******

**THE AMERICAN STATISTICIAN 71(1): 88-89.**

**LETTERS TO THE EDITOR**

## Response to the ASA’s Statement on p-Values: Context, Process, and Purpose

Edward L. Ionides^{a}, Alexander Giessing^{a}, Yaacov Ritov^{a}, and Scott E. Page^{b}

^{a}Department of Statistics, University of Michigan, Ann Arbor, MI; ^{b}Departments of Complex Systems, Political Science and Economics, University of Michigan, Ann Arbor, MI

The ASA’s statement on *p*-values: context, process, and purpose (Wasserstein and Lazar 2016) makes several reasonable practical points on the use of *p*-values in empirical scientific inquiry. The statement then goes beyond this mandate, and in opposition to mainstream views on the foundations of scientific reasoning, to advocate that researchers should move away from the practice of frequentist statistical inference and deductive science. Mixed with the sensible advice on how to use *p*-values comes a message that is being interpreted across academia, the business world, and policy communities, as, “Avoid *p*-values. They don’t tell you what you want to know.” We support the idea of an activist ASA that reminds the statistical community of the proper use of statistical tools. However, any tool that is as widely used as the *p*– value will also often be misused and misinterpreted. The ASA’s statement, while warning statistical practitioners against these abuses, simultaneously warns practitioners away from legitimate use of the frequentist approach to statistical inference.

In particular, the ASA’s statement ends by suggesting that other approaches, such as Bayesian inference and Bayes factors, should be used to solve the problems of using and interpreting *p*-values. Many committed advocates of the Bayesian paradigm were involved in writing the ASA’s statement, so perhaps this conclusion should not surprise the alert reader. Other applied statisticians feel that adding priors to the model often does more to obfuscate the challenges of data analysis than to solve them. It is formally true that difficulties in carrying out frequentist inference can be avoided by following the Bayesian paradigm, since the challenges of properly assessing and interpreting the size and power for a statistical procedure disappear if one does not attempt to calculate them. However, avoiding frequentist inference is not a constructive approach to carrying out better frequentist inference.

On closer inspection, the key issue is a fundamental position of the ASA’s statement on the scientific method, related to but formally distinct from the differences between Bayesian and frequentist inference. Let us focus on a critical paragraph from the ASA’s statement: “In view of the prevalent misuses of and misconceptions concerning *p*-values, some statisticians prefer to supplement or even replace *p*-values with other approaches. These include methods that emphasize estimation over test- ing, such as confidence, credibility, or prediction intervals; Bayesian methods; alternative measures of evidence, such as likelihood ratios or Bayes factors; and other approaches such as decision-theoretical modeling and false discovery rates. All these measures and approaches rely on further assumptions, but they may more directly address the size of an effect (and its associated uncertainty) or whether the hypothesis is correct.”

Some people may want to think about whether it makes scientific sense to “directly address whether the hypothesis is correct.” Some people may have already concluded that usually it does not, and be surprised that a statement on hypothesis test- ing that is at odds with mainstream scientific thought is apparently being advocated by the ASA leadership. Albert Einstein’s views on the scientific method are paraphrased by the assertion that, “No amount of experimentation can ever prove me right; a single experiment can prove me wrong” (Calaprice 2005). This approach to the logic of scientific progress, that data can serve to falsify scientific hypotheses but not to demonstrate their truth, was developed by Popper (1959) and has broad acceptance within the scientific community. In the words of Popper (1963), “It is easy to obtain confirmations, or verifications, for nearly every theory,” while, “Every genuine test of a theory is an attempt to falsify it, or to refute it. Testability is falsifiability.” The ASA’s statement appears to be contradicting the scientific method described by Einstein and Popper. In case the interpretation of this paragraph is unclear, the position of the ASA’s statement is clarified in their Principle 2: “*p*-values do not measure the probability that the studied hypothesis is true, or the prob- ability that the data were produced by random chance alone. Researchers often wish to turn a *p*-value into a statement about the truth of a null hypothesis, or about the probability that random chance produced the observed data. The *p*-value is neither.” Here, the ASA’s statement misleads through omission: a more accurate end of the paragraph would read, “The *p*-value is neither. Nor is any other statistical test used as part of a deductive argument.” It is implicit in the way the authors have stated this principle that they believe alternative scientific methods may be appropriate to assess more directly the truth of the null hypothesis. Many readers will infer the ASA to imply the inferiority of deductive frequentist methods for scientific reasoning. The ASA statement, in its current form, will therefore make it harder for scientists to defend a choice of frequentist statistical methods during peer review. Frequentist articles will become more difficult to publish, which will create a cascade of effects on data collection, research design, and even research agendas.

Gelman and Shalizi (2013) provided a relevant discussion of the distinction between deductive reasoning (based on deducing conclusions from a hypothesis and checking whether they can be falsified, permitting data to argue against a scientific hypothesis but not directly for it) and inductive reasoning (which permits generalization, and therefore allows data to provide direct evidence for the truth of a scientific hypothesis). It is held widely, though less than universally, that only deductive reasoning is appropriate for generating scientific knowledge. Usually, frequentist statistical analysis is associated with deductive reasoning and Bayesian analysis is associated with inductive reasoning. Gelman and Shalizi (2013) argued that it is possible to use Bayesian analysis to support deductive reasoning, though that is not currently the mainstream approach in the Bayesian community. Bayesian deductive reasoning may involve, for example, refusing to use Bayes factors to support scientific conclusions. The Bayesian deductive methodology proposed by Gelman and Shalizi (2013) isa close cousin to frequentist reasoning, and in particular emphasizes the use of Bayesian p-values.

The ASA probably did not intend to make a philosophical statement on the possibility of acquiring scientific knowledge by inductive reasoning. However, it ended up doing so, by making repeated assertions implying, directly and indirectly, the legitimacy and desirability of using data to directly assess the correctness of a hypothesis. This philosophical aspect of the ASA statement is far from irrelevant for statistical practice, since the ASA position encourages the use of statistical arguments that might be considered inappropriate.

A judgment against the validity of inductive reasoning for generating scientific knowledge does not rule out its utility for other purposes. For example, the demonstrated utility of standard inductive Bayesian reasoning for some engineering applications is outside the scope of our current discussion. This amounts to the distinction Popper (1959) made between “common sense knowledge” and “scientific knowledge.”

## References

Calaprice, A. (2005), *The New Quotable Einstein*, Princeton, NJ: Princeton University Press. [88]

Gelman, A., and Shalizi, C. R. (2013), “Philosophy and the Practice of Bayesian Statistics,” *British Journal of Mathematical and Statistical Psychology*, 66, 8–38.

Popper, K. (1963), *Conjectures and Refutations: The Growth of Scientific Knowledge*, New York: Routledge and Kegan Paul. [88]

Popper, K. R. (1959), *The Logic of Scientific Discovery*, London: Hutchinson.

Wasserstein, R. L., and Lazar, N. A. (2016), “The ASA’s Statement on *p*– Values: Context, Process, and Purpose,” *The American Statistician*, 70, 129–133. [88]

Pingback: The ASA controversy on P-values as an illustration of the difficulty of statistics | Error Statistics Philosophy

Edward snd Yaacov: Thank you so much for your commentary. You were exceptionally prescient in seeing right away (in your letter) that even the 2016 ASA statement had stacked the cards against frequentist testing with its last line about “other approaches”. As you can see, I include that letter in your post. In this way of viewing the 2016 ASA document, of course, the Wasserstein et al (2019) claim that “it stopped just short” of calling for an end to significance and p-value thresholds is given some weight. But the pretense in the earlier document that “it contained nothing new” goes by the board.