I came across an interesting letter in response to the ASA’s Statement on p-values that I hadn’t seen before. It’s by Ionides, Giessing, Ritov and Page, and it’s very much worth reading. I make some comments below.

## Response to the ASA’s Statement on p-Values: Context, Process, and Purpose

Edward L. Ionides

^{a}, Alexander Giessing^{a}, Yaacov Ritov^{a}, and Scott E. Page^{b}

^{a}Department of Statistics, University of Michigan, Ann Arbor, MI;^{b}Departments of Complex Systems, Political Science and Economics, University of Michigan, Ann Arbor, MIThe ASA’s statement on p-values: context, process, and purpose (Wasserstein and Lazar 2016) makes several reasonable practical points on the use of p-values in empirical scientific inquiry. The statement then goes beyond this mandate, and in opposition to mainstream views on the foundations of scientific reasoning, to advocate that researchers should move away from the practice of frequentist statistical inference and deductive science. Mixed with the sensible advice on how to use p-values comes a message that is being interpreted across academia, the business world, and policy communities, as, “Avoid p-values. They don’t tell you what you want to know.” We support the idea of an activist ASA that reminds the statistical community of the proper use of statistical tools. However, any tool that is as widely used as the p-value will also often be misused and misinterpreted. The ASA’s statement, while warning statistical practitioners against these abuses, simultaneously warns practitioners away from legitimate use of the frequentist approach to statistical inference.

In particular, the ASA’s statement ends by suggesting that other approaches, such as Bayesian inference and Bayes factors, should be used to solve the problems of using and interpreting p-values. Many committed advocates of the Bayesian paradigm were involved in writing the ASA’s statement, so perhaps this conclusion should not surprise the alert reader. Other applied statisticians feel that adding priors to the model often does more to obfuscate the challenges of data analysis than to solve them. It is formally true that difficulties in carrying out frequentist inference can be avoided by following the Bayesian paradigm, since the challenges of properly assessing and interpreting the size and power for a statistical procedure disappear if one does not attempt to calculate them. However, avoiding frequentist inference is not a constructive approach to carrying out better frequentist inference.

On closer inspection, the key issue is a fundamental position of the ASA’s statement on the scientific method, related to but formally distinct from the differences between Bayesian and frequentist inference. Let us focus on a critical paragraph from the ASA’s statement: “In view of the prevalent misuses of and misconceptions concerning p-values, some statisticians prefer to supplement or even replace p-values with other approaches. These include methods that emphasize estimation over testing, such as confidence, credibility, or prediction intervals; Bayesian methods; alternative measures of evidence, such as likelihood ratios or Bayes factors; and other approaches such as decision-theoretical modeling and false discovery rates. All these measures and approaches rely on further assumptions, but they may more directly address the size of an effect (and its associated uncertainty) or whether the hypothesis is correct.”

Some people may want to think about whether it makes scientific sense to “directly address whether the hypothesis is correct.” Some people may have already concluded that usually it does not, and be surprised that a statement on hypothesis testing that is at odds with mainstream scientific thought is apparently being advocated by the ASA leadership. Albert Einstein’s views on the scientific method are paraphrased by the assertion that, “No amount of experimentation can ever prove me right; a single experiment can prove me wrong” (Calaprice 2005). This approach to the logic of scientific progress, that data can serve to falsify scientific hypotheses but not to demonstrate their truth, was developed by Popper (1959) and has broad acceptance within the scientific community. In the words of Popper (1963), “It is easy to obtain confirmations, or verifications, for nearly every theory,” while, “Every genuine test of a theory is an attempt to falsify it, or to refute it. Testability is falsifiability.” The ASA’s statement appears to be contradicting the scientific method described by Einstein and Popper. In case the interpretation of this paragraph is unclear, the position of the ASA’s statement is clarified in their Principle 2: “p-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone. Researchers often wish to turn a p-value into a statement about the truth of a null hypothesis, or about the probability that random chance produced the observed data. The p-value is neither.” Here, the ASA’s statement misleads through omission: a more accurate end of the paragraph would read, “The p-value is neither. Nor is any other statistical test used as part of a deductive argument.” It is implicit in the way the authors have stated this principle that they believe alternative scientific methods may be appropriate to assess more directly the truth of the null hypothesis. Many readers will infer the ASA to imply the inferiority of deductive frequentist methods for scientific reasoning. The ASA statement, in its current form, will therefore make it harder for scientists to defend a choice of frequentist statistical methods during peer review. Frequentist articles will become more difficult to publish, which will create a cascade of effects on data collection, research design, and even research agendas.

Gelman and Shalizi (2013) provided a relevant discussion of the distinction between deductive reasoning (based on deducing conclusions from a hypothesis and checking whether they can be falsified, permitting data to argue against a scientific hypothesis but not directly for it) and inductive reasoning (which permits generalization, and therefore allows data to provide direct evidence for the truth of a scientific hypothesis). It is held widely, though less than universally, that only deductive reasoning is appropriate for generating scientific knowledge. Usually, frequentist statistical analysis is associated with deductive reasoning and Bayesian analysis is associated with inductive reasoning. Gelman and Shalizi (2013) argued that it is possible to use Bayesian analysis to support deductive reasoning, though that is not currently the mainstream approach in the Bayesian community. Bayesian deductive reasoning may involve, for example, refusing to use Bayes factors to support scientific conclusions. The Bayesian deductive methodology proposed by Gelman and Shalizi (2013) is a close cousin to frequentist reasoning, and in particular emphasizes the use of Bayesian p-values.

The ASA probably did not intend to make a philosophical statement on the possibility of acquiring scientific knowledge by inductive reasoning. However, it ended up doing so, by making repeated assertions implying, directly and indirectly, the legitimacy and desirability of using data to directly assess the correctness of a hypothesis. This philosophical aspect of the ASA statement is far from irrelevant for statistical practice, since the ASA position encourages the use of statistical arguments that might be considered inappropriate.

A judgment against the validity of inductive reasoning for generating scientific knowledge does not rule out its utility for other purposes. For example, the demonstrated utility of standard inductive Bayesian reasoning for some engineering applications is outside the scope of our current discussion. This amounts to the distinction Popper (1959) made between “common sense knowledge” and “scientific knowledge.”

ReferencesCalaprice, A. (2005), The New Quotable Einstein, Princeton, NJ: Princeton University Press. [88]

Gelman, A., and Shalizi, C. R. (2013), “Philosophy and the Practice of Bayesian Statistics,” British Journal of Mathematical and Statistical Psychology, 66, 8–38. [88]

Popper, K. (1963), Conjectures and Refutations: The Growth of Scientific Knowledge, New York: Routledge and Kegan Paul. [88]

Popper, K. R. (1959), The Logic of Scientific Discovery, London: Hutchinson. [88,89]

Wasserstein, R. L., and Lazar, N. A. (2016), “The ASA’s Statement on p Values: Context, Process, and Purpose,” The American Statistician, 70, 129–133. [88]

2017, Vol. 71, No.1

1. It’s both refreshing and worrisome to find there are other people out there even more concerned than I that “ASA’s statement, while warning statistical practitioners against these abuses, simultaneously warns practitioners away from legitimate use of the frequentist approach to statistical inference”. A reason these authors give for saying this is the Guide’s implicit nod to Bayesian and “other” approaches (even though the “others” include frequentist confidence intervals). The Guide is rather vague here. As I indicate in my own commentary (‘Don’t throw out the error control baby with the bad statistics bathwater’), it would be illuminating for the ASA to critically assess these other tools; they should not be beyond scrutiny.

In my view, the strongest reason that a reader of the Guide would view it as recommending against frequentist methods is its unclarity regarding **principle 4**, in favor of full reporting and transparency. To its credit, the ASA Guide could not be stronger in condemning cherry-picking, data-dredging, and multiple testing as invalidating P-values.

“Conducting multiple analyses of the data and reporting only those with certainp-values (typically those passing a significance threshold) renders the reportedp-values essentially uninterpretable. … Whenever a researcher chooses what to present based on statistical results, valid interpretation of those results is severely compromised if the reader is not informed of the choice and its basis.”

Can a reader assume from this last sentence that the ASA Guide maintains that it’s bad science not to at least report, if not also adjust, for data-dredging and such? I hope so. Some think the Guide takes principle 4, thus construed, as referring only to frequentist (error statistical) assessments. Even if the quantitative measures, be they Bayes factors, posteriors, or likelihood ratios, are not directly altered by data-dredging and multiple testing, it does not follow that their inferences are free from the spurious results and lack of error probability control due to selection effects. (This was also a point made by Benjamini in his comment on the ASA Guide.) Note too that additional latitude enters with data-dependent choices of priors. Perhaps the ASA Guide needs to add an amendment as to whether it is claiming principle 4 also holds for other methods, or not. Admittedly, you are free not to care about error control.

2. I would wish to modify one piece of terminology. The authors of the letter use “induction” to mean what I call a *probabilism*. Here probability is used to quantify degrees of belief, confirmation, plausibility support–whether absolute or comparative. Viewing inductive (or “ampliative” inference) in terms of severe testing gets us away from that limitation. Conclusions of statistical tests go strictly beyond their premises, and are qualified by the error probing capacities of the tools. They are a form of scientific induction or learning from data.[o]

Popper allows that anyone who wants to define induction as the procedure of corroborating by severe testing is free to do so; and I do. (STINT* p. 87) The view is clearest in the philosopher C.S. Peirce: “Induction is the experimental testing of a theory”. (Peirce, 5.145). Remember what R.A.Fisher said:

In deductive reasoning all knowledge obtainable is already latent in the postulates. Rigour is needed to prevent the successive inferences growing less and less accurate as we proceed. The conclusions are never more accurate than the data. In inductive reasoning we are performing part of the process by which new knowledge is created. The conclusions normally grow more and more accurate as more data are included. It should never be true, though it is still often said, that the conclusions are no more accurate than the data on which they are based. (Fisher 1935b, p. 54)

We get what I call “lift off”. Statistical testing as Neyman puts it, is “the frequentists’ theory of induction”. A claim is *severely tested* when it is subjected to and passes a test that probably would have found it flawed if it is. The notion isn’t limited to formal testing, but holds as well for estimation, prediction, exploration and problem solving. You don’t have evidence for a claim if little if anything has been done to probe and rule out how it may be flawed.[i]

Here are a few relevant quotes from the first part of SIST.

Severity Requirement (weak): If data x agree with a claim C but the method was practically incapable of finding flaws with C even if they exist, then x is poor evidence for C.

Severity (strong): If C passes a test that was highly capable of finding flaws or discrepancies from C, and yet none or few are found, then the passing result, x, is an indication of, or evidence for, C. (p. 23)

A key question for us is the proper epistemic role for probability. It is standardly taken as providing a probabilism, as an assignment of degree of actual or rational belief in a claim, absolute or comparative. We reject this. We proffer an alternative theory: a severity assessment. An account of what is warranted and unwarranted to infer – a normative epistemology – is not a matter of using probability to assign rational beliefs, but to control and assess how well probed claims are. (p. 54)

What we want is an error statistical approach that controls and assesses a test’s stringency or severity. That’s not much of a label. For short, we call someone who embraces such an approach a severe tester. For now I will just venture that a severity scrutiny illuminates all statistical approaches currently on offer. (p. 55)

… A valuable idea to take from Popper is that probability in learning attaches to a method of conjecture and refutation, that is to testing: it is methodological probability. An error probability is a special case of a methodological probability. We want methods with a high probability of teaching us (and machines) how to distinguish approximately correct and incorrect interpretations of data, even leaving murky cases in the middle, and how to advance knowledge of detectable, while strictly unobservable, effects….

[I]nteresting claims about mechanisms and causal generalizations require numerous assumptions (substantive and statistical) and are rarely open to deductive falsification. How then can good science be all about falsifiability? The answer is that we can erect reliable rules for falsifying claims with severity. We corroborate their denials. If your statistical account denies we can reliably falsify interesting theories, it is irrelevant to real-world knowledge. (p. 81)

Statistical hypothesis tests are an excellent tool for inferring such genuine anomalies and corroborating what Popper calls “falsifying hypotheses” (general claims used to statistically falsify theories).[ii]

[o] We reject inductive enumeration, the straight rule, and formal probabilisms. We hold that to be no reason from being prevented from allowing that learning from corroboration is inductive, in the sense that the inference goes beyond the premises. It is qualified by a statement about the error probing capabilities of the method of interest. By contrast, a posterior probability assignment is generally a deductive assignment from given priors and likelihoods. The person who best understood “scientific induction” as severe testing is C.S. Peirce, writing in the late 19th century.

[i] You needn’t accept the severe testing philosophy in order to use it to peel back the layers of key stat wars, and understand why even experts may talk past each other.

[ii] Strictly speaking any method can be made to falsify claims with the addition of a falsification rule. However, the rule must still be shown to be reliable.

**Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars *(Mayo, 2018, CUP).

Pingback: New top story on Hacker News: A Letter in Response to the ASA’s Statement on P-Values – Latest news

Pingback: New top story on Hacker News: A Letter in Response to the ASA’s Statement on P-Values – Golden News

Pingback: Response to the ASA’s Statement on p-Values – Hacker News Robot

The authors might have questioned the very idea that a posterior probability tells you the probability a statistical hypothesis is correct, as opposed to a degree of belief in its correctness, someone’s betting rate, or, in the case of default/non-subjective Bayes, a convention conveying something we know not what, with multiple different ways of doing it. But most Bayesians these days don’t give posteriors, but rather comparative claims like Bayes Factors. That doesn’t speak to the “correctness” of the hypotheses compared at all. It might measure how much your belief ratio would change.

This blogpost has received more hits than any other post over the last 7years of blogging. I investigated and found the hits are coming from Hacker News. https://news.ycombinator.com/item?id=18950298

But note how quiet it is around here.

Unfortunately, I can see the discussion over there is largely focussed on the wrong thing; namely, whether Popper is right. Of course we know that Popper’s view falls far short of his own goals. He doesn’t even give us the tools to take tests as having a high probability of finding flaws, if present, despite this being needed for his notion of corroboration. That doesn’t stop us from taking a number of his ideas as launching off places in STINT.

Nice discussion. Chris Wallace first pointed out to me what you note above (and what should have been obvious to me): although used in philosophy as a model of induction, the Bayesian shift from prior to posterior is by definition deductive. It is only when we go beyond that to *accept* or *reject* a theory that it becomes inductive. (He argued that MML and other model selection theories did so.) Regardless, whether one favors assigning probabilities to hypotheses (as I do) or eschews them, one should favor severe testing. My first tool is a likelihood ratio because it is only surprising predictions that should shift my belief. But a strong likelihood ratio is meaningless if I pick weak competitors. And that choice, alas, is outside the theory.

Exactly. But what good is it if the fundamental task for an account of statistical inference has gone missing from it? As I say in my book, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP), such an account fails us in an “essential” way.