Notre Dame Philosophical Reviews is a leading forum for publishing reviews of books in philosophy. The philosopher of statistics, Prasanta Bandyopadhyay, published a review of my book Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP)(SIST) in this journal, and I very much appreciate his doing so. Here I excerpt from his review, and respond to a cluster of related criticisms in order to avoid some fundamental misunderstandings of my project. Here’s how he begins:
In this book, Deborah G. Mayo (who has the rare distinction of making an impact on some of the most influential statisticians of our time) delves into issues in philosophy of statistics, philosophy of science, and scientific methodology more thoroughly than in her previous writings. Her reconstruction of the history of statistics, seamless weaving of the issues in the foundations of statistics with the development of twentieth-century philosophy of science, and clear presentation that makes the content accessible to a non-specialist audience constitute a remarkable achievement. Mayo has a unique philosophical perspective which she uses in her study of philosophy of science and current statistical practice.
I regard this as one of the most important philosophy of science books written in the last 25 years. However, as Mayo herself says, nobody should be immune to critical assessment. This review is written in that spirit; in it I will analyze some of the shortcomings of the book.
* * * * * * * * *
I will begin with three issues on which Mayo focuses:
- Conflict about the foundation of statistical inference: Probabilism or Long-run Performance?
- Crisis in science: Which method is adequately general/flexible to be applicable to most problems?
- Replication crisis: Is scientific research reproducible?
Mayo holds that these issues are connected. Failure to recognize that connection leads to problems in statistical inference.
Probabilism, as Mayo describes it, is about accepting reasoned belief when certainty is not available. Error-statistics is concerned with understanding and controlling the probability of errors. This is a long-run performance criterion. Mayo is concerned with “probativeness” for the analysis of “particular statistical inference” (p. 14). She draws her inspiration concerning probativeness from severe testing and calls those who follow this kind of philosophy the “severe testers” (p. 9). This concept is the central idea of the book.
…. What should be done, according to the severe tester, is to take refuge in a meta-standard and evaluate each theory from that meta-theoretical standpoint. Philosophy will provide that higher ground to evaluate two contending statistical theories. In contrast to the statistical foundations offered by both probabilism and long-run performance accounts, severe testers advocate probativism, which does not recommend any statement to be warranted unless a fair amount of investigation has been carried out to probe ways in which the statement could be wrong.
Severe testers think their method is adequately general to capture this intuitively appealing requirement on any plausible account of evidence. That is, if a test were not able to find flaws with H even if H were incorrect, then a mere agreement of H with data X0 would provide poor evidence for H. This, according to the severe tester’s account, should be a minimal requirement on any account of evidence. This is how they address (ii).
Next consider (iii). According to the severe tester’s diagnosis, the replication crisis arises when there is selective reporting: the statistics are cherry-picked for x, i.e., looked at for significance where it is absent, multiple testing, and the like. Severe testers think their account alone can handle the replication crisis satisfactorily. That leaves the burden on them to show that other accounts, such as probabilism and long-run performance, are incapable of handling the crisis, or are inadequate compared to the severe tester’s account. One way probabilists (such as subjective Bayesians) seem to block problematic inferences resulting from the replication crisis is by assigning a high subjective prior probability to the null-hypothesis, resulting in a high posterior probability for it. Severe testers grant that this procedure can block problematic inferences leading to the replication crisis. However, they insist that this procedure won’t be able to show what researchers have initially done wrong in producing the crisis in the first place. The nub of their criticism is that Bayesians don’t provide a convincing resolution of the replication crisis since they don’t explain where the researchers make their mistake.
I don’t think we can look to this procedure (“assigning a high subjective prior probability to the null-hypothesis, resulting in a high posterior probability for it”) to block problematic inferences. In some cases, your disbelief in H might be right on the money, but this is precisely what is unknown when undertaking research. An account must be able to directly register how biasing selection effects alter error probing capacities if it is to call out the resulting bad inferences–or so I argue. Data-dredged hypotheses are often very believable, that’s what makes them so seductive. Moreover, it’s crucial for an account to be able to say that H is plausible but terribly tested by this particular study or test. I don’t say that inquirers are always in the context of severe testing, by the way. We’re not always truly trying to find things out; often, we’re just trying to make our case. That said, I never claim the severe testing account is the only way to avoid irreplication in statistics, nor do I suggest that the problem of replication is the sole problem for an account of statistical inference. Explaining and avoiding irreplication is a minimal problem an account should be capable of solving. This relates to Bandyopadhyay’s central objection below.
In some places, he attributes to me a position that is nearly the opposite of what I argue. After explaining, I try to consider why he might be led to his topsy turvy allegation.
The problem with the long-run performance-based frequency approach, according to Mayo, is that it is easy to support a false hypothesis with these methods by selective reporting. The severe tester thinks both Fisher’s and Neyman and Pearson’s methods leave the door open for cherry-picking, significance seeking, and multiple-testing, thus generating the possibility of a replication crisis. Fisher’s and Neyman-Pearson’s methods make room for enabling the support of a preferred claim even though it is not warranted by evidence. This causes severe testers like Mayo to abandon the idea of adopting long-run performance as a sufficient condition for statistical inferences; it is merely a necessary condition for them.
No, it is the opposite. The error statistical assessments are highly valuable because they pick up on the effects of data dredging, multiple testing, optional stopping and a host of biasing selection effects. Biasing selection effects are blocked in error statistical accounts because they preclude control of error probabilities! It is precisely because they render the error probability assessments invalid that error statistical accounts are able to require–with justification– predesignation and preregistration. That is the key message of SIST from the very start.
- SIST, p. 20: A key point too rarely appreciated: Statistical facts about P -values themselves demonstrate how data finagling can yield spurious significance. This is true for all error probabilities. That’s what a self-correcting inference account should do. … Scouring different subgroups and otherwise “trying and trying again” are classic ways to blow up the actual probability of obtaining an impressive, but spurious, finding – and that remains so even if you ditch P-values and never compute them.
Consider the dramatic opposition between Savage, and Fisher and N-P regarding the Likelihood Principle and optional stopping:
- SIST, p. 46: The lesson about who is allowed to cheat depends on your statistical philosophy. Error statisticians require that the overall and not the “computed” significance level be reported. To them, cheating would be to report the significance level you got after trying and trying again in just the same way as if the test had a fixed sample size.
Bandyopadhyay seems to think that if I have criticisms of the long-run performance (or behavioristic) construal of error probabilities, it must be because I claim it leads to replication failure. That’s the only way I can explain his criticism above.
He is startled that I’m rejecting the long-run performance view I previously held.
This leads me to discuss the severe tester’s rejection of both probabilism and frequency-based long-run performance, especially the latter. It is understandable why Mayo finds fault with probabilists, since they are no friends of Bayesians who take probability theory to be the only logic of uncertainty. So, the position is consistent with the severe tester’s account proposed in Mayo’s last two influential books (1996 and 2010.) What is surprising is that her account rejects the long-run performance view and only takes the frequency-based probability as necessary for statistical inference.
But I’ve always rejected the long run performance or “behavioristic” construal of error statistical methods–when it comes to using them for scientific inference. I’ve always rejected the supposition that the justification and rationale for error statistical methods is their ability to control the probabilities of erroneous inferences in a long run series of applications. Others have rejected it as well, notably, Birnbaum, Cox, Giere. Their sense is that these tools are satisfying inferential goals but in a way that no one has been able to quite explain. What hasn’t been done, and what I only hinted at in earlier work, is to supply an alternative, inferential rationale for error statistics. The trick is to show when and why long run error control supplies a measure of a method’s capability to identify mistakes. This capability assessment, in turn, supplies a measure of how well or poorly tested claims are. So, the inferential assessment, post data, is in terms of how well or poorly tested claims are.
My earlier work, Error and the Growth of Experimental Knowledge (EGEK) was directed at the uses of statistics for solving philosophical problems of evidence and inference. SIST, by contrast, is focussed almost entirely on the philosophical problems of statistical practice. Moreover, I stick my neck out, and try to tackle essentially all of the examples around which there have been philosophical controversy from the severe tester’s paradigm. While I freely admit this represents a gutsy, if not radical, gambit, I actually find it perplexing that it hasn’t been done before. It seems to me that we convert information about (long-run) performance into information about well-testedness in ordinary, day to day reasoning. Take the informal example early on in the book.
- SIST, p. 14: Before leaving the USA for the UK, I record my weight on two scales at home, one digital, one not, and the big medical scale at my doctor’ s office. …Returning from the UK, to my astonishment, not one but all three scales show anywhere from a 4– 5 pound gain. …But the fact that all of them have me at an over 4-pound gain, while none show any difference in the weights of EGEK, pretty well seals it. …No one would say: ‘I can be assured that by following such a procedure, in the long run I would rarely report weight gains erroneously, but I can tell nothing from these readings about my weight now.’ To justify my conclusion by long-run performance would be absurd. Instead we say that the procedure had enormous capacity to reveal if any of the scales were wrong, and from this I argue about the source of the readings: H: I’ve gained weight…. This is the key – granted with a homely example – that can fill a very important gap in frequentist foundations: Just because an account is touted as having a long-run rationale, it does not mean it lacks a short run rationale, or even one relevant for the particular case at hand.
Let me now clarify the reason that satisfying a long-run performance requirement only necessary and not sufficient for severity. Long-run behavior could be satisfied while the error probabilities do not reflect well-testedness in the case at hand. Go to the howlers and chestnuts of Excursion 3 tour II:
- Exhibit (vi): Two Measuring Instruments of Different Precisions. Did you hear about the frequentist who, knowing she used a scale that’s right only half the time, claimed her method of weighing is right 75% of the time? She says, “I flipped a coin to decide whether to use a scale that’s right 100% of the time, or one that’s right only half the time, so, overall, I’m right 75% of the time.” (She wants credit because she could have used a better scale, even knowing she used a lousy one.)
- Basis for the joke: An N-P test bases error probabilities on all possible outcomes or measurements that could have occurred in repetitions, but did not. As with many infamous pathological examples, often presented as knockdown criticisms of all of frequentist statistics, this was invented by a frequentist, Cox (1958). It was a way to highlight what could go wrong in the case at hand, if one embraced an unthinking behavioral-performance view. Yes, error probabilities are taken over hypothetical repetitions of a process, but not just any repetitions will do.
In short: I’m taking the tools that are typically justified only because they control the probability of erroneous inferences in the long-run, and providing them with an inferential justification relevant for the case at hand. It’s only when long-run relative frequencies represent the method’s capability to discern mistaken interpretations of data that the performance and severe testing goals line up. Where the two sets of goals do not line up, severe testing takes precedence–at least when we’re trying to find things out. The book is an experiment in trying to do all of philosophy of statistics within the severe testing paradigm.
There’s more to reply to in his review, but I want to just focus on this clarification which should rectify his main criticism. For a discussion of the general points of severely testing theories, I direct the reader to extensive excerpts from SIST. His full review is here.
Bandyopadhyay attended my NEH Summer Seminar in 1999 on Inductive-Experimental Inference. I’m glad that he has pursued philosophy of statistics through the years. I do wish he had sent me his review earlier so that I could clarify the small set of confusions that led him to some unintended places. Nous might have given the author an opportunity to reply lest readers come away with a distorted view of the book. I will shortly be resuming a discussion of SIST on this blog, picking up with excursion 2.
Update March 4: Note that I wound up commenting further on the Review in the following comments:
 If you find an example that has been the subject of philosophical debate that is omitted from SIST, let me know. You will notice that all these examples are elementary, which is why I was able to cover them with minimal technical complexity. Some more exotic examples are in “chestnuts and howlers”.