Tom Sterkenburg, PhD

Postdoctoral Fellow

Munich Center for Mathematical Philosophy

LMU Munich

Munich, German

## Deborah G. Mayo: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars

The foundations of statistics is not a land of peace and quiet. “Tribal warfare” is perhaps putting it too strong, but it is the case that for decades now various camps and subcamps have been exchanging heated arguments about the right statistical methodology. That these skirmishes are not just an academic exercise is clear from the widespread use of statistical methods, and contemporary challenges that cry for more secure foundations: the rise of big data, the replication crisis.

One often hears that to blame are classical, frequentist methods, that lack a proper justification and are easily misused at that; so that it is all a matter of stepping up our efforts to spread the Bayesian philosophy. This not only ignores the various conflicting views *within** *the Bayesian camp, but also gives too little credit to opposing philosophical perspectives. In particular, this does not do justice to the work of philosopher of statistics Deborah Mayo. Perhaps most famously in her Lakatos Award winning *Error and the Growth of Experimental Knowledge** *(1996), Mayo has been developing an account of statistical and scientific inference that builds on Popper’s falsificationist philosophy and frequentist statistics. She has now written a new book, with the stated goal of helping us get beyond the statistics wars.

This work is a genuine tour de force. Mayo weaves together an extraordinary amount of philosophical themes, technical discussions, and historical anecdotes into a lively and engaging exposition of what she calls the *error-statistical** *philosophy. Like few other works in the area Mayo instills in the reader an appreciation for both the interest and the significance of the topic of statistical methodology, and indeed for the importance of *philosophers** *engaging with it.

That does not yet make the book an easy read. In fact, the downside of Mayo’s conversational style of presentation is that it can take a serious effort on the reader’s part to distill the argumentative structure and how various observations and explanations hang together. This, unfortunately, also limits its use somewhat for those intended readers that are new to the discussed topics.

In the following I will summarize the book, and conclude with some general remarks. (Mayo organizes her book into “excursions” divided into “tours”—we are invited to imagine we are on a cruise—but below I will stick to chapters divided into parts.)

Chapter 1 serves as a warming-up. In the course of laying out the motivation for the book’s project, Mayo introduces *severity** *as a requirement for evidence. On the *weak** *version of the severity criterion, one does *not** *have evidence for a claim *C** *if the method used to arrive at evidence * x*, even if

**x****agrees with**

*C*, had little capability of finding flaws with

*C*

*even if they exist (Mayo also uses the acronym BENT:*

*bad evidence, no test*).On its

*strong*

*version, if*

*C*

*passes a test that did have high capability of contradicting*

*C*, then the passing outcome

**x**

*is*

*evidence—or at least, an indication—for*

*C*. The double role for statistical inference is to identify BENT cases, where we actually have poor evidence; and, using strong severity, to mount positive arguments from coincidence.

Thus if a statistical philosophy is to tell us what we seek to quantify using probability, then Mayo’s error-statistical philosophy says that this is “well-testedness” or *probativeness.** *This she sets apart from *probabilism,** *which sees probability as a way of quantifying plausibility of hypotheses (tenet of the Bayesian approach), but also from *performance*, where probability is a method’s long-run frequency of faulty inferences (the classical, frequentist approach). Mayo is careful, too, to set her philosophy apart from recent efforts to unify or bridge Bayesian and frequentist statistics, approaches that she chastises as “marriages of convenience” that simply look away from the underlying philosophical incongruities. There is here an ambiguity in the nature of Mayo’s project, that remains unresolved throughout the book: is she indeed proposing a new perspective “to tell what is true about the different methods of statistics” (p. 28), the view-from-a-balloon that might finally get us beyond the statistics wars, or should we actually see her as joining the fray with a yet different competing account? What is certainly clear is that Mayo’s philosophy is much closer to the frequentist than the Bayesian school, so that an important application of the new perspective is to exhibit the flaws of the latter. In the second part of the chapter Mayo immediately gets down to business, revisiting a classic point of contention in the form of the likelihood principle.

In Chapter 2 the discussion shifts to Bayesian confirmation theory, in the context of traditional philosophy of science and the problem of induction. Mayo’s diagnosis is that the aim of confirmation theory is *merely** *to try to spell out inductive method, having given up on actually providing justification for it; and in general, that philosophers of science now feel it is taboo to even try to make progress on this account. The latter assessment is not entirely fair, even if it is true that recent proposals addressing the problem of induction (notably those by John Norton and by Gerhard Schurz, who both abandon the idea of a single context-independent inductive method) are still far removed from actual scientific or statistical practice. More interesting than the familiar issues with confirmation theory Mayo lists in the first part of the chapter is therefore the positive account she defends in the second.

Here she discusses falsificationism and how the error-statistical account builds and improves on Popper’s ideas. We read about demarcation, Duhem’s problem, and novel predictions; but also about the replicability crisis in psychology and fallacies of significance tests. In the last section Mayo returns to the question that has been in the background all this time: what is the error-statistical answer to the problem of inductive inference? By then we have already been handed a number of clues: inferences to hypotheses are arguments from strong coincidence, that (unlike “inductive” but really still deductive probabilistic logics) provide genuine “lift-off”, and that (against Popperians) we are free to call warranted or justified. Mayo emphasises that the output of a statistical inference is not a belief; and it is undeniable that for the plausibility of an hypothesis severe testing is neither necessary (the problem of after-the-fact cooked-up hypotheses, Mayo points out, is exactly that they can be so plausible) nor sufficient (as illustrated by the base-rate fallacy). Nevertheless, the envisioned epistemic yield of a (warranted) inference remains agonizingly imprecise. For instance, we read that (sensibly enough) isolated significant results do not count; but when do results start counting, and how? Much is delegated to the dynamics of the overall inquiry, as further illustrated below.

Chapter 3 goes deeper into severe testing: as employed in actual cases of scientific inference, and as instantiated in methods from classical statistics. Thus the first part starts with the 1919 Eddington experiment to test Einstein’s relativity theory, and continues with a discussion of Neyman–Pearson (N–P) tests. The latter are then accommodated into the error-statistical story, with the admonition that the severity rationale goes beyond the usual behavioural warrant of N–P testing as the guarantee of being rarely wrong in repeated application. Moreover, it is stressed, the statistical methods given by N–P as well as Fisherian tests represent “canonical pieces of statistical reasoning, in their naked form as it were” (p. 150). In a real scientific inquiry these are only part of the investigator’s reservoir of error-probabilistic tools “both formal and quasi-formal”, providing the parts that “are integrated in building up arguments from coincidence, informing background theory, self- correcting […], in an iterative movement” (p. 162).

In the next part of Chapter 3, Mayo defends the classical methods against an array of attacks launched from different directions. Apart from some old charges (or “howlers and chestnuts of statistical tests”), these include the excusations arising from the “family feud” between adherents of Fisher and Neyman–Pearson. Mayo argues that the purported different interpretational stances of the founders (Fisher’s more evidential outlook versus Neyman’s more behaviourist position) are a bad reason to preclude a unified view on both methodologies. In the third part, Mayo extends this discussion to incorporate confidence intervals, and the chapter concludes with another illustration of statistical testing in actual scientific inference, the 2012 discovery of the Higgs boson.

The different parts of Chapter 4 revolve around the theme of objectivity. First up is the “dirty hands argument”, the idea that since we can never be free of the influence of subjective choices, all statistical methods must be (equally) subjective. The mistake, Mayo says, is to assume that we are incapable of registering and managing these inevitable threats to objectivity. The subsequent dismissal of the Bayesian way of taking into account—or indeed embracing—subjectivity is followed, in the second part of the chapter, by a response to a series of Bayesian critiques of frequentist methods, and particularly the charge that, as compared to Bayesian posterior probabilities, *P** *values overstate the evidence. The crux of Mayo’s reply is that “it’s erroneous to fault one statistical philosophy from the perspective of a philosophy with a different and incompatible conception of evidence or inference” (p. 265). This is certainly a fair point, but could just as well be turned against her own presentation of the error-statistical perspective as a meta-methodology. Of course, the lesson we are actually encouraged to draw is that an account of evidence in terms of severe testing is preferable to one in terms of plausibility. For this Mayo makes a strong case, in the next part, in connection to the need for tools to intercept various illegitimate research practices. The remainder of the chapter is devoted to some other important themes around frequentist methods: randomization, the trope that “all models are false”, and model validation.

Chapter 5 is a relatively technical chapter about the notion of a test’s *power*. Mayo addresses some purported misunderstandings around the use of power, and discusses the notion of *attained** *or post-data power, combining elements of N–P and of Fisher, as part of her severity account. Later in the chapter we revisit the replication crisis, and in the last part we are given an entertaining “deconstruction” of the debates between N–P and Fisher. Finally, in Chapter 6, Mayo takes one last look at the probabilistic “foundations lost”, to clear the way for her parting proclamation of the new probative foundations. She discusses the retreat by theoreticians from full-blown subjective Bayesianism, the shaky grounds under objective or default Bayesianism, and attempts at unification (“schizophrenia”) or flat-out pragmatism. Saved till the end, fittingly, is the recent “falsificationist Bayesianism” that emerges from the writings of Andrew Gelman, who indeed adopts important elements of the error-statistical philosophy.

It seems only a plausible if not warranted inductive inference that the statistics wars will rage on for a while; but what, towards an assessment of Mayo’s programme, should we be looking for in a foundational account of statistics? The philosophical attraction of the dominant Bayesian approach lies in its promise of a principled and unified account of rational inference. It appears to be too rigid, however, in suggesting a fully mechanical method of inference: after you fix your prior it is, on the standard conception, just a matter of conditionalizing. At the same time it appears to leave too much open, in allowing you to reconstruct any desired reasoning episode by suitable choice of model and prior. Mayo is very clear that her account resists the first: we are not looking for a purely formal account, a single method that can be mindlessly pursued. Still, the severity rationale is emphatically meant to be restrictive: to expose certain inferences as unwarranted. But the threat of too much flexibility is still lurking in how much is delegated to the messy context of the overall inquiry. If too much is left to context-dependent expert judgment, for instance, the account risks to forfeit its advertized capacity to help us hold the experts accountable for their inferences. This motivates the desire for a more precise philosophical conception, if possible, of what inferences count as warranted and how. What Mayo’s book should certainly convince us of is the value of seeking to develop her programme further, and for that reason alone the book is recommended reading for all philosophers—not least those of the Bayesian denomination—concerned with the foundations of statistics.

***

Sterkenburg, T. (2020). “Deborah G. Mayo: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars”, *Journal for General Philosophy of Science* 51: 507–510. (https://doi.org/10.1007/s10838-019-09486-2). Link to review.

Excerpts, mementos, and sketches of 16 tours (including links to proofs) are here.

I came across Tom Sterkenburg’s review of my book entirely by accident (CUP doesn’t send reviews to authors), and I’m very glad to have done so. I find his review to be extremely insightful, clear and illuminating. I don’t really have any criticisms, but I’ll respond to a couple of points:

1. “There is here an ambiguity in the nature of Mayo’s project, that remains unresolved throughout the book: is she indeed proposing a new perspective ‘to tell what is true about the different methods of statistics’ (p. 28), the view-from-a-balloon that might finally get us beyond the statistics wars, or should we actually see her as joining the fray with a yet different competing account?

I am doing both of these. The key thing is that even if you reject the competing account, you can use the severe testing perspective as a tool for excavation and for keeping us above some of the marshes and quicksand that are causing so much confusion in today’s philosophy of statistics debates.

2. “The crux of Mayo’s reply is that ‘it’s erroneous to fault one statistical philosophy from the perspective of a philosophy with a different and incompatible conception of evidence or inference’ (p. 265). This is certainly a fair point, but could just as well be turned against her own presentation of the error-statistical perspective as a meta-methodology. Of course, the lesson we are actually encouraged to draw is that an account of evidence in terms of severe testing is preferable to one in terms of plausibility.”

I avoid that charge by openly admitting that one is free to reject even the minimal principle of severity—but seeing how accounts violate severity principles should cause some scales to fall from eyes, and lead to a more honest appraisal of the price of rejecting the goal of controlling and assessing error probabilities. Here’s what I say on pp. 247-8:

“The goal of this journey is to identify minimal theses about “ bad evidence, no test (BENT)” that enable some degree of scrutiny of any statistical inference account – at least on the meta-level. Why assume all schools of statistical inference embrace the minimum severity principle? I don’ t, and they don’ t. But by identifying when methods violate severity, we can pull back the veil on at least one source of disagreement behind the battles.

Thus, in tackling this latest canard, let’ s resist depicting the critics as committing a gross blunder of confusing a P -value with a posterior probability in a null. We resist, as well, merely denying we care about their measure of support. I say we should look at exactly what the critics are on about. When we

do, we will have gleaned some short-cuts for grasping a plethora of critical debates. We may even wind up with new respect for what a P -value, the least popular girl in the class, really does.” 247-8.

I invite readers to comment, which will undoubtedly trigger further replies from me.

Thanks so much Tom!

Thank you for posting this! I’ll just say here again that I greatly enjoyed reading your book. It really does a tremendous job conveying that the foundations of statistics is an exciting and important topic, and that philosophy matters.

Pingback: Tom Sterkenburg Reviews Mayo’s “Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars” (2018, CUP) – 3ºB EE AMÁLIA RIBEIRO GARCIA PATTO – FILOSOFIA

“If too much is left to context-dependent expert judgment, for instance, the account risks to forfeit its advertized capacity to help us hold the experts accountable for their inferences.”

I see where he is coming from, but to me it seems a misplaced philosophers’ dream to outmaneuver context-dependent expert judgement in scientific inquiry.

Exactly my thoughts when I read this comment. #science can be a messy business as issues are thrashed out and clarifed. It can’t be reduced to a mechanical procedure. That’s on eof the joys of the error statistical appraoch. It gives a way of approaching the issues which promises yo clarify the isses at stake.

Thanks for these comments! I certainly don’t think philosophers should imagine dreaming up a statistical methodology that can do without context-dependent expert judgment, let alone some purely mechanical “universal” method. Much of my own work (in foundations of ML) is concerned with the necessarily restricted role for the mechanical component in inference (see, e.g., http://philsci-archive.pitt.edu/18505/).

My point was just the following. If a philosophy of statistical inference is to have a normative component in that it tells us what makes for a good inference (and so can help us hold experts accountable for their inferences—an important point in the book), then it cannot delegate too much to the particular context of inquiry (including expert judgment) precisely at those points where it specifies what makes for a good inference. I wrote this because I had some trouble seeing how this works out in a real scientific inquiry where (see quotes in the review) inferences by “canonical pieces of statistical reasoning” only form a small part of (and, as I understand this passage, must therefore also derive part of their justification from) the larger dynamics of “building up arguments from coincidence, informing background theory, self-correcting […], in an iterative movement”. Perhaps this tension is unavoidable and the point rather generic—but still, I think, something worth thinking about.

Tom: I’m not sure I see the tension. It seems that you answer your own question. The idea is that repertoires of errors (or error types) can be developed for various learning goals, and when absent, we may embark on procedures to build them. I think types of mistakes in drawing inferences from data are generalizable.

I’m very interested to check your link, thanks so much for it. Of course, my job in SIST is largely just getting us beyond today’s statistics wars. Only then can we see what kind of canonical models of inquiry need developing.

Christian: I agree with you, but there’s two things more. (1) Providing the philosophy of statistics, with some canonical exemplars, isn’t to provide an applied statistical account, as noted in Souvenir Z p. 436. With the philosophy of statistics, however, the applied researcher has a roadmap for fleshing it out, or pinpointing further work that’s needed. The second point is that (2) lacking a full-blown forward-looking account to apply does not mean we cannot hold the experts accountable (as Tom suggests), and by means of the severity account. Even being able to put the questions to the “experts”, based on what’s required to satisfy severity, suffices.

Another comment on Tom:

Tom writes:

“Mayo returns to the question that has been in the background all this time: what is the error-statistical answer to the problem of inductive inference? By then we have already been handed a number of clues: inferences to hypotheses are arguments from strong coincidence, that …provide genuine “lift-off”, and that (against Popperians) we are free to call warranted or justified. Mayo emphasises that the output of a statistical inference is not a belief; and it is undeniable that for the plausibility of an hypothesis severe testing is neither necessary (the problem of after-the-fact cooked-up hypotheses, Mayo points out, is exactly that they can be so plausible) nor sufficient (as illustrated by the base-rate fallacy).”

Focus on the last sentence. It is true that a plausible claim may be inseverely tested, as with some after-the-fact cooked up hypotheses, though this seems a round about way to put this. But what about the second claim? It appears to say that a severely tested hypothesis need not be plausible because of the base rate fallacy? Of course I’m not sure how he is understanding plausible—does it mean probable, in some sense? I would agree that a well-tested claim may be improbable (on almost any sense of probability). The indictment would be trying to assess plausibility by probabilistic means.