*Notre Dame Philosophical Reviews* is a leading forum for publishing reviews of books in philosophy. The philosopher of statistics, Prasanta Bandyopadhyay, published a review of my book *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* (2018, CUP)(SIST) in this journal, and I very much appreciate his doing so. Here I excerpt from his review, and respond to a cluster of related criticisms in order to avoid some fundamental misunderstandings of my project. Here’s how he begins:

In this book, Deborah G. Mayo (who has the rare distinction of making an impact on some of the most influential statisticians of our time) delves into issues in philosophy of statistics, philosophy of science, and scientific methodology more thoroughly than in her previous writings. Her reconstruction of the history of statistics, seamless weaving of the issues in the foundations of statistics with the development of twentieth-century philosophy of science, and clear presentation that makes the content accessible to a non-specialist audience constitute a remarkable achievement. Mayo has a unique philosophical perspective which she uses in her study of philosophy of science and current statistical practice.

^{[1]}I regard this as one of the most important philosophy of science books written in the last 25 years. However, as Mayo herself says, nobody should be immune to critical assessment. This review is written in that spirit; in it I will analyze some of the shortcomings of the book.

* * * * * * * * *

I will begin with three issues on which Mayo focuses:

- Conflict about the foundation of statistical inference: Probabilism or Long-run Performance?
- Crisis in science: Which method is adequately general/flexible to be applicable to most problems?
- Replication crisis: Is scientific research reproducible?
Mayo holds that these issues are connected. Failure to recognize that connection leads to problems in statistical inference.

Probabilism, as Mayo describes it, is about accepting reasoned belief when certainty is not available. Error-statistics is concerned with understanding and controlling the probability of errors. This is a long-run performance criterion. Mayo is concerned with “probativeness” for the analysis of “particular statistical inference” (p. 14). She draws her inspiration concerning probativeness from severe testing and calls those who follow this kind of philosophy the “

severe testers” (p. 9). This concept is the central idea of the book.…. What should be done, according to the severe tester, is to take refuge in a meta-standard and evaluate each theory from that meta-theoretical standpoint. Philosophy will provide that higher ground to evaluate two contending statistical theories. In contrast to the statistical foundations offered by both probabilism and long-run performance accounts, severe testers advocate probativism, which does not recommend any statement to be warranted unless a fair amount of investigation has been carried out to probe ways in which the statement could be wrong.

Severe testers think their method is adequately general to capture this intuitively appealing requirement on any plausible account of evidence. That is, if a test were not able to find flaws with H even if H were incorrect, then a mere agreement of H with data X

_{0}would provide poor evidence for H. This, according to the severe tester’s account, should be a minimal requirement on any account of evidence. This is how they address (ii).Next consider (iii). According to the severe tester’s diagnosis, the replication crisis arises when there is selective reporting: the statistics are cherry-picked for x, i.e., looked at for significance where it is absent, multiple testing, and the like. Severe testers think their account alone can handle the replication crisis satisfactorily. That leaves the burden on them to show that other accounts, such as probabilism and long-run performance, are incapable of handling the crisis, or are inadequate compared to the severe tester’s account. One way probabilists (such as subjective Bayesians) seem to block problematic inferences resulting from the replication crisis is by assigning a high subjective prior probability to the null-hypothesis, resulting in a high posterior probability for it. Severe testers grant that this procedure can block problematic inferences leading to the replication crisis. However, they insist that this procedure won’t be able to show

whatresearchers have initiallydonewrong in producing the crisis in the first place. The nub of their criticism is that Bayesians don’t provide a convincing resolution of the replication crisis since they don’t explain where the researchers make their mistake.

I don’t think we can look to this procedure (“assigning a high subjective prior probability to the null-hypothesis, resulting in a high posterior probability for it”) to block problematic inferences. In some cases, your disbelief in H might be right on the money, but this is precisely what is *unknown* when undertaking research. An account must be able to directly register how biasing selection effects alter error probing capacities if it is to call out the resulting bad inferences–or so I argue. Data-dredged hypotheses are often very believable, that’s what makes them so seductive. Moreover, it’s crucial for an account to be able to say that H is plausible but terribly tested by this particular study or test. I don’t say that inquirers are always in the context of severe testing, by the way. We’re not always truly trying to find things out; often, we’re just trying to make our case. That said, I never claim the severe testing account is the only way to avoid irreplication in statistics, nor do I suggest that the problem of replication is the sole problem for an account of statistical inference. Explaining and avoiding irreplication is a *minimal* problem an account should be capable of solving. This relates to Bandyopadhyay’s central objection below.

In some places, he attributes to me a position that is nearly the opposite of what I argue. After explaining, I try to consider why he might be led to his topsy turvy allegation.

The problem with the long-run performance-based frequency approach, according to Mayo, is that it is easy to support a false hypothesis with these methods by selective reporting. The severe tester thinks both Fisher’s and Neyman and Pearson’s methods leave the door open for cherry-picking, significance seeking, and multiple-testing, thus generating the possibility of a replication crisis. Fisher’s and Neyman-Pearson’s methods make room for enabling the support of a preferred claim even though it is not warranted by evidence. This causes severe testers like Mayo to abandon the idea of adopting long-run performance as a sufficient condition for statistical inferences; it is merely a necessary condition for them.

No, it is the opposite. The error statistical assessments are highly valuable because they pick up on the effects of data dredging, multiple testing, optional stopping and a host of biasing selection effects. Biasing selection effects are blocked in error statistical accounts because they preclude control of error probabilities! It is precisely because they render the error probability assessments invalid that error statistical accounts are able to require–with justification– predesignation and preregistration. That is the key message of SIST from the very start.

- SIST, p. 20: A key point too rarely appreciated: Statistical facts about P -values themselves demonstrate how data finagling can yield spurious significance. This is true for all error probabilities. That’s what a self-correcting inference account should do. … Scouring different subgroups and otherwise “trying and trying again” are classic ways to blow up the actual probability of obtaining an impressive, but spurious, finding – and that remains so even if you ditch P-values and never compute them.

Consider the dramatic opposition between Savage, and Fisher and N-P regarding the Likelihood Principle and optional stopping:

- SIST, p. 46: The lesson about who is allowed to cheat depends on your statistical philosophy. Error statisticians require that the overall and not the “computed” significance level be reported. To them, cheating would be to report the significance level you got after trying and trying again in just
*the same way*as if the test had a fixed sample size.

Bandyopadhyay seems to think that if I have criticisms of the long-run performance (or behavioristic) construal of error probabilities, it must be because I claim it leads to replication failure. That’s the only way I can explain his criticism above.

He is startled that I’m rejecting the long-run performance view I previously held.

This leads me to discuss the severe tester’s rejection of both probabilism and frequency-based long-run performance, especially the latter. It is understandable why Mayo finds fault with probabilists, since they are no friends of Bayesians who take probability theory to be the

onlylogic of uncertainty. So, the position is consistent with the severe tester’s account proposed in Mayo’s last two influential books (1996 and 2010.) What is surprising is that her account rejects the long-run performance view and only takes the frequency-based probability as necessary for statistical inference.

But I’ve always rejected the long run performance or “behavioristic” construal of error statistical methods–when it comes to using them for scientific inference. I’ve always rejected the supposition that the justification and rationale for error statistical methods is their ability to control the probabilities of erroneous inferences in a long run series of applications. Others have rejected it as well, notably, Birnbaum, Cox, Giere. Their sense is that these tools are satisfying inferential goals but in a way that no one has been able to quite explain. What hasn’t been done, and what I only hinted at in earlier work, is to supply an alternative, inferential rationale for error statistics. The trick is to show when and why long run error control supplies a measure of a method’s *capability* to identify mistakes. This capability assessment, in turn, supplies a measure of how well or poorly tested claims are. So, the inferential assessment, post data, is in terms of how well or poorly tested claims are.

My earlier work, *Error and the Growth of Experimental Knowledge* (EGEK) was directed at the uses of statistics for solving philosophical problems of evidence and inference.[1] SIST, by contrast, is focussed almost entirely on the philosophical problems of statistical practice. Moreover, I stick my neck out, and try to tackle essentially all of the examples around which there have been philosophical controversy from the severe tester’s paradigm. While I freely admit this represents a gutsy, if not radical, gambit, I actually find it perplexing that it hasn’t been done before. It seems to me that we convert information about (long-run) performance into information about well-testedness in ordinary, day to day reasoning. Take the informal example early on in the book.

- SIST, p. 14: Before leaving the USA for the UK, I record my weight on two scales at home, one digital, one not, and the big medical scale at my doctor’ s office. …Returning from the UK, to my astonishment, not one but all three scales show anywhere from a 4– 5 pound gain. …But the fact that all of them have me at an over 4-pound gain, while none show any difference in the weights of EGEK, pretty well seals it. …No one would say: ‘I can be assured that by following such a procedure, in the long run I would rarely report weight gains erroneously, but I can tell nothing from these readings about my weight now.’ To justify my conclusion by long-run performance would be absurd. Instead we say that the procedure had enormous capacity to reveal if any of the scales were wrong, and from this I argue about the source of the readings:
*H*: I’ve gained weight…. This is the key – granted with a homely example – that can fill a very important gap in frequentist foundations: Just because an account is touted as having a long-run rationale, it does not mean it lacks a short run rationale, or even one relevant for the particular case at hand.

Let me now clarify the reason that satisfying a long-run performance requirement only necessary and not sufficient for severity. Long-run behavior could be satisfied while the error probabilities do not reflect well-testedness in the case at hand. Go to the howlers and chestnuts of Excursion 3 tour II:

- Exhibit (vi): Two Measuring Instruments of Different Precisions. Did you hear about the frequentist who, knowing she used a scale that’s right only half the time, claimed her method of weighing is right 75% of the time? She says, “I flipped a coin to decide whether to use a scale that’s right 100% of the time, or one that’s right only half the time, so, overall, I’m right 75% of the time.” (She wants credit because she could have used a better scale, even knowing she used a lousy one.)
*Basis for the joke:*An N-P test bases error probabilities on all possible outcomes or measurements that could have occurred in repetitions, but did not. As with many infamous pathological examples, often presented as knockdown criticisms of all of frequentist statistics, this was invented by a frequentist, Cox (1958). It was a way to highlight what could go wrong in the case at hand, if one embraced an unthinking behavioral-performance view. Yes, error probabilities are taken over hypothetical repetitions of a process, but not just any repetitions will do.

In short: I’m taking the tools that are typically justified only because they control the probability of erroneous inferences in the long-run, and providing them with an inferential justification relevant for the case at hand. It’s only when long-run relative frequencies represent the method’s capability to discern mistaken interpretations of data that the performance and severe testing goals line up. Where the two sets of goals do not line up, severe testing takes precedence–at least when we’re trying to find things out. The book is an experiment in trying to do all of philosophy of statistics within the severe testing paradigm.

There’s more to reply to in his review, but I want to just focus on this clarification which should rectify his main criticism. For a discussion of the general points of severely testing theories, I direct the reader to extensive excerpts from SIST. His full review is here.

__________________________________

Bandyopadhyay attended my NEH Summer Seminar in 1999 on Inductive-Experimental Inference. I’m glad that he has pursued philosophy of statistics through the years. I do wish he had sent me his review earlier so that I could clarify the small set of confusions that led him to some unintended places. Nous might have given the author an opportunity to reply lest readers come away with a distorted view of the book. I will shortly be resuming a discussion of SIST on this blog, picking up with excursion 2.

Update March 4: Note that I wound up commenting further on the Review in the following comments:

[1] If you find an example that has been the subject of philosophical debate that is omitted from SIST, let me know. You will notice that all these examples are elementary, which is why I was able to cover them with minimal technical complexity. Some more exotic examples are in “chestnuts and howlers”.

“One way probabilists (such as subjective Bayesians) seem to block problematic inferences resulting from the replication crisis is by assigning a high subjective prior probability to the null-hypothesis, resulting in a high posterior probability for it. Severe testers grant that this procedure can block problematic inferences leading to the replication crisis. ” I have never seen where a severe tester “grants” that assigning a high prior to a null is at all acceptable, much less blocks problematic inferences. I believe such a procedure is problematic in and of itself. Nor have I ever seen an example (not contrived) where this was shown to be a successful way to solve replication issues. What am I missing?

John: I agree with you. Nevertheless, I don’t doubt that in some cases people can correctly spot implausible theories, but there might be as many who regard them as plausible and can point to an entire published literature on the claimed association. Even in those cases where your beliefs are spot on, I suggest that what’s being disbelieved is that the given test successfully probed the claim of interest (and how it can be flawed). Most important are the problems in an appeal to degrees of belief (which may themselves be data dependent) to criticize hypotheses that arose through data-dredging, multiple testing, outcome-switching and the like. We want to distinguish the warrant for one and the same hypothesis, but with different data–in one case with stringent experimental controls, say, in another via data-dredging. The Bayesian might want to say that knowledge of the experiment alters the prior some how, which would be non-standard.

I have another related question. When a study produces a posterior probability as a numerical result, what must a second study produce to be considered a failure to reproduce the finding? For example, if a study finds a posterior prob of 0.85 for a hypothesis, and then a second, independent study finds 0.59, is this a failure to reproduce?

John: It’s not clear that probabilists can or want to falsify. That would include not inferring a claim failed to be replicated. However, an account can be supplemented with rules for falsifying in general, and in particular, for inferring a finding fails to replicate. For example, if the hypothesis of interest goes down in probability by some amount, it might be said not to replicate. Or if the Bayes factor is < a given amount. One would need to set a threshold for declaring non-replication. However, the view that some are championing these days is to eschew all such thresholds. Maybe one would just infer the new result was "incompatible" to some extent with replicating the finding.

But that is the problem I have with the repeated unqualified suggestions that likelihood accounts or Bayesian approaches will somehow not to lead to “replication” troubles. It appears to be based on the idea that if one produces only a likelihood ratio or posterior prob, with no threshold basis for a conclusion, then you will never be “wrong.”

Should I not read this statement that way?

“Since, according to the likelihood account, there is no role of the p-value in the likelihood framework, the replication crisis does not arise immediately.”

John: I really don’t know how to read that. Here he seems to assume that there’s no replication problem if there are no P-values, but I really think that’s too absurd to attribute to him. Anyway, that’s why I decided to discuss it in the comments.

The problem with “P-values” is that they are not P-values, not that they are P-values. They lack the frequentist properties that P-values would have. In particular, the probability that they are less than or equal to P is generally much greater than P, because the “probability” generally has little connection to the implied null hypothesis (which typically includes selection on many levels).

Frequentist interpretations are in terms of long-run behavior, because frquentist probability is defined in terms of long-run behavior. But the evidence provided by considering long-run (or ensemble) behavior applies to individual cases. That’s the whole point of measures of uncertainty such as P-values and confidence sets.

Philip:

Yes I agree that the problem is with illicit P-values, e.g., where the probability the P-value ≤ p is much greater than P (under the null hypothesis).

I do think that something needs to be said as to why the long-run (ensemble) behavior is relevant to the case at hand, because it can happen that it isn’t. There’s the two measuring instruments of different precisions (SIST 170-1). Here’s a link:

Click to access SIST-2-measuring-instruments_170-173b.pdf

Someone might raise Kadane’s example (SIST 166-7) which looks only at controlling the type 1 error.

Even where error probabilities succeed in measuring the capability the method can unearth and avoid mistaken inferences, this needs to be brought out because error statisticians are often put on the defensive as to why error control matters to the individual case.

Philip, I don’t think that you are being clear or realistic because it depends on how you define a p-value. You seem to be defining it in a manner that makes it both fluid and nebulous. Your implied definition of a p-value as something that is uniformly distributed under the null is not standard. Instead, it is the probability of a test statistical as extreme or more extreme according tot he statistical model when the null is true. Yes, that will lead to a uniform distribution, but only within that particular model when that particular null is true.

When you ask a p-value to be uniformly distributed when there is sampling shenanigans and multiple testing you are really asking it to be uniformly distributed in several models for several different nulls at the same time. Not possible even if it were desirable

I’ve written about this at length in section 4.2 of this book chapter: https://arxiv.org/abs/1910.02042

Michael: Perhaps Philip will respond, but let me just note: He does not “ask a p-value to be uniformly distributed when there is sampling shenanigans and multiple testing”–he is saying that such gambits prevent the computed (or nominal) P-value from being an actual P-value. That’s why he says the real problem with P-values is when they aren’t P-values.

Mayo, in saying that “the computed (or nominal) P-value” is prevented from being “an actual P-value” you imply that there is (or should be) an actual P-value that differs from the nominal. There would be no argument for such a ‘corrected’ P-value beyond it having a uniform distribution.

I look forward to Philip’s response.

Michael:

This is an essential part of what enables P-values to do their job. Spoze instead, for example, you reported the P-value is .01, and yet the probability of so small a P-value occurred with high probability under Ho (rather than 1% of the time), then you lose the rationale for taking the small P-value as indicative of incompatibility with Ho. If the Lady Tasting Tea scores a very small P-value, but your test would result in such an impressive score even if she were guessing (perhaps by letting her try and try again), then the small P-value is a poor indication she’s not guessing.

If a random variable is not stochastically dominated by the uniform distribution when the null hypothesis is true, it is not a P-value. If it is stochastically dominated by the uniform under the null, it is a valid P-value. Talking about “extreme” values is imprecise and treats one way of constructing a test (the tail probability a test statistic under the null) as if it were the only way. Conversely, if you start with a test statistic X whose distribution is known under the null, and you want the test to reject for large values of the test statistic (or its absolute value), you can map it into a P value.

The only alternative construction of P values that seems to work is to define them in terms of a monotone family of (measurable) rejection regions (of which tail sets are a special case).

Philip and Mayo, neither of your responses addresses the issue that I raised because you both assume a singular statistical model with a singular null hypothesis.

The fundamental problem that leads to the difficulty with p-values (aside from widespread miseducation) is that they are asked to serve two masters: global error rate control; and evaluation of local evidence. You both seem to be requiring that the p-value serve the first master to the exclusion of the second.

Whenever there is multiple testing or flexible sampling rules then there will be more than one relevant statistical model and null hypothesis. Consider the well-known XKCD jelly beans cartoon. A simple model with a null that references only the green jelly bean data yields a p-value that Mayo calls (derisively) ‘nominal’. It tell us about the local evidence concerning green jelly beans and acne. That p-value would sometimes be adjusted to take into account the fact that 20 other p-values had been calculated (for the 19 other colours of jelly beans and the ensemble of unsorted jelly beans). That adjustment makes the p-value that mayo would call ‘actual’ and that Philip might accept as “stochastically dominated by the uniform distribution when the null hypothesis is true”. That adjusted p-value would allow control of global error rates, but is now not an index of the evidence against the null hypothesis referring only to green jelly beans because it is conditioned on the total number of tests.

You may personally prefer to use the adjusted, unconditional p-value, but I argue that a scientist should be looking at both the local and the global.

Michael: I feel that you have entirely overlooked the main thing that SIST does: develop a new philosophy of evidence in which relevant probabilities are used to assess how well or poorly tested claims are. It just came up in this blogpost and in my additional comment here:

https://errorstatistics.com/2020/03/01/replying-to-a-review-of-statistical-inference-as-severe-testing-by-p-bandyopadhyay/comment-page-1/#comment-188687

It’s developed and applied throughout SIST. Please have a look.

You still adhere to the supposition that evidence is entirely locked in the likelihood ratio from which it follows that error control does not matter. The alarming thing is the number of people who don’t realize that this philosophical perspective lies hidden beneath some recent “reforms” to omit the use of thresholds, without which there are no tests and no falsification, even of the statistical variety. Nor does it help to say, well we’ll bring error probabilities in later when we’re concerned about decisions, because you mislead everyone in arguing that they don’t matter for evidence. Welcome to the world where data-dredging, multiple testing, outcome switching, and the like are all permissible and don’t alter one’s evidential assessment. Anyone is free to hold this view, but they should say up front that it means giving free rein to the major source of bad statistics–biasing selection effects.

Mayo, as usual we seem to be failing to communicate. I did not write about SIST, but took issue with how you and Philip Stark choose to define p-values. I do not adhere to the notion that only the likelihood ratio (I have argued endlessly against singular likelihood ratios!) contains all of the relevant information for inference. I do not agree that a position that error control does not matter, or that it follows from evidence being “entirely locked in the likelihood ratio”.

Instead, I argue that inferences need to be informed by a number of distinct types of information and that those things cannot be encapsulated by any single statistical output.

You imply that I am comfortable with a “world where data-dredging, multiple testing, outcome switching, and the like are all permissible and don’t alter one’s evidential assessment” but I am not. However, I am equally unhappy uncomfortable with a world where it is illicit to data dredge test multiply and explore alternative outcomes in a preliminary, hypothesis generating study.

I propose that the data contain the evidence but the way in which the data were obtained affect how that evidence should inform inferences, and so should the nature and role of the intended inferences.

Michael:

I should have parsed things to reflect the special notion of evidence you (and many other likelihoodists) favor, wherein warranted evidence is very different from warranted inference. Inference might be a belief or a decision and considerations above likelihoods can enter. This is what I consider misleading for ordinary discourse. It is why I argue in my “P-values on trial” paper that it is so misleading to come in and say, Harkonen is off the hook and is not guilty of a misleading interpretation of data on grounds that he data-dredged (to convert his large P-value into a small one). Please see the link to my paper in Harvard Data Science Review:

https://errorstatistics.com/2020/02/01/my-paper-p-values-on-trial-is-out-in-harvard-data-science-review/

He is just describing the “evidence”! And the “evidence” is, strictly speaking, what the likelihoodist finds in his post hoc subgroup. This wouldn’t be so problematic if his defenders went on to say “but of course we think his inference (about his drug) is terrible! We were just talking about evidence! Of course you must consider those features that I ignored in reporting on the evidence when you’re doing inference!

If his defenders added such a caveat, it would not be so misleading. Then the FDA and others would understand that words are bing used in a very different way, and they can retort: we’re doing inference, and in fact setting policy. In ordinary discourse and the discourse of government agencies, the special philosophical choice of the linguistic distinction you favor don’t operate. So it’s a bit of a sleight of hand if you don’t go on to add “warranted evidence of H, but a wholly unwarranted inference to H”.

I hope that readers appreciate my point as to why experts so often talk past each other–a major way to “get beyond” the statistics wars.

Regarding your point about exploratory inference, error statisticians do not say you shouldn’t explore the data when you’re exploring (to arrive at claims to test). But distinct data are typically needed for testing.

(If you read SIST you would also know that I show there are contexts where using the same data to explore and test can lead to well warranted inferences. Look up “explaining a known effect” in the SIST index.) The “same” data are remodelled to ask a different question, in those special cases, and error probs aren’t vitiated.

There’s quite a lot that is confused and confusing in Bandyopadhyay’s review. I’m not even sure if it’s constructive to record them, but here are a few. I group by topic. These three concern the likelihood ratio.

1. “Severe testers like Mayo treat likelihoodists as close relatives of Bayesian, so it isn’t surprising that she is equally critical of the likelihoodists’ treatment of evidence crucial to resolving the replication crisis. The law of likelihood states that the first hypothesis is less supported than the second one given the data when Pr(D|H1) < Pr(D|H2). The goal of the likelihoodists is to compare the strength of evidence between the two models given the data. *According to the likelihoodist, the false sense of the p-value being objective has generated the replication crisis, because the replication crisis rests on adjusting the p-value”.*

MAYO. The underlined (*) last sentence isn't meaningful as it stands. A guess is that he means that likelihoodists criticize statistical significance tests for adjusting the P-value for multiple testing and other biasing selection effects. But this is precisely what enables the significance tester to avoid being fooled by randomness, whereas the likelihoodist is stuck with Royall's "trick deck" hypothesis.

2. “For a given statistical model, the p-value represents the probability that the statistical summary would be greater or equal to the observed results when the null hypothesis is true. In contrast, there is no room for such a maneuver within the likelihood account, since it depends on the law of likelihood, in which one hypothesis is being compared with another relative to data. Since, according to the likelihood account, there is no role of the p-value in the likelihood framework, the replication crisis does not arise immediately.”

MAYO: This says that the replication does not arise, at least not immediately, if there's no role for the P-value! That makes no sense. The same data-dredged hypothesis can readily occur in a likelihood ratio (witness Royall's trick deck, Excursion 1 Tour II), but, unlike the significance tester, the likelihoodist cannot block it as illicit.

3. “However, the severe tester contends that the very idea of the likelihoodist's comparative account of evidence is problematic. Mayo, however, citing George Barnard, argues that for a likelihoodist, "there always is such a rival hypothesis viz., things just had to turn out the way they actually did" (Barnard 1972, 129). It is correct that even in the presence of two competing models, there is a possibility of having a saturated model, where there are as many parameters as there are data. In this sense, Barnard's claim is correct. However, this holds for any statistical paradigm using statistics. So, it is unclear whether Barnard's comment exposes any such shortcoming”.

MAYO: No it does not hold for any statistical paradigm; it is blocked in error statistics. That is why George Barnard, who makes that remark, came to reject the Likelihood Principle.

Continuation of Mayo response to particular assertions in the rest of his review:

4. Under his list of “at least five interrelated strategies severe testers can implement to prevent any future replication crisis”. The first is to control error-probabilities and apply procedures consistent with the philosophy of severe testers. …*The third is to fix the p-value by increasing sample size n, where n must be large. This makes cut-off shorter.*”

MAYO:The last 2 sentences, i.e., claims within asterisks *, make no sense. How would a future replication crisis be prevented by fixing the P-value, making the “cut-off shorter”? I’m afraid the next sentence does not help:

“Thus, the observed data x will be closer to the null hypothesis than other alternatives.”

Making the sample size larger makes it easier to reject the null hypothesis for smaller and smaller discrepancies. I discuss the relation between sample size, power and severity quite a lot in the book, but nowhere do I say that increasing the sample size prevents future replication crises. In fact, I discuss how increasing the sample size leads to the “large n problem”.

I’m skipping some points, or this will become too long.

Continuation of Mayo responses to the review:

5. MAYO: lift-off vs drag-down (and convergent vs linked arguments). There reviewer enters into many thorny brambles on these topics which would be impossible to pull apart here. Please see the book. The simple definitions of lift-off and drag-down are on p. 15:

Lift-off : An overall inference can be more reliable and precise than its premises individually.

Drag-down: An overall inference is only as reliable/precise as is its weakest premise.

6. ” A continuing feature of the severe tester’s account is that it wants to build up the testing of a global theory by piecemeal testing (1996, p. 190). This book maintains that emphasis. …However, I argue that this bottom-up account, tied to making an overall inference, falls prey to the charge of violating the probability conjunction rule. The general statement of the rule is that the probability of an entailed proposition (e.g., just S1) cannot have less probability than the entailing proposition (e.g., S1 & S2).”

MAYO: This would be a concern if I were a probabilist trying to give a posterior probability to a theory by conjoining its parts (and having to worry about conjunctivitis). But I reject this. Science would be in a terrible way if our inferences became more and more uncertain as we added information. Fortunately, we have triangulation, arguments from coincidence, and self-correction. And it’s not just with high level theories. See the example of the argument from coincidence about my weight gain within this post.

7. Continuation of Mayo’s response to the reviewer.

Reviewer: “The interpretation of probability, according to severe testers, is not important. But, what is crucial for a severe tester is how probability ought to be used in inferences (p. 13).”

MAYO: But I never say the interpretation of probability isn’t important. Why would I spend hundreds of pages discussing different interpretations? In the passage the reviewer quotes I do say “Discussions of statistical foundations tend to focus on how to interpret probability, and much less on the overarching question of how probability ought to be used in inference.” Philosophers, nearly always start out ASSUMING that statistical inference is just probability theory. Statistical inference, in this view, is a species of direct inference about the probabilities of events such as “the probability that Peter is a Swede”. (The reviewer quotes Kyburg, here, and this was largely true of his work, which was strongly frequentist.) Philosophy of statistics then becomes a question of which definition of probability to use. I take that to be an important mistake! (It is also why the philosophy of probability, confirmation theory, and formal epistemology (often) is distant from the philosophical foundations of statistical science.) It is as against the assumption that the role of probability in statistical inference is a settled matter that I write the sentence that the reviewer takes to show I don’t care about interpretation of probability. I deliberately start SIST by questioning the presumed role of probability in statistical inference, which I claim is not a species of probability theory.

(Popper is one of the few philosophers of science to question the probabilist role for probability in testing, but he doesn’t go far enough. See Excursion 2.)

If we view statistical inference as severe testing, the role of probability is to assess and control the capabilities of methods to avoid mistaken interpretations of data. This gives us a tool that can be used to critically assess other uses of probability and other methods of statistics, Bayesian, likelihood etc. We assess whether they violate the minimal requirement for evidence.

If you tell me you’re a frequentist, it doesn’t tell me much. It doesn’t tell me whether your account of statistical inference violates the severity principle.

(You will also find quite a large number of Bayesian interpretations of probability discussed in SIST in the “gallimaufry” of construals (e.g., p. 402).)

I don’t know if the reviewer missed the discussion of the overarching position of the book on p. 55 (the very end of Excursion 1):

New Role of Probability for Assessing What’ s Learned. A passage to locate our approach within current thinking is from Reid and Cox (2015):

Reid and Cox: Statistical theory continues to focus on the interplay between the roles of probability as representing physical haphazard variability . . . and as encapsulating in some way, directly or indirectly, aspects of the uncertainty of knowledge, often referred to as

epistemic. (p. 294) We may avoid the need for a different version of probability by appeal to a notion of

calibration, as measured by the behavior of a procedure under hypothetical repetition. That is, we study assessing uncertainty, as with other measuring devices, by assessing the performance of proposed methods under hypothetical repetition. Within this scheme of repetition, probability is defined as a hypothetical frequency.” (p. 295)

This is an ingenious idea. Our meta-level appraisal of methods proceeds this way too, but with one important difference. A key question for us is the proper epistemic role for probability. It is standardly taken as providing a probabilism, as an assignment of degree of actual or rational belief in a claim, absolute or comparative. We reject this. We proffer an alternative theory: a severity assessment.

An account of what is warranted and unwarranted to infer – a normative epistemology – is not a matter of using probability to assign rational beliefs, but to control and assess how well probed claims are.

End of Mayo’s comments on the reviewer

8. The Reviewer: “When people, including myself, criticize her for ignoring the frequency-based prior probabilities in making inferences, she responds by saying that we commit the “fallacy of instantiating probabilities.”

MAYO: Yes, he does commit this fallacy. Randomly selecting a hypothesis, say it turns out to be Einstein’s GTR, from an urn containing 30% true hypotheses, does not mean the frequentist probability of GTR is .3. It is akin to a “division” fallacy where the property of a group is attributed to its members (like having 1.5 children). But it’s false to go on to suppose, as the reviewer does, that I consider relative frequencies irrelevant. Maybe read Gerd Gigerenzer’s remark on the book jacket:

“Deborah Mayo argues forcefully for a frequentist position on statistical inference, and it is a pleasure to see how passionately she treats the various issues analyzed.” Gigerenzer

9. The reviewer: “Philosophers tend to distinguish different questions so that one question is not confounded with the other. Two such questions are “the belief question” (what I prefer to call “the confirmation question”) and “the evidence question,” based on an insight suggested first by Royall.”

MAYO: Philosophers may distinguish evidence, belief and action, but we should not distinguish them along the lines of Royall where evidence follows the Likelihood Principle, and we always have evidence of a “trick deck” by Royall’s own admission (p. 38). I explicitly discuss Royall early on in Excursion 1, Tour II (p. 33), but perhaps the reviewer missed it. You can read it here:

Click to access sist_ex1-tourii.pdf

By the way, the reviewer says I deny the Likelihoodist is a frequentist. Has he not seen the list of key features of the Likelihoodist account listed clearly in Souvenir B (p. 41)?

• The LR offers “a precise and objective numerical measure of the strength of statistical evidence” for one hypotheses over another; it is a frequentist account and does not use prior probabilities (Royall 2004, p. 123).

I think I’d better stop while I’m still taking an easy, breezy view of this review.

“Moreover, I stick my neck out, and try to tackle essentially all of the examples around which there have been philosophical controversy from the severe tester’s paradigm.”

An example of a controversy that you do not address is effect size estimation in the context of clinical trials with early stopping designs. I’m not talking about Bayes here — just within the frequentist school this is a troubling topic. Why? Well, in Section 5.4 of SIST (“Severity Interpretation of Tests: Severity Curves”) you tell us that by answering the question “What discrepancies [from the null hypothesis], if they existed, would very probably have led your method to show a more significant result than you found?” we may then “infer that, at best, the test can rule out increases of that extent.” The difficulty is that “more significant result” is not a uniquely-defined notion in the context of the these sorts of clinical trials. So how would one construct severity curves for such trials and thereby get a post-data assessment of which discrepancies from the null can be ruled out with severity ? No one knows (or if they do, they aren’t saying).

Corey:

You say there are cases with adaptive clinical trials where “no one knows” the best way to assess significance or choose the best test statistic. I discuss the examples I’m aware of where this is the focus of philosophical discussion. A notable example is the famous chestnut with two measurements with different procedures (linked to in my post). While error probability control alone may not suffice, I propose that the goal of assessing severity warrants sensible conditioning. My idea is that the severity assessment–which goes beyond the idea of a formal best test, and directs you to consider the mistaken inference that is of interest in the context–avoids counterintuitive assessments of the relevant error probabilities. Applying SEV correctly is adapted to the context (and the background “repertoire of errors”). There needn’t be a “best” analysis to have a “good” one–as Neyman points out. Other exemplars (examples of a general type that have been the subject of philosophical debate) are in SIST. Please link me to the discussion where your debate arises.

An related set of comments between me and Corey may be found on Gelman’s blog here:

https://statmodeling.stat.columbia.edu/2019/04/12/several-reviews-of-deborah-mayos-new-book-statistical-inference-as-severe-testing-how-to-get-beyond-the-statistics-wars/#comment-1017868