One does not have evidence for a claim if little if anything has been done to rule out ways the claim may be false. The claim may be said to “pass” the test, but it’s one that utterly lacks stringency or severity. On the basis of this very simple principle, I build a notion of evidence that applies to any error prone inference. In this account, data x are evidence for a claim C only if (and only to the extent that) C has passed a severe test with x. How to apply this simple idea, however, and how to use it to solve central problems of induction and statistical inference requires careful consideration of how it is to be fleshed out. (See this post on strong vs weak severity.)
Consider a fairly egregious, yet all-too familiar, example of a poorly tested claim to the effect that a given drug improves lung function on people with a given fatal lung disease. Say the CEO of the drug company, confronted with disappointing results from an RCT — they are no better than would be expected by the background variability alone — orders his data analysts to “slice and dice” the data until they get some positive results. They might try and try again to find a benefit among various subgroups (e.g., males, females, employment history, etc.). Failing yet again they might vary how “lung benefit” is measured using different proxy variables. This way of proceeding has a high probability of issuing in a report of drug benefit H1 (in some subgroup or other), even if no benefit exists (i.e., even if the null or test hypothesis H0 is true). (For a real case, see my “p-values on trial” in Harvard Data Science Review.)
The method has a high error probability in relation to what it infers, H1. H1 passes a test with low or even minimal severity. The gambit leading to low severity here is referred to with a variety of names, multiple testing, significance seeking, data-dredging, subgroup analysis, outcome switching, and data torturing and others besides. Experimental design principles endorsed by hundreds of medical journals, best-practice statistical manuals, and replication researchers reflect the need to block cavalier attitudes towards inferring data-dredged hypotheses. A variety of ways to avoid, adjust or otherwise compensate for “post data selection,” as some now call it, are well-known.
Some central features of the severity assessment:
- The severity assessment attaches to the method of inferring a claim C with a given test T and data x. The resulting assessment for a given hypothesis H1– in this case low — remains even if H1 is known or believed to be true (plausible, probable, or the like). Perhaps there are other data out there, y, or a different type of test, T’, that provide a warrant for H1, but that doesn’t change the low severity afforded by x from test T. In other words, asserting H1 might be right, but if it’s based on the post-data multiple searching method, it is right for the wrong reason. The method, as I described it, failed to distinguish cases where mere random variation throws up a interesting pattern in the particular subgroup which the researchers seize on.
- It is incorrect to speak of the severity of a test, in and of itself. Severity, as used and developed by me and by Spanos, refers to an assessment of how well-tested a particular claim of interest is. (It is post-data.) It is analogous to Popper’s term “corroboration” (a claim is corroborated if it passes severely)–never mind that he never adequately cashed it out. The severity associated with C measures how well-corroborated C is, with the data x and the test T under consideration.
- In assessing the severity associated with a method, we have to consider how it behaves in general, with other possible outcomes–not just the one you happen to observe–and under various alternatives. That is, we consider the method’s error probabilities–its capabilities to avoid (or commit) erroneous interpretations of the data. Methods that use probability (in inference) to assess and control error probabilities I call error statistical accounts. My account of evidence is one of severe testing based on error statistics.
- It is rarely the hypotheses or claims themselves that determine the severity with which they pass tests. Hypotheses pass poor tests when they happen to contain sufficiently vague terms, lending themselves to “just so” stories. An example from Popper is the concept of an “inferiority complex” in Adler’s psychological theory. Whatever behavior is observed, Popper charges, can be ‘explained’ as in sync with Adler (same for concepts in Freud). The theory may be logically falsifiable, but it is immunized from being found false. The theory is easily saved by ad hoc means, even if it’s false. The data-dredger can pull off the same stunt, but–as is more typical– the flexibility is in the data and hypothesis generation and analysis.On the flip side, theories with high content and “corroborative tendrils” that give it more chances of failing enjoy high severity provided that they pass a test that probably would have found flaws. (Sometimes philosophers talk of a large scale theory, paradigm, or research program that is understood to include overall testing methods as well as particular hypothesis.) [Updated 4/5 to include the flip side. For a discussion see SIST (2018) pp. 237-8.]
If someone is interested in appraising the value of our account of severity, and especially if they purport to refute it, they should be sure they are talking about an account with these essential features. Otherwise, their assessment will have no bearing on this account of severity.
Severe testing considers alternative hypotheses but is not a comparative account–there’s a big difference!
A comparative account of evidence merely reports that one hypothesis (model or claim) is favored over another in some sense: It might be said to be more likely, better supported, fit the data better or the like. Comparative accounts do not test, provide evidence for, or falsify hypotheses. They are limited to claiming one fits data better than another in some sense — even though they do not exhaust the possibilities, and even though both might be quite lousy. The better of two poorly warranted hypothesis is still a poorly warranted hypothesis.(See Mayo 2018, Mayo and Spanos 2011).
The classic example of a comparative account is based on the likelihood ratio of the hypothesis H1 over H0 compares the probability (or density) of x under H1 – which we may write as Pr(x;H1) — to the probability of x under H0, Pr(x;H0).
The likelihood ratio is Pr(x;H1)/Pr(x;H0).
With likelihoods, the data x are fixed while the hypotheses vary. Given the data x, it easy to find a hypothesis H1 that perfectly agrees with the data so that H1 is a better fit to the data than is hypothesis H0. However, as statistician George Barnard puts it, “there always is such a rival hypothesis viz., that things just had to turn out the way they actually did” (Barnard 1972, 129). So the probability of finding some better fitting alternative or other is high (if not guaranteed) even when H0 correctly describes the data generation.
Suppose someone proposes that H1 passes a severe test so long as the data are more probable under H1 than under some H0. Such an account will fail to meet even minimal requirements of severe tests in the error statistical account. Since the data dredging and other biasing selection effects do not alter the likelihood ratio or the Bayes Factor, basing severity on such comparative accounts will be at odds with the one we intend. This does not seem to bother the authors of a recent paper, van Dongen, Sprenger and Wagenmakers (2022), hereafter, VSW (2022). They say straight out:
the Bayes factor only depends on the probability of the data in light of the two competing hypotheses. As Mayo emphasizes (e.g., Mayo and Kruse, 2001; Mayo, 2018), the Bayes factor is insensitive to variations the sampling protocol that affect the error rates, i.e., optional stopping of the experiment. The Bayes factor only depends on the actually observed data, and not on whether they have been collected from an experiment with fixed or variable sample size, and so on. In other words, the Bayesian ex-post evaluation of the evidence stays the same regardless of whether the test has been conducted in a severe or less severe fashion. (VSW 2022)
Stopping at this point and acknowledging the difference in statistical philosophies would be my recommendation. We’re not always in a context of severe testing in our sense. But these authors desire (or appear to desire) an error-statistical severity omelet without breaking the error statistical eggs (to allude to a famous analogous quote by Savage).
In the next paragraph they assure us that they too can capture severity, if not in my sense, then in a (subjective Bayesian) sense they find superior:
We agree with this observation [in the above quote], but we believe that the proper place for severity in statistical inference is the choice of the tested hypotheses (VSW 2022).
But the example they give that is supposed to convince me that I ought to define severity comparatively is not promising. According to them:
a stringent scrutiny of the claim C: “90% of all swans are white” requires only a single swan if the alternative claim is “all swans are black”.
But H1: 90% swans are white, does not pass a stringent scrutiny by dint of finding a single white swan x, although x falsifies H0: all swans are black. (It doesn’t matter for my point how we label the two hypotheses.) While I don’t know the precise distribution of white and black swans (nor how the sample was collected, nor whether the hypotheses are specified post hoc), it would be silly to suppose that a single white swan is good evidence that 90% of the population of swans are white.
A more familiar example of the same form as theirs would be to take a single case where a treatment works as grounds to stringently pass a hypothesis H1: that it works in at least 90% of the population. For these authors, as I understand them, what does the work that enables the alleged stringent inference to H1 is setting H0 as a hypothesis that x falsifies. Of course these two hypotheses scarcely exhaust the space of hypotheses — but this is a standard move (and a standard problem) in comparativist accounts . To my ears, the example illustrated the problem with a comparative appraisal: Pr(x;H1) is surely greater than Pr(x;H0) which is 0, but H1 has not thereby been subjected to a scrutiny that it probably would have failed, if false.
In statistical significance tests, say, concerning the mean μ of a Normal distribution: H0: μ < μ0 versus H1: μ > μ0, we have an alternative hypothesis, but it is not a comparative account. (We could equally well have H0: μ = μ0) VSW question how such an alternative can pass with severity because it is composite (p. 6)–H1: μ > μ0 includes a range of values, e.g., the mean survival is higher in the treated vs the control group. Here’s how it does: A small p-value can warrant H1 with severity because with high probability, 1 – p, we would have obtained a larger p-value were we in a world where H0 is adequate. It is rather the comparative appraisal of point hypotheses that cannot falsify a hypothesis.
I will study the rest of VSW’s paper at a later date. The subjective Bayesian account is sufficiently flexible to redefine terms and goals so that the newly defined severity passes the test. But since the authors already conceded “the Bayesian ex-post evaluation of the evidence stays the same regardless of whether the test has been conducted in a severe or less severe fashion,” it’s hard to see how their view is being put to a severe test.
I may come back to this in a later post. For a detailed development of severe testing, see proofs of the first three excursions from SIST.
Share you constructive remarks in the comments.
 Merely blocking an inference to a claim that passes with low severity is what I call weak severity. A fuller, strong severity principle says: We have evidence for a claim C just to the extent it survives a stringent scrutiny. If C passes a test that was highly capable of findings flaws or discrepancies from C, were they to be present, and yet none or few are found, the passing result, x, is evidence for C.
 Optional stopping is another gambit that can wreck error probability guarantees, violating what Cox and Hinkley (1974) call weak repeated sampling. (For details, see SIST pp 44-5; Mayo and Kruse 2001 below).
 Some Bayesians object to Bayes factors for similar reasons. Gelman (2011) says: “To me, Bayes factors correspond to a discrete view of the world, in which we must choose between models A, B, or C” (p. 74) or a weighted average of them.
 I have finally pulled together the pieces from the page proofs of the first three “excursions” of my Statistical Inference as Severe Testing: how to Get Beyond the Statistics Wars (2018, CUP) [SIST]. Here they are, beginning with the Preface: Excursions 1-3 from SIST. I would have hoped that scholars discussing severity and Popper would have looked at what I say about Popper in Excursion 2 (especially Tour II). To depict Popper as endorsing the naive or dogmatic variants called out by Lakatos in 1970 is highly problematic e.g., that old view of falsification by “basic statements”.
The best treatment of Bayes and Popper, I recalled when writing this, is in a book by the non-subjective Bayesian, Roger Rosenkrantz (1977), chapter 6. I looked it up today, and yes I think it is an excellent discussion that at least takes a reader up to Popper 1977. (updated on 4/6/22)
Barnard, G. (1972). The logic of statistical inference (Review of “The Logic of Statistical Inference” by Ian Hacking). British Journal for the Philosophy of Science, 23(2), 123–32.
Cox, D. R. & Hinkley, D. (1974). Theoretical Statistics. London: Chapman and Hall LTD.
Gelman, A. (2011). Induction and Deduction in Bayesian Data Analysis, Rationality, Markets and Morals 2:67-78.
Mayo D. (2018). Statistical inference as severe testing: How to get beyond the statistics wars. Cambridge: Cambridge University Press.
Mayo D. & Kruse, M. (2001). Principles of inference and their consequences. In D. Corfield and J. Williamson (eds.) Foundations of Bayesianism, pp. 381-403. The Netherlands: Kluwer Academic Publishers.
Mayo, D. and Spanos, A. (2011). Error Statistics.
Savage, L. J. (1961). The Foundations of Statistics Reconsidered. Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability 1. Berkeley: University of California Press, 575–86.
van Dongen, N. N. N., Wagenmakers, E., & Sprenger, J. (2020, December 16). A Bayesian perspective on severity: Risky predictions and specific hypotheses. PsyArXiv preprints. (To appear in Psychonomic Bulletin and Review 2022.)
Dear Readers: Most of my 2018 book was written by first posting bits and pieces on this blog. The discussion was often extremely valuable and it saved me from making mistakes in the book. Then, in 2018, I posted finished chapters on this blog which you can search. Going back to the earlier posts, in a way, is the most valuable because of the discussions. Here I link to proofs of the first 3 excursions.
I’ve read through pg 10, and I want to finish it. But I think their water heater example fails.
I’ll start with this. They say that S-1 comes from a rejection of some H0. I don’t remember reading this. Maybe I’m wrong here. In the water heater example, I think that xbar = 151 meets S-1 for the claim mu > 150, since it’s “in the right direction.” It just fails for S-2. Another example is on pg 263 of SIST (Section 4.5).
Their example falls apart at the bottom of pg 9 and on pg 10:
“However, when normal temperatures are around 100 degrees, you observe a mean temperature of 152 with a standard error of 1, and an “full-on emergency for the ecosystem” is imminent when the temperature is 153, it is clear that counter measures are acutely required.”
If I observed a 52-sigma deviation, I would stop worry about severity and seek shelter. I still say that we do not have well-tested evidence for mu > 153, I’m not even sure that S-1 applies, but shouldn’t stop us from taking counter measures. We have good evidence of a 50-sigma deviation, and that should be bad enough!