My BJPS paper: Severe Testing: Error Statistics versus Bayes Factor Tests

.

In my new paper, “Severe Testing: Error Statistics versus Bayes Factor Tests”, now out online at the The British Journal for the Philosophy of Science, I “propose that commonly used Bayes factor tests be supplemented with a post-data severity concept in the frequentist error statistical sense”. But how? I invite your thoughts on this and any aspect of the paper.* (You can read it here.)

I’m pasting down the abstract and the introduction.

ABSTRACT: In the face of today’s statistical crisis of science, it is often recommended that statistical significance tests be replaced with Bayes factor tests. In this article, I examine this recommendation. Bayes factor tests, unlike statistical significance tests, only depend on the probability of the data under H0 and a competitor H1. They are insensitive to a method’s error probabilities such as significance levels, type 1 and type 2 errors, and confidence levels. It might be thought that if a method is insensitive to error probabilities that it escapes the inferential consequences of inflated error rates at the heart of obstacles to replication. I will argue that this is not the case, and that Bayes factor tests can accord strong evidence to a claim H, even though little has been done to rule out H’s flaws. There are two reasons: their insensitivity to biasing selection effects, and the fact that H and its competitor need not exhaust the space of relevant possibilities. I will show how this results in a disconnect between Bayes factor tests and error control protocols that are being called for by replication reforms. To solve the problem, I propose that commonly used Bayes factor tests be supplemented with a post-data severity concept in the frequentist error statistical sense. The question is not whether ‘severity’ can be redefined Bayesianly—of course it can—the question is whether the resulting concept can address today’s concerns behind obstacles to replication. I will also respond to criticisms of the severity reformulation of statistical significance tests, and show how it enables avoiding fallacies of statistical tests.

1.    Introduction

A main source of handwringing in today’s ‘statistical crisis of science’ (Gelman and Loken [2014]) is that high powered methods and researcher flexibility make it easy to find an impressive-looking effect in a particular study even though it is spurious. The data aligns with a hypothesized effect H, but the test H has passed fails to be stringent or severe. According to severe testers (Mayo [1996], [2018]; Mayo and Spanos [2006], [2011]; Mayo and Cox [2006]; Mayo and Hand [2022]):

Severity Requirement: Data x are evidence for a hypothesis H only to the extent that H passes a test that probably would have found evidence that H is false (or specifically flawed) just if it is.

This probability is the severity with which H has passed. Statistical significance tests are intended ‘as a first line of defense against being fooled by randomness’ (Benjamini [2016], p. 1). However, data dredging, multiple testing and cherry picking can result in frequently inferring there is a genuine effect erroneously. The error probability associated with such an inference is high. Unsurprisingly, such data dredged effects often disappear in attempted replications with stricter protocols, and the random variation goes a different way. Methods where inferential assessments require knowing the relevant error probabilities of the method producing x may be called ‘error statistical methods’ (for example, significance levels, type 1 and type 2 errors, confidence levels). While the replication crisis is fostering preregistration and other protocols to control error probabilities, some also take it as grounds to replace statistical significance tests with alternative methods, some of which are insensitive to error probabilities.[1]

It might be thought that if a method is insensitive to error probabilities then it escapes inflated error rates due to biasing selection effects.[2] I will argue that this is not the case. Insensitivity to error probabilities has serious consequences when it comes to inferring genuine as opposed to spurious effects. My focus in examining this issue is an article advocating subjective Bayes factor tests by van Dongen, Sprenger, and Wagenmakers (VSW), two philosophers of science and a mathematical psychologist (van Dongen et al. [2023]). My arguments are relevant whenever Bayes factors are used as tests, as they typically are. The Bayes factor, B10, is the probability (or density) of observed data x under statistical hypothesis H1 as opposed to another, H0. It is the likelihood ratio: Pr(x; H1)/Pr(x; H0).[3] Following thresholds advocated by Jeffreys ([1961]), Bayes factor testers define test rules moving from a likelihood ratio of H1 and a competitor H0 to inferring weak, strong, very strong evidence for H1 (or H0).

VSW are clear that the Bayes factor test is insensitive to error probabilities: ‘the Bayes factor only depends on the probability of the data in light of the two competing hypotheses. As Mayo emphasizes [. . . ] the Bayes factor is insensitive to variations [of] the sampling protocol that affect the error rates i.e., optional stopping of the experiment’ (van Dongen et al. [2023], p. 522). As a result, VSW acknowledge: ‘many Bayesians deny that severity should matter at all in inference. They refer to the Likelihood Principle [. . . ] According to this line of response, Popper, Mayo and other defenders of severe testing are just mistaken when they believe that severity should enter the (post-experimental) assessment of a theory’ ([2023], p. 517). The authors rightly observe that ‘much of the “statistics wars” [(Mayo [2018], p. xi)] between Bayesians and frequentists revolve around this controversy’ as to whether a method’s error probabilities should enter the (post-experimental) assessment of evidence (Cox [1958], [1977], [1978]; Edwards et al. [1963]; Berger and Wolpert [1988]; Royall [1997], [2000a], [2000b]; Lindley [2000]; Mayo [2018]). The controversy turns on a basic principle of evidence: the likelihood principle.[4] On the likelihood principle, once the data are in hand, all of the evidence (for a statistical hypothesis in a model) is contained in the ratio of likelihoods of hypotheses. With likelihoods, the data are fixed, the statistical hypotheses vary.

VSW’s standpoint creates a puzzle. While it adheres to the likelihood principle, it also ‘acknowledges Popper’s and Mayo’s argument that severity needs to be accounted for’ (van Dongen et al. [2023], p. 517). How can they account for severity while admitting ‘the Bayesian ex-post evaluation of the evidence stays the same regardless of whether the test has been conducted in a severe or less severe fashion’ (van Dongen et al. [2023], p. 522)? If they are to avoid inconsistency, I will argue, they must deny that severity enters in the post-experimental assessment of evidence. The question is not whether the term severity can be redefined Bayesianly, but whether the resulting concept will address today’s concerns behind obstacles to replication. VSW argument is notable because it purports to give a Bayesian redefinition of severity that can capture ‘the ideas of severe testing and error control’ while bypassing error probabilities (van Dongen et al. [2023], p. 517). I disagree.

I will argue that the Bayes factor test rule can accord strong evidence to a claim H, even though little has been done to rule out H’s flaws. There are two reasons: insensitivity to error control introduced by biasing selection effects, and the fact that H and its competitor need not encompass all relevant possibilities. This is especially problematic when Bayes factor tests are used as replacements for statistical significance tests—the main concern of this article. To mitigate this, I recommend Bayes factor tests be supplemented with a report of how severely H has passed, in the error statistical sense. Wagenmakers, a developer of Bayes factor software in psychology, might consider adding such an assessment. As practitioners accustomed to the error statistical guarantees of frequentist methods try out Bayesian tests, Bayesians are called on to reconcile their support for popular reforms focused on promoting error probability controls with their advocacy of methods that are insensitive to them.

The article is structured as follows: Section 2 sets out preliminary notions, namely, the (error statistical) severity requirement, and the crux of the rival notions of evidence at issue. Section 3 explains the key elements of statistical significance tests and my proposed severity reformulation of tests. Section 4 responds to VSW’s criticisms of the severity reformulation discussed in section 3. Section 5 introduces Bayes factor tests, and shows the severity consequences of their insensitivity to error probabilities. Section 6 critically examines the viability of VSW’s Bayesian redefinition of severity.

Please share your thoughts, recommendations and questions in the comments.

*I blogged on their paper when it first came out here (“insevere tests of severe testing”) but eventually decided to write a paper.

NOTES:

[1] The literature is huge. Some multi-authored sources are (Wasserstein and Lazar [2016] and supple- mentary comments; Benjamin et al. [2018]; Lakens et al. [2018]; Wasserstein et. al. [2019]; Benjamini et al. [2021]).

[2]  ‘Biasing Selection Effects: when data or hypotheses are selected or generated (or a test criterion is specified), in such a way that the minimal severity requirement is violated, seriously altered, or incapable of being assessed’ (Mayo [2018], p. 92).

[3] B10 is the factor by which a ratio of prior probabilities, Pr(H1)/Pr(H0), could be revised to obtain a ratio of posterior probabilities Pr(H1|x)/Pr(H0|x) (see note 28). The ‘;’ is generally used by frequentists, whereas Bayesians use ‘|’.

[4] The likelihood principle follows from inference by Bayes theorem. Savage ([1962], p. 17) states it: ‘According to Bayes’s Theorem, Pr(x|µ) [. . . ] constitutes the entire evidence of the experiment, that is it tells all that the experiment has to tell. More fully and more precisely, if y is the datum of some other experiment, and if it happens that Pr(x|µ) and Pr(y|µ) are proportional functions of µ (that is, constant multiples of each other), then each of the two data x and y have exactly the same thing to say about the value of µ’ (notation altered).

Categories: Bayesian/frequentist, Likelihood Principle, multiple testing | 4 Comments

Post navigation

4 thoughts on “My BJPS paper: Severe Testing: Error Statistics versus Bayes Factor Tests

  1. Stan Young

    Deborah:The Bayes fanatics are cagey with their Bayes factor. It turns out there is essentially a linear relationship between p-values and Bayes Factors. Neither, on its own, deals with multiple testing and multiple modeling, MTMM. With either, you have to build MTMM into the process, before or after the fact. Ideally, they build MTMM into their analysis. Actually, they have to do it if, as they often claim, Bayesian analysis takes care of MTMM. IMO, they are not completely honest.Stan Stan & Pat Young Cell 919 219 2030 (Stan)  Cell 919 219 2024 (Pat)

    • Stan:

      Thanks so much for your comment. How do they build it into the process? Is it a pre-trial protocol limiting the multiplicity? Or trying to take account of it post-data? That would also require setting out the protocol for others to check. Can you tell us more?

  2. rkenett

    Thank you for this foundational paper.

    Following the reading of this interesting write up, come two comments that relate to past exchanges of emails and comments. At the risk of repeating myself, these are:

    1. There is a need to discuss verbal communication of findings. This is beyond the formulation that “The data aligns with a hypothesized effect H, but the test has passed fails to be stringent or severe”. In previous correspondence I suggested applying for this conceptual understanding mapping methods and tools.
    2. In verbal representations, the formulation of H is often directional. For example, H1: “If A increases, B decreases”. From this perspective, the sign type error applies and we have H2: “If A increases, B increases”. This differs from the magnitude type error considered in the paper. Gelman and Carlin propose assessing S-type error with empirical Bayes type methods. A translational medicine example is presented in Kenett and Rubinstein (2021). “Generalizing Research Findings For Enhanced Reproducibility: An Approach Based On Verbal Alternative Representations”. Scientometrics, 126, 5, Pp. 4137-415. An open access draft of it is available at https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3035070

Leave a reply to Mayo Cancel reply

Blog at WordPress.com.