I. A principled disagreement
The other day I was in a practice (zoom) for a panel I’m in on how different approaches and philosophies (Frequentist, Bayesian, machine learning) might explain “why we disagree” when interpreting clinical trial data. The focus is radiation oncology. An important point of disagreement between frequentist (error statisticians) and Bayesians concerns whether and if so, how, to modify inferences in the face of a variety of selection effects, multiple testing, and stopping for interim analysis. Such multiplicities directly alter the capabilities of methods to avoid erroneously interpreting data, so the frequentist error probabilities are altered. By contrast, if an account conditions on the observed data, error probabilities drop out, and we get principles such as the stopping rule principle. My presentation included a quote from Bayarri and J. Berger (2004):
The stopping rule principle says that once the data have been obtained, the reasons for stopping experimentation should have no bearing on the evidence reported about unknown model parameters. This principle is automatically satisfied by Bayesian analysis, but is viewed as crazy by many frequentists.
….if there’s an “option of stopping the trial early should the data look convincing, frequentists feel that it is then mandatory to adjust the allowed error probability (down) to account for the multiple analyses. (Bayarri and Berger 2004, 77)
Bayesians don’t share this feeling.
One of my co-panelists, Amit Chowdhry, a physician at the University of Rochester medical Center, sent around a paper that shed light on the current Bayesian thinking on the matter.
II. “Do we need to adjust for interim analyses in a Bayesian adaptive trial design?” (Ryan et al. 2020):
The authors do a good job at illuminating the disagreement and connects it to recent calls to “abandon” or “retire” statistical significance. My discussion in this post only refers to this article, which is open access, and not to the case studies around which the panel revolves. (All block quotes in blue refer to this article.)
The Bayesian approach, which is conditional on the data observed, is consistent with the strong likelihood principle. The final analysis can ignore the results and actions taken during the interim analyses and focus on the data actually obtained when estimating the treatment effect … That is, inferential corrections, e.g., adjustments to posterior probabilities, are not required for multiple looks at the data and the posterior distribution for the parameter of interest can be updated at any time. This is appealing to clinicians who are often confused about why previous (or multiple concurrent in the case of multiple arms/outcomes) inspections of the trial data affect the interpretation of the final results in the frequentist framework where adjustments to p-values are usually required. The stopping rule in a Bayesian adaptive design does not play a direct role in a Bayesian analysis, unlike a frequentist analysis.
So it is no wonder there is disagreement. By conditioning on the data, and by being “consistent with the strong likelihood principle,” outcomes other than the one observed drop out. But error probabilities require looking at outcomes that could have occurred but didn’t. (There’s quite a lot on the strong likelihood principle, which I’ll just call the likelihood principle, on this blog. For a discussion of the likelihood principle and the related stopping rule principle, see my book, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST), (CUP 2018), Excursion 1, Tour II).
Thus the “strict Bayesian” as Ryan et al. 2020 call them, is in a quandary in using Bayesian analyses in clinical trials:
Decisions at analysis points are usually based on the posterior distribution of the treatment effect. However, there is some confusion as to whether control of type I error is required for Bayesian designs as this is a frequentist concept.
One can see why there would be confusion. Multiplicities–their focus is on cases where “multiplicities arise from performing interim analyses on accumulating data in an RCT”–do not call for adjustments to the Bayesian posterior probability of H, say that one treatment is superior to another. As they put it: “inferential corrections, e.g., adjustments to posterior probabilities, are not required for multiple looks at the data and the posterior distribution for the parameter of interest can be updated at any time”. That’s another way to state the stopping rule principle.
III. Arguments for and against adjusting for multiplicities
The authors say they will “discuss the arguments for and against adjusting for multiplicities in Bayesian trials with interim analyses.” A main argument in favor is that without adjustments, the associated Type I error probability can be high. The authors show
that the type I error was inflated in the Bayesian adaptive designs through incorporation of interim analyses that allowed early stopping for efficacy and without adjustments to account for multiplicity. An increase in the number of interim analyses that only allowed early stopping for futility decreased the type I error, but also decreased power.
The Type I error probability can exceed 0.5:
Bayesian interim monitoring may violate the weak repeated sampling principle [Cox and Hinkley 1974] which states that, “We should not follow procedures which for some possible parameter values would give, in hypothetical repetitions, misleading conclusions most of the time”.
That would seem to be strong evidence indeed for either avoiding or adjusting for multiplicities. Thus, in regulatory practice, a Bayesian might not wish to wear a strict Bayesian hat:
Whilst the long-run frequency behaviour of sequential testing procedures is irrelevant from the strict Bayesian perspective, long-run properties have been established as being important in the clinical trial setting, particularly for late phase trials…. Confirmatory trials will have to persuade, amongst the normal sceptical scientific audience, the competent authorities for healthcare regulation… One of the core functions of health-care regulators is to prevent those that market interventions from making spurious claims of benefit. For this reason, adequate control of type I error is one of the perennial concerns when appraising the results of confirmatory clinical trials.
That’s the argument for. What’s the argument against adjusting for multiplicities in Bayesian trials? The argument against is that it conflicts with being a strict Bayesian:
The requirement of type I error control for Bayesian adaptive designs causes them to lose many of their philosophical advantages, such as compliance with the likelihood principle, and creates a design that is inherently frequentist.
But compliance with the likelihood principle (LP) means giving up on error control—why is that a philosophical advantage? Do “philosophical advantages” refer to adherence to an a priori conception of inference on intuitive, rather than empirical, grounds? (Yes,I realize they’re referring to a long held tradition wherein only subjective Bayesianism was thought to have respectable philosophical foundations, but that will no longer wash—besides non-subjective Bayesian tools violate the LP and the stopping rule principle.) But the authors deserve credit for pinpointing what is at stake: either give up strict Bayesianism or give up error probability control.
Aside: I’ve referred to three “principles,” which might seem confusing. Here’s a brief note to keep them straight.
IV. How Bayesian trials might adjust
Even though computing error probabilities isn’t Bayesian, one can compute error probabilities associated with a rule, say, to report a posterior probability .9 of superiority of a treatment. For example, a trial might “stop for efficacy” if the posterior probability that H (superiority of treatment) is above 90%. One could also “stop for futility” if, say, Pr(H|data) < 0 .1. Focus just on the former. Since either the treatment is superior or not, you might wonder what assigning H a prior and posterior mean. From the example our group considers (two types of radiation treatments) and other examples, I take it that the prior probabilities come from a combination of elicited experts beliefs, empirical frequencies, and “non-subjective” or default priors (designed to make the data dominant in some sense). To obtain a kind of hybrid Bayesian-frequentist computation of a Type I error probability, they can ask: What’s the probability that the posterior probability of H would exceed .9, given that H is false?
I don’t know the details (I hope to learn more), but the idea is to simulate many thousands of trials and find the proportion that result in H getting a posterior probability of .9, assuming H is false. “This is achieved by determining how frequently the Bayesian design incorrectly declares a treatment to be effective or superior when it is assumed that there is truly no difference, for the given decision criteria/stopping boundaries”. You might see these as simulated (frequentist) probabilities of (Bayesian) posterior probabilities on H with lots of moving parts. There are programs for computing this. (I presume all of the prior probability assignments to all of the parameters in the model are obtained at the start, pre-data.) Perhaps Stephen Senn, as an expert on statistics in clinical trials, can weigh in here.
So the stopping rule only directly affects a frequentist, not a strict Bayesian analysis. But it’s possible to arrange it so as to indirectly alter the Bayesian analysis.
If, in designing the trial, it is found that the Type I error probability associated with a Bayesian posterior probability assignment is high, a Bayesian might require a higher posterior probability threshold before declaring evidence for H. Suppose ensuring a low Type I error probability would require a posterior probability of .95 in H. Even then, I think error statisticians would feel uncomfortable reporting the Pr(H| data) = .95, because they would be unclear as to what the probability refers. Either the treatment is superior or not (typically measured by some average benefit on some measure).
In the view of Ryan et al. (2020), the Bayesian adjusts because doing so is required by funders and regulatory agencies, not because she might otherwise frequently be misleadingly declaring high posterior probability for H, when H is false.
One might have thought (hoped?) they would conclude that the Bayesian analysis may be missing something significant in denying the importance of error probabilities for inference (and I’m not distinguishing inference from evidence). Instead, they are prepared to grant that data provide strong evidence for H, or at least for believing H—which is presumably what high posteriors supply– even if it’s shown the method would very often license such a high level of belief erroneously. By contrast, learning that H would frequently (e.g., with probability > 0.5) be accorded a high posterior probability, say .9, even if H is false, the error statistician would regard the .9 posterior as poor evidence for H. To her, the error probability of the method is not separable from the statistical inference: the inference to H includes the qualification of how well tested or corroborated H is.
So the error statistician might find it unsettling that, rather than take the disagreement as pointing to a lacuna in the Bayesian posterior assessment, by the end of the article Ryan et al. (2020) appear to take it as a reason to be spared from the requirement of error control. If we “simply report the posterior probability of benefit, then we could potentially avoid having to specify the type I error of a Bayesian design”.
Given the recent discussions to abandon significance testing [33, 34] it may be useful to move away from controlling type I error entirely in trial designs.
So here we have another rationale behind the push to “abandon” or “retire” statistical significance: stopping rules would not have to be taken account of in Bayesian clinical trials. I should point out that the most recent ASA Task Force on Statistical Significance and Replication declare:
P-values and significance testing, properly applied and interpreted, are important tools that should not be abandoned, … They are important tools that have advanced science through their proper application. Much of the controversy surrounding statistical significance can be dispelled through a better appreciation of uncertainty, variability, multiplicity, and replicability. (Benjamini et al. (2021)
(For links, see this post)
Ponder an interesting claim Ryan et al. (2020) make as to why “Type I errors are unlikely to be of interest to a strict Bayesian”:
Informative priors that favour an intervention represent a clash in paradigms between the role of the prior in a Bayesian analysis and type I error control in the frequentist framework (which requires an assumption of zero difference). Type I errors are unlikely to be of interest to strict Bayesians, particularly if there is evidence for a non-zero effect that is represented in an informative prior.
First, it should be noted that the assumption of zero difference is merely a hypothetical posit—it is used merely to draw out implications (e.g., of being in a world where H0 adequately explains the data generation process). It is analogous to an implicationary assumption in a (deductive) reductio argument leading to a falsification. But what about the second sentence? Suppose you had what you consider good evidence for a non-zero effect, leading to an informative prior that favors an intervention. Now suppose you find out that the evidence came from studies that all had high Type I error probabilities—high probability of inferring efficacy erroneously. Would you say these “Type I errors are unlikely to be of interest” to you, since you already believe in the effect?
VI. What Type I error probabilities are not
I should just note that in two places the error statistical test is mischaracterized:
Type I errors calculate the probability of data conditional on some assumed fixed value of the parameter of interest (e.g., treatment effect = 0), which is unlikely to ever occur exactly.
This is not right. The probability of a Type I error is the probability of rejecting a test or null hypothesis H0—e.g., inferring evidence of H superiority of a treatment—under the assumption that H0. It is not the probability of data. (Or with p-values, there’s a computation of the probability a test statistic d(X) exceeds the observed d(x); under a statistical hypothesis, such as H0.) [The Pr(a test rejects H0 at level α; H0) = α.]
So what, in your view, is the answer to the question in the title of this post? Share your comments and corrections.
I’ll indicate later drafts with (i), (ii), etc.
 It will be at the meeting of the American Society for Radiation Oncology [ASTRO] in October 2021. I may be the only participant from philosophy of science.
 For simplicity, I will just focus on the Type I error; but the article is very clear about Type II error probabilities.
 I know that some Bayesians, like J. Berger, will redefine the “error probability” associated with assigning a posterior probability on H as the posterior probability of not-H. This allows using error statistical terminology while still wearing a Bayesian hat. But I don’t see that usage in the FDA or other agency requirements. For a discussion, see Excursion 3 Tour II. Please correct me if I’m wrong–I have not looked into this, but it doesn’t appear that the FDA has changed a whole lot from a decade ago. (It has changed some.)
 It suffices to remember that the stopping rule principle and the likelihood principle belong together (reflecting the view that error probabilities are irrelevant for evidence), and that both are at odds with the weak repeated sampling principle (which tells you to reject methods with high error probabilities). For a simple discussion of all three, just look at 3 pages of SIST: 44-46: Excursion 1, Tour II.
 Of course, the evidence for H is always qualified by how capable the given test was in uncovering flaws in H.
 To see what Peter Armitage, an expert in sequential trials in medicine, said to Jimmy Savage about this way back in 1962, see p. 46 of SIST.
References (aside from those linked above)
Bayarri, S. and Berger, J. (2004. “The Interplay of Bayesian and Frequentist Analysis. Statistical Science, Vol. 19, No. 1 (Feb., 2004), pp. 58-80.
Mayo, D. Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP, 2018).