**I. A principled disagreement**

The other day I was in a practice (zoom) for a panel I’m in on how different approaches and philosophies (Frequentist, Bayesian, machine learning) might explain “why we disagree” when interpreting clinical trial data. The focus is radiation oncology.[1] An important point of disagreement between frequentist (error statisticians) and Bayesians concerns whether and if so, how, to modify inferences in the face of a variety of selection effects, multiple testing, and stopping for interim analysis. Such multiplicities directly alter the capabilities of methods to avoid erroneously interpreting data, so the frequentist error probabilities are altered. By contrast, if an account conditions on the observed data, error probabilities drop out, and we get principles such as the *stopping rule principle.* My presentation included a quote from Bayarri and J. Berger (2004):

The stopping rule principlesays that once the data have been obtained, the reasons for stopping experimentation should have no bearing on the evidence reported about unknown model parameters.This principle is automatically satisfied by Bayesian analysis, but is viewed as crazy by many frequentists.….if there’s an “option of stopping the trial early should the data look convincing, frequentists feel that it is then mandatory to adjust the allowed error probability (down) to account for the multiple analyses. (Bayarri and Berger 2004, 77)

Bayesians don’t share this feeling.

One of my co-panelists, Amit Chowdhry, a physician at the University of Rochester medical Center, sent around a paper that shed light on the current Bayesian thinking on the matter.

**II. “Do we need to adjust for interim analyses in a Bayesian adaptive trial design?” (Ryan et al. 2020):**

The authors do a good job at illuminating the disagreement and connects it to recent calls to “abandon” or “retire” statistical significance. My discussion in this post only refers to this article, which is open access, and not to the case studies around which the panel revolves. (All block quotes in blue refer to this article.)

The Bayesian approach, which is conditional on the data observed, is consistent with the strong likelihood principle. The final analysis can ignore the results and actions taken during the interim analyses and focus on the data actually obtained when estimating the treatment effect … That is,

inferentialcorrections, e.g., adjustments to posterior probabilities, are not required for multiple looks at the data and the posterior distribution for the parameter of interest can be updated at any time. This is appealing to clinicians who are often confused about why previous (or multiple concurrent in the case of multiple arms/outcomes) inspections of the trial data affect the interpretation of the final results in the frequentist framework where adjustments top-values are usually required. The stopping rule in a Bayesian adaptive design does not play a direct role in a Bayesian analysis, unlike a frequentist analysis.

So it is no wonder there is disagreement. By conditioning on the data, and by being “consistent with the strong likelihood principle,” outcomes other than the one observed drop out. But error probabilities require looking at outcomes that could have occurred but didn’t. (There’s quite a lot on the strong likelihood principle, which I’ll just call the likelihood principle, on this blog. For a discussion of the likelihood principle and the related stopping rule principle, see my book, *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* (SIST), (CUP 2018), Excursion 1, Tour II).

Thus the “strict Bayesian” as Ryan et al. 2020 call them, is in a quandary in using Bayesian analyses in clinical trials:

Decisions at analysis points are usually based on the posterior distribution of the treatment effect. However, there is some confusion as to whether control of type I error is required for Bayesian designs as this is a frequentist concept.

One can see why there would be confusion. Multiplicities–their focus is on cases where “multiplicities arise from performing interim analyses on accumulating data in an RCT”–do not call for adjustments to the Bayesian posterior probability of *H,* say that one treatment is superior to another. As they put it: “*inferential* corrections, e.g., adjustments to posterior probabilities, are not required for multiple looks at the data and the posterior distribution for the parameter of interest can be updated at any time”. That’s another way to state the stopping rule principle.

** **

**III. Arguments for and against adjusting for multiplicities**

The authors say they will “discuss the arguments for and against adjusting for multiplicities in Bayesian trials with interim analyses.” A main argument in favor is that without adjustments, the associated Type I error probability can be high.[2] The authors show

that the type I error was inflated in the Bayesian adaptive designs through incorporation of interim analyses that allowed early stopping for efficacy and without adjustments to account for multiplicity. An increase in the number of interim analyses that only allowed early stopping for futility decreased the type I error, but also decreased power.

The Type I error probability can exceed 0.5:

Bayesian interim monitoring may violate the weak repeated sampling principle [Cox and Hinkley 1974] which states that,

“We should not follow procedures which for some possible parameter values would give, in hypothetical repetitions, misleading conclusions most of the time”.

That would seem to be strong evidence indeed for either avoiding or adjusting for multiplicities. Thus, in regulatory practice, a Bayesian might not wish to wear a strict Bayesian hat:

Whilst the long-run frequency behaviour of sequential testing procedures is irrelevant from the strict Bayesian perspective, long-run properties have been established as being important in the clinical trial setting, particularly for late phase trials…. Confirmatory trials will have to persuade, amongst the normal sceptical scientific audience, the competent authorities for healthcare regulation… One of the core functions of health-care regulators is to prevent those that market interventions from making spurious claims of benefit. For this reason, adequate control of type I error is one of the perennial concerns when appraising the results of confirmatory clinical trials.

That’s the argument for. What’s the argument against adjusting for multiplicities in Bayesian trials? The argument against is that it conflicts with being a strict Bayesian:

The requirement of type I error control for Bayesian adaptive designs causes them to lose many of their philosophical advantages, such as compliance with the likelihood principle, and creates a design that is inherently frequentist.

But compliance with the likelihood principle (LP) means giving up on error control—why is that a philosophical advantage? Do “philosophical advantages” refer to adherence to an a priori conception of inference on intuitive, rather than empirical, grounds? (Yes,I realize they’re referring to a long held tradition wherein only subjective Bayesianism was thought to have respectable philosophical foundations, but that will no longer wash—besides non-subjective Bayesian tools violate the LP and the stopping rule principle.) But the authors deserve credit for pinpointing what is at stake: either give up strict Bayesianism or give up error probability control.[3]

Aside: I’ve referred to three “principles,” which might seem confusing. Here’s a brief note to keep them straight.[4]

**IV. How Bayesian trials might adjust**

Even though computing error probabilities isn’t Bayesian, one can compute error probabilities associated with a rule, say, to report a posterior probability .9 of superiority of a treatment. For example, a trial might “stop for efficacy” if the posterior probability that *H* (superiority of treatment) is above 90%. One could also “stop for futility” if, say, Pr(*H*|data) < 0 .1. Focus just on the former. Since either the treatment is superior or not, you might wonder what assigning *H* a prior and posterior mean. From the example our group considers (two types of radiation treatments) and other examples, I take it that the prior probabilities come from a combination of elicited experts beliefs, empirical frequencies, and “non-subjective” or default priors (designed to make the data dominant in some sense). To obtain a kind of hybrid Bayesian-frequentist computation of a Type I error probability, they can ask: What’s the probability that the posterior probability of *H* would exceed .9, given that *H* is false?

I don’t know the details (I hope to learn more), but the idea is to simulate many thousands of trials and find the proportion that result in *H* getting a posterior probability of .9, assuming H is false. “This is achieved by determining how frequently the Bayesian design incorrectly declares a treatment to be effective or superior when it is assumed that there is truly no difference, for the given decision criteria/stopping boundaries”. You might see these as simulated (frequentist) probabilities of (Bayesian) posterior probabilities on *H* with lots of moving parts. There are programs for computing this. (I presume all of the prior probability assignments to all of the parameters in the model are obtained at the start, pre-data.) Perhaps Stephen Senn, as an expert on statistics in clinical trials, can weigh in here.

So the stopping rule only *directly* affects a frequentist, not a strict Bayesian analysis. But it’s possible to arrange it so as to *indirectly* alter the Bayesian analysis.

If, in designing the trial, it is found that the Type I error probability associated with a Bayesian posterior probability assignment is high, a Bayesian might require a higher posterior probability threshold before declaring evidence for *H*. Suppose ensuring a low Type I error probability would require a posterior probability of .95 in *H*. Even then, I think error statisticians would feel uncomfortable reporting the Pr(*H*| data) = .95, because they would be unclear as to what the probability refers. Either the treatment is superior or not (typically measured by some average benefit on some measure).[5]

**V. Upshot**

In the view of Ryan et al. (2020), the Bayesian adjusts because doing so is required by funders and regulatory agencies, not because she might otherwise frequently be misleadingly declaring high posterior probability for *H*, when *H* is false.

One might have thought (hoped?) they would conclude that the Bayesian analysis may be missing something significant in denying the importance of error probabilities for inference (and I’m not distinguishing inference from evidence). Instead, they are prepared to grant that data provide strong evidence for *H*, or at least for believing *H*—which is presumably what high posteriors supply– even if it’s shown the method would very often license such a high level of belief erroneously. By contrast, learning that *H *would frequently (e.g., with probability > 0.5) be accorded a high posterior probability, say .9, even if *H* is false, the error statistician would regard the .9 posterior as poor evidence for *H*. To her, the error probability of the method is not separable from the statistical inference: the inference to *H* includes the qualification of how well tested or corroborated *H* is.

So the error statistician might find it unsettling that, rather than take the disagreement as pointing to a lacuna in the Bayesian posterior assessment, by the end of the article Ryan et al. (2020) appear to take it as a reason to be spared from the requirement of error control. If we “simply report the posterior probability of benefit, then we could potentially avoid having to specify the type I error of a Bayesian design”.

Given the recent discussions to abandon significance testing [33, 34] it may be useful to move away from controlling type I error entirely in trial designs.[6]

So here we have another rationale behind the push to “abandon” or “retire” statistical significance: stopping rules would not have to be taken account of in Bayesian clinical trials.[7] I should point out that the most recent ASA Task Force on Statistical Significance and Replication declare:

P-values and significance testing, properly applied and interpreted, are important tools that should not be abandoned, … They are important tools that have advanced science through their proper application. Much of the controversy surrounding statistical significance can be dispelled through a better appreciation of uncertainty, variability, multiplicity, and replicability. (Benjamini et al. (2021)

(For links, see this post)

Ponder an interesting claim Ryan et al. (2020) make as to why “Type I errors are unlikely to be of interest to a strict Bayesian”:

Informative priors that favour an intervention represent a clash in paradigms between the role of the prior in a Bayesian analysis and type I error control in the frequentist framework (which requires an assumption of zero difference). Type I errors are unlikely to be of interest to strict Bayesians, particularly if there is evidence for a non-zero effect that is represented in an informative prior.

First, it should be noted that the assumption of zero difference is merely a hypothetical posit—it is used merely to draw out implications (e.g., of being in a world where H_{0 }adequately explains the data generation process). It is analogous to an implicationary assumption in a (deductive) reductio argument leading to a falsification. But what about the second sentence? Suppose you had what you consider good evidence for a non-zero effect, leading to an informative prior that favors an intervention. Now suppose you find out that the evidence came from studies that all had high Type I error probabilities—high probability of inferring efficacy erroneously. Would you say these “Type I errors are unlikely to be of interest” to you, since you already believe in the effect?

__ __

**VI. What Type I error probabilities are not**

I should just note that in two places the error statistical test is mischaracterized:

Type I errors calculate the probability of data conditional on some assumed fixed value of the parameter of interest (e.g., treatment effect = 0), which is unlikely to ever occur

exactly.

This is not right. The probability of a Type I error is the probability of rejecting a test or null hypothesis *H*_{0}—e.g., inferring evidence of *H* superiority of a treatment—under the assumption that *H*_{0}. It is not the probability of data. (Or with p-values, there’s a computation of the probability a test statistic d(X) exceeds the observed d(x); under a statistical hypothesis, such as *H*_{0}.) [The Pr(a test rejects *H*_{0 }at level α; *H*_{0}) = α.]

** So what, in your view, is the answer to the question in the title of this post? ** Share your comments and corrections.

I’ll indicate later drafts with (i), (ii), etc.

[1] It will be at the meeting of the American Society for Radiation Oncology [ASTRO] in October 2021. I may be the only participant from philosophy of science.

[2] For simplicity, I will just focus on the Type I error; but the article is very clear about Type II error probabilities.

[3] I know that some Bayesians, like J. Berger, will redefine the “error probability” associated with assigning a posterior probability on *H* as the posterior probability of not-*H*. This allows using error statistical terminology while still wearing a Bayesian hat. But I don’t see that usage in the FDA or other agency requirements. For a discussion, see Excursion 3 Tour II. Please correct me if I’m wrong–I have not looked into this, but it doesn’t appear that the FDA has changed a whole lot from a decade ago. (It has changed some.)

[4] It suffices to remember that the stopping rule principle and the likelihood principle belong together (reflecting the view that error probabilities are irrelevant for evidence), and that both are at odds with the weak repeated sampling principle (which tells you to reject methods with high error probabilities). For a simple discussion of all three, just look at 3 pages of SIST: 44-46: Excursion 1, Tour II.

[5] Of course, the evidence for *H* is always qualified by how capable the given test was in uncovering flaws in *H.*

[6] They cite McShane et al (2019), and Amrhein et al. (2019). For links to these two references see Ryan et al. (2020) or this paper of mine.

[7] To see what Peter Armitage, an expert in sequential trials in medicine, said to Jimmy Savage about this way back in 1962, see p. 46 of SIST.

**References **(aside from those linked above)

Bayarri, S. and Berger, J. (2004. “The Interplay of Bayesian and Frequentist Analysis. Statistical Science, Vol. 19, No. 1 (Feb., 2004), pp. 58-80.

Mayo, D. *Statistical Inference as Severe Testing: How to Get Beyond the Statistics War*s (CUP, 2018).

Pingback: Should Bayesian Clinical Trialists Wear Error Statistical Hats? – 3ºB EE AMÁLIA RIBEIRO GARCIA PATTO – FILOSOFIA

Pingback: Should Bayesian Clinical Trialists Wear Error Statistical Hats? – 1ºs. C EE JOSÉ AYRTON FALCÃO- Sociologia

The prior of a strict Bayesian will indubitably be different if they find a correlation which seems interesting to them after looking at hundreds of different ones, than if they did a focussed experiment based on prior theory and prior experiments.

Secondly, the strict Bayesian has a theory which expresses their strictly own personal beliefs. If they publish a statistical analysis hoping to convince others, they had better arrange that their readers actually have the same prior beliefs as they do. Their scientific report will contain pages and pages describing and motivating their prior.

Richard:

Great to hear from you. Do you really think that if Bayesians “publish a statistical analysis hoping to convince others, they had better arrange that their readers actually have the same prior beliefs as they do”? Aren’t they supposed to be convincing people to “update their beliefs”?

Richard: On your first point, I assume that they would not be allowed to change the prior post data, but that it would need to be fully specified before hand–at least in the context of a clinical trial. But your remark raises a very good point against the change that strict Bayesians now advocate, namely, not to have to take account prior looks. If the requirement to control error probabilities is dropped, the priors, and the endpoints, could also be allowed to change. I didn’t mention that in the post, and it’s important.

Richard: Appropriately precise.

I likely wasn’t here about a possible plan “B” https://statmodeling.stat.columbia.edu/2021/09/03/simulation-based-calibration-some-challenges-and-directions-for-future-research/#comment-2022601

Keith

(1) From the Bayesian point of view, I think it ultimately boils down to the reliability of the prior. Ryan et al. write: “Type I errors are unlikely to be of interest to strict Bayesians, particularly if there is evidence for a non-zero effect that is represented in an informative prior.” If this is the case and there’s early stopping because there’s a high posterior probability for a positive effect, this seems to imply that the information of the prior is still weighted highly, compared to a relatively small amount of observed data, meaning that the information from data alone will not be very reliable. This obviously requires the prior to be reliable because it will play a strong role in the overall result. The Bayesian can live with an elevated type I error if they have very good reasons to believe that they H0 will not be true anyway, and therefore the situation in which a type I error obtains will be very rare. This however seems to be risky given that in most Bayesian analyses I don’t see a very convincing motivation of the prior (I’m interested in whether Richard Gill has other experiences), and there are known issues such as overconfidence when specifying priors.

(2) As long as Bayesians believe that there is such a thing as a “true parameter” (and be it as a limiting value implied by their specified Bayesian beliefs), I don’t think they can object against analyses that give that true parameter some hypothetical value and investigate what happens then as “frequentist” and therefore incompatible with their philosophy.

(3) In the Ryan et al. paper there seems to be a distinction between a frequentist point null hypothesis (“treatment effect=0, which is unlikely to ever occur exactly”) and a Bayesian analysis that has a certain minimum value of a meaningful treatment effect (>1, say). This seems unfair. The frequentist can well use a H0 of effect <=1 in this case, and 1 does not need to be precisely true but would give a borderline error probability bounding what happens over the whole H0. (This can even be computed for a test with formal H0 of a zero effect.)

Christian:

Your comment makes several important points and I’m still working through them.

Under your (1), you say:“If this is the case and there’s early stopping because there’s a high posterior probability for a positive effect, this seems to imply that the information of the prior is still weighted highly, compared to a relatively small amount of observed data, meaning that the information from data alone will not be very reliable.” I’m wondering why you say this suggests there’s “relatively small amount of observed data”. Is this because they’re alluding to a case where the Type I error probability is fairly high? That would essentially negate their point about not being interested in a high Type I error probability.

Your remark also brings up the fact that we don’t hear about the reliability of the prior.

You write:“the Bayesian can live with an elevated type I error if they have very good reasons to believe that they H0 will not be true anyway, and therefore the situation in which a type I error obtains will be very rare.”

Explain how (the Type I error) can it be elevated to having a fairly high probability and also be rare? And of course, even if the high posterior is frequent for trivial differences from Ho, it would be worrisome. This connects with your points in (3):

(3) “… there seems to be a distinction between a frequentist point null hypothesis (‘treatment effect=0, which is unlikely to ever occur exactly’) and a Bayesian analysis that has a certain minimum value of a meaningful treatment effect (>1, say). This seems unfair. The frequentist can well use a H0 of effect <=1 …”

Right. I don't know if this alludes to some FDA requirement. I hope this isn’t how they get their high posterior for the denial of Ho.

I didn’t get into the issue of their power calculation, based as it is on the assumption of a clinically relevant alternative. Any thoughts about that?

Re (1) (a) “relatively small amount of observed data”: I was alluding to a situation in which the Bayesian decides to stop early, therefore not many observations, and of course the fewer observations, the larger the impact of the prior.

(b) “The Bayesian can live with an elevated type I error” – this was me getting my head around why the Bayesians think they can ignore the type I error probability; this should be mathematically in some way related to the posterior distribution, although of course not the same. But if the prior probability for the H0 is low, what happens in case the H0 is true indeed doesn’t have much connection to the posterior. (I’m not saying the Bayesian *should* ignore type I error probabilities, I was just explaining to myself how they at least could afford that if they stick to their own philosophy.)

Re (2): No, I was not alluding to some FDA requirement, I was just saying that the way frequentist error calculations were presented in the paper seemed unfair, because the frequentist is portrayed as having to rely on assuming that a point null is true, which isn’t necessary.

Christian:

Yes, you’re right (on point (1)(a)). That brings up the recommendation someplace in the article not to stop too soon.

You ask “What’s the argument against adjusting for multiplicities in Bayesian trials?”

The answers are complicated, but one straightforward argument is that adjustments rob the experiment of power to correctly discard a false null hypothesis. That may sound trite or trivial but it isn’t because there are many circumstances where a false positive error is of less concern than a false negative. Consider drug screening programs where mistakenly discarding a potential lead molecule early is a very costly missed opportunity whereas a false positive result (type I error) would naturally be corrected in the subsequent stages of the program at minimal expense. The same can be true of a clinical study where the study itself is not going to be definitive, and few clinical trials stand alone as the last word on the effectiveness of an intervention.

Another argument against adjustment for multiplicities is that those adjustments necessarily ignore or distort the actual evidence in the data. This argument is too long for a comment and relies on the clear distinction between local evidence and global error rates and so I will direct you to section 3 of this open access chapter: https://link.springer.com/chapter/10.1007/164_2019_286

Michael:

They do discuss stopping for futility, I only focussed on the issue of type I error control. As far as:

“Another argument against adjustment for multiplicities is that those adjustments necessarily ignore or distort the actual evidence in the data”–this, of course, is the key disagreement. For the error statistician, a change to the probability of erroneous interpretations of data IS a very big part of “the actual evidence in the data”. It’s this central disagreement, as to what counts as evidence, that should be at the forefront as the FDA and other agencies evaluate these Bayesian designs. The problem goes back to the fact that strict Bayesians and likelihoodists–and even not so strict variants– have never really understood the role of error properties in inference for error statistical testers. They see the error probabilities associated with the inference method as separable from the inference resulting from applying the method. They are not.

To be clear: the misunderstanding goes both ways. We don’t understand each other. Of course, you and I have had this discussion over much of the life of this blog. Savage went from at one time regarding the LP as patently wrong, to later viewing it as patently right (you know the quote), so maybe it’s possible to move from the latter to the former. However, if there really is a gestalt switch that separates us, then (following Kuhn) we can only grasp 1 of the 2 perspectives at any given time.

But, I’m wondering what you’d say to the Bayesian operating under the actual real-world regulatory rules.

My view is given in https://onlinelibrary.wiley.com/doi/10.1002/pst.1736 in which I argue that Bayesians should not seek “perfect calibration”, rather be satisfied being “well calibrated”. I could I suppose cite Vlltaire’s aphorism “perfect is the enemy of the good” in support.

Mayo,

I am not sure that I know the answer to the question posited. But here is a bit of background on how the EMEA and the FDA have addressed multiplicity in clinical trials.

In 1998, the ICH issued its E9 guidance on statistical analyses for clinical trials. The discussion of statistical method is entirely in frequentist terms, with a brief caveat:

“This should not be taken to imply that other approaches are not appropriate: the use of Bayesian

(see Glossary) and other approaches may be considered when the reasons for their use are

clear and when the resulting conclusions are sufficiently robust.”

An addendum to the guidance in 2020 does not mention Bayesian statistical approaches at all.

In 2010, the US FDA issued a guidance on the use of Bayesian approaches in medical device trials:

https://www.fda.gov/regulatory-information/search-fda-guidance-documents/guidance-use-bayesian-statistics-medical-device-clinical-trials

This 2010 guidance addressed multiplicity as follows:

“Instead, a possible Bayesian approach to the subgroup problem mentioned above is to consider the subgroups as exchangeable, a priori, through the use of a hierarchical model. This modeling makes the Bayesian estimate of the device effect for a subgroup “borrow strength” from the other subgroups. If the observed device effect for a subgroup is large, then the Bayesian estimate is adjusted according to how well the other subgroups either support or cast doubt on this observation. For more discussion see Dixon and Simon (1991) and Pennello and Thompson (2008).

Bayesian adjustments to multiplicity can be acceptable to FDA, provided the the analysis plan has been pre-specified and the operating characteristics of the analysis are adequate (see Section 4.8). Please consult FDA early on with regard to a statistical analysis plan that includes Bayesian adjustment for multiplicity.

Selected references on Bayesian adjustments for multiplicity are Scott and Berger (2006), Duncan and Dixon (1983), Berry (1988), Gonen et. al. (2003), Pennello (1997), and Lewis and Thayer (2004).”

Nathan

Nathan:

Thanks for your comment and all of the useful references. The paper I cite (Ryan et al. 2020) seems to be referring to a newer(?) FDA rule, at least for early phase trials. I would be very interested to know what you think of their description of how to compute Bayesian Type I/II errors. It’s a fairly short paper.

Mayo,

I will take a look. In the meanwhile, the link provided to the newer FDA guidance is broken. The referenced guidance is here:

https://www.fda.gov/regulatory-information/search-fda-guidance-documents/adaptive-design-clinical-trials-drugs-and-biologics-guidance-industry

Nathan

In sequential analysis, that is the asymptote of interim analysis, one talks about power 1 tests. Somehow it is strange that neither sequential analysis nor power 1 tests were mentioned.

In the case of sequential analysis on refers to probability of false alarm (PFA) and conditional expected delay (CED). Setting a balance between them is optimal under several loss functions.

Procedures like Shiryayev-Roberts are frequentists in practice and Bayesian in interpretation.

https://onlinelibrary.wiley.com/doi/10.1002/qre.1436

Another missing element in the discussion is reference to S type errors.

“a high posterior probability for a positive effect,” as mentioned by Ritter is about that. If the analysis would focus on signed effect (S-type error) the results would be better generalized.

https://link.springer.com/article/10.1007%2Fs11192-021-03914-1

Would be nice to see the interesting discussion in this blog address these issues.

Should Bayesian Clinical Trialists Wear Error Statistical Hats?

If say 100 countries or states with similar populations each wished to evaluate drug efficacy (e.g. COVID vaccine or hydroxychloroquine) in well controlled medical trials, using an accepted Bayesian analysis based procedure, would those analyses all yield the same outcome?

That seems unlikely for a statistical procedure, random happenstance will introduce variation ensuring some apparently different outcomes.

Now do Bayesians profess that all these outcomes are correct? Each one conditions on the data that arose in each of the countries.

Each country must decide whether to approve the drug or not.

If most countries approve a new drug, and begin showing positive health outcomes thereafter, while the remaining few do not approve and see greater levels of negative health outcomes, do Bayesians just say all countries made the correct decision because they used Bayesian methods?

Perhaps there is some Bayesian language for correct and incorrect decisions that I am unaware of, let me know if so.

There must be some sense of correct and incorrect decisions in the Bayesian realm, some sense of which Bayesian procedures will yield sensible decisions more often than other Bayesian procedures. Some Bayesian will need to explain this to me. I have a hard time believing that no Bayesian thinks about such things. I would call this error statistical thinking regarding a Bayesian analysis procedure and decisions based upon it.

I would be loathe to base my medical decisions on a procedure with unknown operational characteristics such as how often it does or does not produce sensible results.

I remain concerned about the efforts of some to wedge Bayesian procedures into FDA evaluations without a clear understanding of how often such procedures produce less than desirable outcomes. The cancer treatment decisions based on Bayesian methods peddled by the Duke University team including Anil Potti, Mike West, Joseph Nevins and others a decade ago represent one example of Bayesian procedures gone terribly wrong. Erroneous decision rates matter.

Beware the Bayesian Clinical Trialist not wearing an Error Statistical hat.

Dear Steven:

Thanks for your comment! What do you think of their attempt to compute Type I and II error probabilities by simulations based on considering a rule, such as, decide H: one treatment is superior to the other whenever the posterior probability of H given x reaches a given value?

You’re right that it does seem Bayesians would have to allow that H is true or false. They could say at most H is highly believable, but even saying this presupposes it has a truth value. I hope to hear from Bayesians on this.

I remember your excellent discussions of “the cancer treatment decisions based on Bayesian methods peddled by the Duke University team including Anil Potti, Mike West, Joseph Nevins and others a decade ago”. Has that method itself even been scrutinized? Or did everyone pretty much leave the blame with the flagrant problems with data recording, and their illicit cross validations?

Keith Baggerly and Kevin Coombes did a thorough review of the Duke methodology, and their materials thankfully are still available here:

https://bioinformatics.mdanderson.org/public-datasets/supplements/

https://bioinformatics.mdanderson.org/Supplements/ReproRsch-All/

https://bioinformatics.mdanderson.org/Supplements/ReproRsch-Chemo/

In particular, the review in

https://bioinformatics.mdanderson.org/Supplements/ReproRsch-Ovary/

includes a discussion on pages 51-52 of

describing how the Duke group would split the data into “training” and “validation” sets, but actually start the model fitting procedure using both sets combined.

A model built on a particular data set will perform predictions on that same data set that look quite amazing!

Baggerly and Coombes summarize:

“We chose to not emply the above approach, because we are uneasy with step (c), which allows values in the test data to affect the coefficients of the predictive model.

Using the scores that we compute, we do not see a large story here.”

Baggerly and Coombes recapitulated the Duke Bayesian analysis using equivalent frequentist paradigms, e.g. frequentist logistic regression fitted to the training data alone instead of a Bayesian probit model fitted to both the training and validation sets (Baggerly and Coombes labeled the “validation” set as the “test” set – minor terminology issue).

A nice overview of the whole episode can be seen in the report from

FALL 2012 BIOTECHNOLOGY HEALTHCARE p17

at

Deborah: We discuss stopping rules in Bayesian Data Analysis. The short answer is that with sequential data collection, inclusion of data depends on the time that the data were collected, hence a Bayesian analysis should condition on time, for example by allowing the treatment effect to be time varying. This can make a difference in the analysis. This comes up on occasion and so I’ve blogged about it.

Richard: You write, “the strict Bayesian has a theory which expresses their strictly own personal beliefs.” The term “strict Bayesian” isn’t really defined so I guess it can mean whatever you want it to me—but, in general, no, Bayesian statistics uses models and so does non-Bayesian statistics. You can call a Bayesian or a non-Bayesian model a “personal belief” if you want to, but there’s no reason it has to be. All those logistic regressions being done all over the world right now . . . I don’t think they represent the “personal beliefs” of the statisticians or data analysts. Logistic regression is a convenient model.

Andrew:

Thanks for your comment. I will check the link and reference to conditioning on time. I hope that others more familiar with this will comment as to how this handles the Type I error control discussed in this article.

On what they mean by “strict Bayesian”, I’m guessing the authors are referring to inference in terms of a posterior probability

“which is conditional on the data observed, is consistent with the strong likelihood principle. The final analysis can ignore the results and actions taken during the interim analyses and focus on the data actually obtained when estimating the treatment effect ”

“the long-run frequency behaviour of sequential testing procedures is irrelevant from the strict Bayesian perspective”.

I know you don’t (or didn’t) accept the strong LP, or even (I think) inference as a report of posterior probabilities (but you may have changed your position).

“If one wishes to demonstrate control of type I error in

Bayesian adaptive designs that allow for early stopping

for efficacy then adjustments to the stopping boundaries

are usually required as the number of analyses increase.

If the designs only allow for early stopping for futility

then adjustments to the stopping boundaries may instead be required to ensure that power is maintained as

the number of analyses increase. If one wishes to instead

take a strict Bayesian view then type I errors could be ignored and the designs instead focus on the posterior

probabilities of treatment effects of particular values.”

This is their conclusion, which I will paraphrase as: Yes, interim stops (especially for efficacy) will increase the errors beyond what one should expect, and we can make adjustments to calibrate our interpretations to meet expected errors. Bayesians have ways of adjusting for interim stops as do the frequentists. But a strict Bayesian would like to ignore the problems in error control and ask the rest of us to ignore it as well. If you do not peek in the closet you will not see the skeleton. The massive problem with this is that their simulations show very well why adjustment should be a hard requirement.

John: Exactly. Don’t look in the closet.