**Stephen Senn**

* Head of Competence Center for Methodology and Statistics (CCMS)
Luxembourg Institute of Health
Twitter @stephensenn
*

**Automatic for the people? Not quite**

What caught my eye was the estimable (in its non-statistical meaning) Richard Lehman tweeting about the equally estimable John Ioannidis. For those who don’t know them, the former is a veteran blogger who keeps a very cool and shrewd eye on the latest medical ‘breakthroughs’ and the latter a serial iconoclast of idols of scientific method. This is what Lehman wrote

Ioannidis hits 8 on the Richter scale: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0173184 … Bayes factors consistently quantify strength of evidence, p is valueless.

Since Ioannidis works at Stanford, which is located in the San Francisco Bay Area, he has every right to be interested in earthquakes but on looking up the paper in question, a faint tremor is the best that I can afford it. I shall now try and explain why, but before I do, it is only fair that I acknowledge the very generous, prompt and extensive help I have been given to understand the paper[1] in question by its two authors Don van Ravenzwaaij and Ioannidis himself.

What van Ravenzwaaij and Ioannidis (R&I) have done is investigate the FDA’s famous *two trials rule* as a requirement for drug registration. To do this R&I simulated two-armed parallel group clinical trials according to the following combinations of scenarios (p4).

Thus, to sum up, our simulations varied along the following dimensions:

1. Effect size: small (0.2 SD), medium (0.5 SD), and zero (0 SD)

2. Number of total trials: 2, 3, 4, 5, and 20

3. Number of participants: 20, 50, 100, 500, and 1,000

The first setting defines the treatment effect in terms of common within-group standard deviations, the second defines the total number of trials submitted to the FDA with exactly two of them significant and the third the total number of patients per group.

They thus had 3 x 5 x 5 = 75 simulation settings in total. In each case the simulations were run until 500 cases arose for which two trials were significant. For each of these cases they calculated a one-sided Bayes factor and then proceeded to judge the FDA’s rule based on P-values according to the value the Bayes factor indicated.

In my opinion this is a hopeless mishmash of two systems: the first, (frequentist) conditional on the hypotheses and the second (Bayesian) conditional on the data. They cannot be mixed to any useful purpose in the way attempted and the result is not only irrelevant frequentist statistics but irrelevant Bayesian.

Before proceeding to discuss the inferential problems, however, I am going to level a further charge of irrelevance as regards the simulations. It is true that the ‘two trials rule’ is rather vague in that it is not clear how many trials one is allowed to run to get two significant ones. In my opinion it is reasonable to consider that the FDA might accept two out of three but it is frankly incredible that they would accept two out of twenty unless there were further supporting evidence. For example, if two large trials were significant but 18 smaller ones were not, but significant as a set in a meta-analysis, one could imagine the programme passing. Even this scenario, however, is most unlikely and I would be interested to know of any case of any sort in which the FDA has accepted a ‘two out of twenty’ registration.

Now let us turn to the mishmash. Let us look, first of all, at the set up in frequentist terms. The simplest common case to take is the ‘two out of two’ significant scenario. Sponsors going into phase III will typically perform calculations to target at least 80% power *for the programme as a whole*. Thus 90% power for individual trials is a common standard since the product of the powers is just over 80%. For the two effect sizes of 0.3 and 0.5 that R&I consider this would, according to nQuery®, yield 527 and 86 patients per arm respectively. The overall power of the programme would be 81% and the joint two-sided type I error rate would be 2 x (1/40)^{2 } = 1/800, reflecting the fact that each of two two-sided tests would have to be significant at the 5% level but in the right direction.

Now, of course, these are planned characteristics in advance of running a trial. In practice you will get a result and then, in the spirit of what R&I are attempting, it would be of interest to consider the least impressive result that would *just* give you registration. This of, course, is P=0.05 for each of the two trials. At this point, by the way, I note that a standard frequentist objection can be entered to the two trials rule. If the designs of two trials are identical, then given that they are of the same size, the sufficient statistic is simply the average of the two results. If conducted simultaneously there would be no reason not to use this. This leads to a critical region for a more powerful test based on the average result from the two providing a 1/1600 type I error rate (one-sided) illustrated in the figure below that is to the right and above the blue diagonal line. The corresponding region for the two-trials rule is to the right of the vertical red line and above the horizontal one. The just ‘significant’ value for the two-trials rule has a standardised z-score of 1.96 x √2 = 277, whereas the rule based on the average from the two trials would have a z-score of 3.02. In other words, evidentially, the value according to the two-trials rule is less impressive[2].

Now, the Bayesian will argue, that the frequentist is controlling the behaviour of the procedure if one of two possible realities out of a whole range applies but has given no prior thought to their likely occurrence or, for that matter, to the occurrence of other values. If, for example, moderate effects sizes are very unlikely, but it is quite plausible that the treatment has no effect at all and the trials are very large, then even though their satisfying the two trials rule would be *a priori *unlikely, *if* it was only minimally satisfied, it might actually imply that the null hypothesis was likely true.

A possible way for the Bayesian to assess the evidential value is to assume, just for argument’s sake, that the null hypothesis and the set of possible alternative hypotheses are equally likely *a priori* (the prior odds are one) and then calculate the posterior probability and hence odds given the observed data. The ratio of the posterior odds to the prior odds is known as the *Bayes factor*[3]. Citing a paper[4] by Rouder et al describing this approach R&I then use the BayesFactor package created by Morey and Rouder to calculate the Bayes factor corresponding to every case of two significant trials they generate.

Actually it is not *the* Bayes factor but *a* Bayes factor. As Morey and Rouder make admirably clear in a subsequent paper[5], what the Bayes factor turns out to be depends very much on how the probability is smeared over the range of the alternative hypothesis. This can perhaps be understood by looking at the ratios of likelihoods (relative to the value under the null) when P=0.05 for each of the two trials as a function of the true (unknown) effect size for the sample sizes of 527 and 86 that would give 90% power for the values of the effect sizes (0.2 and 0.5) that R&I consider. The logs of these (chosen to make plotting easier) are given in the figure below. The blue curve corresponds to the smaller effect size used in planning (0.2) and hence the larger sample size (527) and the red curve corresponds to the larger effect size (0.5) and hence the smaller sample size (86). Given the large number of degrees of freedom available, the Normal distribution likelihoods have been used. The values of the statistic that would be just significant at the 5% level (0.1207 and 0.2989) for the two cases are given by the vertical dashed lines and, since these are the values that we assume observed in the two cases, each curve reaches its respective maximum at the relevant value.

Wherever a value on the curve is positive, the ratio of likelihoods is greater than one and the posited value of the effect size is supported against the null. Wherever it is negative, the ratio is less than one and the null is supported. Thus, whether the posited values of the treatment effect that make up the alternative values are supported *as a set* or not depends on how you smear the prior probability. The Bayes factor is the ratio of the prior-weighted integral of the likelihoods. In this case the likelihood under the null is a constant so the conditional prior under the alternative is crucial. *There is no automatic solution and careful choice is necessary.* So what are you supposed to do? Well, as a Bayesian you are supposed to choose a prior distribution that reflects what *you* believe. At this point, I want to make it quite clear that if you think you can do it you *should* do so and I don’t want to argue against that. However, this is really hard and it has serious consequences[6]. Suppose that sample size of 527 has been used corresponding to the blue curve. Then any value of the effect size greater than 0 and less than about 2 x 0.1207 = 0.2414 has more support than the null hypothesis itself but any value more than 0.2414 is less supported than the null. How this pans out in your Bayes factor now depends on your prior distribution. If your prior maintains that all possible values of the effect size when the alternative hypothesis is true must be modest (say never greater than 0.2414), then they are all supported and so is the set. On the other hand, if you think that unless the null hypothesis is true, only values greater than 0.2414 are possible, then all such values are unsupported and so is the set. In general, the way the conditional prior smears the probability is crucial.

Be that as it may, I doubt that choosing, ‘a Cauchy distribution with a width of r = √2/2’ as R&I did would stand any serious scrutiny. Bear in mind that these are molecules that have passed a series of *in vitro* and *in vivo* pre-clinical screens as well as phase I, IIa and IIb before being put to the test in phase III. However, if R&I were serious about this, they would consider how well the distribution works as a prediction as to what actually happens in phase III and examine some data.

Instead, they assume, (as far as I can tell) that the Bayes factor they calculate in this way is some sort of automatic gold standard by which any other inferential statistic can and should be judged whether or not the distribution on which the Bayes factor is based on is reasonable. This is reflected in Richard Lehman’s tweet ‘Bayes factors consistently quantify strength of evidence’ which, in fact, however needs to be rephrased ‘Bayes factors coherently quantify strength of evidence for You *if *You have chosen coherent prior distributions to construct them.’ It’s a big *if*.

R&I then make a second mistake of simultaneously conditioning on a result and a hypothesis. Suppose their claim is correct that in each of the cases of two significant trials that they generate the FDA would register the drug without further consideration. Then, for the first two of the three cases ‘Effect size: small (0.2 SD), medium (0.5 SD), and zero (0 SD)’ the FDA has got it right and for the third it has got it wrong. By the same token wherever any decision based on the Bayes factor would disagree with the FDA *it would be wrong in the first two cases and right in the third*. However, this is completely useless information. It can’t help us decide between the two approaches. If we want to use true posited values of the effect size, we have to consider all possible *outcomes* for the two trial rule, not just the ones that indicate ‘register’. For the cases that indicate ‘register’, it is a foregone conclusion that we will have 100% success (in terms of decision-making) in the first two cases and 100% failure in the second. What we need to consider also is the situation *where it is not the case that two trials are significant*.

If, on the other hand R&I wish to look at this in Bayesian terms, then they have also picked this up the wrong way. If they are committed to their adopted prior distribution, then once they have calculated the Bayes factor there is no more to be said and if they simulate from the prior distribution they have adopted, then their decision making will, as judged by the simulation, turn out to be truly excellent. If they are not committed to the prior distribution, then they are faced with the sore puzzle that is Bayesian robustness. How far can the prior distribution from which one simulates be from the prior distribution one assumes for inference in order for the simulation to be a) a severe test but b) not totally irrelevant?

In short the R&I paper, in contradistinction to Richard Lehman’s claim, tells us nothing about the reasonableness of the FDA’s rule. That would require an analysis of data. Automatic for the people? Not quite. To be Bayesian, ‘to thine own self be true’. However, as I have put it previously, this is very hard and ‘You may believe you are a Bayesian but you are probably wrong’[7].

# Acknowledgements

I am grateful to Don van Ravenzwaaij and John Ioannidis for helpful correspondence and to Andy Grieve for helpful comments. My research on inference for small populations is carried out in the framework of the IDEAL project http://www.ideal.rwth-aachen.de/ and supported by the European Union’s Seventh Framework Programme for research, technological development and demonstration under Grant Agreement no 602552.

# References

- van Ravenzwaaij, D. and J.P. Ioannidis,
*A simulation study of the strength of evidence in the recommendation of medications based on two trials with statistically significant results.*PLoS One, 2017.**12**(3): p. e0173184. - Senn, S.J.,
*Statistical Issues in Drug Development*. Statistics in Practice. 2007, Hoboken: Wiley. 498. - O’Hagan, A.,
*Bayes factors.*Significance, 2006(4): p. 184-186. - Rouder, J.N., et al.,
*Bayesian t tests for accepting and rejecting the null hypothesis.*Psychonomic bulletin & review, 2009.**16**(2): p. 225-237. - Morey, R.D., J.-W. Romeijn, and J.N. Rouder,
*The philosophy of Bayes factors and the quantification of statistical evidence.*Journal of Mathematical Psychology, 2016.**72**: p. 6-18. - Grieve, A.P.,
*Discussion of Piegorsch and Gladen (1986).*Technometrics, 1987.**29**(4): p. 504-505. - Senn, S.J.,
*You may believe you are a Bayesian but you are probably wrong.*Rationality, Markets and Morals, 2011.**2**: p. 48-66.