The complicated and controversial world of bioequivalence
by Stephen Senn*
Those not familiar with drug development might suppose that showing that a new pharmaceutical formulation (say a generic drug) is equivalent to a formulation that has a licence (say a brand name drug) ought to be simple. However, it can often turn out to be bafflingly difficult.
If, as is often the case, both formulations are given in forms that are absorbed through the gut, whether as pills, oral solutions or suppositories, then so-called bioequivalence trials form an attractive option. The basic idea is that the concentration in the blood of the new test formulation can be compared to the licensed reference formulation. Equivalence of concentration in the blood plausibly implies equivalence in all possible effect sites and thus equality of all benefits and harms.
Typically, healthy volunteers are recruited and given the test formulation on one occasion and the reference formulation on another, the order being randomised. Regular blood samples are taken and the concentration time curves summarised using simple statistics: for example the area under the curve (AUC) is always used, the concentration maximum Cmax nearly always also and the time to reach a maximum Tmax, very often. These statistics are then compared across formulations to show that they are similar.
In the rest of this post I shall ignore the problem that various summary measures are employed and assume that we are just considering AUC. There seems to be a general (but arbitrary) agreement that two formulations are equivalent if the true ratio of AUC under test and reference lies between 0.8 & 1.25. In that case (at least as regards the AUC requirement) the formulations are deemed bioequivalent. The true ratio, however, is a parameter not a statistic and so the task is to see what the data can show about the reasonableness of any claim regarding this unknown theoretical quantity.
It is here, however, that the statistical difficulties begin. A simple frequentist solution would appear to be to calculate the 95% confidence intervals for the relative bioavailability and check that these lie within the limits of equivalence. Modelling is always done on the log-scale and since log(0.8)=-log(1.25) we have that limits for the log relative bioavailability of test and reference are (approximately) -0.22 to +0.22. However there is more than one 95% confidence interval and an early dispute in this field was whether a traditional confidence interval centred on the point estimate should be calculated, as Kirkwood proposed in 1981 or one centred on the middle of the range of equivalence, that is to say on 0 (on the log scale) as Westlake had earlier proposed in 1972 .
As O’Quigley and Baudoin pointed out, the difference is, essentially, between deciding whether the ‘shortest’ confidence interval is included within the limits of equivalence or whether the fiducial probability that the true relative bioavailability lies within the limits is at least 95%. The latter is always the easier requirement to satisfy. To see why consider the case where the point estimate is positive. In that case clearly the lower conventional confidence limit would never lie outside the limit unless the upper one did. Thus by lengthening the lower limit and shortening the upper in such a way to maintain the 95% probability one can make it easier to satisfy equivalence.
An alternative approach was taken by Schuirmann who proposed to look at the matter in terms of two one–sided tests. Imagine that we have two regulators: a toxicity and an efficacy regulator. The former defines as toxic any drug whose relative bioavailability is greater than 1.25 and the latter as ineffective any drug whose relative bioavailability is less than 0.8. Each is unconcerned by the other’s decision and so no trading of alpha from one to the other can take place. It turns out that this requirement is satisfied operationally by accepting bioequivalence if the conventional 90% confidence limits are within the limits of equivalence. Opinions differ as to how logical this is. For example, the FDA requires conventional placebo-controlled trials of a new treatment to be tested at the 5% level two-sided but since they would never accept a treatment that was worse than placebo the regulator’s risk is 2.5% not 5%. Why should it be lower for bioequivalence?
Be that as it may, 90% confidence intervals are regularly used but they have been criticised by a number of frequentists of a Neyman-Pearson persuasion. (See for example R. Berger and Hsu.) The argument goes as follows. If the trial is small enough so that the standard error is large enough the width of the confidence interval, however calculated, will exceed the width of the equivalence interval. Thus the type I error rate is zero. Various proposals have been made as to how to recover the missing Type I error but they all boil down to this: given a small enough trial you could claim equivalence even though the point estimate was outside the limits of equivalence! Needless to say nobody uses such tests in practice and they have been severely criticised from a theoretical point of view)
The above argument is based on Normal theory tests. Horrendous complications are introduced by using the t-test if one departs from classical confidence intervals.
And don’t get me started on equivalence when concentration in the blood is irrelevant but a pharmacodynamic outcome must be used instead!
So, what seems to be a simple problem turns out to be controversial and difficult. As I sometimes put it ‘equivalence is different’.
Here there be tygers!
*Head, Methodology and Statistics Group
Competence Center for Methodology and Statistics (CCMS)
1. Senn, S.J., Statistical issues in bioequivalence. Statistics in Medicine, 2001. 20(17-18): p. 2785-2799.
2. Kirkwood, T.B.L., Bioequivalence testing – a need to rethink. Biometrics, 1981. 37: p. 589-591.
3. Westlake, W.J., Use of confidence intervals in analysis of comparative bioavailability trials. Journal of Pharmaceutical Sciences, 1972. 61(8): p. 1340-1341.
4. O’Quigley, J. and C. Baudoin, General approaches to the problem of bioequivalence. The Statistician, 1988. 37: p. 51-58.
5. Schuirmann, D.J., A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability. J Pharmacokinet Biopharm, 1987. 15(6): p. 657-80.
6. Berger, R.L. and J.C. Hsu, Bioequivalence trials, intersection-union tests and equivalence confidence sets. Statistical Science, 1996. 11(4): p. 283-302.
7. Perlman, M.D. and L. Wu, The emperor’s new tests. Statistical Science, 1999. 14(4): p. 355-369.
References added by Editor for readers:
1. Senn SJ. Falsificationism and clinical trials [see comments]. Statistics in Medicine 1991; 10: 1679-1692.
2. Senn SJ. Inherent difficulties with active control equivalence studies. Statistics in Medicine 1993; 12: 2367-2375.
3. Senn SJ. Fisher’s game with the Devil. Statistics in Medicine 1994; 13: 217-230.
Stephen: Thanks so much for your post. I am still unsure how you think we should do equivalence testing. What do you think of the union-intersection method of R. Berger?* (I assume that the question of concentration in the blood comes up after it’s already ascertained that the 2 consist of the same drug or whatever.) I have another question I’ll raise later on.
*Correction: This should be intersection-union test. See July. 31 post.
I don’t like the Berger and Hsu method I cited for reasons I have already given. I don’t accept that maximisng power for a fixed type I error rate is a universally logical way to carry out statistical inference.