**Blood Simple?
**

**The complicated and controversial world of bioequivalence**by Stephen Senn*

Those not familiar with drug development might suppose that showing that a new pharmaceutical formulation (say a generic drug) is equivalent to a formulation that has a licence (say a brand name drug) ought to be simple. However, it can often turn out to be bafflingly difficult[1].

If, as is often the case, both formulations are given in forms that are absorbed through the gut, whether as pills, oral solutions or suppositories, then so-called *bioequivalence trials* form an attractive option. The basic idea is that the concentration in the blood of the new *test* formulation can be compared to the *licensed* reference formulation. Equivalence of concentration in the blood plausibly implies equivalence in all possible effect sites and thus equality of all benefits and harms.

Typically, healthy volunteers are recruited and given the test formulation on one occasion and the reference formulation on another, the order being randomised. Regular blood samples are taken and the concentration time curves summarised using simple statistics: for example the area under the curve (AUC) is always used, the concentration maximum C_{max} nearly always also and the time to reach a maximum T_{max}, very often. These statistics are then compared across formulations to show that they are similar.

In the rest of this post I shall ignore the problem that various summary measures are employed and assume that we are just considering AUC. There seems to be a general (but arbitrary) agreement that two formulations are equivalent if the true ratio of AUC under test and reference lies between 0.8 & 1.25. In that case (at least as regards the AUC requirement) the formulations are deemed bioequivalent. The true ratio, however, is a parameter not a statistic and so the task is to see what the data can show about the reasonableness of any claim regarding this unknown theoretical quantity.

It is here, however, that the statistical difficulties begin. A simple frequentist solution would appear to be to calculate the 95% confidence intervals for the relative bioavailability and check that these lie within the limits of equivalence. Modelling is always done on the log-scale and since log(0.8)=-log(1.25) we have that limits for the log relative bioavailability of test and reference are (approximately) -0.22 to +0.22. However there is more than one 95% confidence interval and an early dispute in this field was whether a traditional confidence interval centred on the point estimate should be calculated, as Kirkwood[2] proposed in 1981 or one centred on the middle of the range of equivalence, that is to say on 0 (on the log scale) as Westlake[3] had earlier proposed in 1972 .

As O’Quigley and Baudoin pointed out[4], the difference is, essentially, between deciding whether the ‘shortest’ confidence interval is included within the limits of equivalence or whether the fiducial probability that the true relative bioavailability lies within the limits is at least 95%. The latter is always the easier requirement to satisfy. To see why consider the case where the point estimate is positive. In that case clearly the lower conventional confidence limit would never lie outside the limit unless the upper one did. Thus by lengthening the lower limit and shortening the upper in such a way to maintain the 95% probability one can make it easier to satisfy equivalence.

An alternative approach was taken by Schuirmann[5] who proposed to look at the matter in terms of two one–sided tests. Imagine that we have two regulators: a toxicity and an efficacy regulator. The former defines as toxic any drug whose relative bioavailability is greater than 1.25 and the latter as ineffective any drug whose relative bioavailability is less than 0.8. Each is unconcerned by the other’s decision and so no trading of alpha from one to the other can take place. It turns out that this requirement is satisfied operationally by accepting bioequivalence if the conventional 90% confidence limits are within the limits of equivalence. Opinions differ as to how logical this is. For example, the FDA requires conventional placebo-controlled trials of a new treatment to be tested at the 5% level two-sided but since they would never accept a treatment that was worse than placebo the regulator’s risk is 2.5% not 5%. Why should it be lower for bioequivalence?

Be that as it may, 90% confidence intervals are regularly used but they have been criticised by a number of frequentists of a Neyman-Pearson persuasion. (See for example R. Berger and Hsu[6].) The argument goes as follows. If the trial is small enough so that the standard error is large enough the width of the confidence interval, however calculated, will exceed the width of the equivalence interval. Thus the type I error rate is zero. Various proposals have been made as to how to recover the missing Type I error but they all boil down to this: given a small enough trial you could claim equivalence even though the point estimate was outside the limits of equivalence! Needless to say nobody uses such tests in practice and they have been severely criticised from a theoretical point of view[7])

The above argument is based on Normal theory tests. Horrendous complications are introduced by using the t-test if one departs from classical confidence intervals.

And don’t get me started on equivalence when concentration in the blood is irrelevant but a pharmacodynamic outcome must be used instead!

So, what seems to be a simple problem turns out to be controversial and difficult. As I sometimes put it ‘equivalence is different’.

Here there be tygers!

*Head, Methodology and Statistics Group

Competence Center for Methodology and Statistics (CCMS)

Luxembourg

**References**

1. Senn, S.J., Statistical issues in bioequivalence*.* Statistics in Medicine, 2001. **20**(17-18): p. 2785-2799.

2. Kirkwood, T.B.L., *Bioequivalence testing – a need to rethink.* Biometrics, 1981. **37**: p. 589-591.

3. Westlake, W.J., *Use of confidence intervals in analysis of comparative bioavailability trials.* Journal of Pharmaceutical Sciences, 1972. **61**(8): p. 1340-1341.

4. O’Quigley, J. and C. Baudoin, *General approaches to the problem of bioequivalence.* The Statistician, 1988. **37**: p. 51-58.

5. Schuirmann, D.J., *A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability.* J Pharmacokinet Biopharm, 1987. **15**(6): p. 657-80.

6. Berger, R.L. and J.C. Hsu, *Bioequivalence trials, intersection-union tests and equivalence confidence sets.* Statistical Science, 1996. **11**(4): p. 283-302.

7. Perlman, M.D. and L. Wu, *The emperor’s new tests.* Statistical Science, 1999. **14**(4): p. 355-369.

References added by Editor for readers:

1. Senn SJ. Falsificationism and clinical trials [see comments]. Statistics in Medicine 1991; 10: 1679-1692.

2. Senn SJ. Inherent difficulties with active control equivalence studies. Statistics in Medicine 1993; 12: 2367-2375.

3. Senn SJ. Fisher’s game with the Devil. Statistics in Medicine 1994; 13: 217-230.

Stephen: Thanks so much for your post. I am still unsure how you think we should do equivalence testing. What do you think of the union-intersection method of R. Berger? (I assume that the question of concentration in the blood comes up after it’s already ascertained that the 2 consist of the same drug or whatever.) I have another question I’ll raise later on.

I don’t like the Berger and Hsu method I cited for reasons I have already given. I don’t accept that maximisng power for a fixed type I error rate is a universally logical way to carry out statistical inference.

Stephen: for reasons you’ve already given here or somewhere else? I’m just trying to figure out your position on this. I vaguely remember a blogpost where the union intersection method came up in passing, and I asked R. Berger what he thought, and he seemed to defer to you (roughly, “Senn is the one who would know”).

I mentioned to you once that I had written a short section of my book on bioequivalence testing, based on a criticism you make about methods of bioequivalence testing going against a Popperian spirit of tests–but in that case the null was Ho: bioequivalence holds. So, on the face of it, with my admittedly poor comprehension of this methodology, I thought you’d prefer to have Ho assert the denial of bioequivalence, as in the union-intersection method. Anyway, as I also mentioned, I ripped out the section, deciding I didn’t really understand it well enough. I don’t plan to restore it, but I would like to understand your position about how it should be done. Another reason I’m curious is that a couple of people, years ago, said that what I was recommending was similar, they thought, to what is done in bioequivalence testing (in the case of non-reject). Finally, your book (Stat Issues in Drug Dev*) says the bioequivalence issue is primarily philosophical rather than technical. That, together with the Popper mention (someplace else), perked up my ears. OK, so any further insight would be of great interest!

* Here it is, p.372.

The business of hypothesis testing when the alternative lies in a closed interval raises several technical mathematical difficulties. The late Gunther Mehring, a one-time colleague of mine, investigated this with great rigour (from a mathematical point of view) and identified work of Samuel Karlin’s as being key.

The problem is, however, that solving the (difficult) mathematical problem of finding an optimal NP test which has the correct size and maximum power leads to a solution that is so counter-intuitive under some circumstances that nobody would accept it.

The practical problem appears to be much more simple than the techical mathematical one and I think that this is one of the cases where I would consider a Bayesian solution in principle. It would need to have as Lindley proposes a loss-function (but not the one he suggests) and also a three valued decision: accept, reject, collect more data. Odile Coudert had a look at this problem but her MSc should be regarded as interesting notes towards the solution of the problem. It’s a topic I should try get back to.

In the meantime I favour Schuirman’s two one sided tests although (in a way) having each at the 2.5% level rather than the 5% level would make more sense.

1. Mehring G. On optimal tests for general interval hypotheses. Communications in Statistics: Theory and Methods 1993; 22: 1257-1297.

Stephen: Is that the philosophical part? (whether to treat it as a decision problem?) Why it’s so different from the usual case of affirming equivalence of 2 means remains mysterious to me. If it’s so difficult, why doesn’t the brand name co. just give/sell their manufacturing process to the generic co.?

Affirming equivalence is fundamentally different from claiming a difference in a way I tried to explain in three articles I wrote in the 1990s :

(added to refs by editor)

1. Senn SJ. Falsificationism and clinical trials [see comments]. Statistics in Medicine 1991; 10: 1679-1692.

2. Senn SJ. Inherent difficulties with active control equivalence studies. Statistics in Medicine 1993; 12: 2367-2375.

3. Senn SJ. Fisher’s game with the devil. Statistics in Medicine 1994; 13: 217-230.

This difference is largely ‘philosphical’. A small example is that the role of blinding is quite different.

The technical difficulty relates to the fact that the 95% confidence intervals may be larger than the region of equivalence – implying zero size of a ‘test’ that proceeds using CIs. Hence the technical work of

1. Anderson S, Hauck WW. A New Procedure for Testing Equivalence in Comparative Bioavailability and Other Clinical-Trials. Communications in Statistics-Theory and Methods 1983; 12: 2663-2692.

2. Berger RL, Hsu JC. Bioequivalence trials, intersection-union tests and equivalence confidence sets. Statistical Science 1996; 11: 283-302.

and

3. Brown LD, Hwang JTG, Munk A. An unbiased test for the bioequivalence problem. Annals of Statistics 1997; 25: 2345-2367.

as well as the important but neglected work of Mehring’s already cited http://www.tandfonline.com/doi/abs/10.1080/03610929308831086?queryID=%24{resultBean.queryID}#.U5G73yiUXYg

The price that innovators put on their invention is not one that generic maunfacturers are willing to pay. Generic manufacturers are tring to cash in on a potential consumer surplus by finding a much easier route to development than full development once the patent permits them to copy (if they can!). Hence, bioequivalence studies. Occasionally companies will use the full development route but it’s very expensive. Astra dis so in developing its form (Oxis) of formoterol which was a Yamanouchi drug licensed to and developed by CIBA-Geigy (now Novartis) as Foradil, but this is rare.

The space for a Bayesian solution is there because this is a much more tightly structured problem than most. There really is only one thing that matters and that is relative bioavailability. All that prevents its being implementable is some sort of agreement on a loss-function (and a lot of technical work on backward induction!). Most attempts to be Bayesian in drug development would entail much greater complication.

Stephen: OK weekend reading (or in 2 cases rereading from long ago) will include the 3 Senn articles.

You seem to be saying it’s easy to be Bayesian when it comes to deciding on bioequivalence. Indeed, it operates as a policy decision. But are the priors fairly straightforward? You see, I don’t know how they arrive at the generic, and what they must show before even getting to the stage of testing for bioequivalence. Nor do I know what they’d have to do next, after showing bioequivalence (e.g., figure out the dose?) It isn’t far fetched to imagine that drugs of a certain type, having already passed the test of being the “same substance” to begin with, would result in similar concentrations in the blood, etc. So I guess I can picture an empirical prior, only I’d worry about generic compound of drug X being relevantly different from the universe of generics one imagines it was randomly selected from.

On the point: “The price that innovators put on their invention is not one that generic maunfacturers are willing to pay.” Why not allow the Brand named company to get some cut of the profits in exchange for supplying the exact drug. I thought that actually happened at times. (My knowledge is limited to stocks in the biotech arena).

I wonder if this whole issue will be influenced by a recent change of law (in the U.S.) regarding generic companies and the obligation to change labels. It used to be required that they keep the exact label as the Brand name company. Now they are supposed to alter it if, for example, previously unknown side effects crop up. Moreover, I believe they can now be sued by patients. I have a couple of posts on generics I can look up later.But I think it’s silly for a generic co. to start from scratch once the patent expires. Let the Brand company keep the patent a bit longer, in exchange for selling it at a reasonable price.

The business of actually adapting the drug so that it is absorbed in the body in a useful way and an appropriate concentration is sometimes known in the industry as ‘pharmaceutical dvelopment’ and is often much harder than many suppose. I think it would be fair to start with a very vague prior that the concentration ought to be similar but one that would be completely dominated by the data. The loss function asepct is more interesting and this is the aspect of Lindley’s paper that is particularly interesting, although I don’t find the specific solution plausible.

It would be interesting to see what difference any change in the US legal framework would make. The FDA got very confused over the purpose of the legislation some years (more than 15) ago and started pursuing individual bioequivalence. This was quite illogical but a different story. See Senn S. In the blood: proposed new requirements for registering generic drugs. Lancet 1998; 352: 85-86.

Prior to the change in generic labeling, I posted this: http://errorstatistics.com/2012/03/22/generic-drugs-resistant-to-lawsuits/

Professor Senn:

I have a few questions about points in your post and comments,maybe you can explain.

“If the trial is small enough so that the standard error is large enough the width of the confidence interval, however calculated, will exceed the width of the equivalence interval. Thus the type I error rate is zero.” The type 1 error of a test associated with a 1 – a confidence interval, I thought, would be a. Why is it 0?

Another question:you have written “given a small enough trial you could claim equivalence even though the point estimate was outside the limits of equivalence!” Wouldn’t the point estimate x be within the confidence interval? x +/- e

Wouldn’t a Neymanian assign non-equivalence to the null hypothesis, in that it would be worse to declare equivalence erroneously?

Thank you for an interesting post!

Have a look at the diagram here http://www.senns.demon.co.uk/Equivalence%20examples.jpg

This gives a number of possible cases and is based on a similar diagram in Chapter 15 of Statistical Issues in Drug Development. The middle vertical line gives the point of exact equivalence and the other two lines give the boundaries of what is acceptable in terms of bioequivalence. If you look at case C, you will see that the point estimate is exactly what the generic company would hope. However the confidence interval is too wide. Even though the point estimate is perfect, the regulator will not accept a claim of equivalence and no value of the point estimate (given the width of the CI) can produce such a claim. Hence the type I error rate for the conventional 1-alpha CI is zero.

For “two wide” read “too wide” !

To pick up on the second question, the way you have to restore the type I error rate is to allow a point estimate that is sufficiently ‘close’ to the point of exact equivalence to indicate equivalence, even though the conventional confidence limits fall outside the boundaries of equivalence. As the confidence intervals get wider and wider you have to get more and more relaxed as to the definition of ‘close’. Eventually ‘close enough’ is beyond the limit of equivalence!

Hence the general rejection of Neyman-Pearson type optimisation in the context of bioequivalence.

Stephen: What does it mean to speak of the Type 1 error rate of A confidence interval?

If you operate a decision rule by saying ‘reject H0 unless the confidence interval is in such and such a region’ then the decision rule has an error rate under H0. This is the type one error rate.

Stephen: I think there is/was some confusion in my mind as to whether the null was asserting equivalence or inequivalence.

In the context of bioequivalence the null hypothesis is that the drugs are inequivalent. The alternative hypothesis is encompased by the region between the two lines of equivalence.

In the diagram I referenced http://www.senns.demon.co.uk/Equivalence%20examples.jpg

cases D, E & F would be accepted as equivalent by the regulator. Case F is interesting since the confidence limits are inside the limits of equivalence but they do not include zero so that on a conventional test the formulations are ‘significantly different’ but also ‘significantly the same’.

Nice way to put the rationale for Schuirmann’s “Imagine that we have two regulators: a toxicity and an efficacy regulator. The former defines as toxic any drug whose relative bioavailability is greater than 1.25 and the latter as ineffective any drug whose relative bioavailability is less than 0.8. Each is unconcerned by the other’s decision and so no trading of alpha from one to the other can take place.”

“consider a Bayesian solution in principle” is always a good idea (especially to allow sensible trading of alphas) and possibly the quickest route to a better frequency solution if the heavy mathematical analysis can eventually be worked through.

Nice Dashiell Hammett reference. Red Harvest is a hell of a read.

Corey: Yes, when Senn gave me this title I looked it up and found the movie, which sounds like the kind of story I’d like–never saw it.

FROM TWITTER:

Stephen John Senn @stephensenn 6h @ChristosArgyrop @learnfromerror But see http://onlinelibrary.wiley.com/doi/10.1002/sim.743/abstract … for a criticism of Lindley’s proposal

The reference to Senn’s paper criticizing Lindley’s suggestion is here: http://errorstatistics.files.wordpress.com/2014/06/senn_statistical-issues-in-bioequivalence_statistics-in-medicine.pdf

I had a quick look at Senn’s “Falsification and Clinical Trials” (1991). http://errorstatistics.files.wordpress.com/2014/06/senn-falsificationism.pdf

As always, Senn’s work includes numerous intriguing philosophical points, and the Popperian spirit is largely in sync with my own views. When it comes to illuminating randomization and blindness, points on which I’m often fuzzy, there’s no one better than Senn.

Here’s the peculiarity I find in what Senn says about Popper here.

1. p. 1684: Consider an empirical counterexample to a universal claim T (e.g., Experimental condition A always produces the same effects as condition B). Such a counterexample,says Senn, “is unscientific” (on Popperian grounds), but I don’t see how this can be Popper’s view. The empirical counterexample could be described in different ways (see point 2): basically a case or cases where condition A is observed not to produce the same effect as B. As a logical empiricist Popper holds that the empirical basis for science rests on empirical claims. If they are ousted from science, there would be no conjecture and refutation, and no science.

2.The empirical counterexample to T may be (i) a singular observation claim e, or (ii) an observable (reproducible) effect E. The former might be: the blood concentrations in two groups of patients are x and y, x differs from y. Such singular observations get their scientific status for Popper from his position that anyone who has learned the proper method could detect a flaw, and the fact that there are interconnected techniques for checking mistakes about such observables. It too is fallible (so he’s not a naïve positivist), but easier to corroborate than statistical or general claims. So, viewed as a singular observation claim, e is scientific.

The latter (the empirical statistical claim (ii)) E, however, asserts a real or genuine effect.

Only claims like E are really of interest to science, Popper rightly notes, not singular observations like e. (Only simple generalizations like All swans are white are falsifiable with singular observation statements.) (EGEK, chapter 1)

3. Senn claims E is not scientific, untestable, but I say this isn’t so. How can we corroborate an empirical statistical claim like E, e.g., this is a genuine effect? (Remember a claim C is corroborated for Popper by passing a severe test, one which would have, with high probability, failed C were C false. Put aside that Popper never adequately cashed out severity, we have given a workable notion, we hope.) Empirical or experimental effects are typically inferred by means of statistical rules rejecting claims such as: the effect is not genuine or not reproducible or the like. These are typical null hypotheses of statistical tests.

4. Senn’s empirical claim E is something like this:

E: A produced or can produce an effect that differs from B.

Whether viewed as a singular observation or, more plausibly, an observable effect, it qualifies as scientific for Popper*.

Real effect E would be “falsified” by finding the effect is not reproducible, that it disappears, we cannot bring it about at will, etc. (see EGEK chapter 1).

5. Notice I put quotes around “falsified”—a Lakatosian move that reminds us that except perhaps for the least interesting generalizations, e.g., all swans are white, hypotheses are not logically falsifiable. Rather there is a convention or “decision” that takes a claim to be “falsified” once the difference between observed and expected exceed a prespecified threshold—much as with significance tests.

I’ve said nothing about equivalence trials. According to Senn, “This then constitutes the basic problem of an equivalence trials. If the trial is successful the experimenter can provide no demonstration that it was competent.”

“By competence of a clinical trial I mean the ability of the trial to detect a difference in treatments if it exists.” (p. 1688).

I do not claim to comprehend the special features of bioequivalence testing of drugs, but I don’t get Senn’s logic here, and the bulk of this article strikes me as making logical points. Perhaps having clarified the business of the scientific status of an empirical falsification or observed counterexample to theory T, one is in a better position to analyze the rest.**

*There were some early, caricatures of Popper that Lakatos called “dogmatic falsificationism”, and one could, if one wanted, characterize a singular observation statement as non-testable, but I think it’s a mistake to foist this on Popper. Admittedly David Miller holds some pretty strange views (and I notice Senn references him).

**Some of the most famous experiments in science are “null experiments”, e.g., Michelson Morley, equivalence principles in gravitational physics. So while we can agree with Senn, “Affirming equivalence is fundamentally different from claiming a difference,” it’s not clear at all that affirming equivalence in the case of drugs is fundamentally different from warranting equivalence in other cases in science.

Popper is so often caricatured that I hope at least readers are clear that the empirical counterexample to a generalization is not ousted from the realm of science from Popper. Quite from the business of the best way to do bioequivalence, and I can only guess what a severity assessment might produce, no condemnation of the scientific credentials of observed effects is warranted.

Moreover, it’s worth keeping in mind that we manage to warrant, with severity, the equivalence of the free fall acceleration of the moon and earth, finding the Nordvedt parameter to be extraordinarily close to 0. My point is that whatever problem the task has in the realm of bioequivalence, it isn’t a matter of logic or “philosophy”.

Null effects are actually quite powerful in physics, because they can show high competence to detect ultra-minuscule discrepancies (not the scientific term). Then the inference is a matter of setting discrepancy bounds, essentially in the form of confidence intervals, using the knowledge of the detectors. Think of the brilliant Michelson-Morley experiments quite early on–ruling out the Newtonian ether, to everyone’s shock.