School Director & Professor
School of Mathematical & Natural Science
Arizona State University
First, I do agree with Senn’s statement that “the FDA requires conventional placebo-controlled trials of a new treatment to be tested at the 5% level two-sided but since they would never accept a treatment that was worse than placebo the regulator’s risk is 2.5% not 5%.” The FDA procedure essentially defines a one-sided test with Type I error probability (size) of .025. Why it is not just called this, I do not know. And if the regulators believe .025 is the appropriate Type I error probability, then perhaps it should be used in other situations, e.g., bioequivalence testing, as well.
Senn refers to a paper by Hsu and me (Berger and Hsu (1996)), and then attempts to characterize what we said. Unfortunately, I believe he has mischaracterized. I do not recognize his explanation after “The argument goes as follows.” Senn says that our argument against the bioequivalence test defined by the 90% confidence interval is based on the fact that the Type I error rate for this test is zero. This is not true. The bioequivalence test in question, defined by the 90% confidence interval, has size exactly equal to α = .05. The Type I error probability is not zero. But this test is biased; the Type I error probability converges to zero as the variance goes to infinity on the boundary between the null and alternative hypotheses. This biasedness allows other tests to be defined that have size α, also, but are uniformly more powerful than the test defined by the 90% confidence interval.
The two main points in Berger and Hsu (1996) are these.
First, by considering the bioequivalence problem in the intersection-union test (IUT) framework, it is easy to define size α tests. The IUT method of test construction, may be useful if the null hypothesis is conveniently expressed as a union of sets in the parameter space. In a bioequivalence problem the null hypothesis (asserting non-bioequivalence) is that the difference (as measured by the difference in log means) between the two drug formulations is either greater than or equal to .22 or less than or equal to -.22. Hence the null hypothesis is the union of two sets, the part where the parameter is greater than or equal to .22 and the part where the parameter is less than or equal to -.22. The intersection-union method considers two hypothesis tests, one of the null “greater than or equal to .22” versus the alternative “less than .22” and the other of the null “less than or equal to -.22” versus the alternative “greater than -.22.” The fundamental result about IUT’s is that if each of these tests is carried out with a size-α test, and if the overall bioequivalence null is rejected if and only if each of these individual tests rejects its respective null, then the resulting overall test has size at most α. Unlike most other methods of combining tests, in which individual tests must have size less than α to ensure the overall test has size α, in the IUT method of combining tests size α tests are combined in a particular way to yield an overall test that has size α, also.
In the usual formulation of the bioequivalence problem, each of the two individual hypotheses is tested with a one-sided, size-α t-test. If both of these individual t-tests rejects its null, then bioequivalence is concluded. This has come to be called the Two One-Sided Test (TOST). The IUT method simply combines two one-sided t-tests into an overall test that has size α. This is much simpler than vague discussions about regulators not trading α, etc. This explanation makes no sense to me, because there is only one regulator (e.g., the FDA). Why appeal to two regulators?
Furthermore, in the IUT framework it is not necessary for the two individual hypotheses to be tested using one-sided t-tests. By considering the configuration of the parameter space in a bioequivalence problem more carefully, it is easy to define other tests that are size-α for the two individual hypotheses. When these are combined using the IUT method into an overall size-α test, they can yield a test that is uniformly more powerful than the TOST. We give an example of such tests in Berger and Hsu. Thus the IUT method gives simple constructions of tests that are superior in power to the usual TOST.
The second main point of Berger and Hsu is this. Describing a size-α (e.g., α = .05) bioequivalence test using a 100(1 − 2α)% (e.g., 90%) confidence interval is confusing and misleading. As Brown, Casella, and Hwang (1995) said, it is only an “algebraic coincidence” that in one particular case there is a correspondence between a size-α bioequivalence test and a 100(1 − 2α)% confidence interval. In Berger and Hsu we point out several examples in which other authors have considered other equivalence type hypotheses and have assumed they could define a size-α test in terms of a 100(1 − 2α)% confidence set. In some cases the resulting tests are conservative, in other cases liberal. There is no general correspondence between α-level equivalence tests and 100(1 − 2α)% confidence sets. This description of one particular size-α equivalence test in terms of a 100(1 − 2α)% confidence interval is confusing and should be abandoned.
On another point, I would disagree with Senn’s characterization that Perlman and Wu (1999) criticized our new tests on theoretical grounds. Rather, I would call them intuitive grounds. They said it sounds crazy to decide in favor of equivalence when the point estimate is outside the equivalence limits (much as Senn said). The theory, as we presented it, is sound. The tests are size-α, and uniformly more powerful than the TOST, and less biased. But in our original paper we acknowledged that they are counterintuitive. We suggested modifications that could be made to eliminate the counterintuitivity but still increase the power over the TOST (another simple argument using the IUT method).
Finally, to correct a misstatement, in the extensive discussion following the original Senn post, there are several references to the “union-intersection method of R. Berger.” The method we used is the intersection-union method. In the union-intersection method individual tests are combined in a different way. In this method if individual size-α tests are used, then the overall test has size greater than α. The individual tests must have size less than α in order for the overall test to have size α. (This is the usual situation with many methods of combining tests.)
Berger, R.L., Hsu, J.C. (1996). Bioequivalence Trials, Intersection-Union Tests and Equivalence Confidence Sets (with Discussion). Statistical Science, 11, 283-319.
Brown, L. D., Casella, G. and Hwang, J. T. G. (1995a). Optimal confidence sets, bioequivalence, and the limacon of Pascal. J. Amer. Statist. Assoc., 90, 880-889.
Perlman, M.D., Wu, L. (1999). The emperor’s new tests. Statistical Science, 14, 355-369.
Senn, S. (6/5/2014). Blood Simple? The complicated and controversial world of bioequivalence (guest post). Mayo’s Error Statistics Blog (error statistics.com).
Comment on Roger Berger
I am interested and grateful to Dr Berger for taking the trouble to comment on my blogpost.
First let me apologise to Dr Berger if I have misrepresented Berger and Hsu. The interested reader can do no better than look up the original publication. This also gives me the occasion to recommend two further articles that appeared at a very similar time to Berger and Hsu. The first is by my late friend and colleague Gunther Mehring and appeared shortly before Berger and Hsu . Gunther and I did not agree on philosophy of statistics but we had many interesting discussions on the subject of bioequivalence during the period that we both worked for CIBA-Geigy and what very little I know of the more technical aspects of general interval hypotheses is due to him. Also of interest is the paper by Brown, Hwang and Munk, which appeared a little after Berger and Hsu and this has an interesting remark I propose to discuss
“We tried to find a fundamental argument for the assertion that a reasonable rejection region should not be unbounded by using a likelihood approach, a Bayesian approach, and so on. However, we did not succeed. Therefore we are not convinced it should not be unbounded.”(p 2348)
Although I do not find the tests proposed by the three sets of authors[1-3] an acceptable practical approach to bioequivalence there is a sense in which I agree with Brown et al but also a sense in which I don’t.
I agree with them because it is possible to find cases in which within a Bayesian decision-analytic framework it is possible to claim equivalence even though the point estimate falls outside the limit of equivalence. A sufficient set of conditions is the following.
- It is strongly believed that were no evidence at all available the logical course of action would be to accept bioequivalence. That is to say if the only choices of actions were A: accept bioequivalence or B: reject bioequivalence the combination of prior belief and utilities would support A.
- However, at no or little cost, a very small bioequivalence study can be run.
- This is the only further information that can be obtained.
- Thus the initial situation is that of a three- valued decision outcome, A: accept bioequivalence, B: reject bioequivalence, C: run the small experiment
- However, if the small experiment is run the only possible actions remaining will be A or B. There is no possibility of collecting yet further information.
- Despite the fact that the evidence from the small experiment has almost no chance of elevating a posteriori B to being a preferable decision to A since the information from action C is almost free, C is the preferred action.
Under such circumstances it could be logical to run a small trial and it could be logical, having run the trial to accept decision A in preference to B even though the point estimate were outside the limits of equivalence. Basically, given such conditions, it would require an extremely in-equivalent result to cause one to prefer B to A. A moderately in-equivalent result would not suffice. However the fact that the possibility, however remote of changing B for A exists makes C a worth-while choice initially.
So technically, at least as regards the Bayesian argument, I think that Brown et al are right. Practically, however, I can think of no realistic circumstances under which these conditions could be satisfied.
Dr Berger and I agree that the FDA’s position on type one error rates is somewhat inconsistent so it is, of course, always dangerous to cite regulatory doctrine as a defence of a claim that an approach is logical. Nevertheless, I note that I do not see any haste by the FDA to replace the current biased test with unbiased procedures. I think that they are far more likely to consider, Dr Berger’s appeal to simplicity notwithstanding, that they are, indeed, entitled here, as will have been the case with the innovator product, to be provided with separate demonstrations of efficacy and tolerability. Seen in this light Schuirmann’s TOST procedure is logical and consistent (apart from the choice of 5% level!).
My basic objection to unbiased tests of this sort[1-3], however, goes much deeper and here I suspect that not only Dr Berger but also Deborah Mayo will disagree with me. The Neyman-Pearson lemma is generally taken as showing that a justification for using likelihood as a basis for thinking about inference can be provided (for some simple cases) in terms of power. I do not, however, regard power as a more fundamental concept. (I believe that there is some evidence that Pearson unlike Neyman hesitated on this.) Thus my interpretation of NP is the reverse: by thinking in terms of likelihood one sometimes obtains a power bonus. If so, so much the better, but this is not the justification for likelihood, au contraire.
- Berger RL, Hsu JC. Bioequivalence trials, intersection-union tests and equivalence confidence sets. Statistical Science 1996; 11: 283-302.
- MehringG. On optimal tests for general interval hypotheses. Communications in Statistics: Theory and Methods 1993; 22: 1257-1297.
- Brown LD, Hwang JTG, Munk A. An unbiased test for the bioequivalence problem. Annals of Statistics 1997; 25: 2345-2367
- Schuirmann DJ. A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability. J Pharmacokinet Biopharm 1987; 15: 657-680.
- Senn, S. (6/5/2014). Blood Simple? The complicated and controversial world of bioequivalence (guest post). Mayo’s Error Statistics Blog (error statistics.com)
*Mayo remark on this exchange: Following Senn’s “Blood Simple” post on this blog, I asked Roger Berger for some clarification, and his post grew out of his responses. I’m very grateful to him for his replies and the post. Subsequently, I asked Senn for a comment to the R. Berger post (above), and I’m most appreciative to him for supplying one on short notice. With both these guest posts in hand, I now share them with you. I hope that this helps to decipher a conundrum that I, for one, have had about bio-equivalence tests. But I’m going to have to study these items much more carefully. I look forward to reader responses.
Just one quick comment on Senn’s remark:
“….I suspect that not only Dr Berger but also Deborah Mayo will disagree with me. The Neyman-Pearson lemma is generally taken as showing that a justification for using likelihood as a basis for thinking about inference can be provided (for some simple cases) in terms of power. I do not, however, regard power as a more fundamental concept. (I believe that there is some evidence that Pearson unlike Neyman hesitated on this.)”
My position on this, I hope, is clear in published work, but just to say one thing: I don’t think that power is “a justification for using likelihood as a basis for thinking about inference”. I agree with E. Pearson in his numbering the steps (fully quoted in this post)
Step 2. We then divide this set [of possible results] by a system of ordered boundaries…such that as we pass across one boundary and proceed to the next, we come to a class of results which makes us more and more inclined on the information available, to reject the hypothesis tested in favour of alternatives which differ from it by increasing amounts” (E. Pearson 1966a, 173).
(Perhaps this is the evidence Senn has in mind.) Merely maximizing power, defined in the crude way we sometimes see (e.g., average power taken over mixtures, as in Cox’s and Birnbaum’s famous examples) can lead to faulty assessments of inferential warrant, but then, I never use pre-data power as an assessment of severity associated with inferences.
While power isn’t necessary “for using likelihood as a basis for thinking about inference” nor for using other distance measures (at Step 2), reports of observed likelihoods and comparative likelihoods are inadequate for inference and error probability control. Hence, Pearson’s Step 3.
Does the issue Senn raises on power really play an important role in his position on bioequivalence tests? I’m not sure. I look forward to hearing from readers.