School Director & Professor

School of Mathematical & Natural Science

Arizona State University

**Comment on S. Senn’s post: ****“Blood Simple? The complicated and controversial world of bioequivalence” ^{(}*^{)}**

First, I do agree with Senn’s statement that “the FDA requires conventional placebo-controlled trials of a new treatment to be tested at the 5% level two-sided but since they would never accept a treatment that was worse than placebo the regulator’s risk is 2.5% not 5%.” The FDA procedure essentially defines a one-sided test with Type I error probability (size) of .025. Why it is not just called this, I do not know. And if the regulators believe .025 is the appropriate Type I error probability, then perhaps it should be used in other situations, e.g., bioequivalence testing, as well.

Senn refers to a paper by Hsu and me (Berger and Hsu (1996)), and then attempts to characterize what we said. Unfortunately, I believe he has mischaracterized. I do not recognize his explanation after “The argument goes as follows.” Senn says that our argument against the bioequivalence test defined by the 90% confidence interval is based on the fact that the Type I error rate for this test is zero. This is not true. The bioequivalence test in question, defined by the 90% confidence interval, has size exactly equal to α = .05. The Type I error probability is not zero. But this test is biased; the Type I error probability converges to zero as the variance goes to infinity on the boundary between the null and alternative hypotheses. This biasedness allows other tests to be defined that have size α, also, but are uniformly more powerful than the test defined by the 90% confidence interval.

The two main points in Berger and Hsu (1996) are these.

First, by considering the bioequivalence problem in the intersection-union test (IUT) framework, it is easy to define size α tests. The IUT method of test construction, may be useful if the null hypothesis is conveniently expressed as a union of sets in the parameter space. In a bioequivalence problem the null hypothesis (asserting non-bioequivalence) is that the difference (as measured by the difference in log means) between the two drug formulations is either greater than or equal to .22 or less than or equal to -.22. Hence the null hypothesis is the union of two sets, the part where the parameter is greater than or equal to .22 and the part where the parameter is less than or equal to -.22. The intersection-union method considers two hypothesis tests, one of the null “greater than or equal to .22” versus the alternative “less than .22” and the other of the null “less than or equal to -.22” versus the alternative “greater than -.22.” The fundamental result about IUT’s is that if each of these tests is carried out with a size-α test, and if the overall bioequivalence null is rejected if and only if each of these individual tests rejects its respective null, then the resulting overall test has size at most α. Unlike most other methods of combining tests, in which individual tests must have size less than α to ensure the overall test has size α, in the IUT method of combining tests size α tests are combined in a particular way to yield an overall test that has size α, also.

In the usual formulation of the bioequivalence problem, each of the two individual hypotheses is tested with a one-sided, size-α t-test. If both of these individual t-tests rejects its null, then bioequivalence is concluded. This has come to be called the Two One-Sided Test (TOST). The IUT method simply combines two one-sided t-tests into an overall test that has size α. This is much simpler than vague discussions about regulators not trading α, etc. This explanation makes no sense to me, because there is only one regulator (e.g., the FDA). Why appeal to two regulators?

Furthermore, in the IUT framework it is not necessary for the two individual hypotheses to be tested using one-sided t-tests. By considering the configuration of the parameter space in a bioequivalence problem more carefully, it is easy to define other tests that are size-α for the two individual hypotheses. When these are combined using the IUT method into an overall size-α test, they can yield a test that is uniformly more powerful than the TOST. We give an example of such tests in Berger and Hsu. Thus the IUT method gives simple constructions of tests that are superior in power to the usual TOST.

The second main point of Berger and Hsu is this. Describing a size-α (e.g., α = .05) bioequivalence test using a 100(1 − 2α)% (e.g., 90%) confidence interval is confusing and misleading. As Brown, Casella, and Hwang (1995) said, it is only an “algebraic coincidence” that in one particular case there is a correspondence between a size-α bioequivalence test and a 100(1 − 2α)% confidence interval. In Berger and Hsu we point out several examples in which other authors have considered other equivalence type hypotheses and have assumed they could define a size-α test in terms of a 100(1 − 2α)% confidence set. In some cases the resulting tests are conservative, in other cases liberal. *There is no* general correspondence between α-level equivalence tests and 100(1 − 2α)% confidence sets. This description of one particular size-α equivalence test in terms of a 100(1 − 2α)% confidence interval is confusing and should be abandoned.

On another point, I would disagree with Senn’s characterization that Perlman and Wu (1999) criticized our new tests on theoretical grounds. Rather, I would call them intuitive grounds. They said it sounds crazy to decide in favor of equivalence when the point estimate is outside the equivalence limits (much as Senn said). The theory, as we presented it, is sound. The tests are size-α, and uniformly more powerful than the TOST, and less biased. But in our original paper we acknowledged that they are counterintuitive. We suggested modifications that could be made to eliminate the counterintuitivity but still increase the power over the TOST (another simple argument using the IUT method).

Finally, to correct a misstatement, in the extensive discussion following the original Senn post, there are several references to the “union-intersection method of R. Berger.” The method we used is the intersection-union method. In the union-intersection method individual tests are combined in a different way. In this method if individual size-α tests are used, then the overall test has size greater than α. The individual tests must have size less than α in order for the overall test to have size α. (This is the usual situation with many methods of combining tests.)

Berger, R.L., Hsu, J.C. (1996). Bioequivalence Trials, Intersection-Union Tests and Equivalence Confidence Sets (with Discussion). *Statistical Science*, 11, 283-319.

Brown, L. D., Casella, G. and Hwang, J. T. G. (1995a). Optimal confidence sets, bioequivalence, and the limacon of Pascal. *J. Amer. Statist. Assoc.,* 90, 880-889.

Perlman, M.D., Wu, L. (1999). The emperor’s new tests. *Statistical Science,* 14, 355-369.

Senn, S. (6/5/2014). Blood Simple? The complicated and controversial world of bioequivalence (guest post). Mayo’s *Error Statistics Blog (error statistics.com)*.

*********

**Stephen Senn
**Head, Methodology and Statistics Group

Competence Center for Methodology and Statistics (CCMS)

Luxembourg

**Comment on Roger Berger**

I am interested and grateful to Dr Berger for taking the trouble to comment on my blogpost.

First let me apologise to Dr Berger if I have misrepresented Berger and Hsu[1]. The interested reader can do no better than look up the original publication. This also gives me the occasion to recommend two further articles that appeared at a very similar time to Berger and Hsu. The first[2] is by my late friend and colleague Gunther Mehring and appeared shortly before Berger and Hsu . Gunther and I did not agree on philosophy of statistics but we had many interesting discussions on the subject of bioequivalence during the period that we both worked for CIBA-Geigy and what very little I know of the more technical aspects of general interval hypotheses is due to him. Also of interest is the paper by Brown, Hwang and Munk[3], which appeared a little after Berger and Hsu[1] and this has an interesting remark I propose to discuss

“We tried to find a fundamental argument for the assertion that a reasonable rejection region should not be unbounded by using a likelihood approach, a Bayesian approach, and so on. However, we did not succeed. Therefore we are not convinced it should not be unbounded.”(p 2348)

Although I do not find the tests proposed by the three sets of authors[1-3] an acceptable practical approach to bioequivalence there is a sense in which I agree with Brown et al but also a sense in which I don’t.

I agree with them because it *is* possible to find cases in which within a Bayesian decision-analytic framework it is possible to claim equivalence even though the point estimate falls outside the limit of equivalence. A sufficient set of conditions is the following.

- It is strongly believed that were no evidence at all available the logical course of action would be to accept bioequivalence. That is to say
*if*the only choices of actions were A: accept bioequivalence or B: reject bioequivalence the combination of prior belief and utilities would support A. - However, at no or little cost, a very small bioequivalence study can be run.
- This is the only further information that can be obtained.
- Thus the initial situation is that of a three- valued decision outcome, A: accept bioequivalence, B: reject bioequivalence, C: run the small experiment
- However, if the small experiment is run the only possible actions remaining will be A or B. There is no possibility of collecting yet further information.
- Despite the fact that the evidence from the small experiment has almost no chance of elevating
*a posteriori*B to being a preferable decision to A since the information from action C is almost free, C is the preferred action.

Under such circumstances it could be logical to run a small trial and it could be logical, having run the trial to accept decision A in preference to B even though the point estimate were outside the limits of equivalence. Basically, given such conditions, it would require an* extremely* in-equivalent result to cause one to prefer B to A. A moderately in-equivalent result would not suffice. However the fact that the possibility, however remote of changing B for A exists makes C a worth-while choice initially.

So technically, at least as regards the Bayesian argument, I think that Brown et al are right. Practically, however, I can think of no realistic circumstances under which these conditions could be satisfied.

Dr Berger and I agree that the FDA’s position on type one error rates is somewhat inconsistent so it is, of course, always dangerous to cite regulatory doctrine as a defence of a claim that an approach is logical. Nevertheless, I note that I do not see any haste by the FDA to replace the current biased test with unbiased procedures. I think that they are far more likely to consider, Dr Berger’s appeal to simplicity notwithstanding, that they are, indeed, entitled here, *as will have been the case with the innovator product*, to be provided with separate demonstrations of efficacy and tolerability. Seen in this light Schuirmann’s TOST procedure[4] is logical and consistent (apart from the choice of 5% level!).

My basic objection to unbiased tests of this sort[1-3], however, goes much deeper and here I suspect that not only Dr Berger but also Deborah Mayo will disagree with me. The Neyman-Pearson lemma is generally taken as showing that a justification for using likelihood as a basis for thinking about inference can be provided (for some simple cases) in terms of power. I do not, however, regard power as a more fundamental concept. (I believe that there is some evidence that Pearson unlike Neyman hesitated on this.) Thus my interpretation of NP is the reverse: by thinking in terms of likelihood one sometimes obtains a power bonus. If so, so much the better, but this is not the justification for likelihood, *au contraire*.

**References**

- Berger RL, Hsu JC. Bioequivalence trials, intersection-union tests and equivalence confidence sets.
*Statistical Science*1996;**11**: 283-302. - MehringG. On optimal tests for general interval hypotheses.
*Communications in Statistics: Theory and Methods*1993;**22**: 1257-1297. - Brown LD, Hwang JTG, Munk A. An unbiased test for the bioequivalence problem.
*Annals of Statistics*1997;**25**: 2345-2367 - Schuirmann DJ. A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability.
*J Pharmacokinet Biopharm*1987;**15**: 657-680. - Senn, S. (6/5/2014). Blood Simple? The complicated and controversial world of bioequivalence (guest post). Mayo’s
*Error Statistics Blog (error statistics.com)*

^^^^^^^^^^^^^^^^^^^

*** Mayo remark on this exchange: Following Senn’s “Blood Simple” post on this blog, I asked Roger Berger for some clarification, and his post grew out of his responses. I’m very grateful to him for his replies and the post. Subsequently, I asked Senn for a comment to the R. Berger post (above), and I’m most appreciative to him for supplying one on short notice. With both these guest posts in hand, I now share them with you. I hope that this helps to decipher a conundrum that I, for one, have had about bio-equivalence tests. But I’m going to have to study these items much more carefully. I look forward to reader responses.**

*Just one quick comment on Senn’s remark: *

“….I suspect that not only Dr Berger but also Deborah Mayo will disagree with me. The Neyman-Pearson lemma is generally taken as showing that a justification for using likelihood as a basis for thinking about inference can be provided (for some simple cases) in terms of power. I do not, however, regard power as a more fundamental concept. (I believe that there is some evidence that Pearson unlike Neyman hesitated on this.)”

*My position on this, I hope, is clear in published work, but just to say one thing: I don’t think that power is “a justification for using likelihood as a basis for thinking about inference”. I agree with E. Pearson in his numbering the steps (fully quoted in this post)*

Step 2. We then divide this set [of possible results] by a system of ordered boundaries…such that as we pass across one boundary and proceed to the next, we come to a class of results which makes us more and more inclined on the information available, to reject the hypothesis tested in favour of alternatives which differ from it by increasing amounts” (E. Pearson 1966a, 173).

http://errorstatistics.com/2013/08/13/blogging-e-s-pearsons-statistical-philosophy/

*(Perhaps this is the evidence Senn has in mind.) Merely maximizing power, defined in the crude way we sometimes see (e.g., average power taken over mixtures, as in Cox’s and Birnbaum’s famous examples) can lead to faulty assessments of inferential warrant, but then, I never use pre-data power as an assessment of severity associated with inferences.*

*While power isn’t necessary “for using likelihood as a basis for thinking about inference” nor for using other distance measures (at Step 2), reports of observed likelihoods and comparative likelihoods are inadequate for inference and error probability control. Hence, Pearson’s Step 3.*

Does the issue Senn raises on power really play an important role in his position on bioequivalence tests? I’m not sure. I look forward to hearing from readers.

I am very grateful to my guest posters for their interesting remarks. I remain somewhat puzzled about some of the problems.(Are they practical? philosophical? both?)

Senn says that it is possible to find cases in which within a Bayesian decision-analytic framework it is logical to accept bio-equivalence despite a point estimate from a small study falling outside the limit of equivalence, because it is not so strongly inequivalent as to warrant a switch to inferring non-equivalence, yet he can “think of no realistic circumstances under which these conditions could be satisfied”. But I’m not sure if this shows a problem with the approach favored by R. Berger (as, I didn’t take him to be appealing to a Bayesian decision-theoretic justification). That said, I’d like to hear Berger’s take on this.

I don’t doubt the complexity of the issues Senn and Berger raise, although reading about generic drug companies (in my other life), I get the impression that bioequivalence techniques are quite successful in constraining variability.

http://www.statsols.com/fda-recommendations/

“The FDA’s bioequivalence guide has faced criticism for the strength of its statistical analysis guidelines. Under the agency’s standards, the API* profiles for brand-name drugs and their generic counterparts do not need to be identical; some minor variation is acceptable.In practice, however, high variability is virtually non-existent. Analysis of 2,070 bioequivalence studies from 1996 to 2007 revealed an average variability of 4.35% for maximum API concentration and 3.56% for time to maximum concentration.”

*active pharmaceutical ingredient (API)

As regards Pearson being less enthusiastic about power than Neyman, I was thinking in particular of Constance Reid’s biography ‘Neyman from Life’, Springer, 1982.

On P73 you will find

…Lehmann continues “there really can be no doubt about the fact that the basic ideas of the 1928 paper were communicated to Neyman by Pearson and that-fortunately!-from the beginning Neyman did not find the likelihood ratio as compelling as Pearson did…”

Of course, this is very much hearsay evidence and in in any case not the same as saying that in the end Pearson did not find power as compelling as Neyman did. However, it does suggest that Pearson found the idea of likelihood initially more intuitive than power. That’s certainly my position.

Stephen: Pearson was against the purely behavioristic use of power but “power” was actually Pearson’s idea (he, you recall, being the “heretic”). He absolutely did emphasize the need for a sensible distance measure (Step 2), but the whole power rationale is lost without that. See the quote in this post after Step 3. http://errorstatistics.com/2013/08/13/blogging-e-s-pearsons-statistical-philosophy/

Your quote from Reid is cryptic, but these were issues Neyman and Pearson talk about very directly in their work and in letters.

That’s not the story as told by Reid. According to her account the idea came to Neyman in February 1930

P93 ‘From the beginning of their collaboration…Neyman had found the principle of likelihood not nearly so compelling as Pearson had

…In a letter at the beginning of February 1930, Neyman told Pearson “if we show that the frequency of accepting a false hypothesis is minimum when we use [likelihood] tests, I think it will be quite a thing.”

…The first real step in the solution of the problem of what today is called “the most powerful test” of a simple statistical hypothesis against a fixed simple alternative came suddenly and unexpectedly in a moment which Neyman has never forgotten’

Stephen: Of course Neyman wrote in his letter:

“if we show that the frequency of accepting a false hypothesis is minimum when we use [likelihood] tests, I think it will be quite a thing.”

But the term “power” was Pearson’s idea*, not that it really matters, Pearson completely agreed that it would “be quite a thing” to find most powerful tests, and he explicitly rejected the idea of stopping with likelihood ratios (at step 2). And Neyman was happy as a clam that Pearson’s LR corresponded to best tests, at least in certain cases.So I don’t think we’re disagreeing about anything in this regard.**

*I feel like we’re talking about where the “4 seasons” got the idea for their name, as in that recent Clint Eastwood movie, “Jersey Boys”, never mind.

**Not that I would weigh a biographer’s construal over first person accounts by Pearson, Neyman, Lehmann.

Stephen: I meant to note that the reason Neyman wasn’t keen on likelihood at first is that he viewed it as a veiled attempt to invoke priors (as noted by E.Pearson, Lehmann).

A question for Roger Berger: I was wondering if you could explain how it is that “Unlike most other methods of combining tests,…in the IUT method of combining tests size α tests are combined in a particular way to yield an overall test that has size α, also.”

Thanks so much.

This fact is Theorem 8.3.23 in Casella & Berger (2002). Statistical Inference, 2nd edition.Wadsworth, Pacific Grove, CA. It is proved earlier in Theorem 1 of Berger, RL (1982), “Multiparameter hypothesis testing and acceptance sampling,” Technometrics 24, 295-300. And this fact is reported (not proved) in an earlier abstract, Gleser, LJ (1973). “On a Theory of Intersection-Union Tests,” IMS Bulletin (Abstract), 2, 233.

Roger: First, welcome to the blog! (It’s only sent for approval the first time.)

After I wrote that question (which was a query put to me by an e-mail) I started to think about it, and gave a conjectured reply to myself below. I’d be glad to know what you think, and I will look at your references.

Roger: Maybe I do see why it holds: The null is a union (a disjunction for philosophers), e.g., blood concentration of the drug is either too low or too high. The alpha rejection regions for the too low and too high hypotheses are R’ and R”, respectively.

So, the rejection region R for the intersection union test (IUT) is a subset of the individual rejection regions, so P(x is in R;Ho) < or equal to P(x is in R';Ho). So it too is no greater than alpha.

Is that the idea? Here the erroneous rejection can't be wrong in both ways, too high and too low, just as with bio-equivalence tests. So I see that. But what if the null were something like:

Ho: either low ability in philosophy or low in statistics

(say in acceptance sampling of job candidates)

A candidate must have both hypotheses rejected to be an acceptable candidate (who needs enough ability in phil and in stat).

Again, the rejection region R for the intersection is a subset of the individual rejection regions, but can the type 1 error still be no greater than alpha when he can fail in both ways (i.e.,the test is wrong about both)?

Maybe it can.

Aug 3: I now do think it can.

For me the practical problem involves two hypotheses (sub-availability and super-availability) each of which has to be proved wrong to claim equivalence.

This is analogous to what happens to the innovator product. You have to prove efficacy (formally) using one outcome or if many outcomes, in a way that require control of the type I error rate and you have to separately address safety. (Usually this is informal and involves many different outcomes.)

There is an interesting parallel to classical two-sided tests. Here one tests two hypotheses, either of which can be proved wrong. There, in my view, the ‘correct’ approach is to test each at 2.5%. This means that the tail probability for the tail that is actually used should be doubled to produce the P-value. Some do not like this but prefer to evaluate each tail separately and add them in the way that maximises power. See section 12.2.7 in Statistical Issues in Drug Development http://www.senns.demon.co.uk/c12.pdf

Stephen: I looked at those pages in your book. Why do you say on p. 187 that doubling the (.061) P-value yields .0123–is this supposed to be .123?

My confusion, at least one of many, is that in the ordinary two sided test the alternative is a union (disjunction) whereas in R. Berger’s IUT the alternative is an intersection (conjunction). So the type 1 error prob can still be alpha. Any thoughts on my comment/query? http://errorstatistics.com/2014/07/31/roger-berger-on-senns-blood-simple-with-a-response-by-s-senn-guest-posts/comment-page-1/#comment-89392

(Also I didn’t think he was computing a P-value, but using a pre-data cut-off. That might not matter.)

Thanks. It’s a typo (there are three kinds of statistician those who can count and those who can’t). There is a difference as to whether an AND or OR logic is used but that’s not my point. The point is that in my view, two tests are used in each case (bioequivalence and notn-directional alternative hypotheses).

Stephen: But look at what I wrote in my comment. The rejection region is what it is, an intersection of the two individual rejection regions.The probability of falling into it, under the null,is no greater than falling into an individual rejection region, under the null.

Yes there are two tests, but in the IUT* it is required that both lead to rejection of their respective nulls before you have a rejection of the null in the IUT. So there has to be a different treatment than with the ordinary 2-sided test where the null is an intersection, and you reject it so long as either is rejected. That is, the difference must show up.

(With UIT cases, by contrast Roger will add the two error rates, if I’m understanding his treatment.)

*I admit to only just getting an idea of these tests, and don’t plan to delve further into the knotty issues of bioequivalence (but this preliminary logical point strikes me as not so different from more ordinary approx. equivalence).

That’s not my point. My point is that in each case there are (potentially) two tests. The issue is whether you should simply establish the property of each of the two tests and make sure that each of these two is independently satisfactory or whether, in setting up one of them, you should anticipate what the other will do.

Look at it like this. If safety were the only issue the toxicity regulator would establish one particular region and require the test statistic to fall inside it to allow the product to be declared safe. This is the best region for that purpose that guarantees the type I error be no greater than 5%.

However, safety is not the only issue, so the sponsor says, “That’s unfair. There’s a part of the region you have assigned me where I will fail the test I have to satisfy to keep your colleague the efficacy regulator happy. Overall, my probability of registering the drug given that it is (just)unsatisfactory is less than 5%.”

The safety regulator replies “Where did you get the idea that you are entitled to this probability under all circumstances? If what you say is true you have designed an inadequate experiment. You’ll be coming to me next with no data at all and demanding that you be allowed to roll your icosahedral die.”

Who wins this argument? I think the regulator does. Apparently others disagree.

Note than in those cases where a pharmacokinetic bioequivalence argument is not possible (for example inhaled bronchodilators) it might be appropriate to show equivalence via non-inferiority in terms of two different pharmacodynamic measures – one for efficacy and one for safety. For example, one might consider nominating FEV1 for the former and QT prolongation for the latter. Here again the type I error rate will be plausibly less than 5% but nobody would adjust the individual tests in consequence.

Stephen: I’m guessing now that you’re alluding to a different variation on the test, other than the two one-sided tests, and I realize R. Berger recommends some more complex combination which I haven’t studied. Perhaps it involves juggling in a way depending on outcomes, and that’s your concern. Is that the idea? Anyway, I’ve gone to the end of my current understanding here, but maybe it will come up again. I’m not sure whether your objections refer to the methods currently in use or only certain recommended improvements. It’s kind of ironic that here, possibly, we’ve a methodology that seems to work fine in practice, but maybe not in theory?

No. I am saying that the situation is one of two one-sided tests but that each has to be evaluated separately. Each has to have a type I error rate of 5% and if that means that you have less chance of declaring equivalence, well, life’s tough if you design inadequate experiments. If I understand Roger Berger correctly he wants to replace this with one test for the closed region of equivalence. I don’t agree that this is the practical problem.

Stephen: Well the IUTs he describes in his post (which are also in Casella and Berger (2002)) do have two one sided tests each with alpha of .05, and the rejection region is the intersection. But I know nothing of other tests he’s developed, so I shouldn’t speculate.

Stephen seems to define the bioequivalence problem like this. There are two one-sided hypotheses to be tested (call them the safety hypothesis and the efficacy hypothesis). He wants each to be tested separately at level .05 (so we report accept or reject for each hypothesis). AND he seems to insist that each test be a one-sided t-test. With all these constraints there is not much left to say except that IUT theory says each t-test can be conducted at level .05 and the overall test which declares bioequivalence if and only if both tests reject will be a level .05 test, also. This is the usual TOST.

What Berger & Hsu (1996) did was say, “What if you don’t require the tests of the two one-sided hypotheses to be t-tests?” What if you use some other tests to test the two one-sided hypotheses. We defined other level .05 tests (as required by Stephen) for each of the two one-sided hypotheses. IUT theory says we can combine these two tests, and the overall test which declares bioequivalence if and only if both of these new tests reject will be a level .05 test, also. But the overall test we get by combining these two new tests is uniformly more powerful (uniformly more likely to declare bioequivalence) than the TOST based on the two t-tests. So with the new methodology you do get an accept/reject decision for each of the safety and efficacy hypotheses, but you also have higher power for the overall test.

This discussion reminds me of discussions I heard back in the 1970′s when I was in grad school. Shrinkage estimation (aka, Stein estimation) was a hot topic. Recall, the point then was that in estimating a univariate normal mean, the sample mean is an admissible estimator in terms of mean squared error. But when estimating a vector of three or more normal means, other estimators (rather than using the sample mean in each coordinate) have uniformly smaller mean squared error. A better estimator could be found by looking at the problem as a whole rather than looking only at each coordinate individually. You still obtained estimates of each coordinate individually, but the overall estimate of the vector was better. The same thing here in the bioequivalence problem. You still have a test of each hypothesis individually, but when you combine the tests you have a better procedure overall.

I do not think you can dismiss these new tests by saying, “life’s tough if you design inadequate experiments.” I think one role of the statistician to to help the client get the most possible information out of the data at hand, whether the sample size is large or small. Our new tests give the client more power for the overall test while still providing level .05 tests for each of the individual hypotheses. The only drawback is that the two tests of the one-sided hypotheses are not t-tests. If the regulator or the client insist that the two individual hypotheses have to be tested using t-tests, then there is nothing left to do but use the TOST. If this is the case then Berger and Hsu (1996) is still useful, because it points out that the TOST is a simple application of IUT theory. No confusing appeal to 90% confidence intervals to perform level .05 tests is needed.

Roger: Thanks so much for this latest comment, I very much look forward to hearing Senn’s response. I haven’t worked through all of Berger and Hsu (it’s amazing how many distinct approaches there are to this problem). It’s too far afield for me to get more than a vague general sense of the idea: by combining tests allowed to vary from a fixed alpha you can fulfill the type 1 error requirements and still get more overall power. Even that might be off. Still, I find the idea of an alternative hypothesis expressing practical equivalence very intriguing.

I think Roger and I have a position of mutual understanding and practical disagreement, so I don’t think that there is much to add.

As a purely theoretical exercise it would be interesting to check what practical differences (if any) there are of adopting any one of the three methods to which I referred (Mehring, Berger and Hsu, Brown et al) when the precision is low. I might try to have a look some time but suspect that Roger already knows.

Stephen: Yes, it would be good to see an example where they come out differently and intuitively one is better. At least we have these points here to possibly come back to at some point (possibly when the issue emerges in FDA policy).