School Director & Professor

School of Mathematical & Natural Science

Arizona State University

**Comment on S. Senn’s post: ****“Blood Simple? The complicated and controversial world of bioequivalence” ^{(}*^{)}**

First, I do agree with Senn’s statement that “the FDA requires conventional placebo-controlled trials of a new treatment to be tested at the 5% level two-sided but since they would never accept a treatment that was worse than placebo the regulator’s risk is 2.5% not 5%.” The FDA procedure essentially defines a one-sided test with Type I error probability (size) of .025. Why it is not just called this, I do not know. And if the regulators believe .025 is the appropriate Type I error probability, then perhaps it should be used in other situations, e.g., bioequivalence testing, as well.

Senn refers to a paper by Hsu and me (Berger and Hsu (1996)), and then attempts to characterize what we said. Unfortunately, I believe he has mischaracterized. I do not recognize his explanation after “The argument goes as follows.” Senn says that our argument against the bioequivalence test defined by the 90% confidence interval is based on the fact that the Type I error rate for this test is zero. This is not true. The bioequivalence test in question, defined by the 90% confidence interval, has size exactly equal to α = .05. The Type I error probability is not zero. But this test is biased; the Type I error probability converges to zero as the variance goes to infinity on the boundary between the null and alternative hypotheses. This biasedness allows other tests to be defined that have size α, also, but are uniformly more powerful than the test defined by the 90% confidence interval.

The two main points in Berger and Hsu (1996) are these.

First, by considering the bioequivalence problem in the intersection-union test (IUT) framework, it is easy to define size α tests. The IUT method of test construction, may be useful if the null hypothesis is conveniently expressed as a union of sets in the parameter space. In a bioequivalence problem the null hypothesis (asserting non-bioequivalence) is that the difference (as measured by the difference in log means) between the two drug formulations is either greater than or equal to .22 or less than or equal to -.22. Hence the null hypothesis is the union of two sets, the part where the parameter is greater than or equal to .22 and the part where the parameter is less than or equal to -.22. The intersection-union method considers two hypothesis tests, one of the null “greater than or equal to .22” versus the alternative “less than .22” and the other of the null “less than or equal to -.22” versus the alternative “greater than -.22.” The fundamental result about IUT’s is that if each of these tests is carried out with a size-α test, and if the overall bioequivalence null is rejected if and only if each of these individual tests rejects its respective null, then the resulting overall test has size at most α. Unlike most other methods of combining tests, in which individual tests must have size less than α to ensure the overall test has size α, in the IUT method of combining tests size α tests are combined in a particular way to yield an overall test that has size α, also.

In the usual formulation of the bioequivalence problem, each of the two individual hypotheses is tested with a one-sided, size-α t-test. If both of these individual t-tests rejects its null, then bioequivalence is concluded. This has come to be called the Two One-Sided Test (TOST). The IUT method simply combines two one-sided t-tests into an overall test that has size α. This is much simpler than vague discussions about regulators not trading α, etc. This explanation makes no sense to me, because there is only one regulator (e.g., the FDA). Why appeal to two regulators?

Furthermore, in the IUT framework it is not necessary for the two individual hypotheses to be tested using one-sided t-tests. By considering the configuration of the parameter space in a bioequivalence problem more carefully, it is easy to define other tests that are size-α for the two individual hypotheses. When these are combined using the IUT method into an overall size-α test, they can yield a test that is uniformly more powerful than the TOST. We give an example of such tests in Berger and Hsu. Thus the IUT method gives simple constructions of tests that are superior in power to the usual TOST.

The second main point of Berger and Hsu is this. Describing a size-α (e.g., α = .05) bioequivalence test using a 100(1 − 2α)% (e.g., 90%) confidence interval is confusing and misleading. As Brown, Casella, and Hwang (1995) said, it is only an “algebraic coincidence” that in one particular case there is a correspondence between a size-α bioequivalence test and a 100(1 − 2α)% confidence interval. In Berger and Hsu we point out several examples in which other authors have considered other equivalence type hypotheses and have assumed they could define a size-α test in terms of a 100(1 − 2α)% confidence set. In some cases the resulting tests are conservative, in other cases liberal. *There is no* general correspondence between α-level equivalence tests and 100(1 − 2α)% confidence sets. This description of one particular size-α equivalence test in terms of a 100(1 − 2α)% confidence interval is confusing and should be abandoned.

On another point, I would disagree with Senn’s characterization that Perlman and Wu (1999) criticized our new tests on theoretical grounds. Rather, I would call them intuitive grounds. They said it sounds crazy to decide in favor of equivalence when the point estimate is outside the equivalence limits (much as Senn said). The theory, as we presented it, is sound. The tests are size-α, and uniformly more powerful than the TOST, and less biased. But in our original paper we acknowledged that they are counterintuitive. We suggested modifications that could be made to eliminate the counterintuitivity but still increase the power over the TOST (another simple argument using the IUT method).

Finally, to correct a misstatement, in the extensive discussion following the original Senn post, there are several references to the “union-intersection method of R. Berger.” The method we used is the intersection-union method. In the union-intersection method individual tests are combined in a different way. In this method if individual size-α tests are used, then the overall test has size greater than α. The individual tests must have size less than α in order for the overall test to have size α. (This is the usual situation with many methods of combining tests.)

Berger, R.L., Hsu, J.C. (1996). Bioequivalence Trials, Intersection-Union Tests and Equivalence Confidence Sets (with Discussion). *Statistical Science*, 11, 283-319.

Brown, L. D., Casella, G. and Hwang, J. T. G. (1995a). Optimal confidence sets, bioequivalence, and the limacon of Pascal. *J. Amer. Statist. Assoc.,* 90, 880-889.

Perlman, M.D., Wu, L. (1999). The emperor’s new tests. *Statistical Science,* 14, 355-369.

Senn, S. (6/5/2014). Blood Simple? The complicated and controversial world of bioequivalence (guest post). Mayo’s *Error Statistics Blog (error statistics.com)*.

*********

**Stephen Senn
**Head, Methodology and Statistics Group

Competence Center for Methodology and Statistics (CCMS)

Luxembourg

**Comment on Roger Berger**

I am interested and grateful to Dr Berger for taking the trouble to comment on my blogpost.

First let me apologise to Dr Berger if I have misrepresented Berger and Hsu[1]. The interested reader can do no better than look up the original publication. This also gives me the occasion to recommend two further articles that appeared at a very similar time to Berger and Hsu. The first[2] is by my late friend and colleague Gunther Mehring and appeared shortly before Berger and Hsu . Gunther and I did not agree on philosophy of statistics but we had many interesting discussions on the subject of bioequivalence during the period that we both worked for CIBA-Geigy and what very little I know of the more technical aspects of general interval hypotheses is due to him. Also of interest is the paper by Brown, Hwang and Munk[3], which appeared a little after Berger and Hsu[1] and this has an interesting remark I propose to discuss

“We tried to find a fundamental argument for the assertion that a reasonable rejection region should not be unbounded by using a likelihood approach, a Bayesian approach, and so on. However, we did not succeed. Therefore we are not convinced it should not be unbounded.”(p 2348)

Although I do not find the tests proposed by the three sets of authors[1-3] an acceptable practical approach to bioequivalence there is a sense in which I agree with Brown et al but also a sense in which I don’t.

I agree with them because it *is* possible to find cases in which within a Bayesian decision-analytic framework it is possible to claim equivalence even though the point estimate falls outside the limit of equivalence. A sufficient set of conditions is the following.

- It is strongly believed that were no evidence at all available the logical course of action would be to accept bioequivalence. That is to say
*if*the only choices of actions were A: accept bioequivalence or B: reject bioequivalence the combination of prior belief and utilities would support A. - However, at no or little cost, a very small bioequivalence study can be run.
- This is the only further information that can be obtained.
- Thus the initial situation is that of a three- valued decision outcome, A: accept bioequivalence, B: reject bioequivalence, C: run the small experiment
- However, if the small experiment is run the only possible actions remaining will be A or B. There is no possibility of collecting yet further information.
- Despite the fact that the evidence from the small experiment has almost no chance of elevating
*a posteriori*B to being a preferable decision to A since the information from action C is almost free, C is the preferred action.

Under such circumstances it could be logical to run a small trial and it could be logical, having run the trial to accept decision A in preference to B even though the point estimate were outside the limits of equivalence. Basically, given such conditions, it would require an* extremely* in-equivalent result to cause one to prefer B to A. A moderately in-equivalent result would not suffice. However the fact that the possibility, however remote of changing B for A exists makes C a worth-while choice initially.

So technically, at least as regards the Bayesian argument, I think that Brown et al are right. Practically, however, I can think of no realistic circumstances under which these conditions could be satisfied.

Dr Berger and I agree that the FDA’s position on type one error rates is somewhat inconsistent so it is, of course, always dangerous to cite regulatory doctrine as a defence of a claim that an approach is logical. Nevertheless, I note that I do not see any haste by the FDA to replace the current biased test with unbiased procedures. I think that they are far more likely to consider, Dr Berger’s appeal to simplicity notwithstanding, that they are, indeed, entitled here, *as will have been the case with the innovator product*, to be provided with separate demonstrations of efficacy and tolerability. Seen in this light Schuirmann’s TOST procedure[4] is logical and consistent (apart from the choice of 5% level!).

My basic objection to unbiased tests of this sort[1-3], however, goes much deeper and here I suspect that not only Dr Berger but also Deborah Mayo will disagree with me. The Neyman-Pearson lemma is generally taken as showing that a justification for using likelihood as a basis for thinking about inference can be provided (for some simple cases) in terms of power. I do not, however, regard power as a more fundamental concept. (I believe that there is some evidence that Pearson unlike Neyman hesitated on this.) Thus my interpretation of NP is the reverse: by thinking in terms of likelihood one sometimes obtains a power bonus. If so, so much the better, but this is not the justification for likelihood, *au contraire*.

**References**

- Berger RL, Hsu JC. Bioequivalence trials, intersection-union tests and equivalence confidence sets.
*Statistical Science*1996;**11**: 283-302. - MehringG. On optimal tests for general interval hypotheses.
*Communications in Statistics: Theory and Methods*1993;**22**: 1257-1297. - Brown LD, Hwang JTG, Munk A. An unbiased test for the bioequivalence problem.
*Annals of Statistics*1997;**25**: 2345-2367 - Schuirmann DJ. A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability.
*J Pharmacokinet Biopharm*1987;**15**: 657-680. - Senn, S. (6/5/2014). Blood Simple? The complicated and controversial world of bioequivalence (guest post). Mayo’s
*Error Statistics Blog (error statistics.com)*

^^^^^^^^^^^^^^^^^^^

*** Mayo remark on this exchange: Following Senn’s “Blood Simple” post on this blog, I asked Roger Berger for some clarification, and his post grew out of his responses. I’m very grateful to him for his replies and the post. Subsequently, I asked Senn for a comment to the R. Berger post (above), and I’m most appreciative to him for supplying one on short notice. With both these guest posts in hand, I now share them with you. I hope that this helps to decipher a conundrum that I, for one, have had about bio-equivalence tests. But I’m going to have to study these items much more carefully. I look forward to reader responses.**

*Just one quick comment on Senn’s remark: *

“….I suspect that not only Dr Berger but also Deborah Mayo will disagree with me. The Neyman-Pearson lemma is generally taken as showing that a justification for using likelihood as a basis for thinking about inference can be provided (for some simple cases) in terms of power. I do not, however, regard power as a more fundamental concept. (I believe that there is some evidence that Pearson unlike Neyman hesitated on this.)”

*My position on this, I hope, is clear in published work, but just to say one thing: I don’t think that power is “a justification for using likelihood as a basis for thinking about inference”. I agree with E. Pearson in his numbering the steps (fully quoted in this post)*

Step 2. We then divide this set [of possible results] by a system of ordered boundaries…such that as we pass across one boundary and proceed to the next, we come to a class of results which makes us more and more inclined on the information available, to reject the hypothesis tested in favour of alternatives which differ from it by increasing amounts” (E. Pearson 1966a, 173).

http://errorstatistics.com/2013/08/13/blogging-e-s-pearsons-statistical-philosophy/

*(Perhaps this is the evidence Senn has in mind.) Merely maximizing power, defined in the crude way we sometimes see (e.g., average power taken over mixtures, as in Cox’s (and Birnbaum’s) famous examples) can lead to faulty assessments of inferential warrant, but then, I never use pre-data power as an assessment of severity associated with inferences.*

*While power isn’t necessary “for using likelihood as a basis for thinking about inference” nor for using other distance measures (at Step 2), reports of observed likelihoods and comparative likelihoods are inadequate for inference and error probability control. Hence, Pearson’s Step 3.*

Does the issue Senn raises on power really play an important role in his position on bioequivalence tests? I’m not sure. I look forward to hearing from readers.

Filed under: bioequivalence, frequentist/Bayesian, PhilPharma Tagged: R. Berger, S. Senn ]]>

**Stephen Senn
**Head, Methodology and Statistics Group

Competence Center for Methodology and Statistics (CCMS)

Luxembourg

**Responder despondency: myths of personalized medicine**

The road to drug development destruction is paved with good intentions. The 2013 FDA report, Paving the Way for Personalized Medicine has an encouraging and enthusiastic foreword from Commissioner Hamburg and plenty of extremely interesting examples stretching back decades. Given what the report shows can be achieved on occasion, given the enthusiasm of the FDA and its commissioner, given the amazing progress in genetics emerging from the labs, a golden future of personalized medicine surely awaits us. It would be churlish to spoil the party by sounding a note of caution but I have never shirked being churlish and that is exactly what I am going to do.

Reading the report, alarm bells began to ring when I came across this chart (p17) describing the percentage of patients for whom drug are ineffective. Actually, I tell a lie. The alarm bells were ringing as soon as I saw the title but by the time I saw this chart, the cacophony was deafening.

The question that immediately arose in my mind was ‘how do the FDA know this is true?’ Well, the Agency very helpfully tells you how they know this is true. They cite a publication, ‘Clinical application of pharmacogenetics’[1] as the source of the chart. Slightly surprisingly, the date of the publication predates the FDA report by 12 years (this is pre-history in pharmacogenetic terms) however, sure enough, if you look up the cited paper you will find that the authors (Spear et al) state ‘We have analyzed the efficacy of major drugs in several important diseases based on published data, and the summary of the information is given in Table 1.’ This is Table 1:

Now, there are a few differences here to the FDA report but we have to give the Agency some credit. First of all they have decided to concentrate on those who don’t respond, so they have subtracted the response rates from 100. Second, they have obviously learned an important data presentation lesson: sorting by the alphabet is often inferior to sorting by importance. Unfortunately, they have ignored an important lesson that texts on graphical excellence impart: don’t clutter your presentation with chart junk[2]. However, in the words of Meatloaf, ‘Two out of three ain’t bad,’ so I have to give them some credit.

However, that’s not quite the end of the story. Note the superscripted 1 in the rubric of the source for the FDA claim. That’s rather important. This gives you the source of the information, which is the *Physician’s Desk Reference*, 54^{th} edition, 2000.

At this point of tracing back, I discovered what I knew already. What the FDA is quoting are zombie statistics. This is not to impugn the work of Spear et al. The paper makes interesting points. (I can’t even blame them for not citing one of my favourite papers[3], since it appeared in the same year.) They may well have worked diligently to collect the data they did but the trail runs cold here. The methodology is not given and the results can’t be checked. It may be true, it may be false but nobody, and that includes the FDA and its commissioner, knows.

But there is a further problem. There is a very obvious trap in using *observed* response rates to judge what percentage of patient respond (or don’t). That is that all such measures are subject to within-patient variability. To take a field I have worked in, asthma, if you take (as the FDA has on occasion) 15% increase in Forced Expiratory Volume in one second (FEV_{1}) above baseline as indicating a response. You will classify someone with a 14% value as a non-responder and someone with a 16 % value as a responder but measure them again and they could easily change places (see chapter 8 of Statistical Issues in Drug Development[4]) . For a bronchodilator I worked on, mean bronchodilation at 12 hours was about 18% so you simply needed to base your measurement of effect on a number of replicates if you wanted to increase the proportion of responders.

There is a very obvious trap (or at least it ought to be obvious to all statisticians) in naively using reported response rates as an indicator of variation in true response[5]. This can be illustrated using the graph below. On the left hand side you see an ideal counterfactual experiment. Every patient can be treated under identical conditions with both treatments. In this thought experiment the difference that the treatment makes to each patient is constant. However, life does not afford us this possibility. If what we choose to do is run a parallel group trial we will have to randomly give the patient either placebo or the active treatment. The right hand panel shows us what we will see and is obtained by randomly erasing one of the two points for each patient on the left hand panel. It is now impossible to judge individual response: all that we can judge is the average.

Of course, I fixed things in the example so that response was constant and it clearly might not be. But that is not the point. The point is that the diagram shows that by naively using raw outcomes we will overestimate the personal element of response. In fact, only repeated cross-over trial can reliable tease out individual response from other components of variation and in many indications these are not possible and even where they are possible they are rarely run[6].

So to sum up, the reason the FDA ‘knows’ that 40% of asthmatic patients don’t respond to treatment is because a paper from 2001, with unspecified methodology, most probably failing to account for within patient variability, reports that the authors found this to be the case by studying the *Physician’s Desk Reference*.

This is nothing short of a scandal. I don’t blame the FDA. I blame me and my fellow statisticians. Why and how are we allowing our life scientist colleagues to get away with this nonsense? *They* genuinely believe it. *We* ought to know better.

**References**

- Spear, B.B., M. Heath-Chiozzi, and J. Huff,
*Clinical application of pharmacogenetics.*Trends in Molecular Medicine, 2001.**7**(5): p. 201-204. - Tufte, E.R.,
*The Visual Display of Quantitative Information*. 1983, Cheshire Connecticut: Graphics Press. - Senn, S.J.,
*Individual Therapy: New Dawn or False Dawn.*Drug Information Journal, 2001.**35**(4): p. 1479-1494. - Senn, S.J.,
*Statistical Issues in Drug Development*. 2007, Hoboken: Wiley. 498. - Senn, S.,
*Individual response to treatment: is it a valid assumption?*BMJ, 2004.**329**(7472): p. 966-8. - Senn, S.J.,
*Three things every medical writer should know about statistics.*The Write Stuff, 2009.**18**(3): p. 159-162.

Filed under: evidence-based policy, Statistics, Stephen Senn ]]>

Since the comments to my previous post are getting too long, I’m reblogging it here to make more room. I say that the issue raised by J. Berger and Sellke (1987) and Casella and R. Berger (1987) concerns evaluating the evidence in relation to a given hypothesis (using error probabilities). Given the information that *this* hypothesis H* was randomly selected from an urn with 99% true hypothesis, we wouldn’t say this gives a great deal of evidence for the truth of H*, nor suppose that H* had thereby been well-tested. (H* might concern the existence of a standard model-like Higgs.) I think the issues about “science-wise error rates” and long-run performance in dichotomous, diagnostic screening should be taken up separately, but commentators can continue on this, if they wish (perhaps see this related post).

** 0. July 20, 2014: **Some of the comments to this post reveal that using the word “fallacy” in my original title might have encouraged running together the current issue with the fallacy of transposing the conditional. Please see a newly added Section 7.

**1. What you should ask…**

Discussions of P-values in the Higgs discovery invariably recapitulate many of the familiar criticisms of P-values (some “howlers”, some not). When you hear the familiar refrain, “We all know that P-values overstate the evidence against the null hypothesis”, denying the P-value aptly measures evidence, what you should ask is:

“What do you mean by overstating the evidence against a hypothesis?”

An honest answer might be:

“What I mean is that when I put a lump of prior probability π_{0}> 1/2 on a point nullH_{0 }(or a very small interval around it), the P-value is smaller than my Bayesian posterior probability onH_{0}.”

Your reply might then be: *(a) P-values are not intended as posteriors in H _{0} and (b) P-values can be used to determine whether there is evidence of inconsistency with a null hypothesis at various levels, and to distinguish how well or poorly tested claims are–depending on the type of question asked. The report on discrepancies “poorly” warranted is what controls any overstatements about discrepancies indicated.*

You might toss in the query: *Why do you assume that “the” correct measure of evidence (for scrutinizing the P-value) is via the Bayesian posterior?*

If you wanted to go even further you might rightly ask: ** And by the way, what warrants your lump of prior to the null?** (See Section 3

^^^^^^^^^^^^^^^

**2 . J. Berger and Sellke and Casella and R. Berger**

Of course it is well-known that for a fixed P-value, with a sufficiently large n, even a statistically significant result can correspond to large posteriors in H_{0} (Jeffreys-Good-Lindley paradox). I.J. Good (I don’t know if he was the first) recommended decreasing the required P-value as n increases, and had a formula for it. A more satisfactory route is to ensure the interpretation takes account of the (obvious) fact that with a fixed P-value and increasing n, the test is more and more sensitive to discrepancies–much as is done with lower/upper bounds of confidence intervals. For some rules of thumb see Section 5.

The JGL result is generalized in J. Berger and Sellke (1987). They make out the conflict between P-values and Bayesian posteriors by considering the two sided test of the Normal mean, *H*_{0}: μ = μ_{0} versus *H*_{1}: μ ≠ μ_{0} .

“If

n= 50…, one can classically ‘rejectH_{0}at significance level p = .05,’ although Pr (H_{0}|) = .52 (which would actually indicate that the evidence favorsxH_{0}).” (Berger and Sellke, 1987, p. 113).

If *n* = 1000, a result statistically significant at the .05 level leads to a posterior to the null going from .5 to .82!

While from their Bayesian perspective, this appears to supply grounds for denying P-values are adequate for assessing evidence, significance testers rightly balk at the fact that using the recommended priors allows highly significant results to be interpreted as no evidence against the null–or even evidence *for* it!

Many think this shows that the P-value ‘overstates evidence against a null’ because it claims to use an ‘impartial’ Bayesian prior probability assignment of .5 to *H*_{0}**, **the remaining .5 spread out over the alternative parameter space. (But see the justification Berger and Sellke give in Section 3. *A Dialogue*.) Casella and R. Berger (1987) charge that the problem is not P-values but the high prior, and that “concentrating mass on the point null hypothesis is biasing the prior in favor of *H*_{0 }as much as possible” (p. 111) whether in 1 or 2-sided tests. Note, too, the conflict with confidence interval reasoning since the null value (here it is 0) lies outside the corresponding confidence interval (Mayo 2005). See Senn’s very interesting points on this same issue in his letter (to Goodman) here.

^^^^^^^^^^^^^^^^^

**3. A Dialogue **(ending with a little curiosity in J. Berger and Sellke):

*So a guy is fishing in Lake Elba, and a representative from the EPA (Elba Protection Association) points to notices that mean toxin levels in fish were found to exceed the permissible mean concentration, set at 0.*

EPA Rep: We’ve conducted two studies (each with random sample of 100 fish) showing statistically significant concentrations of toxin, at low P-values, e.g., .02.

P-Value denier:I deny you’ve shown evidence of high mean toxin levels; P-values exaggerate the evidence against the null.

EPA Rep: Why is that?

P-value denier:If I update the prior of .5 that I give to the null hypothesis (asserting toxin levels are of no concern), my posterior for H_{0 }is still not all that low, not as low as .05 for sure.

EPA Rep:Why do you assign such a high prior probability to H_{0}?

P-value denier:If I gave H_{0}a value lower than .5, then, if there’s evidence to reject H_{0 , }at most I would be claiming an improbable hypothesis has become more improbable. Who would be convinced by the statement ‘I conducted a Bayesian test of H_{0}, assigning prior probability .1 to H_{0}, and my conclusion is that H_{0 }has posterior probability .05 and should be rejected’?

*The last sentence is a direct quote from Berger and Sellke!*

“When giving numerical results, we will tend to present Pr(H

_{0}|) for πx_{0}= 1/2. The choice of π_{0}= 1/2 has obvious intuitive appeal in scientific investigations as being ‘objective.’ (some might argue that should even be chosen larger than 1/2 since H_{0 }is often the ‘established theory.’) …[I]t will rarely be justifiable to choose π_{0 }< 1/2; who, after all, would be convinced by the statement ‘I conducted a Bayesian test of H_{0}, assigning prior probability .1 to H_{0}, and my conclusion is that H_{0 }has posterior probability .05 and should be rejected’? We emphasize this obvious point because some react to the Bayesian-classical conflict by attempting to argue that π_{0 }should be made small in the Bayesian analysis so as to force agreement.” (Berger and Sellke, 115)

*There’s something curious in assigning a high prior to the null H _{0}–thereby making it harder to reject (or find evidence against) H_{0}–and then justifying the assignment by saying it ensures that, if you do reject H_{0}, there will be a meaningful drop in the probability of H_{0. }What do you think of this?*

^^^^^^^^^^^^^^^^^^^^

**4. The real puzzle. **

I agree with J. Berger and Sellke that we should not “force agreement”. What’s puzzling to me is why it would be thought that an account that manages to evaluate how well or poorly tested hypotheses are–as significance tests can do–would want to measure up to an account that can only give a comparative assessment (be they likelihoods, odds ratios, or other) [ii]. From the perspective of the significance tester, the disagreements between (audited) P-values and posterior probabilities are an indictment, not of the P-value, but of the posterior, as well as the Bayes ratio leading to the disagreement (as even one or two Bayesians appear to be coming around to realize, e.g., Bernardo 2011, 58-9). Casella and R. Berger show that for sensible priors with one-sided tests, the P-value can be “reconciled” with the posterior, thereby giving an excellent retort to J. Berger and Sellke. Personally, I don’t see why an error statistician would wish to construe the P-value as how “believe worthy” or “bet worthy” statistical hypotheses are. Changing the interpretation may satisfy J. Berger’s call for “an agreement on numbers” (and never mind philosophies), but doing so precludes the proper functioning of P-values, confidence levels, and other error probabilities. And “**what is the intended interpretation of the prior, again****?**” you might ask. Aside from the subjective construals (of betting and belief, or the like), the main one on offer (from the conventionalist Bayesians) is that the prior is undefined and is simply a way to compute a posterior. Never mind that they don’t agree on which to use. Your question should be: **“Please tell me: how does a posterior, based on an undefined prior used solely to compute a posterior, become “the” measure of evidence that we should aim to match?” **

^^^^^^^^^^^^^^^^

**5. (Crude) Benchmarks for taking into account sample size: **

Throwing out a few numbers may give sufficient warning to those inclined to misinterpret statistically significant differences at a given level but with varying sample sizes (please also search this blog [iii]). Using the familiar example of Normal testing with T+ :

H_{0}: μ ≤ 0 vs. H_{1}: μ > 0.Let

σ= 1,n= 25, soσ= (σ/√_{x}n).

*For this exercise, fix the sample mean M to be** just significant at the .025 level for a 1-sided test, and vary the sample size n. In one case, n = 100, in a second, n = 1600. So, for simplicity, using the 2-standard deviation cut-off:*

m= 0 + 2_{0}(σ/√n).

With stat sig results from test T+, we worry about unwarranted inferences of form: * μ > 0 + γ.*

*Some benchmarks:*

** *** The lower bound of a 50% confidence interval is** **2(σ/√*n*). *So there’s quite lousy evidence that μ > *2

** ***The lower bound of the 93% confidence interval is** **.5(σ/√*n*). *So there’s decent evidence that μ > *.5

** ***For *n* = 100,* σ/√n* = .1 (*σ= 1); f*or *n* = 1600,* σ/√n* = .025

* *Therefore, a .025 stat sig result is fairly good evidence that μ > *.05, when

You’re picking up smaller and smaller discrepancies as *n* increases, when P is kept fixed. Taking the indicated discrepancy into account avoids erroneous construals and scotches any “paradox”.

^^^^^^^^^^

**6.**** “The Jeffreys-Lindley Paradox and Discovery Criteria in High Energy Physics” (Cousins, 2014)**

Robert Cousins, a HEP physicist willing to talk to philosophers and from whom I am learning about statistics in the Higgs discovery, illuminates the key issues, models and problems in his paper with that title. (The reference to Bernardo 2011 that I had in mind in Section 4 is cited on p. 26 of Cousins 2014).

^^^^^^^^^^^^^^^^^^^^^^^^^^

*7. July 20, 2014: *** There is a distinct issue here…**.That “P-values overstate the evidence against the null” is often stated as an uncontroversial “given”. In calling it a “fallacy”, I was being provocative. However, in dubbing it a fallacy, some people assumed I was referring to one or another

The problem with using a P-value to assess evidence against a given null hypothesis H_{0} is that it tends to be smaller, even much smaller, than an apparently plausible posterior assessment of H_{0}, given data *x* (especially as n increases). The mismatch is avoided with a suitably tiny P-value, and that’s why many recommend this tactic. [iv] Yet I say the correct answer to the question in my (new) title is: “fallacious”. It’s one of those criticisms that have not been thought through carefully, but rather repeated based on some well-known articles.

[i] We assume the P-values are “audited”, that they are not merely “nominal”, but are “actual” P-values. Selection effects, cherry-picking and other biases would alter the error probing capacity of the tests, and thus the purported P-value would fail the audit.

[ii] Note too that the comparative assessment will vary depending on the “catchall”.

[iii] See for example:

Section 6.1 “fallacies of rejection“.

Slide #8 of Spanos lecture in our seminar Phil 6334.

[iv] So we can also put aside for the moment the issue of P-values not being conditional probabilities to begin with. We can also (I hope) distinguish another related issue, which requires a distinct post: using ratios of frequentist error probabilities, e.g., type 1 errors and power, to form a kind of “likelihood ratio” in a screening computation.

**References **(minimalist) A number of additional links are given in comments to my previous post

Berger, J. O. and Sellke, T. (1987). “Testing a point null hypothesis: The irreconcilability of *p *values and evidence,” (with discussion). *J. Amer. Statist. Assoc. ***82: **112–139.

Cassella G. and Berger, R.. (1987). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem,” (with discussion). *J. Amer. Statist. Assoc. ***82 **106–111, 123–139.

*Blog posts:*

Comedy Hour at the Bayesian Retreat: P-values versus Posteriors.

Highly probable vs highly probed: Bayesian/ error statistical differences.

Filed under: Bayesian/frequentist, CIs and tests, fallacy of rejection, highly probable vs highly probed, P-values, Statistics ]]>

** 0. July 20, 2014: **Some of the comments to this post reveal that using the word “fallacy” in my original title might have encouraged running together the current issue with the fallacy of transposing the conditional. Please see a newly added Section 7.

**1. What you should ask…**

Discussions of P-values in the Higgs discovery invariably recapitulate many of the familiar criticisms of P-values (some “howlers”, some not). When you hear the familiar refrain, “We all know that P-values overstate the evidence against the null hypothesis”, denying the P-value aptly measures evidence, what you should ask is:

“What do you mean by overstating the evidence against a hypothesis?”

An honest answer might be:

“What I mean is that when I put a lump of prior probability π_{0}> 1/2 on a point nullH_{0 }(or a very small interval around it), the P-value is smaller than my Bayesian posterior probability onH_{0}.”

Your reply might then be: *(a) P-values are not intended as posteriors in H _{0} and (b) P-values can be used to determine whether there is evidence of inconsistency with a null hypothesis at various levels, and to distinguish how well or poorly tested claims are–depending on the type of question asked. The report on discrepancies “poorly” warranted is what controls any overstatements about discrepancies indicated.*

You might toss in the query: *Why do you assume that “the” correct measure of evidence (for scrutinizing the P-value) is via the Bayesian posterior?*

If you wanted to go even further you might rightly ask: ** And by the way, what warrants your lump of prior to the null?** (See Section 3

^^^^^^^^^^^^^^^

**2 . J. Berger and Sellke and Casella and R. Berger**

Of course it is well-known that for a fixed P-value, with a sufficiently large n, even a statistically significant result can correspond to large posteriors in H_{0} (Jeffreys-Good-Lindley paradox). I.J. Good (I don’t know if he was the first) recommended decreasing the required P-value as n increases, and had a formula for it. A more satisfactory route is to ensure the interpretation takes account of the (obvious) fact that with a fixed P-value and increasing n, the test is more and more sensitive to discrepancies–much as is done with lower/upper bounds of confidence intervals. For some rules of thumb see Section 5.

The JGL result is generalized in J. Berger and Sellke (1987). They make out the conflict between P-values and Bayesian posteriors by considering the two sided test of the Normal mean, *H*_{0}: μ = μ_{0} versus *H*_{1}: μ ≠ μ_{0} .

“If

n= 50…, one can classically ‘rejectH_{0}at significance level p = .05,’ although Pr (H_{0}|) = .52 (which would actually indicate that the evidence favorsxH_{0}).” (Berger and Sellke, 1987, p. 113).

If *n* = 1000, a result statistically significant at the .05 level leads to a posterior to the null going from .5 to .82!

While from their Bayesian perspective, this appears to supply grounds for denying P-values are adequate for assessing evidence, significance testers rightly balk at the fact that using the recommended priors allows highly significant results to be interpreted as no evidence against the null–or even evidence *for* it!

Many think this shows that the P-value ‘overstates evidence against a null’ because it claims to use an ‘impartial’ Bayesian prior probability assignment of .5 to *H*_{0}**, **the remaining .5 spread out over the alternative parameter space. (But see the justification Berger and Sellke give in Section 3. *A Dialogue*.) Casella and R. Berger (1987) charge that the problem is not P-values but the high prior, and that “concentrating mass on the point null hypothesis is biasing the prior in favor of *H*_{0 }as much as possible” (p. 111) whether in 1 or 2-sided tests. Note, too, the conflict with confidence interval reasoning since the null value (here it is 0) lies outside the corresponding confidence interval (Mayo 2005). See Senn’s very interesting points on this same issue in his letter (to Goodman) here.

^^^^^^^^^^^^^^^^^

**3. A Dialogue **(ending with a little curiosity in J. Berger and Sellke):

*So a guy is fishing in Lake Elba, and a representative from the EPA (Elba Protection Association) points to notices that mean toxin levels in fish were found to exceed the permissible mean concentration, set at 0.*

EPA Rep: We’ve conducted two studies (each with random sample of 100 fish) showing statistically significant concentrations of toxin, at low P-values, e.g., .02.

P-Value denier:I deny you’ve shown evidence of high mean toxin levels; P-values exaggerate the evidence against the null.

EPA Rep: Why is that?

P-value denier:If I update the prior of .5 that I give to the null hypothesis (asserting toxin levels are of no concern), my posterior for H_{0 }is still not all that low, not as low as .05 for sure.

EPA Rep:Why do you assign such a high prior probability to H_{0}?

P-value denier:If I gave H_{0}a value lower than .5, then, if there’s evidence to reject H_{0 , }at most I would be claiming an improbable hypothesis has become more improbable. Who would be convinced by the statement ‘I conducted a Bayesian test of H_{0}, assigning prior probability .1 to H_{0}, and my conclusion is that H_{0 }has posterior probability .05 and should be rejected’?

*The last sentence is a direct quote from Berger and Sellke!*

“When giving numerical results, we will tend to present Pr(H

_{0}|) for πx_{0}= 1/2. The choice of π_{0}= 1/2 has obvious intuitive appeal in scientific investigations as being ‘objective.’ (some might argue that π_{0 }should even be chosen larger than 1/2 since H_{0 }is often the ‘established theory.’) …[I]t will rarely be justifiable to choose π_{0 }< 1/2; who, after all, would be convinced by the statement ‘I conducted a Bayesian test of H_{0}, assigning prior probability .1 to H_{0}, and my conclusion is that H_{0 }has posterior probability .05 and should be rejected’? We emphasize this obvious point because some react to the Bayesian-classical conflict by attempting to argue that π_{0 }should be made small in the Bayesian analysis so as to force agreement.” (Berger and Sellke, 115)

*There’s something curious in assigning a high prior to the null H _{0}–thereby making it harder to reject (or find evidence against) H_{0}–and then justifying the assignment by saying it ensures that, if you do reject H_{0}, there will be a meaningful drop in the probability of H_{0. }What do you think of this?*

^^^^^^^^^^^^^^^^^^^^

**4. The real puzzle. **

I agree with J. Berger and Sellke that we should not “force agreement”. What’s puzzling to me is why it would be thought that an account that manages to evaluate how well or poorly tested hypotheses are–as significance tests can do–would want to measure up to an account that can only give a comparative assessment (be they likelihoods, odds ratios, or other) [ii]. From the perspective of the significance tester, the disagreements between (audited) P-values and posterior probabilities are an indictment, not of the P-value, but of the posterior, as well as the Bayes ratio leading to the disagreement (as even one or two Bayesians appear to be coming around to realize, e.g., Bernardo 2011, 58-9). Casella and R. Berger show that for sensible priors with one-sided tests, the P-value can be “reconciled” with the posterior, thereby giving an excellent retort to J. Berger and Sellke. Personally, I don’t see why an error statistician would wish to construe the P-value as how “believe worthy” or “bet worthy” statistical hypotheses are. Changing the interpretation may satisfy J. Berger’s call for “an agreement on numbers” (and never mind philosophies), but doing so precludes the proper functioning of P-values, confidence levels, and other error probabilities. And “**what is the intended interpretation of the prior, again****?**” you might ask. Aside from the subjective construals (of betting and belief, or the like), the main one on offer (from the conventionalist Bayesians) is that the prior is undefined and is simply a way to compute a posterior. Never mind that they don’t agree on which to use. Your question should be: **“Please tell me: how does a posterior, based on an undefined prior used solely to compute a posterior, become “the” measure of evidence that we should aim to match?” **

^^^^^^^^^^^^^^^^

**5. (Crude) Benchmarks for taking into account sample size: **

Throwing out a few numbers may give sufficient warning to those inclined to misinterpret statistically significant differences at a given level but with varying sample sizes (please also search this blog [iii]). Using the familiar example of Normal testing with T+ :

H_{0}: μ ≤ 0 vs. H_{1}: μ > 0.Let

σ= 1,n= 25, soσ= (σ/√_{x}n).

*For this exercise, fix the sample mean M to be** just significant at the .025 level for a 1-sided test, and vary the sample size n. In one case, n = 100, in a second, n = 1600. So, for simplicity, using the 2-standard deviation cut-off:*

m= 0 + 2_{0}(σ/√n).

With stat sig results from test T+, we worry about unwarranted inferences of form: * μ > 0 + γ.*

*Some benchmarks:*

** *** The lower bound of a 50% confidence interval is** **2(σ/√*n*). *So there’s quite lousy evidence that μ > *2

** ***The lower bound of the 93% confidence interval is** **.5(σ/√*n*). *So there’s decent evidence that μ > *.5

** ***For *n* = 100,* σ/√n* = .1 (*σ= 1); f*or *n* = 1600,* σ/√n* = .025

* *Therefore, a .025 stat sig result is fairly good evidence that μ > *.05, when

You’re picking up smaller and smaller discrepancies as *n* increases, when P is kept fixed. Taking the indicated discrepancy into account avoids erroneous construals and scotches any “paradox”.

^^^^^^^^^^

**6.**** “The Jeffreys-Lindley Paradox and Discovery Criteria in High Energy Physics” (Cousins, 2014)**

Robert Cousins, a HEP physicist willing to talk to philosophers and from whom I am learning about statistics in the Higgs discovery, illuminates the key issues, models and problems in his paper with that title. (The reference to Bernardo 2011 that I had in mind in Section 4 is cited on p. 26 of Cousins 2014).

^^^^^^^^^^^^^^^^^^^^^^^^^^

*7. July 20, 2014: *** There is a distinct issue here…**.That “P-values overstate the evidence against the null” is often stated as an uncontroversial “given”. In calling it a “fallacy”, I was being provocative. However, in dubbing it a fallacy, some people assumed I was referring to one or another

The problem with using a P-value to assess evidence against a given null hypothesis H_{0} is that it tends to be smaller, even much smaller, than an apparently plausible posterior assessment of H_{0}, given data *x* (especially as n increases). The mismatch is avoided with a suitably tiny P-value, and that’s why many recommend this tactic. [iv] Yet I say the correct answer to the question in my (new) title is: “fallacious”. It’s one of those criticisms that have not been thought through carefully, but rather repeated based on some well-known articles.

[i] We assume the P-values are “audited”, that they are not merely “nominal”, but are “actual” P-values. Selection effects, cherry-picking and other biases would alter the error probing capacity of the tests, and thus the purported P-value would fail the audit.

[ii] Note too that the comparative assessment will vary depending on the “catchall”.

[iii] See for example:

Section 6.1 “fallacies of rejection“.

Slide #8 of Spanos lecture in our seminar Phil 6334.

[iv] So we can also put aside for the moment the issue of P-values not being conditional probabilities to begin with. We can also (I hope) distinguish another related issue, which requires a distinct post: using ratios of frequentist error probabilities, e.g., type 1 errors and power, to form a kind of “likelihood ratio” in a screening computation.

**References **(minimalist)

Berger, J. O. and Sellke, T. (1987). “Testing a point null hypothesis: The irreconcilability of *p *values and evidence,” (with discussion). *J. Amer. Statist. Assoc. ***82: **112–139.

Cassella G. and Berger, R.. (1987). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem,” (with discussion). *J. Amer. Statist. Assoc. ***82 **106–111, 123–139.

*Blog posts:*

Comedy Hour at the Bayesian Retreat: P-values versus Posteriors.

Highly probable vs highly probed: Bayesian/ error statistical differences.

Filed under: Bayesian/frequentist, CIs and tests, fallacy of rejection, highly probable vs highly probed, P-values, Statistics ]]>

Some people say to me: “This kind of reasoning is fine for a ‘sexy science’ like high energy physics (HEP)”–as if their statistical inferences are radically different. But I maintain that this is the mode by which data are used in “uncertain” reasoning across the entire landscape of science and day-to-day learning (at least, when we’re trying to find things out)[2] Even with high level theories, the particular problems of learning from data are tackled piecemeal, in local inferences that afford error control. Granted, this statistical philosophy differs importantly from those that view the task as assigning comparative (or absolute) degrees-of-support/belief/plausibility to propositions, models, or theories.

**“Higgs Analysis and Statistical Flukes: part 2″**

Everyone was excited when the Higgs boson results were reported on July 4, 2012 indicating evidence for a Higgs-like particle based on a “5 sigma observed effect”. The observed effect refers to the number of *excess events* of a given type that are “observed” in comparison to the number (or proportion) that would be expected from background alone, and not due to a Higgs particle. This continues my earlier post (part 1). It is an outsider’s angle on one small aspect of the statistical inferences involved. But that, apart from being fascinated by it, is precisely why I have chosen to discuss it: we [philosophers of statistics] should be able to employ a general philosophy of inference to get an understanding of what is true about the controversial concepts we purport to illuminate, e.g., significance levels.

Here I keep close to an official report from ATLAS, researchers define a “global signal strength” parameter “such that μ = 0 corresponds to the background only hypothesis and μ = 1 corresponds to the SM Higgs boson signal in addition to the background” (where SM is the Standard Model). The statistical test may be framed as a one-sided test, where the test statistic (which is actually a ratio) records differences in the positive direction, in standard deviation (sigma) units. Reports such as

Pr(Test T would yield at least a 5 sigma excess;

H_{0}: background only) = extremely low

are deduced from the sampling distribution of the test statistic, fortified with much cross-checking of results (e.g., by modeling and simulating relative frequencies of observed excesses generated with “Higgs signal +background” compared to background alone). The inferences, even the formal statistical ones, go beyond p-value reports. For instance, they involve setting lower and upper bounds such that values excluded are ruled out with high severity, to use my term. But the popular report is in terms of the observed 5 sigma excess in an overall test T, and that is mainly what I want to consider here.

*Error probabilities*

In a Neyman-Pearson setting, a cut-off c_{α}_{ }is chosen pre-data so that the probability of a type I error is low. In general,

Pr(

d(X) > c_{α};H_{0}) ≤ α

and in particular,alluding to an overall test T:

(1) Pr(Test T yields

d(X) > 5 standard deviations;H_{0}) ≤ .0000003.

The test at the same time is designed to ensure a reasonably high probability of detecting global strength discrepancies of interest. (I always use “discrepancy” to refer to parameter magnitudes, to avoid confusion with observed differences).

[Notice these are not likelihoods.] Alternatively, researchers can report observed standard deviations (here, the sigmas), or equivalently, the associated observed statistical significance probability, *p*_{0}. In general,

Pr(

P<p_{0};H_{0}) <p_{0}

and in particular,

(2) Pr(Test T yields

P<.0000003;H_{0}) < .0000003.

For test T to yield a “worse fit” with *H*_{0 }(smaller p-value) due to background alone is sometimes called “a statistical fluke” or a “random fluke”, and the probability of so statistically significant a random fluke is ~0. With the March 2013 results, the 5 sigma difference has grown to 7 sigmas.

So probabilistic statements along the lines of (1) and (2) are standard.They allude to sampling distributions, either of test statistic *d*(**X)**, or the P-value viewed as a random variable. They are scarcely illicit or prohibited. (I return to this in the last section of this post).

*An implicit principle of inference or evidence*

Admittedly, the move to taking the 5 sigma effect as evidence for a genuine effect (of the Higgs-like sort) results from an implicit principle of evidence that I have been calling the severity principle (SEV). Perhaps the weakest form is to a statistical rejection or falsification of the null. (I will deliberately use a few different variations on statements that can be made.)

Datax_{0 }from a test T provide evidence for rejecting H_{0}(just) to the extent that H_{0}would (very probably) have survived, were it a reasonably adequate description of the process generating the data (with respect to the question).

It is also captured by a general frequentist principle of evidence (FEV) (Mayo and Cox 2010), a variant on the general idea of severity (SEV) (EGEK 1996, Mayo and Spanos 2006, etc.).

The sampling distribution is computed, under the assumption that the production of observed results is similar to the “background alone”, with respect to relative frequencies of signal-like events. (Likewise for computations under hypothesized discrepancies.) The relationship between *H*_{0}* *and the probabilities of outcomes is an intimate one: the various statistical nulls live their lives to refer to aspects of general types of data generating procedures (for a taxonomy, see Cox 1958, 1977). “*H** _{0 }*is true” is a shorthand for a very long statement that

*Severity and the detachment of inferences*

The sampling distributions serve to give counterfactuals. In this case they tell us what it would be like, statistically, were the mechanism generating the observed signals similar to *H*_{0}.[i] While one would want to go on to consider the probability test T yields so statistically significant an excess under various alternatives to μ = 0, this suffices for the present discussion. Sampling distributions can be used to arrive at error probabilities that are relevant for understanding the capabilities of the test process, in relation to something we want to find out..*Since a relevant test statistic is a function of the data and quantities about which we want to learn, the associated sampling distribution is the key to inference*. (This is why bootstrap, and other types of, resampling works when one has a random sample from the process or population of interest.)

The *severity principle*, put more generally:

Data from a test T[ii]provide good evidence for inferring H (just) to the extent that H passes severely withx_{0}, i.e., to the extent that H would (very probably) not have survived the test so well were H false.

(The severity principle can also be made out just in terms of relative frequencies, as with bootstrap re-sampling.)* *In this case, what is surviving is minimally the non-null. Regardless of the specification of a statistical inference, to assess the severity associated with a claim H requires considering H’s denial: together they exhaust the answers to a given question.

Without making such a principle explicit, some critics assume the argument is all about the reported p-value. The inference actually **detached** from the evidence can be put in any number of ways, and no uniformity is to be expected or needed:

(3) There is strong evidence for H: a Higgs (or a Higgs-like) particle.

(3)’ They have experimentally demonstrated H: a Higgs (or Higgs-like) particle.

Or just, infer H.

Doubtless particle physicists would qualify these statements, but nothing turns on that. ((3) and (3)’ are a bit stronger than merely falsifying the null because certain properties of the particle must be shown. I leave this to one side.)

As always, the mere p-value is a pale reflection of the detailed information about the consistency of results that really fortifies the knowledge of a genuine effect. Nor is the precise improbability level what matters. We care about the inferences to real effects (and estimated discrepancies) that are warranted.

*Qualifying claims by how well they have been probed*

The inference is qualified by the statistical properties of the test, as in (1) and (2), but that does not prevent detaching (3). This much is shown: they are able to experimentally demonstrate the Higgs particle. They can take that much of the problem as solved and move on to other problems of discerning the properties of the particle, and much else that goes beyond our discussion*. There is obeisance to the strict fallibility of every empirical claim, but there is no probability assigned. Neither is there in day-to-day reasoning, nor in the bulk of scientific inferences, which are not formally statistical. Having inferred (3), granted, one may say informally, “so probably we have experimentally demonstrated the Higgs”, or “probably, the Higgs exists” (?). Or an informal use of “likely” might arise. But whatever these might mean in informal parlance, they are not formal mathematical probabilities. (As often argued on this blog, discussions on statistical philosophy must not confuse these.)

[We can however write, SEV(H) ~1]

The claim in (3) is approximate and limited–as are the vast majority of claims of empirical knowledge and inference–and, moreover, we can say in just what ways. It is recognized that subsequent data will add precision to the magnitudes estimated, and may eventually lead to new and even entirely revised interpretations of the known experimental effects, models and estimates. That is what cumulative knowledge is about. (I sometimes hear people assert, without argument, that modeled quantities, or parameters, used to describe data generating processes are “things in themselves” and are outside the realm of empirical inquiry. This is silly. Else we’d be reduced to knowing only tautologies and maybe isolated instances as to how “I seem to feel now,” attained through introspection.)

*Telling what’s true about significance levels*

So we grant the critic that something like the severity principle is needed to move from statistical information plus background (theoretical and empirical) to inferences about evidence and inference (and to what levels of approximation). It may be called lots of other things and framed in different ways, and the reader is free to experiment . What we should not grant the critic is any allegation that there should be, or invariably is, a link from a small observed significance level to a small posterior probability assignment to *H** _{0}*. Worse, (1- the p-value) is sometimes alleged to be the posterior probability accorded to the Standard Model itself! This is neither licensed nor wanted!

If critics (or the p-value police, as Wasserman called them) maintain that Higgs researchers are misinterpreting their significance levels, correct them with the probabilities in (1) and (2). If they say, it is patently obvious that Higgs researchers want to use the p-value as a posterior probability assignment to *H** _{0}*, point out the more relevant and actually attainable [iii] inference that is detached in (3). If they persist that what is really, really wanted is a posterior probability assignment to the inference about the Higgs in (3), ask why? As a formal posterior probability it would require a prior probability on all hypotheses that could explain the data. That would include not just

Degrees of belief will not do. Many scientists perhaps had (and have) strong beliefs in the Standard Model before the big collider experiments—given its perfect predictive success. Others may believe (and fervently wish) that it will break down somewhere (showing supersymmetry or whatnot); a major goal of inquiry is learning about viable rivals and how they may be triggered and probed. Research requires an open world not a closed one with all possibilities trotted out and weighed by current beliefs. [v] We need to point up what has not yet been well probed which, by the way, is very different from saying of a theory that it is “not yet probable”.

*Those prohibited phrases*

One may wish to return to some of the condemned phrases of particular physics reports.Take,

“There is less than a one in a million chance that their results are a statistical fluke”.

This is not to assign a probability to the null, just one of many ways (perhaps not the best) of putting claims about the sampling distribution: The statistical null asserts that Ho: background alone adequately describes the process.

Ho does not assert the results are a statistical fluke, but it tells us what we need to determine the probability of observed results “under Ho”. In particular, consider all outcomes in the sample space that are further from the null prediction than the observed, in terms of p-values {x: p < po}. Even when Ho is true, such “signal like” outcomes may occur. They are po level flukes. Were such flukes generated even with moderate frequency under Ho, they would not be evidence against Ho. But in this case, such flukes occur a teeny tiny proportion of the time. Then SEV enters: if we are regularly able to generate such teeny tiny p-values, we have evidence of a genuine discrepancy from Ho.

I am repeating myself, I realize, on the hopes that at least one phrasing will drive the point home. Nor is it even the improbability that substantiates this, it is the fact that an extraordinary set of coincidences would have to have occurred again and again. To nevertheless retain Ho as the source of the data would block learning. (Moreover, they know that if some horrible systematic mistake was made, it would be detected in later data analyses.)

I will not deny that there have been misinterpretations of p-values, but if a researcher has just described performing a statistical significance test, it would be “ungenerous” to twist probabilistic assertions into posterior probabilities. It would be a kind of “confirmation bias” whereby one insists on finding one sentence among very many that could conceivably be misinterpreted Bayesianly.

*Triggering, indicating, inferring*

As always, the error statistical philosopher would distinguish different questions at multiple stages of the inquiry. The aim of many preliminary steps is “behavioristic” and performance oriented: the goal being to control error rates on the way toward finding excess events or bumps of interest.

I hope it is (more or less) clear that burgandy is new; black is old. If interested: *See statistical flukes (part 3)*

The original posts of parts 1 and 2 had around 30 comments each; you might want to look at them:

Part 1: http://errorstatistics.com/2013/03/17/update-on-higgs-data-analysis-statistical-flukes-1/

Part 2 http://errorstatistics.com/2013/03/27/higgs-analysis-and-statistical-flukes-part-2/

*Fisher insisted that to assert a phenomenon is experimentally demonstrable:[W]e need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result. (Fisher Design of Experiments 1947, 14).

New Notes

[1] I plan to do some new work in this arena soon, so I’ll be glad to have comments.

[2] I have often noted that there are other times where we are trying to find evidence to support a previously held position.

REFERENCES (from March, 2013 post):

ATLAS Collaboration (November 14, 2012), Atlas Note: “Updated ATLAS results on the signal strength of the Higgs-like boson for decays into WW and heavy fermion final states”, ATLAS-CONF-2012-162. http://cds.cern.ch/record/1494183/files/ATLAS-CONF-2012-162.pdf

Cox, D.R. (1958), “Some Problems Connected with Statistical Inference,” *Annals of Mathematical Statistics*, 29: 357–72.

Cox, D.R. (1977), “The Role of Significance Tests (with Discussion),” *Scandinavian Journal of Statistics*, 4: 49–70.

Mayo, D.G. (1996), *Error and the Growth of Experimental Knowledge*, University of Chicago Press, Chicago.

Mayo, D. G. and Cox, D. R. (2010). “Frequentist Statistics as a Theory of Inductive Inference” in *Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science* (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 247-275.

Mayo, D.G., and Spanos, A. (2006), “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,” *British Journal of Philosophy *of *Science*, 57: 323–357.

___________

Original notes:

[i] This is a bit stronger than merely falsifying the null here, because certain features of the particle discerned must also be shown. I leave details to one side.

[ii] Which almost always refers to a set of tests, not just one.

[iii] I sense that some Bayesians imagine P(H) is more “hedged” than to actually infer (3). But the relevant hedging, the type we can actually attain, is given by an assessment of severity or corroboration or the like. Background enters via a repertoire of information about experimental designs, data analytic techniques, mistakes and flaws to be wary of, and a host of theories and indications about which aspects have/have not been severely probed. Many background claims enter to substantiate the error probabilities; others do not alter them.

[iv]In aspects of the modeling, researchers make use of known relative frequencies of events (e.g., rates of types of collisions) that lead to legitimate, empirically based, frequentist “priors” if one wants to call them that.

[v] After sending out the letter, prompted by Lindley, O’Hagan wrote up a synthesis http://errorstatistics.com/2012/08/25/did-higgs-physicists-miss-an-opportunity-by-not-consulting-more-with-statisticians/

Filed under: Higgs, highly probable vs highly probed, P-values, Severity, Statistics ]]>

July 4, 2014 was the two year anniversary of the Higgs boson discovery. As the world was celebrating the “5 sigma!” announcement, and we were reading about the statistical aspects of this major accomplishment, I was aghast to be emailed a letter, purportedly instigated by Bayesian Dennis Lindley, through Tony O’Hagan (to the ISBA). Lindley, according to this letter, wanted to know:

“Arethe particle physics community completely wedded to frequentist analysis? If so, has anyone tried to explain what bad science that is?”

Fairly sure it was a joke, I posted it on my “Rejected Posts” blog for a bit until it checked out [1]. (See O’Hagan’s “Digest and Discussion”)

Then, as details of the statistical analysis trickled down to the media, the P-value police (Wasserman, see (2)) came out in full force to examine if reports by journalists and scientists could in any way or stretch of the imagination be seen to have misinterpreted the sigma levels as posterior probability assignments to the various models and claims. The HEP (High Energy Physics) community had been painstaking in their communication of the results, but the P-bashers insisted on transforming the intended conditional….(I’ll come back to this.)

As for the HEP researchers, a central interest now is to explore any and all leads in the data that would point to physics beyond the Standard Model (BSM). The Higgs is just coming out to be too “perfectly plain vanilla,” and they’ve been unable to reject an SM null for years (3) (more on this later). So on this two-year anniversary, I’ll reblog a few of the Higgs posts, with some updated remarks—beginning with the first one below.

I suppose[ed] this was somewhat of a joke from the ISBA, prompted by Dennis Lindley, but as I [now] accord the actual extent of jokiness to be only ~10%, I’m sharing it on the blog [i]. Lindley (according to O’Hagan) wonders why scientists require so high a level of statistical significance before claiming to have evidence of a Higgs boson. It is asked: “Are the particle physics community completely wedded to frequentist analysis? If so, has anyone tried to explain what bad science that is?”

*Bad science? * I’d really like to understand what these representatives from the ISBA would recommend, if there is even a shred of seriousness here (or is Lindley just peeved that significance levels are getting so much press in connection with so important a discovery in particle physics?)

Well, read the letter and see what you think.

On Jul 10, 2012, at 9:46 PM, ISBA Webmaster wrote:

Dear Bayesians,

A question from Dennis Lindley prompts me to consult this list in search of answers.

We’ve heard a lot about the Higgs boson. The news reports say that the LHC needed convincing evidence before they would announce that a particle had been found that looks like (in the sense of having some of the right characteristics of) the elusive Higgs boson. Specifically, the news referred to a confidence interval with 5-sigma limits.

Now this appears to correspond to a frequentist significance test with an extreme significance level. Five standard deviations, assuming normality, means a p-value of around 0.0000005. A number of questions spring to mind.

1. Why such an extreme evidence requirement? We know from a Bayesian perspective that this only makes sense if (a) the existence of the Higgs boson (or some other particle sharing some of its properties) has extremely small prior probability and/or (b) the consequences of erroneously announcing its discovery are dire in the extreme. Neither seems to be the case, so why 5-sigma?

2. Rather than ad hoc justification of a p-value, it is of course better to do a proper Bayesian analysis. Are the particle physics community completely wedded to frequentist analysis? If so, has anyone tried to explain what bad science that is?

3. We know that given enough data it is nearly always possible for a significance test to reject the null hypothesis at arbitrarily low p-values, simply because the parameter will never be exactly equal to its null value. And apparently the LNC has accumulated a very large quantity of data. So could even this extreme p-value be illusory?If anyone has any answers to these or related questions, I’d be interested to know and will be sure to pass them on to Dennis.

Regards,

Tony

—-

Professor A O’Hagan

Email: a.ohagan@sheffield.ac.uk

Department of Probability and Statistics

University of Sheffield

So given that the Higgs boson does not have such an extremely small prior probability, a proper Bayesian analysis would have enabled evidence of the Higgs long before attaining such an “extreme evidence requirement”. Why has no one tried to explain to these scientists how with just a little Bayesian analysis, they might have been done ~~in~~ last year or years ago? I take it the Bayesian would also enjoy the simplicity and freedom of not having to adjust “the Look Elsewhere Effect” (LEE[ii])

Let’s see if there’s a serious follow-up.[iii]

[i] bringing it down from my “Msc Kvetching page” where I’d put it last night.

[ii] For a discussion of how the error statistical philosophy avoids the classic criticisms of significance tests, see Mayo & Spanos (2011) ERROR STATISTICS. Other articles may be found on the link to my publication page.

[iii] O’Hagan informed me of several replies to his letter at the following:: http://bayesian.org/forums/news/3648

*****************************************************

(1) There’s scarce need for my “Rejected Posts” blog now that renegade thoughts can go on “twitter” (@learnfromerror), but I’ll keep it around for later.

(2) The Higgs Boson and the p-value Police: http://normaldeviate.wordpress.com/2012/07/11/the-higgs-boson-and-the-p-value-police/

(3)The logic in this case is especially interesting. Each failure to reject the nulls of this type inform about the variant of BSM ruled out. (I’ll check with Robert Cousins that I’ve put this correctly. Update: He says that I have.) Here’s a link to Cousins’ recent paper on the Higgs and foundations of statistics http://arxiv.org/abs/1310.3791.

Filed under: Bayesian/frequentist, fallacy of non-significance, Higgs, Lindley, Statistics Tagged: comedy, Dennis V. Lindley, Higgs boson, p-value vs posterior, particle physics, significance tests ]]>

**Winner of June 2014 Palindrome Contest: First Second* Time Winner! **

******Her April **win is here*

**Palindrome:**

**Parsec? I overfit omen as Elba sung “I err on! Oh, honor reign!” Usable, sane motif revoices rap.**

**The requirement:** A palindrome with Elba plus overfit. (The optional second word: “average” was not needed to win.)

**Bio:**

Lori Wike is principal bassoonist of the Utah Symphony and is on the faculty of the University of Utah and Westminster College. She holds a Bachelor of Music degree from the Eastman School of Music and a Master of Arts degree in Comparative Literature from UC-Irvine.

**Statement**:

I’m thrilled to be a second-time winner of the palindrome contest and my love of book collecting overrides any guilty feelings I may have about winning twice! Here’s a fun picture of me in the midst of polygonal fracturing from my June escapades. Sadly, I don’t think I can work “polygonal” into a palindrome******.

I’ve been fascinated by palindromes ever since first learning about them as a child in a Martin Gardner book. I started writing palindromes several years ago when my interest in the form was rekindled by reading about the constraint-based techniques of several Oulipo writers. While I love all sorts of wordplay and puzzles, and I occasionally write some word-unit palindromes as well, I find writing the traditional letter-unit palindromes to be the most satisfying challenge, due to the extreme formal constraint of exact letter reversal–which is made even more fun in a contest like this where one has to include specific words in the palindrome. I also enjoy writing palindromes about specific themes (Poe’s Raven, Oedipus Rex, Verdi’s Aida) and I have plans to write a very long palindrome about Proust one of these days.

**Book Choice**:

*Dicing with Death: Chance, Risk and Health* (Stephen Senn 2003, Cambridge: Cambridge University Press)

Filed under: Announcement, Palindrome ]]>

The article in the Chronicle of Higher Education also gets credit for its title: “Replication Crisis in Psychology Research Turns Ugly and Odd”. I’ll likely write this in installments…(2^{nd}, 3^{rd} , 4^{th})

^^^^^^^^^^^^^^^

The Guardian article answers yes to the question “Do ‘hard’ sciences hold the solution…“:

Psychology is evolving faster than ever. For decades now, many areas in psychology have relied on what academics call “questionable research practices” – a comfortable euphemism for types of malpractice that distort science but which fall short of the blackest of frauds, fabricating data.

But now a new generation of psychologists is fed up with this game. Questionable research practices aren’t just being seen as questionable – they are being increasingly recognised for what they are: soft fraud. In fact, “soft” may be an understatement. What would your neighbours say if you told them you got published in a prestigious academic journal because you cherry-picked your results to tell a neat story? How would they feel if you admitted that you refused to share your data with other researchers out of fear they might use it to undermine your conclusions? Would your neighbours still see you as an honest scientist – a person whose research and salary deserves to be funded by their taxes?

For the first time in history, we are seeing a co-ordinated effort to make psychology more robust, repeatable, and transparent.

“Soft fraud”? (Is this like “white collar” fraud?) Is it possible that holding social psych up as a genuine replicable science is, ironically, creating soft frauds too readily?

*Or would it be all to the good if the result is to so label large portions of the (non-trivial) results of social psychology?*

The sentiment in the Guardian article is that the replication program in psych is just doing what is taken for granted in other sciences; it shows psych is maturing, it’s getting *better and better all the time* …so long as the replication movement continues. Yes? [0]

^^^^^^^^

It’s hard to entirely dismiss the concerns of the pushback, dubbed in some quarters as “Repligate”. Even in this contrarian mode, you might sympathize with “those who fear that psychology’s growing replication movement, which aims to challenge what some critics see as a tsunami of suspicious science, is more destructive than corrective” (e.g., Professor Wilson, at U Va) while at the same time rejecting their dismissal of the seriousness of the problem of false positives in psych. The problem *is* serious, but there may be built-in obstacles to fixing things by the current route. From the Chronicle:

Still, Mr. Wilson was polite. Daniel Gilbert, less so. Mr. Gilbert, a professor of psychology at Harvard University, … wrote that certain so-called replicators are “shameless little bullies” and “second stringers” who engage in tactics “out of Senator Joe McCarthy’s playbook” (he later took back the word “little,” writing that he didn’t know the size of the researchers involved).

Wow. Let’s read a bit more:

Scrutiny From the Replicators

What got Mr. Gilbert so incensed was the treatment of Simone Schnall, a senior lecturer at the University of Cambridge, whose 2008 paper on cleanliness and morality was selected for replication in a special issue of the journal

Social Psychology.….In one experiment, Ms. Schnall had 40 undergraduates unscramble some words. One group unscrambled words that suggested cleanliness (pure, immaculate, pristine), while the other group unscrambled neutral words. They were then presented with a number of moral dilemmas, like whether it’s cool to eat your dog after it gets run over by a car. Ms. Schnall wanted to discover whether prompting—or priming, in psych parlance—people with the concept of cleanliness would make them less judgmental…..These studies fit into a relatively new field known as embodied cognition, which examines how one’s environment and body affect one’s feelings and thoughts. …

For instance, political extremists might literally be less capable of discerning shades of grey than political moderates—or so Matt Motyl thought until his results disappeared. Now he works actively in the replication movement.[1]

Links are here.

7/1: By the way, since Schnall’s research was testing “embodied cognition” why wouldn’t they have subjects involved in actual cleansing activities rather than have them unscramble words about cleanliness?

^^^^^^^^^^

Another irony enters: some of the people working on the replication project in social psych are the same people who hypothesize that a large part of the blame for lack of replication may be traced to the reward structure, to incentives to publish surprising and sexy studies, and to an overly flexible methodology opening the door to promiscuous QRPs (you know: Questionable Research Practices.) Call this the “rewards and flexibility” hypothesis. If the rewards/flex hypothesis is correct, as is quite plausible, then wouldn’t it follow that the same incentives are operative in the new psych replication movement? [2]

A skeptic of the movement in psychology could well ask how the replication can be judged sounder than the original studies? When RCTs fail to replicate observational studies, the presumption is that RCTs would have found the effect, were it genuine. That’s why it’s taken as an indictment of the observational study. But here, one could argue, it’s just another study, not obviously one that *corrects* the earlier. The question some have asked, “Who will replicate the replicators?” is not entirely without merit. Triangulation for purposes of correction, I say, is what’s really needed. [3]

Daniel Kahneman, who first called for the “daisy chain” (after the Stapel scandal), likely hadn’t anticipated the tsunami he was about to unleash.[4]

Daniel Kahneman, a Nobel Prize winner who has tried to serve as a sort of a peace broker, recently offered some rules of the road for replications, including keeping a record of the correspondence between the original researcher and the replicator, as was done in the Schnall case. Mr. Kahneman argues that such a procedure is important because there is “a lot of passion and a lot of ego in scientists’ lives, reputations matter, and feelings are easily bruised.”

That’s undoubtedly true, and taking glee in someone else’s apparent misstep is unseemly. Yet no amount of politeness is going to soften the revelation that a published, publicized finding is bogus. Feelings may very well get bruised, reputations tarnished, careers trashed. That’s a shame, but while being nice is important, so is being right.

Is the replication movement getting psych closer to “being right”? That is the question. What if inferences from priming studies and ”embodied cognition” really *are* questionable. What if the hypothesized effects are incapable of being turned into replicable science?

^^^^^^^^^

The sentiment voiced in the Guardian bristles at the thought; there is pushback even to Kahneman’s apparently civil “rules of the road”:

For many psychologists, the reputational damage [from a failed replication]… is grave – so grave that they believe we should limit the freedom of researchers to pursue replications. In a recent open letter, Nobel laureate Daniel Kahneman called for a new rule in which replication attempts should be “prohibited” unless the researchers conducting the replication consult beforehand with the authors of the original work. Kahneman says, “Authors, whose work and reputation are at stake, should have the right to participate as advisers in the replication of their research.” Why? Because method sections published by psychology journals are generally too vague to provide a recipe that can be repeated by others. Kahneman argues that successfully reproducing original effects could depend on seemingly irrelevant factors – hidden secrets that only the original authors would know. “For example, experimental instructions are commonly paraphrased in the methods section, although their wording and even the font in which they are printed are known to be significant.”

“Hidden secrets”? This was a remark sure to enrage those who take psych measurements as (at least potentially) akin to measuring the Hubble constant:

If this doesn’t sound very scientific to you, you’re not alone. For many psychologists, Kahnemann’s cure is worse than the disease. Dr Andrew Wilson from Leeds Metropolitan University points out that if the problem with replication in psychology is vague method sections then the logical solution – not surprisingly – is to publish

detailedmethod sections. In a lively response to Kahnemann, Wilson rejects the suggestion of new regulations: “If you can’t stand the replication heat, get out of the empirical kitchen because publishing your work means you think it’s ready for prime time, and if other people can’t make it work based on your published methods then that’s your problem and not theirs.”

Prime time for priming research in social psych?

Read the rest of the Guardian article. Second installment later on…maybe….

**What do readers think?**

^^^^^^^^^^^^^^

Naturally the issues that interest me the most are statistical-methodological. Some of the methodology and meta-methodology of the replication effort is apparently being developed hand-in-hand with the effort itself—that makes it all the more interesting, while also potentially risky.

The replicationist’s question of methodology, as I understand it, is alleged to be what we might call “purely statistical”. It is not: would the initial positive results warrant the psychological hypothesis, were the statistics unproblematic? The presumption from the start was that the answer to this question is yes. In the case of the controversial Schnall study, the question wasn’t: can the hypotheses about cleanliness and morality be well-tested or well probed by finding statistical associations between unscrambling cleanliness words and “being less judgmental” about things like eating your dog if he’s runover? At least not directly. In other words, the statistical-substantive link was not at issue. The question is limited to: do we get the statistically significant effect in a replication of the initial study, presumably one with high power to detect the effects at issue. So, for the moment, I too will retain that as the sole issue around which the replication attempts revolve.

Checking statistical assumptions is, of course, a part of the pure statistics question, since the P-value and other measures depend on assumptions being met at least approximately.

The replication team assigned to Schnall (U of Cambridge) reported results apparently inconsistent with the positive ones she had obtained. Schnall shares her experiences in “Further Thoughts on Replications, Ceiling Effects and Bullying” and “The Replication Authors’ Rejoinder”:http://www.psychol.cam.ac.uk/cece/blog

The replication authors responded to my commentary in a rejoinder. It is entitled “Hunting for Artifacts: The Perils of Dismissing Inconsistent Replication Results.” In it, they accuse me of “criticizing after the results are known,” or CARKing, as Nosek and Lakens (2014) call it in their editorial. In the interest of “increasing the credibility of published results” interpretation of data evidently needs to be discouraged at all costs, which is why the special issue editors decided to omit any independent peer review of the results of all replication papers. (Schnall)

Perhaps her criticisms are off the mark, and in no way discount the failed replication (I haven’t read them), but CARKing? Data and model checking are intended to take place post-data. So the post-data aspect of a critique scarcely renders it illicit. The statistical fraud-busting of a Smeesters or a Jens Forster were all based on post-data criticisms. So it would be *ironic* if in the midst of defending efforts to promote scientific credentials they inadvertently labeled as questionable post-data criticisms. top

^^^^^^^^^^^^^^^^^^^^^^^^^^^

Uri Simonsohn [5] at “Data Colada” discusses, specifically, the objections raised by Simone Schnall (2nd installment), and the responses by the authors who failed to replicate her work: Brent Donnellan, Felix Cheung and David Johnson.

Simonsohn does not reject out of hand Schnall’s allegation that the lack of replication is explained away (e.g., by a “ceiling effect”). (In fact, he has elsewhere discussed a case that was rightfully absolved thereby [6].) Simonsohn provides statistical grounds for denying a ceiling effect is to be blamed in Schnall’s case. However, he also agrees with Schnall’s discounting the replicators’ reaction to the charge of a ceiling effect by simply lopping off the most extreme results.

In their rejoinder (.pdf), the replicators counter by dropping all observations at the ceiling and showing the results are still not significant.

I don’t think that’s right either.Data Colada

Since the replicators here have the burden of proof of evidence, the statistical problems with their *ad hoc* retort to Schnall are grounds for concern, or should be.

http://datacolada.org/2014/06/04/23-ceiling-effects-and-replications/

What follows from this? What follows is that the analysis of the evidential import of failed replications in this field is an unsettled business. Despite the best of intentions of the new replicationists, there are grounds for questioning if the meta-methodology is ready for the heavy burden being placed on it. I’m not saying that facets for the necessary methodology aren’t out there, but that the pieces haven’t been fully assembled ahead of time. Until they are,the basis for scrutinizing failed (and successful) replications will remain in flux.

^^^^^^^^^^

Final irony. If the replication researchers claim they haven’t caught on to any of the problems or paradoxes I have intimated for their enterprise, let me end with one more. ..No, I’ve save it for installment 4. top

^^^^^^^^^^

Statistical significance testers in psychology (and other areas) often maintain there is no information, or no proper inference, to be obtained from statistically insignificant (negative) results. This, despite power analyst Jacob Cohen toiling amongst them for years. Maybe they’ve been misled by their own constructed animal, the so-called NHST (no need to look it up, if you don’t already know).

*The irony is that much replication analysis turns on interpreting non statistically significant results!*

One of my first blogposts talks about interpreting negative results and I’ve been publishing on this for donkey’s years[7]. Here are some posts for your Saturday night reading:

http://errorstatistics.com/2011/11/09/neymans-nursery-2-power-and-severity-continuation-of-oct-22-post/Some numerical examples:

^^^^^^

[0] Unsurprisingly, replicationistas in psych are finding well-known results from experimental psych to be replicable. Interestingly, similar results are found in experimental economics, dubbed “experimental exhibits”. Expereconomists recognize that rival interpretations of the exhibits are still open to debate.

[1] In Nuzzo’s article: “For a brief moment in 2010, Matt Motyl was on the brink of scientific glory: he had discovered that extremists quite literally see the world in black and white”.

(Glory, I tell you!)

[2] Some of the results are now published in Social Psychology. Perhaps it was not such an exaggeration to suggest, in an earlier post, that “non-significant results are the new significant results”. At the time I didn’t know the details of the replication project; I was just reacting to graduate students presenting this as the basis for a philosophical position, when philosophers should have been performing a stringent methodological critique.

[3] By contrast, statistical fraudbusting and statistical forensics have some rigorous standards that are hard to evade, e.g., recently Jens Forster.

[4] In Kahneman’s initial call (Oct, 2012) “He suggested setting up a ‘daisy chain’ of replication, in which each lab would propose a priming study that another lab would attempt to replicate. Moreover, he wanted labs to select work they considered to be robust, and to have the lab that performed the original study help the replicating lab vet its procedure.”

[5] Simonsohn is always churning out the most intriguing and important statistical analyses in social psychology. The field needs more like him.

[6] For an excellent discussion of a case that *is* absolved from non-replication by appealing to the ceiling effect see http://datacolada.org/2014/06/27/24-p-curve-vs-excessive-significance-test/.

[7] e.g., Mayo 1985, 1988, to see how we talked about statistics in risk assessment philosophy back then.

Filed under: junk science, science communication, Statistical fraudbusting, Statistics ]]>

One of the world’s leading economists, INET Oxford’s Prof. Sir David Hendry received a unique award from the Economic and Social Research Council (ESRC)…

Commenting on the award, Torbjørn Hægeland, Director of Research at Statistics Norway, said: ‘Professor David Hendry’s contributions have exerted a great influence on the way we do practical econometric work. In particular, the automatic models election programme, Autometrics, is used extensively to guide improved empirical modelling, especially when there are structural shifts, avoiding wasted time on incorrect formulations so our economists can focus on analysis and specification.’

You can read about the award here.

Sir David Hendry’s contribution to RMM Vol. 2, 2011:* *“Empirical Economic Model Discovery and Theory Evaluation:

**Abstract: **
*Economies are so high dimensional and non-constant that many features of models cannot be derived by prior reasoning, intrinsically involving empirical discovery and requiring theory evaluation. Despite important differences, discovery and evaluation in economics are similar to those of science. Fitting a pre-specified equation limits discovery, but automatic methods can formulate much more general initial models with many possible variables, long lag lengths and non-linearities, allowing for outliers, data contamination, and parameter shifts; then select congruent parsimonious-encompassing models even with more candidate variables than observations, while embedding the theory; finally rigorously evaluate selected models to ascertain their viability.*

http://www.rmm-journal.de/downloads/Article_Hendry.pdf

His work grows out of a unique philosophical conception of the relationship between data and theory. [3]

*BOOKS*

(new)** ***Empirical Model Discovery and Theory Evaluation: Automatic Selection Methods in Econometrics (Arne Ryde Memorial Lectures)*

*Hendry, D.F. and B. Nielsen (2007)*, Econometric Modeling: A Likelihood Approach. Princeton University Press.

*J. Campos, N.R. Ericsson and D.F. Hendry (2004)* **General to Specific Modelling**. Edward Elgar. Forthcoming.

*Clements, M.P. and D.F. Hendry (2002)*. **A Companion to Economic Forecasting**. Oxford: Blackwell Publishers. (ISBN 0631215697)

*Hendry, D.F. and N.R. Ericsson (2001)* **Understanding Economic Forecasts** Cambridge, Mass.: MIT Press.

*Doornik, J.A. and D.F. Hendry (2001)*. **Interactive Monte Carlo Experimentation in Econometrics Using PcNaive** London: Timberlake Consultants Press.

*Doornik, J.A. and D.F. Hendry (2001)*. **GiveWin: An Interface to Empirical Modelling** (2nd edition), London: Timberlake Consultants Press. (ISBN 0-9533394-3-2) (1st ed. 1996, 2nd ed. 1999)

*Hendry, D.F. and J.A. Doornik (2001)*. **Empirical Econometric Modelling Using PcGive** Volumes I, II and III London: Timberlake Consultants Press. (Vol I: 2nd ed. 1999, 1st ed. 1996; version 8: 1994, version 7: 1992) (Vol II: 2nd ed. 1999, 1st ed. 1997; version 8: 1994)

*D.F. Hendry and H-M. Krolzig (2001)*. **Automatic Econometric Model Selection** London: Timberlake Consultants Press.

*Hendry, D.F. (2001)* **Econometrics: Alchemy or Science?** 2nd Edition. Oxford: Oxford University Press. (ISBN 0-19-829354-2)

*W. A. Barnett, D. F. Hendry, S. Hylleberg, T. Teräsvirta, D. Tjøstheim, and A. Würtz (eds) (2000)*. **Nonlinear Econometric Modeling in Time Series**. Proceedings of the Eleventh International Symposium in Economic Theory Cambridge: Cambridge University Press.

*Clements, M.P. and D.F. Hendry (1999)*. **Forecasting Non-stationary Economic Time Series**. Cambridge, Mass.: MIT Press.

*Clements, M.P. and D.F. Hendry (1998)*. **Forecasting Economic Time Series**. Cambridge: Cambridge University Press. (ISBN 0-521-634806)

*Hendry, D.F. and M.S. Morgan (1995)*. **The Foundations of Econometric Analysis**. Cambridge: Cambridge University Press. (ISBN 0-521-38043-X)

Some general comments by Clark Glymour, in relation to Hendry’s paper,are below:

[1] Professor Hendry is Director of the Programme in Economic Modelling at the Institute for New Economic Thinking at the Oxford Martin School.

[2] Hendry, D. (2011) “Empirical Economic Model Discovery and Theory Evaluation”, in *Rationality, Markets and Morals*, Volume 2, Special Topic: *Statistical Science and Philosophy of Science,* (D. G. Mayo, A. Spanos & K. W. Staley (guest eds.)): 115-145.

[3] Hendry was Aris Spanos’ dissertation advisor at the LSE; their work has interconnected over the years.

Filed under: David Hendry, StatSci meets PhilSci Tagged: David Hendry ]]>

**May 2014**

(5/1) Putting the brakes on the breakthrough: An informal look at the argument for the Likelihood Principle

(5/3) You can only become coherent by ‘converting’ non-Bayesianly

(5/6) Winner of April Palindrome contest: Lori Wike

(5/7) A. Spanos: Talking back to the critics using error statistics (Phil6334)

(5/10) Who ya gonna call for statistical Fraudbusting? R.A. Fisher, P-values, and error statistics (again)

(5/15) Scientism and Statisticism: a conference* (i)

(5/17) Deconstructing Andrew Gelman: “A Bayesian wants everybody else to be a non-Bayesian.”

(5/20) The Science Wars & the Statistics Wars: More from the Scientism workshop

(5/25) Blog Table of Contents: March and April 2014

(5/27) Allan Birnbaum, Philosophical Error Statistician: 27 May 1923 – 1 July 1976

(5/31) What have we learned from the Anil Potti training and test data frameworks? Part 1 (draft 2)

Filed under: blog contents, Metablog, Statistics ]]>

[The papers in this collection] give examples of problems which are well-suited to being tackled using such methods, but one must not lose sight of the merits of having multiple different strategies and tools in one’s inferential armory.(Hand [1])_

…. But I have to ask, is the emphasis on ‘Bayesian’ necessary? That is, do we need further demonstrations aimed at promoting the merits of Bayesian methods? … The examples in this special issue were selected, firstly by the authors, who decided what to write about, and then, secondly, by the editors, in deciding the extent to which the articles conformed to their desiderata of being Bayesian success stories: that they ‘present actual data processing stories where a non-Bayesian solution would have failed or produced sub-optimal results.’ In a way I think this is unfortunate. I am certainly convinced of the power of Bayesian inference for tackling many problems, but the generality and power of the method is not really demonstrated by a collection specifically selected on the grounds that this approach works and others fail. To take just one example, choosing problems which would be difficult to attack using the Neyman-Pearson hypothesis testing strategy would not be a convincing demonstration of a weakness of that approach if those problems lay outside the class that that approach was designed to attack.

Hand goes on to make a philosophical assumption that might well be questioned by Bayesians:

One of the basic premises of science is that you must not select the data points which support your theory, discarding those which do not. In fact, on the contrary, one should test one’s theory by challenging it with tough problems or new observations. (This contrasts with political party rallies, where the candidates speak to a cheering audience of those who already support them.) So the fact that the articles in this collection provide wonderful stories illustrating the power of modern Bayesian methods is rather tarnished by the one-sidedness of the story.

This, of course, is the philosophical standpoint reflected in a severe or stringent testing philosophy, and it’s one that I heartily endorse. But it may be a mistake to assume it is universal: there’s an entirely distinct conception of confirmation as gathering data in order to support a position already held [2]. *I don’t mean this at all facetiously.* On the contrary, to suppose the editors of this issue share the testing conception is to implicitly suggest they are engaged in an exercise with questionable scientific standards (“tarnished by the one-sidedness of the story”). Recall my post on “who is allowed to cheat” and optional stopping with I.J. Good? It took some pondering for him to admit a different way of cashing out “allowed to cheat”. Likewise, wearing Bayesian glasses lets me take various Bayesian remarks as other than disingenuous. Hand goes on to offer a tantalizing suggestion:

Or perhaps, if one is going to have a collection of papers demonstrating the power of one particular inferential school, then, in the journalist spirit of balanced reporting, we should invite a series of similar issue containing articles which present actual data processing stories where a nonfrequentist / non-likelihood / non-[fill in your favourite school of inference] solution would have failed or produced sub-optimal results.

On the face of it, it sounds like a great idea! Sauce for the goose and all that….David Hand is courageous for even suggesting it (deserving an * honorary mention*!), and he’d be an excellent editor of such an imaginary, parallel journal issue. [Share potential names. See [3]] But if X = “a frequentist” approach, it becomes clear, on further thought, it actually wouldn’t make sense, and frequentists (or, as I prefer, error statisticians) wouldn’t wish to pursue such a thing. Besides, they wouldn’t be allowed– “Frequentist” seems to be some kind of an “F” word in statistics these days–and anyway Bayesian accounts have the latitude to mimic any solution post hoc, if they so desire; if they didn’t concur with the solution, they’d merely deny the claims to superior performance (as sought by the editors of any such imaginary, parallel, journal issue). [Yet, perhaps a good example of the kind of article that would work is Fraser's quick and dirty confidence in a 2011 issue of the same journal.]

Christian Robert explains that the goal was for “a collection of six-page vignettes that describe real cases in which Bayesian analysis has been the only way to crack a really important problem.” Papers should address the question: “Why couldn’t it be solved by other means? What were the shortcomings of other statistical solutions?” I’m not sure what criteria the special editors employed to judge that Bayesian methods were required. According to one of the contributors (Stone) it means the problem required subjective priors. [See Note 4] (I’m a bit surprised at the choice of name for the special issue. Incidentally, the “big” refers to the bigness of the problem, not big data. Not sure about “stories”.)

Yet scientific methods are supposed to be interconnected, fostering both interchecking via multiple lines of evidence as well as building on diverse strategies. I just read of a promising new technique that would allow a blood test to detect infectious prions (as in mad cow disease) in living animals—a first. This will be both scrutinized and built upon by multiple approaches in current prion research. Seeing how the new prion test works, those using other methods will *want* to avail themselves of the new Mad Cow test. Saying Bayesianism is *required, *by contrast, doesn’t obviously* *suggest that non-Bayesians would wish to go there.

Aside: Robert begins his description of the special issue: “Bayesian statistics is now endemic in many areas of scientific, business and social research”, but does he really mean endemic? (See [5])

All in all, I think Hand gives a strong, generous, positive endorsement, interspersed with some caveats and hesitations:

When presented with fragmentary evidence, for example, one should proceed with caution. In such circumstances, the opportunity for undetected selection bias is considerable. Assumptions about the missing data mechanism may be untestable, perhaps even unnoticed. Data can be missing only in the context of a larger model, and one might not have any idea about what model might be suitable.

Caution is voiced by another discussant, A. H. Welsh:

Another reason a model may be difficult to fit is that it does not describe the data. Forcing it to “fit”, for example by switching to a Bayesian analysis, may not be the best response. It is difficult to check complicated models,particularly hierarchical models with latent variables, measurement error,missing data etc but using an incorrect model may be a concern when the model proves difficult to fit.

Recall, in this connection, this post (on “When Bayesian Inference Shatters”.)

Do you know what would really have been impressive (in my judgement)? A special journal issue replete with articles identifying the most serious flaws, shortcomings, and problems in Bayesian applications; perhaps showing how non-Bayesian methods helped to pinpoint loopholes and improve solutions. Methodological progress is never so sure or so speedy as when subjected to severe criticism. I think people would stand up and really take notice to see Bayesians remove the rose-colored glasses for a bit. What do you think?

[Added 6/22: I see this is equivocal. I had meant that the criticism be self-criticism and that the Bayesians themselves would have vigorously brought out the problems. But mixing in constructive criticism from others would also be of value.]

Here’s some of the rest….

The editors emphasised that they were not looking for ‘argumentative rehashes of the Bayesian versus frequentist debate’. I can only commend them on that. On the other hand, times move on, ideas develop, and understanding deepens, so while ‘argumentative rehashes’ might not be desirable, re-examination from a more sophisticated perspective might be.

I couldn’t agree more as to the need for “a re-examination from a more sophisticated perspective”, and it’s a point very rarely articulated. I hear people quote Neyman and Pearson from like the first few months of exploring a brand new approach and overlook the 70 years of developments in the general frequentist, sampling or (as I prefer) error statistical domain of inference and modeling. ….

An interesting question, perhaps in part sociological, is why different scientific communities tend to favour different schools of inference. Astronomers favour Bayesian methods, particle physicists and psychologists seem to favour frequentist methods. Is there something about these different domains which makes them more amenable to attack by different approaches? In general, when building statistical models, we must not forget that the aim is to understand something about the real world. Or predict, choose an action, make a decision, summarize evidence, and so on, but always about the real world, not an abstract mathematical world… …As an aside, there is also the question of what exactly is meant by ‘Bayesian’. Cox and Donnelly (2011, p144) remark that ‘the word Bayesian, however, became ever more widely used, sometimes representing a regression to the older usage of “flat” prior distributions supposedly representing initial ignorance, sometimes meaning models in which the parameters of interest are regarded as random variables and occasionally meaning little more than that the laws of probability are somewhere invoked.’

Yes that’s another thorny question that remains without a generally accepted answer. I’ve seen it used to simply mean the use of conditional probability anywhere, any time.

Turning to the papers themselves, the Bayesian approach to statistics, with its interpretation of parameters as random variables, has the merit of formulating everything in a consistent manner. Instead of trying to fit together objects of various different kinds, one merely has a single common type of brick to use, which certainly makes life easier.

What is this single brick? Managing to assess everything as a probability brick, when they actually have very different references, isn’t obviously better than recognizing and reporting the differences, possibly synthesizing in some other way. To end up with a remark by Welsh:

One motivation for doing a Bayesian analysis for this problem (and one that is commonly articulated) is that the event in question is unique so it is not meaningful to think about replications. This is not really convincing because hypothetical replications are hypothetical whether they are conceived of for an event that is extremely rare (and in the extreme happens once) or for events that occur frequently.

I concur with Welsh. The study of unique events and fixed hypotheses still involves general types of questions and theories under what I call a repertoire of background. [One might ask, if “the event in question is unique so it is not meaningful to think about replications,” then how does the methodology serve for replicable science?]

Please send any corrections to this draft (i).

**I invite comments, as always, and UPhils for guest blog posting (by July 15), if anyone is interested: error@vt.edu**

[1] The citations come from the Statistical Science posting of future articles (thus final corrected versions could differ), but I am also linking to the published discussion articles.

[2] As even Popper emphasized, even a certain degree of dogmatism has a role, to avoid rejecting a claim too soon. But this is intended to occur within an inquiry that is working hard to find flaws and weaknesses, else it falls far short of being scientific–*for Popper.*

[3] Fab frequentist “tales” (areas)?

[4] I never know whether requiring subjective priors means they required beliefs about weights of evidence, beliefs about frequencies, beliefs about beliefs, or something closer to Christian Robert’s idea that a prior “has nothing to do with ‘reality,’ it is a reference measure that is necessary for making probability statements” (2011, 317-18) in a comment on Don Fraser’s quick and dirty confidence paper.

[5] Endemic

- (of a disease or condition) regularly found among particular people or in a certain area; “areas where malaria is
**endemic**“

Denoting an area in which a particular disease is regularly found. -
(of a plant or animal) native or restricted to a certain country or area; “a marsupial
**endemic to**northeastern Australia”

Growing or existing in a certain place or region.

Filed under: Bayesian/frequentist, Honorary Mention, Statistics ]]>

Four ~~score~~ years ago (!) we held the conference “Statistical Science and Philosophy of Science: Where Do (Should) They meet?” at the London School of Economics, Center for the Philosophy of Natural and Social Science, CPNSS, where I’m visiting professor [1] Many of the discussions on this blog grew out of contributions from the conference, and conversations initiated soon after. The conference site is here; my paper on the general question is here.[2]

*My main contribution was “Statistical Science Meets Philosophy of Science Part 2: Shallow versus Deep Explorations” SS & POS 2. **It begins like this: *

**1. Comedy Hour at the Bayesian Retreat[3]**

** **Overheard at the comedy hour at the Bayesian retreat: Did you hear the one about the frequentist…

“who defended the reliability of his radiation reading, despite using a broken radiometer, on the grounds that most of the time he uses one that works, so on average he’s pretty reliable?”

or

“who claimed that observing ‘heads’ on a biased coin that lands heads with probability .05 is evidence of a statistically significant improvement over the standard treatment of diabetes, on the grounds that such an event occurs with low probability (.05)?”

Such jests may work for an after-dinner laugh, but if it turns out that, despite being retreads of ‘straw-men’ fallacies, they form the basis of why some statisticians and philosophers reject frequentist methods, then they are not such a laughing matter. But surely the drubbing of frequentist methods could not be based on a collection of howlers, could it? I invite the reader to stay and find out.

If we are to take the criticisms seriously, and put to one side the possibility that they are deliberate distortions of frequentist statistical methods, we need to identify their sources. To this end I consider two interrelated areas around which to organize foundational issues in statistics: (1) the roles of probability in induction and inference, and (2) the nature and goals of statistical inference in science or learning. Frequentist sampling statistics, which I prefer to call ‘error statistics’, continues to be raked over the coals in the foundational literature, but with little scrutiny of the presuppositions about goals and methods, without which the criticisms lose all force.

First, there is the supposition that an adequate account must assign degrees of probability to hypotheses, an assumption often called probabilism. Second, there is the assumption that the main, if not the only, goal of error-statistical methods is to evaluate long-run error rates. Given the wide latitude with which some critics define ‘controlling long-run error’, it is not surprising to find them arguing that (i) error statisticians approve of silly methods, and/or (ii) rival (e.g., Bayesian) accounts also satisfy error statistical demands. Absent this sleight of hand, Bayesian celebrants would have to go straight to the finale of their entertainment hour: a rousing rendition of ‘There’s No Theorem Like Bayes’s Theorem’.

Never mind that frequentists have responded to these criticisms, they keep popping up (verbatim) in every Bayesian and some non-Bayesian textbooks and articles on philosophical foundations. No wonder that statistician Stephen Senn is inclined to “describe a Bayesian as one who has a reverential awe for all opinions except those of a frequentist statistician” (Senn 2011, 59, this special topic of RMM). Never mind that a correct understanding of the error-statistical demands belies the assumption that any method (with good performance properties in the asymptotic long-run) succeeds in satisfying error-statistical demands.

The difficulty of articulating a statistical philosophy that fully explains the basis for both (i) insisting on error-statistical guarantees, while (ii) avoiding pathological examples in practice, has turned many a frequentist away from venturing into foundational battlegrounds. Some even concede the distorted perspectives drawn from overly literal and radical expositions of what Fisher, Neyman, and Pearson ‘really thought’. I regard this as a shallow way to do foundations.

Here is where I view my contribution—as a philosopher of science—to the long-standing debate: not merely to call attention to the howlers that pass as legitimate criticisms of frequentist error statistics, but also to sketch the main lines of an alternative statistical philosophy within which to better articulate the roles and value of frequentist tools. Let me be clear that I do not consider this the only philosophical framework for frequentist statistics—different terminology could do as well. I will consider myself successful if I can provide one way of building, or one standpoint from which to build, a frequentist, error-statistical philosophy. Here I mostly sketch key ingredients and report on updates in a larger, ongoing project.

** ****2. Popperians Are to Frequentists as Carnapians Are to Bayesians**

** **Statisticians do, from time to time, allude to better-known philosophers of science (e.g., Popper). The familiar philosophy/statistics analogy—that Popper is to frequentists as Carnap is to Bayesians—is worth exploring more deeply, most notably the contrast between the popular conception of Popperian falsification and inductive probabilism. Popper himself remarked:

In opposition to [the] inductivist attitude, I assert that C(

H,x) must not be interpreted as the degree of corroboration ofHbyx, unlessxreports the results of our sincere efforts to overthrowH. The requirement of sincerity cannot be formalized—no more than the inductivist requirement thatxmust represent our total observational knowledge. (Popper 1959, 418, I replace ‘e’ with ‘x’)

In contrast with the more familiar reference to Popperian falsification, and its apparent similarity to statistical significance testing, here we see Popper alluding to failing to reject, or what he called the “corroboration” of hypothesis *H*. Popper chides the inductivist for making it too easy for agreements between data **x **and* H* to count as giving *H* a degree of confirmation.

Observations or experiments can be accepted as supporting a theory (or a hypothesis, or a scientific assertion) only if these observations or experiments are severe tests of the theory—or in other words, only if they result from serious attempts to refute the theory. (Popper 1994, 89)

(Note the similarity to Peirce in Mayo 2011, 87.)

**2.1 Severe Tests**

Popper did not mean to cash out ‘sincerity’ psychologically of course, but in some objective manner. Further, high corroboration must be ascertainable: ‘sincerely trying’ to find flaws will not suffice. Although Popper never adequately cashed out his intuition, there is clearly something right in this requirement. It is the gist of an experimental principle presumably accepted by Bayesians and frequentists alike, thereby supplying a minimal basis to philosophically scrutinize different methods. (Mayo 2011, section 2.5, this special topic of RMM) Error-statistical tests lend themselves to the philosophical standpoint reflected in the severity demand. Pretty clearly, evidence is not being taken seriously in appraising hypothesis *H* if it is predetermined that, even if *H* is false, a way would be found to either obtain, or interpret, data as agreeing with (or ‘passing’) hypothesis *H*. Here is one of many ways to state this:

Severity Requirement (weakest): An agreement between dataxandHfails to count as evidence for a hypothesis or claimHif the test would yield (with high probability) so good an agreement even ifHis false.

Because such a test procedure had little or no ability to find flaws in *H*, finding none would scarcely count in *H*’s favor.

*2.1.1 Example: Negative Pressure Tests on the Deep Horizon Rig*

Did the negative pressure readings provide ample evidence that:

H: leaking gases, if any, were within the bounds of safety (e.g., less than θ_{0}_{0})?

Not if the rig workers kept decreasing the pressure until H passed, rather than performing a more stringent test (e.g., a so-called ‘cement bond log’ using acoustics). Such a lowering of the hurdle for passing *H _{0}* made it too easy to pass

H: the pressure build-up was in excess of θ_{1}_{0}.

That ‘the negative pressure readings were misinterpreted’, meant that it was incorrect to construe them as indicating H_{0}. If such negative readings would be expected, say, 80 percent of the time, even if* H _{1}* is true, then

**2.2 Another Egregious Violation of the Severity Requirement**

Too readily interpreting data as agreeing with or fitting hypothesis *H* is not the only way to violate the severity requirement. Using utterly irrelevant evidence, such as the result of a coin flip to appraise a diabetes treatment, would be another way. In order for data **x **to succeed in corroborating *H* with severity, two things are required: (i) **x **must fit *H*, for an adequate notion of fit, and (ii) the test must have a reasonable probability of finding worse agreement with *H*, were *H* false. I have been focusing on (ii) but requirement (i) also falls directly out from error statistical demands. In general, for *H* to fit **x**, H would have to make **x **more probable than its denial. Coin tossing hypotheses say nothing about hypotheses on diabetes and so they fail the fit requirement. Note how this immediately scotches the second howler in the second opening example.

But note that we can appraise the severity credentials of other accounts by using whatever notion of ‘fit’ they permit. For example, if a Bayesian method assigns high posterior probability to *H* given data **x**, we can appraise how often it would do so even if *H* is false. That is a main reason I do not want to limit what can count as a purported measure of fit: we may wish to entertain different measures for purposes of criticism.

**2.3 The Rationale for Severity is to Find Things Out Reliably**

** **Although the severity requirement reflects a central intuition about evidence, I do not regard it as a primitive: it can be substantiated in terms of the goals of learning. To flout it would not merely permit being wrong with high probability—a long-run behavior rationale. In any particular case, little if anything will have been done to rule out the ways in which data and hypothesis can ‘agree’, even where the hypothesis is false. The burden of proof on anyone claiming to have evidence for H is to show that the claim is not guilty of at least an egregious lack of severity.

Although one can get considerable mileage even with the weak severity requirement, I would also accept the corresponding positive conception of evidence, which will comprise the full severity principle:

Severity Principle (full):Dataxprovide a good indication of or evidence for hypothesisH(only) to the extent that testTseverely passesHwithx.

Degree of corroboration is a useful shorthand for the degree of severity with which a claim passes, and may be used as long as the meaning remains clear.

**2.4 What Can Be Learned from Popper; What Can Popperians Be Taught?**

** **Interestingly, Popper often crops up as a philosopher to emulate—both by Bayesian and frequentist statisticians. As a philosopher, I am glad to have one of our own taken as useful, but feel I should point out that, despite having the right idea, Popperian logical computations never gave him an adequate way to implement his severity requirement, and I think I know why: Popper once wrote to me that he regretted never having learned mathematical statistics. Were he to have made the ‘error probability’ turn, today’s meeting ground between philosophy of science and statistics would likely look very different, at least for followers of Popper, the ‘critical rationalists’.

Consider, for example, Alan Musgrave (1999; 2006). Although he declares that “the critical rationalist owes us a theory of criticism” (2006, 323) this has yet to materialize. Instead, it seems that current-day critical rationalists retain the limitations that emasculated Popper. Notably, they deny that the method they recommend—either to accept or to prefer the hypothesis best-tested so far—is reliable. They are right: the best-tested so far may have been poorly probed. But critical rationalists maintain nevertheless that their account is ‘rational’. If asked why, their response is the same as Popper’s: ‘I know of nothing more rational’ than to accept the best-tested hypotheses. It sounds rational enough, but only if the best-tested hypothesis so far is itself well tested (see Mayo 2006; 2010b). So here we see one way in which a philosopher, using methods from statistics, could go back to philosophy and implement an incomplete idea.

On the other hand, statisticians who align themselves with Popper need to show that the methods they favor uphold falsificationist demands: that they are capable of finding claims false, to the extent that they are false; and retaining claims, just to the extent that they have passed severe scrutiny (of ways they can be false). Error probabilistic methods can serve these ends; but it is less clear that Bayesian methods are well-suited for such goals (or if they are, it is not clear they are properly ‘Bayesian’).

__________________

To read sections 3 and 4 see: SS & POS 2 or go to the RMM page, and scroll down to Mayo’s Sept 25 paper.

*Here is section 5:*

**5. The Error-Statistical Philosophy**

I recommend moving away, once and for all, from the idea that frequentists must ‘sign up’ for either Neyman and Pearson, or Fisherian paradigms. As a philosopher of statistics I am prepared to admit to supplying the tools with an interpretation and an associated philosophy of inference. I am not concerned to prove this is what any of the founders ‘really meant’.

Fisherian simple-significance tests, with their single null hypothesis and at most an idea of a directional alternative (and a corresponding notion of the ‘sensitivity’ of a test), are commonly distinguished from Neyman and Pearson tests, where the null and alternative exhaust the parameter space, and the corresponding notion of power is explicit. On the interpretation of tests that I am proposing, these are just two of the various types of testing contexts appropriate for different questions of interest. My use of a distinct term, ‘error statistics’, frees us from the bogeymen and bogeywomen often associated with ‘classical’ statistics, and it is to be hoped that that term is shelved. (Even ‘sampling theory’, technically correct, does not seem to represent the key point: the sampling distribution matters in order to evaluate error probabilities, and thereby assess corroboration or severity associated with claims of interest.) Nor do I see that my comments turn on whether one replaces frequencies with ‘propensities’ (whatever they are).

**5.1 Error (Probability) Statistics**

*What is key on the statistics side *is that the probabilities refer to the distribution of a statistic *d*(**X**)—the so-called sampling distribution. This alone is at odds with Bayesian methods where consideration of outcomes other than the one observed is disallowed (likelihood principle [LP]), at least once the data are available.

Neyman-Pearson hypothesis testing violates the likelihood principle, because the event either happens or does not; and hence has probability one or zero. (Kadane 2011, 439)

The idea of considering, hypothetically, what other outcomes could have occurred in reasoning from the one that did occur seems so obvious in ordinary reasoning that it will strike many, at least those outside of this specialized debate, as bizarre for an account of statistical inference to banish such considerations. And yet, banish them the Bayesian must—at least if she is being coherent. I come back to the likelihood principle in section 7.

*What is key on the philosophical side *is that error probabilities may be used to quantify the probativeness or severity of tests (in relation to a given inference).

The twin goals of probative tests and informative inferences constrain the selection of tests. But however tests are specified, they are open to an after-data scrutiny based on the severity achieved. Tests do not always or automatically give us relevant severity assessments, and I do not claim one will find just this construal in the literature. Because any such severity assessment is relative to the particular ‘mistake’ being ruled out, it must be qualified in relation to a given inference, and a given testing context. We may write:

SEV(

T,x,H) to abbreviate ‘the severity with which testTpasses hypothesisHwith datax’.

When the test and data are clear, I may just write SEV(*H*). The standpoint of the severe prober, or the severity principle, directs us to obtain error probabilities that are relevant to determining well testedness, and this is the key, I maintain, to avoiding counterintuitive inferences which are at the heart of often-repeated comic criticisms. This makes explicit what Neyman and Pearson implicitly hinted at:

If properly interpreted we should not describe one [test] as more accurate than another, but according to the problem in hand should recommend this one or that as providing information which is more relevant to the purpose. (Neyman and Pearson 1967, 56–57)

For the vast majority of cases we deal with, satisfying the N-P long-run desiderata leads to a uniquely appropriate test that simultaneously satisfies Cox’s (Fisherian) focus on minimally sufficient statistics, and also the severe testing desiderata (Mayo and Cox 2010).

**5.2 Philosophy-Laden Criticisms of Frequentist Statistical Methods**

** **What is rarely noticed in foundational discussions is that appraising statistical accounts at the foundational level is ‘theory-laden’, and in this case the theory is philosophical. A deep as opposed to a shallow critique of such appraisals must therefore unearth the philosophical presuppositions underlying both the criticisms and the plaudits of methods. To avoid question-begging criticisms, the standpoint from which the appraisal is launched must itself be independently defended.

But for many philosophers, in particular, Bayesians, the presumption that inference demands a posterior probability for hypotheses is thought to be so obvious as not to require support. At any rate, the only way to give a generous interpretation of the critics (rather than assume a deliberate misreading of frequentist goals) is to allow that critics are implicitly making assumptions that are at odds with the frequentist statistical philosophy. In particular, the criticisms of frequentist statistical methods assume a certain philosophy about statistical inference (probabilism), often coupled with the allegation that error-statistical methods can only achieve radical behavioristic goals, wherein long-run error rates alone matter.

Criticisms then follow readily, in the form of one or both:

- Error probabilities do not supply posterior probabilities in hypotheses.
- Methods can satisfy long-run error probability demands while giving rise to counterintuitive inferences in particular cases.

I have proposed an alternative philosophy that replaces these tenets with different ones:

- The role of probability in inference is to quantify how reliably or severely claims have been tested.
- The severity principle directs us to the relevant error probabilities; control of long-run error probabilities, while necessary, is not sufficient for good tests.

The following examples will substantiate and flesh out these claims.

** ****5.3 Severity as a ‘Metastatistical’ Assessment**

** **In calling severity ‘metastatistical’, I am deliberately calling attention to the fact that the reasoned deliberation it involves cannot simply be equated to formal quantitative measures, particularly those that arise in recipe-like uses of methods such as significance tests. In applying it, we consider several possible inferences that might be considered of interest. In the example of test *T+* [this is a one-sided Normal test of H_{0}: μ≤μ_{0} against H_{1}: μ>μ_{0}, on p. 81], the data specific severity evaluation quantifies the extent of the discrepancy (γ) from the null that is (or is not) indicated by data **x **rather than quantifying a ‘degree of confirmation’ accorded a given hypothesis. Still, if one wants to emphasize a post-data measure one can write:

SEV(μ <

X+ γσ_{0}_{x}) to abbreviate: The severity with which a testT+with a resultxpasses the hypothesis:(μ <

X+ γσ_{0}_{x}) with σ_{x}abbreviating (σ /√n)^{ }

One might consider a series of benchmarks or upper severity bounds:

SEV(μ <

x+ 0σ_{0}_{x}) = .5

SEV(μ <x+ .5σ_{0}_{x}) = .7

SEV(μ <x+ 1σ_{0}_{x}) = .84

SEV(μ <x+ 1.5σ_{0}_{x}) = .93

SEV(μ <x+ 1.98σ_{0}_{x}) = .975

More generally, one might interpret nonstatistically significant results (i.e., *d*(**x**) ≤* c _{α}*) in test

(μ ≤

X+ γ_{0}_{ε}(σ /√n)) passes the testT+with severity (1 –ε),

for any P(*d*(**X**)>γ_{ε}) = ε.

It is true that I am here limiting myself to a case where σ is known and we do not worry about other possible ‘nuisance parameters’. Here I am doing philosophy of statistics; only once the logic is grasped will the technical extensions be forthcoming.

*5.3.1 Severity and Confidence Bounds in the Case of Test T+*

It will be noticed that these bounds are identical to the corresponding upper confidence interval bounds for estimating μ. There is a duality relationship between confidence intervals and tests: the confidence interval contains the parameter values that would not be rejected by the given test at the specified level of significance. It follows that the (1 – α) one-sided confidence interval (CI) that corresponds to test *T+* is of form:

μ>X− c_{α}(σ /√n)^{ }

The corresponding CI, in other words, would not be the assertion of the upper bound, as in our interpretation of statistically insignificant results. In particular, the 97.5 percent CI estimator corresponding to test *T+* is:

μ>X− 1.96(σ /√n)

We were only led to the upper bounds in the midst of a severity interpretation of negative results (see Mayo and Spanos 2006). [See also posts on this blog, e.g., on reforming the reformers.]

Still, applying the severity construal to the application of confidence interval estimation is in sync with the recommendation to consider a series of lower and upper confidence limits, as in Cox (2006). But are not the degrees of severity just another way to say how probable each claim is? No. This would lead to well-known inconsistencies, and gives the wrong logic for ‘how well-tested’ (or ‘corroborated’) a claim is.

A classic misinterpretation of an upper confidence interval estimate is based on the following fallacious instantiation of a random variable by its fixed value:

P(μ < (

X+2(σ /√n); μ) = .975,

observe mean **x**,

therefore, P (μ < (

x+ 2(σ /√n); μ) = .975.

While this instantiation is fallacious, critics often argue that we just cannot help it. Hacking (1980) attributes this assumption to our tendency toward ‘logicism’, wherein we assume a logical relationship exists between any data and hypothesis. More specifically, it grows out of the first tenet of the statistical philosophy that is assumed by critics of error statistics, that of probabilism.

*5.3.2 Severity versus Rubbing Off*

The severity construal is different from what I call the ‘rubbing off construal’ which says: infer from the fact that the procedure is rarely wrong to the assignment of a low probability to its being wrong in the case at hand. This is still dangerously equivocal, since the probability properly attaches to the method not the inference. Nor will it do to merely replace an error probability associated with an inference to *H* with the phrase ‘degree of severity’ with which *H* has passed. The long-run reliability of the rule is a necessary but not a sufficient condition to infer *H* (with severity).

The reasoning instead is the counterfactual reasoning behind what we agreed was at the heart of an entirely general principle of evidence. Although I chose to couch it within the severity principle, the general frequentist principle of evidence (FEV) or something else could be chosen.

To emphasize another feature of the severity construal, suppose one wishes to entertain the severity associated with the inference:

*H*: μ< (** x _{0}** + 0σ

on the basis of mean **x _{0}** from test

*5.3.3 What’s Belief Got to Do with It?*

Some philosophers profess not to understand what I could be saying if I am prepared to allow that a hypothesis *H* has passed a severe test *T* with **x **without also advocating (strong) belief in* H*. When SEV(*H*) is high there is no problem in saying that **x **warrants *H*, or if one likes, that **x **warrants believing *H*, even though that would not be the direct outcome of a statistical inference. The reason it is unproblematic in the case where SEV(*H*) is high is:

If SEV(*H*) is high, its denial is low, i.e., SEV(~*H*) is low.

But it does not follow that a severity assessment should obey the probability calculus, or be a posterior probability—it should not, and is not.

After all, a test may poorly warrant both a hypothesis *H* and its denial, violating the probability calculus. That is, SEV(*H*) may be low because its denial was ruled out with severity, i.e., because SEV(~*H*) is high. But Sev(*H*) may also be low because the test is too imprecise to allow us to take the result as good evidence for *H*.

Even if one wished to retain the idea that degrees of belief correspond to (or are revealed by?) bets an agent is willing to take, that degrees of belief are comparable across different contexts, and all the rest of the classic subjective Bayesian picture, this would still not have shown the relevance of a measure of belief to the objective appraisal of what has been learned from data. Even if I strongly believe a hypothesis, I will need a concept that allows me to express whether or not the test with outcome **x** warrants *H*. That is what a severity assessment would provide. In this respect, a dyed-in-the wool subjective Bayesian could accept the severity construal for science, and still find a home for his personalistic conception.

Critics should also welcome this move because it underscores the basis for many complaints: the strict frequentist formalism alone does not prevent certain counterintuitive inferences. That is why I allowed that a severity assessment is on the metalevel in scrutinizing an inference. Granting that, the error- statistical account based on the severity principle does prevent the counterintuitive inferences that have earned so much fame not only at Bayesian retreats, but throughout the literature.

*5.3.4 Tacking Paradox Scotched*

In addition to avoiding fallacies within statistics, the severity logic avoids classic problems facing both Bayesian and hypothetical-deductive accounts in philosophy. For example, tacking on an irrelevant conjunct to a well-confirmed hypothesis *H* seems magically to allow confirmation for some irrelevant conjuncts. Not so in a severity analysis. Suppose the severity for claim *H* (given test *T* and data **x**) is high: i.e., SEV(*T*, **x**, *H*) is high, whereas a claim *J* is not probed in the least by test *T*. Then the severity for the conjunction (*H* & *J*) is very low, if not minimal.

If SEV(Test

T, datax, claimH) is high, butJis not probed in the least by the experimental testT, then SEV (T,x, (H&J)) = very low or minimal.

For example, consider:

H: GTR andJ: Kuru is transmitted through funerary cannibalism,

and let data **x _{0}** be a value of the observed deflection of light in accordance with the general theory of relativity, GTR. The two hypotheses do not refer to the same data models or experimental outcomes, so it would be odd to conjoin them; but if one did, the conjunction gets minimal severity from this particular data set. Note that we distinguish

A severity assessment also allows a clear way to distinguish the well-testedness of a portion or variant of a larger theory, as opposed to the full theory. To apply a severity assessment requires exhausting the space of alternatives to any claim to be inferred (i.e., ‘*H* is false’ is a specific denial of *H*). These must be relevant rivals to *H*—they must be at ‘the same level’ as *H*. For example, if *H* is asking about whether drug Z causes some effect, then a claim at a different (‘higher’) level might a theory purporting to explain the causal effect. A test that severely passes the former does not allow us to regard the latter as having passed severely. So severity directs us to identify the portion or aspect of a larger theory that passes. We may often need to refine the hypothesis of stated interest so that it is sufficiently local to enable a severity assessment. Background knowledge will clearly play a key role. Nevertheless we learn a lot from determining that we are not allowed to regard given claims or theories as passing with severity. I come back to this in the next section (and much more elsewhere, e.g., Mayo 2010a, b).

[1] co-organized with Aris Spanos.

[2] This was a special topic of the on-line journal, *Rationality, Markets and Morals (RMM)*, edited by Max Albert—also a conference participant. For more Saturday night reading, check out the page.Authors are: David Cox, Andrew Gelman, David F. Hendry, Deborah G. Mayo, Stephen Senn, Aris Spanos, Jan Sprenger, Larry Wasserman. Search this blog for a number of commentaries on most of these papers.

[3]Long-time blog readers will recognize this from the start of this blog. for some background, and a table of contents for the paper, see my Oct 17 post.

Filed under: Error Statistics, Philosophy of Statistics, Severity, Statistics, StatSci meets PhilSci ]]>

**Aris Spanos**

Wilson E. Schmidt Professor of Economics

*Department of Economics, Virginia Tech*

**Recurring controversies about P values and conﬁdence intervals revisited*
**

Volume 95, Issue 3 (March 2014): pp. 645-651

*INTRODUCTION*

The use, abuse, interpretations and reinterpretations of the notion of a *P* value has been a hot topic of controversy since the 1950s in statistics and several applied ﬁelds, including psychology, sociology, ecology, medicine, and economics.

The initial controversy between Fisher’s signiﬁcance testing and the Neyman and Pearson (N-P; 1933) hypothesis testing concerned the extent to which the pre-data Type I error probability α can address the arbitrariness and potential abuse of Fisher’s *post-data threshold *for the *P *value. Fisher adopted a falsiﬁcationist stance and viewed the *P *value as an indicator of disagreement (inconsistency, contradiction) between data *x*_{0}* _{ }*and the null hypothesis (

The primary aim of this paper is to revisit several charges, interpretations, and comparisons of the *P* value with other procedures as they relate to their primary aims and objectives, the nature of the questions posed to the data, and the nature of their underlying reasoning and the ensuing inferences. The idea is to shed light on some of these issues using the *error-statistical* perspective; see Mayo and Spanos (2011).

…..

Click to read all of A. Spanos on “Recurring controversies“.

……

*SUMMARY AND CONCLUSIONS*

The paper focused primarily on certain charges, claims, and interpretations of the *P *value as they relate to CIs and the AIC. It was argued that some of these comparisons and claims are misleading because they ignore key differences in the procedures being compared, such as (1) their primary aims and objectives, (2) the nature of the questions posed to the data, as well as (3) the nature of their underlying reasoning and the ensuing inferences.

In the case of the *P *value, the crucial issue is whether Fisher’s evidential interpretation of the *P *value as ‘‘indicating the strength of evidence against *H*_{0}’’ is appropriate. It is argued that, despite Fisher’s maligning of the Type II error, a principled way to provide an adequate evidential account, in the form of post-data severity evaluation, calls for taking into account the power of the test.

The error-statistical perspective brings out a key weakness of the *P *value and addresses several foundational issues raised in frequentist testing, including the fallacies of acceptance and rejection as well as misinterpretations of observed CIs; see Mayo and Spanos (2011). The paper also uncovers the connection between model selection procedures and hypothesis testing, revealing the inherent unreliability of the former. Hence, the choice between different procedures should not be ‘‘stylistic’’ (Murtaugh 2013), but should depend on the questions of interest, the answers sought, and the reliability of the procedures.

*Spanos, A. (2014) Recurring controversies about P values and conﬁdence intervals revisited. *Ecology *95(3): 645-651.

**Murtaugh_In defense of P values Murtaugh_Rejoinder**

** Burnham & Anderson_P values are only an index to evidence_ 20th- vs 21st-century statistical science**

Filed under: CIs and tests, Error Statistics, Fisher, P-values, power, Statistics ]]>

So what happened? Medical journals, the main vehicles for publishing clinical trials today, are after all the ‘gatekeepers of medical evidence’—as they are described in

Bad Pharma, Ben Goldacre’s 2012 bestseller. …… The Alltrials campaign, launched two years ago on the back of Goldacre’s book, has attracted an extraordinary level of support. …

Professor Senn has long argued the AllTrials case, he insisted. ‘There’s no doubt that obtaining a license to market a drug should involve an obligation to share the results with interested parties,’ he said.

His point, however, was that this sharing should not involve medical journals. …There were several reasons, he said, as to why Bad

JAMAand other journals were at least as much to blame as Bad Pharma for a lack of transparency in pharmaceutical research: the constant need of the medical press to make a sensational impact, ‘the vanity and ambitions of scientists,’ and the confusing restrictions of embargos—as well as the fact that, despite the evidence, it was clear that journalsdofavour ‘exciting’ research. Instead of journals, Professor Senn claimed, trials should be self published either on the web or in some publicly searchable registry, such as the website Clinicaltrials.gov.

I wonder if this would have helped in the case of the Potti and Nevins Duke trials. I believe the NCI only discovered it was partially funding one of the trials by noticing it on the clinical trials website.

Between the medical journals and the regulators, Senn puts more trust in the latter.

[A]ccording to Professor Senn, it’s the regulators, virtually alone, that keep medicine safe. ‘Regulators may make mistakes, but they do a better job than the journals,’ he said. ‘Would you want to fly to New York with a big reputable airline like BA, which is heavily regulated? Or a plane built by Professor Smith and his colleagues from the local university?’

What do you think?

With so much to disagree on, speakers and audience members agreed that transparent clinical research is a complex goal, and should be addressed as such. Discussing the future is just the start of the process, pointed out Dr Groves.’Publication bias is not only down to publishers, it is also dependent on people submitting their results including old data, whether it’s in a loft or on a floppy disk or filed away somewhere—so bring out your dead,’ she said. ‘We need to be able to make decisions on all the evidence. That means that observational studies should be regarded as being as important as randomised controlled trials. We know we’ve got to improve and there’s a long way to go. It’s an exciting time,’ she said.

*Bring out your dead?*

Filed under: PhilPharma, science communication, Statistics ]]>

by Stephen Senn*

Those not familiar with drug development might suppose that showing that a new pharmaceutical formulation (say a generic drug) is equivalent to a formulation that has a licence (say a brand name drug) ought to be simple. However, it can often turn out to be bafflingly difficult[1].

If, as is often the case, both formulations are given in forms that are absorbed through the gut, whether as pills, oral solutions or suppositories, then so-called *bioequivalence trials* form an attractive option. The basic idea is that the concentration in the blood of the new *test* formulation can be compared to the *licensed* reference formulation. Equivalence of concentration in the blood plausibly implies equivalence in all possible effect sites and thus equality of all benefits and harms.

Typically, healthy volunteers are recruited and given the test formulation on one occasion and the reference formulation on another, the order being randomised. Regular blood samples are taken and the concentration time curves summarised using simple statistics: for example the area under the curve (AUC) is always used, the concentration maximum C_{max} nearly always also and the time to reach a maximum T_{max}, very often. These statistics are then compared across formulations to show that they are similar.

In the rest of this post I shall ignore the problem that various summary measures are employed and assume that we are just considering AUC. There seems to be a general (but arbitrary) agreement that two formulations are equivalent if the true ratio of AUC under test and reference lies between 0.8 & 1.25. In that case (at least as regards the AUC requirement) the formulations are deemed bioequivalent. The true ratio, however, is a parameter not a statistic and so the task is to see what the data can show about the reasonableness of any claim regarding this unknown theoretical quantity.

It is here, however, that the statistical difficulties begin. A simple frequentist solution would appear to be to calculate the 95% confidence intervals for the relative bioavailability and check that these lie within the limits of equivalence. Modelling is always done on the log-scale and since log(0.8)=-log(1.25) we have that limits for the log relative bioavailability of test and reference are (approximately) -0.22 to +0.22. However there is more than one 95% confidence interval and an early dispute in this field was whether a traditional confidence interval centred on the point estimate should be calculated, as Kirkwood[2] proposed in 1981 or one centred on the middle of the range of equivalence, that is to say on 0 (on the log scale) as Westlake[3] had earlier proposed in 1972 .

As O’Quigley and Baudoin pointed out[4], the difference is, essentially, between deciding whether the ‘shortest’ confidence interval is included within the limits of equivalence or whether the fiducial probability that the true relative bioavailability lies within the limits is at least 95%. The latter is always the easier requirement to satisfy. To see why consider the case where the point estimate is positive. In that case clearly the lower conventional confidence limit would never lie outside the limit unless the upper one did. Thus by lengthening the lower limit and shortening the upper in such a way to maintain the 95% probability one can make it easier to satisfy equivalence.

An alternative approach was taken by Schuirmann[5] who proposed to look at the matter in terms of two one–sided tests. Imagine that we have two regulators: a toxicity and an efficacy regulator. The former defines as toxic any drug whose relative bioavailability is greater than 1.25 and the latter as ineffective any drug whose relative bioavailability is less than 0.8. Each is unconcerned by the other’s decision and so no trading of alpha from one to the other can take place. It turns out that this requirement is satisfied operationally by accepting bioequivalence if the conventional 90% confidence limits are within the limits of equivalence. Opinions differ as to how logical this is. For example, the FDA requires conventional placebo-controlled trials of a new treatment to be tested at the 5% level two-sided but since they would never accept a treatment that was worse than placebo the regulator’s risk is 2.5% not 5%. Why should it be lower for bioequivalence?

Be that as it may, 90% confidence intervals are regularly used but they have been criticised by a number of frequentists of a Neyman-Pearson persuasion. (See for example R. Berger and Hsu[6].) The argument goes as follows. If the trial is small enough so that the standard error is large enough the width of the confidence interval, however calculated, will exceed the width of the equivalence interval. Thus the type I error rate is zero. Various proposals have been made as to how to recover the missing Type I error but they all boil down to this: given a small enough trial you could claim equivalence even though the point estimate was outside the limits of equivalence! Needless to say nobody uses such tests in practice and they have been severely criticised from a theoretical point of view[7])

The above argument is based on Normal theory tests. Horrendous complications are introduced by using the t-test if one departs from classical confidence intervals.

And don’t get me started on equivalence when concentration in the blood is irrelevant but a pharmacodynamic outcome must be used instead!

So, what seems to be a simple problem turns out to be controversial and difficult. As I sometimes put it ‘equivalence is different’.

Here there be tygers!

*Head, Methodology and Statistics Group

Competence Center for Methodology and Statistics (CCMS)

Luxembourg

**References**

1. Senn, S.J., Statistical issues in bioequivalence*.* Statistics in Medicine, 2001. **20**(17-18): p. 2785-2799.

2. Kirkwood, T.B.L., *Bioequivalence testing – a need to rethink.* Biometrics, 1981. **37**: p. 589-591.

3. Westlake, W.J., *Use of confidence intervals in analysis of comparative bioavailability trials.* Journal of Pharmaceutical Sciences, 1972. **61**(8): p. 1340-1341.

4. O’Quigley, J. and C. Baudoin, *General approaches to the problem of bioequivalence.* The Statistician, 1988. **37**: p. 51-58.

5. Schuirmann, D.J., *A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability.* J Pharmacokinet Biopharm, 1987. **15**(6): p. 657-80.

6. Berger, R.L. and J.C. Hsu, *Bioequivalence trials, intersection-union tests and equivalence confidence sets.* Statistical Science, 1996. **11**(4): p. 283-302.

7. Perlman, M.D. and L. Wu, *The emperor’s new tests.* Statistical Science, 1999. **14**(4): p. 355-369.

References added by Editor for readers:

1. Senn SJ. Falsificationism and clinical trials [see comments]. Statistics in Medicine 1991; 10: 1679-1692.

2. Senn SJ. Inherent difficulties with active control equivalence studies. Statistics in Medicine 1993; 12: 2367-2375.

3. Senn SJ. Fisher’s game with the Devil. Statistics in Medicine 1994; 13: 217-230.

Filed under: bioequivalence, confidence intervals and tests, PhilPharma, Statistics, Stephen Senn ]]>