Despite the fact that Fisherians and Neyman-Pearsonians alike regard observed significance levels, or P values, as error probabilities, we occasionally hear allegations (typically from those who are neither Fisherian nor N-P theorists) that P values are actually not error probabilities. The denials tend to go hand in hand with allegations that P values exaggerate evidence against a null hypothesis—a problem whose cure invariably invokes measures that are at odds with both Fisherian and N-P tests. The Berger and Sellke (1987) article from a recent post is a good example of this. When leading figures put forward a statement that looks to be straightforwardly statistical, others tend to simply repeat it without inquiring whether the allegation actually mixes in issues of interpretation and statistical philosophy. So I wanted to go back and look at their arguments. I will post this in installments.

**1. Some assertions from Fisher, N-P, and Bayesian camps**

Here are some assertions from Fisherian, Neyman-Pearsonian and Bayesian camps: (I make no attempt at uniformity in writing the “P-value”, but retain the quotes as written.)

*a) From the Fisherian camp (Cox and Hinkley):*

For given observationsywe calculate t = t_{obs}= t(y), say, and the level of significance p_{obs}by

p_{obs}= Pr(T > t_{obs}; H_{0}).

….Hence p_{obs}is the probability that we would mistakenly declare there to be evidence against H_{0}, were we to regard the data under analysis as being just decisive against H_{0}.” (Cox and Hinkley 1974, 66).

Thus p_{obs} would be the Type I error probability associated with the test.

*b) From the Neyman-Pearson N-P camp (Lehmann and Romano):*

“[I]t is good practice to determine not only whether the hypothesis is accepted or rejected at the given significance level, but also to determine the smallest significance level…at which the hypothesis would be rejected for the given observation. This number, the so-called p-value gives an idea of how strongly the data contradict the hypothesis. It also enables others to reach a verdict based on the significance level of their choice.” (Lehmann and Romano 2005, 63-4)

Very similar quotations are easily found, and are regarded as uncontroversial—even by Bayesians whose contributions stood at the foot of Berger and Sellke’s argument that P values exaggerate the evidence against the null.

*c) Gibbons and Pratt:*

“The P-value can then be interpreted as the smallest level of significance, that is, the ‘borderline level’, since the outcome observed would be judged significant at all levels greater than or equal to the P-value[i] but not significant at any smaller levels. Thus it is sometimes called the ‘level attained’ by the sample….Reporting a P-value, whether exact or within an interval, in effect permits each individual to choose his own level of significance as the maximum tolerable probability of a Type I error.” (Gibbons and Pratt 1975, 21).

**2. So what’s behind the “P values aren’t error probabilities” allegation?**

In their rejoinder to Hinkley, Berger and Sellke assert the following: “The use of the term ‘error rate’ suggests that the frequentist justifications, such as they are, for confidence intervals and fixed a-level hypothesis tests carry over to P values.”

They do not disagree with Cox and Hinkley’s assertion above, but they maintain that:

“This hypothetical error rate does not conform to the usual classical notion of ‘repeated-use’ error rate, since the P-value is determined only once in this sequence of tests. The frequentist justifications of significance tests and confidence intervals are in terms of how these procedures perform when used repeatedly.” (Berger and Sellke 1987, 136)

Keep in mind that Berger and Sellke are using “significance tests” to refer to Neyman-Pearson (N-P) tests in contrast to Fisherian P-value appraisals.

So their point appears to be simply that the P value, as intended by Fisher, is not justified by (or not intended to be justified by) a behavioral appeal to controlling long run error rates. It is assumed that those are the only, or the main, justifications available for N-P significance tests and confidence intervals (thus type 1 and 2 error probabilities and confidence levels are genuine error probabilities). They do not entertain the idea that the P value, as the attained significance level, is important for N-P theorists nor that* “a p-value* gives an idea of how strongly the data contradict the hypothesis”(Lehmann and Romano)—a construal we find early on in David Cox.

But let’s put that aside, as we pin down Berger and Sellke’s point. Here’s how we might construe them. They grant that the P-value is, mathematically, a frequentist error probability, it is the *justification* that they think differs from what they take to be the justification of Type 1 and 2 errors in N-P statistics. They think N-P tests and confidence intervals get their justification in terms of (actual?) long run error rates, and P-values do not. To continue with their remarks:

“Can P values be justified on the basis of how they perform in repeated use? We doubt it. For one thing, how would one measure the performance of P values? With significance tests and confidence intervals, they are either right or wrong, so it possible to talk about error rates. If one introduces a decision rule into the situation by saying that H

_{0}is rejected when the P value < .05, then of course the classical error rate is .05.”[ii](ibid.)

Thus, *P values are error probabilities***, **but their intended justification (by Fisher?) was not a matter of a behavioristic appeal to low long-run error rates, but rather, something more inferential or evidential. We can actually strengthen their argument in a couple of ways. Firstly, we can remove the business of “actual” versus “hypothetical” repetitions, because the behavioristic justifications that they are trying to call out are also given in terms of hypotheticals. Moreover, behavioristic appeals to controlling error rates are not limited to “reject/do not reject”, but apply even where the inference is in terms of an inferred discrepancy or other test output.

The problem is that the inferential vs behavioristic distinction does not separate Fisherian P-values from confidence levels and type I and 2 error probabilities. *All* of these are amenable to *both* types of interpretation! More to follow in installment #2.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Installment 2:** **Mirror Mirror on the Wall, Who’s the More Behavioral of them all?**

Granted, the founders did not make out intended inferential construals fully—though representatives from Fisherian and N-P camps took several steps. At the same time, members of both camps also can be found talking like acceptance samplers!

Berger and Sellke had said: “If one introduces a decision rule into the situation by saying that H_{0} is rejected when the P value < .05, then of course the classical error rate is .05.” Good. Then we can agree that it is mathematically an error probability. They simply don’t think it reflects the Fisherian ideal.

**3. Fisher as acceptance sampler.**

But it was Fisher, after all, who declared that “Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis. “ (DOE 15-16)

Or to quote from an earlier article of Fisher (1926):

…we may, if we prefer it, draw the line at one in fifty (the 2 per cent point), or one in a hundred (the 1 per cent point). Personally, the writer prefers to set a low standard of significance at the 5 per cent point, and

ignore entirely all results which fail to reach this level. A scientific fact should be regarded as experimentally established only if a properly designed experimentrarely failsto give this level of significance. The very high odds sometimes claimed for experimental results should usually be discounted, for inaccurate methods of estimating error have far more influence than has the particular standard of significance chosen.

The above is a more succinct version of essentially the same points Fisher makes in DOE.[iii]

No wonder Neyman could tell Fisher to look in the mirror (as it were): *“Pearson and I were only systematizing your practices for how to interpret data, using those nice charts you made. True, we introduced the alternative hypothesis (and the corresponding type 2 error), but that was only to give a rationale, and apparatus, for the kinds of tests you were using. You never had a problem with the Type 1 error probability, and your concern for how best to increase “sensitivity” was to be reflected in the power assessment. You had no objections—at least at first”.* See this post.

The dichotomous “up-down” spirit that Berger and Sellke suggest is foreign to Fisher is not foreign at all. Again from DOE:

Our examination of the possible results of the experiment has therefore led us to a statistical test of significance, by which these results are divided into two classes with opposed interpretation. ….The two classes of results which are distinguished by our test of significance are, on the one hand, those which show a significant discrepancy from a certain hypothesis; …and on the other hand, results which show no significant discrepancy from this hypothesis. (DOE 15)

Even where Fisher is berating Neyman for introducing the Type 2 error–he had no problem with type 1 errors, and both were fine in cases of estimation–Fisher falls into talk of actions, as Neyman points out (Neyman 1956,Triad).

“The worker’s real attitude in such a case might be, according to the circumstances:

(a)”the probable deviation from truth of my working hypothesis, to examine which the test is appropriate, seems not to be of sufficient magnitude to warrant any immediate modification.” (Fisher 1955, Triad, p. 73)

Pearson responds (1955) that this is entirely the type of interpretation they imagined to be associated with the bare mathematics of the test. And Neyman made it clear early on (though I didn’t discover it at first) that he intended “accept” to serve merely as a shorthand for “do not reject”. See this recent post, which includes links to all three papers in the “triad” (by Fisher, Neyman, and Pearson).

Here’s George Barnard (1972):

“In fact Fisher referred approvingly to the concept of the power curve of a test procedure and although he wrote: ‘On the whole the ideas (a) that a test of significance must be regarded as one of a series of similar tests applied to a succession of similar bodies of data, and (b) that the purpose of the test is to discriminate or ‘decide’ between two or more hypotheses, have greatly obscured their understanding’, he was careful to go on and add ‘when taken not as contingent possibilities but as elements essential to their logic’.” (129).

To see how Fisher links power to his own work early on, please check this post.

So we are back to the key question: what is the basis for Berger and Sellke (and others who follow similar lines of criticism) to allow error probabilities in the case of N-P significance tests and confidence intervals, and not in the case of P-values? It cannot be whether the method involves a rule for mapping outcomes to interpretations (be there two or three—the third might be N-P’s initial “remain undecided” or “get more data”), because we’ve just seen that to be true of Fisherian tests as well.

**4.** **Fixing the type 1 error probability**

But isn’t the issue that N-P tests fix the type 1 error probability in advance? Firstly, we must distinguish between fixing the P value threshold to be used in each application, and justifying tests solely by reference to a control of long run error (behavioral justification). So what about the first point of predesignating the threshold? Actually, this was more Fisher than N-P:

“Neyman and Pearson *followed Fisher’s* adoption of a fixed level” Erich Lehmann tells us. (Lehmann 1993, 1244). Lehmann is flummoxed by the accusation of fixed levels of significance since “[U]nlike Fisher, Neyman and Pearson (1933, p. 296) did not recommend a standard level but suggested that ‘how the balance [between the two kinds of error] should be struck must be left to the investigator.” (ibid.) From their earliest papers, they stressed that the tests were to be “used with discretion and understanding” depending on the context. Pearson made it clear that he thought it “irresponsible”, in a matter of importance, to distinguish rejections at the .025 or .05 level.[iv] (See this post.) And as we already saw, Lehmann (who developed N-P tests as decision rules) recommends reporting the attained P value.

In a famous passage,[v] Fisher (1956) raises the criticism—but without naming names:

A man who ‘rejects’ a hypothesis provisionally, as a matter of habitual practice, when the significance is at the 1% level or higher, will certainly be mistaken in not more than 1% of such decisions. For when the hypothesis is correct he will be mistaken in just 1% of these cases, and when it is incorrect he will never be mistaken in rejection….However, the calculation is absurdly academic, for in fact no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas.

It is assumed he is speaking of N-P, or at least Neyman, but I wonder…

Anyway, the point is that the mathematics admits of different interpretations and uses. The “P values are not error rates” argument really boils down to claiming that the justification for using P-values *inferentially is *not *merely* that if you repeatedly did this you’d rarely erroneously interpret results (that’s necessary but not sufficient for the inferential warrant). That, of course, is what I (and others) have been arguing for ages—but I’d extend this to N-P significance tests and confidence intervals, at least in contexts of scientific inference. See, for example, Mayo and Cox (2006/2010), Mayo and Spanos (2006). We couldn’t even express the task of how to construe error probabilities inferentially if we could only use the term “error probabilities” to mean something justified only by behavioristic long-runs.

**5. What about the Famous Blow-ups?**

What about the big disagreement between Neyman and Fisher (Pearson is generally left out of it)? Well, I think that as hostilities between Fisher and Neyman heated up, the former got more and more evidential (and even fiducial) and the latter more and more behavioral. Still, what has made a lasting impression on people, understandably, are Fisher’s accusations that Neyman (if not Pearson) converted his tests into acceptance sampling devices, more suitable for making money in the U.S. or Russian 5 year plans, than thoughtful inference. (But remember Pearson’s and Neyman’s responses.) Imagine what he might have said about today’s infatuation with converting P value assessments to dichotomous outputs to compute science-wise error rates: Neyman on steroids.[vi]

By the way, it couldn’t have been too obvious that N-P distorted his tests, since Fisher tells us in 1955 that it was only when Barnard brought it to his attention that “despite agreeing mathematically in very large part”, there is a distinct *philosophical position* emphasized at least by Neyman. So it took like 20 years to realize this? (Barnard also told me this in person, recounted in this theater production.)

Here’s an enlightening passage from Cox (2006):

Neyman and Pearson “began an influential collaboration initially designed primarily, it would seem to clarify Fisher’s writing. This led to their theory of testing hypotheses and to Neyman’s development of confidence intervals, aiming to clarify Fisher’s idea of fiducial intervals. As late as 1932 Fisher was writing to Neyman encouragingly about this work, but relations soured, notably when Fisher greatly disapproved of a paper of Neyman’s on experimental design and no doubt partly because their being in the same building at University College London brought them too close to one another!” (195)

*Being in the same building,indeed!* Recall Fisher declaring that if Neyman teaches in the same building and doesn’t use his book, he would oppose him in all things. See this post for details on some of their anger management problems.

The point is that it is absurd to base conceptions of inferential methods on personality disputes rather than the mathematical properties of tests (and their associated interpretations). These two approaches are best seen as offering clusters of tests appropriate for different contexts within the large taxonomy of tests and estimation methods. We can agree that the radical behavioristic rationale for error rates is not the rationale intended by Fisher in using P-values. I would argue it was not the rationale intended by Pearson, nor, much of the time, by Neyman. Yet we should be beyond worrying about what the founders really thought. *It’s the methods, stupid.*

Readers should not have to go through this “he said/we said” history again. Enough! Nor should they be misled into thinking there’s a deep inconsistency which renders all standard treatments invalid (by dint of using both N-P and Fisherian tests).

So has pure analytic philosophy, by clarifying terms (along with a bit of history of statistics), solved the apparent disagreement with Berger and Sellke (1987) and others?

It’s gotten us somewhere, yet there’s a big problem that remains. TO BE CONTINUED ON A NEW POST

**REFERENCES:**

Barnard, G. (1972). “Review of ‘The Logic of Statistical Inference’ by I. Hacking” *Brit. J. Phil. Sci.* 23(2): 123-132.

Berger, J. O. and Sellke, T. (1987) “Testing a point null hypothesis: The irreconcilability of p values and evidence,” (with discussion). *J. Amer. Statist. Assoc.* 82: 112-139.

Cassella G. and Berger, R.. (1987). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem,” (with discussion). *J. Amer. Statist. Assoc. ***82 **106–111, 123–139.

Cox, D. R. (2006) *Principles of Statistical Inference*. Cambridge: Cambridge University Press.

Cox, D. R. & Hinkley, D. V. (1974). *Theoretical Statistics*, London, Chapman & Hall.

Fisher, R. A. (1926). “The Arrangement of Field Experiments”, *J. of Ministry of Agriculture*, Vol. XXXIII, 503-513.

Fisher, R. A. (1947). *The Design of Experiments* (4^{th} Ed.) NY Hafner.

Fisher, R. A. (1955) “Statistical Methods and Scientific Induction,” *Journal of The Royal Statistical Society *(B) 17: 69-78.

Fisher, R.A. (1956). *Statistical Methods and Scientific Inference*, Hafner

Gibbons, J. & Pratt, J. W. (1975). “P-values: Interpretation and Methodology”, *The American Statistician* 29: 20-25.

Lehmann, E. (1993). “The Fisher, Neyman-Pearson Theories of Testing Hypotheses: One Theory or Two?” *J. Amer. Statist. Assoc.*, 88(424):1242-1249.

Lehmann and Romano (2005) *Testing Statistical Hypotheses* (3^{rd} ed.), New York: Springer.

Mayo, D.G. and Cox, D. R. (2006/2010) “Frequentists Statistics as a Theory of Inductive Inference,” *Optimality: The Second Erich L. Lehmann Symposium *(ed. J. Rojo), Lecture Notes-Monograph series, Institute of Mathematical Statistics (IMS), Vol. 49: 77-97.

Mayo, D. G. and Spanos, A. (2006). “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,” *British Journal of Philosophy of Science*, 57: 323-357.

Neyman, J. (1977) “Frequentist Probability and Frequentist Statistics,” *Synthese* 36: 97-131.

Neyman, J. (1956). “Note on an Article by Sir Ronald Fisher,” *Journal of the Royal Statistical Society* (B), 18:288-294.

Neyman, J. and Pearson, E.S. (1933). “On the Problem of the Most Efficient Tests of Statistical Hypotheses,” *Philosophical Transactions of the Royal Society of London*, (A), 231, 289-337.

Pearson, E.S. (1947), “The Choice of Statistical Tests Illustrated on the Interpretation of Data Classed in a 2×2 Table,” Biometrika 34 (1/2): 139-167.

Pearson, E. S. (1955). “Statistical Concepts in Their Relation to Reality,” *Journal of the Royal Statistical Society, *(B), 17: 204-207.

____________

[i] With the usual inversions.

[ii] They add “but the expected P value given rejection is .025, an average understatement of the error rate by a factor of two.”

[iii] Neyman did put in a plug for developments in empirical Bayesian methods in his 1977 Synthese paper.

[iv] Pearson says,

“Were the action taken to be decided automatically by the side of the 5% level on which the observations fell, it is clear that [the precise p-value]…would here be of vital importance. But no responsible statistician, faced with an investigation of this character, would follow an automatic probability rule.” (1947, 192). See the post: http://errorstatistics.com/2014/08/17/are-p-values-error-probabilities-installment-1/

[v] From the *Design of Experiments* (DOE):

The test of significance (13):

“It is open to the experimenter to be more or less exacting in respect of the smallness of the probability he would require before he would be willing to admit that his observations have demonstrated a positive result. It is obvious that an experiment would be useless of which no possible result would satisfy him. Thus, if he wishes to ignore results having probabilities as high as 1 in 20–the probabilities being of course reckoned from the hypothesis that the phenomenon to be demonstrated is in fact absent –then it would be useless for him to experiment with only 3 cups of tea…. It is usual and convenient for the experimenters to take 5 per cent as a standard level of significance, in the sense that they are prepared to ignore all results which fail to reach this standard, and, by this means to eliminate from further discussion the greater part of the fluctuations which chance causes have introduced into their experimental results. …we thereby admit that no isolated experiment, however significant in itself, can suffice for the experimental demonstration of any natural phenomenon; for the ‘one chance in a million’ will undoubtedly occur, with no less and no more than its appropriate frequency, however surprised we may be that it should occur to

us.In order to assert that a natural phenomenon is experimentally demonstrable we need, not an isolated record, but a reliable method of procedure. In relation to the test of significance we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result.” (emphasis added)

On 46-7 Fisher clarifies something people often confuse: it’s not the low probability of the event “rather to the fact, very near in this case, that the correctness of the assertion would entail an event of this low probability.

[vi] It follows a paragraph criticizing Bayesians.

Filed under: frequentist/Bayesian, J. Berger, P-values, Statistics ]]>

Today is **Egon Pearson’s birthday: 11 August 1895-12 June, 1980.**

E. Pearson rejected some of the familiar tenets that have come to be associated with Neyman and Pearson (N-P) statistical tests, notably the idea that the essential justification for tests resides in a long-run control of rates of erroneous interpretations–what he termed the “behavioral” rationale of tests. In an unpublished letter E. Pearson wrote to Birnbaum (1974), he talks about N-P theory admitting of two interpretations: behavioral and evidential:

“I think you will pick up here and there in my own papers signs of evidentiality, and you can say now that we or I should have stated clearly the difference between the behavioral and evidential interpretations. Certainly we have suffered since in the way the people have concentrated (to an absurd extent often) on behavioral interpretations”.

(Nowadays, some people concentrate to an absurd extent on “science-wise error rates in dichotomous screening”.)

When Erich Lehmann, in his review of my “Error and the Growth of Experimental Knowledge” (EGEK 1996), called Pearson “the hero of Mayo’s story,” it was because I found in E.S.P.’s work, if only in brief discussions, hints, and examples, the key elements for an “inferential” or “evidential” interpretation of N-P statistics. Granted, these “evidential” attitudes and practices have never been explicitly codified to guide the interpretation of N-P tests. If they had been, I would not be on about providing an inferential philosophy all these years.[i] Nevertheless, “Pearson and Pearson” statistics (both Egon, not Karl) would have looked very different from Neyman and Pearson statistics, I suspect. One of the few sources of E.S. Pearson’s statistical philosophy is his (1955) “Statistical Concepts in Their Relation to Reality”. It begins like this:

Controversies in the field of mathematical statistics seem largely to have arisen because statisticians have been unable to agree upon how theory is to provide, in terms of probability statements, the numerical measures most helpful to those who have to draw conclusions from observational data. We are concerned here with the ways in which mathematical theory may be put, as it were, into gear with the common processes of rational thought, and there seems no reason to suppose that there is one best way in which this can be done. If, therefore, Sir Ronald Fisher recapitulates and enlarges on his views upon statistical methods and scientific induction we can all only be grateful, but when he takes this opportunity to criticize the work of others through misapprehension of their views as he has done in his recent contribution to this

Journal(Fisher 1955 “Scientific Methods and Scientific Induction” ), it is impossible to leave him altogether unanswered.In the first place it seems unfortunate that much of Fisher’s criticism of Neyman and Pearson’s approach to the testing of statistical hypotheses should be built upon a “penetrating observation” ascribed to Professor G.A. Barnard, the assumption involved in which happens to be historically incorrect. There was no question of a difference in point of view having “originated” when Neyman “reinterpreted” Fisher’s early work on tests of significance “in terms of that technological and commercial apparatus which is known as an acceptance procedure”. There was no sudden descent upon British soil of Russian ideas regarding the function of science in relation to technology and to five-year plans. It was really much simpler–or worse.

The original heresy, as we shall see, was a Pearson one!…

To continue reading, “Statistical Concepts in Their Relation to Reality” click HERE [iii]

Pearson doesn’t mean it was he who endorsed the behavioristic model that Fisher is here attacking.[ii] The “original heresy” refers to the break from Fisher in the explicit introduction of alternative hypotheses (even if only directional). Without considering alternatives, Pearson and Neyman argued, significance tests are insufficiently constrained–for evidential purposes! However, this does *not* mean N-P tests give us merely a *comparativist* appraisal (as in a report of relative likelihoods!)

[i] Noteworthy leaders in this “evidential interpretation” are David Cox and Allan Birnbaum.

[ii] Fisher’s tirades against behavioral interpretations of “his” tests are almost entirely a reflection of his break with Neyman (after 1935) rather than any radical disagreement either in philosophy or method. Fisher could be even more behavioristic in practice (if not in theory) than Neyman, and Neyman could be even more evidential in practice (if not in theory) than Fisher. Contemporary writers love to harp on the so-called “inconsistent hybrid” combining Fisherian and N-P tests, but it’s largely a lot of hoopla growing out of their taking Fisher-Neyman personality feuds at face value. It’s time to dismiss these popular distractions: they are an obstacle to progress in statistical understanding. *The only thing that matters is what the methods are capable of doing! For more on this, see “it’s the methods, stupid!”*

[iii]See also Aris Spanos: “Egon Pearson’s Neglected Contributions to Statistics“, and my “E.S. Pearson’s statistical philosophy” from 2 years ago.

References:

*The “Triad” (3 short, key, papers)*:

- Fisher, R. A. (1955), “Statistical Methods and Scientific Induction“.
*Journal of The Royal Statistical Society*(B) 17: 69-78. - Neyman, J. (1956), “Note on an Article by Sir Ronald Fisher,”
*Journal of the Royal Statistical Society*. Series B (Methodological), 18: 288-294. - Pearson, E. S. (1955), “Statistical Concepts in Their Relation to Reality,”
*Journal of the Royal Statistical Society*, B, 17: 204-207.

Also of relevance:

Erich Lehmann’s (1993), “The Fisher, Neyman-Pearson Theories of Testing Hypotheses: One Theory or Two?“. *Journal of the American Statistical Association*, Vol. 88, No. 424: 1242-1249.

Mayo, D. (1996), “Why Pearson Rejected the Neyman-Pearson (Behavioristic) Philosophy and a Note on Objectivity in Statistics” (Chapter 11) in *Error and the Growth of Experimental Knowledge.* Chicago: University of Chicago Press.

Filed under: phil/history of stat, Philosophy of Statistics, Statistics Tagged: Egon Pearson, Statistical hypothesis testing ]]>

**Blog Contents: June and July 2014***

(6/5) Stephen Senn: Blood Simple? The complicated and controversial world of bioequivalence (guest post)

(6/9) “The medical press must become irrelevant to publication of clinical trials.”

(6/11) A. Spanos: “Recurring controversies about P values and confidence intervals revisited”

(6/14) “Statistical Science and Philosophy of Science: where should they meet?”

(6/21) Big Bayes Stories? (draft ii)

(6/25) Blog Contents: May 2014

(6/28) Sir David Hendry Gets Lifetime Achievement Award

(6/30) Some ironies in the ‘replication crisis’ in social psychology (4^{th} and final installment)

**July 2014**

(7/7) Winner of June Palindrome Contest: Lori Wike

(7/8) Higgs Discovery 2 years on (1: “Is particle physics bad science?”)

(7/10) Higgs Discovery 2 years on (2: Higgs analysis and statistical flukes)

(7/14) “P-values overstate the evidence against the null”: legit or fallacious? (revised)

(7/23) Continued:”P-values overstate the evidence against the null”: legit or fallacious?

(7/26) S. Senn: “Responder despondency: myths of personalized medicine” (Guest Post)

(7/31) Roger Berger on Stephen Senn’s “Blood Simple” with a response by Senn (Guest Posts)

*Complied by Nicole Jinn and Jean Miller

Filed under: blog contents ]]>

**Winner of July 2014 Contest:**

**Manan Shah**

**Palindrome: **

**Trap May Elba, Dr. of Fanatic. I fed naan, deli-oiled naan, deficit an affordable yam part.**

**The requirements: **

In addition to using Elba, a candidate for a winning palindrome must have used** **fanatic. An optional second word was: part. An acceptable palindrome with both words would best an acceptable palindrome with just fanatic.

**Bio:**

Manan Shah is a mathematician and owner of Think. Plan. Do. LLC. (www.ThinkPlanDoLLC.com). He also maintains the “Math Misery?” blog at www.mathmisery.com. He holds a PhD in Mathematics from Florida State University.

**Statement:**

Ever since I can remember, I saw words spelled out as they were spoken. Whenever I said something, I would see my words run across on a ticker tape in my mind. The same would happen when somebody else would speak, not necessarily to me. Since childhood I loved playing word games; I played a lot of Scrabble with my dad, Once I got a handle on programming, the world of endless word lists became open to me.

Palindromes have always fascinated me. Curiously, though, I have not tried to write a palindrome generation program that went more than two words deep, but perhaps now I will! I’m excited and flattered to have won the July contest. My smiling face can be found on Twitter under the handle @shahlock.

**Book Choice:**

I was looking through the books and couldn’t decide. :) They all look so good! Let’s go with *Principles of Applied Statistics* [D. R. Cox and C. A. Donnelly 2011, Cambridge: Cambridge University Press].

*Mayo*: Congratulations Manan Shah! I might note that he had an earlier interesting candidate this month: “Elba is aware — poor fanatic. I lost rap part. Solicit an afro? Opera, was I able.”

I’ve only seen a few palindromic programs– none that created palindromes with stipulated words. I’d be glad to hear of any that can! That might ruin my contest but I doubt it.

*The word for the August Palindrome contest is “correlate”, with optional second word “mood” due by midnight Sept 5.*

Filed under: Palindrome, Rejected Posts ]]>

Nate Silver gave his ASA Presidential talk to a packed audience (with questions tweeted[i]). Here are some quick thoughts—based on scribbled notes (from last night). Silver gave a list of 10 points that went something like this (turns out there were 11):

1. statistics are not just numbers

2. context is needed to interpret data

3. correlation is not causation

4. averages are the most useful tool

5. human intuitions about numbers tend to be flawed and biased

6. people misunderstand probability

7. we should be explicit about our biases and (in this sense) should be Bayesian?

8. complexity is not the same as not understanding

9. being in the in crowd gets in the way of objectivity

10. making predictions improves accountability

Just to comment on #7, I don’t know if this is a brand new philosophy of Bayesianism, but his position went like this: Journalists and others are incredibly biased, they view data through their prior conceptions, wishes, goals, and interests, and you cannot expect them to be self-critical enough to be aware of, let alone be willing to expose, their propensity toward spin, prejudice, etc. Silver said the reason he favors the Bayesian philosophy (yes he used the words “philosophy” and “epistemology”) is that people should be explicit about disclosing their biases. I have three queries: (1) If we concur that people are so inclined to see the world through their tunnel vision, what evidence is there that they are able/willing to be explicit about their biases? (2) If priors are to be understood as the way to be explicit about one’s biases, shouldn’t they be kept separate from the data rather than combined with them? (3) I don’t think this is how Bayesians view Bayesianism or priors—is it? Subjective Bayesians, I thought, view priors as representing prior or background information about the statistical question of interest; but Silver sees them as admissions of prejudice, bias or what have you. As a confession of bias, I’d be all for it—though I think people may be better at exposing other’s biases than their own. Only thing: I’d need an entirely distinct account of warranted inference from data.

This does possibly explain some inexplicable remarks in Silver’s book to the effect that R.A. Fisher denied, excluded, or overlooked human biases since he disapproved of adding subjective prior beliefs to data in scientific contexts. Is Silver just about to recognize/appreciate the genius of Fisher (and others) in developing techniques consciously designed to find things out despite knowledge gaps, variability, and human biases? Or not?

Share your comments and/or links to other blogs discussing his talk (which will surely be posted if it isn’t already). Fill in gaps if you were there—I was far away… (See also my previous post blogging the JSM).

For a follow-up post including an 11th bullet that I’d missed, see here.

[i] What was the point of this, aside from permitting questions to be cherry picked? (It would have been fun to see ALL the queries tweeted.) The ones I heard were limited to: how can we make statistics more attractive, who is your favorite journalist, favorite baseball player, and so on. But I may have missed some, I left before the end.

Some reader comments on JSM 14 are here. Feel free to add comments here or there on either JSM.

Filed under: Statistics, StatSci meets PhilSci ]]>

*Jerzy Neyman: April 16, 1894-August 5, 1981. *This reblogs posts under “The Will to Understand Power” & “Neyman’s Nursery” here & here.

Way back when, although I’d never met him, I sent my doctoral dissertation, *Philosophy of Statistics, *to one person only: Professor Ronald Giere. (And he would read it, too!) I knew from his publications that he was a leading defender of frequentist statistical methods in philosophy of science, and that he’d worked for at time with Birnbaum in NYC.

Some ~~ten~~ 15 years ago, Giere decided to quit philosophy of statistics (while remaining in philosophy of science): I think it had to do with a certain form of statistical exile (in philosophy). He asked me if I wanted his papers—a mass of work on statistics and statistical foundations gathered over many years. Could I make a home for them? I said yes. Then came his caveat: there would be a lot of them.

As it happened, we were building a new house at the time, Thebes, and I designed a special room on the top floor that could house a dozen or so file cabinets. (I painted it pale rose, with white lacquered book shelves up to the ceiling.) Then, for more than 9 months (same as my son!), I waited . . . Several boxes finally arrived, containing hundreds of files—each meticulously labeled with titles and dates. More than that, the labels were hand-typed! I thought, If Ron knew what a slob I was, he likely would not have entrusted me with these treasures. *(Perhaps he knew of no one else who would actually want them!)*

I assumed that I knew most of the papers, certainly those by Neyman, Pearson, and Birnbaum, but the files also contained early drafts, pale mimeo versions of papers, and, best of all, hand-written comments Giere had exchanged with Birnbaum and others, before the work was all tidied-up. For a year or so, the papers received few visits. Then, in 2003, after a storm that killed our internet connection, I climbed the stairs to find an article of Birnbaum’s (more on this later).

I was flipping through some articles (that I assumed were in Neyman’s books and collected works) when I found one, then another, and then a third Neyman paper that would turn out to be dramatically at odds, philosophically—in ways large and small—from everything I had read by Neyman on Neyman and Pearson methods. (Aris Spanos and I came to refer to them as the “hidden Neyman papers,” below.) So what was so startling? Stay tuned . . .

***************

[NN2]

Let me pick up where I left off in “Neyman’s Nursery,” [built to house Giere's statistical papers-in-exile]. The main goal of the discussion is to get us to exercise correctly our “will to understand power”, if only little by little. One of the two surprising papers I came across the night our house was hit by lightening has the tantalizing title “The Problem of Inductive Inference” (Neyman 1955). It reveals a use of statistical tests strikingly different from the long-run behavior construal most associated with Neyman. Surprising too, Neyman is talking to none other than the logical positivist philosopher of confirmation, Rudof Carnap:

I am concerned with the term “degree of confirmation” introduced by Carnap. …We have seen that the application of the locally best one-sided test to the data … failed to reject the hypothesis [that the n observations come from a source in which the null hypothesis is true]. The question is: does this result “confirm” the hypothesis that H

_{0}is true of the particular data set? (Neyman, pp 40-41).

Neyman continues:

The answer … depends very much on the exact meaning given to the words “confirmation,” “confidence,” etc. If one uses these words to describe one’s intuitive feeling of confidence in the hypothesis tested H

_{0}, then…. the attitude described is dangerous.… [T]he chance of detecting the presence [of discrepancy from the null], when only [n] observations are available, is extremely slim, even if [the discrepancy is present]. Therefore, the failure of the test to reject H_{0}cannot be reasonably considered as anything like a confirmation of H_{0}. The situation would have been radically different if the power function [corresponding to a discrepancy of interest] were, for example, greater than 0.95. (ibid.)

The general conclusion is that it is a little rash to base one’s intuitive confidence in a given hypothesis on the fact that a test failed to reject this hypothesis. A more cautious attitude would be to form one’s intuitive opinion only after studying the power function of the test applied.

Neyman alludes to a one-sided test of the mean of a Normal distribution with n iid samples, and known standard deviation, call it test T+. …

H_{0}: µ ≤ µ_{0} against H_{1}: µ > µ_{0}.

*The test statistic* d(X) is the standardized sample mean.

The test rule: Infer a (positive) discrepancy from µ_{0} iff {d(x_{0}) > cα) where cα corresponds to a difference statistically significant at the α level.

In Carnap’s example the test could not reject the null hypothesis, i.e., d(x_{0}) ≤ cα, but (to paraphrase Neyman) the problem is that the chance of detecting the presence of discrepancy δ from the null, with so few observations, is extremely slim, even if [δ is present].

We are back to our old friend: interpreting negative results!

“One may be confident in the absence of that discrepancy only if the power to detect it were high.”

The power of the test T+ to detect discrepancy δ:

(1) P(d(X) > cα; µ = µ_{0} + δ)

It is interesting to hear Neyman talk this way since it is at odds with the more behavioristic construal he usually championed. He sounds like a Cohen-style power analyst! Still, power is calculated relative to an outcome just making/missing the cutoff cα. This is, in effect, the worst case of a negative (non significant) result, and if the actual outcome corresponds to a larger p-value, that should be taken into account in interpreting the results. It is more informative, therefore, to look at the probability of getting a worse fit (with the null hypothesis) than you did:

(2) P(d(X) > d(x0); µ = µ_{0} + δ)

In this example, this gives a measure of the severity (or degree of corroboration) for the inference µ < µ_{0} + δ.

Although (1) may be low, (2) may be high (For numbers, see Mayo and Spanos 2006).

Spanos and I (Mayo and Spanos 2006) couldn’t find a term in the literature defined precisely this way–the way I’d defined it in Mayo (1996) and before. We were thinking at first of calling it “attained power” but then came across what some have called “observed power” which is very different (and very strange). Those measures are just like ordinary power but calculated assuming the value of the mean equals the observed mean! (Why anyone would want to do this and then apply power analytic reasoning is unclear. I’ll come back to this in my next post NN3.) Anyway, we refer to it as the Severity Interpretation of “Acceptance” (SIA) in Mayo and Spanos 2006.

The claim in (2) could also be made out viewing the p-value as a random variable, calculating its distribution for various alternatives (Cox 2006, 25). This reasoning yields a core frequentist principle of evidence (FEV) in Mayo and Cox 2010, 256):

FEV:^{1} A moderate p-value is evidence of the absence of a discrepancy d from H_{0} only if there is a high probability the test would have given a worse fit with H_{0} (i.e., smaller p value) were a discrepancy d to exist.

It is important to see that it is only in the case of a negative result that severity for various inferences is in the same direction as power. In the case of significant results, d(x) in excess of the cutoff, the opposite concern arises—namely, the test is too sensitive. So severity is always relative to the particular inference being entertained: speaking of the “severity of a test” simpliciter is an incomplete statement in this account. These assessments enable sidestepping classic fallacies of tests that are either too sensitive or not sensitive enough.^{2}

________________________________________

2 The full version of our frequentist principle of evidence FEV corresponds to the interpretation of a small p-value:

x is evidence of a discrepancy d from H_{0} iff, if H_{0} is a correct description of the mechanism generating x, then, with high probability a less discordant result would have occurred.

Severity (SEV) may be seen as a meta-statistical principle that follows the same logic as FEV reasoning within the formal statistical analysis.

By making a SEV assessment relevant to the inference under consideration, we obtain a measure where high (low) values always correspond to good (poor) evidential warrant.

It didn’t have to be done this way, but I decided it was best, even though it means appropriately swapping out the claim H for which one wants to assess SEV.

NOTE: There are 5 Neyman’s Nursery posts total (NN1-NN5). Search this blog for the 3 others (all relating to power).

NN3:

http://errorstatistics.com/2011/11/12/neymans-nursery-nn-3-shpower-vs-power/

REFERENCES:

Cohen, J. (1988), *Statistical Power Analysis for the Behavioral Sciences*, 2^{nd} ed. Hillsdale, Erlbaum, NJ.

Mayo, D.G. and Cox, D. R. (2006) “Frequentists Statistics as a Theory of Inductive Inference,” *Optimality: The Second Erich L. Lehmann Symposium *(ed. J. Rojo), Lecture Notes-Monograph series, Institute of Mathematical Statistics (IMS), Vol. 49: 77-97.

Mayo, D. G. and Spanos, A. (2006). “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,” *British Journal of Philosophy of Science*, 57: 323-357.

Mayo, D. G. and Spanos, A. (2010). “Introduction and Background: Part I: Central Goals, Themes, and Questions; Part II The Error-Statistical Philosophy” in *Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science* (D Mayo and A. Spanos eds.), Cambridge: CUP: 1-14, 15-27.

Neyman, J. (1955), “The Problem of Inductive Inference,” *Communications on Pure and Applied Mathematics*, VIII, 13-46.

Filed under: Neyman, phil/history of stat, power, Statistics Tagged: negative result, Neyman, power, severe testing ]]>

I’m not there. (Several people have asked, I guess because I blogged JSM13.) If you hear of talks (or anecdotes) of interest to error statistics.com, please comment here (or twitter: @learnfromerror)

Filed under: Announcement ]]>

School Director & Professor

School of Mathematical & Natural Science

Arizona State University

**Comment on S. Senn’s post: ****“Blood Simple? The complicated and controversial world of bioequivalence” ^{(}*^{)}**

First, I do agree with Senn’s statement that “the FDA requires conventional placebo-controlled trials of a new treatment to be tested at the 5% level two-sided but since they would never accept a treatment that was worse than placebo the regulator’s risk is 2.5% not 5%.” The FDA procedure essentially defines a one-sided test with Type I error probability (size) of .025. Why it is not just called this, I do not know. And if the regulators believe .025 is the appropriate Type I error probability, then perhaps it should be used in other situations, e.g., bioequivalence testing, as well.

Senn refers to a paper by Hsu and me (Berger and Hsu (1996)), and then attempts to characterize what we said. Unfortunately, I believe he has mischaracterized. I do not recognize his explanation after “The argument goes as follows.” Senn says that our argument against the bioequivalence test defined by the 90% confidence interval is based on the fact that the Type I error rate for this test is zero. This is not true. The bioequivalence test in question, defined by the 90% confidence interval, has size exactly equal to α = .05. The Type I error probability is not zero. But this test is biased; the Type I error probability converges to zero as the variance goes to infinity on the boundary between the null and alternative hypotheses. This biasedness allows other tests to be defined that have size α, also, but are uniformly more powerful than the test defined by the 90% confidence interval.

The two main points in Berger and Hsu (1996) are these.

First, by considering the bioequivalence problem in the intersection-union test (IUT) framework, it is easy to define size α tests. The IUT method of test construction, may be useful if the null hypothesis is conveniently expressed as a union of sets in the parameter space. In a bioequivalence problem the null hypothesis (asserting non-bioequivalence) is that the difference (as measured by the difference in log means) between the two drug formulations is either greater than or equal to .22 or less than or equal to -.22. Hence the null hypothesis is the union of two sets, the part where the parameter is greater than or equal to .22 and the part where the parameter is less than or equal to -.22. The intersection-union method considers two hypothesis tests, one of the null “greater than or equal to .22” versus the alternative “less than .22” and the other of the null “less than or equal to -.22” versus the alternative “greater than -.22.” The fundamental result about IUT’s is that if each of these tests is carried out with a size-α test, and if the overall bioequivalence null is rejected if and only if each of these individual tests rejects its respective null, then the resulting overall test has size at most α. Unlike most other methods of combining tests, in which individual tests must have size less than α to ensure the overall test has size α, in the IUT method of combining tests size α tests are combined in a particular way to yield an overall test that has size α, also.

In the usual formulation of the bioequivalence problem, each of the two individual hypotheses is tested with a one-sided, size-α t-test. If both of these individual t-tests rejects its null, then bioequivalence is concluded. This has come to be called the Two One-Sided Test (TOST). The IUT method simply combines two one-sided t-tests into an overall test that has size α. This is much simpler than vague discussions about regulators not trading α, etc. This explanation makes no sense to me, because there is only one regulator (e.g., the FDA). Why appeal to two regulators?

Furthermore, in the IUT framework it is not necessary for the two individual hypotheses to be tested using one-sided t-tests. By considering the configuration of the parameter space in a bioequivalence problem more carefully, it is easy to define other tests that are size-α for the two individual hypotheses. When these are combined using the IUT method into an overall size-α test, they can yield a test that is uniformly more powerful than the TOST. We give an example of such tests in Berger and Hsu. Thus the IUT method gives simple constructions of tests that are superior in power to the usual TOST.

The second main point of Berger and Hsu is this. Describing a size-α (e.g., α = .05) bioequivalence test using a 100(1 − 2α)% (e.g., 90%) confidence interval is confusing and misleading. As Brown, Casella, and Hwang (1995) said, it is only an “algebraic coincidence” that in one particular case there is a correspondence between a size-α bioequivalence test and a 100(1 − 2α)% confidence interval. In Berger and Hsu we point out several examples in which other authors have considered other equivalence type hypotheses and have assumed they could define a size-α test in terms of a 100(1 − 2α)% confidence set. In some cases the resulting tests are conservative, in other cases liberal. *There is no* general correspondence between α-level equivalence tests and 100(1 − 2α)% confidence sets. This description of one particular size-α equivalence test in terms of a 100(1 − 2α)% confidence interval is confusing and should be abandoned.

On another point, I would disagree with Senn’s characterization that Perlman and Wu (1999) criticized our new tests on theoretical grounds. Rather, I would call them intuitive grounds. They said it sounds crazy to decide in favor of equivalence when the point estimate is outside the equivalence limits (much as Senn said). The theory, as we presented it, is sound. The tests are size-α, and uniformly more powerful than the TOST, and less biased. But in our original paper we acknowledged that they are counterintuitive. We suggested modifications that could be made to eliminate the counterintuitivity but still increase the power over the TOST (another simple argument using the IUT method).

Finally, to correct a misstatement, in the extensive discussion following the original Senn post, there are several references to the “union-intersection method of R. Berger.” The method we used is the intersection-union method. In the union-intersection method individual tests are combined in a different way. In this method if individual size-α tests are used, then the overall test has size greater than α. The individual tests must have size less than α in order for the overall test to have size α. (This is the usual situation with many methods of combining tests.)

Berger, R.L., Hsu, J.C. (1996). Bioequivalence Trials, Intersection-Union Tests and Equivalence Confidence Sets (with Discussion). *Statistical Science*, 11, 283-319.

Brown, L. D., Casella, G. and Hwang, J. T. G. (1995a). Optimal confidence sets, bioequivalence, and the limacon of Pascal. *J. Amer. Statist. Assoc.,* 90, 880-889.

Perlman, M.D., Wu, L. (1999). The emperor’s new tests. *Statistical Science,* 14, 355-369.

Senn, S. (6/5/2014). Blood Simple? The complicated and controversial world of bioequivalence (guest post). Mayo’s *Error Statistics Blog (error statistics.com)*.

*********

**Stephen Senn
**Head, Methodology and Statistics Group

Competence Center for Methodology and Statistics (CCMS)

Luxembourg

**Comment on Roger Berger**

I am interested and grateful to Dr Berger for taking the trouble to comment on my blogpost.

First let me apologise to Dr Berger if I have misrepresented Berger and Hsu[1]. The interested reader can do no better than look up the original publication. This also gives me the occasion to recommend two further articles that appeared at a very similar time to Berger and Hsu. The first[2] is by my late friend and colleague Gunther Mehring and appeared shortly before Berger and Hsu . Gunther and I did not agree on philosophy of statistics but we had many interesting discussions on the subject of bioequivalence during the period that we both worked for CIBA-Geigy and what very little I know of the more technical aspects of general interval hypotheses is due to him. Also of interest is the paper by Brown, Hwang and Munk[3], which appeared a little after Berger and Hsu[1] and this has an interesting remark I propose to discuss

“We tried to find a fundamental argument for the assertion that a reasonable rejection region should not be unbounded by using a likelihood approach, a Bayesian approach, and so on. However, we did not succeed. Therefore we are not convinced it should not be unbounded.”(p 2348)

Although I do not find the tests proposed by the three sets of authors[1-3] an acceptable practical approach to bioequivalence there is a sense in which I agree with Brown et al but also a sense in which I don’t.

I agree with them because it *is* possible to find cases in which within a Bayesian decision-analytic framework it is possible to claim equivalence even though the point estimate falls outside the limit of equivalence. A sufficient set of conditions is the following.

- It is strongly believed that were no evidence at all available the logical course of action would be to accept bioequivalence. That is to say
*if*the only choices of actions were A: accept bioequivalence or B: reject bioequivalence the combination of prior belief and utilities would support A. - However, at no or little cost, a very small bioequivalence study can be run.
- This is the only further information that can be obtained.
- Thus the initial situation is that of a three- valued decision outcome, A: accept bioequivalence, B: reject bioequivalence, C: run the small experiment
- However, if the small experiment is run the only possible actions remaining will be A or B. There is no possibility of collecting yet further information.
- Despite the fact that the evidence from the small experiment has almost no chance of elevating
*a posteriori*B to being a preferable decision to A since the information from action C is almost free, C is the preferred action.

Under such circumstances it could be logical to run a small trial and it could be logical, having run the trial to accept decision A in preference to B even though the point estimate were outside the limits of equivalence. Basically, given such conditions, it would require an* extremely* in-equivalent result to cause one to prefer B to A. A moderately in-equivalent result would not suffice. However the fact that the possibility, however remote of changing B for A exists makes C a worth-while choice initially.

So technically, at least as regards the Bayesian argument, I think that Brown et al are right. Practically, however, I can think of no realistic circumstances under which these conditions could be satisfied.

Dr Berger and I agree that the FDA’s position on type one error rates is somewhat inconsistent so it is, of course, always dangerous to cite regulatory doctrine as a defence of a claim that an approach is logical. Nevertheless, I note that I do not see any haste by the FDA to replace the current biased test with unbiased procedures. I think that they are far more likely to consider, Dr Berger’s appeal to simplicity notwithstanding, that they are, indeed, entitled here, *as will have been the case with the innovator product*, to be provided with separate demonstrations of efficacy and tolerability. Seen in this light Schuirmann’s TOST procedure[4] is logical and consistent (apart from the choice of 5% level!).

My basic objection to unbiased tests of this sort[1-3], however, goes much deeper and here I suspect that not only Dr Berger but also Deborah Mayo will disagree with me. The Neyman-Pearson lemma is generally taken as showing that a justification for using likelihood as a basis for thinking about inference can be provided (for some simple cases) in terms of power. I do not, however, regard power as a more fundamental concept. (I believe that there is some evidence that Pearson unlike Neyman hesitated on this.) Thus my interpretation of NP is the reverse: by thinking in terms of likelihood one sometimes obtains a power bonus. If so, so much the better, but this is not the justification for likelihood, *au contraire*.

**References**

- Berger RL, Hsu JC. Bioequivalence trials, intersection-union tests and equivalence confidence sets.
*Statistical Science*1996;**11**: 283-302. - MehringG. On optimal tests for general interval hypotheses.
*Communications in Statistics: Theory and Methods*1993;**22**: 1257-1297. - Brown LD, Hwang JTG, Munk A. An unbiased test for the bioequivalence problem.
*Annals of Statistics*1997;**25**: 2345-2367 - Schuirmann DJ. A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability.
*J Pharmacokinet Biopharm*1987;**15**: 657-680. - Senn, S. (6/5/2014). Blood Simple? The complicated and controversial world of bioequivalence (guest post). Mayo’s
*Error Statistics Blog (error statistics.com)*

^^^^^^^^^^^^^^^^^^^

*** Mayo remark on this exchange: Following Senn’s “Blood Simple” post on this blog, I asked Roger Berger for some clarification, and his post grew out of his responses. I’m very grateful to him for his replies and the post. Subsequently, I asked Senn for a comment to the R. Berger post (above), and I’m most appreciative to him for supplying one on short notice. With both these guest posts in hand, I now share them with you. I hope that this helps to decipher a conundrum that I, for one, have had about bio-equivalence tests. But I’m going to have to study these items much more carefully. I look forward to reader responses.**

*Just one quick comment on Senn’s remark: *

“….I suspect that not only Dr Berger but also Deborah Mayo will disagree with me. The Neyman-Pearson lemma is generally taken as showing that a justification for using likelihood as a basis for thinking about inference can be provided (for some simple cases) in terms of power. I do not, however, regard power as a more fundamental concept. (I believe that there is some evidence that Pearson unlike Neyman hesitated on this.)”

*My position on this, I hope, is clear in published work, but just to say one thing: I don’t think that power is “a justification for using likelihood as a basis for thinking about inference”. I agree with E. Pearson in his numbering the steps (fully quoted in this post)*

Step 2. We then divide this set [of possible results] by a system of ordered boundaries…such that as we pass across one boundary and proceed to the next, we come to a class of results which makes us more and more inclined on the information available, to reject the hypothesis tested in favour of alternatives which differ from it by increasing amounts” (E. Pearson 1966a, 173).

http://errorstatistics.com/2013/08/13/blogging-e-s-pearsons-statistical-philosophy/

*(Perhaps this is the evidence Senn has in mind.) Merely maximizing power, defined in the crude way we sometimes see (e.g., average power taken over mixtures, as in Cox’s and Birnbaum’s famous examples) can lead to faulty assessments of inferential warrant, but then, I never use pre-data power as an assessment of severity associated with inferences.*

*While power isn’t necessary “for using likelihood as a basis for thinking about inference” nor for using other distance measures (at Step 2), reports of observed likelihoods and comparative likelihoods are inadequate for inference and error probability control. Hence, Pearson’s Step 3.*

Does the issue Senn raises on power really play an important role in his position on bioequivalence tests? I’m not sure. I look forward to hearing from readers.

Filed under: bioequivalence, frequentist/Bayesian, PhilPharma, Statistics Tagged: R. Berger, S. Senn ]]>

**Stephen Senn
**Head, Methodology and Statistics Group

Competence Center for Methodology and Statistics (CCMS)

Luxembourg

**Responder despondency: myths of personalized medicine**

The road to drug development destruction is paved with good intentions. The 2013 FDA report, Paving the Way for Personalized Medicine has an encouraging and enthusiastic foreword from Commissioner Hamburg and plenty of extremely interesting examples stretching back decades. Given what the report shows can be achieved on occasion, given the enthusiasm of the FDA and its commissioner, given the amazing progress in genetics emerging from the labs, a golden future of personalized medicine surely awaits us. It would be churlish to spoil the party by sounding a note of caution but I have never shirked being churlish and that is exactly what I am going to do.

Reading the report, alarm bells began to ring when I came across this chart (p17) describing the percentage of patients for whom drug are ineffective. Actually, I tell a lie. The alarm bells were ringing as soon as I saw the title but by the time I saw this chart, the cacophony was deafening.

The question that immediately arose in my mind was ‘how do the FDA know this is true?’ Well, the Agency very helpfully tells you how they know this is true. They cite a publication, ‘Clinical application of pharmacogenetics’[1] as the source of the chart. Slightly surprisingly, the date of the publication predates the FDA report by 12 years (this is pre-history in pharmacogenetic terms) however, sure enough, if you look up the cited paper you will find that the authors (Spear et al) state ‘We have analyzed the efficacy of major drugs in several important diseases based on published data, and the summary of the information is given in Table 1.’ This is Table 1:

Now, there are a few differences here to the FDA report but we have to give the Agency some credit. First of all they have decided to concentrate on those who don’t respond, so they have subtracted the response rates from 100. Second, they have obviously learned an important data presentation lesson: sorting by the alphabet is often inferior to sorting by importance. Unfortunately, they have ignored an important lesson that texts on graphical excellence impart: don’t clutter your presentation with chart junk[2]. However, in the words of Meatloaf, ‘Two out of three ain’t bad,’ so I have to give them some credit.

However, that’s not quite the end of the story. Note the superscripted 1 in the rubric of the source for the FDA claim. That’s rather important. This gives you the source of the information, which is the *Physician’s Desk Reference*, 54^{th} edition, 2000.

At this point of tracing back, I discovered what I knew already. What the FDA is quoting are zombie statistics. This is not to impugn the work of Spear et al. The paper makes interesting points. (I can’t even blame them for not citing one of my favourite papers[3], since it appeared in the same year.) They may well have worked diligently to collect the data they did but the trail runs cold here. The methodology is not given and the results can’t be checked. It may be true, it may be false but nobody, and that includes the FDA and its commissioner, knows.

But there is a further problem. There is a very obvious trap in using *observed* response rates to judge what percentage of patient respond (or don’t). That is that all such measures are subject to within-patient variability. To take a field I have worked in, asthma, if you take (as the FDA has on occasion) 15% increase in Forced Expiratory Volume in one second (FEV_{1}) above baseline as indicating a response. You will classify someone with a 14% value as a non-responder and someone with a 16 % value as a responder but measure them again and they could easily change places (see chapter 8 of Statistical Issues in Drug Development[4]) . For a bronchodilator I worked on, mean bronchodilation at 12 hours was about 18% so you simply needed to base your measurement of effect on a number of replicates if you wanted to increase the proportion of responders.

There is a very obvious trap (or at least it ought to be obvious to all statisticians) in naively using reported response rates as an indicator of variation in true response[5]. This can be illustrated using the graph below. On the left hand side you see an ideal counterfactual experiment. Every patient can be treated under identical conditions with both treatments. In this thought experiment the difference that the treatment makes to each patient is constant. However, life does not afford us this possibility. If what we choose to do is run a parallel group trial we will have to randomly give the patient either placebo or the active treatment. The right hand panel shows us what we will see and is obtained by randomly erasing one of the two points for each patient on the left hand panel. It is now impossible to judge individual response: all that we can judge is the average.

Of course, I fixed things in the example so that response was constant and it clearly might not be. But that is not the point. The point is that the diagram shows that by naively using raw outcomes we will overestimate the personal element of response. In fact, only repeated cross-over trial can reliable tease out individual response from other components of variation and in many indications these are not possible and even where they are possible they are rarely run[6].

So to sum up, the reason the FDA ‘knows’ that 40% of asthmatic patients don’t respond to treatment is because a paper from 2001, with unspecified methodology, most probably failing to account for within patient variability, reports that the authors found this to be the case by studying the *Physician’s Desk Reference*.

This is nothing short of a scandal. I don’t blame the FDA. I blame me and my fellow statisticians. Why and how are we allowing our life scientist colleagues to get away with this nonsense? *They* genuinely believe it. *We* ought to know better.

**References**

- Spear, B.B., M. Heath-Chiozzi, and J. Huff,
*Clinical application of pharmacogenetics.*Trends in Molecular Medicine, 2001.**7**(5): p. 201-204. - Tufte, E.R.,
*The Visual Display of Quantitative Information*. 1983, Cheshire Connecticut: Graphics Press. - Senn, S.J.,
*Individual Therapy: New Dawn or False Dawn.*Drug Information Journal, 2001.**35**(4): p. 1479-1494. - Senn, S.J.,
*Statistical Issues in Drug Development*. 2007, Hoboken: Wiley. 498. - Senn, S.,
*Individual response to treatment: is it a valid assumption?*BMJ, 2004.**329**(7472): p. 966-8. - Senn, S.J.,
*Three things every medical writer should know about statistics.*The Write Stuff, 2009.**18**(3): p. 159-162.

Filed under: evidence-based policy, Statistics, Stephen Senn ]]>

Since the comments to my previous post are getting too long, I’m reblogging it here to make more room. I say that the issue raised by J. Berger and Sellke (1987) and Casella and R. Berger (1987) concerns evaluating the evidence in relation to a given hypothesis (using error probabilities). Given the information that *this* hypothesis H* was randomly selected from an urn with 99% true hypothesis, we wouldn’t say this gives a great deal of evidence for the truth of H*, nor suppose that H* had thereby been well-tested. (H* might concern the existence of a standard model-like Higgs.) I think the issues about “science-wise error rates” and long-run performance in dichotomous, diagnostic screening should be taken up separately, but commentators can continue on this, if they wish (perhaps see this related post).

** 0. July 20, 2014: **Some of the comments to this post reveal that using the word “fallacy” in my original title might have encouraged running together the current issue with the fallacy of transposing the conditional. Please see a newly added Section 7.

**1. What you should ask…**

Discussions of P-values in the Higgs discovery invariably recapitulate many of the familiar criticisms of P-values (some “howlers”, some not). When you hear the familiar refrain, “We all know that P-values overstate the evidence against the null hypothesis”, denying the P-value aptly measures evidence, what you should ask is:

“What do you mean by overstating the evidence against a hypothesis?”

An honest answer might be:

“What I mean is that when I put a lump of prior probability π_{0}> 1/2 on a point nullH_{0 }(or a very small interval around it), the P-value is smaller than my Bayesian posterior probability onH_{0}.”

Your reply might then be: *(a) P-values are not intended as posteriors in H _{0} and (b) P-values can be used to determine whether there is evidence of inconsistency with a null hypothesis at various levels, and to distinguish how well or poorly tested claims are–depending on the type of question asked. The report on discrepancies “poorly” warranted is what controls any overstatements about discrepancies indicated.*

You might toss in the query: *Why do you assume that “the” correct measure of evidence (for scrutinizing the P-value) is via the Bayesian posterior?*

If you wanted to go even further you might rightly ask: ** And by the way, what warrants your lump of prior to the null?** (See Section 3

^^^^^^^^^^^^^^^

**2 . J. Berger and Sellke and Casella and R. Berger**

Of course it is well-known that for a fixed P-value, with a sufficiently large n, even a statistically significant result can correspond to large posteriors in H_{0} (Jeffreys-Good-Lindley paradox). I.J. Good (I don’t know if he was the first) recommended decreasing the required P-value as n increases, and had a formula for it. A more satisfactory route is to ensure the interpretation takes account of the (obvious) fact that with a fixed P-value and increasing n, the test is more and more sensitive to discrepancies–much as is done with lower/upper bounds of confidence intervals. For some rules of thumb see Section 5.

The JGL result is generalized in J. Berger and Sellke (1987). They make out the conflict between P-values and Bayesian posteriors by considering the two sided test of the Normal mean, *H*_{0}: μ = μ_{0} versus *H*_{1}: μ ≠ μ_{0} .

“If

n= 50…, one can classically ‘rejectH_{0}at significance level p = .05,’ although Pr (H_{0}|) = .52 (which would actually indicate that the evidence favorsxH_{0}).” (Berger and Sellke, 1987, p. 113).

If *n* = 1000, a result statistically significant at the .05 level leads to a posterior to the null going from .5 to .82!

While from their Bayesian perspective, this appears to supply grounds for denying P-values are adequate for assessing evidence, significance testers rightly balk at the fact that using the recommended priors allows highly significant results to be interpreted as no evidence against the null–or even evidence *for* it!

Many think this shows that the P-value ‘overstates evidence against a null’ because it claims to use an ‘impartial’ Bayesian prior probability assignment of .5 to *H*_{0}**, **the remaining .5 spread out over the alternative parameter space. (But see the justification Berger and Sellke give in Section 3. *A Dialogue*.) Casella and R. Berger(1987) charge that the problem is not P-values but the high prior, and that “concentrating mass on the point null hypothesis is biasing the prior in favor of *H*_{0 }as much as possible” (p. 111) whether in 1 or 2-sided tests. Note, too, the conflict with confidence interval reasoning since the null value (here it is 0) lies outside the corresponding confidence interval (Mayo 2005). See Senn’s very interesting points on this same issue in his letter (to Goodman) here.

^^^^^^^^^^^^^^^^^

**3. A Dialogue **(ending with a little curiosity in J. Berger and Sellke):

*So a guy is fishing in Lake Elba, and a representative from the EPA (Elba Protection Association) points to notices that mean toxin levels in fish were found to exceed the permissible mean concentration, set at 0.*

EPA Rep: We’ve conducted two studies (each with random sample of 100 fish) showing statistically significant concentrations of toxin, at low P-values, e.g., .02.

P-Value denier:I deny you’ve shown evidence of high mean toxin levels; P-values exaggerate the evidence against the null.

EPA Rep: Why is that?

P-value denier:If I update the prior of .5 that I give to the null hypothesis (asserting toxin levels are of no concern), my posterior for H_{0 }is still not all that low, not as low as .05 for sure.

EPA Rep:Why do you assign such a high prior probability to H_{0}?

P-value denier:If I gave H_{0}a value lower than .5, then, if there’s evidence to reject H_{0 , }at most I would be claiming an improbable hypothesis has become more improbable. Who would be convinced by the statement ‘I conducted a Bayesian test of H_{0}, assigning prior probability .1 to H_{0}, and my conclusion is that H_{0 }has posterior probability .05 and should be rejected’?

*The last sentence is a direct quote from Berger and Sellke!*

“When giving numerical results, we will tend to present Pr(H

_{0}|) for πx_{0}= 1/2. The choice of π_{0}= 1/2 has obvious intuitive appeal in scientific investigations as being ‘objective.’ (some might argue that should even be chosen larger than 1/2 since H_{0 }is often the ‘established theory.’) …[I]t will rarely be justifiable to choose π_{0 }< 1/2; who, after all, would be convinced by the statement ‘I conducted a Bayesian test of H_{0}, assigning prior probability .1 to H_{0}, and my conclusion is that H_{0 }has posterior probability .05 and should be rejected’? We emphasize this obvious point because some react to the Bayesian-classical conflict by attempting to argue that π_{0 }should be made small in the Bayesian analysis so as to force agreement.” (Berger and Sellke, 115)

*There’s something curious in assigning a high prior to the null H _{0}–thereby making it harder to reject (or find evidence against) H_{0}–and then justifying the assignment by saying it ensures that, if you do reject H_{0}, there will be a meaningful drop in the probability of H_{0. }What do you think of this?*

^^^^^^^^^^^^^^^^^^^^

**4. The real puzzle. **

I agree with J. Berger and Sellke that we should not “force agreement”. What’s puzzling to me is why it would be thought that an account that manages to evaluate how well or poorly tested hypotheses are–as significance tests can do–would want to measure up to an account that can only give a comparative assessment (be they likelihoods, odds ratios, or other) [ii]. From the perspective of the significance tester, the disagreements between (audited) P-values and posterior probabilities are an indictment, not of the P-value, but of the posterior, as well as the Bayes ratio leading to the disagreement (as even one or two Bayesians appear to be coming around to realize, e.g., Bernardo 2011, 58-9). Casella and R. Berger show that for sensible priors with one-sided tests, the P-value can be “reconciled” with the posterior, thereby giving an excellent retort to J. Berger and Sellke. Personally, I don’t see why an error statistician would wish to construe the P-value as how “believe worthy” or “bet worthy” statistical hypotheses are. Changing the interpretation may satisfy J. Berger’s call for “an agreement on numbers” (and never mind philosophies), but doing so precludes the proper functioning of P-values, confidence levels, and other error probabilities. And “**what is the intended interpretation of the prior, again****?**” you might ask. Aside from the subjective construals (of betting and belief, or the like), the main one on offer (from the conventionalist Bayesians) is that the prior is undefined and is simply a way to compute a posterior. Never mind that they don’t agree on which to use. Your question should be: **“Please tell me: how does a posterior, based on an undefined prior used solely to compute a posterior, become “the” measure of evidence that we should aim to match?” **

^^^^^^^^^^^^^^^^

**5. (Crude) Benchmarks for taking into account sample size: **

Throwing out a few numbers may give sufficient warning to those inclined to misinterpret statistically significant differences at a given level but with varying sample sizes (please also search this blog [iii]). Using the familiar example of Normal testing with T+ :

H_{0}: μ ≤ 0 vs. H_{1}: μ > 0.Let

σ= 1,n= 25, soσ= (σ/√_{x}n).

*For this exercise, fix the sample mean M to be** just significant at the .025 level for a 1-sided test, and vary the sample size n. In one case, n = 100, in a second, n = 1600. So, for simplicity, using the 2-standard deviation cut-off:*

m= 0 + 2_{0}(σ/√n).

With stat sig results from test T+, we worry about unwarranted inferences of form: * μ > 0 + γ.*

*Some benchmarks:*

** *** The lower bound of a 50% confidence interval is** **2(σ/√*n*). *So there’s quite lousy evidence that μ > *2

** ***The lower bound of the 93% confidence interval is** **.5(σ/√*n*). *So there’s decent evidence that μ > *.5

** ***For *n* = 100,* σ/√n* = .1 (*σ= 1); f*or *n* = 1600,* σ/√n* = .025

* *Therefore, a .025 stat sig result is fairly good evidence that μ > *.05, when

You’re picking up smaller and smaller discrepancies as *n* increases, when P is kept fixed. Taking the indicated discrepancy into account avoids erroneous construals and scotches any “paradox”.

^^^^^^^^^^

**6.**** “The Jeffreys-Lindley Paradox and Discovery Criteria in High Energy Physics” (Cousins, 2014)**

Robert Cousins, a HEP physicist willing to talk to philosophers and from whom I am learning about statistics in the Higgs discovery, illuminates the key issues, models and problems in his paper with that title. (The reference to Bernardo 2011 that I had in mind in Section 4 is cited on p. 26 of Cousins 2014).

^^^^^^^^^^^^^^^^^^^^^^^^^^

*7. July 20, 2014: *** There is a distinct issue here…**.That “P-values overstate the evidence against the null” is often stated as an uncontroversial “given”. In calling it a “fallacy”, I was being provocative. However, in dubbing it a fallacy, some people assumed I was referring to one or another

The problem with using a P-value to assess evidence against a given null hypothesis H_{0} is that it tends to be smaller, even much smaller, than an apparently plausible posterior assessment of H_{0}, given data *x* (especially as n increases). The mismatch is avoided with a suitably tiny P-value, and that’s why many recommend this tactic. [iv] Yet I say the correct answer to the question in my (new) title is: “fallacious”. It’s one of those criticisms that have not been thought through carefully, but rather repeated based on some well-known articles.

[i] We assume the P-values are “audited”, that they are not merely “nominal”, but are “actual” P-values. Selection effects, cherry-picking and other biases would alter the error probing capacity of the tests, and thus the purported P-value would fail the audit.

[ii] Note too that the comparative assessment will vary depending on the “catchall”.

[iii] See for example:

Section 6.1 “fallacies of rejection“.

Slide #8 of Spanos lecture in our seminar Phil 6334.

[iv] So we can also put aside for the moment the issue of P-values not being conditional probabilities to begin with. We can also (I hope) distinguish another related issue, which requires a distinct post: using ratios of frequentist error probabilities, e.g., type 1 errors and power, to form a kind of “likelihood ratio” in a screening computation.

**References **(minimalist) A number of additional links are given in comments to my previous post

Berger, J. O. and Sellke, T. (1987). “Testing a point null hypothesis: The irreconcilability of p values and evidence,” (with discussion). *J. Amer. Statist. Assoc. ***82: **112–139.

Casella G. and Berger, R.. (1987). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem,” (with discussion). *J. Amer. Statist. Assoc. ***82 **106–111, 123–139.

*Blog posts:*

Comedy Hour at the Bayesian Retreat: P-values versus Posteriors.

Highly probable vs highly probed: Bayesian/ error statistical differences.

Filed under: Bayesian/frequentist, CIs and tests, fallacy of rejection, highly probable vs highly probed, P-values, Statistics ]]>

** 0. July 20, 2014: **Some of the comments to this post reveal that using the word “fallacy” in my original title might have encouraged running together the current issue with the fallacy of transposing the conditional. Please see a newly added Section 7.

**1. What you should ask…**

Discussions of P-values in the Higgs discovery invariably recapitulate many of the familiar criticisms of P-values (some “howlers”, some not). When you hear the familiar refrain, “We all know that P-values overstate the evidence against the null hypothesis”, denying the P-value aptly measures evidence, what you should ask is:

“What do you mean by overstating the evidence against a hypothesis?”

An honest answer might be:

“What I mean is that when I put a lump of prior probability π_{0}> 1/2 on a point nullH_{0 }(or a very small interval around it), the P-value is smaller than my Bayesian posterior probability onH_{0}.”

Your reply might then be: *(a) P-values are not intended as posteriors in H _{0} and (b) P-values can be used to determine whether there is evidence of inconsistency with a null hypothesis at various levels, and to distinguish how well or poorly tested claims are–depending on the type of question asked. The report on discrepancies “poorly” warranted is what controls any overstatements about discrepancies indicated.*

You might toss in the query: *Why do you assume that “the” correct measure of evidence (for scrutinizing the P-value) is via the Bayesian posterior?*

If you wanted to go even further you might rightly ask: ** And by the way, what warrants your lump of prior to the null?** (See Section 3

^^^^^^^^^^^^^^^

**2 . J. Berger and Sellke and Casella and R. Berger**

Of course it is well-known that for a fixed P-value, with a sufficiently large n, even a statistically significant result can correspond to large posteriors in H_{0} (Jeffreys-Good-Lindley paradox). I.J. Good (I don’t know if he was the first) recommended decreasing the required P-value as n increases, and had a formula for it. A more satisfactory route is to ensure the interpretation takes account of the (obvious) fact that with a fixed P-value and increasing n, the test is more and more sensitive to discrepancies–much as is done with lower/upper bounds of confidence intervals. For some rules of thumb see Section 5.

The JGL result is generalized in J. Berger and Sellke (1987). They make out the conflict between P-values and Bayesian posteriors by considering the two sided test of the Normal mean, *H*_{0}: μ = μ_{0} versus *H*_{1}: μ ≠ μ_{0} .

“If

n= 50…, one can classically ‘rejectH_{0}at significance level p = .05,’ although Pr (H_{0}|) = .52 (which would actually indicate that the evidence favorsxH_{0}).” (Berger and Sellke, 1987, p. 113).

If *n* = 1000, a result statistically significant at the .05 level leads to a posterior to the null going from .5 to .82!

While from their Bayesian perspective, this appears to supply grounds for denying P-values are adequate for assessing evidence, significance testers rightly balk at the fact that using the recommended priors allows highly significant results to be interpreted as no evidence against the null–or even evidence *for* it!

Many think this shows that the P-value ‘overstates evidence against a null’ because it claims to use an ‘impartial’ Bayesian prior probability assignment of .5 to *H*_{0}**, **the remaining .5 spread out over the alternative parameter space. (But see the justification Berger and Sellke give in Section 3. *A Dialogue*.) Casella and R. Berger (1987) charge that the problem is not P-values but the high prior, and that “concentrating mass on the point null hypothesis is biasing the prior in favor of *H*_{0 }as much as possible” (p. 111) whether in 1 or 2-sided tests. Note, too, the conflict with confidence interval reasoning since the null value (here it is 0) lies outside the corresponding confidence interval (Mayo 2005). See Senn’s very interesting points on this same issue in his letter (to Goodman) here.

^^^^^^^^^^^^^^^^^

**3. A Dialogue **(ending with a little curiosity in J. Berger and Sellke):

*So a guy is fishing in Lake Elba, and a representative from the EPA (Elba Protection Association) points to notices that mean toxin levels in fish were found to exceed the permissible mean concentration, set at 0.*

EPA Rep: We’ve conducted two studies (each with random sample of 100 fish) showing statistically significant concentrations of toxin, at low P-values, e.g., .02.

P-Value denier:I deny you’ve shown evidence of high mean toxin levels; P-values exaggerate the evidence against the null.

EPA Rep: Why is that?

P-value denier:If I update the prior of .5 that I give to the null hypothesis (asserting toxin levels are of no concern), my posterior for H_{0 }is still not all that low, not as low as .05 for sure.

EPA Rep:Why do you assign such a high prior probability to H_{0}?

P-value denier:If I gave H_{0}a value lower than .5, then, if there’s evidence to reject H_{0 , }at most I would be claiming an improbable hypothesis has become more improbable. Who would be convinced by the statement ‘I conducted a Bayesian test of H_{0}, assigning prior probability .1 to H_{0}, and my conclusion is that H_{0 }has posterior probability .05 and should be rejected’?

*The last sentence is a direct quote from Berger and Sellke!*

“When giving numerical results, we will tend to present Pr(H

_{0}|) for πx_{0}= 1/2. The choice of π_{0}= 1/2 has obvious intuitive appeal in scientific investigations as being ‘objective.’ (some might argue that π_{0 }should even be chosen larger than 1/2 since H_{0 }is often the ‘established theory.’) …[I]t will rarely be justifiable to choose π_{0 }< 1/2; who, after all, would be convinced by the statement ‘I conducted a Bayesian test of H_{0}, assigning prior probability .1 to H_{0}, and my conclusion is that H_{0 }has posterior probability .05 and should be rejected’? We emphasize this obvious point because some react to the Bayesian-classical conflict by attempting to argue that π_{0 }should be made small in the Bayesian analysis so as to force agreement.” (Berger and Sellke, 115)

*There’s something curious in assigning a high prior to the null H _{0}–thereby making it harder to reject (or find evidence against) H_{0}–and then justifying the assignment by saying it ensures that, if you do reject H_{0}, there will be a meaningful drop in the probability of H_{0. }What do you think of this?*

^^^^^^^^^^^^^^^^^^^^

**4. The real puzzle. **

I agree with J. Berger and Sellke that we should not “force agreement”. What’s puzzling to me is why it would be thought that an account that manages to evaluate how well or poorly tested hypotheses are–as significance tests can do–would want to measure up to an account that can only give a comparative assessment (be they likelihoods, odds ratios, or other) [ii]. From the perspective of the significance tester, the disagreements between (audited) P-values and posterior probabilities are an indictment, not of the P-value, but of the posterior, as well as the Bayes ratio leading to the disagreement (as even one or two Bayesians appear to be coming around to realize, e.g., Bernardo 2011, 58-9). Casella and R. Berger show that for sensible priors with one-sided tests, the P-value can be “reconciled” with the posterior, thereby giving an excellent retort to J. Berger and Sellke. Personally, I don’t see why an error statistician would wish to construe the P-value as how “believe worthy” or “bet worthy” statistical hypotheses are. Changing the interpretation may satisfy J. Berger’s call for “an agreement on numbers” (and never mind philosophies), but doing so precludes the proper functioning of P-values, confidence levels, and other error probabilities. And “**what is the intended interpretation of the prior, again****?**” you might ask. Aside from the subjective construals (of betting and belief, or the like), the main one on offer (from the conventionalist Bayesians) is that the prior is undefined and is simply a way to compute a posterior. Never mind that they don’t agree on which to use. Your question should be: **“Please tell me: how does a posterior, based on an undefined prior used solely to compute a posterior, become “the” measure of evidence that we should aim to match?” **

^^^^^^^^^^^^^^^^

**5. (Crude) Benchmarks for taking into account sample size: **

Throwing out a few numbers may give sufficient warning to those inclined to misinterpret statistically significant differences at a given level but with varying sample sizes (please also search this blog [iii]). Using the familiar example of Normal testing with T+ :

H_{0}: μ ≤ 0 vs. H_{1}: μ > 0.Let

σ= 1,n= 25, soσ= (σ/√_{x}n).

*For this exercise, fix the sample mean M to be** just significant at the .025 level for a 1-sided test, and vary the sample size n. In one case, n = 100, in a second, n = 1600. So, for simplicity, using the 2-standard deviation cut-off:*

m= 0 + 2_{0}(σ/√n).

With stat sig results from test T+, we worry about unwarranted inferences of form: * μ > 0 + γ.*

*Some benchmarks:*

** *** The lower bound of a 50% confidence interval is** **2(σ/√*n*). *So there’s quite lousy evidence that μ > *2

** ***The lower bound of the 93% confidence interval is** **.5(σ/√*n*). *So there’s decent evidence that μ > *.5

** ***For *n* = 100,* σ/√n* = .1 (*σ= 1); f*or *n* = 1600,* σ/√n* = .025

* *Therefore, a .025 stat sig result is fairly good evidence that μ > *.05, when

You’re picking up smaller and smaller discrepancies as *n* increases, when P is kept fixed. Taking the indicated discrepancy into account avoids erroneous construals and scotches any “paradox”.

^^^^^^^^^^

**6.**** “The Jeffreys-Lindley Paradox and Discovery Criteria in High Energy Physics” (Cousins, 2014)**

Robert Cousins, a HEP physicist willing to talk to philosophers and from whom I am learning about statistics in the Higgs discovery, illuminates the key issues, models and problems in his paper with that title. (The reference to Bernardo 2011 that I had in mind in Section 4 is cited on p. 26 of Cousins 2014).

^^^^^^^^^^^^^^^^^^^^^^^^^^

*7. July 20, 2014: *** There is a distinct issue here…**.That “P-values overstate the evidence against the null” is often stated as an uncontroversial “given”. In calling it a “fallacy”, I was being provocative. However, in dubbing it a fallacy, some people assumed I was referring to one or another

The problem with using a P-value to assess evidence against a given null hypothesis H_{0} is that it tends to be smaller, even much smaller, than an apparently plausible posterior assessment of H_{0}, given data *x* (especially as n increases). The mismatch is avoided with a suitably tiny P-value, and that’s why many recommend this tactic. [iv] Yet I say the correct answer to the question in my (new) title is: “fallacious”. It’s one of those criticisms that have not been thought through carefully, but rather repeated based on some well-known articles.

[i] We assume the P-values are “audited”, that they are not merely “nominal”, but are “actual” P-values. Selection effects, cherry-picking and other biases would alter the error probing capacity of the tests, and thus the purported P-value would fail the audit.

[ii] Note too that the comparative assessment will vary depending on the “catchall”.

[iii] See for example:

Section 6.1 “fallacies of rejection“.

Slide #8 of Spanos lecture in our seminar Phil 6334.

[iv] So we can also put aside for the moment the issue of P-values not being conditional probabilities to begin with. We can also (I hope) distinguish another related issue, which requires a distinct post: using ratios of frequentist error probabilities, e.g., type 1 errors and power, to form a kind of “likelihood ratio” in a screening computation.

**References **(minimalist)

Berger, J. O. and Sellke, T. (1987). “Testing a point null hypothesis: The irreconcilability of *p *values and evidence,” (with discussion). *J. Amer. Statist. Assoc. ***82: **112–139.

Cassella G. and Berger, R.. (1987). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem,” (with discussion). *J. Amer. Statist. Assoc. ***82 **106–111, 123–139.

*Blog posts:*

Comedy Hour at the Bayesian Retreat: P-values versus Posteriors.

Highly probable vs highly probed: Bayesian/ error statistical differences.

Filed under: Bayesian/frequentist, CIs and tests, fallacy of rejection, highly probable vs highly probed, P-values, Statistics ]]>

Some people say to me: “This kind of reasoning is fine for a ‘sexy science’ like high energy physics (HEP)”–as if their statistical inferences are radically different. But I maintain that this is the mode by which data are used in “uncertain” reasoning across the entire landscape of science and day-to-day learning (at least, when we’re trying to find things out)[2] Even with high level theories, the particular problems of learning from data are tackled piecemeal, in local inferences that afford error control. Granted, this statistical philosophy differs importantly from those that view the task as assigning comparative (or absolute) degrees-of-support/belief/plausibility to propositions, models, or theories.

**“Higgs Analysis and Statistical Flukes: part 2″**

Everyone was excited when the Higgs boson results were reported on July 4, 2012 indicating evidence for a Higgs-like particle based on a “5 sigma observed effect”. The observed effect refers to the number of *excess events* of a given type that are “observed” in comparison to the number (or proportion) that would be expected from background alone, and not due to a Higgs particle. This continues my earlier post (part 1). It is an outsider’s angle on one small aspect of the statistical inferences involved. But that, apart from being fascinated by it, is precisely why I have chosen to discuss it: we [philosophers of statistics] should be able to employ a general philosophy of inference to get an understanding of what is true about the controversial concepts we purport to illuminate, e.g., significance levels.

Here I keep close to an official report from ATLAS, researchers define a “global signal strength” parameter “such that μ = 0 corresponds to the background only hypothesis and μ = 1 corresponds to the SM Higgs boson signal in addition to the background” (where SM is the Standard Model). The statistical test may be framed as a one-sided test, where the test statistic (which is actually a ratio) records differences in the positive direction, in standard deviation (sigma) units. Reports such as

Pr(Test T would yield at least a 5 sigma excess;

H_{0}: background only) = extremely low

are deduced from the sampling distribution of the test statistic, fortified with much cross-checking of results (e.g., by modeling and simulating relative frequencies of observed excesses generated with “Higgs signal +background” compared to background alone). The inferences, even the formal statistical ones, go beyond p-value reports. For instance, they involve setting lower and upper bounds such that values excluded are ruled out with high severity, to use my term. But the popular report is in terms of the observed 5 sigma excess in an overall test T, and that is mainly what I want to consider here.

*Error probabilities*

In a Neyman-Pearson setting, a cut-off c_{α}_{ }is chosen pre-data so that the probability of a type I error is low. In general,

Pr(

d(X) > c_{α};H_{0}) ≤ α

and in particular,alluding to an overall test T:

(1) Pr(Test T yields

d(X) > 5 standard deviations;H_{0}) ≤ .0000003.

The test at the same time is designed to ensure a reasonably high probability of detecting global strength discrepancies of interest. (I always use “discrepancy” to refer to parameter magnitudes, to avoid confusion with observed differences).

[Notice these are not likelihoods.] Alternatively, researchers can report observed standard deviations (here, the sigmas), or equivalently, the associated observed statistical significance probability, *p*_{0}. In general,

Pr(

P<p_{0};H_{0}) <p_{0}

and in particular,

(2) Pr(Test T yields

P<.0000003;H_{0}) < .0000003.

For test T to yield a “worse fit” with *H*_{0 }(smaller p-value) due to background alone is sometimes called “a statistical fluke” or a “random fluke”, and the probability of so statistically significant a random fluke is ~0. With the March 2013 results, the 5 sigma difference has grown to 7 sigmas.

So probabilistic statements along the lines of (1) and (2) are standard.They allude to sampling distributions, either of test statistic *d*(**X)**, or the P-value viewed as a random variable. They are scarcely illicit or prohibited. (I return to this in the last section of this post).

*An implicit principle of inference or evidence*

Admittedly, the move to taking the 5 sigma effect as evidence for a genuine effect (of the Higgs-like sort) results from an implicit principle of evidence that I have been calling the severity principle (SEV). Perhaps the weakest form is to a statistical rejection or falsification of the null. (I will deliberately use a few different variations on statements that can be made.)

Datax_{0 }from a test T provide evidence for rejecting H_{0}(just) to the extent that H_{0}would (very probably) have survived, were it a reasonably adequate description of the process generating the data (with respect to the question).

It is also captured by a general frequentist principle of evidence (FEV) (Mayo and Cox 2010), a variant on the general idea of severity (SEV) (EGEK 1996, Mayo and Spanos 2006, etc.).

The sampling distribution is computed, under the assumption that the production of observed results is similar to the “background alone”, with respect to relative frequencies of signal-like events. (Likewise for computations under hypothesized discrepancies.) The relationship between *H*_{0}* *and the probabilities of outcomes is an intimate one: the various statistical nulls live their lives to refer to aspects of general types of data generating procedures (for a taxonomy, see Cox 1958, 1977). “*H** _{0 }*is true” is a shorthand for a very long statement that

*Severity and the detachment of inferences*

The sampling distributions serve to give counterfactuals. In this case they tell us what it would be like, statistically, were the mechanism generating the observed signals similar to *H*_{0}.[i] While one would want to go on to consider the probability test T yields so statistically significant an excess under various alternatives to μ = 0, this suffices for the present discussion. Sampling distributions can be used to arrive at error probabilities that are relevant for understanding the capabilities of the test process, in relation to something we want to find out..*Since a relevant test statistic is a function of the data and quantities about which we want to learn, the associated sampling distribution is the key to inference*. (This is why bootstrap, and other types of, resampling works when one has a random sample from the process or population of interest.)

The *severity principle*, put more generally:

Data from a test T[ii]provide good evidence for inferring H (just) to the extent that H passes severely withx_{0}, i.e., to the extent that H would (very probably) not have survived the test so well were H false.

(The severity principle can also be made out just in terms of relative frequencies, as with bootstrap re-sampling.)* *In this case, what is surviving is minimally the non-null. Regardless of the specification of a statistical inference, to assess the severity associated with a claim H requires considering H’s denial: together they exhaust the answers to a given question.

Without making such a principle explicit, some critics assume the argument is all about the reported p-value. The inference actually **detached** from the evidence can be put in any number of ways, and no uniformity is to be expected or needed:

(3) There is strong evidence for H: a Higgs (or a Higgs-like) particle.

(3)’ They have experimentally demonstrated H: a Higgs (or Higgs-like) particle.

Or just, infer H.

Doubtless particle physicists would qualify these statements, but nothing turns on that. ((3) and (3)’ are a bit stronger than merely falsifying the null because certain properties of the particle must be shown. I leave this to one side.)

As always, the mere p-value is a pale reflection of the detailed information about the consistency of results that really fortifies the knowledge of a genuine effect. Nor is the precise improbability level what matters. We care about the inferences to real effects (and estimated discrepancies) that are warranted.

*Qualifying claims by how well they have been probed*

The inference is qualified by the statistical properties of the test, as in (1) and (2), but that does not prevent detaching (3). This much is shown: they are able to experimentally demonstrate the Higgs particle. They can take that much of the problem as solved and move on to other problems of discerning the properties of the particle, and much else that goes beyond our discussion*. There is obeisance to the strict fallibility of every empirical claim, but there is no probability assigned. Neither is there in day-to-day reasoning, nor in the bulk of scientific inferences, which are not formally statistical. Having inferred (3), granted, one may say informally, “so probably we have experimentally demonstrated the Higgs”, or “probably, the Higgs exists” (?). Or an informal use of “likely” might arise. But whatever these might mean in informal parlance, they are not formal mathematical probabilities. (As often argued on this blog, discussions on statistical philosophy must not confuse these.)

[We can however write, SEV(H) ~1]

The claim in (3) is approximate and limited–as are the vast majority of claims of empirical knowledge and inference–and, moreover, we can say in just what ways. It is recognized that subsequent data will add precision to the magnitudes estimated, and may eventually lead to new and even entirely revised interpretations of the known experimental effects, models and estimates. That is what cumulative knowledge is about. (I sometimes hear people assert, without argument, that modeled quantities, or parameters, used to describe data generating processes are “things in themselves” and are outside the realm of empirical inquiry. This is silly. Else we’d be reduced to knowing only tautologies and maybe isolated instances as to how “I seem to feel now,” attained through introspection.)

*Telling what’s true about significance levels*

So we grant the critic that something like the severity principle is needed to move from statistical information plus background (theoretical and empirical) to inferences about evidence and inference (and to what levels of approximation). It may be called lots of other things and framed in different ways, and the reader is free to experiment . What we should not grant the critic is any allegation that there should be, or invariably is, a link from a small observed significance level to a small posterior probability assignment to *H** _{0}*. Worse, (1- the p-value) is sometimes alleged to be the posterior probability accorded to the Standard Model itself! This is neither licensed nor wanted!

If critics (or the p-value police, as Wasserman called them) maintain that Higgs researchers are misinterpreting their significance levels, correct them with the probabilities in (1) and (2). If they say, it is patently obvious that Higgs researchers want to use the p-value as a posterior probability assignment to *H** _{0}*, point out the more relevant and actually attainable [iii] inference that is detached in (3). If they persist that what is really, really wanted is a posterior probability assignment to the inference about the Higgs in (3), ask why? As a formal posterior probability it would require a prior probability on all hypotheses that could explain the data. That would include not just

Degrees of belief will not do. Many scientists perhaps had (and have) strong beliefs in the Standard Model before the big collider experiments—given its perfect predictive success. Others may believe (and fervently wish) that it will break down somewhere (showing supersymmetry or whatnot); a major goal of inquiry is learning about viable rivals and how they may be triggered and probed. Research requires an open world not a closed one with all possibilities trotted out and weighed by current beliefs. [v] We need to point up what has not yet been well probed which, by the way, is very different from saying of a theory that it is “not yet probable”.

*Those prohibited phrases*

One may wish to return to some of the condemned phrases of particular physics reports.Take,

“There is less than a one in a million chance that their results are a statistical fluke”.

This is not to assign a probability to the null, just one of many ways (perhaps not the best) of putting claims about the sampling distribution: The statistical null asserts that Ho: background alone adequately describes the process.

Ho does not assert the results are a statistical fluke, but it tells us what we need to determine the probability of observed results “under Ho”. In particular, consider all outcomes in the sample space that are further from the null prediction than the observed, in terms of p-values {x: p < po}. Even when Ho is true, such “signal like” outcomes may occur. They are po level flukes. Were such flukes generated even with moderate frequency under Ho, they would not be evidence against Ho. But in this case, such flukes occur a teeny tiny proportion of the time. Then SEV enters: if we are regularly able to generate such teeny tiny p-values, we have evidence of a genuine discrepancy from Ho.

I am repeating myself, I realize, on the hopes that at least one phrasing will drive the point home. Nor is it even the improbability that substantiates this, it is the fact that an extraordinary set of coincidences would have to have occurred again and again. To nevertheless retain Ho as the source of the data would block learning. (Moreover, they know that if some horrible systematic mistake was made, it would be detected in later data analyses.)

I will not deny that there have been misinterpretations of p-values, but if a researcher has just described performing a statistical significance test, it would be “ungenerous” to twist probabilistic assertions into posterior probabilities. It would be a kind of “confirmation bias” whereby one insists on finding one sentence among very many that could conceivably be misinterpreted Bayesianly.

*Triggering, indicating, inferring*

As always, the error statistical philosopher would distinguish different questions at multiple stages of the inquiry. The aim of many preliminary steps is “behavioristic” and performance oriented: the goal being to control error rates on the way toward finding excess events or bumps of interest.

I hope it is (more or less) clear that burgandy is new; black is old. If interested: *See statistical flukes (part 3)*

The original posts of parts 1 and 2 had around 30 comments each; you might want to look at them:

Part 1: http://errorstatistics.com/2013/03/17/update-on-higgs-data-analysis-statistical-flukes-1/

Part 2 http://errorstatistics.com/2013/03/27/higgs-analysis-and-statistical-flukes-part-2/

*Fisher insisted that to assert a phenomenon is experimentally demonstrable:[W]e need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result. (Fisher Design of Experiments 1947, 14).

New Notes

[1] I plan to do some new work in this arena soon, so I’ll be glad to have comments.

[2] I have often noted that there are other times where we are trying to find evidence to support a previously held position.

REFERENCES (from March, 2013 post):

ATLAS Collaboration (November 14, 2012), Atlas Note: “Updated ATLAS results on the signal strength of the Higgs-like boson for decays into WW and heavy fermion final states”, ATLAS-CONF-2012-162. http://cds.cern.ch/record/1494183/files/ATLAS-CONF-2012-162.pdf

Cox, D.R. (1958), “Some Problems Connected with Statistical Inference,” *Annals of Mathematical Statistics*, 29: 357–72.

Cox, D.R. (1977), “The Role of Significance Tests (with Discussion),” *Scandinavian Journal of Statistics*, 4: 49–70.

Mayo, D.G. (1996), *Error and the Growth of Experimental Knowledge*, University of Chicago Press, Chicago.

Mayo, D. G. and Cox, D. R. (2010). “Frequentist Statistics as a Theory of Inductive Inference” in *Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science* (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 247-275.

Mayo, D.G., and Spanos, A. (2006), “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,” *British Journal of Philosophy *of *Science*, 57: 323–357.

___________

Original notes:

[i] This is a bit stronger than merely falsifying the null here, because certain features of the particle discerned must also be shown. I leave details to one side.

[ii] Which almost always refers to a set of tests, not just one.

[iii] I sense that some Bayesians imagine P(H) is more “hedged” than to actually infer (3). But the relevant hedging, the type we can actually attain, is given by an assessment of severity or corroboration or the like. Background enters via a repertoire of information about experimental designs, data analytic techniques, mistakes and flaws to be wary of, and a host of theories and indications about which aspects have/have not been severely probed. Many background claims enter to substantiate the error probabilities; others do not alter them.

[iv]In aspects of the modeling, researchers make use of known relative frequencies of events (e.g., rates of types of collisions) that lead to legitimate, empirically based, frequentist “priors” if one wants to call them that.

[v] After sending out the letter, prompted by Lindley, O’Hagan wrote up a synthesis http://errorstatistics.com/2012/08/25/did-higgs-physicists-miss-an-opportunity-by-not-consulting-more-with-statisticians/

Filed under: Higgs, highly probable vs highly probed, P-values, Severity, Statistics ]]>

July 4, 2014 was the two year anniversary of the Higgs boson discovery. As the world was celebrating the “5 sigma!” announcement, and we were reading about the statistical aspects of this major accomplishment, I was aghast to be emailed a letter, purportedly instigated by Bayesian Dennis Lindley, through Tony O’Hagan (to the ISBA). Lindley, according to this letter, wanted to know:

“Arethe particle physics community completely wedded to frequentist analysis? If so, has anyone tried to explain what bad science that is?”

Fairly sure it was a joke, I posted it on my “Rejected Posts” blog for a bit until it checked out [1]. (See O’Hagan’s “Digest and Discussion”)

Then, as details of the statistical analysis trickled down to the media, the P-value police (Wasserman, see (2)) came out in full force to examine if reports by journalists and scientists could in any way or stretch of the imagination be seen to have misinterpreted the sigma levels as posterior probability assignments to the various models and claims. The HEP (High Energy Physics) community had been painstaking in their communication of the results, but the P-bashers insisted on transforming the intended conditional….(I’ll come back to this.)

As for the HEP researchers, a central interest now is to explore any and all leads in the data that would point to physics beyond the Standard Model (BSM). The Higgs is just coming out to be too “perfectly plain vanilla,” and they’ve been unable to reject an SM null for years (3) (more on this later). So on this two-year anniversary, I’ll reblog a few of the Higgs posts, with some updated remarks—beginning with the first one below.

I suppose[ed] this was somewhat of a joke from the ISBA, prompted by Dennis Lindley, but as I [now] accord the actual extent of jokiness to be only ~10%, I’m sharing it on the blog [i]. Lindley (according to O’Hagan) wonders why scientists require so high a level of statistical significance before claiming to have evidence of a Higgs boson. It is asked: “Are the particle physics community completely wedded to frequentist analysis? If so, has anyone tried to explain what bad science that is?”

*Bad science? * I’d really like to understand what these representatives from the ISBA would recommend, if there is even a shred of seriousness here (or is Lindley just peeved that significance levels are getting so much press in connection with so important a discovery in particle physics?)

Well, read the letter and see what you think.

On Jul 10, 2012, at 9:46 PM, ISBA Webmaster wrote:

Dear Bayesians,

A question from Dennis Lindley prompts me to consult this list in search of answers.

We’ve heard a lot about the Higgs boson. The news reports say that the LHC needed convincing evidence before they would announce that a particle had been found that looks like (in the sense of having some of the right characteristics of) the elusive Higgs boson. Specifically, the news referred to a confidence interval with 5-sigma limits.

Now this appears to correspond to a frequentist significance test with an extreme significance level. Five standard deviations, assuming normality, means a p-value of around 0.0000005. A number of questions spring to mind.

1. Why such an extreme evidence requirement? We know from a Bayesian perspective that this only makes sense if (a) the existence of the Higgs boson (or some other particle sharing some of its properties) has extremely small prior probability and/or (b) the consequences of erroneously announcing its discovery are dire in the extreme. Neither seems to be the case, so why 5-sigma?

2. Rather than ad hoc justification of a p-value, it is of course better to do a proper Bayesian analysis. Are the particle physics community completely wedded to frequentist analysis? If so, has anyone tried to explain what bad science that is?

3. We know that given enough data it is nearly always possible for a significance test to reject the null hypothesis at arbitrarily low p-values, simply because the parameter will never be exactly equal to its null value. And apparently the LNC has accumulated a very large quantity of data. So could even this extreme p-value be illusory?If anyone has any answers to these or related questions, I’d be interested to know and will be sure to pass them on to Dennis.

Regards,

Tony

—-

Professor A O’Hagan

Email: a.ohagan@sheffield.ac.uk

Department of Probability and Statistics

University of Sheffield

So given that the Higgs boson does not have such an extremely small prior probability, a proper Bayesian analysis would have enabled evidence of the Higgs long before attaining such an “extreme evidence requirement”. Why has no one tried to explain to these scientists how with just a little Bayesian analysis, they might have been done ~~in~~ last year or years ago? I take it the Bayesian would also enjoy the simplicity and freedom of not having to adjust “the Look Elsewhere Effect” (LEE[ii])

Let’s see if there’s a serious follow-up.[iii]

[i] bringing it down from my “Msc Kvetching page” where I’d put it last night.

[ii] For a discussion of how the error statistical philosophy avoids the classic criticisms of significance tests, see Mayo & Spanos (2011) ERROR STATISTICS. Other articles may be found on the link to my publication page.

[iii] O’Hagan informed me of several replies to his letter at the following:: http://bayesian.org/forums/news/3648

*****************************************************

(1) There’s scarce need for my “Rejected Posts” blog now that renegade thoughts can go on “twitter” (@learnfromerror), but I’ll keep it around for later.

(2) The Higgs Boson and the p-value Police: http://normaldeviate.wordpress.com/2012/07/11/the-higgs-boson-and-the-p-value-police/

(3)The logic in this case is especially interesting. Each failure to reject the nulls of this type inform about the variant of BSM ruled out. (I’ll check with Robert Cousins that I’ve put this correctly. Update: He says that I have.) Here’s a link to Cousins’ recent paper on the Higgs and foundations of statistics http://arxiv.org/abs/1310.3791.

Filed under: Bayesian/frequentist, fallacy of non-significance, Higgs, Lindley, Statistics Tagged: comedy, Dennis V. Lindley, Higgs boson, p-value vs posterior, particle physics, significance tests ]]>

**Winner of June 2014 Palindrome Contest: First Second* Time Winner! **

******Her April **win is here*

**Palindrome:**

**Parsec? I overfit omen as Elba sung “I err on! Oh, honor reign!” Usable, sane motif revoices rap.**

**The requirement:** A palindrome with Elba plus overfit. (The optional second word: “average” was not needed to win.)

**Bio:**

Lori Wike is principal bassoonist of the Utah Symphony and is on the faculty of the University of Utah and Westminster College. She holds a Bachelor of Music degree from the Eastman School of Music and a Master of Arts degree in Comparative Literature from UC-Irvine.

**Statement**:

I’m thrilled to be a second-time winner of the palindrome contest and my love of book collecting overrides any guilty feelings I may have about winning twice! Here’s a fun picture of me in the midst of polygonal fracturing from my June escapades. Sadly, I don’t think I can work “polygonal” into a palindrome******.

I’ve been fascinated by palindromes ever since first learning about them as a child in a Martin Gardner book. I started writing palindromes several years ago when my interest in the form was rekindled by reading about the constraint-based techniques of several Oulipo writers. While I love all sorts of wordplay and puzzles, and I occasionally write some word-unit palindromes as well, I find writing the traditional letter-unit palindromes to be the most satisfying challenge, due to the extreme formal constraint of exact letter reversal–which is made even more fun in a contest like this where one has to include specific words in the palindrome. I also enjoy writing palindromes about specific themes (Poe’s Raven, Oedipus Rex, Verdi’s Aida) and I have plans to write a very long palindrome about Proust one of these days.

**Book Choice**:

*Dicing with Death: Chance, Risk and Health* (Stephen Senn 2003, Cambridge: Cambridge University Press)

Filed under: Announcement, Palindrome ]]>

The article in the Chronicle of Higher Education also gets credit for its title: “Replication Crisis in Psychology Research Turns Ugly and Odd”. I’ll likely write this in installments…(2^{nd}, 3^{rd} , 4^{th})

^^^^^^^^^^^^^^^

The Guardian article answers yes to the question “Do ‘hard’ sciences hold the solution…“:

Psychology is evolving faster than ever. For decades now, many areas in psychology have relied on what academics call “questionable research practices” – a comfortable euphemism for types of malpractice that distort science but which fall short of the blackest of frauds, fabricating data.

But now a new generation of psychologists is fed up with this game. Questionable research practices aren’t just being seen as questionable – they are being increasingly recognised for what they are: soft fraud. In fact, “soft” may be an understatement. What would your neighbours say if you told them you got published in a prestigious academic journal because you cherry-picked your results to tell a neat story? How would they feel if you admitted that you refused to share your data with other researchers out of fear they might use it to undermine your conclusions? Would your neighbours still see you as an honest scientist – a person whose research and salary deserves to be funded by their taxes?

For the first time in history, we are seeing a co-ordinated effort to make psychology more robust, repeatable, and transparent.

“Soft fraud”? (Is this like “white collar” fraud?) Is it possible that holding social psych up as a genuine replicable science is, ironically, creating soft frauds too readily?

*Or would it be all to the good if the result is to so label large portions of the (non-trivial) results of social psychology?*

The sentiment in the Guardian article is that the replication program in psych is just doing what is taken for granted in other sciences; it shows psych is maturing, it’s getting *better and better all the time* …so long as the replication movement continues. Yes? [0]

^^^^^^^^

It’s hard to entirely dismiss the concerns of the pushback, dubbed in some quarters as “Repligate”. Even in this contrarian mode, you might sympathize with “those who fear that psychology’s growing replication movement, which aims to challenge what some critics see as a tsunami of suspicious science, is more destructive than corrective” (e.g., Professor Wilson, at U Va) while at the same time rejecting their dismissal of the seriousness of the problem of false positives in psych. The problem *is* serious, but there may be built-in obstacles to fixing things by the current route. From the Chronicle:

Still, Mr. Wilson was polite. Daniel Gilbert, less so. Mr. Gilbert, a professor of psychology at Harvard University, … wrote that certain so-called replicators are “shameless little bullies” and “second stringers” who engage in tactics “out of Senator Joe McCarthy’s playbook” (he later took back the word “little,” writing that he didn’t know the size of the researchers involved).

Wow. Let’s read a bit more:

Scrutiny From the Replicators

What got Mr. Gilbert so incensed was the treatment of Simone Schnall, a senior lecturer at the University of Cambridge, whose 2008 paper on cleanliness and morality was selected for replication in a special issue of the journal

Social Psychology.….In one experiment, Ms. Schnall had 40 undergraduates unscramble some words. One group unscrambled words that suggested cleanliness (pure, immaculate, pristine), while the other group unscrambled neutral words. They were then presented with a number of moral dilemmas, like whether it’s cool to eat your dog after it gets run over by a car. Ms. Schnall wanted to discover whether prompting—or priming, in psych parlance—people with the concept of cleanliness would make them less judgmental…..These studies fit into a relatively new field known as embodied cognition, which examines how one’s environment and body affect one’s feelings and thoughts. …

For instance, political extremists might literally be less capable of discerning shades of grey than political moderates—or so Matt Motyl thought until his results disappeared. Now he works actively in the replication movement.[1]

Links are here.

7/1: By the way, since Schnall’s research was testing “embodied cognition” why wouldn’t they have subjects involved in actual cleansing activities rather than have them unscramble words about cleanliness?

^^^^^^^^^^

Another irony enters: some of the people working on the replication project in social psych are the same people who hypothesize that a large part of the blame for lack of replication may be traced to the reward structure, to incentives to publish surprising and sexy studies, and to an overly flexible methodology opening the door to promiscuous QRPs (you know: Questionable Research Practices.) Call this the “rewards and flexibility” hypothesis. If the rewards/flex hypothesis is correct, as is quite plausible, then wouldn’t it follow that the same incentives are operative in the new psych replication movement? [2]

A skeptic of the movement in psychology could well ask how the replication can be judged sounder than the original studies? When RCTs fail to replicate observational studies, the presumption is that RCTs would have found the effect, were it genuine. That’s why it’s taken as an indictment of the observational study. But here, one could argue, it’s just another study, not obviously one that *corrects* the earlier. The question some have asked, “Who will replicate the replicators?” is not entirely without merit. Triangulation for purposes of correction, I say, is what’s really needed. [3]

Daniel Kahneman, who first called for the “daisy chain” (after the Stapel scandal), likely hadn’t anticipated the tsunami he was about to unleash.[4]

Daniel Kahneman, a Nobel Prize winner who has tried to serve as a sort of a peace broker, recently offered some rules of the road for replications, including keeping a record of the correspondence between the original researcher and the replicator, as was done in the Schnall case. Mr. Kahneman argues that such a procedure is important because there is “a lot of passion and a lot of ego in scientists’ lives, reputations matter, and feelings are easily bruised.”

That’s undoubtedly true, and taking glee in someone else’s apparent misstep is unseemly. Yet no amount of politeness is going to soften the revelation that a published, publicized finding is bogus. Feelings may very well get bruised, reputations tarnished, careers trashed. That’s a shame, but while being nice is important, so is being right.

Is the replication movement getting psych closer to “being right”? That is the question. What if inferences from priming studies and ”embodied cognition” really *are* questionable. What if the hypothesized effects are incapable of being turned into replicable science?

^^^^^^^^^

The sentiment voiced in the Guardian bristles at the thought; there is pushback even to Kahneman’s apparently civil “rules of the road”:

For many psychologists, the reputational damage [from a failed replication]… is grave – so grave that they believe we should limit the freedom of researchers to pursue replications. In a recent open letter, Nobel laureate Daniel Kahneman called for a new rule in which replication attempts should be “prohibited” unless the researchers conducting the replication consult beforehand with the authors of the original work. Kahneman says, “Authors, whose work and reputation are at stake, should have the right to participate as advisers in the replication of their research.” Why? Because method sections published by psychology journals are generally too vague to provide a recipe that can be repeated by others. Kahneman argues that successfully reproducing original effects could depend on seemingly irrelevant factors – hidden secrets that only the original authors would know. “For example, experimental instructions are commonly paraphrased in the methods section, although their wording and even the font in which they are printed are known to be significant.”

“Hidden secrets”? This was a remark sure to enrage those who take psych measurements as (at least potentially) akin to measuring the Hubble constant:

If this doesn’t sound very scientific to you, you’re not alone. For many psychologists, Kahnemann’s cure is worse than the disease. Dr Andrew Wilson from Leeds Metropolitan University points out that if the problem with replication in psychology is vague method sections then the logical solution – not surprisingly – is to publish

detailedmethod sections. In a lively response to Kahnemann, Wilson rejects the suggestion of new regulations: “If you can’t stand the replication heat, get out of the empirical kitchen because publishing your work means you think it’s ready for prime time, and if other people can’t make it work based on your published methods then that’s your problem and not theirs.”

Prime time for priming research in social psych?

Read the rest of the Guardian article. Second installment later on…maybe….

**What do readers think?**

^^^^^^^^^^^^^^

Naturally the issues that interest me the most are statistical-methodological. Some of the methodology and meta-methodology of the replication effort is apparently being developed hand-in-hand with the effort itself—that makes it all the more interesting, while also potentially risky.

The replicationist’s question of methodology, as I understand it, is alleged to be what we might call “purely statistical”. It is not: would the initial positive results warrant the psychological hypothesis, were the statistics unproblematic? The presumption from the start was that the answer to this question is yes. In the case of the controversial Schnall study, the question wasn’t: can the hypotheses about cleanliness and morality be well-tested or well probed by finding statistical associations between unscrambling cleanliness words and “being less judgmental” about things like eating your dog if he’s runover? At least not directly. In other words, the statistical-substantive link was not at issue. The question is limited to: do we get the statistically significant effect in a replication of the initial study, presumably one with high power to detect the effects at issue. So, for the moment, I too will retain that as the sole issue around which the replication attempts revolve.

Checking statistical assumptions is, of course, a part of the pure statistics question, since the P-value and other measures depend on assumptions being met at least approximately.

The replication team assigned to Schnall (U of Cambridge) reported results apparently inconsistent with the positive ones she had obtained. Schnall shares her experiences in “Further Thoughts on Replications, Ceiling Effects and Bullying” and “The Replication Authors’ Rejoinder”:http://www.psychol.cam.ac.uk/cece/blog

The replication authors responded to my commentary in a rejoinder. It is entitled “Hunting for Artifacts: The Perils of Dismissing Inconsistent Replication Results.” In it, they accuse me of “criticizing after the results are known,” or CARKing, as Nosek and Lakens (2014) call it in their editorial. In the interest of “increasing the credibility of published results” interpretation of data evidently needs to be discouraged at all costs, which is why the special issue editors decided to omit any independent peer review of the results of all replication papers. (Schnall)

Perhaps her criticisms are off the mark, and in no way discount the failed replication (I haven’t read them), but CARKing? Data and model checking are intended to take place post-data. So the post-data aspect of a critique scarcely renders it illicit. The statistical fraud-busting of a Smeesters or a Jens Forster were all based on post-data criticisms. So it would be *ironic* if in the midst of defending efforts to promote scientific credentials they inadvertently labeled as questionable post-data criticisms. top

^^^^^^^^^^^^^^^^^^^^^^^^^^^

Uri Simonsohn [5] at “Data Colada” discusses, specifically, the objections raised by Simone Schnall (2nd installment), and the responses by the authors who failed to replicate her work: Brent Donnellan, Felix Cheung and David Johnson.

Simonsohn does not reject out of hand Schnall’s allegation that the lack of replication is explained away (e.g., by a “ceiling effect”). (In fact, he has elsewhere discussed a case that was rightfully absolved thereby [6].) Simonsohn provides statistical grounds for denying a ceiling effect is to be blamed in Schnall’s case. However, he also agrees with Schnall’s discounting the replicators’ reaction to the charge of a ceiling effect by simply lopping off the most extreme results.

In their rejoinder (.pdf), the replicators counter by dropping all observations at the ceiling and showing the results are still not significant.

I don’t think that’s right either.Data Colada

Since the replicators here have the burden of proof of evidence, the statistical problems with their *ad hoc* retort to Schnall are grounds for concern, or should be.

http://datacolada.org/2014/06/04/23-ceiling-effects-and-replications/

What follows from this? What follows is that the analysis of the evidential import of failed replications in this field is an unsettled business. Despite the best of intentions of the new replicationists, there are grounds for questioning if the meta-methodology is ready for the heavy burden being placed on it. I’m not saying that facets for the necessary methodology aren’t out there, but that the pieces haven’t been fully assembled ahead of time. Until they are,the basis for scrutinizing failed (and successful) replications will remain in flux.

^^^^^^^^^^

Final irony. If the replication researchers claim they haven’t caught on to any of the problems or paradoxes I have intimated for their enterprise, let me end with one more. ..No, I’ve save it for installment 4. top

^^^^^^^^^^

Statistical significance testers in psychology (and other areas) often maintain there is no information, or no proper inference, to be obtained from statistically insignificant (negative) results. This, despite power analyst Jacob Cohen toiling amongst them for years. Maybe they’ve been misled by their own constructed animal, the so-called NHST (no need to look it up, if you don’t already know).

*The irony is that much replication analysis turns on interpreting non statistically significant results!*

One of my first blogposts talks about interpreting negative results and I’ve been publishing on this for donkey’s years[7]. Here are some posts for your Saturday night reading:

http://errorstatistics.com/2011/11/09/neymans-nursery-2-power-and-severity-continuation-of-oct-22-post/Some numerical examples:

^^^^^^

[0] Unsurprisingly, replicationistas in psych are finding well-known results from experimental psych to be replicable. Interestingly, similar results are found in experimental economics, dubbed “experimental exhibits”. Expereconomists recognize that rival interpretations of the exhibits are still open to debate.

[1] In Nuzzo’s article: “For a brief moment in 2010, Matt Motyl was on the brink of scientific glory: he had discovered that extremists quite literally see the world in black and white”.

(Glory, I tell you!)

[2] Some of the results are now published in Social Psychology. Perhaps it was not such an exaggeration to suggest, in an earlier post, that “non-significant results are the new significant results”. At the time I didn’t know the details of the replication project; I was just reacting to graduate students presenting this as the basis for a philosophical position, when philosophers should have been performing a stringent methodological critique.

[3] By contrast, statistical fraudbusting and statistical forensics have some rigorous standards that are hard to evade, e.g., recently Jens Forster.

[4] In Kahneman’s initial call (Oct, 2012) “He suggested setting up a ‘daisy chain’ of replication, in which each lab would propose a priming study that another lab would attempt to replicate. Moreover, he wanted labs to select work they considered to be robust, and to have the lab that performed the original study help the replicating lab vet its procedure.”

[5] Simonsohn is always churning out the most intriguing and important statistical analyses in social psychology. The field needs more like him.

[6] For an excellent discussion of a case that *is* absolved from non-replication by appealing to the ceiling effect see http://datacolada.org/2014/06/27/24-p-curve-vs-excessive-significance-test/.

[7] e.g., Mayo 1985, 1988, to see how we talked about statistics in risk assessment philosophy back then.

Filed under: junk science, science communication, Statistical fraudbusting, Statistics ]]>