Despite the fact that Fisherians and Neyman-Pearsonians alike regard observed significance levels, or P values, as error probabilities, we occasionally hear allegations (typically from those who are neither Fisherian nor N-P theorists) that P values are actually not error probabilities. The denials tend to go hand in hand with allegations that P values exaggerate evidence against a null hypothesis—a problem whose cure invariably invokes measures that are at odds with both Fisherian and N-P tests. The Berger and Sellke (1987) article from a recent post is a good example of this. When leading figures put forward a statement that looks to be straightforwardly statistical, others tend to simply repeat it without inquiring whether the allegation actually mixes in issues of interpretation and statistical philosophy. So I wanted to go back and look at their arguments. I will post this in installments.

**1. Some assertions from Fisher, N-P, and Bayesian camps**

Here are some assertions from Fisherian, Neyman-Pearsonian and Bayesian camps: (I make no attempt at uniformity in writing the “P-value”, but retain the quotes as written.)

*a) From the Fisherian camp (Cox and Hinkley):*

For given observationsywe calculate t = t_{obs}= t(y), say, and the level of significance p_{obs}by

p_{obs}= Pr(T > t_{obs}; H_{0}).

….Hence p_{obs}is the probability that we would mistakenly declare there to be evidence against H_{0}, were we to regard the data under analysis as being just decisive against H_{0}.” (Cox and Hinkley 1974, 66).

Thus p_{obs} would be the Type I error probability associated with the test.

*b) From the Neyman-Pearson N-P camp (Lehmann and Romano):*

“[I]t is good practice to determine not only whether the hypothesis is accepted or rejected at the given significance level, but also to determine the smallest significance level…at which the hypothesis would be rejected for the given observation. This number, the so-called p-value gives an idea of how strongly the data contradict the hypothesis. It also enables others to reach a verdict based on the significance level of their choice.” (Lehmann and Romano 2005, 63-4)

Very similar quotations are easily found, and are regarded as uncontroversial—even by Bayesians whose contributions stood at the foot of Berger and Sellke’s argument that P values exaggerate the evidence against the null.

*c) Gibbons and Pratt:*

“The P-value can then be interpreted as the smallest level of significance, that is, the ‘borderline level’, since the outcome observed would be judged significant at all levels greater than or equal to the P-value[i] but not significant at any smaller levels. Thus it is sometimes called the ‘level attained’ by the sample….Reporting a P-value, whether exact or within an interval, in effect permits each individual to choose his own level of significance as the maximum tolerable probability of a Type I error.” (Gibbons and Pratt 1975, 21).

**2. So what’s behind the “P values aren’t error probabilities” allegation?**

In their rejoinder to Hinkley, Berger and Sellke assert the following: “The use of the term ‘error rate’ suggests that the frequentist justifications, such as they are, for confidence intervals and fixed a-level hypothesis tests carry over to P values.”

They do not disagree with Cox and Hinkley’s assertion above, but they maintain that:

“This hypothetical error rate does not conform to the usual classical notion of ‘repeated-use’ error rate, since the P-value is determined only once in this sequence of tests. The frequentist justifications of significance tests and confidence intervals are in terms of how these procedures perform when used repeatedly.” (Berger and Sellke 1987, 136)

Keep in mind that Berger and Sellke are using “significance tests” to refer to Neyman-Pearson (N-P) tests in contrast to Fisherian P-value appraisals.

So their point appears to be simply that the P value, as intended by Fisher, is not justified by (or not intended to be justified by) a behavioral appeal to controlling long run error rates. It is assumed that those are the only, or the main, justifications available for N-P significance tests and confidence intervals (thus type 1 and 2 error probabilities and confidence levels are genuine error probabilities). They do not entertain the idea that the P value, as the attained significance level, is important for N-P theorists nor that* “a p-value* gives an idea of how strongly the data contradict the hypothesis”(Lehmann and Romano)—a construal we find early on in David Cox.

But let’s put that aside, as we pin down Berger and Sellke’s point. Here’s how we might construe them. They grant that the P-value is, mathematically, a frequentist error probability, it is the *justification* that they think differs from what they take to be the justification of Type 1 and 2 errors in N-P statistics. They think N-P tests and confidence intervals get their justification in terms of (actual?) long run error rates, and P-values do not. To continue with their remarks:

“Can P values be justified on the basis of how they perform in repeated use? We doubt it. For one thing, how would one measure the performance of P values? With significance tests and confidence intervals, they are either right or wrong, so it possible to talk about error rates. If one introduces a decision rule into the situation by saying that H

_{0}is rejected when the P value < .05, then of course the classical error rate is .05.”[ii](ibid.)

Thus, *P values are error probabilities***, **but their intended justification (by Fisher?) was not a matter of a behavioristic appeal to low long-run error rates, but rather, something more inferential or evidential. We can actually strengthen their argument in a couple of ways. Firstly, we can remove the business of “actual” versus “hypothetical” repetitions, because the behavioristic justifications that they are trying to call out are also given in terms of hypotheticals. Moreover, behavioristic appeals to controlling error rates are not limited to “reject/do not reject”, but apply even where the inference is in terms of an inferred discrepancy or other test output.

The problem is that the inferential vs behavioristic distinction does not separate Fisherian P-values from confidence levels and type I and 2 error probabilities. *All* of these are amenable to *both* types of interpretation! More to follow in installment #2.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Installment 2:** **Mirror Mirror on the Wall, Who’s the More Behavioral of them all?**

Granted, the founders did not make out intended inferential construals fully—though representatives from Fisherian and N-P camps took several steps. At the same time, members of both camps also can be found talking like acceptance samplers!

Berger and Sellke had said: “If one introduces a decision rule into the situation by saying that H_{0} is rejected when the P value < .05, then of course the classical error rate is .05.” Good. Then we can agree that it is mathematically an error probability. They simply don’t think it reflects the Fisherian ideal.

**3. Fisher as acceptance sampler.**

But it was Fisher, after all, who declared that “Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis. “ (DOE 15-16)

Or to quote from an earlier article of Fisher (1926):

…we may, if we prefer it, draw the line at one in fifty (the 2 per cent point), or one in a hundred (the 1 per cent point). Personally, the writer prefers to set a low standard of significance at the 5 per cent point, and

ignore entirely all results which fail to reach this level. A scientific fact should be regarded as experimentally established only if a properly designed experimentrarely failsto give this level of significance. The very high odds sometimes claimed for experimental results should usually be discounted, for inaccurate methods of estimating error have far more influence than has the particular standard of significance chosen.

The above is a more succinct version of essentially the same points Fisher makes in DOE.[iii]

No wonder Neyman could tell Fisher to look in the mirror (as it were): *“Pearson and I were only systematizing your practices for how to interpret data, using those nice charts you made. True, we introduced the alternative hypothesis (and the corresponding type 2 error), but that was only to give a rationale, and apparatus, for the kinds of tests you were using. You never had a problem with the Type 1 error probability, and your concern for how best to increase “sensitivity” was to be reflected in the power assessment. You had no objections—at least at first”.* See this post.

The dichotomous “up-down” spirit that Berger and Sellke suggest is foreign to Fisher is not foreign at all. Again from DOE:

Our examination of the possible results of the experiment has therefore led us to a statistical test of significance, by which these results are divided into two classes with opposed interpretation. ….The two classes of results which are distinguished by our test of significance are, on the one hand, those which show a significant discrepancy from a certain hypothesis; …and on the other hand, results which show no significant discrepancy from this hypothesis. (DOE 15)

Even where Fisher is berating Neyman for introducing the Type 2 error–he had no problem with type 1 errors, and both were fine in cases of estimation–Fisher falls into talk of actions, as Neyman points out (Neyman 1956,Triad).

“The worker’s real attitude in such a case might be, according to the circumstances:

(a)”the probable deviation from truth of my working hypothesis, to examine which the test is appropriate, seems not to be of sufficient magnitude to warrant any immediate modification.” (Fisher 1955, Triad, p. 73)

Pearson responds (1955) that this is entirely the type of interpretation they imagined to be associated with the bare mathematics of the test. And Neyman made it clear early on (though I didn’t discover it at first) that he intended “accept” to serve merely as a shorthand for “do not reject”. See this recent post, which includes links to all three papers in the “triad” (by Fisher, Neyman, and Pearson).

“In fact Fisher referred approvingly to the concept of the power curve of a test procedure and although he wrote: ‘On the whole the ideas (a) that a test of significance must be regarded as one of a series of similar tests applied to a succession of similar bodies of data, and (b) that the purpose of the test is to discriminate or ‘decide’ between two or more hypotheses, have greatly obscured their understanding’, he was careful to go on and add ‘when taken not as contingent possibilities but as elements essential to their logic’.” (129).

To see how Fisher links power to his own work early on, please check this post.

So we are back to the key question: what is the basis for Berger and Sellke (and others who follow similar lines of criticism) to allow error probabilities in the case of N-P significance tests and confidence intervals, and not in the case of P-values? It cannot be whether the method involves a rule for mapping outcomes to interpretations (be there two or three—the third might be N-P’s initial “remain undecided” or “get more data”), because we’ve just seen that to be true of Fisherian tests as well.

**4.** **Fixing the type 1 error probability**

But isn’t the issue that N-P tests fix the type 1 error probability in advance? Firstly, we must distinguish between fixing the P value threshold to be used in each application, and justifying tests solely by reference to a control of long run error (behavioral justification). So what about the first point of predesignating the threshold? Actually, this was more Fisher than N-P:

“Neyman and Pearson *followed Fisher’s* adoption of a fixed level” Erich Lehmann tells us. (Lehmann 1993, 1244). Lehmann is flummoxed by the accusation of fixed levels of significance since “[U]nlike Fisher, Neyman and Pearson (1933, p. 296) did not recommend a standard level but suggested that ‘how the balance [between the two kinds of error] should be struck must be left to the investigator.” (ibid.) From their earliest papers, they stressed that the tests were to be “used with discretion and understanding” depending on the context. Pearson made it clear that he thought it “irresponsible”, in a matter of importance, to distinguish rejections at the .025 or .05 level.[iv] (See this post.) And as we already saw, Lehmann (who developed N-P tests as decision rules) recommends reporting the attained P value.

In a famous passage,[v] Fisher (1956) raises the criticism—but without naming names:

A man who ‘rejects’ a hypothesis provisionally, as a matter of habitual practice, when the significance is at the 1% level or higher, will certainly be mistaken in not more than 1% of such decisions. For when the hypothesis is correct he will be mistaken in just 1% of these cases, and when it is incorrect he will never be mistaken in rejection….However, the calculation is absurdly academic, for in fact no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas.

It is assumed he is speaking of N-P, or at least Neyman, but I wonder…

Anyway, the point is that the mathematics admits of different interpretations and uses. The “P values are not error rates” argument really boils down to claiming that the justification for using P-values *inferentially is *not *merely* that if you repeatedly did this you’d rarely erroneously interpret results (that’s necessary but not sufficient for the inferential warrant). That, of course, is what I (and others) have been arguing for ages—but I’d extend this to N-P significance tests and confidence intervals, at least in contexts of scientific inference. See, for example, Mayo and Cox (2006/2010), Mayo and Spanos (2006). We couldn’t even express the task of how to construe error probabilities inferentially if we could only use the term “error probabilities” to mean something justified only by behavioristic long-runs.

**5. What about the Famous Blow-ups?**

What about the big disagreement between Neyman and Fisher (Pearson is generally left out of it)? Well, I think that as hostilities between Fisher and Neyman heated up, the former got more and more evidential (and even fiducial) and the latter more and more behavioral. Still, what has made a lasting impression on people, understandably, are Fisher’s accusations that Neyman (if not Pearson) converted his tests into acceptance sampling devices, more suitable for making money in the U.S. or Russian 5 year plans, than thoughtful inference. (But remember Pearson’s and Neyman’s responses.) Imagine what he might have said about today’s infatuation with converting P value assessments to dichotomous outputs to compute science-wise error rates: Neyman on steroids.[vi]

By the way, it couldn’t have been too obvious that N-P distorted his tests, since Fisher tells us in 1955 that it was only when Barnard brought it to his attention that “despite agreeing mathematically in very large part”, there is a distinct *philosophical position* emphasized at least by Neyman. So it took like 20 years to realize this? (Barnard also told me this in person, recounted in this theater production.)

Here’s an enlightening passage from Cox (2006):

Neyman and Pearson “began an influential collaboration initially designed primarily, it would seem to clarify Fisher’s writing. This led to their theory of testing hypotheses and to Neyman’s development of confidence intervals, aiming to clarify Fisher’s idea of fiducial intervals. As late as 1932 Fisher was writing to Neyman encouragingly about this work, but relations soured, notably when Fisher greatly disapproved of a paper of Neyman’s on experimental design and no doubt partly because their being in the same building at University College London brought them too close to one another!” (195)

*Being in the same building,indeed!* Recall Fisher declaring that if Neyman teaches in the same building and doesn’t use his book, he would oppose him in all things. See this post for details on some of their anger management problems.

The point is that it is absurd to base conceptions of inferential methods on personality disputes rather than the mathematical properties of tests (and their associated interpretations). These two approaches are best seen as offering clusters of tests appropriate for different contexts within the large taxonomy of tests and estimation methods. We can agree that the radical behavioristic rationale for error rates is not the rationale intended by Fisher in using P-values. I would argue it was not the rationale intended by Pearson, nor, much of the time, by Neyman. Yet we should be beyond worrying about what the founders really thought. *It’s the methods, stupid.*

Readers should not have to go through this “he said/we said” history again. Enough! Nor should they be misled into thinking there’s a deep inconsistency which renders all standard treatments invalid (by dint of using both N-P and Fisherian tests).

So has pure analytic philosophy, by clarifying terms (along with a bit of history of statistics), solved the apparent disagreement with Berger and Sellke (1987) and others?

It’s gotten us somewhere, yet there’s a big problem that remains. TO BE CONTINUED ON A NEW POST

**REFERENCES:**

Barnard, G. (1972). “Review of ‘The Logic of Statistical Inference’ by I. Hacking” *Brit. J. Phil. Sci.* 23(2): 123-132.

Berger, J. O. and Sellke, T. (1987) “Testing a point null hypothesis: The irreconcilability of p values and evidence,” (with discussion). *J. Amer. Statist. Assoc.* 82: 112-139.

Cassella G. and Berger, R.. (1987). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem,” (with discussion). *J. Amer. Statist. Assoc. ***82 **106–111, 123–139.

Cox, D. R. (2006) *Principles of Statistical Inference*. Cambridge: Cambridge University Press.

Cox, D. R. & Hinkley, D. V. (1974). *Theoretical Statistics*, London, Chapman & Hall.

Fisher, R. A. (1926). “The Arrangement of Field Experiments”, *J. of Ministry of Agriculture*, Vol. XXXIII, 503-513.

Fisher, R. A. (1947). *The Design of Experiments* (4^{th} Ed.) NY Hafner.

Fisher, R. A. (1955) “Statistical Methods and Scientific Induction,” *Journal of The Royal Statistical Society *(B) 17: 69-78.

Fisher, R.A. (1956). *Statistical Methods and Scientific Inference*, Hafner

Gibbons, J. & Pratt, J. W. (1975). “P-values: Interpretation and Methodology”, *The American Statistician* 29: 20-25.

Lehmann, E. (1993). “The Fisher, Neyman-Pearson Theories of Testing Hypotheses: One Theory or Two?” *J. Amer. Statist. Assoc.*, 88(424):1242-1249.

Lehmann and Romano (2005) *Testing Statistical Hypotheses* (3^{rd} ed.), New York: Springer.

Mayo, D.G. and Cox, D. R. (2006/2010) “Frequentists Statistics as a Theory of Inductive Inference,” *Optimality: The Second Erich L. Lehmann Symposium *(ed. J. Rojo), Lecture Notes-Monograph series, Institute of Mathematical Statistics (IMS), Vol. 49: 77-97.

Mayo, D. G. and Spanos, A. (2006). “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,” *British Journal of Philosophy of Science*, 57: 323-357.

Neyman, J. (1977) “Frequentist Probability and Frequentist Statistics,” *Synthese* 36: 97-131.

Neyman, J. (1956). “Note on an Article by Sir Ronald Fisher,” *Journal of the Royal Statistical Society* (B), 18:288-294.

Neyman, J. and Pearson, E.S. (1933). “On the Problem of the Most Efficient Tests of Statistical Hypotheses,” *Philosophical Transactions of the Royal Society of London*, (A), 231, 289-337.

Pearson, E. S. (1955). “Statistical Concepts in Their Relation to Reality,” *Journal of the Royal Statistical Society, *(B), 17: 204-207.

____________

[i] With the usual inversions.

[ii] They add “but the expected P value given rejection is .025, an average understatement of the error rate by a factor of two.”

[iii] Neyman did put in a plug for developments in empirical Bayesian methods in his 1977 Synthese paper.

[iv] Pearson says,

*Design of Experiments*(DOE):

The test of significance (13):

“It is open to the experimenter to be more or less exacting in respect of the smallness of the probability he would require before he would be willing to admit that his observations have demonstrated a positive result. It is obvious that an experiment would be useless of which no possible result would satisfy him. Thus, if he wishes to ignore results having probabilities as high as 1 in 20–the probabilities being of course reckoned from the hypothesis that the phenomenon to be demonstrated is in fact absent –then it would be useless for him to experiment with only 3 cups of tea…. It is usual and convenient for the experimenters to take 5 per cent as a standard level of significance, in the sense that they are prepared to ignore all results which fail to reach this standard, and, by this means to eliminate from further discussion the greater part of the fluctuations which chance causes have introduced into their experimental results. …we thereby admit that no isolated experiment, however significant in itself, can suffice for the experimental demonstration of any natural phenomenon; for the ‘one chance in a million’ will undoubtedly occur, with no less and no more than its appropriate frequency, however surprised we may be that it should occur to

us.In order to assert that a natural phenomenon is experimentally demonstrable we need, not an isolated record, but a reliable method of procedure. In relation to the test of significance we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result.” (emphasis added)

On 46-7 Fisher clarifies something people often confuse: it’s not the low probability of the event “rather to the fact, very near in this case, that the correctness of the assertion would entail an event of this low probability.

[vi] It follows a paragraph criticizing Bayesians.

Is the B camp named for Berger, or for Bayes?

I have to think it’s named for Berger, because I don’t think most other Bayesians really care too much whether people think p-values are error probabilities — that has very little to do with the sort of things we’re concerned with.

So what camp do the dissenters Hubbard and Bayarri (http://ftp.isds.duke.edu/WorkingPapers/03-26.pdf and Kalbfleish and Sprott (“…the significance level cannot be equated with the rejection frequency of a true null hypothesis“) belong? I recognize Bayarri’s name from Bayesian papers I’ve read, and Valencia is, of course, a notorious hotbed of Bayesianity. But I can find no hint of Bayesian leanings in Hubbard’s publications, and Kalbfleish and Sprott were straight-up Fisherians at the time they wrote the statement I quoted (which was well before Berger’s alleged misreading of Neyman).

Wow, not sure what happened to the Hubbard and Bayarri link. It should have been P Values are not Error Probabilities.

Corey: Thanks for linking Hubbard and Bayarri. It is a perfect exemplar, almost exactly, word for word the position I’m describing as held by Berger and Sellke, with the added flourishes of the “inconsistent hybrid” hysteria. I’m not saying there aren’t people who hold this view–I’m discussing it because people do, and it’s time to get to the bottom of it. It’s an issue of philosophy and interpretation (and a bit more). I’ll touch upon all the other points that this position involves in later installments.

As for your point about Bayesians really not caring how people view p-values, that’s fine, but they bring it up in the midst of criticism (as your example well shows)—couldn’t open the

~~Kempthorne and Folks~~Kalb and Sprott. They also keep coming back to it (Berger surely does) because it enables working without an explicit alternative. Berger’s 2003 attempts conditional error probabilities based on p-values.Mayo: Yup, Hubbard and Bayarri are clearly following up on Berger and Sellke. But that cannot be said of Kalbfleish and Sprott, since they were writing about this a decade before Berger, and they clearly viewed themselves as writing in the Fisherian tradition (in distinction to Berger’s Bayes-influenced point-of-view). I don’t have access to their full paper, just to the first couple of pages, so I can only read the preamble to their actual argument. What are your thoughts on that paper?

Corey: I might note that Pratt (1965) showed Berger and Sellke’s results back in 65 (that’s why I used a quote from hi that I had handy from my book.) He teases Berger and Sellke about this in his comments. He also mentions the famous Edwards, Lindmann, and Savage from 63. (I don’t know if “teases” is the right word.) Many Fisherians (and N-Pers) oppose the behavioristic rationale–doesn’t change the mathematical point.

The Kalbfleisch and Sprott paper (it’s in the Harper and Hooker Vol II, 1976, for people who have it: yes they describe Fisher’s point that the frequency of rejecting true hypotheses in repetitions “will not necessarily be indicative of the strength of the evidence.” 265. Right. The observed p-value may be smaller than a predesignated fixed alpha.

Corey –

I would hardly lump Kalbfleisch and Sprott in with dissenters – reviewing this statement within the context of their paper is instructive.

Kalbfleisch and Sprott state early in this paper “A test of significance is a measurement procedure whose purpose is to evaluate the strength of the evidence provided by the data against an hypothesis. The observed significance level is an index of the compatibility or consistency of the data and the hypothesis. The smaller the observed significance level, the stronger the evidence provided by the data against the hypothesis.” They don’t label significance probabilities as “error probabilities” per se, but clearly regard them as a measure of strength of evidence. I suppose we could try to ask them under what scenarios their measure is equivalent to a probability measure in a strict measure-theoretical framework.

Pretty clearly in line with Error Statistical principles, as I read things, though even the above quote doesn’t cover all of their philosophical territory.

There’s plenty more sensible philosophical discussion in this paper – thanks for fixing that link to the reference.

“It is a gross oversimplification to regard a test of significance as a decision rule for accepting or rejecting an hypothesis. Any decision to ‘accept’ or ‘reject’ a scientific hypothesis will certainly depend upon more than the experimental evidence. A theory which is contradicted by the data may continue to be used if no satisfactory alternative is available, or if the nature of the departures from it are judged to be unimportant for a particular application.” A good example of this is Newtonian mechanics. We know they are wrong, Einstein showed this quite clearly, and data from many sources support Einstein’s theories over Newton’s. But F=MA is still good enough to get a rocket to the moon, so NASA uses it.

The quote that you gave appears later in their discussion, in reviewing issues with ‘the usual’ Neyman-Pearson formulation: “Thirdly, the significance level cannot be equated with the rejection frequency of a true null hypothesis. These objections will be discussed in the next three sections, and following this a more satisfactory formulation of tests of significance will be given and illustrated.” So that statement clearly does not refer to significance levels as a measure by which to evaluate the strength of evidence provided by the data, which Kalbfleisch and Sprott had previously discussed.

Rather, that quote refers to a particular phenomenon sometimes encountered in real situations, and of course easily and endlessly discussable in hypotheticals. Namely, as they discuss in Section 4, “However, as Fisher (1959, p. 93) pointed out, one must distinguish between the strength of the evidence, which is to be measured by the significance level, and the frequency with which evidence of a given strength will be obtained. The nature of the hypothesis and experiment may be such that evidence of even moderate strength is almost impossible to obtain. The frequency with which a true hypothesis would be rejected at level 0.05 in repetitions of the experiment may then be much less than 0.05. This point is illustrated in the following examples.”

I’ll spare the examples, anyone can obtain them thanks to the link you provided. But the point is that the context of the quote you provided was about some quirky multiple parameter complex hypothesis tests, corner cases of a philosophical discussion of error statistical principles. Arguably, some of their examples might not constitute “a properly designed experiment” in the sense that Fisher discussed in the 1926 article quoted by Mayo above, so this portion of their discussion should not be construed as dissent – rather it is part of a most interesting philosophical discourse on the nature of significance levels as a measure of strength of evidence of data for an hypothesis.

I was lucky enough to attend the University of Waterloo in 1981-1982 and have each of these interesting statisticians as teachers – their discussions of error statistical principles as reflected in the multiple contributions by Waterloo faculty in the book you cite suggest to me that they are no dissenters by any means.

The most telling part of Kalb and Sprott is the last line in the discussion. When asked by Lindley how to operationalize their conception of the role of P values as evidence:

Kalbfleisch:

“As in the N-P theory, results significant at the 5% level would occur at most one time in twenty ‘in the long run’ if the hypothesis were true.In selecting the reference set, one specifies the long run within which it is appropriate to evaluate the inference, and this need not correspond to all possible repetitions of the experiment.”So it’s precisely my point: the issue is a matter of philosophical rationale, and how to identify relevant repetitions for using error probabilities as indicating discordance or inconsistency or the like. This is entirely in sync with the inferential use of N-P error statistical properties.

Admittedly, they, like Fisher, N-P, and many others, only hint at the epistemological principle that directs the choice, in relation to the question asked. They are at a loss, but can only say, negatively, that it can’t be merely a matter of long-run error rates. So true! It’s filling in this gap that I’ve tried to supply in my work. My point in this post is just to try to get us beyond the blockade formed by declarations that effectively say : We know from Fisher and Neyman feuds that the P value is a very different animal–mathematically–from a N-P error probability. That’s one of the first steps to lifting the brain out (I refer to the illustration in this post), declaring all this has been settled long ago….the wise men know this…and then moving on to whatever lesson someone wishes to prove thereby.

Mayo: I have real difficulty reconciling “the significance level cannot be equated with the rejection frequency of a true null hypothesis” with “results significant at the 5% level would occur at most one time in twenty ‘in the long run’ if the hypothesis were true”. The continuation of the latter quote suggests to me that Kalbfleish has some kind of conditioning-on-ancillary-statistics concern regarding the relevant reference set. I do wonder what your thoughts are on ancillarity and frequentist conditional inference…

Corey: As the next sentence indicates, he’s saying it holds within the chosen reference set. As for me, the issue comes up in Cox and Mayo (2010), “objectivity & conditionality”. It was this discussion of conditionality that led me to look carefully at Birnbaum’s “result” and find the flaw. Finding it was easy, explaining it was hard, and then very hard. That doesn’t completely answer your question….

Steve McKinney: Not everyone can read the paper — some folks lack academic access. Like me, for instance.

By “dissenters”, I didn’t mean that they dissent from the idea that p-values are relevant for post-data inference. Clearly that’s not the case. I meant that they dissent from the idea that the relevance is due to a realized p-value being equal to some kind of post-data error probability. This seems like a reasonable reading of the quoted text; after all “cannot be equated with the rejection frequency” is a pretty strong assertion! (Unlike you, I could not read the examples, so I don’t know exactly what “corner cases” they had in mind, but I’ll bet they fall afoul of Pearson’s Step 2.) In contrast, Error Statistics as promulgated by Mayo and Spanos (if I understand it correctly) holds that a realized p-value is indeed a post-data error probability, and that this is a necessary (but not sufficient) condition for it to correspond to (one minus) the severity of the test that some claim has passed.

Well I have the paper here and it’s just exactly Fisher (with his example, as Steven points out.) Mayo I can get the Elba editors to scan this for Saturday night.

Corey

The Kalbfleisch and Sprott link just links back to this web page. Can you post the article title? I can’t determine what writing of theirs you’re referring to.

The title is On Tests of Significance

Mayo, I guess I’m in the camp that doesn’t see how p-values can be interpreted as error probabilities (or, at least, I’m still not convinced), but I’m definitely *not* in the camp that claim that p-values exaggerate the evidence against the null. I’m very much a defender of p-values, at least in some cases (e.g., under randomization)… more and more, I suppose I find myself in the “Fisher” camp, as much as that distinction matters. I do agree with the “just decisive” interpretation of Cox and Hinkley, although I don’t personally find that interpretation all that useful. My problem is then to make the leap that “p_obs would be the Type I error probability” as you seem to. If you interpret p_obs this way, then it seems that you must also be able to interpret 1-p_obs (since we, of course, want our probabilities to be coherent). But here, I’m at a loss…

Mark: just reread the quotes. If I’d thought they would be problematic I would have quoted a few more, or few hundred more. Nor do Berger and Sellke deny this much as you can see. 1 -p would be the probability of not rejecting the null, computed under the null. Next installment should come today.

Mark: I don’t know if you’re clear on the 1-p business, and it’s come up before I think (in your comments). To allude to one of Fisher’s construals, (1 – p) is the prob his experiment would tell him to ignore the result (curb his enthusiasm), when in fact chance alone was operating, or whatever the null asserts: Prob(T < t*;Ho) for cut-off t* corresponding to his chosen p-value, e.g., .05.

Mayo, no, this is not the same thing. A “chosen cut off” corresponds to the typical interpretation of a (pre-data) type I error rate, which is pertinent to both Fisher and NP camps (although Fisher wouldn’t have called it this). We were talking about interpreting the observed p-value. My contention is that *if* we think of p_obs as something like an observed error rate, then 1-p_obs must have a similar interpretation, but here I’m stuck and I don’t think you’ve (or the selected quotes) have addressed it adequately.

In fact, I’ve just finished reading Kempthorne’s (1976) “what’s the use” paper, in which he also heavily quotes C and H. I’m a huge fan of Kempthorne (even though you’ve commented in the past that he was a rather persnickety fellow), but I’m still stewing over this particular paper.

Just to be clear, I fully understand what a p-value represents with respect to the observed data and am completely comfortable with the mathematical definition. I also highly value the concept of significance testing in certain cases, particularly under randomisation, where I (and many others) see testing as much more relevant than estimation. I just still haven’t been persuaded that it’s useful, or even necessarily accurate, to think of a p-value as an error probability.

OK so there’s my breathless wrap up of this business, The continuation will deal with the erroneous evidential measures Fisher is subjected to to provide him with the evidential assessments the frequentist error probs apparently can’t give provide. I had forgotten that funny paragraph in Cox about the problem of sharing a building.

Interesting post. I agree that there is too much of a hangup about the difference between Fisher and N-P regarding decisions and inferences and said as much in my comment on Goodman:

Senn, S. J. (2002). “A comment on replication, p-values and evidence S.N.Goodman, Statistics in Medicine 1992; 11:875-879.” Statistics in Medicine 21(16): 2437-2444.

After all, it was Fisher who introduced the habit of reverse tabulation. (That is to say, obtaining what in NP mode are called critical values from significance levels.)

I think that a more important difference is to do with the relevant subset for constructing the inferences. Fisher did not (at least in his later writings) accept that the probabilities were uniquely determined by repeated hypothetical generation of examples of the experiment run: there might be a conditioning on values actually seen, as in regression. I am less familiar with Neyman’s writings but I think this is a point for disagreement.

As regards power, Fisher’s view was that the null hypothesis is more primitive than the test statistic but the same is not true of alternative hypotheses. See https://errorstatistics.com/2014/02/21/stephen-senn-fishers-alternative-to-the-alternative/

Stephen: I’m prepared to, and am interested in, discussing PhilHistStat, but with a big [PHStat] in front. You don’t see Bayesians declaring a given method an inconsistent hybrid because of disagreements between, say, Savage, Good, and Lindley. As you say, the differences between Fisher and N-P are much smaller than those between conventionalist/default Bayesians and subjective Bayesians.

[PHStat]: Yes, Fisher says (in DOE): that “the same data may contradict the hypothesis in any of a number of different ways” (185) and the experimenter “is aware of what observational discrepancy it is which interests him” thus there’s no need to set them out. This is a poor argument for why we needn’t try to be explicit about the alternative that “interest him”. That his primitive intuition that tells him what he’s looking for fails to provide grounds that we needn’t be explicit about a test’s capabilities to have found it. Is the experimenter allowed to develop alternatives in a data dependent fashion for Fisher?

Stephen: [PH Stat] I do agree with you about the issue of error probs being determined by repeated hypothetical generation of examples of the experiment as opposed to conditioning. Fisher appealed to principles of information–or something else, I don’t know. Basically N-P felt different contexts called for different “relevant” repetitions, real or hypothetical. Lehmann makes clear (e.g., in the “1 theory or 2” paper linked above) that an N-P-er is free to condition if he wants to.

Certainly J. Berger cannot equate frequentist with unconditional because he is keen to advocate a conditional frequentist account. (eg., in the 2003 paper). Of course Neyman “conditioned” on complete sufficient statistics to obtain “similar” tests, and as you say, “there might be a conditioning on values actually seen, as in regression”. In Mayo and Cox (2010) we do talk about some of the howlers of conditioning, notably the mixed test from Cox 1958). Which reminds me, my Birnbaum paper is to be out in actual physical form in 2 weeks!

Of course what I really care about is the upshot of all this. There are two. One concerns PhilHist of stat being used to present positions that are hard for people to question without actually delving into the gory details of the history. Few people would want to do that and they shouldn’t. But Berger and others present their position as if these are statistical facts intertwined with historical ones. They are actually interpretations based on philosophical positions about evidence, probability, frequentist statistics.

Now it is interesting that in Berger 2003, which I commented on in ASA,it’s as if he’s saying the opposite of what he’d said in the paper with Sellke.

“there has been great concern that the too-common misinterpretation of p-values as error probabilities very often results in considerable overstatement of the evidence against H0″ (p. 3).

I thought that his earlier point was that if we interpret p-values evidentially (not as error probs) then we wind up with an overstatement of the evidence.

Ironically, in answering the charge that his methods relied on ‘repeated sampling from the same population’, Neyman turns around and shows the error probabilities can hold averaging over distinct populations. This is in “Frequentist Prob and Frequentist Stat” in the Birnbaum Synthese volume in 1977, and he gives a simulation (which of course involves hypothetical repetitions). So Berger in 2003 declares Neyman 1977 is the real Neyman–an empirical Bayesian of sorts. Berger develops a “frequentist principle” and he erects conditional p-values based on much the same computations as in the joint 1987 paper, and others besides.That is, type 1 and 2 error probs handily become posterior probabilities with spiked priors (he says they are frequentist). That leads to the second important thing, which would make its way into the continuation.

I have shared this post with Jim, and I hope we can get to the bottom of things off line at some point. (He was a guest poster once, you can search.)

Here’s Berger 2003, with comments.

http://www.phil.vt.edu/dmayo/personal_website/Berger Could Fisher Jeffreys and Neyman have agreed on testing with Commentary.pdf

Practicing scientist rather than philosopher of statistics here, so I may be out of my depth but here goes. I confess I’m puzzled by Berger and Sellke’s remarks. A P value is just the flip side of a confidence interval, and both are so intimately linked to N-P tests that I don’t fully understand how you can think confidence intervals and N-P tests justified but P values unjustified. I take it that their complaint is that a P value of 0.028 or whatever is a property of a particular test, and so it doesn’t make any sense to talk about its long-run error rate in a series of repeated trials? Well no, I suppose it doesn’t, not in the way they mean at any rate. But why would anyone think that it *would* make sense? Who has ever claimed that?

Hi Jeremy: Thanks for coming over. You’re exactly right in linking the P-value with the corresponding (1 – P) in the kinds of tests they are discussing. This is also what Fisher said when he embraced their approach as in sync with his own (links above).

The point is all about rationale. Fisher didn’t think a crude, long-run error rate control was the rationale for tests. But Pearson at least denied that was the rationale for N-P tests as well. And N, in practice, also showed an inferential rationale. The issue that they didn’t address adequately–and this is really the central point–is what exactly IS the inferential relevance of error probs to inference. (That of course is my goal, via severity.)

But I think the way to understand Berger and Sellke’s strategy (and many, many papers in similar spirit)–which they (Berger and Sellke) themselves indicate– is this: since they’ve denied the rationale for the P value is error control (or the like), they are forced to consider inferential justifications for the P-value. By assuming inferential MUST mean likelihood ratio/Bayesian ratio or Bayes posterior, they set out to show that there’s a mismatch between P values and these Bayesian measures.. See the recent post on P-values overstating evidence. They don’t consider there’s any other way to construe P-values inferentially other than a Bayesian/likelihood one (they assume a default or conventional Bayes approach). As Senn has pointed out, the Bayesians radically disagree amongst themselves on this computation (the 2 cheers paper, linked in earlier post), and Casella and Berger (1987) show the problem is that Berger and Sellke are using unjustified spiked priors to show the mismatch.

The point is that the whole “P-values aren’t error probs” gambit is just a brief stopping point enabling them to argue, with a false dilemma, that the only other option is for the P value to be inferential, which must mean Bayes, and it fails at Bayes, so it fails period. And that is what they conclude. Steve Goodman (and many others) give the same argument.

I might note/admit that EGEK 1996 contains a chapter “Why Pearson Rejected the N-P (Behavioristic) Account of Testing…” that emphasizes the “Pearson-inferential/ Neyman-behavioral divide a bit more strongly than I would today. (It was never a matter of Fisher vs N-P, despite my describing it in this chapter in my reading.) Here’s the Chapter (11):

https://errorstatistics.files.wordpress.com/2014/08/egekchap11.pdf

After I met Spanos in 1999, who actually knows the history of statistics, I realized you had to go back further to the origins of the Fisher-Neyman break-up in 1935 to understand anything they were yelling about in the 1955-6 “triad”. Then, a few years later, there were the discoveries of the “hidden Neyman” papers (search this blog) which led me to think the “hard core” behavioristic line was somewhat of a caricature of Neyman as well. But clearly there was something to it.

I don’t know how widespread this practice is, but in particle physics it is routine to use the likelihood ratio as a test statistic, even when the point of the analysis is to produce a p-value, which is standard for discovery claims like the Higgs case that has been discussed previously on this blog. Since the likelihood ratio requires explicit definition of the alternative hypothesis, this looks like another way in which the distinction between Fisherian and N-P approaches is not so great in practice. I would be curious to know what statisticians think of this way of doing things.

Hi Kent: Good to hear from you. I’m prepared to say that determining the sampling distribution of the statistic, under the null, gives you a p-value–and never mind distinguishing F or N-P. David Cox, for example, will consider the P-value as a statistic and then one can compute its distribution under varying hypotheses. That, at any rate, is what allowed us to be in accord in Mayo and Cox (2006/10).

Now on the Higgs, which I will be going back to soon–for our upcoming schtick–, maybe you can tell me. I understand the 0 vs 1 comparison makes it appear as point vs point, but isn’t it really a matter of looking for discrepancies from the null in the direction of the alternative? (Cousins paper isn’t too far away, but there’s a lot of thunder, and I’m interested in your take.)

Going in the other direction, i.e., from N-P to F, I think the existence of best tests, freeing one from considering the alternative in computing the P-value (or type 1 error prob) makes it appear as if we’re not taking into account the alternative. But it’s already accommodated in the fact that it’s a best test (at least in the familiar examples).

anyway, I know you asked for statistician input. I hope we get some.

If Berger and Sellke really doubt that “frequentist justifications, such as they are, for confidence intervals and fixed a-level hypothesis tests carry over to P values” then what justification do they accord to P values? What other one can there be?

They claim they have to resort to “evidential” construals. Although they say they try several, in fact they boil down to likelihood ratio/Bayes ratio or posteriors with spiked priors. Naturally they get disagreement between these measures and the P-value, When Casella and Berger (1987) point out that in one sided testing they are actually “reconciled”.they say, just one more nail in the coffin for P values, but it makes no sense. All this was in the earlier post on P values overstating evidence.

https://errorstatistics.com/2014/07/14/the-p-values-overstate-the-evidence-against-the-null-fallacy/

You might think, well it’s different now. But it isn’t. I’d never be going back to 1987 if it weren’t that precisely the same arguments and computations are used in current articles. This is just one of the key sources.

Couple of things:

Sander Greenland points out in an e-mail that Fisher’s recommendation that non-significant results be ignored (quotes in my post) can be seen as essentially endorsing what we now call “publication bis”.

Anyone who want an electronic copy (rather tattered) of Kalbfleisch and Sprott should write to me at error@vt.edu

Matloff happens to be discussing a different set of turf battles having to do with statistics and machine learning. His blog is here:

http://matloff.wordpress.com/2014/08/26/statistics-losing-ground-to-cs-losing-image-among-students/

(1) “Thus p observed would be the Type I error probability associated with the test”

This is wrong; p value is a random variable that will take different “observed” values, from experiment to experiment. Type I error is a fixed value, ie a constant, set up in advance.

(2) p values have frequentist meaning, not only because they are based on the sampling distributions, but more importantly because if the rejection rule is stated as Reject when p < 0,05, overall in 5% of cases true Ho will be rejected, when we test 100% of true nulls. However, if you do not state any kind of rejection rule (or significance limit) and perform a single experiment, and if you obtain say p=0,003 how can you say that your error rate is 0.003%? This does not make sense at all, since your next experiment would certainly give a very different p observed! What would be the meaning of (post hoc statement): Type I error is 0,003%?

(3) p values do not overstate evidence compared with the Bayesian posteriors in the case of absolutely sharp null hypothesis. The (often) huge discrepancises are caused because Bayesians can only attack this problem by assigning positive probability mass to a subatom point that have Lebesque measure zero. Therefore, it is a Bayesian who perform rituals and does mindless statistics, because his "God" Jeffrey advocated it!!!

(4) The P-value can then be interpreted as the smallest level of significance, that is, the ‘borderline level’, since the outcome observed would be judged significant at all levels greater than or equal to the P-value[i] but not significant at any smaller levels..

Once again wrong, sorry Ms Jean Dickinson Gibbons and Mr Pratt.

The second (incredible) mistake here is that outcome observed would be judged significant at all levels SMALLER (not greater) than or equal to the P-value but not significant at any (GREATER, not smaller) levels.