Despite the fact that Fisherians and Neyman-Pearsonians alike regard observed significance levels, or P values, as error probabilities, we occasionally hear allegations (typically from those who are neither Fisherian nor N-P theorists) that P values are actually not error probabilities. The denials tend to go hand in hand with allegations that P values exaggerate evidence against a null hypothesis—a problem whose cure invariably invokes measures that are at odds with both Fisherian and N-P tests. The Berger and Sellke (1987) article from a recent post is a good example of this. When leading figures put forward a statement that looks to be straightforwardly statistical, others tend to simply repeat it without inquiring whether the allegation actually mixes in issues of interpretation and statistical philosophy. So I wanted to go back and look at their arguments. I will post this in installments.
1. Some assertions from Fisher, N-P, and Bayesian camps
Here are some assertions from Fisherian, Neyman-Pearsonian and Bayesian camps: (I make no attempt at uniformity in writing the “P-value”, but retain the quotes as written.)
a) From the Fisherian camp (Cox and Hinkley):
For given observations y we calculate t = tobs = t(y), say, and the level of significance pobs by
pobs = Pr(T > tobs; H0).
….Hence pobs is the probability that we would mistakenly declare there to be evidence against H0, were we to regard the data under analysis as being just decisive against H0.” (Cox and Hinkley 1974, 66).
Thus pobs would be the Type I error probability associated with the test.
b) From the Neyman-Pearson N-P camp (Lehmann and Romano):
“[I]t is good practice to determine not only whether the hypothesis is accepted or rejected at the given significance level, but also to determine the smallest significance level…at which the hypothesis would be rejected for the given observation. This number, the so-called p-value gives an idea of how strongly the data contradict the hypothesis. It also enables others to reach a verdict based on the significance level of their choice.” (Lehmann and Romano 2005, 63-4)
Very similar quotations are easily found, and are regarded as uncontroversial—even by Bayesians whose contributions stood at the foot of Berger and Sellke’s argument that P values exaggerate the evidence against the null.
c) Gibbons and Pratt:
“The P-value can then be interpreted as the smallest level of significance, that is, the ‘borderline level’, since the outcome observed would be judged significant at all levels greater than or equal to the P-value[i] but not significant at any smaller levels. Thus it is sometimes called the ‘level attained’ by the sample….Reporting a P-value, whether exact or within an interval, in effect permits each individual to choose his own level of significance as the maximum tolerable probability of a Type I error.” (Gibbons and Pratt 1975, 21).
2. So what’s behind the “P values aren’t error probabilities” allegation?
In their rejoinder to Hinkley, Berger and Sellke assert the following: “The use of the term ‘error rate’ suggests that the frequentist justifications, such as they are, for confidence intervals and fixed a-level hypothesis tests carry over to P values.”
They do not disagree with Cox and Hinkley’s assertion above, but they maintain that:
“This hypothetical error rate does not conform to the usual classical notion of ‘repeated-use’ error rate, since the P-value is determined only once in this sequence of tests. The frequentist justifications of significance tests and confidence intervals are in terms of how these procedures perform when used repeatedly.” (Berger and Sellke 1987, 136)
Keep in mind that Berger and Sellke are using “significance tests” to refer to Neyman-Pearson (N-P) tests in contrast to Fisherian P-value appraisals.
So their point appears to be simply that the P value, as intended by Fisher, is not justified by (or not intended to be justified by) a behavioral appeal to controlling long run error rates. It is assumed that those are the only, or the main, justifications available for N-P significance tests and confidence intervals (thus type 1 and 2 error probabilities and confidence levels are genuine error probabilities). They do not entertain the idea that the P value, as the attained significance level, is important for N-P theorists nor that “a p-value gives an idea of how strongly the data contradict the hypothesis”(Lehmann and Romano)—a construal we find early on in David Cox.
But let’s put that aside, as we pin down Berger and Sellke’s point. Here’s how we might construe them. They grant that the P-value is, mathematically, a frequentist error probability, it is the justification that they think differs from what they take to be the justification of Type 1 and 2 errors in N-P statistics. They think N-P tests and confidence intervals get their justification in terms of (actual?) long run error rates, and P-values do not. To continue with their remarks:
“Can P values be justified on the basis of how they perform in repeated use? We doubt it. For one thing, how would one measure the performance of P values? With significance tests and confidence intervals, they are either right or wrong, so it possible to talk about error rates. If one introduces a decision rule into the situation by saying that H0 is rejected when the P value < .05, then of course the classical error rate is .05.”[ii](ibid.)
Thus, P values are error probabilities, but their intended justification (by Fisher?) was not a matter of a behavioristic appeal to low long-run error rates, but rather, something more inferential or evidential. We can actually strengthen their argument in a couple of ways. Firstly, we can remove the business of “actual” versus “hypothetical” repetitions, because the behavioristic justifications that they are trying to call out are also given in terms of hypotheticals. Moreover, behavioristic appeals to controlling error rates are not limited to “reject/do not reject”, but apply even where the inference is in terms of an inferred discrepancy or other test output.
The problem is that the inferential vs behavioristic distinction does not separate Fisherian P-values from confidence levels and type I and 2 error probabilities. All of these are amenable to both types of interpretation! More to follow in installment #2.
Installment 2: Mirror Mirror on the Wall, Who’s the More Behavioral of them all?
Granted, the founders did not make out intended inferential construals fully—though representatives from Fisherian and N-P camps took several steps. At the same time, members of both camps also can be found talking like acceptance samplers!
Berger and Sellke had said: “If one introduces a decision rule into the situation by saying that H0 is rejected when the P value < .05, then of course the classical error rate is .05.” Good. Then we can agree that it is mathematically an error probability. They simply don’t think it reflects the Fisherian ideal.
3. Fisher as acceptance sampler.
But it was Fisher, after all, who declared that “Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis. “ (DOE 15-16)
Or to quote from an earlier article of Fisher (1926):
…we may, if we prefer it, draw the line at one in fifty (the 2 per cent point), or one in a hundred (the 1 per cent point). Personally, the writer prefers to set a low standard of significance at the 5 per cent point, and ignore entirely all results which fail to reach this level. A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance. The very high odds sometimes claimed for experimental results should usually be discounted, for inaccurate methods of estimating error have far more influence than has the particular standard of significance chosen.
The above is a more succinct version of essentially the same points Fisher makes in DOE.[iii]
No wonder Neyman could tell Fisher to look in the mirror (as it were): “Pearson and I were only systematizing your practices for how to interpret data, using those nice charts you made. True, we introduced the alternative hypothesis (and the corresponding type 2 error), but that was only to give a rationale, and apparatus, for the kinds of tests you were using. You never had a problem with the Type 1 error probability, and your concern for how best to increase “sensitivity” was to be reflected in the power assessment. You had no objections—at least at first”. See this post.
The dichotomous “up-down” spirit that Berger and Sellke suggest is foreign to Fisher is not foreign at all. Again from DOE:
Our examination of the possible results of the experiment has therefore led us to a statistical test of significance, by which these results are divided into two classes with opposed interpretation. ….The two classes of results which are distinguished by our test of significance are, on the one hand, those which show a significant discrepancy from a certain hypothesis; …and on the other hand, results which show no significant discrepancy from this hypothesis. (DOE 15)
Even where Fisher is berating Neyman for introducing the Type 2 error–he had no problem with type 1 errors, and both were fine in cases of estimation–Fisher falls into talk of actions, as Neyman points out (Neyman 1956,Triad).
“The worker’s real attitude in such a case might be, according to the circumstances:
(a)”the probable deviation from truth of my working hypothesis, to examine which the test is appropriate, seems not to be of sufficient magnitude to warrant any immediate modification.” (Fisher 1955, Triad, p. 73)
Pearson responds (1955) that this is entirely the type of interpretation they imagined to be associated with the bare mathematics of the test. And Neyman made it clear early on (though I didn’t discover it at first) that he intended “accept” to serve merely as a shorthand for “do not reject”. See this recent post, which includes links to all three papers in the “triad” (by Fisher, Neyman, and Pearson).
“In fact Fisher referred approvingly to the concept of the power curve of a test procedure and although he wrote: ‘On the whole the ideas (a) that a test of significance must be regarded as one of a series of similar tests applied to a succession of similar bodies of data, and (b) that the purpose of the test is to discriminate or ‘decide’ between two or more hypotheses, have greatly obscured their understanding’, he was careful to go on and add ‘when taken not as contingent possibilities but as elements essential to their logic’.” (129).
To see how Fisher links power to his own work early on, please check this post.
So we are back to the key question: what is the basis for Berger and Sellke (and others who follow similar lines of criticism) to allow error probabilities in the case of N-P significance tests and confidence intervals, and not in the case of P-values? It cannot be whether the method involves a rule for mapping outcomes to interpretations (be there two or three—the third might be N-P’s initial “remain undecided” or “get more data”), because we’ve just seen that to be true of Fisherian tests as well.
4. Fixing the type 1 error probability
But isn’t the issue that N-P tests fix the type 1 error probability in advance? Firstly, we must distinguish between fixing the P value threshold to be used in each application, and justifying tests solely by reference to a control of long run error (behavioral justification). So what about the first point of predesignating the threshold? Actually, this was more Fisher than N-P:
“Neyman and Pearson followed Fisher’s adoption of a fixed level” Erich Lehmann tells us. (Lehmann 1993, 1244). Lehmann is flummoxed by the accusation of fixed levels of significance since “[U]nlike Fisher, Neyman and Pearson (1933, p. 296) did not recommend a standard level but suggested that ‘how the balance [between the two kinds of error] should be struck must be left to the investigator.” (ibid.) From their earliest papers, they stressed that the tests were to be “used with discretion and understanding” depending on the context. Pearson made it clear that he thought it “irresponsible”, in a matter of importance, to distinguish rejections at the .025 or .05 level.[iv] (See this post.) And as we already saw, Lehmann (who developed N-P tests as decision rules) recommends reporting the attained P value.
In a famous passage,[v] Fisher (1956) raises the criticism—but without naming names:
A man who ‘rejects’ a hypothesis provisionally, as a matter of habitual practice, when the significance is at the 1% level or higher, will certainly be mistaken in not more than 1% of such decisions. For when the hypothesis is correct he will be mistaken in just 1% of these cases, and when it is incorrect he will never be mistaken in rejection….However, the calculation is absurdly academic, for in fact no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas.
It is assumed he is speaking of N-P, or at least Neyman, but I wonder…
Anyway, the point is that the mathematics admits of different interpretations and uses. The “P values are not error rates” argument really boils down to claiming that the justification for using P-values inferentially is not merely that if you repeatedly did this you’d rarely erroneously interpret results (that’s necessary but not sufficient for the inferential warrant). That, of course, is what I (and others) have been arguing for ages—but I’d extend this to N-P significance tests and confidence intervals, at least in contexts of scientific inference. See, for example, Mayo and Cox (2006/2010), Mayo and Spanos (2006). We couldn’t even express the task of how to construe error probabilities inferentially if we could only use the term “error probabilities” to mean something justified only by behavioristic long-runs.
5. What about the Famous Blow-ups?
What about the big disagreement between Neyman and Fisher (Pearson is generally left out of it)? Well, I think that as hostilities between Fisher and Neyman heated up, the former got more and more evidential (and even fiducial) and the latter more and more behavioral. Still, what has made a lasting impression on people, understandably, are Fisher’s accusations that Neyman (if not Pearson) converted his tests into acceptance sampling devices, more suitable for making money in the U.S. or Russian 5 year plans, than thoughtful inference. (But remember Pearson’s and Neyman’s responses.) Imagine what he might have said about today’s infatuation with converting P value assessments to dichotomous outputs to compute science-wise error rates: Neyman on steroids.[vi]
By the way, it couldn’t have been too obvious that N-P distorted his tests, since Fisher tells us in 1955 that it was only when Barnard brought it to his attention that “despite agreeing mathematically in very large part”, there is a distinct philosophical position emphasized at least by Neyman. So it took like 20 years to realize this? (Barnard also told me this in person, recounted in this theater production.)
Here’s an enlightening passage from Cox (2006):
Neyman and Pearson “began an influential collaboration initially designed primarily, it would seem to clarify Fisher’s writing. This led to their theory of testing hypotheses and to Neyman’s development of confidence intervals, aiming to clarify Fisher’s idea of fiducial intervals. As late as 1932 Fisher was writing to Neyman encouragingly about this work, but relations soured, notably when Fisher greatly disapproved of a paper of Neyman’s on experimental design and no doubt partly because their being in the same building at University College London brought them too close to one another!” (195)
Being in the same building,indeed! Recall Fisher declaring that if Neyman teaches in the same building and doesn’t use his book, he would oppose him in all things. See this post for details on some of their anger management problems.
The point is that it is absurd to base conceptions of inferential methods on personality disputes rather than the mathematical properties of tests (and their associated interpretations). These two approaches are best seen as offering clusters of tests appropriate for different contexts within the large taxonomy of tests and estimation methods. We can agree that the radical behavioristic rationale for error rates is not the rationale intended by Fisher in using P-values. I would argue it was not the rationale intended by Pearson, nor, much of the time, by Neyman. Yet we should be beyond worrying about what the founders really thought. It’s the methods, stupid.
Readers should not have to go through this “he said/we said” history again. Enough! Nor should they be misled into thinking there’s a deep inconsistency which renders all standard treatments invalid (by dint of using both N-P and Fisherian tests).
So has pure analytic philosophy, by clarifying terms (along with a bit of history of statistics), solved the apparent disagreement with Berger and Sellke (1987) and others?
It’s gotten us somewhere, yet there’s a big problem that remains. TO BE CONTINUED ON A NEW POST
Barnard, G. (1972). “Review of ‘The Logic of Statistical Inference’ by I. Hacking” Brit. J. Phil. Sci. 23(2): 123-132.
Berger, J. O. and Sellke, T. (1987) “Testing a point null hypothesis: The irreconcilability of p values and evidence,” (with discussion). J. Amer. Statist. Assoc. 82: 112-139.
Cassella G. and Berger, R.. (1987). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem,” (with discussion). J. Amer. Statist. Assoc. 82 106–111, 123–139.
Cox, D. R. (2006) Principles of Statistical Inference. Cambridge: Cambridge University Press.
Cox, D. R. & Hinkley, D. V. (1974). Theoretical Statistics, London, Chapman & Hall.
Fisher, R. A. (1926). “The Arrangement of Field Experiments”, J. of Ministry of Agriculture, Vol. XXXIII, 503-513.
Fisher, R. A. (1947). The Design of Experiments (4th Ed.) NY Hafner.
Fisher, R. A. (1955) “Statistical Methods and Scientific Induction,” Journal of The Royal Statistical Society (B) 17: 69-78.
Fisher, R.A. (1956). Statistical Methods and Scientific Inference, Hafner
Gibbons, J. & Pratt, J. W. (1975). “P-values: Interpretation and Methodology”, The American Statistician 29: 20-25.
Lehmann, E. (1993). “The Fisher, Neyman-Pearson Theories of Testing Hypotheses: One Theory or Two?” J. Amer. Statist. Assoc., 88(424):1242-1249.
Lehmann and Romano (2005) Testing Statistical Hypotheses (3rd ed.), New York: Springer.
Mayo, D.G. and Cox, D. R. (2006/2010) “Frequentists Statistics as a Theory of Inductive Inference,” Optimality: The Second Erich L. Lehmann Symposium (ed. J. Rojo), Lecture Notes-Monograph series, Institute of Mathematical Statistics (IMS), Vol. 49: 77-97.
Mayo, D. G. and Spanos, A. (2006). “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,” British Journal of Philosophy of Science, 57: 323-357.
Neyman, J. (1977) “Frequentist Probability and Frequentist Statistics,” Synthese 36: 97-131.
Neyman, J. (1956). “Note on an Article by Sir Ronald Fisher,” Journal of the Royal Statistical Society (B), 18:288-294.
Neyman, J. and Pearson, E.S. (1933). “On the Problem of the Most Efficient Tests of Statistical Hypotheses,” Philosophical Transactions of the Royal Society of London, (A), 231, 289-337.
Pearson, E. S. (1955). “Statistical Concepts in Their Relation to Reality,” Journal of the Royal Statistical Society, (B), 17: 204-207.
[i] With the usual inversions.
[ii] They add “but the expected P value given rejection is .025, an average understatement of the error rate by a factor of two.”
[iii] Neyman did put in a plug for developments in empirical Bayesian methods in his 1977 Synthese paper.
[iv] Pearson says,
The test of significance (13):
“It is open to the experimenter to be more or less exacting in respect of the smallness of the probability he would require before he would be willing to admit that his observations have demonstrated a positive result. It is obvious that an experiment would be useless of which no possible result would satisfy him. Thus, if he wishes to ignore results having probabilities as high as 1 in 20–the probabilities being of course reckoned from the hypothesis that the phenomenon to be demonstrated is in fact absent –then it would be useless for him to experiment with only 3 cups of tea…. It is usual and convenient for the experimenters to take 5 per cent as a standard level of significance, in the sense that they are prepared to ignore all results which fail to reach this standard, and, by this means to eliminate from further discussion the greater part of the fluctuations which chance causes have introduced into their experimental results. …we thereby admit that no isolated experiment, however significant in itself, can suffice for the experimental demonstration of any natural phenomenon; for the ‘one chance in a million’ will undoubtedly occur, with no less and no more than its appropriate frequency, however surprised we may be that it should occur to us. In order to assert that a natural phenomenon is experimentally demonstrable we need, not an isolated record, but a reliable method of procedure. In relation to the test of significance we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result.” (emphasis added)
On 46-7 Fisher clarifies something people often confuse: it’s not the low probability of the event “rather to the fact, very near in this case, that the correctness of the assertion would entail an event of this low probability.
[vi] It follows a paragraph criticizing Bayesians.