ABSTRACT. Contrary to ideas suggested by the title of the conference at which the present paper was presented, the author is not aware of a conceptual difference between a “test of a statistical hypothesis” and a “test of significance” and uses these terms interchangeably. A study of any serious substantive problem involves a sequence of incidents at which one is forced to pause and consider what to do next. In an effort to reduce the frequency of misdirected activities one uses statistical tests. The procedure is illustrated on two examples: (i) Le Cam’s (and associates’) study of immunotherapy of cancer and (ii) a socio-economic experiment relating to low-income homeownership problems.
Neyman died on August 5, 1981. Here’s an unusual paper of his, “Tests of Statistical Hypotheses and Their Use in Studies of Natural Phenomena.” I have been reading a fair amount by Neyman this summer in writing about the origins of his philosophy, and have found further corroboration of the position that the behavioristic view attributed to him, while not entirely without substance*, is largely a fable that has been steadily built up and accepted as gospel. This has justified ignoring Neyman-Pearson statistics (as resting solely on long-run performance and irrelevant to scientific inference) and turning to crude variations of significance tests, that Fisher wouldn’t have countenanced for a moment (so-called NHSTs), lacking alternatives, incapable of learning from negative results, and permitting all sorts of P-value abuses–notably going from a small p-value to claiming evidence for a substantive research hypothesis. The upshot is to reject all of frequentist statistics, even though P-values are a teeny tiny part. *This represents a change in my perception of Neyman’s philosophy since EGEK (Mayo 1996). I still say that that for our uses of method, it doesn’t matter what anybody thought, that “it’s the methods, stupid!” Anyway, I recommend, in this very short paper, the general comments and the example on home ownership. Here are two snippets:
The title of the present session involves an element that appears mysterious to me. This element is the apparent distinction between tests of statistical hypotheses, on the one hand, and tests of significance, on the other. If this is not a lapse of someone’s pen, then I hope to learn the conceptual distinction. Particularly with reference to applied statistical work in a variety of domains of Science, my own thoughts of tests of significance, or EQUIVALENTLY of tests of statistical hypotheses, are that they are tools to reduce the frequency of errors.…
(iv) A similar remark applies to the use of the words “decision” or “conclusion”. It seem to me that at our discussion, these particular words were used to designate only something like a final outcome of complicated analysis involving several tests of different hypotheses. In my own way of speaking, I do not hesitate to use the words ‘decision’ or “conclusion” every time they come handy. For example, in the analysis of the follow-up data for the [home ownership] experiment, Mark Eudey and I started by considering the importance of bias in forming the experimental and control groups of families. As a result of the tests we applied, we decided to act on the assumption (or concluded) that the two groups are not random samples from the same population. Acting on this assumption (or having reached this conclusions), we sought for ways to analyze that data other than by comparing the experimental and the control groups. The analyses we performed led us to “conclude” or “decide” that the hypotheses tested could be rejected without excessive risk of error. In other words, after considering the probability of error (that is, after considering how frequently we would be in error if in conditions of our data we rejected the hypotheses tested), we decided to act on the assumption that “high” scores on “potential” and on “education” are indicative of better chances of success in the drive to home ownership. (750-1; the emphasis is Neyman’s)
To read the full (short) paper: Tests of Statistical Hypotheses and Their Use in Studies of Natural Phenomena.
Following Neyman, I’ve “decided” to use the terms ‘tests of hypotheses’ and ‘tests of significance’ interchangeably in my new book. Now it’s true that Neyman was more behavioristic than Pearson, and it’s also true that tests of statistical hypotheses or tests of significance need an explicit reformulation and statistical philosophy to explicate the role of error probabilities in inference. My way of providing this has been in terms of severe tests. However, in Neyman-Pearson applications, more than in their theory, you can find many examples as well. Recall Neyman’s paper, “The Problem of Inductive Inference” (Neyman 1955) wherein Neyman is talking to none other than the logical positivist philosopher of confirmation, Rudolf Carnap:
I am concerned with the term “degree of confirmation” introduced by Carnap. …We have seen that the application of the locally best one-sided test to the data … failed to reject the hypothesis [that the n observations come from a source in which the null hypothesis is true]. The question is: does this result “confirm” the hypothesis that H0 is true of the particular data set? (Neyman, pp 40-41).
The answer … depends very much on the exact meaning given to the words “confirmation,” “confidence,” etc. If one uses these words to describe one’s intuitive feeling of confidence in the hypothesis tested H0, then…. the attitude described is dangerous.… [T]he chance of detecting the presence [of discrepancy from the null], when only [n] observations are available, is extremely slim, even if [the discrepancy is present]. Therefore, the failure of the test to reject H0 cannot be reasonably considered as anything like a confirmation of H0. The situation would have been radically different if the power function [corresponding to a discrepancy of interest] were, for example, greater than 0.95. (ibid.)
The general conclusion is that it is a little rash to base one’s intuitive confidence in a given hypothesis on the fact that a test failed to reject this hypothesis. A more cautious attitude would be to form one’s intuitive opinion only after studying the power function of the test applied.
Neyman, like Peirce, Popper and many others, holds that the only “logic” is deductive logic. “Confirmation” for Neyman is akin to Popperian “corroboration”–you could corroborate a hypothesis H only to the extent that it passed a severe test–one with a high probability of having found flaws in H, if they existed. Neyman puts this in terms of having high power to reject H, if H is false and alternative H’ true, and high probability of finding no evidence against H if true, but it’s the same idea. (Their weakness is in being predesignated error probabilities, but severity fixes this.) Unlike Popper, however, Neyman actually provides a methodology that can be shown to accomplish the task reliably.
Still, Fisher was correct to claim that Neyman is merely recording his preferred way of speaking. One could choose a different way. For example, Peirce defined induction as passing a severe test, and Popper said you could define it that way if you wanted to. But the main thing is that Neyman is attempting to distinguish the “inductive” or “evidence transcending” conclusions that statistics affords, on his approach from assigning to hypotheses degrees of belief, probability, support, plausibility or the like.
De Finetti gets it right when he says that the expression “inductive behavior…that was for Neyman simply a slogan underlining and explaining the difference between his own, the Bayesian and the Fisherian formulations” became, with Wald’s work, “something much more substantial” (de Finetti 1972, p.176). De Finetti called this “the involuntarily destructive aspect of Wald’s work” (ibid.).
For related papers, see:
- Mayo, D.G. and Cox, D. R. (2006) “Frequentists Statistics as a Theory of Inductive Inference,” Optimality: The Second Erich L. Lehmann Symposium (ed. J. Rojo), Lecture Notes-Monograph series, Institute of Mathematical Statistics (IMS), Vol. 49: 77-97.
- Mayo, D. G. and Spanos, A. (2006). “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,” British Journal of Philosophy of Science, 57: 323-357.
 That really is a decision though it’s based on evidence that doing so is in sync with what both Neyman and Pearson thought. There’s plenty of evidence, by the way, that Fisher is more behavioristic and less evidential than is Neyman, and certainly less than E. Pearson. I think this “he said/she said” route to understanding statistical methods is a huge mistake. I keep saying, “It’s the method’s stupid!”
 And, Neyman rightly assumed at first, from Fisher’s approach. Fisher’s loud rants, later on, that Neyman turned his tests into crude acceptance sampling affairs akin to Russian 5 year-plans, and money-making goals of U.S. commercialism, all occurred after the break in 1935 which registered a conflict of egos, not statistical philosophies. Look up “anger management” on this blog.
Fisher is the arch anti-Bayesian; whereas, Neyman experimented with using priors at the start. The problem wasn’t so much viewing parameters as random variables, but lacking knowledge of what their frequentist distributions could possibly be. Thus he sought methods whose validity held up regardless of priors. Here E. Pearson was closer to Fisher, but unlike the two others, he was a really nice guy. (I hope everyone knows I’m talking of Egon here, not his mean daddy.) See chapter 11 of EGEK (1996):
- Mayo, D. 1996. “Why Pearson rejected the Neyman-Pearson (behavioristic) philosophy and a note on objectivity in statistics” .
de Finetti, B. 1972. Probability, Induction and Statistics: The Art of Guessing. Wiley.
Neyman, J. 1976. “Tests of Statistical Hypotheses and Their Use in Studies of Natural Phenomena.” Commun. Statist. Theor. Meth. A5(8), 737-751.