Neyman April 16, 1894 – August 5, 1981
My second Jerzy Neyman item, in honor of his birthday, is a little play that I wrote for Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018):
A local acting group is putting on a short theater production based on a screenplay I wrote: “Les Miserables Citations” (“Those Miserable Quotes”) . The “miserable” citations are those everyone loves to cite, from their early joint 1933 paper:
We are inclined to think that as far as a particular hypothesis is concerned, no test based upon the theory of probability can by itself provide any valuable evidence of the truth or falsehood of that hypothesis.
But we may look at the purpose of tests from another viewpoint. Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behavior with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong. (Neyman and Pearson 1933, pp. 290-1).
In this early paper, Neyman and Pearson were still groping toward the basic concepts of tests–for example, “power” had yet to be coined. Taken out of context, these quotes have led to knee-jerk (behavioristic) interpretations which neither Neyman nor Pearson would have accepted. What was the real context of those passages? Well, the paper opens, just five paragraphs earlier, with a discussion of a debate between two French probabilists—Joseph Bertrand, author of “Calculus of Probabilities” (1907), and Emile Borel, author of “Le Hasard” (1914)! According to Neyman, what served “as an inspiration to Egon S. Pearson and myself in our effort to build a frequentist theory of testing hypotheses”(1977, p. 103) initially grew out of remarks of Borel, whose lectures Neyman had attended in Paris. He returns to the Bertrand-Borel debate in four different papers, and circles back to it often in his talks with his biographer, Constance Reid. His student Erich Lehmann (1993), regarded as the authority on Neyman, wrote an entire paper on the topic: “The Bertrand-Borel Debate and the Origins of the Neyman Pearson Theory”.
Since it’s Saturday night, let’s listen in on this one act play, just about to begin at the Elba Dinner Theater. Don’t worry, food and drink are allowed to be taken in. (I’ve also included, in the References, several links to papers for your weekend reading enjoyment!) There go les trois coups–the curtain’s about to open!
The curtain opens with a young Neyman and Pearson (from 1933) standing mid-stage, lit by a spotlight. (Neyman does the talking, since its his birthday).
Neyman: “Bertrand put into statistical form a variety of hypotheses, as for example the hypothesis that a given group of stars…form a ‘system.’ His method of attack, which is that in common use, consisted essentially in calculating the probability, P, that a certain character, x, of the observed facts would arise if the hypothesis tested were true. If P were very small, this would generally be considered as an indication that…H was probably false, and vice versa. Bertrand expressed the pessimistic view that no test of this kind could give reliable results.
Borel, however, considered…that the method described could be applied with success provided that the character, x, of the observed facts were properly chosen—were, in fact, a character which he terms ‘en quelque sorte remarquable’” (Neyman and Pearson 1933, p.141/290).
The stage fades to black, then a spotlight shines on Bertrand, stage right.
Bertrand: “How can we decide on the unusual results that chance is incapable of producing?…The Pleiades appear closer to each other than one would naturally expect…In order to make the vague idea of closeness more precise, should we look for the smallest circle that contains the group? the largest of the angular distances? the sum of squares of all the distances?…Each of these quantities is smaller for the group of the Pleiades than seems plausible. Which of them should provide the measure of implausibility. …
[He turns to the audience, shaking his head.]
The application of such calculations to questions of this kind is a delusion and an abuse.” (Bertrand, 1907, p. 166; Lehmann 1993, p. 963).
The stage fades to black, then a spotlight appears on Borel, stage left.
Borel: “The particular form that problems of causes often take…is the following: Is such and such a result due to chance or does it have a cause? It has often been observed how much this statement lacks in precision. Bertrand has strongly emphasized this point. But …to refuse to answer under the pretext that the answer cannot be absolutely precise, is to… misunderstand the essential nature of the application of mathematics.” (ibid. p. 964) Bertrand considers the Pleiades. ‘If one has observed a [precise angle between the stars]…in tenths of seconds…one would not think of asking to know the probability [of observing exactly this observed angle under chance] because one would never have asked that precise question before having measured the angle’… (ibid.)
The question is whether one has the same reservations in the case in which one states that one of the angles of the triangle formed by three stars has “une valeur remarquable” [a striking or noteworthy value], and is for example equal to the angle of the equilateral triangle…. (ibid.)
Here is what one can say on this subject: One should carefully guard against the tendency to consider as striking an event that one has not specified beforehand, because the number of such events that may appear striking, from different points of view, is very substantial” (ibid. p. 964).
The stage fades to black, then a spotlight beams on Neyman and Pearson mid-stage. (Neyman does the talking)
Neyman: “We appear to find disagreement here, but are inclined to think that…the two writers [Bertrand and Borel] are not really considering precisely the same problem. In general terms the problem is this: Is it possible that there are any efficient tests of hypotheses based upon the theory of probability, and if so, what is their nature. …What is the precise meaning of the words ‘an efficient test of a hypothesis’?” (1933, p. 140/290)
“[W]e may consider some specified hypothesis, as that concerning the group of stars, and look for a method which we should hope to tell us, with regard to a particular group of stars, whether they form a system, or are grouped ‘by chance,’…their relative movements unrelated.” (ibid.)
“If this were what is required of ‘an efficient test’, we should agree with Bertrand in his pessimistic view. For however small be the probability that a particular grouping of a number of stars is due to ‘chance’, does this in itself provide any evidence of another ‘cause’ for this grouping but ‘chance’? …Indeed, if x is a continuous variable—as for example is the angular distance between two stars—then any value of x is a singularity of relative probability equal to zero. We are inclined to think that as far as a particular hypothesis is concerned, no test based upon the theory of probability can by itself provide any valuable evidence of the truth or falsehood of that hypothesis. But we may look at the purpose of tests from another view-point.” (Emphasis added; ibid. pp. 141-2; 290-1)
Fade to black, spot on narrator mid-stage:
Narrator: We all know our famous (miserable) lines are about to come. But let’s linger on the “as far as a particular hypothesis is concerned” portion. For any particular case, one may identify a data dependent feature x that would be highly improbable “under the particular hypothesis of chance”. We must “carefully guard,” Borel warns, “against the tendency to consider as striking an event that one has not specified beforehand”. But if you are required to set the test’s capabilities ahead of time then you need to specify the type of falsity of Ho, the distance measure or test statistic beforehand. An efficient test should capture Fisher’s concern with tests sensitive to departures of interest. Listen to Neyman over 40 years later, reflecting on the relevance of Borel’s position in 1977.
Fade to black. Spotlight on an older Neyman, stage right.
Neyman: “The question (what is an efficient test of a statistical hypothesis) is about an intelligible methodology for deciding whether the observed difference…contradicts the stochastic model….
This question was the subject of a lively discussion by Borel and others. Borel was optimistic but insisted that: (a) the criterion to test a hypothesis (a ‘statistical hypothesis’) using some observations must be selected not after the examination of the results of observation, but before, and (b) this criterion should be a function of the observations (of some sort remarkable) (Neyman 1977, pp. 102-103).
It is these remarks of Borel that served as an inspiration to Egon S. Pearson and myself in our effort to build a frequentist theory of testing hypotheses.”(ibid. p. 103)
Fade to back. Spotlight on an older Egon Pearson writing a letter to Neyman about the preprint Neyman sent of his 1977 paper. (The letter is unpublished, but I cite Lehmann 1993).
Pearson: “I remember that you produced this quotation [from Borel] when we began to get our  paper into shape… The above stages [wherein he had been asking ‘Why use that particular test statistic?’] led up to Borel’s requirement of finding…a criterion which was ‘a function of the observations ‘en quelque sorte remarquable’. Now my point is that you and I (perhaps my first leading) had ourselves reached the Borel requirement independently of Borel, because we were serious humane thinkers; Borel’s expression neatly capped our own.”
Fade to black. End Play
Egon has the habit of leaving the most tantalizing claims unpacked, and this is no exception: What exactly is the Borel requirement already reached due to their being “serious humane thinkers”? I can well imagine growing this one act play into something like the expressionist play of Michael Fraylin, Copenhagen, wherein a variety of alternative interpretations are entertained based on subsequent work and remarks. I don’t say that it would enjoy a long life on Broadway, but a small handful of us would relish it.
As with my previous attempts at “statistical theatre of the absurd, (e.g., “Stat on a hot-tin roof”) there would be no attempt at all to popularize—only published quotes and closely remembered conversations would be included.
Deconstructions on the Meaning of the Play by Theater Critics
It’s not hard to see that “as far as a particular” star grouping is concerned, we cannot expect a reliable inference to just any non-chance effect discovered in the data. The more specific the feature is to these particular observations, the more improbable. What’s the probability of 3 hurricanes followed by 2 plane crashes (as occurred last month, say)? Harold Jeffreys put it this way: any sample is improbable in some respect;to cope with this fact statistical method does one of two things: appeals to prior probabilities of a hypothesis or to error probabilities of a procedure. The former can check our tendency to find a more likely explanation H’ than chance by an appropriately low prior weight to H’. What does the latter approach do? It says, we need to consider the problem as of a general type. It’s a general rule, from a test statistic to some assertion about alternative hypotheses, expressing the non-chance effect. Such assertions may be in error but we can control such erroneous interpretations. We deliberately move away from the particularity of the case at hand, to the general type of mistake that could be made.
Isn’t this taken care of by Fisher’s requirement that Pr(P < p0; Ho) = p—that the test rarely rejects the null if true? It may be, in practice, Neyman and Pearson thought, but only with certain conditions that were not explicitly codified by Fisher’s simple significance tests. With just the null hypothesis, it is unwarranted to take low P-values as evidence for a specific “cause” or non-chance explanation. Many could be erected post data, but the ways these could be in error would not have been probed. Fisher (1947, p. 182) is well aware that “the same data may contradict the hypothesis in any of a number of different ways,” and that different corresponding tests would be used.
The notion that different tests of significance are appropriate to test different features of the same null hypothesis presents no difficulty to workers engaged in practical experimentation. [T]he experimenter is aware of what observational discrepancy it is which interests him, and which he thinks may be statistically significant, before he inquires what test of significance, if any, is available appropriate to his needs (ibid., p. 185).
Even if “an experienced experimenter” knows the appropriate test, this doesn’t lessen the importance of NP’s interest in seeking to identify a statistical rationale for the choices made on informal grounds. In today’s world, if not in Fisher’s day, there’s legitimate concern about selecting the alternative that gives the more impressive P-value.
Here’s Egon Pearson writing with Chandra Sekar: In testing if a sample has been drawn from a single normal population, “it is not possible to devise an efficient test if we only bring into the picture this single normal probability distribution with its two unknown parameters. We must also ask how sensitive the test is in detecting failure of the data to comply with the hypotheses tested, and to deal with this question effectively we must be able to specify the directions in which the hypothesis may fail” ( p. 121). “It is sometimes held that the criterion for a test can be selected after the data, but it will be hard to be unprejudiced at this point” (Pearson & Chandra Sekar, 1936, p. 129).
To base the choice of the test of a statistical hypothesis upon an inspection of the observations is a dangerous practice; a study of the configuration of a sample is almost certain to reveal some feature, or features, which are exceptions if the hypothesis is true….By choosing the feature most unfavourable to Ho out of a very large number of features examined it will usually be possible to find some reason for rejecting the hypothesis. It must be remembered, however, that the point now at issue will not be whether it is exceptional to find a given criterion with so unfavourable a value. We shall need to find an answer to the more difficult question. Is it exceptional that the most unfavourable criterion of the n, say, examined should have as unfavourable a value as this? (ibid., p. 127).
Notice, the goal is not behavioristic; it’s a matter of avoiding the glaring fallacies in the test at hand, fallacies we know all too well.
“The statistician who does not know in advance with which type of alternative to H0 he may be faced, is in the position of a carpenter who is summoned to a house to undertake a job of an unknown kind and is only able to take one tool with him! Which shall it be? Even if there is an omnibus tool, it is likely to be far less sensitive at any particular job than a specialized one; but the specialized tool will be quite useless under the wrong conditions” (ibid., p. 126).
In a famous result, Neyman (1952) demonstrates that by dint of a post-data choice of hypothesis, a result that leads to rejection in one test yields the opposite conclusion in another, both adhering to a fixed significance level. [Fisher concedes this as well.] If you are keen to ensure the test is capable of teaching about discrepancies of interest, you should prespecify an alternative hypothesis, where the null and alternative hypothesis exhaust the space, relative to a given question. We can infer discrepancies from the null, as well as corroborate their absence by considering those the test had high power to detect.
Let’s flesh out Neyman’s conclusion to the Borel-Bertrand debate: if we accept the words, “an efficient test of the hypothesis H” to mean a statistical (methodological) falsification rule that controls the probabilities of erroneous interpretations of data, and ensures the rejection was because of the underlying cause (as modeled), then we agree with Borel that efficient tests are possible. This requires (a) a prespecified test criterion to avoid verification biases while ensuring power (efficiency), and (b) consideration of alternative hypotheses to avoid fallacies of acceptance and rejection. We must steer clear of isolated or particular curiosities to find indications that we are tracking genuine effects.
“Fisher’s the one to be credited,” Pearson remarks, “for his emphasis on planning an experiment, which led naturally to the examination of the power function, both in choosing the size of sample so as to enable worthwhile results to be achieved, and in determining the most appropriate tests” (Pearson 1962, p. 277). If you’re planning, you’re prespecifying, perhaps, nowadays, by means of explicit preregistration.
Nevertheless prespecifying the question (or test statistic) is distinct from predesignating a cut-off P-value for significance. Discussions of tests often suppose one is somehow cheating if the attained P-value is reported, as if it loses its error probability status. It doesn’t. I claim they are confusing prespecifying the question or hypothesis, with fixing the P-value in advance–a confusion whose origin stems from failing to identify the rationale behind conventions of tests, or so I argue. Nor is it even that the predesignation is essential, rather than an excellent way to promote valid error probabilities.
But not just any characteristic of the data affords the relevant error probability assessment. It has got to be pretty remarkable!
Enter those pivotal statistics called upon in Fisher’s Fiducial inference. In fact, the story could well be seen to continue in the following two posts: “You can’t take the Fiducial out of Fisher if you want to understand the N-P performance philosophy“, and ” Deconstructing the Fisher-Neyman conflict wearing fiducial glasses”.
 Or, it might have been titled, “A Polish Statistician in Paris”, given the remake of “An American in Paris” is still going strong on Broadway, last time I checked.
 We know that Lehmann insisted people report the attained p-value so that others could apply their own preferred error probabilities. N-P felt the same way. (I may add some links to relevant posts later on.)
Bertrand, J. (1888/1907). Calcul des Probabilités. Paris: Gauthier-Villars.
Borel, E. 1914. Le Hasard. Paris: Alcan.
Fisher, R. A. 1947. The Design of Experiments (4th ed.). Edinburgh: Oliver and Boyd.
Lehmann, E.L. 2012. “The Bertrand-Borel Debate and the Origins of the Neyman-Pearson Theory” in J. Rojo (ed.), Selected Works of E. L. Lehmann, 2012, Springer US, Boston, MA, pp. 965-974.
Mayo, D. 2018. Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP)
Neyman, J. 1952. Lectures and Conferences on Mathematical Statistics and Probability. 2nd ed. Washington, DC: Graduate School of U.S. Dept. of Agriculture.
Neyman, J. 1977. “Frequentist Probability and Frequentist Statistics“, Synthese 36(1): 97–131.
Neyman, J. & Pearson, E. 1933. “On the Problem of the Most Efficient Tests of Statistical Hypotheses“, Philosophical Transactions of the Royal Society of London 231. Series A, Containing Papers of a Mathematical or Physical Character: 289–337.
Pearson, E. S. 1962. “Some Thoughts on Statistical Inference”, The Annals of Mathematical Statistics, 33(2): 394-403.
Pearson, E. S. & Sekar, C. C. 1936. “The Efficiency of Statistical Tools and a Criterion for the Rejection of Outlying Observations“, Biometrika 28(3/4): 308-320. Reprinted (1966) in The Selected Papers of E. S. Pearson, (pp. 118-130). Berkeley: University of California Press.
Reid, Constance (1982). Neyman–from life