3.2 N-P Tests: An Episode in Anglo-Polish Collaboration*
We proceed by setting up a specific hypothesis to test, H0 in Neyman’s and my terminology, the null hypothesis in R. A. Fisher’s . . . in choosing the test, we take into account alternatives to H0 which we believe possible or at any rate consider it most important to be on the look out for . . .Three steps in constructing the test may be defined:
Step 1. We must first specify the set of results . . .
Step 2. We then divide this set by a system of ordered boundaries . . .such that as we pass across one boundary and proceed to the next, we come to a class of results which makes us more and more inclined, on the information available, to reject the hypothesis tested in favour of alternatives which differ from it by increasing amounts.
Step 3. We then, if possible, associate with each contour level the chance that, if H0 is true, a result will occur in random sampling lying beyond that level . . .
In our first papers [in 1928] we suggested that the likelihood ratio criterion, λ, was a very useful one . . . Thus Step 2 proceeded Step 3. In later papers [1933–1938] we started with a fixed value for the chance, ε, of Step 3 . . . However, although the mathematical procedure may put Step 3 before 2, we cannot put this into operation before we have decided, under Step 2, on the guiding principle to be used in choosing the contour system. That is why I have numbered the steps in this order. (Egon Pearson 1947, p. 173)
In addition to Pearson’s 1947 paper, the museum follows his account in “The Neyman–Pearson Story: 1926–34” (Pearson 1970). The subtitle is “Historical Sidelights on an Episode in Anglo-Polish Collaboration”!
We meet Jerzy Neyman at the point he’s sent to have his work sized up by Karl Pearson at University College in 1925/26. Neyman wasn’t that impressed:
Neyman found . . . [K.]Pearson himself surprisingly ignorant of modern mathematics. (The fact that Pearson did not understand the difference between independence and lack of correlation led to a misunderstanding that nearly terminated Neyman’s stay . . .) (Lehmann 1988, p. 2)
Thus, instead of spending his second fellowship year in London, Neyman goes to Paris where his wife Olga (“Lola”) is pursuing a career in art, and where he could attend lectures in mathematics by Lebesque and Borel. “[W]ere it not for Egon Pearson [whom I had briefly met while in London], I would have probably drifted to my earlier passion for [pure mathematics]” (Neyman quoted in Lehmann 1988, p. 3).
What pulled him back to statistics was Egon Pearson’s letter in 1926. E. Pearson had been “suddenly smitten” with doubt about the justification of tests then in use, and he needed someone with a stronger mathematical background to pursue his concerns. Neyman had just returned from his fellowship years to a hectic and difficult life in Warsaw, working multiple jobs in applied statistics.
[H]is financial situation was always precarious. The bright spot in this difficult period was his work with the younger Pearson. Trying to find a unifying, logical basis which would lead systematically to the various statistical tests that had been proposed by Student and Fisher was a ‘big problem’ of the kind for which he had hoped . . . (ibid., p. 3)
….Interim pages 131-6 (in proofs) are here.
Historical Sidelight. Except for short visits and holidays, their work proceeded by mail. When Pearson visited Neyman in 1929, he was shocked at the conditions in which Neyman and other academics lived and worked in Poland. Numerous letters from Neyman describe the precarious position in his statistics lab: “You may have heard that we have in Poland a terrific crisis in everything”  (C. Reid 1998, p. 99). In 1932, “I simply cannot work; the crisis and the struggle for existence takes all my time and energy” (Lehmann 2011, p. 40). Yet he managed to produce quite a lot. While at the start, the initiative for the joint work was from Pearson, it soon turned in the other direction with Neyman leading the way.
By comparison, Egon Pearson’s greatest troubles at the time were personal: He had fallen in love “at first sight” with a woman engaged to his cousin George Sharpe, and she with him. She returned the ring the very next day, but Egon still gave his cousin two years to win her back (C. Reid 1998, p. 86). In 1929, buoyed by his work with Neyman, Egon finally declares his love and they are set to be married, but he let himself be intimidated by his father, Karl, deciding “that I could not go against my family’s opinion that I had stolen my cousin’s fiancée . . . at any rate my courage failed” (ibid., p. 94). Whenever Pearson says he was “suddenly smitten” with doubts about the justification of tests while gazing on the fruit station that his cousin directed, I can’t help thinking he’s also referring to this woman (ibid., p. 60). He was lovelorn for years, but refused to tell Neyman what was bothering him.
…..….Interim pages 137-9 are here.
Performance versus Severity Construals of Tests
“The work [of N-P] quite literally transformed mathematical statistics” (C. Reid 1998, p. 104). The idea that appraising statistical methods revolves around optimality (of some sort) goes viral. Some compared it “to the effect of the theory of relativity upon physics” (ibid.). Even when the optimal tests were absent, the optimal properties served as benchmarks against which the performance of methods could be gauged. They had established a new pattern for appraising methods, paving the way for Abraham Wald’s decision theory, and the seminal texts by Lehmann and others. The rigorous program overshadowed the more informal Fisherian tests. This came to irk Fisher. Famous feuds between Fisher and Neyman erupted as to whose paradigm would reign supreme. Those who sided with Fisher erected examples to show that tests could satisfy predesignated criteria and long-run error control while leading to counterintuitive tests in specific cases. That was Barnard’s point on the eclipse experiments (Section 3.1): no one would consider the class of repetitions as referring to the hoped-for 12 photos, when in fact only some smaller number were usable. We’ll meet up with other classic chestnuts as we proceed.
N-P tests began to be couched as formal mapping rules taking data into “reject H0” or “do not reject H0” so as to ensure the probabilities of erroneous rejection and erroneous acceptance are controlled at small values, independent of the true hypothesis and regardless of prior probabilities of parameters. Lost in this behavioristic formulation was how the test criteria naturally grew out of the requirements of probative tests, rather than good long-run performance. Pearson underscores this in his paper (1947) in the epigraph of Section 3.2: Step 2 comes before Step 3. You must first have a sensible distance measure. Since tests that pass muster on performance grounds can simultaneously serve as probative tests, the severe tester breaks out of the behavioristic prison. Neither Neyman nor Pearson, in their applied work, was wedded to it. Where performance and probativeness conflict, probativeness takes precedent. Two decades after Fisher allegedly threw Neyman’s wood models to the floor (Section 5.8), Pearson (1955) tells Fisher: “From the start we shared Professor Fisher’s view that in scientific enquiry, a statistical test is ‘a means of learning’” (p. 206):
. . . it was not till after the main lines of this theory had taken shape with its necessary formalization in terms of critical regions, the class of admissible hypotheses, the two sources of error, the power function, etc., that the fact that there was a remarkable parallelism of ideas in the field of acceptance sampling became apparent. Abraham Wald’s contributions to decision theory of ten to fifteen years later were perhaps strongly influenced by acceptance sampling problems, but that is another story. (ibid., pp. 204–5)
In fact, the tests as developed by Neyman–Pearson began as an attempt to obtain tests that Fisher deemed intuitively plausible, and this goal is easily interpreted as that of computing and controlling the severity with which claims are inferred. Not only did Fisher reply encouragingly to Neyman’s letters during the development of their results, it was Fisher who first informed Neyman of the split of K. Pearson’s duties between himself and Egon, opening up the possibility of Neyman’s leaving his difficult life in Poland and gaining a position at University College in London. Guess what else? Fisher was a referee for the all-important N–P 1933 paper, and approved of it.
To Neyman it has always been a source of satisfaction and amusement that his and Egon’s fundamental paper was presented to the Royal Society by Karl Pearson, who was hostile and skeptical of its contents, and favorably refereed by the formidable Fisher, who was later to be highly critical of much of the Neyman–Pearson theory. (C. Reid 1998, p. 103)
*…To read further, see Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo 2018, CUP).
Excursion 3: Statistical Tests and Scientific Inference
Tour I Ingenious and Severe Tests 119
3.1 Statistical Inference and Sexy Science: The 1919
Eclipse Test 121
3.2 N-P Tests: An Episode in Anglo-Polish Collaboration 131
3.3 How to Do All N-P Tests Do (and more) While
a Member of the Fisherian Tribe 146
Pingback: First Look at N-P Methods as Severe Tests: Water plant accident [Exhibit (i) from Excursion 3] | Error Statistics Philosophy