Erich Lehmann’s 100 Birthday: Neyman Pearson vs Fisher on P-values

Erich Lehmann 20 November 1917 – 12 September 2009

Erich Lehmann was born 100 years ago today! (20 November 1917 – 12 September 2009). Lehmann was Neyman’s first student at Berkeley (Ph.D 1942), and his framing of Neyman-Pearson (NP) methods has had an enormous influence on the way we typically view them.*


I got to know Erich in 1997, shortly after publication of EGEK (1996). One day, I received a bulging, six-page, handwritten letter from him in tiny, extremely neat scrawl (and many more after that).  He began by telling me that he was sitting in a very large room at an ASA (American Statistical Association) meeting where they were shutting down the conference book display (or maybe they were setting it up), and on a very long, wood table sat just one book, all alone, shiny red.

He said ” I wonder if it might be of interest to me!”  So he walked up to it….  It turned out to be my Error and the Growth of Experimental Knowledge (1996, Chicago), which he reviewed soon after[0]. (What are the chances?) Some related posts on Lehmann’s letter are here and here.

One of Lehmann’s more philosophical papers is Lehmann (1993), “The Fisher, Neyman-Pearson Theories of Testing Hypotheses: One Theory or Two?” Here are some excerpts (blue), and remarks (black)

…A distinction frequently made between the approaches of Fisher and Neyman-Pearson is that in the latter the test is carried out at a fixed level, whereas the principal outcome of the former is the statement of a p value that may or may not be followed by a pronouncement concerning significance of the result [p.1243].

The history of this distinction is curious. Throughout the 19th century, testing was carried out rather informally. It was roughly equivalent to calculating an (approximate) p value and rejecting the hypothesis if this value appeared to be sufficiently small. … Fisher, in his 1925 book and later, greatly reduced the needed tabulations by providing tables not of the distributions themselves but of selected quantiles. … These tables allow the calculation only of ranges for the p values; however, they are exactly suited for determining the critical values at which the statistic under consideration becomes significant at a given level. As Fisher wrote in explaining the use of his [chi square] table (1946, p. 80):

In preparing this table we have borne in mind that in practice we do not want to know the exact value of P for any observed [chi square], but, in the first place, whether or not the observed value is open to suspicion. If P is between .1 and .9, there is certainly no reason to suspect the hypothesis tested. If it is below .02, it is strongly indicated that the hypothesis fails to account for the whole of the facts. We shall not often be astray if we draw a conventional line at .05 and consider that higher values of [chi square] indicate a real discrepancy.

Similarly, he also wrote (1935, p. 13) that “it is usual and convenient for experimenters to take 5 percent as a standard level of significance, in the sense that they are prepared to ignore all results which fail to reach this standard .. .” ….

Neyman and Pearson followed Fisher’s adoption of a fixed level. In fact, Pearson (1962, p. 395) acknowledged that they were influenced by “[Fisher’s] tables of 5 and 1% significance levels which lent themselves to the idea of choice, in advance of experiment, of the risk of the ‘first kind of error’ which the experimenter was prepared to take.” … [p. 1244]

It is interesting to note that unlike Fisher, Neyman and Pearson (1933a, p. 296) did not recommend a standard level but suggested that “how the balance [between the two kinds of error] should be struck must be left to the investigator.” … It is thus surprising that in SMSI Fisher (1973, p. 44-45) criticized the NP use of a fixed conventional level.” …

Responding to earlier versions of these and related objections by Fisher to the Neyman-Pearson formulation, Pearson (1955, p. 206) admitted that the terms “acceptance” and “rejection” were perhaps unfortunately chosen, but of his joint work with Neyman he said that “from the start we shared Professor Fisher’s view that in scientific inquiry, a statistical test is ‘a means of learning’ ” and “I would agree that some of our wording may have been chosen inadequately, but I do not think that our position in some respects was or is so very different from that which Professor Fisher himself has now reached.” [This is from Pearson’s portion of the”triad”: “Statistical Concepts in Their Relation to Reality’.] 

 [A] central consideration of the Neyman-Pearson theory is that one must specify not only the hypothesis H but also the alternatives against which it is to be tested. In terms of the alternatives, one can then define the type II error (false acceptance) and the power of the test (the rejection probability as a function of the alternative). This idea is now fairly generally accepted for its importance in assessing the chance of detecting an effect (i.e., a departure from H) when it exists, determining the sample size required to raise this chance to an acceptable level, and providing a criterion on which to base the choice of an appropriate test…[p.1244].

The big difference Lehmann sees between Fisher and NP is not one we’re most likely to hear about these days: conditioning vs maximizing power. [The paper puts to one side the difference between Fisher’s fiducial and Neyman’s confidence intervals] [i]. To illustrate the conditioning issue, Lehmann alludes to the famous example of flipping a coin to determine if an experiment will use a highly precise or imprecise weighing machine (Cox 1958). Discussions may be found on this blog. [Search under Birnbaum or the SLP.] In this paper, Lehmann makes it clear that NP theory is free to condition (to attain the relevant error probabilities) even if an unconditional test could well be chosen if the context called for maximizing power. I find it interesting that Lehmann’s statistical philosophy (not just in his writing, but in many conversations we’ve had) emphasizes the difference between scientific contexts– where the concern is finding things out in the case at hand–and contexts where long-run optimization is of prime importance, while his own work contributed so much to developing the latter. Back to his paper:

Fisher was of course aware of the importance of power, as is clear from the following remarks (1947, p. 24): “With respect to the refinements of technique, we have seen above that these contribute nothing to the validity of the experiment and of the test of significance by which we determine its result. They may, however, be important, and even essential, in permitting the phenomenon under test to manifest itself.”

…Under the heading “Multiplicity of Tests of the Same Hypothesis,” he devoted a section (sec. 61) to this topic. Here again, without using the term, he referred to alternatives when he wrote (Fisher 1947, p. 182) that “we may now observe that the same data may contradict the hypothesis in any of a number of different ways.” After illustrating how different tests would be appropriate for different alternatives, he continued (p. 185):

The notion that different tests of significance are appropriate to test different features of the same null hypothesis presents no difficulty to workers engaged in practical experimentation but has been the occasion of much theoretical discussion among statisticians….[The experimenter] is aware of what observational discrepancy it is which interests him, and which he thinks may be statistically significant, before he inquires what test of significance, if any, is available appropriate to his needs. He is, therefore, not usually concerned with the question: To what observational feature should a test of significance be applied?

The idea is “that there is no need for a theory of test choice,because an experienced experimenter knows what is the appropriate test”. [ibid, p. 1245]. This reminds me of Senn’s post on Fisher’s alternative to the alternative. Maybe “an experienced experimenter” knows the appropriate test, but this doesn’t lessen the importance of NP’s interest in seeking to identify a statistical rationale for different choices, made informally. Frankly, I can’t take entirely seriously any complaints Fisher levels against NP after 1935, and the break-up.  Lehmann returns to the question “cut-offs or P-values?” at the end. [1247]

In “the reporting of the conclusions of the analysis. Should this consist merely of a statement of significance or nonsignificance at a given level, or should a p value be reported? …fortunately, this is a case where you can have your cake and eat it too. One should routinely report the p value and, where desired, combine this with a statement on significance at any stated level. …

Both Neyman-Pearson and Fisher would give at most lukewarm support to standard significance levels such as 5% or 1%. Fisher, although originally recommending the use of such levels, later strongly attacked any standard choice.[p. 1248] Neyman-Pearson, in their original formulation of 1933, recommended a balance between the two kinds of error (i.e., between level and power).

…To summarize, p values, fixed-level significance statements,conditioning, and power considerations can be combined into a unified approach. When long-term power and conditioning are in conflict, specification of the appropriate frame of reference takes priority, because it determines the meaning of the probability statements. A fundamental gap in the theory is the lack of clear principles for selecting the appropriate framework. Additional work in this area will have to come to terms with the fact that the decision in any particular situation must be based not only on abstract principles but also on contextual aspects.

The full paper is here. You can find the references within. It’s too bad Lehmann didn’t try to work out this unification; I try my own hand at this in my forthcoming book, Statistical Inference as Severe Testing How to Get Beyond the Statistics Wars (CUP 2018).

Happy Birthday Erich!

*I blogged most of this material in 2015.

[0]That same year I remember having a last-minute phone call with Erich to ask how best to respond to a “funny Bayesian example” raised by Colin Howson. It is essentially the case of Mary’s positive result for a disease, where Mary is selected randomly from a population where the disease is very rare. See for example here. His recommendations were extremely illuminating, and with them he sent me a poem he’d written (which you can read in my published response here). Aside from being a leading statistician, Erich had a (serious) literary bent.

Juliet Shafer, Erich Lehmann, D. Mayo

Juliet Shafer, Erich Lehmann, D. Mayo

In 1998 Lehmann wrote in support of the book’s receiving the Lakatos Prize which it did. The picture on the right (of Erich and his wife Julie Schaffer) was taken by Aris Spanos in 2003.

[i]  Lehmann notes [p. 1242] that “in certain important situations tests can be obtained by an approach also due to Fisher for which he used the term fiducial. Most comparisons of Fisher’s work on hypothesis testing with that of Neyman and Pearson … do not include a discussion of the fiducial argument, which most statisticians have found difficult to follow. Although Fisher himself viewed fiducial considerations to be a very important part of his statistical thinking, this topic can be split off from other aspects of his work, and here I shall consider neither the fiducial nor the Bayesian approach…”

Following Lehmann, I too, largely sidestepped Fisher’s fiducial episode  in discussing N-P and Fisher in the past, but I’ve recently come to the view that this seriously shortchanges one’s understanding of NP methods, and the disagreements between N-P and Fisher. But I won’t get into that now.

(Selected) Books by Lehmann)

  • Testing Statistical Hypotheses, 1959
  • Basic Concepts of Probability and Statistics, 1964, co-author J. L. Hodges
  • Elements of Finite Probability, 1965, co-author J. L. Hodges
  • Lehmann, Erich L.; With the special assistance of H. J. M. D’Abrera (2006). Nonparametrics: Statistical methods based on ranks (Reprinting of 1988 revision of 1975 Holden-Day ed.). New York: Springer. pp. xvi+463. ISBN 978-0-387-35212-1. MR 2279708.
  • Theory of Point Estimation, 1983
  • Elements of Large-Sample Theory (1988). New York: Springer Verlag.
  • Reminiscences of a Statistician, 2007, ISBN 978-0-387-71596-4
  • Fisher, Neyman, and the Creation of Classical Statistics, 2011, ISBN 978-1-4419-9499-8 [published posthumously]

Articles (3 of very many)

Well-Known Results

Categories: Fisher, P-values, phil/history of stat

Post navigation

3 thoughts on “Erich Lehmann’s 100 Birthday: Neyman Pearson vs Fisher on P-values

  1. “The big difference Lehmann sees between Fisher and NP is not one we’re most likely to hear about these days: conditioning vs maximizing power.”

    I see this not as a difference between Fisher and NP theory — as you say, NP theory doesn’t forbid conditioning when appropriate — but as a difference between Fisher and *Neyman*. I seem to recall that Neyman suggested throwing away relevant data at random in pursuit of tractability in the two-sample t-test scenario with unequal sample sizes (before Welch’s well-accepted approximate solution was published), and that he suggested randomizing hypothesis test results to achieve a desired type I error rate exactly when dealing with discrete sampling distributions. (Please correct me if I’m wrong.) I think such solutions would only occur to a person who had truly given up on the idea that statistical data analysis should be aimed at learning true facts from data rather than ensuring reliable “inductive behaviour”.

    • Hi Corey! Long time, no comment. I think if you look at Neyman’s applications you will someone engaged in statistical inference and not acceptance sampling or crude inductive behavior*. I can’t think of a place where he advocated randomized tests to get the type I error rate exactly, can you? Nobody seems aware of any place where Neyman took up examples like Cox (1958), and I’ve always been baffled that Lehman didn’t discuss it before this paper.

      Of course, the mere description of inference as an action of some sort does not entail giving up on the idea of using data for statistical inference. Neyman was largely trying to dissociate his view (of the use of probability) from Bayesian probabilism and Fisherian fiducial inference.

  2. Christian Hennig

    I’d also be surprised if somebody came up with quotes from Neyman in which he really advocated things like these.
    I think that the randomisation of test results came up as a mathematical result – how does the NP-optimal test for a fixed precise level look like if the test distribution cannot achieve that precise level based on the data alone? Sadly many people tend to think that what’s mathematically optimal in some sense is necessarily recommended in practice, but I don’t think that Neyman ever indeed advertised to use this kind of randomisation in practice; particularly because all the relevant information (probability for rejection conditionally on the test statistic value) is available before any actual randomisation is done.

Blog at