Erich Lehmann: Neyman-Pearson & Fisher on P-values


lone book on table

Today is Erich Lehmann’s birthday (20 November 1917 – 12 September 2009). Lehmann was Neyman’s first student at Berkeley (Ph.D 1942), and his framing of Neyman-Pearson (NP) methods has had an enormous influence on the way we typically view them.

I got to know Erich in 1997, shortly after publication of EGEK (1996). One day, I received a bulging, six-page, handwritten letter from him in tiny, extremely neat scrawl (and many more after that).  He began by telling me that he was sitting in a very large room at an ASA (American Statistical Association) meeting where they were shutting down the conference book display (or maybe they were setting it up), and on a very long, wood table sat just one book, all alone, shiny red.  He said he wondered if it might be of interest to him!  So he walked up to it….  It turned out to be my Error and the Growth of Experimental Knowledge (1996, Chicago), which he reviewed soon after[0]. (What are the chances?) Some related posts on Lehmann’s letter are here and here.

One of Lehmann’s more philosophical papers is Lehmann (1993), “The Fisher, Neyman-Pearson Theories of Testing Hypotheses: One Theory or Two?” We haven’t discussed it before on this blog. Here are some excerpts (blue), and remarks (black)

Erich Lehmann 20 November 1917 – 12 September 2009

Erich Lehmann 20 November 1917 – 12 September 2009

…A distinction frequently made between the approaches of Fisher and Neyman-Pearson is that in the latter the test is carried out at a fixed level, whereas the principal outcome of the former is the statement of a p value that may or may not be followed by a pronouncement concerning significance of the result [p.1243].

The history of this distinction is curious. Throughout the 19th century, testing was carried out rather informally. It was roughly equivalent to calculating an (approximate) p value and rejecting the hypothesis if this value appeared to be sufficiently small. … Fisher, in his 1925 book and later, greatly reduced the needed tabulations by providing tables not of the distributions themselves but of selected quantiles. … These tables allow the calculation only of ranges for the p values; however, they are exactly suited for determining the critical values at which the statistic under consideration becomes significant at a given level. As Fisher wrote in explaining the use of his [chi square] table (1946, p. 80):

In preparing this table we have borne in mind that in practice we do not want to know the exact value of P for any observed [chi square], but, in the first place, whether or not the observed value is open to suspicion. If P is between .1 and .9, there is certainly no reason to suspect the hypothesis tested. If it is below .02, it is strongly indicated that the hypothesis fails to account for the whole of the facts. We shall not often be astray if we draw a conventional line at .05 and consider that higher values of [chi square] indicate a real discrepancy.

Similarly, he also wrote (1935, p. 13) that “it is usual and convenient for experimenters to take 5 percent as a standard level of significance, in the sense that they are prepared to ignore all results which fail to reach this standard .. .” ….

Neyman and Pearson followed Fisher’s adoption of a fixed level. In fact, Pearson (1962, p. 395) acknowledged that they were influenced by “[Fisher’s] tables of 5 and 1% significance levels which lent themselves to the idea of choice, in advance of experiment, of the risk of the ‘first kind of error’ which the experimenter was prepared to take.” … [p. 1244]

It is interesting to note that unlike Fisher, Neyman and Pearson (1933a, p. 296) did not recommend a standard level but suggested that “how the balance [between the two kinds of error] should be struck must be left to the investigator.” … It is thus surprising that in SMSI Fisher (1973, p. 44-45) criticized the NP use of a fixed conventional level.” …

Responding to earlier versions of these and related objections by Fisher to the Neyman-Pearson formulation, Pearson (1955, p. 206) admitted that the terms “acceptance” and “rejection” were perhaps unfortunately chosen, but of his joint work with Neyman he said that “from the start we shared Professor Fisher’s view that in scientific inquiry, a statistical test is ‘a means of learning’ ” and “I would agree that some of our wording may have been chosen inadequately, but I do not think that our position in some respects was or is so very different from that which Professor Fisher himself has now reached.” [This is from Pearson’s portion of the”triad”: “Statistical Concepts in Their Relation to Reality’.] 

 [A] central consideration of the Neyman-Pearson theory is that one must specify not only the hypothesis H but also the alternatives against which it is to be tested. In terms of the alternatives, one can then define the type II error (false acceptance) and the power of the test (the rejection probability as a function of the alternative). This idea is now fairly generally accepted for its importance in assessing the chance of detecting an effect (i.e., a departure from H) when it exists, determining the sample size required to raise this chance to an acceptable level, and providing a criterion on which to base the choice of an appropriate test…[p.1244].

The big difference Lehmann sees between Fisher and NP is not one we’re most likely to hear about these days: conditioning vs maximizing power. [The paper puts to one side the difference between Fisher’s fiducial and Neyman’s confidence intervals] [i]. To illustrate the conditioning issue, Lehmann alludes to the famous example of flipping a coin to determine if an experiment will use a highly precise or imprecise weighing machine (Cox 1958). Discussions may be found on this blog. [Search under Birnbaum or the SLP.] In this paper, Lehmann makes it clear that NP theory is free to condition (to attain the relevant error probabilities) even if an unconditional test could well be chosen if the context called for maximizing power. I find it interesting that Lehmann’s statistical philosophy (not just in his writing, but in many conversations we’ve had) emphasizes the difference between scientific contexts– where the concern is finding things out in the case at hand–and contexts where long-run optimization is of prime importance, while his own work contributed so much to developing the latter. Back to his paper:

Fisher was of course aware of the importance of power, as is clear from the following remarks (1947, p. 24): “With respect to the refinements of technique, we have seen above that these contribute nothing to the validity of the experiment and of the test of significance by which we determine its result. They may, however, be important, and even essential, in permitting the phenomenon under test to manifest itself.”

…Under the heading “Multiplicity of Tests of the Same Hypothesis,” he devoted a section (sec. 61) to this topic. Here again, without using the term, he referred to alternatives when he wrote (Fisher 1947, p. 182) that “we may now observe that the same data may contradict the hypothesis in any of a number of different ways.” After illustrating how different tests would be appropriate for different alternatives, he continued (p. 185):

The notion that different tests of significance are appropriate to test different features of the same null hypothesis presents no difficulty to workers engaged in practical experimentation but has been the occasion of much theoretical discussion among statisticians….[The experimenter] is aware of what observational discrepancy it is which interests him, and which he thinks may be statistically significant, before he inquires what test of significance, if any, is available appropriate to his needs. He is, therefore, not usually concerned with the question: To what observational feature should a test of significance be applied?

The idea is “that there is no need for a theory of test choice,because an experienced experimenter knows what is the appropriate test”. [ibid, p. 1245]. This reminds me of Senn’s post on Fisher’s alternative to the alternative. Maybe “an experienced experimenter” knows the appropriate test, but this doesn’t lessen the importance of NP’s interest in seeking to identify a statistical rationale for different choices, made informally. Frankly, I can’t take entirely seriously any complaints Fisher levels against NP after 1935, and the break-up.  Lehmann returns to the question “cut-offs or P-values?” at the end. [1247]

In “the reporting of the conclusions of the analysis. Should this consist merely of a statement of significance or nonsignificance at a given level, or should a p value be reported? …fortunately, this is a case where you can have your cake and eat it too. One should routinely report the p value and, where desired, combine this with a statement on significance at any stated level. …

Both Neyman-Pearson and Fisher would give at most lukewarm support to standard significance levels such as 5% or 1%. Fisher, although originally recommending the use of such levels, later strongly attacked any standard choice.[p. 1248] Neyman-Pearson, in their original formulation of 1933, recommended a balance between the two kinds of error (i.e., between level and power).

…To summarize, p values, fixed-level significance statements,conditioning, and power considerations can be combined into a unified approach. When long-term power and conditioning are in conflict, specification of the appropriate frame of reference takes priority, because it determines the meaning of the probability statements. A fundamental gap in the theory is the lack of clear principles for selecting the appropriate framework. Additional work in this area will have to come to terms with the fact that the decision in any particular situation must be based not only on abstract principles but also on contextual aspects.

The full paper is here. You can find the references within.

Happy Birthday Erich!

[0] In 98 Lehmann wrote in support of the book’s receiving the Lakatos Prize which it did.

[i]  Lehmann notes [p. 1242] that “in certain important situations tests can be obtained by an approach also due to Fisher for which he used the term fiducial. Most comparisons of Fisher’s work on hypothesis testing with that of Neyman and Pearson … do not include a discussion of the fiducial argument, which most statisticians have found difficult to follow. Although Fisher himself viewed fiducial considerations to be a very important part of his statistical thinking, this topic can be split off from other aspects of his work, and here I shall consider neither the fiducial nor the Bayesian approach…”

I, too, largely sidestepped Fisher’s fiducial approach  in discussing NP and Fisher in the past, but I’ve recently come to the view that this seriously shortchanges one’s understanding of NP methods, and the disagreements between N-P and Fisher. But I won’t get into that now.[1242]

(Selected) Books by Lehmann)

  • Testing Statistical Hypotheses, 1959
  • Basic Concepts of Probability and Statistics, 1964, co-author J. L. Hodges
  • Elements of Finite Probability, 1965, co-author J. L. Hodges
  • Lehmann, Erich L.; With the special assistance of H. J. M. D’Abrera (2006). Nonparametrics: Statistical methods based on ranks (Reprinting of 1988 revision of 1975 Holden-Day ed.). New York: Springer. pp. xvi+463. ISBN 978-0-387-35212-1. MR 2279708.
  • Theory of Point Estimation, 1983
  • Elements of Large-Sample Theory (1988). New York: Springer Verlag.
  • Reminiscences of a Statistician, 2007, ISBN 978-0-387-71596-4
  • Fisher, Neyman, and the Creation of Classical Statistics, 2011, ISBN 978-1-4419-9499-8 [published posthumously]

Articles (3 of very many)

Well-Known Results

Categories: Neyman, P-values, phil/history of stat, Statistics | Tags: ,

Post navigation

4 thoughts on “Erich Lehmann: Neyman-Pearson & Fisher on P-values

  1. Another fascinating post – the history and the personalities are so tangled.

    I’m interested that you suggest this quote from Fisher (1946) shows that he favoured “Neyman-Pearsonian” critical-value, yes-or-no decision-making (neither of us really likes my term “absolutist” for this –

    “If P is between .1 and .9, there is certainly no reason to suspect the hypothesis tested. If it is below .02, it is strongly indicated that the hypothesis fails to account for the whole of the facts. We shall not often be astray if we draw a conventional line at .05 and consider that higher values of [chi square] indicate a real discrepancy.”

    This seems to me to suggest a framework with at least three (likely 4) interesting regions: p0.1 is “no reason to suspect”, and p<0.05 is "real discrepancy". Implicitly, then, 0.05<p<0.1 is short of "real discrepancy" but better than "no reason to suspect", or, as some might say today, "nearly significant", while the 0.02<p<0.05 part of "real discrepancy" is not quite enough for "strong indication" . Perhaps I'm reading too much into a short quote, but it seems a short walk from here to a completely continuous interpretation of the p-value – am I wrong?

    The more I learn about the history of all this, the more interesting it gets. Thanks!

    • SSS: Well I had your recent post in mind when writing this post on Erich, as well as something I’m writing on fiducial. Thanks for the comment.

    • Anoneuoid

      Later he introduces the phrase “probably significant”:

      “The value of P is between .02 and .05, so that sex difference in the classification by hair colours is probably significant as judged by this district alone.”

      “The χ2 test does not attempt to measure the degree of association, but as a test of significance it is independent of all additional hypotheses as to the nature of the association.”

  2. It is noteworthy that Lehmann writes:
    “Fisher, although originally recommending the use of such levels, later strongly attacked any standard choice.”[p. 1248] As a matter of fact, Fisher spent a lot of time after 1935 renouncing his older positions–the ones Neyman tried so hard to capture in his tests and confidence intervals–because he was disgruntled with Neyman who a) wouldn’t use his book and b) dared to point out inconsistencies in his fiducial frequencies. So Fisher’s war with Neyman is largely a war with himself, and a refusal to admit his mistaken claims about fiducial probabilities.
    In a nutshell, take a specific .95 lower confidence interval bound for mu (sigma known or estimated) in a Normal distribution: CI lower (.95). This Fisher called the fiducial 5% limit for mu, and claimed the probability that mu < the CI lower (.95) = .05. This probability holds for the CI (or fiducial) estimatOR, but once you substitute the data and get a specific value for the lower bound, the probability no longer holds. At most you could claim, as Fisher himself does, that the aggregate of outputs of form: mu < CI lower(.95) has 5% false claims. But when Neyman described them this way,* Fisher said he was turning his methods into acceptance sampling devices. See the top half of p.75 of Fisher’s contribution to what I call the “triad” (1955:

    Click to access Fisher-1955.pdf

    This whole issue is complicated, but revelatory, and I will say more about it in a post at some point.
    *Neyman was only trying to offer a revised wording of Fisher’s 1930 paper on fiducial frequencies, to avoid falsehoods which he attributed to accidental "lapses of language" common in describing a new idea.

Blog at