Erich Lehmann’s 100 Birthday: Neyman Pearson vs Fisher on P-values

Erich Lehmann 20 November 1917 – 12 September 2009

Erich Lehmann was born 100 years ago today! (20 November 1917 – 12 September 2009). Lehmann was Neyman’s first student at Berkeley (Ph.D 1942), and his framing of Neyman-Pearson (NP) methods has had an enormous influence on the way we typically view them.*


I got to know Erich in 1997, shortly after publication of EGEK (1996). One day, I received a bulging, six-page, handwritten letter from him in tiny, extremely neat scrawl (and many more after that).  He began by telling me that he was sitting in a very large room at an ASA (American Statistical Association) meeting where they were shutting down the conference book display (or maybe they were setting it up), and on a very long, wood table sat just one book, all alone, shiny red.

He said ” I wonder if it might be of interest to me!”  So he walked up to it….  It turned out to be my Error and the Growth of Experimental Knowledge (1996, Chicago), which he reviewed soon after[0]. (What are the chances?) Some related posts on Lehmann’s letter are here and here.

One of Lehmann’s more philosophical papers is Lehmann (1993), “The Fisher, Neyman-Pearson Theories of Testing Hypotheses: One Theory or Two?” Here are some excerpts (blue), and remarks (black)

…A distinction frequently made between the approaches of Fisher and Neyman-Pearson is that in the latter the test is carried out at a fixed level, whereas the principal outcome of the former is the statement of a p value that may or may not be followed by a pronouncement concerning significance of the result [p.1243].

The history of this distinction is curious. Throughout the 19th century, testing was carried out rather informally. It was roughly equivalent to calculating an (approximate) p value and rejecting the hypothesis if this value appeared to be sufficiently small. … Fisher, in his 1925 book and later, greatly reduced the needed tabulations by providing tables not of the distributions themselves but of selected quantiles. … These tables allow the calculation only of ranges for the p values; however, they are exactly suited for determining the critical values at which the statistic under consideration becomes significant at a given level. As Fisher wrote in explaining the use of his [chi square] table (1946, p. 80):

In preparing this table we have borne in mind that in practice we do not want to know the exact value of P for any observed [chi square], but, in the first place, whether or not the observed value is open to suspicion. If P is between .1 and .9, there is certainly no reason to suspect the hypothesis tested. If it is below .02, it is strongly indicated that the hypothesis fails to account for the whole of the facts. We shall not often be astray if we draw a conventional line at .05 and consider that higher values of [chi square] indicate a real discrepancy.

Similarly, he also wrote (1935, p. 13) that “it is usual and convenient for experimenters to take 5 percent as a standard level of significance, in the sense that they are prepared to ignore all results which fail to reach this standard .. .” ….

Neyman and Pearson followed Fisher’s adoption of a fixed level. In fact, Pearson (1962, p. 395) acknowledged that they were influenced by “[Fisher’s] tables of 5 and 1% significance levels which lent themselves to the idea of choice, in advance of experiment, of the risk of the ‘first kind of error’ which the experimenter was prepared to take.” … [p. 1244]

It is interesting to note that unlike Fisher, Neyman and Pearson (1933a, p. 296) did not recommend a standard level but suggested that “how the balance [between the two kinds of error] should be struck must be left to the investigator.” … It is thus surprising that in SMSI Fisher (1973, p. 44-45) criticized the NP use of a fixed conventional level.” …

Responding to earlier versions of these and related objections by Fisher to the Neyman-Pearson formulation, Pearson (1955, p. 206) admitted that the terms “acceptance” and “rejection” were perhaps unfortunately chosen, but of his joint work with Neyman he said that “from the start we shared Professor Fisher’s view that in scientific inquiry, a statistical test is ‘a means of learning’ ” and “I would agree that some of our wording may have been chosen inadequately, but I do not think that our position in some respects was or is so very different from that which Professor Fisher himself has now reached.” [This is from Pearson’s portion of the”triad”: “Statistical Concepts in Their Relation to Reality’.] 

 [A] central consideration of the Neyman-Pearson theory is that one must specify not only the hypothesis H but also the alternatives against which it is to be tested. In terms of the alternatives, one can then define the type II error (false acceptance) and the power of the test (the rejection probability as a function of the alternative). This idea is now fairly generally accepted for its importance in assessing the chance of detecting an effect (i.e., a departure from H) when it exists, determining the sample size required to raise this chance to an acceptable level, and providing a criterion on which to base the choice of an appropriate test…[p.1244].

The big difference Lehmann sees between Fisher and NP is not one we’re most likely to hear about these days: conditioning vs maximizing power. [The paper puts to one side the difference between Fisher’s fiducial and Neyman’s confidence intervals] [i]. To illustrate the conditioning issue, Lehmann alludes to the famous example of flipping a coin to determine if an experiment will use a highly precise or imprecise weighing machine (Cox 1958). Discussions may be found on this blog. [Search under Birnbaum or the SLP.] In this paper, Lehmann makes it clear that NP theory is free to condition (to attain the relevant error probabilities) even if an unconditional test could well be chosen if the context called for maximizing power. I find it interesting that Lehmann’s statistical philosophy (not just in his writing, but in many conversations we’ve had) emphasizes the difference between scientific contexts– where the concern is finding things out in the case at hand–and contexts where long-run optimization is of prime importance, while his own work contributed so much to developing the latter. Back to his paper:

Fisher was of course aware of the importance of power, as is clear from the following remarks (1947, p. 24): “With respect to the refinements of technique, we have seen above that these contribute nothing to the validity of the experiment and of the test of significance by which we determine its result. They may, however, be important, and even essential, in permitting the phenomenon under test to manifest itself.”

…Under the heading “Multiplicity of Tests of the Same Hypothesis,” he devoted a section (sec. 61) to this topic. Here again, without using the term, he referred to alternatives when he wrote (Fisher 1947, p. 182) that “we may now observe that the same data may contradict the hypothesis in any of a number of different ways.” After illustrating how different tests would be appropriate for different alternatives, he continued (p. 185):

The notion that different tests of significance are appropriate to test different features of the same null hypothesis presents no difficulty to workers engaged in practical experimentation but has been the occasion of much theoretical discussion among statisticians….[The experimenter] is aware of what observational discrepancy it is which interests him, and which he thinks may be statistically significant, before he inquires what test of significance, if any, is available appropriate to his needs. He is, therefore, not usually concerned with the question: To what observational feature should a test of significance be applied?

The idea is “that there is no need for a theory of test choice,because an experienced experimenter knows what is the appropriate test”. [ibid, p. 1245]. This reminds me of Senn’s post on Fisher’s alternative to the alternative. Maybe “an experienced experimenter” knows the appropriate test, but this doesn’t lessen the importance of NP’s interest in seeking to identify a statistical rationale for different choices, made informally. Frankly, I can’t take entirely seriously any complaints Fisher levels against NP after 1935, and the break-up.  Lehmann returns to the question “cut-offs or P-values?” at the end. [1247]

In “the reporting of the conclusions of the analysis. Should this consist merely of a statement of significance or nonsignificance at a given level, or should a p value be reported? …fortunately, this is a case where you can have your cake and eat it too. One should routinely report the p value and, where desired, combine this with a statement on significance at any stated level. …

Both Neyman-Pearson and Fisher would give at most lukewarm support to standard significance levels such as 5% or 1%. Fisher, although originally recommending the use of such levels, later strongly attacked any standard choice.[p. 1248] Neyman-Pearson, in their original formulation of 1933, recommended a balance between the two kinds of error (i.e., between level and power).

…To summarize, p values, fixed-level significance statements,conditioning, and power considerations can be combined into a unified approach. When long-term power and conditioning are in conflict, specification of the appropriate frame of reference takes priority, because it determines the meaning of the probability statements. A fundamental gap in the theory is the lack of clear principles for selecting the appropriate framework. Additional work in this area will have to come to terms with the fact that the decision in any particular situation must be based not only on abstract principles but also on contextual aspects.

The full paper is here. You can find the references within. It’s too bad Lehmann didn’t try to work out this unification; I try my own hand at this in my forthcoming book, Statistical Inference as Severe Testing How to Get Beyond the Statistics Wars (CUP 2018).

Happy Birthday Erich!

*I blogged most of this material in 2015.

[0]That same year I remember having a last-minute phone call with Erich to ask how best to respond to a “funny Bayesian example” raised by Colin Howson. It is essentially the case of Mary’s positive result for a disease, where Mary is selected randomly from a population where the disease is very rare. See for example here. His recommendations were extremely illuminating, and with them he sent me a poem he’d written (which you can read in my published response here). Aside from being a leading statistician, Erich had a (serious) literary bent.

Juliet Shafer, Erich Lehmann, D. Mayo

Juliet Shafer, Erich Lehmann, D. Mayo

In 1998 Lehmann wrote in support of the book’s receiving the Lakatos Prize which it did. The picture on the right (of Erich and his wife Julie Schaffer) was taken by Aris Spanos in 2003.

[i]  Lehmann notes [p. 1242] that “in certain important situations tests can be obtained by an approach also due to Fisher for which he used the term fiducial. Most comparisons of Fisher’s work on hypothesis testing with that of Neyman and Pearson … do not include a discussion of the fiducial argument, which most statisticians have found difficult to follow. Although Fisher himself viewed fiducial considerations to be a very important part of his statistical thinking, this topic can be split off from other aspects of his work, and here I shall consider neither the fiducial nor the Bayesian approach…”

Following Lehmann, I too, largely sidestepped Fisher’s fiducial episode  in discussing N-P and Fisher in the past, but I’ve recently come to the view that this seriously shortchanges one’s understanding of NP methods, and the disagreements between N-P and Fisher. But I won’t get into that now.

(Selected) Books by Lehmann)

  • Testing Statistical Hypotheses, 1959
  • Basic Concepts of Probability and Statistics, 1964, co-author J. L. Hodges
  • Elements of Finite Probability, 1965, co-author J. L. Hodges
  • Lehmann, Erich L.; With the special assistance of H. J. M. D’Abrera (2006). Nonparametrics: Statistical methods based on ranks (Reprinting of 1988 revision of 1975 Holden-Day ed.). New York: Springer. pp. xvi+463. ISBN 978-0-387-35212-1. MR 2279708.
  • Theory of Point Estimation, 1983
  • Elements of Large-Sample Theory (1988). New York: Springer Verlag.
  • Reminiscences of a Statistician, 2007, ISBN 978-0-387-71596-4
  • Fisher, Neyman, and the Creation of Classical Statistics, 2011, ISBN 978-1-4419-9499-8 [published posthumously]

Articles (3 of very many)

Well-Known Results

Categories: Fisher, P-values, phil/history of stat | 2 Comments


3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: November 2014. I mark in red 3-4 posts from each month that seem most apt for general background on key issues in this blog, excluding those reblogged recently[1], and in green 3- 4 others of general relevance to philosophy of statistics (in months where I’ve blogged a lot)[2].  Posts that are part of a “unit” or a group count as one (11/1/14 & 11/09/14 and 11/15/14 & 11/25/14 are grouped). The comments are worth checking out.


November 2014

  • 11/01 Philosophy of Science Assoc. (PSA) symposium on Philosophy of Statistics in the Higgs Experiments “How Many Sigmas to Discovery?”
  • 11/09 “Statistical Flukes, the Higgs Discovery, and 5 Sigma” at the PSA
  • 11/11 The Amazing Randi’s Million Dollar Challenge
  • 11/12 A biased report of the probability of a statistical fluke: Is it cheating?
  • 11/15 Why the Law of Likelihood is bankrupt–as an account of evidence


  • 11/18 Lucien Le Cam: “The Bayesians Hold the Magic”
  • 11/20 Erich Lehmann: Statistician and Poet
  • 11/22 Msc Kvetch: “You are a Medical Statistic”, or “How Medical Care Is Being Corrupted”
  • 11/25 How likelihoodists exaggerate evidence from statistical tests

[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.

[2] New Rule, July 30,2016, March 30,2017 -a very convenient way to allow data-dependent choices (note why it’s legit in selecting blog posts, on severity grounds).












Categories: 3-year memory lane | 1 Comment

Yoav Benjamini, “In the world beyond p < .05: When & How to use P < .0499…"


These were Yoav Benjamini’s slides,”In the world beyond p<.05: When & How to use P<.0499…” from our session at the ASA 2017 Symposium on Statistical Inference (SSI): A World Beyond p < 0.05. (Mine are in an earlier post.) He begins by asking:

However, it’s mandatory to adjust for selection effects, and Benjamini is one of the leaders in developing ways to carry out the adjustments. Even calling out the avenues for cherry-picking and multiple testing, long known to invalidate p-values, would make replication research more effective (and less open to criticism).

Readers of this blog will recall one of the controversial cases of failed replication in psychology: Simone Schnall’s 2008 paper on cleanliness and morality. In an earlier post, I quote from the Chronicle of Higher Education:

…. Ms. Schnall had 40 undergraduates unscramble some words. One group unscrambled words that suggested cleanliness (pure, immaculate, pristine), while the other group unscrambled neutral words. They were then presented with a number of moral dilemmas, like whether it’s cool to eat your dog after it gets run over by a car. Ms. Schnall wanted to discover whether prompting—or priming, in psych parlance—people with the concept of cleanliness would make them less judgmental.

My gripe has long been that the focus of replication research is with the “pure” statistics– do we get a low p-value or not?–largely sidestepping the often tenuous statistical-substantive links and measurement questions, as in this case [1]. When the study failed to replicate there was a lot of heated debate about the “fidelity” of the replication.[2] Yoav Benjamini cuts to the chase by showing the devastating selection effects involved (on slide #18):

For the severe tester, the onus is on researchers to show explicitly that they could put to rest expected challenges about selection effects. Failure to do so suffices to render the finding poorly or weakly tested (i.e., it passes with low severity) overriding the initial claim to have a non-spurious effect.

“Every research proposal and paper should have a replicabiity-check component” Benjamini recommends.[3] Here are the full slides for your Saturday night perusal.

[1] I’ve yet to hear anyone explain why unscrambling soap-related words should be a good proxy for “situated cognition” of cleanliness. But even without answering that thorny issue, identifying the biasing selection effects that were not taken account of vitiates the nominal low p-value. It is easy rather than difficult to find at least one such computed low p-value by selection alone. I advocate going further, where possible, and falsifying the claim that the statistical correlation is a good test of the hypothesis.

[2] For some related posts, with links to other blogs, check out:

Some ironies in the replication crisis in social psychology.

Repligate returns.

A new front in the statistics wars: peaceful negotiation in the face of so-called methodological terrorism

For statistical transparency: reveal multiplicity and/or falsify the test: remark on Gelman and colleagues

[3] The problem underscores the need for a statistical account to be directly altered by biasing selection effects, as is the p-value.

Categories: Error Statistics, P-values, replication research, selection effects | 22 Comments

Going round and round again: a roundtable on reproducibility & lowering p-values


There will be a roundtable on reproducibility Friday, October 27th (noon Eastern time), hosted by the International Methods Colloquium, on the reproducibility crisis in social sciences motivated by the paper, “Redefine statistical significance.” Recall, that was the paper written by a megateam of researchers as part of the movement to require p ≤ .005, based on appraising significance tests by a Bayes Factor analysis, with prior probabilities on a point null and a given alternative. It seems to me that if you’re prepared to scrutinize your frequentist (error statistical) method on grounds of Bayes Factors, then you must endorse using Bayes Factors (BFs) for inference to begin with. If you don’t endorse BFs–and, in particular, the BF required to get the disagreement with p-values–*, then it doesn’t make sense to appraise your non-Bayesian method on grounds of agreeing or disagreeing with BFs. For suppose you assess the recommended BFs from the perspective of an error statistical account–that is, one that checks how frequently the method would uncover or avoid the relevant mistaken inference.[i] Then you will find the situation is reversed, and the recommended BF exaggerates the evidence!  (In particular, with high probability, it gives an alternative H’ fairly high posterior probability, or comparatively higher probability, even though H’ is false.) They’re measuring very different things, and it’s illicit to expect an agreement on numbers.[ii] We’ve discussed this quite a lot on this blog (2 are linked below [iii]).

If the given list of panelists is correct, it looks to be 4 against 1, but I’ve no doubt that Lakens can handle it.

  1. Daniel Benjamin, Associate Research Professor of Economics at the University of Southern California and a primary co-author of “Redefine Statistical Significance”
  2. Daniel Lakens, Assistant Professor in Applied Cognitive Psychology at Eindhoven University of Technology and a primary co-author of a response to ‘Redefine statistical significance’ (under review).
  3. Blake McShane, Associate Professor of Marketing at Northwestern University and a co-author of the recent paper “Abandon Statistical Significance”.
  4. Jennifer Tackett, Associate Professor of Psychology at Northwestern University and a co-author of the recent paper “Abandon Statistical Significance”.
  5. E.J. Wagenmakers, Professor at the Methodology Unit of the Department of Psychology at the University of Amsterdam, a co-author of the paper “Redefine Statistical Significance”

To tune in to the presentation and participate in the discussion after the talk, visit this site on the day of the talk. To register for the talk in advance, click here.

The paradox for those wishing to abandon significance tests on grounds that there’s “a replication crisis”–and I’m not alleging everyone under the “lower your p-value” umbrella are advancing this–is that lack of replication is effectively uncovered thanks to statistical significance tests. They are also the basis for fraud-busting, and adjustments for multiple testing and selection effects. Unlike Bayes Factors, they:

  • are directly affected by cherry-picking, data dredging and other biasing selection effects
  • are able to test statistical model assumptions, and may have their own assumptions vouchsafed by appropriate experimental design
  • block inferring a genuine effect when a method has low capability of having found it spurious.

In my view, the result of a significance test should be interpreted in terms of the discrepancies that are well or poorly indicated by the result. So we’d avoid the concern that leads some to recommend a .005 cut-off to begin with. But if this does become the standard for testing the existence of risks, I’d make “there’s an increased risk of at least r” the test hypothesis in a one-sided test, as Neyman recommends. Don’t give a gift to the risk producers. In the most problematic areas of social science, the real problems are (a) the questionable relevance of the “treatment” and “outcome” to what is purported to be measured, (b) cherry-picking, data-dependent endpoints, and a host of biasing selection effects, and (c) violated model assumptions. Lowering a p-value will do nothing to help with these problems; forgoing statistical tests of significance will do a lot to make them worse.

 *Added Oct 27. This is worth noting because in other Bayesian assessment, indeed, in assessments deemed more sensible and less biased in favor of the null hypothesis–the p-value scarcely differs from the posterior on Ho. This is discussed, for example, in Casella and R. Berger 1987. See links in [iii]. The two are reconciled with 1-sided tests, and insofar as the typical study states a predicted direction, that’s what they should be doing.

[i] Both “frequentist” and “sampling theory” are unhelpful names. Since the key feature is basing inference on error probabilities of methods, I abbreviate by error statistics. The error probabilities are based on the sampling distribution of the appropriate test statistic. A proper subset of error statistical contexts are those that utilize error probabilities to assess and control the severity by which a particular claim is tested.

[ii] See#4 of my recent talk on statistical skepticism “7 challenges and how to respond to them”

[iii]  Two related posts: p-values overstate the evidence against the null fallacy

How likelihoodists exaggerate evidence from statistical tests (search the blog for others)









Categories: Announcement, P-values, reforming the reformers, selection effects | 5 Comments

Deconstructing “A World Beyond P-values”

.A world beyond p-values?

I was asked to write something explaining the background of my slides (posted here) in relation to the recent ASA “A World Beyond P-values” conference. I took advantage of some long flight delays on my return to jot down some thoughts:

The contrast between the closing session of the conference “A World Beyond P-values,” and the gist of the conference itself, shines a light on a pervasive tension within the “Beyond P-Values” movement. Two very different debates are taking place. First there’s the debate about how to promote better science. This includes welcome reminders of the timeless demands of rigor and integrity required to avoid deceiving ourselves and others–especially crucial in today’s world of high-powered searches and Big Data. That’s what the closing session was about. [1] Continue reading

Categories: P-values, Philosophy of Statistics, reforming the reformers | 8 Comments

Statistical skepticism: How to use significance tests effectively: 7 challenges & how to respond to them

Here are my slides from the ASA Symposium on Statistical Inference : “A World Beyond p < .05”  in the session, “What are the best uses for P-values?”. (Aside from me,our session included Yoav Benjamini and David Robinson, with chair: Nalini Ravishanker.)


  • Why use a tool that infers from a single (arbitrary) P-value that pertains to a statistical hypothesis H0 to a research claim H*?
  • Why use an incompatible hybrid (of Fisher and N-P)?
  • Why apply a method that uses error probabilities, the sampling distribution, researcher “intentions” and violates the likelihood principle (LP)? You should condition on the data.
  • Why use methods that overstate evidence against a null hypothesis?
  • Why do you use a method that presupposes the underlying statistical model?
  • Why use a measure that doesn’t report effect sizes?
  • Why do you use a method that doesn’t provide posterior probabilities (in hypotheses)?


Categories: P-values, spurious p values, statistical tests, Statistics | Leave a comment

New venues for the statistics wars

I was part of something called “a brains blog roundtable” on the business of p-values earlier this week–I’m glad to see philosophers getting involved.

Next week I’ll be in a session that I think is intended to explain what’s right about P-values at an ASA Symposium on Statistical Inference : “A World Beyond p < .05”. Continue reading

Categories: Announcement, Bayesian/frequentist, P-values | 3 Comments

G.A. Barnard: The “catch-all” factor: probability vs likelihood


Barnard 23 Sept.1915 – 9 Aug.20

With continued acknowledgement of Barnard’s birthday on Friday, Sept.23, I reblog an exchange on catchall probabilities from the “The Savage Forum” (pp 79-84 Savage, 1962) with some new remarks.[i] 

 BARNARD:…Professor Savage, as I understand him, said earlier that a difference between likelihoods and probabilities was that probabilities would normalize because they integrate to one, whereas likelihoods will not. Now probabilities integrate to one only if all possibilities are taken into account. This requires in its application to the probability of hypotheses that we should be in a position to enumerate all possible hypotheses which might explain a given set of data. Now I think it is just not true that we ever can enumerate all possible hypotheses. … If this is so we ought to allow that in addition to the hypotheses that we really consider we should allow something that we had not thought of yet, and of course as soon as we do this we lose the normalizing factor of the probability, and from that point of view probability has no advantage over likelihood. This is my general point, that I think while I agree with a lot of the technical points, I would prefer that this is talked about in terms of likelihood rather than probability. I should like to ask what Professor Savage thinks about that, whether he thinks that the necessity to enumerate hypotheses exhaustively, is important. Continue reading

Categories: Barnard, highly probable vs highly probed, phil/history of stat | 6 Comments

George Barnard’s birthday: stopping rules, intentions

G.A. Barnard: 23 Sept.1915 – 9 Aug.2002

Today is George Barnard’s birthday. I met him in the 1980s and we corresponded off and on until 1999. Here’s a snippet of his discussion with Savage (1962) (link below [i]) that connects to issues often taken up on this blog: stopping rules and the likelihood principle. (It’s a slightly revised reblog of an earlier post.) I’ll post some other items related to Barnard this week, in honor of his birthday.

Happy Birthday George!

Barnard: I have been made to think further about this issue of the stopping rule since I first suggested that the stopping rule was irrelevant (Barnard 1947a,b). This conclusion does not follow only from the subjective theory of probability; it seems to me that the stopping rule is irrelevant in certain circumstances.  Since 1947 I have had the great benefit of a long correspondence—not many letters because they were not very frequent, but it went on over a long time—with Professor Bartlett, as a result of which I am considerably clearer than I was before. My feeling is that, as I indicated [on p. 42], we meet with two sorts of situation in applying statistics to data One is where we want to have a single hypothesis with which to confront the data. Do they agree with this hypothesis or do they not? Now in that situation you cannot apply Bayes’s theorem because you have not got any alternatives to think about and specify—not yet. I do not say they are not specifiable—they are not specified yet. And in that situation it seems to me the stopping rule is relevant. Continue reading

Categories: Likelihood Principle, Philosophy of Statistics | Tags: | 2 Comments

Revisiting Popper’s Demarcation of Science 2017

28 July 1902- 17 Sept. 1994

Karl Popper died on September 17 1994. One thing that gets revived in my new book (Statistical Inference as Severe Testing, 2018, CUP) is a Popperian demarcation of science vs pseudoscience Here’s a snippet from what I call a “live exhibit” (where the reader experiments with a subject) toward the end of a chapter on Popper:

Live Exhibit. Revisiting Popper’s Demarcation of Science: Here’s an experiment: Try shifting what Popper says about theories to a related claim about inquiries to find something out. To see what I have in mind, join me in watching a skit over the lunch break:

Physicist: “If mere logical falsifiability suffices for a theory to be scientific, then, we can’t properly oust astrology from the scientific pantheon. Plenty of nutty theories have been falsified, so by definition they’re scientific. Moreover, scientists aren’t always looking to subject well corroborated theories to “grave risk” of falsification.”

Fellow traveler: “I’ve been thinking about this. On your first point, Popper confuses things by making it sound as if he’s asking: When is a theory unscientific? What he is actually asking or should be asking is: When is an inquiry into a theory, or an appraisal of claim H unscientific? We want to distinguish meritorious modes of inquiry from those that are BENT. If the test methods enable ad hoc maneuvering, sneaky face-saving devices, then the inquiry–the handling and use of data–is unscientific. Despite being logically falsifiable, theories can be rendered immune from falsification by means of cavalier methods for their testing. Adhering to a falsified theory no matter what is poor science. On the other hand, some areas have so much noise that you can’t pinpoint what’s to blame for failed predictions. This is another way that inquiries become bad science.”

She continues: Continue reading

Categories: Error Statistics, Popper, pseudoscience, science vs pseudoscience | Tags: | 10 Comments

Peircean Induction and the Error-Correcting Thesis

C. S. Peirce: 10 Sept, 1839-19 April, 1914

C. S. Peirce: 10 Sept, 1839-19 April, 1914

Sunday, September 10, was C.S. Peirce’s birthday. He’s one of my heroes. He’s a treasure chest on essentially any topic, and anticipated quite a lot in statistics and logic. (As Stephen Stigler (2016) notes, he’s to be credited with articulating and appling randomization [1].) I always find something that feels astoundingly new, even rereading him. He’s been a great resource as I complete my book, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP, 2018) [2]. I’m reblogging the main sections of a (2005) paper of mine. It’s written for a very general philosophical audience; the statistical parts are very informal. I first posted it in 2013Happy (belated) Birthday Peirce.

Peircean Induction and the Error-Correcting Thesis
Deborah G. Mayo
Transactions of the Charles S. Peirce Society: A Quarterly Journal in American Philosophy, Volume 41, Number 2, 2005, pp. 299-319

Peirce’s philosophy of inductive inference in science is based on the idea that what permits us to make progress in science, what allows our knowledge to grow, is the fact that science uses methods that are self-correcting or error-correcting:

Induction is the experimental testing of a theory. The justification of it is that, although the conclusion at any stage of the investigation may be more or less erroneous, yet the further application of the same method must correct the error. (5.145)

Inductive methods—understood as methods of experimental testing—are justified to the extent that they are error-correcting methods. We may call this Peirce’s error-correcting or self-correcting thesis (SCT): Continue reading

Categories: Bayesian/frequentist, C.S. Peirce | 2 Comments

Professor Roberta Millstein, Distinguished Marjorie Grene speaker September 15



Virginia Tech Philosophy Department

2017 Distinguished Marjorie Grene Speaker


Professor Roberta L. Millstein

University of California, Davis

“Types of Experiments and Causal Process Tracing: What Happened on the Kaibab Plateau in the 1920s?”

September 15, 2017

320 Lavery Hall: 5:10-6:45pm



Continue reading

Categories: Announcement | 4 Comments

All She Wrote (so far): Error Statistics Philosophy: 6 years on

metablog old fashion typewriter

D.G. Mayo with her  blogging typewriter

Error Statistics Philosophy: Blog Contents (6 years) [i]
By: D. G. Mayo

Dear Reader: It’s hard to believe I’ve been blogging for six years (since Sept. 3, 2011)! A big celebration is taking place at the Elbar Room this evening. If you’re in the neighborhood, stop by for some Elba Grease.

Amazingly, this old typewriter not only still works; one of the whiz kids on Elba managed to bluetooth it to go directly from my typewriter onto the blog (I never got used to computer keyboards.) I still must travel to London to get replacement ribbons for this klunker.

Please peruse the offerings below, and take advantage of some of the super contributions and discussions by guest posters and readers! I don’t know how much longer I’ll continue blogging–I’ve had to cut back this past year (sorry)–but at least until the publication of my book “Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars” (CUP, 2018). After that I plan to run conferences, workshops, and ashrams on PhilStat and PhilSci, and I will invite readers to take part! Keep reading and commenting. Sincerely, D. Mayo



September 2011

October 2011 Continue reading

Categories: blog contents, Metablog | Leave a comment

Egon Pearson’s Heresy

E.S. Pearson: 11 Aug 1895-12 June 1980.

Here’s one last entry in honor of Egon Pearson’s birthday: “Statistical Concepts in Their Relation to Reality” (Pearson 1955). I’ve posted it several times over the years (6!), but always find a new gem or two, despite its being so short. E. Pearson rejected some of the familiar tenets that have come to be associated with Neyman and Pearson (N-P) statistical tests, notably the idea that the essential justification for tests resides in a long-run control of rates of erroneous interpretations–what he termed the “behavioral” rationale of tests. In an unpublished letter E. Pearson wrote to Birnbaum (1974), he talks about N-P theory admitting of two interpretations: behavioral and evidential:

“I think you will pick up here and there in my own papers signs of evidentiality, and you can say now that we or I should have stated clearly the difference between the behavioral and evidential interpretations. Certainly we have suffered since in the way the people have concentrated (to an absurd extent often) on behavioral interpretations”.

(Nowadays, some people concentrate to an absurd extent on “science-wise error rates in dichotomous screening”.) Continue reading

Categories: phil/history of stat, Philosophy of Statistics, Statistics | Tags: , , | Leave a comment

A. Spanos: Egon Pearson’s Neglected Contributions to Statistics

11 August 1895 – 12 June 1980

Continuing with my Egon Pearson posts in honor of his birthday, I reblog a post by Aris Spanos:  Egon Pearson’s Neglected Contributions to Statistics“. 

    Egon Pearson (11 August 1895 – 12 June 1980), is widely known today for his contribution in recasting of Fisher’s significance testing into the Neyman-Pearson (1933) theory of hypothesis testing. Occasionally, he is also credited with contributions in promoting statistical methods in industry and in the history of modern statistics; see Bartlett (1981). What is rarely mentioned is Egon’s early pioneering work on:

(i) specification: the need to state explicitly the inductive premises of one’s inferences,

(ii) robustness: evaluating the ‘sensitivity’ of inferential procedures to departures from the Normality assumption, as well as

(iii) Mis-Specification (M-S) testing: probing for potential departures from the Normality  assumption.

Arguably, modern frequentist inference began with the development of various finite sample inference procedures, initially by William Gosset (1908) [of the Student’s t fame] and then Fisher (1915, 1921, 1922a-b). These inference procedures revolved around a particular statistical model, known today as the simple Normal model: Continue reading

Categories: E.S. Pearson, phil/history of stat, Spanos, Testing Assumptions | 2 Comments

Performance or Probativeness? E.S. Pearson’s Statistical Philosophy

egon pearson

E.S. Pearson (11 Aug, 1895-12 June, 1980)

This is a belated birthday post for E.S. Pearson (11 August 1895-12 June, 1980). It’s basically a post from 2012 which concerns an issue of interpretation (long-run performance vs probativeness) that’s badly confused these days. I’ll blog some E. Pearson items this week, including, my latest reflection on a historical anecdote regarding Egon and the woman he wanted marry, and surely would have, were it not for his father Karl!


Are methods based on error probabilities of use mainly to supply procedures which will not err too frequently in some long run? (performance). Or is it the other way round: that the control of long run error properties are of crucial importance for probing the causes of the data at hand? (probativeness). I say no to the former and yes to the latter. This, I think, was also the view of Egon Sharpe (E.S.) Pearson.  Continue reading

Categories: highly probable vs highly probed, phil/history of stat, Statistics | Tags: | Leave a comment

Thieme on the theme of lowering p-value thresholds (for Slate)


Here’s an article by Nick Thieme on the same theme as my last blogpost. Thieme, who is Slate’s 2017 AAAS Mass Media Fellow, is the first person to interview me on p-values who (a) was prepared to think through the issue for himself (or herself), and (b) included more than a tiny fragment of my side of the exchange.[i]. Please share your comments.

Will Lowering P-Value Thresholds Help Fix Science? P-values are already all over the map, and they’re also not exactly the problem.



Illustration by Slate

                 Illustration by Slate

Last week a team of 72 scientists released the preprint of an article attempting to address one aspect of the reproducibility crisis, the crisis of conscience in which scientists are increasingly skeptical about the rigor of our current methods of conducting scientific research.

Their suggestion? Change the threshold for what is considered statistically significant. The team, led by Daniel Benjamin, a behavioral economist from the University of Southern California, is advocating that the “probability value” (p-value) threshold for statistical significance be lowered from the current standard of 0.05 to a much stricter threshold of 0.005. Continue reading

Categories: P-values, reforming the reformers, spurious p values | 14 Comments

“A megateam of reproducibility-minded scientists” look to lowering the p-value


Having discussed the “p-values overstate the evidence against the null fallacy” many times over the past few years, I leave it to readers to disinter the issues (pro and con), and appraise the assumptions, in the most recent rehearsal of the well-known Bayesian argument. There’s nothing intrinsically wrong with demanding everyone work with a lowered p-value–if you’re so inclined to embrace a single, dichotomous standard without context-dependent interpretations, especially if larger sample sizes are required to compensate the loss of power. But lowering the p-value won’t solve the problems that vex people (biasing selection effects), and is very likely to introduce new ones (see my comment). Kelly Servick, a reporter from Science, gives the ingredients of the main argument given by “a megateam of reproducibility-minded scientists” in an article out today: Continue reading

Categories: Error Statistics, highly probable vs highly probed, P-values, reforming the reformers | 55 Comments


3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: July 2014. I mark in red 3-4 posts from each month that seem most apt for general background on key issues in this blog, excluding those reblogged recently[1]. Posts that are part of a “unit” or a group count as one. This month there are three such groups: 7/8 and 7/10; 7/14 and 7/23; 7/26 and 7/31.

July 2014

  • (7/7) Winner of June Palindrome Contest: Lori Wike
  • (7/8) Higgs Discovery 2 years on (1: “Is particle physics bad science?”)
  • (7/10) Higgs Discovery 2 years on (2: Higgs analysis and statistical flukes)
  • (7/14) “P-values overstate the evidence against the null”: legit or fallacious? (revised)
  • (7/23) Continued:”P-values overstate the evidence against the null”: legit or fallacious?
  • (7/26) S. Senn: “Responder despondency: myths of personalized medicine” (Guest Post)
  • (7/31) Roger Berger on Stephen Senn’s “Blood Simple” with a response by Senn (Guest Posts)

[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.







Categories: 3-year memory lane, Higgs, P-values | Leave a comment

On the current state of play in the crisis of replication in psychology: some heresies


The replication crisis has created a “cold war between those who built up modern psychology and those” tearing it down with failed replications–or so I read today [i]. As an outsider (to psychology), the severe tester is free to throw some fuel on the fire on both sides. This is a short update on my post “Some ironies in the replication crisis in social psychology” from 2014.

Following the model from clinical trials, an idea gaining steam is to prespecify a “detailed protocol that includes the study rationale, procedure and a detailed analysis plan” (Nosek 2017). In this new paper, they’re called registered reports (RRs). An excellent start. I say it makes no sense to favor preregistration and deny the relevance to evidence of optional stopping and outcomes other than the one observed. That your appraisal of the evidence is altered when you actually see the history supplied by the RR is equivalent to worrying about biasing selection effects when they’re not written down; your statistical method should pick up on them (as do p-values, confidence levels and many other error probabilities). There’s a tension between the RR requirements and accounts following the Likelihood Principle (no need to name names [ii]). Continue reading

Categories: Error Statistics, preregistration, reforming the reformers, replication research | 9 Comments

Blog at