**Today is Allan Birnbaum’s Birthday. **Birnbaum’s (1962) classic “On the Foundations of Statistical Inference,” in *Breakthroughs in Statistics (volume I 1993), *concerns a principle that remains at the heart of today’s controversies in statistics–even if it isn’t obvious at first: the Likelihood Principle (LP) (also called the strong likelihood Principle SLP, to distinguish it from the weak LP [1]). According to the LP/SLP, given the statistical model, the information from the data are fully contained in the likelihood ratio. Thus, *properties of the sampling distribution of the test statistic vanish *(as I put it in my slides from this post)! But error probabilities are all properties of the sampling distribution. Thus, embracing the LP (SLP) blocks our error statistician’s direct ways of taking into account “biasing selection effects” (slide #10). [Posted earlier here.] Interesting, as seen in a 2018 post on Neyman, Neyman *did* discuss this paper, but had an odd reaction that I’m not sure I understand. (Check it out.) Continue reading

# phil/history of stat

## “Intentions (in your head)” is the code word for “error probabilities (of a procedure)”: Allan Birnbaum’s Birthday

## R.A. Fisher: “Statistical methods and Scientific Induction”

I continue a week of Fisherian posts in honor of his birthday (Feb 17). This is his contribution to the “Triad”–an exchange between Fisher, Neyman and Pearson 20 years after the Fisher-Neyman break-up. The other two are below. They are each very short and bear rereading

*“Statistical Methods and Scientific Induction”*

*by Sir Ronald Fisher (1955)
*

**SUMMARY**

The attempt to reinterpret the common tests of significance used in scientific research as though they constituted some kind of acceptance procedure and led to “decisions” in Wald’s sense, originated in several misapprehensions and has led, apparently, to several more.

The three phrases examined here, with a view to elucidating they fallacies they embody, are:

- “Repeated sampling from the same population”,
- Errors of the “second kind”,
- “Inductive behavior”.

Mathematicians without personal contact with the Natural Sciences have often been misled by such phrases. The errors to which they lead are not only numerical.

To continue reading Fisher’s paper.

**“Note on an Article by Sir Ronald Fisher“**

**by Jerzy Neyman (1956)**

**Summary**

(1) FISHER’S allegation that, contrary to some passages in the introduction and on the cover of the book by Wald, this book does not really deal with experimental design is unfounded. In actual fact, the book is permeated with problems of experimentation. (2) Without consideration of hypotheses alternative to the one under test and without the study of probabilities of the two kinds, no purely probabilistic theory of tests is possible. (3) The conceptual fallacy of the notion of fiducial distribution rests upon the lack of recognition that valid probability statements about random variables usually cease to be valid if the random variables are replaced by their particular values. The notorious multitude of “paradoxes” of fiducial theory is a consequence of this oversight. (4) The idea of a “cost function for faulty judgments” appears to be due to Laplace, followed by Gauss.

“S**tatistical Concepts in Their Relation to Reality”.**

**by E.S. Pearson (1955)**

Controversies in the field of mathematical statistics seem largely to have arisen because statisticians have been unable to agree upon how theory is to provide, in terms of probability statements, the numerical measures most helpful to those who have to draw conclusions from observational data. We are concerned here with the ways in which mathematical theory may be put, as it were, into gear with the common processes of rational thought, and there seems no reason to suppose that there is one best way in which this can be done. If, therefore, Sir Ronald Fisher recapitulates and enlarges on his views upon statistical methods and scientific induction we can all only be grateful, but when he takes this opportunity to criticize the work of others through misapprehension of their views as he has done in his recent contribution to this *Journal* (Fisher 1955 “Scientific Methods and Scientific Induction” ), it is impossible to leave him altogether unanswered.

In the first place it seems unfortunate that much of Fisher’s criticism of Neyman and Pearson’s approach to the testing of statistical hypotheses should be built upon a “penetrating observation” ascribed to Professor G.A. Barnard, the assumption involved in which happens to be historically incorrect. There was no question of a difference in point of view having “originated” when Neyman “reinterpreted” Fisher’s early work on tests of significance “in terms of that technological and commercial apparatus which is known as an acceptance procedure”. There was no sudden descent upon British soil of Russian ideas regarding the function of science in relation to technology and to five-year plans. It was really much simpler–or worse. *The original heresy, as we shall see, was a Pearson one!…*

To continue reading, “Statistical Concepts in Their Relation to Reality” click HERE

## R. A. Fisher: How an Outsider Revolutionized Statistics (Aris Spanos)

In recognition of R.A. Fisher’s birthday on February 17….

**‘R. A. Fisher: How an Outsider Revolutionized Statistics’**

by **Aris Spanos**

Few statisticians will dispute that R. A. Fisher **(February 17, 1890 – July 29, 1962)** is the father of modern statistics; see Savage (1976), Rao (1992). Inspired by William Gosset’s (1908) paper on the Student’s t finite sampling distribution, he recast statistics into the modern model-based induction in a series of papers in the early 1920s. He put forward a theory of *optimal estimation* based on the method of maximum likelihood that has changed only marginally over the last century. His significance testing, spearheaded by the p-value, provided the basis for the Neyman-Pearson theory of *optimal testing* in the early 1930s. According to Hald (1998)

“Fisher was a genius who almost single-handedly created the foundations for modern statistical science, without detailed study of his predecessors. When young he was ignorant not only of the Continental contributions but even of contemporary publications in English.” (p. 738)

What is not so well known is that Fisher was the *ultimate outsider* when he brought about this change of paradigms in statistical science. As an undergraduate, he studied mathematics at Cambridge, and then did graduate work in statistical mechanics and quantum theory. His meager knowledge of statistics came from his study of astronomy; see Box (1978). That, however did not stop him from publishing his first paper in statistics in 1912 (still an undergraduate) on “curve fitting”, questioning Karl Pearson’s method of moments and proposing a new method that was eventually to become the likelihood method in his 1921 paper. Continue reading

## Happy Birthday R.A. Fisher: ‘Two New Properties of Mathematical Likelihood’

*Today is R.A. Fisher’s birthday. I’ll post some Fisherian items this week in honor of it. This paper comes just before the conflicts with Neyman and Pearson erupted. Fisher links his tests and sufficiency, to the Neyman and Pearson lemma in terms of power. It’s as if we may see them as ending up in a similar place while starting from different origins. I quote just the most relevant portions…the full article is linked below. Happy Birthday Fisher!*

“Two New Properties of Mathematical Likelihood“

by R.A. Fisher, F.R.S.

Proceedings of the Royal Society, Series A, 144: 285-307 (1934)

The property that where a sufficient statistic exists, the likelihood, apart from a factor independent of the parameter to be estimated, is a function only of the parameter and the sufficient statistic, explains the principle result obtained by Neyman and Pearson in discussing the efficacy of tests of significance. Neyman and Pearson introduce the notion that any chosen test of a hypothesis H_{0} is more powerful than any other equivalent test, with regard to an alternative hypothesis H_{1}, when it rejects H_{0} in a set of samples having an assigned aggregate frequency ε when H_{0} is true, and the greatest possible aggregate frequency when H_{1} is true. If any group of samples can be found within the region of rejection whose probability of occurrence on the hypothesis H_{1} is less than that of any other group of samples outside the region, but is not less on the hypothesis H_{0}, then the test can evidently be made more powerful by substituting the one group for the other. Continue reading

## Erich Lehmann’s 100 Birthday: Neyman Pearson vs Fisher on P-values

**Erich Lehmann was born 100 years ago today! (20 November 1917 – 12 September 2009).** Lehmann was Neyman’s first student at Berkeley (Ph.D 1942), and his framing of Neyman-Pearson (NP) methods has had an enormous influence on the way we typically view them.*

I got to know Erich in 1997, shortly after publication of EGEK (1996). One day, I received a bulging, six-page, handwritten letter from him in tiny, extremely neat scrawl (and many more after that). He began by telling me that he was sitting in a very large room at an ASA (American Statistical Association) meeting where they were shutting down the conference book display (or maybe they were setting it up), and on a very long, wood table sat just one book, all alone, shiny red.

He said ” I wonder if it might be of interest to me!” So he walked up to it…. It turned out to be my *Error and the Growth of Experimental Knowledge* (1996, Chicago), which he reviewed soon after[0]. (What are the chances?) Some related posts on Lehmann’s letter are here and here.

## Egon Pearson’s Heresy

Here’s one last entry in honor of Egon Pearson’s birthday: “Statistical Concepts in Their Relation to Reality” (Pearson 1955). I’ve posted it several times over the years (6!), but always find a new gem or two, despite its being so short. E. Pearson rejected some of the familiar tenets that have come to be associated with Neyman and Pearson (N-P) statistical tests, notably the idea that the essential justification for tests resides in a long-run control of rates of erroneous interpretations–what he termed the “behavioral” rationale of tests. In an unpublished letter E. Pearson wrote to Birnbaum (1974), he talks about N-P theory admitting of two interpretations: behavioral and evidential:

“I think you will pick up here and there in my own papers signs of evidentiality, and you can say now that we or I should have stated clearly the difference between the behavioral and evidential interpretations. Certainly we have suffered since in the way the people have concentrated (to an absurd extent often) on behavioral interpretations”.

(Nowadays, some people concentrate to an absurd extent on “science-wise error rates in dichotomous screening”.) Continue reading

## A. Spanos: Egon Pearson’s Neglected Contributions to Statistics

*Continuing with my Egon Pearson posts in honor of his birthday, I reblog a post by Aris Spanos: ** “**Egon Pearson’s Neglected Contributions to Statistics“. *

**Egon Pearson** (11 August 1895 – 12 June 1980), is widely known today for his contribution in recasting of Fisher’s significance testing into the * Neyman-Pearson (1933) theory of hypothesis testing*. Occasionally, he is also credited with contributions in promoting statistical methods in industry and in the history of modern statistics; see Bartlett (1981). What is rarely mentioned is Egon’s early pioneering work on:

**(i) specification**: the need to state explicitly the inductive premises of one’s inferences,

**(ii) robustness**: evaluating the ‘sensitivity’ of inferential procedures to departures from the Normality assumption, as well as

**(iii) Mis-Specification (M-S) testing**: probing for potential departures from the Normality assumption.

Arguably, modern frequentist inference began with the development of various finite sample inference procedures, initially by William Gosset (1908) [of the **Student’s t** fame] and then **Fisher** (1915, 1921, 1922a-b). These inference procedures revolved around a particular statistical model, known today as *the simple Normal model*: Continue reading

## Performance or Probativeness? E.S. Pearson’s Statistical Philosophy

This is a belated birthday post for E.S. Pearson (11 August 1895-12 June, 1980). It’s basically a post from 2012 which concerns an issue of interpretation (long-run performance vs probativeness) that’s badly confused these days. I’ll blog some E. Pearson items this week, including, my latest reflection on a historical anecdote regarding Egon and the woman he wanted marry, and surely would have, were it not for his father Karl!

**HAPPY BELATED BIRTHDAY EGON!**

Are methods based on error probabilities of use mainly to supply procedures which will not err too frequently in some long run? (*performance*). Or is it the other way round: that the control of long run error properties are of crucial importance for probing the causes of the data at hand? (*probativeness*). I say no to the former and yes to the latter. This, I think, was also the view of Egon Sharpe (E.S.) Pearson. Continue reading

## Performance or Probativeness? E.S. Pearson’s Statistical Philosophy

E.S. Pearson died on this day in 1980. Aside from being co-developer of Neyman-Pearson statistics, Pearson was interested in philosophical aspects of statistical inference. A question he asked is this: Are methods with good error probabilities of use mainly to supply procedures which will not err too frequently in some long run? (*performance*). Or is it the other way round: that the control of long run error properties are of crucial importance for probing the causes of the data at hand? (*probativeness*). I say no to the former and yes to the latter. But how exactly does it work? It’s not just the frequentist error statistician who faces this question, but also some contemporary Bayesians who aver that the performance or calibration of their methods supplies an evidential (or inferential or epistemic) justification (e.g., Robert Kass 2011). The latter generally ties the reliability of the method that produces the particular inference C to degrees of belief in C. The inference takes the form of a probabilism, e.g., Pr(C|x), equated, presumably, to the reliability (or coverage probability) of the method. But why? The frequentist inference is C, which is qualified by the reliability of the method, but there’s no posterior assigned C. Again, what’s the rationale? I think existing answers (from both tribes) come up short in non-trivial ways. Continue reading

## R.A Fisher: “It should never be true, though it is still often said, that the conclusions are no more accurate than the data on which they are based”

A final entry in a week of recognizing R.A.Fisher (February 17, 1890 – July 29, 1962). Fisher is among the very few thinkers I have come across to recognize this crucial difference between induction and deduction:

In deductive reasoning all knowledge obtainable is already latent in the postulates. Rigorous is needed to prevent the successive inferences growing less and less accurate as we proceed. The conclusions are never more accurate than the data. In inductive reasoning we are performing part of the process by which new knowledge is created. The conclusions normally grow more and more accurate as more data are included. It should never be true, though it is still often said, that the conclusions are no more accurate than the data on which they are based. Statistical data are always erroneous, in greater or less degree. The study of inductive reasoning is the study of the embryology of knowledge, of the processes by means of which truth is extracted from its native ore in which it is infused with much error. (Fisher, “The Logic of Inductive Inference,” 1935, p 54).

Reading/rereading this paper is very worthwhile for interested readers. Some of the fascinating historical/statistical background may be found in a guest post by Aris Spanos: “R.A.Fisher: How an Outsider Revolutionized Statistics”

## R.A. Fisher: “Statistical methods and Scientific Induction”

I continue a week of Fisherian posts in honor of his birthday (Feb 17). This is his contribution to the “Triad”–an exchange between Fisher, Neyman and Pearson 20 years after the Fisher-Neyman break-up. They are each very short.

*“Statistical Methods and Scientific Induction”*

*by Sir Ronald Fisher (1955)
*

**SUMMARY**

The attempt to reinterpret the common tests of significance used in scientific research as though they constituted some kind of acceptance procedure and led to “decisions” in Wald’s sense, originated in several misapprehensions and has led, apparently, to several more.

The three phrases examined here, with a view to elucidating they fallacies they embody, are:

- “Repeated sampling from the same population”,
- Errors of the “second kind”,
- “Inductive behavior”.

Mathematicians without personal contact with the Natural Sciences have often been misled by such phrases. The errors to which they lead are not only numerical.

To continue reading Fisher’s paper.

The most noteworthy feature is Fisher’s position on Fiducial inference, typically downplayed. I’m placing a summary and link to Neyman’s response below–it’s that interesting. Continue reading

## R.A. Fisher: ‘Two New Properties of Mathematical Likelihood’

*Today is R.A. Fisher’s birthday. I’ll post some different Fisherian items this week in honor of it. This paper comes just before the conflicts with Neyman and Pearson erupted. Fisher links his tests and sufficiency, to the Neyman and Pearson lemma in terms of power. It’s as if we may see them as ending up in a similar place while starting from different origins. I quote just the most relevant portions…the full article is linked below. Happy Birthday Fisher!*

“Two New Properties of Mathematical Likelihood“

by R.A. Fisher, F.R.S.

Proceedings of the Royal Society, Series A, 144: 285-307 (1934)

The property that where a sufficient statistic exists, the likelihood, apart from a factor independent of the parameter to be estimated, is a function only of the parameter and the sufficient statistic, explains the principle result obtained by Neyman and Pearson in discussing the efficacy of tests of significance. Neyman and Pearson introduce the notion that any chosen test of a hypothesis H_{0} is more powerful than any other equivalent test, with regard to an alternative hypothesis H_{1}, when it rejects H_{0} in a set of samples having an assigned aggregate frequency ε when H_{0} is true, and the greatest possible aggregate frequency when H_{1} is true. Continue reading

## TragiComedy hour: P-values vs posterior probabilities vs diagnostic error rates

The consequences of recent criticisms of statistical tests have breathed brand new life into some very old howlers, many of which have been discussed on this blog. What is not funny, though, is how standard notions such as frequentist error probabilities are being redefined in the process, and how we now have arguments built on equivocations. In fact, there are official guidebooks for the statistically perplexed giving inconsistent definitions to the same term (See for just 1 of many examples this post). How much more perplexed will that leave us! Since it’s near the 5-year anniversary of this blog, let’s listen in to a new comedy hour mixing one from **3 years ago **with some add-ons*.

*D id you hear the one about the frequentist significance tester when he was shown the nonfrequentist nature of p-values?*

Critic:I just simulated a long series of tests on a pool of null hypotheses, and I found that among tests with p-values of .05, at least 22%—and typically over 50%—of the null hypotheses are true!

Frequentist Significance Tester: Scratches head: But rejecting the null with a p-value of .05 ensures erroneous rejection no more than 5% of the time!

Raucous laughter ensues!

(Hah, hah… “So funny, I forgot to laugh! Or, I’m crying and laughing at the same time!) Continue reading

## History of statistics sleuths out there? “Ideas came into my head as I sat on a gate overlooking an experimental blackcurrant plot”–No wait, it was apples, probably

Here you see my scruffy sketch of Egon drawn 20 years ago for the frontispiece of my book, “Error and the Growth of Experimental Knowledge” (EGEK 1996). The caption is

“I might recall how certain early ideas came into my head as I sat on a gate overlooking an experimental blackcurrant plot…–E.S Pearson, “Statistical Concepts in Their Relation to Reality”.

He is responding to Fisher to “dispel the picture of the Russian technological bogey”. [i]

So, as I said in my last post, just to make a short story long, I’ve recently been scouring around the history and statistical philosophies of Neyman, Pearson and Fisher for purposes of a book soon to be completed, and I discovered a funny little error about this quote. Only maybe 3 or 4 people alive would care, but maybe someone out there knows the real truth.

OK, so I’d been rereading Constance Reid’s great biography of Neyman, and in one place she interviews Egon about the sources of inspiration for their work. Here’s what Egon tells her: Continue reading

## Performance or Probativeness? E.S. Pearson’s Statistical Philosophy

This is a belated birthday post for E.S. Pearson (11 August 1895-12 June, 1980). It’s basically a post from 2012 which concerns an issue of interpretation (long-run performance vs probativeness) that’s badly confused these days. I’ve recently been scouring around the history and statistical philosophies of Neyman, Pearson and Fisher for purposes of a book soon to be completed. I recently discovered a little anecdote that calls for a correction in something I’ve been saying for years. While it’s little more than a point of trivia, it’s in relation to Pearson’s (1955) response to Fisher (1955)–the last entry in this post. I’ll wait until tomorrow or the next day to share it, to give you a chance to read the background.

Are methods based on error probabilities of use mainly to supply procedures which will not err too frequently in some long run? (*performance*). Or is it the other way round: that the control of long run error properties are of crucial importance for probing the causes of the data at hand? (*probativeness*). I say no to the former and yes to the latter. This, I think, was also the view of Egon Sharpe (E.S.) Pearson.

*Cases of Type A and Type B*

“How far then, can one go in giving precision to a philosophy of statistical inference?” (Pearson 1947, 172)

## A. Birnbaum: Statistical Methods in Scientific Inference (May 27, 1923 – July 1, 1976)

Allan Birnbaum died 40 years ago today. He lived to be only 53 [i]. From the perspective of philosophy of statistics and philosophy of science, Birnbaum is best known for his work on likelihood, the Likelihood Principle [ii], and for his attempts to blend concepts of likelihood with error probability ideas to arrive at what he termed “concepts of statistical evidence”. Failing to find adequate concepts of statistical evidence, Birnbaum called for joining the work of “interested statisticians, scientific workers and philosophers and historians of science”–an idea I have heartily endorsed. While known for a result that the (strong) Likelihood Principle followed from sufficiency and conditionality principles (a result that Jimmy Savage deemed one of the greatest breakthroughs in statistics), a few years after publishing it, he turned away from it, perhaps discovering gaps in his argument. A post linking to a 2014 *Statistical Science* issue discussing Birnbaum’s result is here. Reference [5] links to the *Synthese* 1977 volume dedicated to his memory. The editors describe it as their way of “paying homage to Professor Birnbaum’s penetrating and stimulating work on the foundations of statistics”. Ample weekend reading! Continue reading

## Return to the Comedy Hour: P-values vs posterior probabilities (1)

Some recent criticisms of statistical tests of significance have breathed brand new life into some very old howlers, many of which have been discussed on this blog. One variant that returns to the scene every decade I think (for 50+ years?), takes a “disagreement on numbers” to show a problem with significance tests even from a “frequentist” perspective. Since it’s Saturday night, let’s listen in to one of the comedy hours from **3 years ago **(0) (new notes in red):

*D id you hear the one about the frequentist significance tester when he was shown the nonfrequentist nature of p-values?*

JB[Jim Berger]:I just simulated a long series of tests on a pool of null hypotheses, and I found that among tests with p-values of .05, at least 22%—and typically over 50%—of the null hypotheses are true!(1)

Frequentist Significance Tester: Scratches head: But rejecting the null with a p-value of .05 ensures erroneous rejection no more than 5% of the time!

Raucous laughter ensues!

(Hah, hah,…. I feel I’m back in high school: “So funny, I forgot to laugh!)

The frequentist tester should retort:

But you assumed 50% of the null hypotheses are true, and computed P(HFrequentist Significance Tester:_{0}|x) (imagining P(H_{0})= .5)—and then assumed my p-value should agree with the number you get, if it is not to be misleading!

Yet, our significance tester is not heard from as they move on to the next joke…. Continue reading

## Erich Lehmann: Neyman-Pearson & Fisher on P-values

**Today is Erich Lehmann’s birthday (20 November 1917 – 12 September 2009).** Lehmann was Neyman’s first student at Berkeley (Ph.D 1942), and his framing of Neyman-Pearson (NP) methods has had an enormous influence on the way we typically view them.

I got to know Erich in 1997, shortly after publication of EGEK (1996). One day, I received a bulging, six-page, handwritten letter from him in tiny, extremely neat scrawl (and many more after that). He began by telling me that he was sitting in a very large room at an ASA (American Statistical Association) meeting where they were shutting down the conference book display (or maybe they were setting it up), and on a very long, wood table sat just one book, all alone, shiny red. He said he wondered if it might be of interest to him! So he walked up to it…. It turned out to be my *Error and the Growth of Experimental Knowledge* (1996, Chicago), which he reviewed soon after[0]. (What are the chances?) Some related posts on Lehmann’s letter are here and here.

One of Lehmann’s more philosophical papers is Lehmann (1993), “The Fisher, Neyman-Pearson Theories of Testing Hypotheses: One Theory or Two?” We haven’t discussed it before on this blog. Here are some excerpts (blue), and remarks (black)

…A distinction frequently made between the approaches of Fisher and Neyman-Pearson is that in the latter the test is carried out at a fixed level, whereas the principal outcome of the former is the statement of a p value that may or may not be followed by a pronouncement concerning significance of the result [p.1243].

The history of this distinction is curious. Throughout the 19th century, testing was carried out rather informally. It was roughly equivalent to calculating an (approximate) p value and rejecting the hypothesis if this value appeared to be sufficiently small. … Fisher, in his 1925 book and later, greatly reduced the needed tabulations by providing tables not of the distributions themselves but of selected quantiles. … These tables allow the calculation only of ranges for the p values; however, they are exactly suited for determining the critical values at which the statistic under consideration becomes significant at a given level. As Fisher wrote in explaining the use of his [chi square] table (1946, p. 80):

In preparing this table we have borne in mind that in practice we do not want to know the exact value of P for any observed [chi square], but, in the first place, whether or not the observed value is open to suspicion. If P is between .1 and .9, there is certainly no reason to suspect the hypothesis tested. If it is below .02, it is strongly indicated that the hypothesis fails to account for the whole of the facts. We shall not often be astray if we draw a conventional line at .05 and consider that higher values of [chi square] indicate a real discrepancy.

Similarly, he also wrote (1935, p. 13) that “it is usual and convenient for experimenters to take 5 percent as a standard level of significance, in the sense that they are prepared to ignore all results which fail to reach this standard .. .” …. Continue reading