# phil/history of stat

## A. Spanos: Egon Pearson’s Neglected Contributions to Statistics

11 August 1895 – 12 June 1980

Continuing with my Egon Pearson posts in honor of his birthday, I reblog a post by Aris Spanos:  Egon Pearson’s Neglected Contributions to Statistics“.

Egon Pearson (11 August 1895 – 12 June 1980), is widely known today for his contribution in recasting of Fisher’s significance testing into the Neyman-Pearson (1933) theory of hypothesis testing. Occasionally, he is also credited with contributions in promoting statistical methods in industry and in the history of modern statistics; see Bartlett (1981). What is rarely mentioned is Egon’s early pioneering work on:

(i) specification: the need to state explicitly the inductive premises of one’s inferences,

(ii) robustness: evaluating the ‘sensitivity’ of inferential procedures to departures from the Normality assumption, as well as

(iii) Mis-Specification (M-S) testing: probing for potential departures from the Normality  assumption.

Arguably, modern frequentist inference began with the development of various finite sample inference procedures, initially by William Gosset (1908) [of the Student’s t fame] and then Fisher (1915, 1921, 1922a-b). These inference procedures revolved around a particular statistical model, known today as the simple Normal model:

Xk ∽ NIID(μ,σ²), k=1,2,…,n,…             (1)

where ‘NIID(μ,σ²)’ stands for ‘Normal, Independent and Identically Distributed with mean μ and variance σ²’. These procedures include the ‘optimal’ estimators of μ and σ², Xbar and s², and the pivotal quantities:

(a) τ(X) =[√n(Xbar- μ)/s] ∽ St(n-1),  (2)

(b) v(X) =[(n-1)s²/σ²] ∽ χ²(n-1),        (3)

where St(n-1) and χ²(n-1) denote the Student’s t and chi-square distributions with (n-1) degrees of freedom.

The question of ‘how these inferential results might be affected when the Normality assumption is false’ was originally raised by Gosset in a letter to Fisher in 1923:

“What I should like you to do is to find a solution for some other population than a normal one.”  (Lehmann, 1999)

He went on to say that he tried the rectangular (uniform) distribution but made no progress, and he was seeking Fisher’s help in tackling this ‘robustness/sensitivity’ problem. In his reply that was unfortunately lost, Fisher must have derived the sampling distribution of τ(X), assuming some skewed distribution (possibly log-Normal). We know this from Gosset’s reply:

“I like the result for z [τ(X)] in the case of that horrible curve you are so fond of. I take it that in skew curves the distribution of z is skew in the opposite direction.”  (Lehmann, 1999)

After this exchange Fisher was not particularly receptive to Gosset’s requests to address the problem of working out the implications of non-Normality for the Normal-based inference procedures; t, chi-square and F tests.

In contrast, Egon Pearson shared Gosset’s concerns about the robustness of Normal-based inference results (a)-(b) to non-Normality, and made an attempt to address the problem in a series of papers in the late 1920s and early 1930s. This line of research for Pearson began with a review of Fisher’s 2nd edition of the 1925 book, published in Nature, and dated June 8th, 1929.  Pearson, after praising the book for its path breaking contributions, dared raise a mild criticism relating to (i)-(ii) above:

“There is one criticism, however, which must be made from the statistical point of view. A large number of tests are developed upon the assumption that the population sampled is of ‘normal’ form. That this is the case may be gathered from a very careful reading of the text, but the point is not sufficiently emphasised. It does not appear reasonable to lay stress on the ‘exactness’ of tests, when no means whatever are given of appreciating how rapidly they become inexact as the population samples diverge from normality.” (Pearson, 1929a)

Fisher reacted badly to this criticism and was preparing an acerbic reply to the ‘young pretender’ when Gosset jumped into the fray with his own letter in Nature, dated July 20th, in an obvious attempt to moderate the ensuing fight. Gosset succeeded in tempering Fisher’s reply, dated August 17th, forcing him to provide a less acerbic reply, but instead of addressing the ‘robustness/sensitivity’ issue, he focused primarily on Gosset’s call to address ‘the problem of what sort of modification of my tables for the analysis of variance would be required to adapt that process to non-normal distributions’. He described that as a hopeless task. This is an example of Fisher’s genius when cornered by an insightful argument. He sidestepped the issue of ‘robustness’ to departures from Normality, by broadening it – alluding to other possible departures from the ID assumption – and rendering it a hopeless task, by focusing on the call to ‘modify’ the statistical tables for all possible non-Normal distributions; there is an infinity of potential modifications!

Egon Pearson recognized the importance of stating explicitly the inductive premises upon which the inference results are based, and pressed ahead with exploring the robustness issue using several non-Normal distributions within the Pearson family. His probing was based primarily on simulation, relying on tables of pseudo-random numbers; see Pearson and Adyanthaya (1928, 1929), Pearson (1929b, 1931). His broad conclusions were that the t-test:

τ0(X)=|[√n(X-bar- μ0)/s]|, C1:={x: τ0(x) > cα},    (4)

for testing the hypotheses:

H0: μ = μ0 vs. H1: μ ≠ μ0,                                             (5)

is relatively robust to certain departures from Normality, especially when the underlying distribution is symmetric, but the ANOVA test is rather sensitive to such departures! He continued this line of research into his 80s; see Pearson and Please (1975).

Perhaps more importantly, Pearson (1930) proposed a test for the Normality assumption based on the skewness and kurtosis coefficients: a Mis-Specification (M-S) test. Ironically, Fisher (1929) provided the sampling distributions of the sample skewness and kurtosis statistics upon which Pearson’s test was based. Pearson continued sharpening his original M-S test for Normality, and his efforts culminated with the D’Agostino and Pearson (1973) test that is widely used today; see also Pearson et al. (1977). The crucial importance of testing Normality stems from the fact that it renders the ‘robustness/sensitivity’ problem manageable. The test results can be used to narrow down the possible departures one needs to worry about. They can also be used to suggest ways to respecify the original model.

After Pearson’s early publications on the ‘robustness/sensitivity’ problem Gosset realized that simulation alone was not effective enough to address the question of robustness, and called upon Fisher, who initially rejected Gosset’s call by saying ‘it was none of his business’, to derive analytically the implications of non-Normality using different distributions:

“How much does it [non-Normality] matter? And in fact that is your business: none of the rest of us have the slightest chance of solving the problem: we can play about with samples [i.e. perform simulation studies], I am not belittling E. S. Pearson’s work, but it is up to you to get us a proper solution.” (Lehmann, 1999).

In this passage one can discern the high esteem with which Gosset held Fisher for his technical ability. Fisher’s reply was rather blunt:

“I do not think what you are doing with nonnormal distributions is at all my business, and I doubt if it is the right approach. … Where I differ from you, I suppose, is in regarding normality as only a part of the difficulty of getting data; viewed in this collection of difficulties I think you will see that it is one of the least important.”

It’s clear from this that Fisher understood the problem of how to handle departures from Normality more broadly than his contemporaries. His answer alludes to two issues that were not well understood at the time:

(a) departures from the other two probabilistic assumptions (IID) have much more serious consequences for Normal-based inference than Normality, and

(b) deriving the consequences of particular forms of non-Normality on the reliability of Normal-based inference, and proclaiming a procedure enjoys a certain level of ‘generic’ robustness, does not provide a complete answer to the problem of dealing with departures from the inductive premises.

In relation to (a) it is important to note that the role of ‘randomness’, as it relates to the IID assumptions, was not well understood until the 1940s, when the notion of non-IID was framed in terms of explicit forms of heterogeneity and dependence pertaining to stochastic processes. Hence, the problem of assessing departures from IID was largely ignored at the time, focusing almost exclusively on departures from Normality. Indeed, the early literature on nonparametric inference retained the IID assumptions and focused on inference procedures that replace the Normality assumption with indirect distributional assumptions pertaining to the ‘true’ but unknown f(x), like the existence of certain moments, its symmetry, smoothness, continuity and/or differentiability, unimodality, etc. ; see Lehmann (1975). It is interesting to note that Egon Pearson did not consider the question of testing the IID assumptions until his 1963 paper.

In relation to (b), when one poses the question ‘how robust to non-Normality is the reliability of inference based on a t-test?’ one ignores the fact that the t-test might no longer be the ‘optimal’ test under a non-Normal distribution. This is because the sampling distribution of the test statistic and the associated type I and II error probabilities depend crucially on the validity of the statistical model assumptions. When any of these assumptions are invalid, the relevant error probabilities are no longer the ones derived under the original model assumptions, and the optimality of the original test is called into question. For instance, assuming that the ‘true’ distribution is uniform (Gosset’s rectangular):

Xk ∽ U(a-μ,a+μ),   k=1,2,…,n,…        (6)

where f(x;a,μ)=(1/(2μ)), (a-μ) ≤ x ≤ (a+μ), μ > 0,

how does one assess the robustness of the t-test? One might invoke its generic robustness to symmetric non-Normal distributions and proceed as if the t-test is ‘fine’ for testing the hypotheses (5). A more well-grounded answer will be to assess the discrepancy between the nominal (assumed) error probabilities of the t-test based on (1) and the actual ones based on (6). If the latter approximate the former ‘closely enough’, one can justify the generic robustness. These answers, however, raise the broader question of what are the relevant error probabilities? After all, the optimal test for the hypotheses (5) in the context of (6), is no longer the t-test, but the test defined by:

w(X)=|{(n-1)([X[1] +X[n]]-μ0)}/{[X[1]-X[n]]}|∽F(2,2(n-1)),   (7)

with a rejection region C1:={x: w(x) > cα},  where (X[1], X[n]) denote the smallest and the largest element in the ordered sample (X[1], X[2],…, X[n]), and F(2,2(n-1)) the F distribution with 2 and 2(n-1) degrees of freedom; see Neyman and Pearson (1928). One can argue that the relevant comparison error probabilities are no longer the ones associated with the t-test ‘corrected’ to account for the assumed departure, but those associated with the test in (7). For instance, let the t-test have nominal and actual significance level, .05 and .045, and power at μ10+1, of .4 and .37, respectively. The conventional wisdom will call the t-test robust, but is it reliable (effective) when compared with the test in (7) whose significance level and power (at μ1) are say, .03 and .9, respectively?

A strong case can be made that a more complete approach to the statistical misspecification problem is:

(i) to probe thoroughly for any departures from all the model assumptions using trenchant M-S tests, and if any departures are detected,

(ii) proceed to respecify the statistical model by choosing a more appropriate model with a view to account for the statistical information that the original model did not.

Admittedly, this is a more demanding way to deal with departures from the underlying assumptions, but it addresses the concerns of Gosset, Egon Pearson, Neyman and Fisher much more effectively than the invocation of vague robustness claims; see Spanos (2010).

References

Bartlett, M. S. (1981) “Egon Sharpe Pearson, 11 August 1895-12 June 1980,” Biographical Memoirs of Fellows of the Royal Society, 27: 425-443.

D’Agostino, R. and E. S. Pearson (1973) “Tests for Departure from Normality. Empirical Results for the Distributions of b₂ and √(b₁),” Biometrika, 60: 613-622.

Fisher, R. A. (1915) “Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population,” Biometrika, 10: 507-521.

Fisher, R. A. (1921) “On the “probable error” of a coefficient of correlation deduced from a small sample,” Metron, 1: 3-32.

Fisher, R. A. (1922a) “On the mathematical foundations of theoretical statistics,” Philosophical Transactions of the Royal Society A, 222, 309-368.

Fisher, R. A. (1922b) “The goodness of fit of regression formulae, and the distribution of regression coefficients,” Journal of the Royal Statistical Society, 85: 597-612.

Fisher, R. A. (1925) Statistical Methods for Research Workers, Oliver and Boyd, Edinburgh.

Fisher, R. A. (1929), “Moments and Product Moments of Sampling Distributions,” Proceedings of the London Mathematical Society, Series 2, 30: 199-238.

Neyman, J. and E. S. Pearson (1928) “On the use and interpretation of certain test criteria for purposes of statistical inference: Part I,” Biometrika, 20A: 175-240.

Neyman, J. and E. S. Pearson (1933) “On the problem of the most efficient tests of statistical hypotheses”, Philosophical Transanctions of the Royal Society, A, 231: 289-337.

Lehmann, E. L. (1975) Nonparametrics: statistical methods based on ranks, Holden-Day, San Francisco.

Lehmann, E. L. (1999) “‘Student’ and Small-Sample Theory,” Statistical Science, 14: 418-426.

Pearson, E. S. (1929a) “Review of ‘Statistical Methods for Research Workers,’ 1928, by Dr. R. A. Fisher”, Nature, June 8th, pp. 866-7.

Pearson, E. S. (1929b) “Some notes on sampling tests with two variables,” Biometrika, 21: 337-60.

Pearson, E. S. (1930) “A further development of tests for normality,” Biometrika, 22: 239-49.

Pearson, E. S. (1931) “The analysis of variance in cases of non-normal variation,” Biometrika, 23: 114-33.

Pearson, E. S. (1963) “Comparison of tests for randomness of points on a line,” Biometrika, 50: 315-25.

Pearson, E. S. and N. K. Adyanthaya (1928) “The distribution of frequency constants in small samples from symmetrical populations,” Biometrika, 20: 356-60.

Pearson, E. S. and N. K. Adyanthaya (1929) “The distribution of frequency constants in small samples from non-normal symmetrical and skew populations,” Biometrika, 21: 259-86.

Pearson, E. S. and N. W. Please (1975) “Relations between the shape of the population distribution and the robustness of four simple test statistics,” Biometrika, 62: 223-241.

Pearson, E. S., R. B. D’Agostino and K. O. Bowman (1977) “Tests for departure from normality: comparisons of powers,” Biometrika, 64: 231-246.

Spanos, A. (2010) “Akaike-type Criteria and the Reliability of Inference: Model Selection vs. Statistical Model Specification,” Journal of Econometrics, 158: 204-220.

Student (1908), “The Probable Error of the Mean,” Biometrika, 6: 1-25.

## Performance or Probativeness? E.S. Pearson’s Statistical Philosophy

E.S. Pearson (11 Aug, 1895-12 June, 1980)

This is a belated birthday post for E.S. Pearson (11 August 1895-12 June, 1980). It’s basically a post from 2012 which concerns an issue of interpretation (long-run performance vs probativeness) that’s badly confused these days. I’ll blog some E. Pearson items this week, including, my latest reflection on a historical anecdote regarding Egon and the woman he wanted marry, and surely would have, were it not for his father Karl!

HAPPY BELATED BIRTHDAY EGON!

Are methods based on error probabilities of use mainly to supply procedures which will not err too frequently in some long run? (performance). Or is it the other way round: that the control of long run error properties are of crucial importance for probing the causes of the data at hand? (probativeness). I say no to the former and yes to the latter. This, I think, was also the view of Egon Sharpe (E.S.) Pearson.

Cases of Type A and Type B

“How far then, can one go in giving precision to a philosophy of statistical inference?” (Pearson 1947, 172)

Pearson considers the rationale that might be given to N-P tests in two types of cases, A and B:

“(A) At one extreme we have the case where repeated decisions must be made on results obtained from some routine procedure…

(B) At the other is the situation where statistical tools are applied to an isolated investigation of considerable importance…?” (ibid., 170)

In cases of type A, long-run results are clearly of interest, while in cases of type B, repetition is impossible and may be irrelevant:

“In other and, no doubt, more numerous cases there is no repetition of the same type of trial or experiment, but all the same we can and many of us do use the same test rules to guide our decision, following the analysis of an isolated set of numerical data. Why do we do this? What are the springs of decision? Is it because the formulation of the case in terms of hypothetical repetition helps to that clarity of view needed for sound judgment?

Or is it because we are content that the application of a rule, now in this investigation, now in that, should result in a long-run frequency of errors in judgment which we control at a low figure?” (Ibid., 173)

Although Pearson leaves this tantalizing question unanswered, claiming, “On this I should not care to dogmatize”, in studying how Pearson treats cases of type B, it is evident that in his view, “the formulation of the case in terms of hypothetical repetition helps to that clarity of view needed for sound judgment” in learning about the particular case at hand.

“Whereas when tackling problem A it is easy to convince the practical man of the value of a probability construct related to frequency of occurrence, in problem B the argument that ‘if we were to repeatedly do so and so, such and such result would follow in the long run’ is at once met by the commonsense answer that we never should carry out a precisely similar trial again.

Nevertheless, it is clear that the scientist with a knowledge of statistical method behind him can make his contribution to a round-table discussion…” (Ibid., 171).

Pearson gives the following example of a case of type B (from his wartime work), where he claims no repetition is intended:

“Example of type B. Two types of heavy armour-piercing naval shell of the same caliber are under consideration; they may be of different design or made by different firms…. Twelve shells of one kind and eight of the other have been fired; two of the former and five of the latter failed to perforate the plate….”(Pearson 1947, 171)

“Starting from the basis that, individual shells will never be identical in armour-piercing qualities, however good the control of production, he has to consider how much of the difference between (i) two failures out of twelve and (ii) five failures out of eight is likely to be due to this inevitable variability. ..”(Ibid.,)

We’re interested in considering what other outcomes could have occurred, and how readily, in order to learn what variability alone is capable of producing. As a noteworthy aside, Pearson shows that treating the observed difference (between the two proportions) in one way yields an observed significance level of 0.052; treating it differently (along Barnard’s lines), he gets 0.025 as the (upper) significance level. But in scientific cases, Pearson insists, the difference in error probabilities makes no real difference to substantive judgments in interpreting the results. Only in an unthinking, automatic, routine use of tests would it matter:

“Were the action taken to be decided automatically by the side of the 5% level on which the observation point fell, it is clear that the method of analysis used would here be of vital importance. But no responsible statistician, faced with an investigation of this character, would follow an automatic probability rule.” (ibid., 192)

The two analyses correspond to the tests effectively asking different questions, and if we recognize this, says Pearson, different meanings may be appropriately attached.

Three Steps in the Original Construction of Tests

After setting up the test (or null) hypothesis, and the alternative hypotheses against which “we wish the test to have maximum discriminating power” (Pearson 1947, 173), Pearson defines three steps in specifying tests:

“Step 1. We must specify the experimental probability set, the set of results which could follow on repeated application of the random process used in the collection of the data…

Step 2. We then divide this set [of possible results] by a system of ordered boundaries…such that as we pass across one boundary and proceed to the next, we come to a class of results which makes us more and more inclined on the Information  available, to reject the hypothesis tested in favour of alternatives which differ from it by increasing amounts”.

“Step 3. We then, if possible[i], associate with each contour level the chance that, if [the null] is true, a result will occur in random sampling lying beyond that level” (ibid.).

Pearson warns that:

“Although the mathematical procedure may put Step 3 before 2, we cannot put this into operation before we have decided, under Step 2, on the guiding principle to be used in choosing the contour system. That is why I have numbered the steps in this order.” (Ibid. 173).

Strict behavioristic formulations jump from step 1 to step 3, after which one may calculate how the test has in effect accomplished step 2.  However, the resulting test, while having adequate error probabilities, may have an inadequate distance measure and may even be irrelevant to the hypothesis of interest. This is one reason critics can construct howlers that appear to be licensed by N-P methods, and which make their way from time to time into this blog.

So step 3 remains crucial, even for cases of type [B]. There are two reasons: pre-data planning—that’s familiar enough—but secondly, for post-data scrutiny. Post data, step 3 enables determining the capability of the test to have detected various discrepancies, departures, and errors, on which a critical scrutiny of the inferences are based. More specifically, the error probabilities are used to determine how well/poorly corroborated, or how severely tested, various claims are, post-data.

If we can readily bring about statistically significantly higher rates of success with the first type of armour-piercing naval shell than with the second (in the above example), we have evidence the first is superior. Or, as Pearson modestly puts it: the results “raise considerable doubts as to whether the performance of the [second] type of shell was as good as that of the [first]….” (Ibid., 192)[ii]

Still, while error rates of procedures may be used to determine how severely claims have/have not passed they do not automatically do so—hence, again, opening the door to potential howlers that neither Egon nor Jerzy for that matter would have countenanced.

Neyman Was the More Behavioristic of the Two

Pearson was (rightly) considered to have rejected the more behaviorist leanings of Neyman.

Here’s a snippet from an unpublished letter he wrote to Birnbaum (1974) about the idea that the N-P theory admits of two interpretations: behavioral and evidential:

“I think you will pick up here and there in my own papers signs of evidentiality, and you can say now that we or I should have stated clearly the difference between the behavioral and evidential interpretations. Certainly we have suffered since in the way the people have concentrated (to an absurd extent often) on behavioral interpretations”.

In Pearson’s (1955) response to Fisher (blogged here):

“To dispel the picture of the Russian technological bogey, I might recall how certain early ideas came into my head as I sat on a gate overlooking an experimental blackcurrant plot….!” (Pearson 1955, 204)

“To the best of my ability I was searching for a way of expressing in mathematical terms what appeared to me to be the requirements of the scientist in applying statistical tests to his data. After contact was made with Neyman in 1926, the development of a joint mathematical theory proceeded much more surely; it was not till after the main lines of this theory had taken shape with its necessary formalization in terms of critical regions, the class of admissible hypotheses, the two sources of error, the power function, etc., that the fact that there was a remarkable parallelism of ideas in the field of acceptance sampling became apparent. Abraham Wald’s contributions to decision theory of ten to fifteen years later were perhaps strongly influenced by acceptance sampling problems, but that is another story.“ (ibid., 204-5).

“It may be readily agreed that in the first Neyman and Pearson paper of 1928, more space might have been given to discussing how the scientific worker’s attitude of mind could be related to the formal structure of the mathematical probability theory….Nevertheless it should be clear from the first paragraph of this paper that we were not speaking of the final acceptance or rejection of a scientific hypothesis on the basis of statistical analysis…. Indeed, from the start we shared Professor Fisher’s view that in scientific enquiry, a statistical test is ‘a means of learning”… (Ibid., 206)

“Professor Fisher’s final criticism concerns the use of the term ‘inductive behavior’; this is Professor Neyman’s field rather than mine.” (Ibid., 207)

__________________________

References:

Pearson, E. S. (1947), “The choice of Statistical Tests illustrated on the Interpretation of Data Classed in a 2×2 Table,Biometrika 34(1/2): 139-167.

Pearson, E. S. (1955), “Statistical Concepts and Their Relationship to RealityJournal of the Royal Statistical Society, Series B, (Methodological), 17(2): 204-207.

Neyman, J. and Pearson, E. S. (1928), “On the Use and Interpretation of Certain Test Criteria for Purposes of Statistical Inference, Part I.” Biometrika 20(A): 175-240.

In some cases only an upper limit to this error probability may be found.

[ii] Pearson inadvertently switches from number of failures to number of successes in the conclusion of this paper.

## Performance or Probativeness? E.S. Pearson’s Statistical Philosophy

E.S. Pearson (11 Aug, 1895-12 June, 1980)

E.S. Pearson died on this day in 1980. Aside from being co-developer of Neyman-Pearson statistics, Pearson was interested in philosophical aspects of statistical inference. A question he asked is this: Are methods with good error probabilities of use mainly to supply procedures which will not err too frequently in some long run? (performance). Or is it the other way round: that the control of long run error properties are of crucial importance for probing the causes of the data at hand? (probativeness). I say no to the former and yes to the latter. But how exactly does it work? It’s not just the frequentist error statistician who faces this question, but also some contemporary Bayesians who aver that the performance or calibration of their methods supplies an evidential (or inferential or epistemic) justification (e.g., Robert Kass 2011). The latter generally ties the reliability of the method that produces the particular inference C to degrees of belief in C. The inference takes the form of a probabilism, e.g., Pr(C|x), equated, presumably, to the reliability (or coverage probability) of the method. But why? The frequentist inference is C, which is qualified by the reliability of the method, but there’s no posterior assigned C. Again, what’s the rationale? I think existing answers (from both tribes) come up short in non-trivial ways.

I’ve recently become clear (or clearer) on a view I’ve been entertaining for a long time. There’s more than one goal in using probability, but when it comes to statistical inference in science, I say, the goal is not to infer highly probable claims (in the formal sense)* but claims which have been highly probed and have passed severe probes.  Even highly plausible claims can be poorly tested (and I require a bit more of a test than informal uses of the word.) The frequency properties of a method are relevant in those contexts where they provide assessments of a method’s capabilities and shortcomings in uncovering ways C may be wrong. Knowledge of the methods capabilities are used, in turn, to ascertain how well or severely C has been probed. C is warranted only to the extent that it survived a severe probe of ways it can be incorrect. There’s poor evidence for C when little has been done to rule out C’s flaws. The most important role of error probabilities is in blocking inferences to claims that have not passed severe tests, but also to falsify (statistically) claims whose denials pass severely. This view is in the spirit of E.S. Pearson, Peirce, and Popper–though none fully worked it out. That’s one of the things I do or try to in my latest work. Each supplied important hints. The following remarks of Pearson, earlier blogged here, contains some of his hints.

*Nor to give a comparative assessment of the probability of claims

“How far then, can one go in giving precision to a philosophy of statistical inference?” (Pearson 1947, 172)

## R.A Fisher: “It should never be true, though it is still often said, that the conclusions are no more accurate than the data on which they are based”

.

A final entry in a week of recognizing R.A.Fisher (February 17, 1890 – July 29, 1962). Fisher is among the very few thinkers I have come across to recognize this crucial difference between induction and deduction:

In deductive reasoning all knowledge obtainable is already latent in the postulates. Rigorous is needed to prevent the successive inferences growing less and less accurate as we proceed. The conclusions are never more accurate than the data. In inductive reasoning we are performing part of the process by which new knowledge is created. The conclusions normally grow more and more accurate as more data are included. It should never be true, though it is still often said, that the conclusions are no more accurate than the data on which they are based. Statistical data are always erroneous, in greater or less degree. The study of inductive reasoning is the study of the embryology of knowledge, of the processes by means of which truth is extracted from its native ore in which it is infused with much error. (Fisher, “The Logic of Inductive Inference,” 1935, p 54).

Reading/rereading this paper is very worthwhile for interested readers. Some of the fascinating historical/statistical background may be found in a guest post by Aris Spanos: “R.A.Fisher: How an Outsider Revolutionized Statistics”

Categories: Fisher, phil/history of stat | 30 Comments

## R.A. Fisher: “Statistical methods and Scientific Induction”

I continue a week of Fisherian posts in honor of his birthday (Feb 17). This is his contribution to the “Triad”–an exchange between  Fisher, Neyman and Pearson 20 years after the Fisher-Neyman break-up. They are each very short.

17 February 1890 — 29 July 1962

“Statistical Methods and Scientific Induction”

by Sir Ronald Fisher (1955)

SUMMARY

The attempt to reinterpret the common tests of significance used in scientific research as though they constituted some kind of  acceptance procedure and led to “decisions” in Wald’s sense, originated in several misapprehensions and has led, apparently, to several more.

The three phrases examined here, with a view to elucidating they fallacies they embody, are:

1. “Repeated sampling from the same population”,
2. Errors of the “second kind”,
3. “Inductive behavior”.

Mathematicians without personal contact with the Natural Sciences have often been misled by such phrases. The errors to which they lead are not only numerical.

The most noteworthy feature is Fisher’s position on Fiducial inference, typically downplayed. I’m placing a summary and link to Neyman’s response below–it’s that interesting. Continue reading

Categories: fiducial probability, Fisher, Neyman, phil/history of stat | 6 Comments

## R.A. Fisher: ‘Two New Properties of Mathematical Likelihood’

17 February 1890–29 July 1962

Today is R.A. Fisher’s birthday. I’ll post some different Fisherian items this week in honor of it. This paper comes just before the conflicts with Neyman and Pearson erupted.  Fisher links his tests and sufficiency, to the Neyman and Pearson lemma in terms of power.  It’s as if we may see them as ending up in a similar place while starting from different origins. I quote just the most relevant portions…the full article is linked below. Happy Birthday Fisher!

by R.A. Fisher, F.R.S.

Proceedings of the Royal Society, Series A, 144: 285-307 (1934)

The property that where a sufficient statistic exists, the likelihood, apart from a factor independent of the parameter to be estimated, is a function only of the parameter and the sufficient statistic, explains the principle result obtained by Neyman and Pearson in discussing the efficacy of tests of significance.  Neyman and Pearson introduce the notion that any chosen test of a hypothesis H0 is more powerful than any other equivalent test, with regard to an alternative hypothesis H1, when it rejects H0 in a set of samples having an assigned aggregate frequency ε when H0 is true, and the greatest possible aggregate frequency when H1 is true. Continue reading

Categories: Fisher, phil/history of stat, Statistics | | 2 Comments

## G.A. Barnard’s 101st Birthday: The Bayesian “catch-all” factor: probability vs likelihood

G. A. Barnard: 23 Sept 1915-30 July, 2002

Today is George Barnard’s 101st birthday. In honor of this, I reblog an exchange between Barnard, Savage (and others) on likelihood vs probability. The exchange is from (of what I call) “The Savage Forum” (Savage, 1962).[i] Six other posts on Barnard are linked below: 2 are guest posts (Senn, Spanos); the other 4 include a play (pertaining to our first meeting), and a letter he wrote to me.

♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠

BARNARD:…Professor Savage, as I understand him, said earlier that a difference between likelihoods and probabilities was that probabilities would normalize because they integrate to one, whereas likelihoods will not. Now probabilities integrate to one only if all possibilities are taken into account. This requires in its application to the probability of hypotheses that we should be in a position to enumerate all possible hypotheses which might explain a given set of data. Now I think it is just not true that we ever can enumerate all possible hypotheses. … If this is so we ought to allow that in addition to the hypotheses that we really consider we should allow something that we had not thought of yet, and of course as soon as we do this we lose the normalizing factor of the probability, and from that point of view probability has no advantage over likelihood. This is my general point, that I think while I agree with a lot of the technical points, I would prefer that this is talked about in terms of likelihood rather than probability. I should like to ask what Professor Savage thinks about that, whether he thinks that the necessity to enumerate hypotheses exhaustively, is important. Continue reading

## TragiComedy hour: P-values vs posterior probabilities vs diagnostic error rates

Did you hear the one about the frequentist significance tester when he was shown the nonfrequentist nature of p-values?

Critic: I just simulated a long series of tests on a pool of null hypotheses, and I found that among tests with p-values of .05, at least 22%—and typically over 50%—of the null hypotheses are true!

Frequentist Significance Tester: Scratches head: But rejecting the null with a p-value of .05 ensures erroneous rejection no more than 5% of the time!

Raucous laughter ensues!

(Hah, hah… “So funny, I forgot to laugh! Or, I’m crying and laughing at the same time!) Continue reading

Categories: Bayesian/frequentist, Comedy, significance tests, Statistics | 9 Comments

## History of statistics sleuths out there? “Ideas came into my head as I sat on a gate overlooking an experimental blackcurrant plot”–No wait, it was apples, probably

E.S.Pearson on a Gate, Mayo sketch

Here you see my scruffy sketch of Egon drawn 20 years ago for the frontispiece of my book, “Error and the Growth of Experimental Knowledge” (EGEK 1996). The caption is

“I might recall how certain early ideas came into my head as I sat on a gate overlooking an experimental blackcurrant plot… –E.S Pearson, “Statistical Concepts in Their Relation to Reality”.

He is responding to Fisher to “dispel the picture of the Russian technological bogey”. [i]

So, as I said in my last post, just to make a short story long, I’ve recently been scouring around the history and statistical philosophies of Neyman, Pearson and Fisher for purposes of a book soon to be completed, and I discovered a funny little error about this quote. Only maybe 3 or 4 people alive would care, but maybe someone out there knows the real truth.

OK, so I’d been rereading Constance Reid’s great biography of Neyman, and in one place she interviews Egon about the sources of inspiration for their work. Here’s what Egon tells her: Continue reading

Categories: E.S. Pearson, phil/history of stat, Statistics | 1 Comment

## Performance or Probativeness? E.S. Pearson’s Statistical Philosophy

E.S. Pearson (11 Aug, 1895-12 June, 1980)

This is a belated birthday post for E.S. Pearson (11 August 1895-12 June, 1980). It’s basically a post from 2012 which concerns an issue of interpretation (long-run performance vs probativeness) that’s badly confused these days. I’ve recently been scouring around the history and statistical philosophies of Neyman, Pearson and Fisher for purposes of a book soon to be completed. I recently discovered a little anecdote that calls for a correction in something I’ve been saying for years. While it’s little more than a point of trivia, it’s in relation to Pearson’s (1955) response to Fisher (1955)–the last entry in this post.  I’ll wait until tomorrow or the next day to share it, to give you a chance to read the background.

Are methods based on error probabilities of use mainly to supply procedures which will not err too frequently in some long run? (performance). Or is it the other way round: that the control of long run error properties are of crucial importance for probing the causes of the data at hand? (probativeness). I say no to the former and yes to the latter. This, I think, was also the view of Egon Sharpe (E.S.) Pearson.

Cases of Type A and Type B

“How far then, can one go in giving precision to a philosophy of statistical inference?” (Pearson 1947, 172)

## A. Birnbaum: Statistical Methods in Scientific Inference (May 27, 1923 – July 1, 1976)

Allan Birnbaum: May 27, 1923- July 1, 1976

Allan Birnbaum died 40 years ago today. He lived to be only 53 [i]. From the perspective of philosophy of statistics and philosophy of science, Birnbaum is best known for his work on likelihood, the Likelihood Principle [ii], and for his attempts to blend concepts of likelihood with error probability ideas to arrive at what he termed “concepts of statistical evidence”. Failing to find adequate concepts of statistical evidence, Birnbaum called for joining the work of “interested statisticians, scientific workers and philosophers and historians of science”–an idea I have heartily endorsed. While known for a result that the (strong) Likelihood Principle followed from sufficiency and conditionality principles (a result that Jimmy Savage deemed one of the greatest breakthroughs in statistics), a few years after publishing it, he turned away from it, perhaps discovering gaps in his argument. A post linking to a 2014 Statistical Science issue discussing Birnbaum’s result is here. Reference [5] links to the Synthese 1977 volume dedicated to his memory. The editors describe it as their way of “paying homage to Professor Birnbaum’s penetrating and stimulating work on the foundations of statistics”. Ample weekend reading! Continue reading

Comedy Hour

Did you hear the one about the frequentist significance tester when he was shown the nonfrequentist nature of p-values?

JB [Jim Berger]: I just simulated a long series of tests on a pool of null hypotheses, and I found that among tests with p-values of .05, at least 22%—and typically over 50%—of the null hypotheses are true!(1)

Frequentist Significance Tester: Scratches head: But rejecting the null with a p-value of .05 ensures erroneous rejection no more than 5% of the time!

Raucous laughter ensues!

(Hah, hah,…. I feel I’m back in high school: “So funny, I forgot to laugh!)

The frequentist tester should retort:

Frequentist Significance Tester: But you assumed 50% of the null hypotheses are true, and  computed P(H0|x) (imagining P(H0)= .5)—and then assumed my p-value should agree with the number you get, if it is not to be misleading!

Yet, our significance tester is not heard from as they move on to the next joke…. Continue reading

Categories: Bayesian/frequentist, Comedy, PBP, significance tests, Statistics | 27 Comments

## Erich Lehmann: Neyman-Pearson & Fisher on P-values

lone book on table

Today is Erich Lehmann’s birthday (20 November 1917 – 12 September 2009). Lehmann was Neyman’s first student at Berkeley (Ph.D 1942), and his framing of Neyman-Pearson (NP) methods has had an enormous influence on the way we typically view them.

I got to know Erich in 1997, shortly after publication of EGEK (1996). One day, I received a bulging, six-page, handwritten letter from him in tiny, extremely neat scrawl (and many more after that).  He began by telling me that he was sitting in a very large room at an ASA (American Statistical Association) meeting where they were shutting down the conference book display (or maybe they were setting it up), and on a very long, wood table sat just one book, all alone, shiny red.  He said he wondered if it might be of interest to him!  So he walked up to it….  It turned out to be my Error and the Growth of Experimental Knowledge (1996, Chicago), which he reviewed soon after[0]. (What are the chances?) Some related posts on Lehmann’s letter are here and here.

One of Lehmann’s more philosophical papers is Lehmann (1993), “The Fisher, Neyman-Pearson Theories of Testing Hypotheses: One Theory or Two?” We haven’t discussed it before on this blog. Here are some excerpts (blue), and remarks (black)

Erich Lehmann 20 November 1917 – 12 September 2009

…A distinction frequently made between the approaches of Fisher and Neyman-Pearson is that in the latter the test is carried out at a fixed level, whereas the principal outcome of the former is the statement of a p value that may or may not be followed by a pronouncement concerning significance of the result [p.1243].

The history of this distinction is curious. Throughout the 19th century, testing was carried out rather informally. It was roughly equivalent to calculating an (approximate) p value and rejecting the hypothesis if this value appeared to be sufficiently small. … Fisher, in his 1925 book and later, greatly reduced the needed tabulations by providing tables not of the distributions themselves but of selected quantiles. … These tables allow the calculation only of ranges for the p values; however, they are exactly suited for determining the critical values at which the statistic under consideration becomes significant at a given level. As Fisher wrote in explaining the use of his [chi square] table (1946, p. 80):

In preparing this table we have borne in mind that in practice we do not want to know the exact value of P for any observed [chi square], but, in the first place, whether or not the observed value is open to suspicion. If P is between .1 and .9, there is certainly no reason to suspect the hypothesis tested. If it is below .02, it is strongly indicated that the hypothesis fails to account for the whole of the facts. We shall not often be astray if we draw a conventional line at .05 and consider that higher values of [chi square] indicate a real discrepancy.

Similarly, he also wrote (1935, p. 13) that “it is usual and convenient for experimenters to take 5 percent as a standard level of significance, in the sense that they are prepared to ignore all results which fail to reach this standard .. .” …. Continue reading

Categories: Neyman, P-values, phil/history of stat, Statistics | | 4 Comments

## WHIPPING BOYS AND WITCH HUNTERS (ii)

.

At least as apt today as 3 years ago…HAPPY HALLOWEEN! Memory Lane with new comments in blue

In an earlier post I alleged that frequentist hypotheses tests often serve as whipping boys, by which I meant “scapegoats”, for the well-known misuses, abuses, and flagrant misinterpretations of tests (both simple Fisherian significance tests and Neyman-Pearson tests, although in different ways)—as well as for what really boils down to a field’s weaknesses in modeling, theorizing, experimentation, and data collection.  Checking the history of this term however, there is a certain disanalogy with at least the original meaning of a “whipping boy,” namely, an innocent boy who was punished when a medieval prince misbehaved and was in need of discipline.  It was thought that seeing an innocent companion, often a friend, beaten for his own transgressions would supply an effective way to ensure the prince would not repeat the same mistake. But significance tests floggings, rather than a tool for a humbled self-improvement and commitment to avoiding flagrant rule violations, has tended instead to yield declarations that it is the rules that are invalid! The violators are excused as not being able to help it! The situation is more akin to that of witch hunting that in some places became an occupation in its own right.

Now some early literature, e.g., Morrison and Henkel’s Significance Test Controversy (1962), performed an important service over fifty years ago.  They alerted social scientists to the fallacies of significance tests: misidentifying a statistically significant difference with one of substantive importance, interpreting insignificant results as evidence for the null hypothesis—especially problematic with insensitive tests, and the like. Chastising social scientists for applying significance tests in slavish and unthinking ways, contributors call attention to a cluster of pitfalls and fallacies of testing. Continue reading

## Statistical “reforms” without philosophy are blind (v update)

.

Is it possible, today, to have a fair-minded engagement with debates over statistical foundations? I’m not sure, but I know it is becoming of pressing importance to try. Increasingly, people are getting serious about methodological reforms—some are quite welcome, others are quite radical. Too rarely do the reformers bring out the philosophical presuppositions of the criticisms and proposed improvements. Today’s (radical?) reform movements are typically launched from criticisms of statistical significance tests and P-values, so I focus on them. Regular readers know how often the P-value (that most unpopular girl in the class) has made her appearance on this blog. Here, I tried to quickly jot down some queries. (Look for later installments and links.) What are some key questions we need to ask to tell what’s true about today’s criticisms of P-values?

I. To get at philosophical underpinnings, the single most import question is this:

(1) Do the debaters distinguish different views of the nature of statistical inference and the roles of probability in learning from data? Continue reading

## G.A. Barnard: The “catch-all” factor: probability vs likelihood

G.A.Barnard 23 sept. 1915- 30 July 2002

From the “The Savage Forum” (pp 79-84 Savage, 1962)[i]

BARNARD:…Professor Savage, as I understand him, said earlier that a difference between likelihoods and probabilities was that probabilities would normalize because they integrate to one, whereas likelihoods will not. Now probabilities integrate to one only if all possibilities are taken into account. This requires in its application to the probability of hypotheses that we should be in a position to enumerate all possible hypotheses which might explain a given set of data. Now I think it is just not true that we ever can enumerate all possible hypotheses. … If this is so we ought to allow that in addition to the hypotheses that we really consider we should allow something that we had not thought of yet, and of course as soon as we do this we lose the normalizing factor of the probability, and from that point of view probability has no advantage over likelihood. This is my general point, that I think while I agree with a lot of the technical points, I would prefer that this is talked about in terms of likelihood rather than probability. I should like to ask what Professor Savage thinks about that, whether he thinks that the necessity to enumerate hypotheses exhaustively, is important.

SAVAGE: Surely, as you say, we cannot always enumerate hypotheses so completely as we like to think. The list can, however, always be completed by tacking on a catch-all ‘something else’. In principle, a person will have probabilities given ‘something else’ just as he has probabilities given other hypotheses. In practice, the probability of a specified datum given ‘something else’ is likely to be particularly vague­–an unpleasant reality. The probability of ‘something else’ is also meaningful of course, and usually, though perhaps poorly defined, it is definitely very small. Looking at things this way, I do not find probabilities unnormalizable, certainly not altogether unnormalizable. Continue reading

## George Barnard: 100th birthday: “We need more complexity” (and coherence) in statistical education

G.A. Barnard: 23 September, 1915 – 30 July, 2002

The answer to the question of my last post is George Barnard, and today is his 100th birthday*. The paragraphs stem from a 1981 conference in honor of his 65th birthday, published in his 1985 monograph: “A Coherent View of Statistical Inference” (Statistics, Technical Report Series, University of Waterloo). Happy Birthday George!

[I]t seems to be useful for statisticians generally to engage in retrospection at this time, because there seems now to exist an opportunity for a convergence of view on the central core of our subject. Unless such an opportunity is taken there is a danger that the powerful central stream of development of our subject may break up into smaller and smaller rivulets which may run away and disappear into the sand.

I shall be concerned with the foundations of the subject. But in case it should be thought that this means I am not here strongly concerned with practical applications, let me say right away that confusion about the foundations of the subject is responsible, in my opinion, for much of the misuse of the statistics that one meets in fields of application such as medicine, psychology, sociology, economics, and so forth. It is also responsible for the lack of use of sound statistics in the more developed areas of science and engineering. While the foundations have an interest of their own, and can, in a limited way, serve as a basis for extending statistical methods to new problems, their study is primarily justified by the need to present a coherent view of the subject when teaching it to others. One of the points I shall try to make is, that we have created difficulties for ourselves by trying to oversimplify the subject for presentation to others. It would surely have been astonishing if all the complexities of such a subtle concept as probability in its application to scientific inference could be represented in terms of only three concepts––estimates, confidence intervals, and tests of hypotheses. Yet one would get the impression that this was possible from many textbooks purporting to expound the subject. We need more complexity; and this should win us greater recognition from scientists in developed areas, who already appreciate that inference is a complex business while at the same time it should deter those working in less developed areas from thinking that all they need is a suite of computer programs.

Categories: Barnard, phil/history of stat, Statistics | 9 Comments

## (Part 3) Peircean Induction and the Error-Correcting Thesis

C. S. Peirce: 10 Sept, 1839-19 April, 1914

Last third of “Peircean Induction and the Error-Correcting Thesis”

Deborah G. Mayo
Transactions of the Charles S. Peirce Society 41(2) 2005: 299-319

Part 2 is here.

8. Random sampling and the uniformity of nature

We are now at the point to address the final move in warranting Peirce’s SCT. The severity or trustworthiness assessment, on which the error correcting capacity depends, requires an appropriate link (qualitative or quantitative) between the data and the data generating phenomenon, e.g., a reliable calibration of a scale in a qualitative case, or a probabilistic connection between the data and the population in a quantitative case. Establishing such a link, however, is regarded as assuming observed regularities will persist, or making some “uniformity of nature” assumption—the bugbear of attempts to justify induction.

But Peirce contrasts his position with those favored by followers of Mill, and “almost all logicians” of his day, who “commonly teach that the inductive conclusion approximates to the truth because of the uniformity of nature” (2.775). Inductive inference, as Peirce conceives it (i.e., severe testing) does not use the uniformity of nature as a premise. Rather, the justification is sought in the manner of obtaining data. Justifying induction is a matter of showing that there exist methods with good error probabilities. For this it suffices that randomness be met only approximately, that inductive methods check their own assumptions, and that they can often detect and correct departures from randomness.

… It has been objected that the sampling cannot be random in this sense. But this is an idea which flies far away from the plain facts. Thirty throws of a die constitute an approximately random sample of all the throws of that die; and that the randomness should be approximate is all that is required. (1.94)

## Statistics, the Spooky Science

.

I was reading this interview Of Erich Lehmann yesterday: “A Conversation with Erich L. Lehmann”

Lehmann: …I read over and over again that hypothesis testing is dead as a door nail, that nobody does hypothesis testing. I talk to Julie and she says that in the behaviorial sciences, hypothesis testing is what they do the most. All my statistical life, I have been interested in three different types of things: testing, point estimation, and confidence-interval estimation. There is not a year that somebody doesn’t tell me that two of them are total nonsense and only the third one makes sense. But which one they pick changes from year to year. [Laughs] (p.151)…..

DeGroot: …It has always amazed me about statistics that we argue among ourselves about which of our basic techniques are of practical value. It seems to me that in other areas one can argue about whether a methodology is going to prove to be useful, but people would agree whether a technique is useful in practice. But in statistics, as you say, some people believe that confidence intervals are the only procedures that make any sense on practical grounds, and others think they have no practical value whatsoever. I find it kind of spooky to be in such a field.

Lehmann: After a while you get used to it. If somebody attacks one of these, I just know that next year I’m going to get one who will be on the other side. (pp.151-2)

Emphasis is mine.

I’m reminded of this post.

Morris H. DeGroot, Statistical Science, 1986, Vol. 1, No.2, 243-258

Categories: phil/history of stat, Statistics | 1 Comment

## Performance or Probativeness? E.S. Pearson’s Statistical Philosophy

E.S. Pearson

Are methods based on error probabilities of use mainly to supply procedures which will not err too frequently in some long run? (performance). Or is it the other way round: that the control of long run error properties are of crucial importance for probing the causes of the data at hand? (probativeness). I say no to the former and yes to the latter. This, I think, was also the view of Egon Sharpe (E.S.) Pearson (11 Aug, 1895-12 June, 1980). I reblog a relevant post from 2012.

Cases of Type A and Type B

“How far then, can one go in giving precision to a philosophy of statistical inference?” (Pearson 1947, 172)

Pearson considers the rationale that might be given to N-P tests in two types of cases, A and B:

“(A) At one extreme we have the case where repeated decisions must be made on results obtained from some routine procedure…

(B) At the other is the situation where statistical tools are applied to an isolated investigation of considerable importance…?” (ibid., 170)

In cases of type A, long-run results are clearly of interest, while in cases of type B, repetition is impossible and may be irrelevant:

“In other and, no doubt, more numerous cases there is no repetition of the same type of trial or experiment, but all the same we can and many of us do use the same test rules to guide our decision, following the analysis of an isolated set of numerical data. Why do we do this? What are the springs of decision? Is it because the formulation of the case in terms of hypothetical repetition helps to that clarity of view needed for sound judgment?

Or is it because we are content that the application of a rule, now in this investigation, now in that, should result in a long-run frequency of errors in judgment which we control at a low figure?” (Ibid., 173)

Although Pearson leaves this tantalizing question unanswered, claiming, “On this I should not care to dogmatize”, in studying how Pearson treats cases of type B, it is evident that in his view, “the formulation of the case in terms of hypothetical repetition helps to that clarity of view needed for sound judgment” in learning about the particular case at hand. Continue reading

Categories: 3-year memory lane, phil/history of stat | Tags: | 28 Comments