*Continuing with the discussion of E.S. Pearson:*

**Egon Pearson’s Neglected Contributions to Statistics**

by** Aris Spanos**

**Egon Pearson** (11 August 1895 – 12 June 1980), is widely known today for his contribution in recasting of Fisher’s significance testing into the * Neyman-Pearson (1933) theory of hypothesis testing*. Occasionally, he is also credited with contributions in promoting statistical methods in industry and in the history of modern statistics; see Bartlett (1981). What is rarely mentioned is Egon’s early pioneering work on:

**(i) specification**: the need to state explicitly the inductive premises of one’s inferences,

**(ii) robustness**: evaluating the ‘sensitivity’ of inferential procedures to departures from the Normality assumption, as well as

**(iii) Mis-Specification (M-S) testing**: probing for potential departures from the Normality assumption.

Arguably, modern frequentist inference began with the development of various finite sample inference procedures, initially by William Gosset (1908) [of the **Student’s t** fame] and then **Fisher** (1915, 1921, 1922a-b). These inference procedures revolved around a particular statistical model, known today as *the simple Normal model*:

X_{k} ∽ NIID(μ,σ²), k=1,2,…,n,… (1)

where ‘NIID(μ,σ²)’ stands for ‘Normal, Independent and Identically Distributed with mean μ and variance σ²’. These procedures include the ‘optimal’ estimators of μ and σ², Xbar and s², and the pivotal quantities:

(a) τ(**X**) =[√n(Xbar- μ)/s] ∽ St(n-1), (2)

(b) *v*(**X**) =[(n-1)s²/σ²] ∽ χ²(n-1), (3)

where St(n-1) and χ²(n-1) denote the Student’s t and chi-square distributions with (n-1) degrees of freedom.

The question of ‘how these inferential results might be affected when the Normality assumption is false’ was originally raised by Gosset in a letter to Fisher in 1923:

“What I should like you to do is to find a solution for some other population than a normal one.” (Lehmann, 1999)

He went on to say that he tried the rectangular (uniform) distribution but made no progress, and he was seeking Fisher’s help in tackling this ‘robustness/sensitivity’ problem. In his reply that was unfortunately lost, Fisher must have derived the sampling distribution of τ(**X**), assuming some skewed distribution (possibly log-Normal). We know this from Gosset’s reply:

“I like the result for z [τ(**X**)] in the case of that horrible curve you are so fond of. I take it that in skew curves the distribution of z is skew in the opposite direction.” (Lehmann, 1999)

After this exchange Fisher was not particularly receptive to Gosset’s requests to address the problem of working out the implications of non-Normality for the Normal-based inference procedures; t, chi-square and F tests.

In contrast, **Egon Pearson** shared Gosset’s concerns about the robustness of Normal-based inference results (a)-(b) to non-Normality, and made an attempt to address the problem in a series of papers in the late 1920s and early 1930s.

This line of research for Pearson began with a review of Fisher’s 2nd edition of the 1925 book, published in *Nature*, and dated June 8th, 1929. Pearson, after praising the book for its path breaking contributions, dared raise a mild criticism relating to (i)-(ii) above:

“There is one criticism, however, which must be made from the statistical point of view. A large number of tests are developed upon the assumption that the population sampled is of ‘normal’ form. That this is the case may be gathered from a very careful reading of the text, but the point is not sufficiently emphasised. It does not appear reasonable to lay stress on the ‘exactness’ of tests, when no means whatever are given of appreciating how rapidly they become inexact as the population samples diverge from normality.” (Pearson, 1929a)

Fisher reacted badly to this criticism and was preparing an acerbic reply to the ‘young pretender’ when Gosset jumped into the fray with his own letter in *Nature*, dated July 20th, in an obvious attempt to moderate the ensuing fight. Gosset succeeded in tempering Fisher’s reply, dated August 17th, forcing him to provide a less acerbic reply, but instead of addressing the ‘robustness/sensitivity’ issue, he focused primarily on Gosset’s call to address ‘the problem of what sort of modification of my tables for the analysis of variance would be required to adapt that process to non-normal distributions’. He described that as a hopeless task. This is an example of Fisher’s genious when cornered by an insightful argument. He sidestepped the issue of ‘robustness’ to departures from Normality, by broadening it – alluding to other possible departures from the ID assumption – and rendering it a hopeless task, by focusing on the call to ‘modify’ the statistical tables for all possible non-Normal distributions; there is an infinity of potential modifications!

**Egon Pearson** recognized the importance of stating explicitly the inductive premises upon which the inference results are based, and pressed ahead with exploring the robustness issue using several non-Normal distributions within the Pearson family. His probing was based primarily on *simulation*, relying on tables of pseudo-random numbers; see Pearson and Adyanthaya (1928, 1929), Pearson (1929b, 1931). His broad conclusions were that the t-test:

τ_{0}(**X**)=** |**[√n(X-bar- μ

_{0})/s]|, C

_{1}:={

**x**: τ

_{0}(

**x**) > c

_{α}}, (4)

for testing the hypotheses:

H_{0}: μ = μ_{0} vs. H_{1}: μ ≠ μ_{0}, (5)

is relatively robust to certain departures from Normality, especially when the underlying distribution is symmetric, but the ANOVA test is rather sensitive to such departures! He continued this line of research into his 80s; see Pearson and Please (1975).

Perhaps more importantly, Pearson (1930) proposed a *test for the Normality* assumption based on the skewness and kurtosis coefficients: a Mis-Specification (M-S) test. Ironically, Fisher (1929) provided the sampling distributions of the sample skewness and kurtosis statistics upon which Pearson’s test was based. Pearson continued sharpening his original M-S test for Normality, and his efforts culminated with the D’Agostino and Pearson (1973) test that is widely used today; see also Pearson et al. (1977). The crucial importance of testing Normality stems from the fact that it renders the ‘robustness/sensitivity’ problem manageable. The test results can be used to narrow down the possible departures one needs to worry about. They can also be used to suggest ways to respecify the original model.

After Pearson’s early publications on the ‘robustness/sensitivity’ problem Gosset realized that *simulation* alone was not effective enough to address the question of robustness, and called upon Fisher, who initially rejected Gosset’s call by saying ‘it was none of his business’, to derive analytically the implications of non-Normality using different distributions:

“How much does it [non-Normality] matter? And in fact that is your business: none of the rest of us have the slightest chance of solving the problem: we can play about with samples [i.e. perform simulation studies], I am not belittling E. S. Pearson’s work, but it is up to you to get us a proper solution.” (Lehmann, 1999).

In this passage one can discern the high esteem with which Gosset held Fisher for his technical ability. Fisher’s reply was rather blunt:

“I do not think what you are doing with nonnormal distributions is at all my business, and I doubt if it is the right approach. … Where I differ from you, I suppose, is in regarding normality as only a part of the difficulty of getting data; viewed in this collection of difficulties I think you will see that it is one of the least important.”

It’s clear from this that Fisher understood the problem of how to handle departures from Normality more broadly than his contemporaries. His answer alludes to two issues that were not well understood at the time:

(a) departures from the other two probabilistic assumptions (IID) have much more serious consequences for Normal-based inference than Normality, and

(b) deriving the consequences of particular forms of non-Normality on the reliability of Normal-based inference, and proclaiming a procedure enjoys a certain level of ‘generic’ robustness, does *not* provide a complete answer to the problem of dealing with departures from the inductive premises.

In relation to (a) it is important to note that the role of ‘randomness’, as it relates to the IID assumptions, was not well understood until the 1940s, when the notion of non-IID was framed in terms of explicit forms of heterogeneity and dependence pertaining to stochastic processes. Hence, the problem of assessing departures from IID was largely ignored at the time, focusing almost exclusively on departures from Normality. Indeed, the early literature on nonparametric inference retained the IID assumptions and focused on inference procedures that replace the Normality assumption with indirect distributional assumptions pertaining to the ‘true’ but unknown *f*(x), like the existence of certain moments, its symmetry, smoothness, continuity and/or differentiability, unimodality, etc. ; see Lehmann (1975). It is interesting to note that Egon Pearson did not consider the question of testing the IID assumptions until his 1963 paper.

In relation to (b), when one poses the question ‘how robust to non-Normality is the reliability of inference based on a t-test?’ one ignores the fact that the t-test might no longer be the ‘optimal’ test under a non-Normal distribution. This is because the sampling distribution of the test statistic and the associated type I and II error probabilities depend crucially on the validity of the statistical model assumptions. When any of these assumptions are invalid, the relevant error probabilities are no longer the ones derived under the original model assumptions, and the optimality of the original test is called into question. For instance, assuming that the ‘true’ distribution is uniform (Gosset’s rectangular):

X_{k }∽ U(a-μ,a+μ), k=1,2,…,n,… (6)

where *f*(x;a,μ)=(1/(2μ)), (a-μ) ≤ x ≤ (a+μ), μ > 0,

how does one assess the robustness of the t-test? One might invoke its generic robustness to symmetric non-Normal distributions and proceed as if the t-test is ‘fine’ for testing the hypotheses (5). A more well-grounded answer will be to assess the discrepancy between the nominal (assumed) error probabilities of the t-test based on (1) and the actual ones based on (6). If the latter approximate the former ‘closely enough’, one can justify the generic robustness. These answers, however, raise the broader question of what are the relevant error probabilities? After all, the optimal test for the hypotheses (5) in the context of (6), is no longer the t-test, but the test defined by:

w(**X**)=|{(n-1)([X_{[1] }+X_{[n]}]-μ_{0})}/{[X_{[1]}-X_{[n]}]}|∽F(2,2(n-1)), (7)

with a rejection region C_{1}:={**x**: w(**x**) > c_{α}}, where (X_{[1]}, X_{[n]}) denote the smallest and the largest element in the ordered sample (X_{[1]}, X_{[2]},…, X_{[n]}), and F(2,2(n-1)) the F distribution with 2 and 2(n-1) degrees of freedom; see Neyman and Pearson (1928). One can argue that the relevant comparison error probabilities are no longer the ones associated with the t-test ‘corrected’ to account for the assumed departure, but those associated with the test in (7). For instance, let the t-test have nominal and actual significance level, .05 and .045, and power at μ_{1}=μ_{0}+1, of .4 and .37, respectively. The conventional wisdom will call the t-test robust, but is it reliable (effective) when compared with the test in (7) whose significance level and power (at μ_{1}) are say, .03 and .9, respectively?

A strong case can be made that a more complete approach to the statistical misspecification problem is:

(i) to probe thoroughly for any departures from all the model assumptions using trenchant M-S tests, and if any departures are detected,

(ii) proceed to respecify the statistical model by choosing a more appropriate model with a view to account for the statistical information that the original model did not.

Admittedly, this is a more demanding way to deal with departures from the underlying assumptions, but it addresses the concerns of Gosset, Egon Pearson, Neyman and Fisher much more effectively than the invocation of vague robustness claims; see Spanos (2010).

**References**

Bartlett, M. S. (1981) “Egon Sharpe Pearson, 11 August 1895-12 June 1980,” *Biographical Memoirs of Fellows of the Royal Society*, 27: 425-443.

D’Agostino, R. and E. S. Pearson (1973) “Tests for Departure from Normality. Empirical Results for the Distributions of b₂ and √(b₁),” *Biometrika*, 60: 613-622.

Fisher, R. A. (1915) “Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population,” *Biometrika*, 10: 507-521.

Fisher, R. A. (1921) “On the “probable error” of a coefficient of correlation deduced from a small sample,” *Metron*, 1: 3-32.

Fisher, R. A. (1922a) “On the mathematical foundations of theoretical statistics,” *Philosophical Transactions of the Royal Society* A, 222, 309-368.

Fisher, R. A. (1922b) “The goodness of fit of regression formulae, and the distribution of regression coefficients,” *Journal of the Royal Statistical Society*, 85: 597-612.

Fisher, R. A. (1925) *Statistical Methods for Research Workers*, Oliver and Boyd, Edinburgh.

Fisher, R. A. (1929), “Moments and Product Moments of Sampling Distributions,” *Proceedings of the London Mathematical Society*, Series 2, 30: 199-238.

Neyman, J. and E. S. Pearson (1928) “On the use and interpretation of certain test criteria for purposes of statistical inference: Part I,” *Biometrika*, 20A: 175-240.

Neyman, J. and E. S. Pearson (1933) “On the problem of the most efficient tests of statistical hypotheses”, *Philosophical Transanctions of the Royal Society*, A, 231: 289-337.

Lehmann, E. L. (1975) *Nonparametrics: statistical methods based on ranks*, Holden-Day, San Francisco.

Lehmann, E. L. (1999) “‘Student’ and Small-Sample Theory,” *Statistical Science*, 14: 418-426.

Pearson, E. S. (1929a) “Review of ‘Statistical Methods for Research Workers,’ 1928, by Dr. R. A. Fisher”, *Nature*, June 8th, pp. 866-7.

Pearson, E. S. (1929b) “Some notes on sampling tests with two variables,” *Biometrika*, 21: 337-60.

Pearson, E. S. (1930) “A further development of tests for normality,” *Biometrika*, 22: 239-49.

Pearson, E. S. (1931) “The analysis of variance in cases of non-normal variation,” *Biometrika*, 23: 114-33.

Pearson, E. S. (1963) “Comparison of tests for randomness of points on a line,” *Biometrika*, 50: 315-25.

Pearson, E. S. and N. K. Adyanthaya (1928) “The distribution of frequency constants in small samples from symmetrical populations,” *Biometrika*, 20: 356-60.

Pearson, E. S. and N. K. Adyanthaya (1929) “The distribution of frequency constants in small samples from non-normal symmetrical and skew populations,” *Biometrika*, 21: 259-86.

Pearson, E. S. and N. W. Please (1975) “Relations between the shape of the population distribution and the robustness of four simple test statistics,” *Biometrika*, 62: 223-241.

Pearson, E. S., R. B. D’Agostino and K. O. Bowman (1977) “Tests for departure from normality: comparisons of powers,” *Biometrika*, 64: 231-246.

Spanos, A. (2010) “Akaike-type Criteria and the Reliability of Inference: Model Selection vs. Statistical Model Specification,” *Journal of Econometrics*, 158: 204-220.

Student (1908), “The Probable Error of the Mean,” *Biometrika*, 6: 1-25.

Aris, this is all very interesting. However, it seems to me that regarding robustness you have different priorities from myself, and probably from most people in the “robustness community” that meinly grew out of the work of Tukey, Huber and Hampel.

If we accept that we generally won’t hit a precisely correct model even after a long series of well chosen M-S tests, it seems to be of much less importance how a procedure behaves compared to an optimal one in a model that hasn’t been ruled out by tests (but which still won’t be precisely true) than how it behaves in general enough neighborhoods of models, or in some worst (but conceivable) cases. Error probabilities for non-parametric hypotheses (like “median=a”) as quality measurements seem much more stable to me than optimality statements, which generally rely on models being exactly true and are often not even meaningful for distributions that deviate so little from the nominal model that they cannot be distinguished from it with any finite number of points (but may, for example, not have Lebesgue densities).

Christian: you are right, I do have a somewhat different perspective on “robustness” than most statisticians/econometricians. I also began with the work of Tukey, Huber and Hampel, but after working with real data for more than 25 years I realized that the traditional view of robustness was not providing (for me) satisfactory answers to the real problems raised by statistical misspecification.

When robustness is viewed as ‘insensitivity’ to particular departures from the probabilistic assumptions of the statistical model in question, one needs two things:

(i) an appropriate measure of the ‘sensitivity’ of inference, and

(ii) the type of departures one should be worried about.

My view differs from the traditional ones because:

(a) I consider the discrepancy between actual and nominal error probabilities as providing the most appropriate way to measure ‘sensitivity’ (reliability), and

(b) the type of departure one should worry about could NOT be just ‘generic neighborhoods’ around the model in question.

In practice, statistical misspecification is almost never ‘local’, and thorough Mis-Specification (M-S) testing provides the best way to narrow down the departures one should worry about. Indeed, it turns out that the ‘sensitivity’ of inference depends crucially on the exact nature of the misspecification. For instance, knowing that one’s model does not account for the temporal dependence in the data, will not do! The type of temporal dependence (Markov, exchangeable, etc.) makes a huge difference to the ‘sensitivity’. In light of that, when one goes the extra mile to get such information, it will be a waste not to respecify the original model and find the optimal procedure, instead of settling for the ‘second best’; an inference procedure developed for the misspecified model but it is not “too bad” for the adequate model!

Lastly, I never subscribed to the argument that ‘all models are wrong’ and thus trying one’s best to learn as much from the data as possible is pretentious and misleading. Having modeled hundreds (literally) of actual economic data series for more than 25 years, I know better!

Christian: This time I’m afraid I don’t understand what you’re claiming. Maybe you can clarify. I took Spanos to be pointing out that departures from assumptions can invalidate purported claims of optimality, hence claims of robustness. Moreover, if one has identified departures, thereby showing a model to be statistically inadequate, he recommends respecifying the model to arrive at a statistically adequate model in relation to which error probabilistic assessments are relevant.

Christian, I agree with the points you make. With model-based inference, figuring out all the different mis-specifications that might matter to analysis is hard enough, even when we pretend we know the true underlying model. Also, having power to detect all the different mis-specifications is not plausible, and even for mis-specifications we do have power to detect, the impact on the overall analysis of re-specification is also very hard to learn about. So, while I see it’s not easy to apply in situations where few design-based assumptions (e.g. simple random sampling) are justified, the Huber-ite approach of designing methods that provide useful inference on a mildly-constrained set of well-defined parameters across a range of plausible distributions (i.e. smooth densities, no hugely heavy tails) seems a very attractive foundation.

Also very disappointed to see arguments raised, again, that “I’ve done it this way, successfully, for 25 years” and so it must be right. This could equally well be claimed by Kadane, or Huber, or Efron, or many others, all of whom would have different versions of “right”, among which we can find conflicts if we look hard enough. Showing that one general approach works well in one application area is an argument for it *possibly* being admissible, but that’s all.

Aris, we’re probably not going to agree regarding “all models are wrong”… anyway, I don’t see a contradiction between looking for gross violations of models that can be found by M-S tests and adjusting the models accordingly (which is what you advertise) and then still doing something that is locally robust because the thing with many local violations of models is that they cannot be detected. Which also implies that no amount of experience can tell anyone that nothing is going on in local neighborhoods.

Mayo: As far as you summarise Aris, I don’t disagree. What I disagree with is that he seems to say, more or less explicity, that after having gone through a battery of M-S tests we can feel safe and don’t need robustness against still possible and undetected/untdetectable misspecifications anymore.

Guest: I did not say that “I’ve done it this way, successfully, for 25 years” and so it must be right.” What I said is that after 25 years of doing modeling at the trenches I know better than to fall for “theoretical” arguments as to what is and what is not possible in empirical modeling, when the primary aim is to learn as much as possible from the data about phenomena of interest.