By Aris Spanos
One of R. A. Fisher’s (17 February 1890 — 29 July 1962) most remarkable, but least recognized, achievement was to initiate the recasting of statistical induction. Fisher (1922) pioneered modern frequentist statistics as a model-based approach to statistical induction anchored on the notion of a statistical model, formalized by:
Mθ(x)={f(x;θ); θ∈Θ}; x∈Rn ;Θ⊂Rm; m < n; (1)
where the distribution of the sample f(x;θ) ‘encapsulates’ the probabilistic information in the statistical model.
Before Fisher, the notion of a statistical model was vague and often implicit, and its role was primarily confined to the description of the distributional features of the data in hand using the histogram and the first few sample moments; implicitly imposing random (IID) samples. The problem was that statisticians at the time would use descriptive summaries of the data to claim generality beyond the data in hand x0:=(x1,x2,…,xn) As late as the 1920s, the problem of statistical induction was understood by Karl Pearson in terms of invoking (i) the ‘stability’ of empirical results for subsequent samples and (ii) a prior distribution for θ.
Fisher was able to recast statistical inference by turning Karl Pearson’s approach, proceeding from data x0 in search of a frequency curve f(x;ϑ) to describe its histogram, on its head. He proposed to begin with a prespecified Mθ(x) (a ‘hypothetical infinite population’), and view x0 as a ‘typical’ realization thereof; see Spanos (1999).
In my mind, Fisher’s most enduring contribution is his devising a general way to ‘operationalize’ errors by embedding the material experiment into Mθ(x), and taming errors via probabilification, i.e. to define frequentist error probabilities in the context of a statistical model. These error probabilities are (a) deductively derived from the statistical model, and (b) provide a measure of the ‘effectiviness’ of the inference procedure: how often a certain method will give rise to correct inferences concerning the underlying ‘true’ Data Generating Mechanism (DGM). This cast aside the need for a prior. Both of these key elements, the statistical model and the error probabilities, have been refined and extended by Mayo’s error statistical approach (E.G., 1996). Learning from data is achieved when an inference is reached by an inductive procedure which, with high probability, will yield true conclusions from valid inductive premises (a statistical model); Mayo and Spanos (2011).
Frequentist statistical inference was largely in place by the late 1930s. Fisher, almost single-handedly, created the current theory of ‘optimal’ point estimation and formalized significance testing based on the p-value reasoning. In the early 1930s Neyman and Pearson (N-P) proposed an ‘optimal’ theory for hypothesis testing, by modifying/extending Fisher’s significance testing. By the late 1930s Neyman proposed an ‘optimal’ theory for interval estimation analogous to N-P testing. Despite these developments in frequentist statstics, its philosophical foundations concerned with the proper form of the underlying inductive reasoning were in a confused state. Fisher was arguing for ‘inductive inference’, spearheaded by his significance testing in conjunction with p-values and his fiducial probability for interval estimation. Neyman was arguing for ‘inductive behavior’ based on N-P testing and confidence interval estimation firmly grounded on pre-data error probabilities.
The last exchange between these pioneers took place in the mid 1950s (see [Fisher, 1955; Neyman, 1956; Pearson, 1955]) and left the philosophical foundations of the field in a state of confusion with many more questions than answers.
One of the key issues of disagreement was about the relevance of alternative hypotheses and the role of the pre-data error probabilities in frequentist testing, i.e. the irrelevance of Errors of the “second kind”, as Fisher (p. 69) framed the issue. My take on this issue is that Fisher did understand the importance of alternative hypotheses and the power of the test by talking about its ‘sensitivity’:
“By increasing the size of the experiment, we can render it more sensitive, meaning by this that it will allow of the detection of a lower degree of sensory discrimination, or, in other words, of a quantitatively smaller departure from the null hypothesis.” (Fisher, 1935, p. 22)
If this is not the same as increasing the power of the test by increasing the sample size, I do not know what it is! What Fisher and many subsequent commentators did not appreciate enough was that Neyman and Pearson defined the relevant alternative hypotheses in a very specific way: to be the complement to the null relative to the prespecified statistical model Mθ(x):
H0: µ∈Θ0 vs. H1: µ∈Θ1 (2)
where Θ0 and Θ1 constitute a partition of the parameter space Θ. That rendered the evaluation of power possible and Fisher’s comment about type II errors:
“Such errors are therefore incalculable both in frequency and in magnitude merely from the specification of the null hypothesis.” simply misplaced.
Let me finish with a quotation from Fisher (1935) that I find very insightful and as relevant today as it was then:
“In the field of pure research no assessment of the cost of wrong conclusions, or of delay in arriving at more correct conclusions can conceivably be more than a pretence, and in any case such an assessment would be inadmissible and irrelevant in judging the state of the scientific evidence.” (pp. 25-26)
References
[1] Fisher, R. A. (1922), “On the mathematical foundations of theoretical statistics”, Philosophical Transactions of the Royal Society A,
222: 309-368.
[2] Fisher, R. A. (1935), The Design of Experiments, Oliver and Boyd, Edinburgh.
[3] Fisher, R. A. (1955), “Statistical methods and scientific induction,” Journal of the Royal Statistical Society, B, 17: 69-78.
[4] Mayo, D. G. and A. Spanos (2011), “Error Statistics,” pp. 151196 in the Handbook of Philosophy of Science, vol. 7: Philosophy of Statistics, D. Gabbay, P. Thagard, and J. Woods (editors), Elsevier.
[5] Neyman, J. (1956), “Note on an Article by Sir Ronald Fisher,” Journal of the Royal Statistical Society, B, 18: 288-294.
[6] Pearson, E. S. (1955), “Statistical Concepts in the Relation to Reality,” Journal of the Royal Statistical Society, B, 17, 204-207.
[7] Spanos, A. (1999), Probability Theory and Statistical Inference: Econometric Modeling with Observational Data, Cambridge University Press, Cambridge.
Thanks much for the guest post. I wonder if your point about NP alternatives always being exhaustive (together with the null) that Fisher never concurred with. But I do know of a Fisher remark, quite old, where he was actually striving to explain to Neyman that he’d wind up in the same place as they (with their NP lemma), starting from the likelihood ratio. That may well be before he decided to disagree with their standpoint.
We know that Fisher’s initial [1930-1934] reaction to the Neyman-Pearson perspective was positive and encouraging; he was a referee for the 1933 paper with a recommedation to publish the paper. The first disagreement between Fisher and Neyman is dated in 1935, and had nothing to do with testing. It was about experimental design and agricultural experiments. It seems that after that first skermish everything Neyman proposed Fisher would be unsympathetic!
In the 1955 paper it is clear that Fisher never accepted the N-P partition view of the null and alternative hypotheses. When he argues:
“The frequency of the second kind must depend not only on the frequency with which rival hypotheses are in fact true, but also greatly on how closely they resemble the null hypothesis. Such errors are therefore incalculable both in frequency and in magnitude merely from the specification of the null hypothesis.”
it is clear that Fisher invokes a notion of an alternative that includes, not just the complement to the null with respect to the statistical model in question, but “everything else”, including other statistical models.
I think that Fisher’s point of view is encapsulated in the letter to Chester Bliss I quoted in my recent blog and I sum it up like this: of the three entities null hypothesis, test statistic and alternative hypothesis the null hypothesis is the most primitive and the alternative hypothesis the least. Fisher’s point of view would be that you cannot make the alternative to be the justification for the test statistic. This solution seems natural to the mathematician but it just show that (s)he should get out more. The NP reasoning from Fisher’s point of view is back to front. By noting that some tests rejected the null hypothesis and others did not you might eventually get a clue as to what sort of alternative hypothesis was reasonable.
I think it significant (the word seems apposite) that the 1935 dispute with Neyman to which Aris refers is really a dispute about hypotheses. In this case it is actually Neyman’s choice of null which is at fault. In this case it is impossible to believe that Fisher’s null could be false while Neyman’s null could be true but Neyman constructs a test to deal with this impossible case and then shows that Fisher’s test would not quite hold the error rate.
Stephen: Thanks for the comment. I will reread that letter, but I’m curious to know what you think of my most recent post from Fisher.
I think one of the things I find perplexing is why anyone would suppose that to ask a questions you already have to know the likely answer. As Aris notes, there are two kinds of testing situations—one “within” the model, and one “without”—but even though the latter permits the entire complement to the null, I guess I just cannot think of a case where one knows the null one wants to test and has utterly no clue about how to pose a question: how can this be false? I mean there’s no reason our statistical theory has to rule out cases where investigators are utterly at a loss to verbalize what they might want to ask or inquire about. But I wouldn’t blame that on the statistical account. On the contrary, a huge asset of the general error statistical framework is in NOT demanding an exhaustive listing of all possible rival hypotheses, plus priors assigned to them.
Along these lines, i also must confess not to have intuitions about which is more “primitive” (a test stat or an alternative) and since Fisher has said that such intuitions should be evident to all thinking minds, maybe it’s not as primitive an assignment as you are alleging.
(Think of the Matrixx drug company example in Schactman’s post: does drug Z increase the chance of side effect E in this population? or not? (Matrixx alleged the rate of E was about the same)–there’s your alternative. that’s easy…but now, what test statistic can i actually use to calculate p-values on the null alone? not so obvious.)