More Fisher insights from A. Spanos, this from 2 years ago:

One of R. A. Fisher’s (17 February 1890 — 29 July 1962) most remarkable, but least recognized, achievement was to initiate the recasting of statistical induction. Fisher (1922) pioneered modern frequentist statistics as a model-based approach to statistical induction anchored on the notion of a statistical model, formalized by:

M_{θ}(**x**)={f(**x**;θ); θ∈Θ**}**; **x**∈R^{n };Θ⊂R^{m}; m < n; (1)

where the distribution of the sample f(**x**;θ) ‘encapsulates’ the probabilistic information in the statistical model.

Before Fisher, the notion of a statistical model was vague and often implicit, and its role was primarily conﬁned to the description of the distributional features of the data in hand using the histogram and the ﬁrst few sample moments; implicitly imposing random (IID) samples. The problem was that statisticians at the time would use descriptive summaries of the data to claim generality beyond the data in hand **x**_{0}:=(x_{1},x_{2},…,x_{n}). As late as the 1920s, the problem of statistical induction was understood by Karl Pearson in terms of invoking (i) the ‘stability’ of empirical results for subsequent samples and (ii) a prior distribution for θ.

Fisher was able to recast statistical inference by turning Karl Pearson’s approach, proceeding from data **x**_{0 }in search of a frequency curve f(x;ϑ) to describe its histogram, on its head. He proposed to begin with a prespeciﬁed M_{θ}(**x**) (a ‘hypothetical inﬁnite population’), and view x_{0 }as a ‘typical’ realization thereof; see Spanos (1999).

In my mind, Fisher’s most enduring contribution is his devising a general way to ‘operationalize’ errors by embedding the material experiment into M_{θ}(**x**), and taming errors via probabiliﬁcation, i.e. to deﬁne frequentist error probabilities in the context of a statistical model. These error probabilities are (a) deductively derived from the statistical model, and (b) provide a measure of the ‘eﬀectiviness’ of the inference procedure: how often a certain method will give rise to correct inferences concerning the underlying ‘true’ Data Generating Mechanism (DGM). This cast aside the need for a prior. Both of these key elements, the statistical model and the error probabilities, have been reﬁned and extended by Mayo’s error statistical approach (EGEK 1996). Learning from data is achieved when an inference is reached by an inductive procedure which, with high probability, will yield true conclusions from valid inductive premises (a statistical model); Mayo and Spanos (2011).

Frequentist statistical inference was largely in place by the late 1930s. Fisher, almost single-handedly, created the current theory of ‘optimal’ point estimation and formalized signiﬁcance testing based on the p-value reasoning. In the early 1930s Neyman and Pearson (N-P) proposed an ‘optimal’ theory for hypothesis testing, by modifying/extending Fisher’s signiﬁcance testing. By the late 1930s Neyman proposed an ‘optimal’ theory for interval estimation analogous to N-P testing. Despite these developments in frequentist statstics, its philosophical foundations concerned with the proper form of the underlying inductive reasoning were in a confused state. Fisher was arguing for ‘inductive inference’, spearheaded by his signiﬁcance testing in conjunction with p-values and his ﬁducial probability for interval estimation. Neyman was arguing for ‘inductive behavior’ based on N-P testing and conﬁdence interval estimation ﬁrmly grounded on pre-data error probabilities.

The last exchange between these pioneers took place in the mid 1950s (see [Fisher, 1955; Neyman, 1956; Pearson, 1955]) and left the philosophical foundations of the ﬁeld in a state of confusion with many more questions than answers.

One of the key issues of disagreement was about the relevance of alternative hypotheses and the role of the pre-data error probabilities in frequentist testing, i.e. the irrelevance of Errors of the “second kind”, as Fisher (p. 69) framed the issue. My take on this issue is that Fisher did understand the importance of alternative hypotheses and the power of the test by talking about its ‘sensitivity’:

“By increasing the size of the experiment, we can render it more sensitive, meaning by this that it will allow of the detection of a lower degree of sensory discrimination, or, in other words, of a quantitatively smaller departure from the null hypothesis.” (Fisher, 1935, p. 22)

If this is not the same as increasing the power of the test by increasing the sample size, I do not know what it is! What Fisher and many subsequent commentators did not appreciate enough was that Neyman and Pearson deﬁned the relevant alternative hypotheses in a very specific way: to be the complement to the null relative to the prespeciﬁed statistical model M_{θ}(x):

H_{0}: µ∈Θ_{0 }vs. H_{1}: µ∈Θ_{1} (2)

where Θ_{0 }and Θ_{1 }constitute a partition of the parameter space Θ. That rendered the evaluation of power possible and Fisher’s comment about type II errors:

“Such errors are therefore incalculable both in frequency and in magnitude merely from the speciﬁcation of the null hypothesis.” simply misplaced.

Let me ﬁnish with a quotation from Fisher (1935) that I ﬁnd very insightful and as relevant today as it was then:

“In the ﬁeld of pure research no assessment of the cost of wrong conclusions, or of delay in arriving at more correct conclusions can conceivably be more than a pretence, and in any case such an assessment would be inadmissible and irrelevant in judging the state of the scientiﬁc evidence.” (pp. 25-26)

**References **

[1] Fisher, R. A. (1922), “On the mathematical foundations of theoretical statistics”, Philosophical Transactions of the Royal Society A,

222: 309-368.

[2] Fisher, R. A. (1935), *The Design of Experiments*, Oliver and Boyd, Edinburgh.

[3] Fisher, R. A. (1955), “Statistical methods and scientiﬁc induction,” *Journal of the Royal Statistical Society*, B, 17: 69-78.

[4] Mayo, D. G. and A. Spanos (2011), “Error Statistics,” pp. 151196 in the *Handbook of Philosophy of Science*, vol. 7: *Philosophy of Statistics*, D. Gabbay, P. Thagard, and J. Woods (editors), Elsevier.

[5] Neyman, J. (1956), “Note on an Article by Sir Ronald Fisher,” *Journal of the Royal Statistical Society*, B, 18: 288-294.

[6] Pearson, E. S. (1955), “Statistical Concepts in the Relation to Reality,” *Journal of the Royal Statistical Society*, B, 17, 204-207.

[7] Spanos, A. (1999), *Probability Theory and Statistical Inference: Econometric Modeling with Observational Data*, Cambridge University Press, Cambridge.

I find this extremely insightful. I wonder how much of the confusion Fisher tried to clear up underlies current-day confusions about the nature of parameters within statistical models. Couple of things on my own work:

(1) It’s good to see that I’ve moved from the somewhat timid reinterpretation of error statistics in EGEK (which now sounds to me to be overly “behavioristic”) to a conception that reflects the more philosophical chapters of the book. Since around 2004, I took the leap to viewing error probabilities as assessing the capability of the test at hand (to have uncovered the inferential mistake of relevance). This counterfactual construal is in sync with what some philosophers call a strong argument from coincidence. It depends on accepting the severity principle which seems to me to reflect the minimal requirement for evidence. A blogpost on this may be found here (Jan 2, 2013):

http://errorstatistics.com/2013/01/02/severity-as-a-metastatistical-assessment/

(2) I know it is dangerous to reopen the Fisherian Pandora’s box of fiducial measures, but sometimes I wonder… After all, “fiducial measures” in science are actually related to the giving of standard benchmarks for calibration. there are places where Fisher talked as if he was after a severity assessment, or at least sensitivity and precision.

Pingback: FLUMP – Featuring RA Fisher’s 124th Birthday, smog sequencing, and a traits manifesto | BioDiverse Perspectives