This year marks the 50th anniversary of RA Fisher’s death. It is a good excuse, I think, to draw attention to an aspect of his philosophy of significance testing. In his extremely interesting essay on Fisher, Jimmie Savage drew attention to a problem in Fisher’s approach to testing. In describing Fisher’s aversion to power functions Savage writes, ‘Fisher says that some tests are *more sensitive* than others, and I cannot help suspecting that that comes to very much the same thing as thinking about the power function.’ (Savage 1976) (P473).

The modern statistician, however, has an advantage here denied to Savage. Savage’s essay was published posthumously in 1976 and the lecture on which it was based was given in Detroit on 29 December 1971 (P441). At that time Fisher’s scientific correspondence did not form part of his available oeuvre but in1990 Henry Bennett’s magnificent edition of Fisher’s statistical correspondence (Bennett 1990) was published and this throws light on many aspects of Fisher’s thought including on significance tests.

The key letter here is Fisher’s reply of 6 October 1938 to Chester Bliss’s letter of 13 September. Bliss himself had reported an issue that had been raised with him by Snedecor on 6 September. Snedecor had pointed out that an analysis using inverse sine transformations of some data that Bliss had worked on gave a different result to an analysis of the original values. Bliss had defended his (transformed) analysis on the grounds that a) if a transformation always gave the same result as an analysis of the original data there would be no point and b) an analysis on inverse sines was a sort of weighted analysis of percentages with the transformation more appropriately reflecting the weight of information in each sample. Bliss wanted to know what Fisher thought of his reply.

Fisher replies with a ‘shorter catechism’ on transformations which ends as follows:

A…Have not Neyman and Pearson developed a general mathematical theory for deciding what tests of significance to apply?

B…Their method only leads to definite results when mathematical postulates are introduced, which could only be justifiably believed as a result of extensive experience….the introduction of hidden postulates only disguises the tentative nature of the process by which real knowledge is built up. (Bennett 1990) (p246)

It seems clear that by *hidden postulates* Fisher means *alternative hypotheses* and I would sum up Fisher’s argument like this. Null hypotheses are more primitive than statistics: to state a null hypothesis immediately carries an implication about an infinity of test

statistics. You have to choose one, however. To say that you should choose the one with the greatest *power* gets you nowhere. This *power* depends on the alternative hypothesis but how will you choose your alternative hypothesis? If you knew that under all circumstances in which the null hypothesis was true you would know which alternative was false you would already know more than the experiment was designed to find out. All that you can do is apply your experience to use statistics, which when employed in valid tests, reject the null hypothesis most often. Hence statistics are more primitive than alternative hypotheses and the latter cannot be made the justification of the former.

I think that this is an important criticism of Fisher’s but not entirely fair. The experience of any statistician rarely amounts to so much that this can be made the (sure) basis for the choice of test. I think that (s)he uses a mixture of experience and argument. I can give an example from my own practice. In carrying out meta-analyses of binary data I have theoretical grounds (I believe) for a prejudice against the risk difference scale and in favour of odds ratios. I think that this prejudice was originally analytic. To that extent I was being rather Neyman-Pearson. However some extensive empirical studies of large collections of meta-analyses have shown that there is less heterogeneity on the odds ratio scale compared to the risk-difference scale. To that extent my preference is Fisherian. However, there are some circumstances (for example where it was reasonably believed that only a small proportion of patients would respond) under which I could be persuaded that the odds ratio was not a good scale. This strikes me as veering towards the N-P.

Nevertheless, I have a lot of sympathy with Fisher’s criticism. It seems to me that what the practicing scientist wants to know is what is a good test in practice rather than what would be a good test in theory if this or that could be believed about the world.

**References: **

J. H. Bennett (1990) *Statistical Inference and Analysis Selected Correspondence of R.A. Fisher*, Oxford: Oxford University Press.

L. J. Savage (1976) On rereading R A Fisher. *The Annals of Statistics,* 441-500.

###### Related articles

- JERZY NEYMAN: Note on an Article by Sir Ronald Fisher (errorstatistics.com)
- E.S. PEARSON: Statistical Concepts in Their Relation to Reality (errorstatistics.com)
- Fisher, Statistical Methods and Scientific Inference (errorstatistics.com)

Thanks for this post. I have to say that I have never understood the big deal about specifying alternative answers to a question of interest. In all of inquiry, we have problems and questions of interest: is there a discrepancy of a given type? is the rotation speeding up? slowing down? are the trials independent or not? If you don’t have a clue what you want to know, or how you can be wrong about something (i.e., what it would mean for H to be false), then why are you doing the inquiry to begin with?

I just don’t see how formulating a denial of a claim requires knowing “more than the experiment was designed to find out”, or even that it requires knowing very much. (I don’t know if your reference to “every null” is what you think gives trouble, but so long as I know how to specify the denial in the case of interest it suffices). Cox has a good taxonomy of types of nulls, and these correspond to different questions and therefore different alternatives, or ways of being wrong about this one claim in this experiment. I have a feeling the obstacle comes from not seeing that one can profitably formulate several answers to a given question and evaluate the power for each (even though I prefer to assess severity).

My earlier comment referred to your (Senn”s) remarks two paragraphs before the last. But in the next to last paragraph, you allude to the use of background information in relation to discrepancies from nulls of interest—in order to explain why you might be interested in one or the other (risk difference vs odds ratio)—in exactly the way I was trying to suggest. But why refer to the choice as between being “analytic” or being Fisherian? Is it not possible to extricate your reasoning in empirical terms, e.g., in relation to heterogeneity and relevant obstacles to finding out what you want to know, in the context at hand. I may be missing what you meant in using “analytic” here.

The analytic reason would be that a log-odds has a value on the real line and is hence inherently suitable for regression modelling since that framework cannot yield an impossible log odds. On the other hand the absolute risk is somewhere on the scale 0 to 1 so a combination of absolute risk plus regression could cause a problem. (In practice it often would not.)

The empirical reason is that in looking at sets of trials with varying background risk it has more often been observed to be the case that the results are much more heterogeneous on the risk difference scale than the log-odds scale.

By the way, I am often amused to note that even statisticians seem to forget these simple facts about properties of bounded scales when re-scaling examination marks. The percentage scale is a bad scale to use. See http://www.senns.demon.co.uk/wprose.html#Mote

Pingback: Sir Harold Jeffreys’ (tail area) howler: Sat night comedy | Error Statistics Philosophy