As part of the week of recognizing R.A.Fisher (February 17, 1890 – July 29, 1962), I reblog a guest post by Stephen Senn from 2012/2017. The comments from 2017 lead to a troubling issue that I will bring up in the comments today.
‘Fisher’s alternative to the alternative’
By: Stephen Senn
[2012 marked] the 50th anniversary of RA Fisher’s death. It is a good excuse, I think, to draw attention to an aspect of his philosophy of significance testing. In his extremely interesting essay on Fisher, Jimmie Savage drew attention to a problem in Fisher’s approach to testing. In describing Fisher’s aversion to power functions Savage writes, ‘Fisher says that some tests are more sensitive than others, and I cannot help suspecting that that comes to very much the same thing as thinking about the power function.’ (Savage 1976) (P473).
The modern statistician, however, has an advantage here denied to Savage. Savage’s essay was published posthumously in 1976 and the lecture on which it was based was given in Detroit on 29 December 1971 (P441). At that time Fisher’s scientific correspondence did not form part of his available oeuvre but in 1990 Henry Bennett’s magnificent edition of Fisher’s statistical correspondence (Bennett 1990) was published and this throws light on many aspects of Fisher’s thought including on significance tests.
The key letter here is Fisher’s reply of 6 October 1938 to Chester Bliss’s letter of 13 September. Bliss himself had reported an issue that had been raised with him by Snedecor on 6 September. Snedecor had pointed out that an analysis using inverse sine transformations of some data that Bliss had worked on gave a different result to an analysis of the original values. Bliss had defended his (transformed) analysis on the grounds that a) if a transformation always gave the same result as an analysis of the original data there would be no point and b) an analysis on inverse sines was a sort of weighted analysis of percentages with the transformation more appropriately reflecting the weight of information in each sample. Bliss wanted to know what Fisher thought of his reply.
Fisher replies with a ‘shorter catechism’ on transformations which ends as follows:
A…Have not Neyman and Pearson developed a general mathematical theory for deciding what tests of significance to apply?
B…Their method only leads to definite results when mathematical postulates are introduced, which could only be justifiably believed as a result of extensive experience….the introduction of hidden postulates only disguises the tentative nature of the process by which real knowledge is built up. (Bennett 1990) (p246)
It seems clear that by hidden postulates Fisher means alternative hypotheses and I would sum up Fisher’s argument like this. Null hypotheses are more primitive than statistics: to state a null hypothesis immediately carries an implication about an infinity of test
statistics. You have to choose one, however. To say that you should choose the one with the greatest power gets you nowhere. This power depends on the alternative hypothesis but how will you choose your alternative hypothesis? If you knew that under all circumstances in which the null hypothesis was true you would know which alternative was false you would already know more than the experiment was designed to find out. All that you can do is apply your experience to use statistics, which when employed in valid tests, reject the null hypothesis most often. Hence statistics are more primitive than alternative hypotheses and the latter cannot be made the justification of the former.
I think that this is an important criticism of Fisher’s but not entirely fair. The experience of any statistician rarely amounts to so much that this can be made the (sure) basis for the choice of test. I think that (s)he uses a mixture of experience and argument. I can give an example from my own practice. In carrying out meta-analyses of binary data I have theoretical grounds (I believe) for a prejudice against the risk difference scale and in favour of odds ratios. I think that this prejudice was originally analytic. To that extent I was being rather Neyman-Pearson. However some extensive empirical studies of large collections of meta-analyses have shown that there is less heterogeneity on the odds ratio scale compared to the risk-difference scale. To that extent my preference is Fisherian. However, there are some circumstances (for example where it was reasonably believed that only a small proportion of patients would respond) under which I could be persuaded that the odds ratio was not a good scale. This strikes me as veering towards the N-P.
Nevertheless, I have a lot of sympathy with Fisher’s criticism. It seems to me that what the practicing scientist wants to know is what is a good test in practice rather than what would be a good test in theory if this or that could be believed about the world.
J. H. Bennett (1990) Statistical Inference and Analysis Selected Correspondence of R.A. Fisher, Oxford: Oxford University Press.
L. J. Savage (1976) On rereading R A Fisher. The Annals of Statistics, 441-500.
- JERZY NEYMAN: Note on an Article by Sir Ronald Fisher (errorstatistics.com)
- E.S. PEARSON: Statistical Concepts in Their Relation to Reality (errorstatistics.com)
- Fisher, Statistical Methods and Scientific Inference (errorstatistics.com)
The comment I’d said I would leave is essentially the one from this time last year, but I’d ponder it some more since it surprised me. Here it is:
I think what troubles me about the idea of forming a test statistic post data is that it would allow the kind of selection effects Fisher was against. If the null were, say, drug makes no difference, what’s to stop the test stat from being chosen to be “improvement in memory” or whatever accords with the data. I know that Cox emphasizes the need to adjust for selection effects, and clearly doesn’t imagine a Fisherian test would allow that.
Neyman shows that with only a null you can always reject, and complains as well that without considering “sensitivity” ahead of time, Fisher “takes it that there’s no effect” (or the like) when a non-significant result occurs, even when the test had low power (he discusses this in the section I linked to in my previous comment.) I take it that most of the time the test stat, for Fisher, is to come from a good estimator of the parameter in question. But there’s still the matter of sensitivity. Cox requires at least an implicit alternative. So,I don’t really understand this alternative to the alternative–as far as current uses of tests–– even though I obviously see where Fisher was keen to reject the type 2 error in the way he imagined N-P had in mind. Barnard says that Fisher was happy with power and didn’t object even to a behavioristic justification of tests–so long as that wasn’t the only way they could be used. That seems right.