**As part of the week of recognizing R.A.Fisher (February 17, 1890 – July 29, 1962), I reblog a guest post by Stephen Senn from 2012. (I will comment in the comments.)**

*‘Fisher’s alternative to the alternative’*

*By: Stephen Senn*

[2012 marked] the 50th anniversary of RA Fisher’s death. It is a good excuse, I think, to draw attention to an aspect of his philosophy of significance testing. In his extremely interesting essay on Fisher, Jimmie Savage drew attention to a problem in Fisher’s approach to testing. In describing Fisher’s aversion to power functions Savage writes, ‘Fisher says that some tests are *more sensitive* than others, and I cannot help suspecting that that comes to very much the same thing as thinking about the power function.’ (Savage 1976) (P473).

The modern statistician, however, has an advantage here denied to Savage. Savage’s essay was published posthumously in 1976 and the lecture on which it was based was given in Detroit on 29 December 1971 (P441). At that time Fisher’s scientific correspondence did not form part of his available oeuvre but in 1990 Henry Bennett’s magnificent edition of Fisher’s statistical correspondence (Bennett 1990) was published and this throws light on many aspects of Fisher’s thought including on significance tests.

The key letter here is Fisher’s reply of 6 October 1938 to Chester Bliss’s letter of 13 September. Bliss himself had reported an issue that had been raised with him by Snedecor on 6 September. Snedecor had pointed out that an analysis using inverse sine transformations of some data that Bliss had worked on gave a different result to an analysis of the original values. Bliss had defended his (transformed) analysis on the grounds that a) if a transformation always gave the same result as an analysis of the original data there would be no point and b) an analysis on inverse sines was a sort of weighted analysis of percentages with the transformation more appropriately reflecting the weight of information in each sample. Bliss wanted to know what Fisher thought of his reply.

Fisher replies with a ‘shorter catechism’ on transformations which ends as follows:

A…Have not Neyman and Pearson developed a general mathematical theory for deciding what tests of significance to apply?

B…Their method only leads to definite results when mathematical postulates are introduced, which could only be justifiably believed as a result of extensive experience….the introduction of hidden postulates only disguises the tentative nature of the process by which real knowledge is built up. (Bennett 1990) (p246)

It seems clear that by *hidden postulates* Fisher means *alternative hypotheses* and I would sum up Fisher’s argument like this. Null hypotheses are more primitive than statistics: to state a null hypothesis immediately carries an implication about an infinity of test

statistics. You have to choose one, however. To say that you should choose the one with the greatest *power* gets you nowhere. This *power* depends on the alternative hypothesis but how will you choose your alternative hypothesis? If you knew that under all circumstances in which the null hypothesis was true you would know which alternative was false you would already know more than the experiment was designed to find out. All that you can do is apply your experience to use statistics, which when employed in valid tests, reject the null hypothesis most often. Hence statistics are more primitive than alternative hypotheses and the latter cannot be made the justification of the former.

I think that this is an important criticism of Fisher’s but not entirely fair. The experience of any statistician rarely amounts to so much that this can be made the (sure) basis for the choice of test. I think that (s)he uses a mixture of experience and argument. I can give an example from my own practice. In carrying out meta-analyses of binary data I have theoretical grounds (I believe) for a prejudice against the risk difference scale and in favour of odds ratios. I think that this prejudice was originally analytic. To that extent I was being rather Neyman-Pearson. However some extensive empirical studies of large collections of meta-analyses have shown that there is less heterogeneity on the odds ratio scale compared to the risk-difference scale. To that extent my preference is Fisherian. However, there are some circumstances (for example where it was reasonably believed that only a small proportion of patients would respond) under which I could be persuaded that the odds ratio was not a good scale. This strikes me as veering towards the N-P.

Nevertheless, I have a lot of sympathy with Fisher’s criticism. It seems to me that what the practicing scientist wants to know is what is a good test in practice rather than what would be a good test in theory if this or that could be believed about the world.

**References: **

J. H. Bennett (1990) *Statistical Inference and Analysis Selected Correspondence of R.A. Fisher*, Oxford: Oxford University Press.

L. J. Savage (1976) On rereading R A Fisher. *The Annals of Statistics,* 441-500.

###### Related articles

- JERZY NEYMAN: Note on an Article by Sir Ronald Fisher (errorstatistics.com)
- E.S. PEARSON: Statistical Concepts in Their Relation to Reality (errorstatistics.com)
- Fisher, Statistical Methods and Scientific Inference (errorstatistics.com)

Before giving my comment on this, interested readers might look at pp. 289-291 from Neyman’s (1956) response to Fiher, in the paper linked to in my last post. Here it is: http://www.phil.vt.edu/dmayo/personal_website/Neyman-1956.pdf

I think what troubles me about the idea of forming a test statistic post data is that it would allow the kind of selection effects Fisher was against. If the null were, say, drug makes no difference, what’s to stop the test stat from being chosen to be “improvement in memory” or whatever accords with the data. I know that Cox emphasizes the need to adjust for selection effects, and clearly doesn’t imagine a Fisherian test would allow that.

Neyman shows that with only a null you can always reject, and complains as well that without considering “sensitivity” ahead of time, Fisher “takes it that there’s no effect” (or the like) when a non-significant result occurs, even when the test had low power (he discusses this in the section I linked to in my previous comment.) I take it that most of the time the test stat, for Fisher, is to come from a good estimator of the parameter in question. But there’s still the matter of sensitivity. Cox requires at least an implicit alternative. So,I don’t really understand this alternative to the alternative–as far as current uses of tests–– even though I obviously see where Fisher was keen to reject the type 2 error in the way he imagined N-P had in mind. Barnard says that Fisher was happy with power and didn’t object even to a behavioristic justification of tests–so long as that wasn’t the only way they could be used. That seems right.

Thanks.

Mayo

Stephen:

You write, “what the practicing scientist wants to know is what is a good test in practice.” I think you have to be careful about giving the practicing scientist what he or she wants! It’s my impression that the practicing scientist wants certainty: thus, the result of an experiment is either “statistically significant” (the result is true) or “not significant” (the result is false). Or perhaps “marginally significance” which is really great because then if you want the result to be true you can call it evidence in favor, and if you want the result to be false you can call it evidence against. This desire for certainty on the part of statistical consumers has historically been aided and abetted by the statistics profession.

Andrew: You’ve often said this and I’ve always been perplexed. Anyone doing a statistical test of a hypothesis knows that it’s a statistical test (as is the case in all but the most trivial tests in science). They know it’s fallible, and the error probabilities quantify the capabilities of the test, the level of inconsistency observed, the precision, etc. These are measures of how well the test has probed mistaken interpretations of data, and ruled them out–or not. So there’s no demand for certainty in applying a statistical test (or corresponding confidence interval).

The only way I can understand your point is to assume that not demanding certainty means being a probabilist: assigning a probability to a hypothesis inferred, or a comparative assessment as with Bayes factors.Is that what you mean? I deny you have to be a probabilist in order to embrace fallibility, and in fact I deny that putting probabilities over hypotheses is the best (or even a good) way to recognize the error-proneness of statistical inference.

Aside from that,I entirely agree that it’s unhelpful to talk about what “the scientist wants to know” or “what he ought to believe”, or the probability he ought to assign a theory. It makes it sound as if the frequentist error statisticians are withholding what the scientist wants. In fact, I deny a scientist wants posteriors or comparisons of posteriors given priors, in any way these may be obtained—however interpreted. Unless of course, one is merely using “probability” to mean something like how well tested or corroborated claims are (which won’t obey the probability calculus).She wants an informative theory that has been well-corroborated in some realms, but which also leaves interesting gaps, while providing tools for how to learn more, and how to relate the inquiry to understanding other problems.

Mayo, you are making an extraordinary claim with this “I deny a scientist wants posteriors or comparisons of posteriors given priors, in any way these may be obtained—however interpreted.” Interestingly, it is a claim amenable to empirical testing. Ask a few scientists if they would like the to know the probability of a hypothesis given the data and I’m pretty confident that a few of them will say “Yes please!”. If you ask the many scientists who regularly use Bayesian methods then most of them will answer in the affirmative. You must be saying something other than what your words seems to say.

Michael: You are just confirming Gelman’s point that it’s unhelpful to talk of what scientists want. In recent times, of course, many have been told what they really, really want. The word “ought” is central: wouldn’t you like to know what you ought to believe (e.g., what stock to buy?). Note my qualification, posteriors or Bayes factors as actually computed (except possibly in empirical Bayesian cases). Of course I’ve argued this for 20+ yrs, I assumed it was obvious to readers that I wouldn’t simply be claiming it.Check especially my general phil sci papers as in Error and Inference (2010). A posterior computation requires an exhaustive set of possible hypotheses or explanations of data. Scientists don’t set out an inquiry with these.

So tell me what form of obtainable posteriors you think capture a scientist’s goal in learning about and testing theories?

I’m afraid your empirical test wouldn’t be very reliable: watch what they do, not what they say–especially not when they’re prompted with a probabilist’s way of viewing inference. It’s no different in day to day learning.

Deborah:

You write, “Anyone doing a statistical test of a hypothesis knows that it’s a statistical test (as is the case in all but the most trivial tests in science). They know it’s fallible, and the error probabilities quantify the capabilities of the test, the level of inconsistency observed, the precision, etc.”

I disagree. I suggest you read papers in Psychological Science, PPNAS, etc., even the Journal of the American Statistical Association. It is standard reasoning that if something is “statistically significant” then it represents a real effect, and if something is “not statistically significant” then it can be treated as zero. This is, I’d say, the standard way that hypothesis tests are used, millions of times a year.

Yes. I agree that one has to. Be wary about giving scientists what they want out of statistics. I frequently stress that biostatisticians should give physicians what they need and that this is often not the same as what they want.

I think that your criticism is a reasonable one and it is one that Student[1] himself made of Fisher’s attitude writing

‘ I personally choose the method which is most likely to be profitable when designing the experiment rather than use Prof. Fisher’s system of a posteriori choice[2] which has always seemed to me to

savour rather too much of ” heads I win, tails you lose ” ‘(P370)

Here Student is referring to the opinion of Fisher,that for a matched pairs design where the result was non-significant, one could always check to see whether the two-sample t-test might not give a significant result. This is a piece of “cake and eat it” advice of Fisher’s that I have always found rather strange.

However, that does not mean that Neyman is let off the hook, since he substitutes an a-priori choice that could have been different had another statistician been involved, which choice implies that a great deal of information is available when it is not.

Furthermore, your criticism ought to apply equally to any scheme in which models are examined for adequacy prior to carrying out a substantive test, a procedure that makes some of us uneasy, but which has its defenders. (See, for example, posts by Aris Spanos.)

One way of rescuing Fisher’s scheme from your criticism is to say that the choice for a current data-set should be based on analysis of previous ones. This, in my view, has a lot to recommend it. And I have frequently suggested to pharma companies that they should consider re-analysing old trials, not with the object of changing their views on previously studied treatments, but in order to learn how to analyse better in future. See[3] for a discussion.

Reference

[1]Gossett WS. Comparison between balanced and random arrangements of field plots. Biometrika 1938; 29:363–378.

[2] Statistical Methods for Research Workers, 24.1 (5th ed.),

[3]Senn, S. (2008). Statisticians and evidence–mote and beam. Pharmaceutical statistics, 7(3), 155-157

Stephen:

Thanks so much for your comment. There’s too much I’d want to say. But first off, I had no idea that Fisher would ever entertain post-data designation of alternatives (did he really do that?) because N-P spend a long time talking about the need for an alternative, and don’t really mention that. They’re concerned with (a) the fact that one can find a way to have the same data reject or accept while still satisfy Fisher’s p-value requirement, and (b) the fact that considering the alternative enables ensuring power, and they complain that some of Fisher’s tests have low power, but that he takes nonsignificance as the data corroborating the null in some sense. (Neyman’s 1956 reply linked in my last post.)

So, I’m surprised you say that Fisher didn’t block data-dependent alternatives (or corresponding data dependent choices of test stat. What I mean is, although the formal test appears to license a data dependent test statistic, I assumed that it was an unwritten rule that Fisher precluded your doing that, because to do that would vitiate the key argument: e.g., that a result as or more statistically significant would be rare under the null. To choose a data dependent test statistic, I can imagine Fisher saying, would be to hide information. I will continue to assume that Fisherians do block such a move, because otherwise David Cox wouldn’t be so concerned with selection effects.

Now you raise a very different case: data dependencies in identifying the source of anomalies, whether in formal testing hypotheses, in testing model assumptions or in entirely informal set-ups. I’ve written a lot on this over many years. There’s a difference between what might be called “explaining a known effect” and trying to infer there even is a genuine effect at all. The Fisherian test context we were talking about is an example of the latter. By contrast, explaining a deflection effect, or changes in scale of photos before and after an eclipse are examples of the former. Is there a damaging change of scale, and if so, what caused it? Was it the heat of the sun? Or did Eddington get some of his jelly-donut on the plates? Reliable means to pinpoint blame for known effects or anomalies postdata may be available, even if no one thought of the way heat can warp telescope mirrors in advance. There are a set number of assumptions required for usable plates (for safely applying a statistical method), just like a set of assumptions for a usable model—and these are predesignated. If you can show the mirror distortion (or jelly donut) precludes a reliable estimate of error, then the original data do not satisfy the assumptions needed to be usable. That’s why George Barnard emphasized,specifically discussing the eclipse tests, that a Neyman-style requirement of predesignating the number of usable data points can’t hold up in such cases. You can only use the unblurred star images, and this is only known post-data. (You can tell I’ve been writing about the eclipse tests of GTR.)

I’m blurring a few things, but I’m denying that all data dependent specifications invalidate the relevant error probabilities. It’s rather the opposite: to vouchsafe relevant error probabilities may require post-data specifics.

I noticed a recent article by Kadane “Beyond Hypothesis Testing” that says “Implementation of Fisher’s proposal requires that the stochastic model and the test statistic be chosen before the data are examined”. I’m not sure where he gets that.

file:///Users/deborahmayo/Downloads/kadane%20entropy-18-00199(2).pdf

One could turn the Neyman-Pearson game on its head and say that every test statistic implies an alternative. If the test rejects in case T>c, the implicit alternative against which the test is testing is all distributions for which P(T>c) is larger than for the H0. Usually when testing hypotheses we are interested in certain deviations from the H0 but not in all conceivable deviations. For example, if H0: N(a,sigma^2) with fixed a we may be interested in comparing locations (e.g., whether we find outcomes that are systematically larger or smaller than a) but we may not be interested in whether the distribution really has a normal shape, or at most to the extent to which it would otherwise be detrimental to making statements about location. Certainly for any data that human beings can observe, N(a,sigma^2) is violated because of discreteness, but we are not interested in this and there’s no need to test it.

Running a Gauss-test or t-test compares locations, not distributional shapes or variances; in this sense these tests come with an implicit alternative and that’s why we use them. This implicit alternative is bigger than the class of N(b,sigma^2) with b not equal a though (and therefore the issue is somewhat more complex than what is covered by Neyman-Pearson theory); all kinds of distributions with mean not too close to a are in it. Anyway, I don’t find any appeal in thinking about a test without taking into account the alternative (or rather set of alternatives) against which it is meant to test, and the (implicit) alternative against which it actually tests.

Christian: Sensible as always, I too “don’t find any appeal in thinking about a test without taking into account”at least an implicit alternative. However, even N-P tests, as I see them, but it doesn’t matter if this is a broader, N-P-F umbrella, assume the alternatives of interest will be specified, allowing the bigger (or even smaller) alternative you mention. For instance, the claims for which we assess severity are post data. People some times allege that a test assumes every aspect of the statistical model is entirely “correct” , just to ask a simple question. As you suggest, so long as the inference we care about isn’t radically vitiated by the well-recognized idealization, it’s not under test for the moment.

But back to the Fisherian “alternative to the alternative” business, I think the supposition that it needn’t be prespecified, even implicitly, has led to some of the fallacious uses of tests we see–especially when the alternative is taken to be a substantive research hypothesis.

I think the Fisherian standpoint leads Cox to have had to set out a taxonomy of selection effects in specifying the alternative post-data.