*By: Stephen Senn*

This year [2012] marks the 50th anniversary of RA Fisher’s death. It is a good excuse, I think, to draw attention to an aspect of his philosophy of significance testing. In his extremely interesting essay on Fisher, Jimmie Savage drew attention to a problem in Fisher’s approach to testing. In describing Fisher’s aversion to power functions Savage writes, ‘Fisher says that some tests are *more sensitive* than others, and I cannot help suspecting that that comes to very much the same thing as thinking about the power function.’ (Savage 1976) (P473).

The modern statistician, however, has an advantage here denied to Savage. Savage’s essay was published posthumously in 1976 and the lecture on which it was based was given in Detroit on 29 December 1971 (P441). At that time Fisher’s scientific correspondence did not form part of his available oeuvre but in1990 Henry Bennett’s magnificent edition of Fisher’s statistical correspondence (Bennett 1990) was published and this throws light on many aspects of Fisher’s thought including on significance tests.

The key letter here is Fisher’s reply of 6 October 1938 to Chester Bliss’s letter of 13 September. Bliss himself had reported an issue that had been raised with him by Snedecor on 6 September. Snedecor had pointed out that an analysis using inverse sine transformations of some data that Bliss had worked on gave a different result to an analysis of the original values. Bliss had defended his (transformed) analysis on the grounds that a) if a transformation always gave the same result as an analysis of the original data there would be no point and b) an analysis on inverse sines was a sort of weighted analysis of percentages with the transformation more appropriately reflecting the weight of information in each sample. Bliss wanted to know what Fisher thought of his reply.

Fisher replies with a ‘shorter catechism’ on transformations which ends as follows:

A…Have not Neyman and Pearson developed a general mathematical theory for deciding what tests of significance to apply?

B…Their method only leads to definite results when mathematical postulates are introduced, which could only be justifiably believed as a result of extensive experience….the introduction of hidden postulates only disguises the tentative nature of the process by which real knowledge is built up. (Bennett 1990) (p246)

It seems clear that by *hidden postulates* Fisher means *alternative hypotheses* and I would sum up Fisher’s argument like this. Null hypotheses are more primitive than statistics: to state a null hypothesis immediately carries an implication about an infinity of test

statistics. You have to choose one, however. To say that you should choose the one with the greatest *power* gets you nowhere. This *power* depends on the alternative hypothesis but how will you choose your alternative hypothesis? If you knew that under all circumstances in which the null hypothesis was true you would know which alternative was false you would already know more than the experiment was designed to find out. All that you can do is apply your experience to use statistics, which when employed in valid tests, reject the null hypothesis most often. Hence statistics are more primitive than alternative hypotheses and the latter cannot be made the justification of the former.

I think that this is an important criticism of Fisher’s but not entirely fair. The experience of any statistician rarely amounts to so much that this can be made the (sure) basis for the choice of test. I think that (s)he uses a mixture of experience and argument. I can give an example from my own practice. In carrying out meta-analyses of binary data I have theoretical grounds (I believe) for a prejudice against the risk difference scale and in favour of odds ratios. I think that this prejudice was originally analytic. To that extent I was being rather Neyman-Pearson. However some extensive empirical studies of large collections of meta-analyses have shown that there is less heterogeneity on the odds ratio scale compared to the risk-difference scale. To that extent my preference is Fisherian. However, there are some circumstances (for example where it was reasonably believed that only a small proportion of patients would respond) under which I could be persuaded that the odds ratio was not a good scale. This strikes me as veering towards the N-P.

Nevertheless, I have a lot of sympathy with Fisher’s criticism. It seems to me that what the practicing scientist wants to know is what is a good test in practice rather than what would be a good test in theory if this or that could be believed about the world.

**References: **

J. H. Bennett (1990) *Statistical Inference and Analysis Selected Correspondence of R.A. Fisher*, Oxford: Oxford University Press.

L. J. Savage (1976) On rereading R A Fisher. *The Annals of Statistics,* 441-500.

###### Related articles

- JERZY NEYMAN: Note on an Article by Sir Ronald Fisher (errorstatistics.com)
- E.S. PEARSON: Statistical Concepts in Their Relation to Reality (errorstatistics.com)
- Fisher, Statistical Methods and Scientific Inference (errorstatistics.com)

Stephen: I’ve always liked this post, especially as I never really saw Fisher as deliberately proposing an alternative to the alternative in this way. I’ve also always been somewhat confused about what you mean, or think Fisher means here (e.g., in that letter). Maybe it is Fisher who is vague. Sometimes Fisher talks as if the scientist knows what he/she is looking for, and the test statistic reflects this, so no reason to set out an explicit alternative. That makes sense, and to me is precisely akin to specifying a general alternative, even if only directional. But then there are these other hints that Fisher is saying, the scientist can’t know until the data are in. That’s quite different. So maybe you will clarify this after 2 years. thanks.

Stephen: And surely you will be able to answer the question about the picture I have up–where it is, and what it is!

Deborah, not sure what picture you are referring to. One (smoking) is of Fisher at Lake Janaluska, NC 1956. The other is of Fisher at his Millionaire calculator.

I refer to the stained glass window picture on the left of the blog!

What does it mean to specify a “general alternative” hypothesis?

vl: general, directional alternative–essentially what the test statistic affords.

I assume that the logic of hypothesis testing is something like this. 1) There is a universal set of all possible hypotheses (for the general field). 2) The null hypothesis is either (if simple) a single hypothesis or (if complex) a collection of hypotheses. Either way it is a subset of the universal set. 3) If the null hypothesis is false, as a matter of logic, the complement is true. 4) The way we decide about the truth of a null hypothesis is to carry out hypothesis testing.

Now, from Fisher’s point of view there is something very odd about the Neyman-Pearson approach. Immediately without any recourse to data the universal set is reduced. Consider, for example, H0: E(X)=0 versus X~N(delta,sigma^2), delta not equal to zero. Even allowing for the fact that in the t-testing approach sigma^2 is allowed to be anything and unknown it is clear that the null and alternative hypotheses between them do not cover all the possibilities. There is a much bigger range of hypotheses that are not covered by either. So how have NP managed to reduce the universal set to the business of comparing various subsets?

Fisher says, lets start somewhere else. Let’s start with the null-hypothesis. There will be a whole range of possible valid ways we could test the null. If the null is true they would (at the 5% level of significance) lead to rejection 5% of the time. However, if the null is false then some will lead to rejection with greater probability than others. NP say is is knowledge of the alternative that leads us to choose which statistics is more sensitive but Fisher says NP are now being inconsistent. They choose between H0 and H1 on the basis of statistics. but they choose the statistic on the basis of a choice of H1 and they made the choice of H1 without statistics. This he says is back to front. Your choice of H1 is not a priori. It is based on your experience with similar tests in previous occasions.

In other words in the Fisher system statistics are always the basis of the choice of hypotheses, whereas in the NP system hypotheses are sometimes chosen on the basis of statistics and sometimes vice versa.

As I said in my original post, I think that Fisher’s criticism is too extreme. Nevertheless, I do think his argument is worth considering very carefully.

Stephen: thanks for this, but I think I’m still unclear. Firstly, the choice of an alternative may only be the result of a question we wish to pose–say, in your example, we ask: is mu greater than 0 or not. Now you say “in the Fisher system statistics are always the basis of the choice of hypotheses.” So I’m Fisher and I

choose the statistic, say (X-bar – 0), and if the observed difference is greater than the number of

standard deviations required, I reject the null and infer mu is greater than 0. I don’t see the difference, unless you are suggesting the Fisherian doesn’t have to pin down the distance statistic before running the test, but I don’t think he’d go along with this. Don’t forget that N-P also emphasize that the choice of a sensible distance statistic comes first (This is Pearson’s “step 2”).

I don’t see the difference except that Fisher is leaving things vague or implicit. And the choice of departure needn’t be based on previous experience for either–.

The letter by Fisher I quoted considered various possible ways of analysing binary data, for example using the probability scale or the arc-sine transformation. Note that the difference distances are not the same for the two and don’t just differ by a scale factor. You could justify one or the other in terms of an alternative hypothesis. However, Fisher would say that in practice this would just be an excuse. If you were regularly carrying out hypothesis tests you should just use whichever of the two had had the highest rejection rate in the past. Such experience might eventually lead you to surmise that one version of the alternative was more appropriate than the other but it would be statistic first, hypothesis second.

I note also, that it is a historical fact, that the nearly all the various tests used were originally invented without consideration of an alternative hypothesis.

I don’t have that book here (in NY) but I don’t find this reflected in Fisher’s statistical formulation of tests (Is it? Please correct me.) Maybe this was why N was keen to show you could always reject a null (if one looks for a test with maximal chance of rejecting).

It is in Fisher’s correspondence as edited by Bennett but it is easy enough to grasp, surely ?

Q Biologist. “Why do you suggest using the arc-sine transformation?”

Neyman “because an alternative hypothesis I favour suggests it will be more powerful”

Fisher “because my experience suggests it is more sensitive”

Who has given the more reasonable reply?

Stephen: I never imagined this would be Neyman’s response. The computation of power would be analytic, but the relevance of the distance measure in relation to the (underlying) discrepancy of interest is distinct. I mean if one is picking up on the same underlying difference it doesn’t matter, but there are different kinds of differences. The test has to make sense, e.g., the further from the null in the respect queried, the higher the power. The interpretation of the rejection, or acceptance is relative to the distance measure chosen.

I’m not sure if there’s any disagreement or just a matter of putting things in different words. Experience might give the warrant for the model…

But Neyman made a great point of investigation most powerful tests etc and such investigations do rely on the alternative hypothesis.

Do you mean those tests depend on the type of departure? or specific value of the parameter? In any event when you say “rely” it could be taken to mean “depends on being true” whereas the point of the most powerful test is that it doesn’t matter which is true. Anyhow, one can ascertain severity/inseverity for various discrepancies.

Yes, the most powerful, or for that matter most sensitive, test depends on the nature of the departure. Very roughly this is given by the scale that is most nearly additive.

As Bahadur and Savage http://projecteuclid.org/euclid.aoms/1177728077 showed there are no tests that are most powerful for every departure. David Salsburg wrote a book on restricted Neyman tests which is partly inspired by this idea.

Sure, but where does this leave the two approaches to specifying tests?

This is very interesting and a central issue, I think.

Here is a thought. I often have the impression that the writings of pioneers such as Neyman and Fisher are interpreted in a too dogmatic matter. I’m not denying that one may be able to find some statements in their works that say more or less clearly “you should do things in this way and not in that way”, but I think that this rather confuses the real value of their work.

Take the NP theory of optimal testing. At the core of this work is a mathematical theorem which states a mathematical fact. “If the alternative is given…” – it doesn’t say that we always *should* proceed by defining an alternative before we choose a test statistic. It only says that this is the way to go *if we want to apply the optimality theorem*.

The whole machinery of evaluating the quality of tests using power is useful and informative whether or not we choose a test in this way. Even if we do it in what Stephen portrays as the Fisherian fashion, looking at power functions against various alternatives and asking against which direction of departure from the H0 a test is optimal and against which it is rather wanting is a very good thing to understand.

Personally I would always ask myself in which kind of direction of deviation from the H0 I’d be interested and for what reason. Various reasons are conceivable: One may be interested for practical reasons in specific deviations (without implying that one believes these are the true directions), one may have the kind of experience Fisher would like to rely on, or one may have a rather specific suspicion which alternative is true in a specific situation.

In any case I doubt that the scope of NP-optimality is wide enough to cover every possible alternative of interest (model assumptions such as distributional shape or independence may always be more or less slightly violated), so in real life the best we can do is to try to be good against most of the potentially huge set of alternatives against which we’d like to test, but we never will be uniformly optimal. So it is always an open question, to be negotiated in a situation-dependent way, whether NP-optimality against a certain specific alternative is the best choice overall, or rather, for example, something more robust and less optimal against any specific direction, or something that has looked good in a number of similar past applications. So nobody needs to stick to either Fisher or NP in general.

Christian: You are right. Anyone who reads N and P hear them repeatedly say that these are just tools with different properties and it’s up to the user to choose those that seem most relevant for the goal. N and P fought against the idea of a single “best” method as squelching science, so it’s ironic whenever they are construed as laying down a single best method.

I think I can agree with most of this Christian. As I said in my ‘You may believe you are a Bayesian but you are probably wrong’ paper I think that there is something to be said for all four major schools of inference.

However, I would like to make another point as regards Fisher and Neyman. I do not view power as fundamental. I think likelihood is the more fundamental concept. So I interpret the NP lemma in more or less the opposite way to most people. Power does not become the justification for using likelihood. Likelihood is fundamental and power is (sometimes) an incidental bonus. I think that this position is somewhat closer to Egon Pearson’s view of the lemma than Neyman’s,

Well, my view is that the distributions of the relevant statistic (including the P-value), being essential for error probabilities, are essential for N-P and also Fisher (and this is in sync with Fisher’s use of tail areas)*. If you only care about likelihoods, then you forgo control and assessment of error probabilities–we’re back to the likelihood principle. For example David Cox, whom I take to be an arch Fisherian, will look at the distribution of the P-value to assess precision and sensitivity. So Cox and I are able to agree on severity or whatever name one wants to give for the associated stringency measure. Stephen’s remark reminds me of the “new properties of math likelihood”.

https://errorstatistics.com/2014/02/20/r-a-fisher-two-new-properties-of-mathematical-likelihood/

But Fisher’s not talking mere likelihood there. I don’t know how Stephen sees it.

*I also grant that one needs a “sensible distance statistic”–but that too is in both Fisher and N-P (maybe Pearson stressed this more in his “step two”).

The relevant point is that Fisher considers that significance levels could be allowed to vary from case to case according to circumstance.

NP is sometimes presented as 1) you fix alpha and 2) you minimise beta but actually what the NP lemma shows is that (for some cases) IF you want to fix alpha and then minimise beta, likelihood is the way to guide your choice of tail area. It does not say that you should want to fix alpha and mimimise beta.

The crunch comes for cases when the sample space is discrete. Then you can actually improve beta for a given alpha by abandoning likelihood as your guide. It is at this point that you should start thinking whether power or likelihood is primary. This(1) is a practical example where I have argued against going down the power route.

Reference

1) Senn, S. (2007). “Drawbacks to noninteger scoring for ordered categorical data.” Biometrics 63(1): 296-298; discussion 298-299.

A proposal to improve trend tests by using noninteger scores is examined. It is concluded that despite improved power such tests are usually inferior to the simpler integer scored approach.

Stephen: Sure, nobody likes randomized-at-the boundary tests, and only an extreme behaviorist could advance them.

Pearson warns that:

“Although the mathematical procedure may put Step 3 before 2, we cannot put this into operation before we have decided, under Step 2, on the guiding principle to be used in choosing the contour system. That is why I have numbered the steps in this order.” (Ibid. 173).

https://errorstatistics.com/2012/08/16/e-s-pearsons-statistical-philosophy/

I have never understood why the concept of power provokes such strong reactions, even in people who are happy to talk of high sensitivity to detect effects, and the like. To confess: power is my very favorite concept, even though I’d want to compute P(D > d;H’) rather than P(D> d*;H’) where d* is the fixed cut-off for rejection (in a size alpha test).

I didn’t say I disliked power. I was pointing out that power and likelihood very often march hand in hand but gave an example where you would clearly prefer maintaining a likelihood ordering to breaking it, even though that would permit you to increase power while holding the type I error constant.

So to sum up, I was saying rather ‘I won’t sacrifice likelihood to power but am delighted to accept it as an incidental bonus when it is on offer.’

Let me also reverse your statement. Why does the concept of likelihood provokes such strong reactions even in people who attach such importance to power?

By the way if you want an example of a criticism of some who have suggested sacrificing likelihood to power see

1. Perlman MD, Wu L. The emperor’s new tests. Statistical Science 1999; 14: 355-369.

http://projecteuclid.org/euclid.ss/1009212517

> likelihood is primary

Agree, but the likelihood needs to be defined and one choice of transformation may support additivity and another not.

I once defined this a parameter transformation choice making parameter components, common, arbirarily different or common in distribution (more commonly refered to as exchangeable or random effects parameters). Fisher seemed to dislike common in distribution parameters and Efron clearly argues they are necessarily Bayesian.

So does not “likelihood is primary” just start another debate?

Did not mean to be anonomous here, Keith O’Rourke

Keith, I am not sure that I understand this. Likelihood is invariant to parameter transform but of course if you assume that the error terms kicks in once you have transformed the data that is another matter. On the other hand if you are referring to nuisance parameters I agree. They are the Achilles heel of all the systems. It would be interesting to know if Deborah’s severity can survive where Fisher’s fiducial inference failed. I have given this no real thought but my gut feeling is ‘no’. (This is not, of course, a practical criticism since it may well be roughly OK in all cases that matter.)

Stephen, once defined the full dimensional likelihood is invariant to parameter transformations and still reduction in dimension over nuisance parameters is the Achilles heel of all the systems as you put it.

But I was primarily trying to point to the (choice of) definition of the likelihood as having or not having random parameters to allow for non-additivity as suggested by choice of transformation e.g. non-additivity for first component of (Pt – Pc,Pc) verses additivity for first component (OR,Pc) or vice versa depending on the application. The specific distribution form for those random parameters, being an especially nasty nuisance parameter. I suspect Fisher avoided specifying _likelihoods_ with random parameters?

Keith

I need to think about it some more but at the back of my mind is that Lee and Nelder have insisted that one has to be very careful in the specification of H-likelihood and of course, the context, as the name implies, is hierarchical.

Yes, random parameters to me being a synonym for hierarchical likelihood and H-likelihood being one way to define (Nelder?) or simply deal with (others, e.g. Cox?) the hierarchical likelihood, with some disagreements.

Sorry for being vague, but also I am primarily interested currently in Fisher’s views (before H-likelihood).

Keith

If we want to evaluate the quality of a test against all kinds of nonstandard (but potentially interesting) alternatives, power will help us and likelihood won’t.

For example, how good is the two sample t-test if the samples are indeed t-distributed? How good is a rank sum test in situations with all kinds of distributional shapes? Against what kind of alternatives are the various tests for normality good?

Christian: Good point, it’s the idea of looking at the properties of a procedure under various assumptions about the data generation. I’m not sure what Stephen wants to say about the priority of the likelihood, when it comes to the question you raise.