*In recognition of R.A. Fisher’s birthday tomorrow, I will post several entries on him. I find this (1934) paper to be intriguing –immediately before the conflicts with Neyman and Pearson erupted. It represents essentially the last time he could take their work at face value, without the professional animosities that almost entirely caused, rather than being caused by, the apparent philosophical disagreements and name-calling everyone focuses on. Fisher links his tests and sufficiency, to the Neyman and Pearson lemma in terms of power. It’s as if we may see them as ending up in a very similar place (no pun intended) while starting from different origins. I quote just the most relevant portions…the full article is linked below. I’d blogged it earlier here. You may find some gems in it.*

**‘Two new Properties of Mathematical Likelihood’**

by R.A. Fisher, F.R.S.

Proceedings of the Royal Society, Series A, 144: 285-307 (1934)

The property that where a sufficient statistic exists, the likelihood, apart from a factor independent of the parameter to be estimated, is a function only of the parameter and the sufficient statistic, explains the principle result obtained by Neyman and Pearson in discussing the efficacy of tests of significance. Neyman and Pearson introduce the notion that any chosen test of a hypothesis H_{0} is more powerful than any other equivalent test, with regard to an alternative hypothesis H_{1}, when it rejects H_{0} in a set of samples having an assigned aggregate frequency ε when H_{0} is true, and the greatest possible aggregate frequency when H_{1} is true.

If any group of samples can be found within the region of rejection whose probability of occurrence on the hypothesis H_{1} is less than that of any other group of samples outside the region, but is not less on the hypothesis H_{0}, then the test can evidently be made more powerful by substituting the one group for the other.

Consequently, for the most powerful test possible the ratio of the probabilities of occurrence on the hypothesis H_{0} to that on the hypothesis H_{1} is less in all samples in the region of rejection than in any sample outside it. For samples involving continuous variation the region of rejection will be bounded by contours for which this ratio is constant. The regions of rejection will then be required in which the likelihood of H_{0} bears to the likelihood of H_{1}, a ratio less than some fixed value defining the contour. (295)…

It is evident, at once, that such a system is only possible when the class of hypotheses considered involves only a single parameter θ, or, what come to the same thing, when all the parameters entering into the specification of the population are definite functions of one of their number. In this case, the regions defined by the uniformly most powerful test of significance are those defined by the estimate of maximum likelihood, T. For the test to be uniformly most powerful, moreover, these regions must be independent of θ showing that the statistic must be of the special type distinguished as sufficient. Such sufficient statistics have been shown to contain all the information which the sample provides relevant to the value of the appropriate parameter θ . It is inevitable therefore that if such a statistic exists it should uniquely define the contours best suited to discriminate among hypotheses differing only in respect of this parameter; and it is surprising that Neyman and Pearson should lay it down as a preliminary consideration that ‘the tesitng of statistical hypotheses cannot be treated as a problem in estimation.’ When tests are considered only in relation to sets of hypotheses specified by one or more variable parameters, the efficacy of the tests can be treated directly as the problem of estimation of these parameters. Regard for what has been established in that theory, apart from the light it throws on the results already obtained by their own interesting line of approach, should also aid in treating the difficulties inherent in cases in which no sufficient statistics exists. (296)

You can read the full paper here.

Mayo – perhaps you could comment on the following passage in Fisher’s paper (relevant to my comments on the other recent posts):

“It is a matter of no great practical urgency, but of some theoretical importance, to consider the process of interpretation by which this loss can be recovered. Evidently, the simple and convenient method of relying on a single estimate will have to be abandoned. The loss of information has been traced to the fact that samples yielding the same estimate will have likelihood functions of different forms, and will therefore supply different amounts of information. When these functions are differentiable successive portions of the loss may be recovered by using as ancillary statistics, in addition to the maximum likelihood estimate, the second and higher differential coefficients at the maximum. In general we can only hope to recover the total loss, by taking into account the entire course of the likelihood function.”

Another related (albeit old) paper, quoting the key portion of the above, is: Efron and Hikley (1978) ‘Assessing the accuracy of the maximum likelihood estimator: Observed versus expected Fisher information’.

Omaclaren: This is one of those very mysterious passages and intriguing ideas I thought about a lot some years ago. The idea of recovering lost information through ancillaries. I know Cox worked on this The idea is that there can be information about “precision” even with samples yielding the same estimates, and this may be recoverable via ancillaries. There are a number of standard examples (one in Cox and Mayo 2010) where you can see that one is in a recognizable subset with different error probabilities, or different amounts of information as measured by Fisher. I don’t think there was ever agreement about how to choose the best ancillary (if there’s more than one), but it’s been awhile since I’ve studied the relevant papers.

Obviously I’m a bit of an outsider to this, but a naive reading to me would be an endorsement of some form of likelihood principle (perhaps not as sometimes/often interpreted?) – the totality of information from the sample is contained in the *full* likelihood function.

The important point to me is that one need not – and should not – just naively consider, say, the location of the maximum likelihood. One can also examine the ‘shape’ near this (or any other) estimate. So two log-likelihoods L1 and L2 may have L1′ = L2′ = 0 at the same location theta* (max likelihood estimate), but have very different second derivatives there (observed Fisher information or measures of ‘statistical curvature’ – see also Efron’s papers on this).

In fact standard methods of forming confidence intervals often use the observed Fisher information L” to estimate the variance of the maximum likelihood estimate, right? See e.g. the Efron paper above and subsequent literature.

This makes sense from a variety of perspectives – information about a variations of a function at a point is given by the Taylor series at that point. Fisher seems to be directly referring to this. Sure, the first derivative measures ‘relative support’ but the second derivative also supplies additional information about ‘variability/shape’. Here this information is about *the sample at hand based on the likelihood only*. If one is not satisfied with only local information then in general the full function can be inspected, as Fisher says. Edwards also seems to have followed this.

[Much more speculative, but perhaps in some sense a derivative may be interpreted as considering a (local) ‘counterfactual’? This is analogous to say a mechanical system (or other dynamical system) in which the location of equilibria is governed by the first derivative of the energy (a la Law of Likelihood) but the stability of equilibria by the second derivative (a la observed Fisher information). This can even be phrased in more counterfactual-like ‘variational’ or ‘virtual displacement’ terms in standard mechanics (or thermodynamics etc) problems.

Perhaps ‘severity’ considerations are a call to examine these ‘higher order’ counterfactuals (derivatives/variations)? The relation of severity to a definite integral of the likelihood further suggests, extending the mechanical analogy too far perhaps, some sort of ‘action principle’ to go with the variational derivatives.]