In recognition of R.A. Fisher’s birthday today, I’ve decided to share some thoughts on a topic that has so far has been absent from this blog: Fisher’s* fiducial probability*. **Happy Birthday Fisher.**

[Neyman and Pearson] “began an influential collaboration initially designed primarily, it would seem to clarify Fisher’s writing. This led to their theory of testing hypotheses and to Neyman’s development of confidence intervals, aiming to clarify Fisher’s idea of fiducial intervals (D.R.Cox, 2006, p. 195).

The entire episode of fiducial probability is fraught with minefields. Many say it was Fisher’s biggest blunder; others suggest it still hasn’t been understood. The majority of discussions omit the side trip to the Fiducial Forest altogether, finding the surrounding brambles too thorny to penetrate. Besides, a fascinating narrative about the Fisher-Neyman-Pearson divide has managed to bloom and grow while steering clear of fiducial probability–never mind that it remained a centerpiece of Fisher’s statistical philosophy. I now think that this is a mistake. It was thought, following Lehman (1993) and others, that we could take the fiducial out of Fisher and still understand the core of the Neyman-Pearson vs Fisher (or Neyman vs Fisher) disagreements. We can’t. Quite aside from the intrinsic interest in correcting the “he said/he said” of these statisticians, the issue is intimately bound up with the current (flawed) consensus view of frequentist error statistics.

So what’s *fiducial inference*? I follow Cox (2006), adapting for the case of the lower limit:

We take the simplest example,…the normal mean when the variance is known, but the considerations are fairly general. The lower limit, [with Z the standard Normal variate, and M the sample mean]:

M

_{0}– z_{c}σ/√nderived from the probability statement

Pr(μ > M – z

_{c}σ/√n ) = 1 – cis a particular instance of a hypothetical long run of statements a proportion 1 – c of which will be true, assuming the model is sound. We can, at least in principle, make such a statement for each c and thereby generate a collection of statements, sometimes called a

confidence distribution. (Cox 2006, p. 66).

For Fisher it was a *fiducial distribution*. Once M_{0} is observed, M_{0} – z_{c} σ/√n is what Fisher calls the *fiducial c per cent limit* for *μ.* Making such statements for different c’s yields his *fiducial distribution*.

In Fisher’s earliest paper on fiducial inference in 1930, he sets 1 – c as .95 per cent. Start from the significance test of *μ *(e.g.,* μ*<

*μ*

_{0}vs.

*μ*>

*μ*

_{0 }) with significance level .05. He defines the

*95 percent value of the sample mean M*, M

_{.95}, such that in 95% of samples M< M

_{.95}. In the Normal testing case, M

_{.95}=

*μ*

_{0}+ 1.65σ/√n. Notice M

_{.95}is the cut-off for rejection in a .05 one-sided test T+ (of

*μ*<

*μ*

_{0}vs.

*μ*>

*μ*

_{0}).

We have a relationship between the statistic [M] and the parameter

μsuch that M_{.95}= is the 95 per cent value corresponding to a givenμ. This relationship implies the perfectly objective fact that in 5 per cent of samples M> M_{.95}. (Fisher 1930, p. 533; I use μ for his θ, M in place of T).

That is, Pr(M <μ+ 1.65σ/√n) = .95.

The event M > M_{.95} occurs just in case *μ*_{0} < M − 1.65σ/√n .[i]

For a particular observed M_{0} , M_{0} − 1.65σ/√n is the *fiducial 5 per cent value* of μ.

We may know as soon as M is calculated what is the fiducial 5 per cent value of μ,

and that the true value of μThis then is a definite probability statement about the unknown parameter μ which is true irrespective of any assumption as to it’s a priori distribution. (Fisher 1930, p. 533 emphasis is mine).will be less than this value in just 5 per cent of trials.

This seductively suggests that *μ *< *μ*_{.05} gets the probability .05! But we know we cannot say that Pr(*μ** *< *μ*_{.05}) = .05.[ii]

However, Fisher’s claim that we obtain “a definite probability statement about the unknown parameter *μ” c*an be interpreted in another way. There’s a kosher probabilistic statement about the pivot Z, it’s just not a probabilistic assignment to a parameter. Instead, a particular substitution is, to paraphrase Cox “a particular instance of a hypothetical long run of statements 95% of which will be true.” After all, *Fisher was abundantly clear that the fiducial bound should not be regarded as an inverse inference to a posterior probability.* We could only obtain an inverse inference, Fisher explains, by considering *μ* to have been selected from a superpopulation of *μ**‘s *with known distribution. But then the inverse inference (posterior probability) would be a deductive inference and not properly inductive. Here, Fisher is quite clear, the move is *inductive*.

People are mistaken, Fisher says, when they try to find priors so that they would match the fiducial probability:

In reality the statements with which we are concerned differ materially in logical content from inverse probability statements, and it is to distinguish them from these that we speak of the distribution derived as a

fiducial frequencydistribution, and of the working limits, at any required level of significance, ….as thefiducial limitsat this level. (Fisher 1936, p. 253).

So, what is being assigned the fiducial probability? It is, Fisher tells us, the “aggregate of all such statements…” Or, to put it another way, it’s the *method* of reaching claims to which the probability attaches. Because M and S (using the student’s T pivot) or M alone (where σ is assumed known) are *sufficient statistics* “we may infer, without any use of probabilities a priori, a frequency distribution for *μ* which shall correspond with the aggregate of all such statements … to the effect that the probability that *μ* is less than M – 1.65σ/√n is .05.” (Fisher 1936, p. 253)[iii]

Suppose you’re Neyman and Pearson aiming to clarify and justify Fisher’s methods.

”I see what’s going on’ we can imagine Neyman declaring. There’s a method for outputting statements such as would take the general form

*μ* >M – z_{c}σ/√n

Some would be in error, others not. The method outputs statements with a probability of 1 – c of being correct. The outputs are instances of general form of statement, and the probability alludes to the relative frequencies that they would be correct, as given by the chosen significance or fiducial level c . Voila! “We may look at the purpose of tests from another viewpoint,” as Neyman and Pearson (1933) put it. Probability qualifies (and controls) the *performance* of a method.

There is leeway here for different *interpretations* and *justifications* of that probability, from actual to hypothetical performance, and from behavioristic to more evidential–I’m keen to develop the latter. But my main point here is that in struggling to extricate Fisher’s fiducial limits, without slipping into fallacy, they are led to the N-P performance construal. Is there an efficient way to test hypotheses based on probabilities? ask Neyman and Pearson in the opening of the 1933 paper.

Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behavior with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong (Neyman and Pearson 1933, pp. 141-2/290-1).

At the time, Neyman thought his development of confidence intervals (in 1930) was essentially the same as Fisher’s fiducial intervals. Fisher’s talk of assigning fiducial probability to a parameter, Neyman thought at first, was merely the result of accidental slips of language, altogether expected in explaining a new concept. There was evidence that Fisher accepted Neyman’s reading. When Neyman gave a paper in 1934 discussing confidence intervals, seeking to generalize fiducial limits, but making it clear that the term “confidence coefficient” is not synonymous to the term probability, Fisher didn’t object. In fact he bestowed high praise, saying Neyman “had every reason to be proud of the line of argument he had developed for its perfect clarity. The generalization was a wide and very handsome one,” the only problem being that there wasn’t a single unique confidence interval, as Fisher had wanted (for fiducial intervals).[iv] Slight hints of the two in a mutual admiration society are heard, with Fisher demurring that “Dr Neyman did him too much honor” in crediting him for the revolutionary insight of Student’s T pivot. Neyman responds that of course in calling it Student’s T he is crediting Student, but “this does not prevent me from recognizing and appreciating the work of Professor Fisher concerning the same distribution.”(Fisher comments on Neyman 1934, p. 137). For more on Neyman and Pearson being on Fisher’s side in these early years, see Spanos’s post.

So how does this relate to the current consensus view of Neyman-Pearson vs Fisher? Stay tuned.[v] In the mean time, *share your views.*

The next installment is here.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

[i] (μ < M – z_{c} σ/√n) iff M > M_{(1 – c)} = M >μ + z_{c} σ/√n

[ii] In terms of the pivot Z, the inequality Z >z_{c }is equivalent to the inequality

μ < M –z_{c} σ/√n

“so that this last inequality must be satisfied with the same probability as the first.” But the fiducial value replaces M with M_{0} and then Fisher’s assertion

Pr(μ > M_{0} –z_{c} σ/√n ) = 1 – c

no longer holds. (*Fallacy of probabilistic instantiation*.) In this connection, see my previous post on confidence intervals in polling.

[iii] If we take a number of samples of size n from the same or from different populations, and for each calculate the fiducial 5 percent value for μ, then in 5 per cent of cases the true value of μ will be less than the value we have found. There is no contradiction in the fact that this may differ from a posterior probability. “The fiducial probability is more general and, I think, more useful in practice, for in practice our samples will all give different values, and therefore both different fiducial distributions and different inverse probability distributions. Whereas, however, the fiducial values are expected to be different in every case, and our probabilty statements are relative to such variability, the inverse probability statement is absolute in form and really means something different for each different sample, unless the observed statistic actually happens to be exactly the same.” (Fisher 1930, p. 535)

[iv]Fisher restricts fiducial distributions to special cases where the statistics exhaust the information. He recognizes”The political principle that ‘Anything can be proved with statistics’ if you don’t make use of all the information. This is essential for fiducial inference”. (1936, p. 255). There are other restrictions to the approach as he developed it; many have extended it. There are a number of contemporary movements to revive fiducial and confidence distributions. For references, see the discussants on my likelihood principle paper.

[v] For background, search Fisher on this blog. Some of the material here is from my forthcoming book, *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP).*

Cox, D. R. (2006), *Principles of Statistical Inference*. Cambridge.

Fisher, R.A. (1930), “Inverse Probability,” *Mathematical **Proceedings of the Cambridge Philosophical Society*, 26(4): 528-535.

Fisher, R.A. (1936), “Uncertain Inference,”*Proceedings of the American Academy of Arts and Sciences* 71: 248-258.

Lehmann, E. (1993), “The Fisher, Neyman-Pearson Theories of Testing Hypotheses: One Theory or Two?” *Journal of the American Statistical Association* 88 (424): 1242–1249.

Neyman, J. (1934), “On the Two Different Aspects of the Representative Method: The Method of Stratified Sampling and the Method of Purposive Selection,” *Early Statistical Papers of J. Neyman*: 98-141. [Originally published (1934) in *The Journal of the Royal Statistical Society* 97(4): 558-625.]

At university my first exposure to statisticaI thinking was delivered by Maurice Quenouille, who was then Professor of Statistics at the Southampton University in the UK. In Quenouille’s STAT101 course we never got so far as to discuss advanced topics such as Fisher’s Fiducial theory and I doubt many of us at the time would have appreciated the distinction between it and Neyman’s confidence interval theory.

Whilst Quenouille will probably be remembered as the developer of the jackknife, he had wide interests writing books on multivariate analysis, time series, design and analysis of experiments, short-cut methods and a short book on statistical inference. In this last book, first published in 1958*, Quenouille argued that the close connection between confidence intervals and hypothesis tests – the former indicating whether the observed data are compatible with given hypotheses – means that they are to be used in testing significance. On the other hand fiducial intervals provide a range of hypotheses compatible with the observed data and “are therefore used in setting limits of estimation”.

This distinction is fine and is no more than the distinction between pre- and post-data procedures or as Savage** put it “-a bold attempt to make the Bayesian omelet without breaking the Bayesian egg”

* Maurice H Quenouille. Fundamentals of Statistical Reasoning. Charles Griffin and Company, London, 1958.

** LJ Savage. The Foundation of Statistics Reconsidered. In Proc 4th Berkeley Symposium 1961; 1:575-585.

.

Andy: I don’t get the two points you make which are very different. That is, the parameter values in the confidence intervals are those non-rejectable values in the corresponding significance tests. So intervals can, and often are, used as tests.The issue of assigning a fiducial probability to a parameter–the Bayesian omelette w/o breaking eggs– is just a mistake. But assigning the probability to the estimation method–in the sense that the method outputs statements with a probability of 1 – c of being correct– is not.

Andy: You might be interested to see what Fisher says connecting N-P tests and estimation: https://errorstatistics.com/2015/02/16/r-a-fisher-two-new-properties-of-mathematical-likelihood-just-before-breaking-up-with-n-p/

Even supposing the Neyman and Fisher had never had their dispute over Neyman’s course text, they would have disagreed (perhaps less acrimoniously in this counterfactual than in reality) about Fisher’s preferred solution to the Behrens-Fisher problem. Fisher’s fiducial probabilities in that setting are obtained after conditioning on ancillary statistics and thus do not have a constant coverage rate (i.e., they don’t satisfy Neyman’s definition of a confidence interval). Mayo, I believe you’d fall on the Neyman side of the fence in that case…?

Corey: Not sure where the course text disagreement enters this post, but sure they had different preferred ways of modeling various questions. Neither had firm principles that gave uniquely preferred answers in all cases, so what?

I’m mainly responding to

“Fisher’s talk of assigning fiducial probability to a parameter, Neyman thought at first, was merely the result of accidental slips of language, altogether expected in explaining a new concept. There was evidence that Fisher accepted Neyman’s reading. When Neyman gave a paper in 1934 discussing confidence intervals, seeking to generalize fiducial limits, but making it clear that the term “confidence coefficient” is not synonymous to the term probability, Fisher didn’t object. In fact he bestowed high praise… Slight hints of the two in a mutual admiration society are heard…”

Fisher was all about optimal use of information for inference, so he didn’t scruple to condition on ancillaries even though the resulting procedures did not have long-run sampling frequency guarantees (and if I understand correctly, said guarantees are a necessary condition for reliable inference procedures in your view). So even if they hadn’t had that clash over the textbook, they might have ended up not admiring each other’s approaches quite so much.

Corey: I don’t want to claim their fights were only over a textbook, so I can agree there. My story is enfolding in a few steps (at least it does in my book). At this step we have Fisher describing fiducial probability as applying to an aggregate of statements. It’s because he says it affords assurances about such aggregates, that he uses the word probability. Various further requirements are needed (which he hoped N could help him remove), and no recognizable subsets.

Corey: look at pp 533-4 in the Fisher paper I cite: https://errorstatistics.files.wordpress.com/2016/02/fisher-1930-inverse-probability.pdf

Yes, at this point in the development of the fiducial approach, fiducial intervals looks a lot confidence intervals based on pivots; naturally the valid frequency statements that can be stated about such intervals were obvious to Fisher and did not pass without notice. But as soon he started looking at problems with an interest parameter and a nuisance parameter for which an ancillary statistic exists, frequency statements went out the window.

(I’m still not clear on what *you* think is the right way to deal with recognizable subsets.)

Corey: Fisher never could have sent frequencies out the window because he was a frequentist til the end. Whether they entitled statements about “rational disinclination” or, for that matter, well-testedness, are separate questions, as is the question of whether performance is all that matters (it isn’t), and whether relevant frequencies may require conditioning (they do, but not everyone would call it that).

Please look at Lehmann’s discussion of the cases you’re on about (in the linked paper): where all sides evoke slightly ad hoc considerations. Only someone seeking a mechanical method, and supposing a single error probability was all important, could consider the slight differences as warranting greater weight than Lehmann gives them. (I’d forgotten he dealt with all of those examples here, so I thank the Elbians for linking it.)

On your last, surely you know what I say about the famous howlers, e.g., Cox and Mayo (2010)”Objectivity and Conditioning”. Again, I think Lehmann’s remarks are very appropriate*. He says when (average) power conflicts with conditioning, one needs to look at context, and in scientific cases would condition. Further, he sees p-values (he calls them significance probabilities!), alpha, power, & conditioning as open to unification by considering context & frame of reference. The fact that he didn’t work out just how to achieve this is a separate matter.

*even though in a way I blame him for the ultra decision-theoretic formulation he gives to tests in his famous textbook.

> alpha, power, & conditioning as open to unification by considering context & frame of

> reference … he didn’t work out just how to achieve this

Some argue that the unknown answer is the Holy Grail of statistical inference – http://arxiv.org/pdf/1510.08539v1.pdf

Figure 8 might be of interest.

Keith O’Rourke

Thanks for sending me this interesting paper—I think I heard him present a portion of this at the Boston colloquium for philsci in 2014.

https://errorstatistics.com/2014/01/29/boston-colloquium-for-philosophy-of-sciencerevisiting-the-foundations-of-statistics/

Interesting to spot the Cox 58, “two weighing machines” example, and lots of good attempts to array different views.There’s a lot in it that I’ll read carefully later. (As it happens I wrote to Meng yesterday, because he asked me at that conference how severity related to fiducial.)

One needs to distinguish, in the medical cases he discusses, wanting general theory, and wanting a diagnosis/prediction for an individual. It may well be that statistical inference will one day be avoidable in medicine by sufficient knowledge.

People ask about the appropriately narrow reference class for performance, but don’t ask what that performance information really tells you. Some kind of “rubbing off” construal is just assumed, which I think is too quick.

I only read this paper sketchily, but it strikes me as presenting a rather novel (to me) way of delineating statistical inference schools. “The difference between (subjective) Bayesian and Frequentist inference hinges on” the question of how narrow or individualized a reference class should be used. “While they are often thought to be two different methodologies, in fact they share the same logic, with the only difference being how they select the relevant subset of control problems”. (21) Is the shared logic that of “probabilism” or performance?

I don’t know if this position, and their delineation of views in a fascinating diagram on p. 24, follows from their focus on “direct inference” of an event, as in personalized medicine or diagnostic screening. Philosophers Reichenbach, Salmon, Kyburg and others debated the reference class problem, but all were frequentists, none were subjective Bayesians, though some favored more (or less) narrow reference classes as the basis for direct inference. Maybe it’s because the authors of the paper are allowing that the “prevalence” rates, or priors, have to be guessed. I’m curious as to what others, perhaps more familiar with this work, think?

“Fisher never could have sent frequencies out the window because he was a frequentist til the end.”

I didn’t say frequencies went out the window — I said frequency

statements(by which I meant unconditional frequency guarantees) went out the window. More specifically, where Neyman’s “procedure such that we shall not be wrong more than alpha percent of the time” school of thought pretty much demands unconditional frequency guarantees, Fisher’s main concern was about what the information in the data indicated about the parameter of interest. For example when considering association in contingency tables Fisher asserts that the margins alone can’t tell us anything about association so our inferences about association should be conditional on the observed marginal frequencies as in Fisher’s exact test. Another example: in the Behrens-Fisher problem, the sample variances can’t tell us anything about the difference of the means so Fisher’s fiducial inference for that difference is conditional on the observed values of the sample variances.Corey:

Or even more interesting perhaps – ratios from paired data (Fieller-Creasy) where Fisher flip-flopped on whether a fiducial approach was appropriate [DA Sprott Estimation of Ratios from Paired Data. 2001.]

What I find curious about this one is that if the data are strictly positive and you take a log transformation – a paired t-test (on the log scale) side steps the difficulties altogether.

Fisher “It is a noteworthy peculiarity of inductive inference that comparatively slight differences in the mathematical specification of a problem may have logically important effects on the inferences possible.”

Keith O’Rourke

Corey: As Lehmann points out it is open to a NP theorist to condition or not depending on context. And I said earlier, in certain types of cases, neither sets of principles avoid ad hockeries. It’s easy to critically evaluate why different types of questions/concerns yield different answers. More importantly, I still say that neither sets of principles (Fisherian or N-P) afford a satisfactory account.

Corey: Now that the links are up, take a look at what Lehmann says on this: https://errorstatistics.files.wordpress.com/2014/08/lehm1993_1-theory-or-2.pdf

Aris Spanos reminds me that “recognizable subsets” were found even for Fisher’s favorite fiducial case, the Student’s t distribution. It was something like a year after he died. Whether their mere existence, as opposed to being ignorant of them, sufficed to preclude fiducial inference– for Fisher, I can’t say. As I noted, there are developments these days resurrecting fiducial inference. Maybe it’s akin to one of the default Bayesian schools. Fortunately we don’t have to wrestle with these complexities. All I’m saying is that there’s a need to recognize when fiducial issues and statements are directly behind Fisher words, and his debates; especially, we need to recognize how Fisher’s fiducial talk impinged on the early development of N-P methods.. (Recall the quote from Cox at the outset about “Neyman’s development of confidence intervals, aiming to clarify Fisher’s idea of fiducial intervals.”)