Just as you keep up your physical exercise during the pandemic (*sure*), you want to keep up with mental gymnastics too. With that goal in mind, and given we’re just a few days from the New Year (and given especially my promised presentation for January 7), here’s one of the two simple examples that will limber you up for the puzzle to ensue. It’s the famous weighing machine example from Sir David Cox (1958)[1]. It is one of the “chestnuts” in the museum exhibits of “chestnuts and howlers” in Excursion 3 (Tour II) of my book *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* (SIST, 2018). So block everything else out for a few minutes and consider 3 pages from SIST …

**Exhibit (vi): Two Measuring Instruments of Different Precisions. SIST (pp. 170-173). ***Did you hear about the frequentist who, knowing she used a scale that’s right only half the time, claimed her method of weighing is right 75% of the time?*

She says, “I flipped a coin to decide whether to use a scale that’s right 100% of the time, or one that’s right only half the time, so, overall, I’m right 75% of the time.” (She wants credit because she could have used a better scale, even knowing she used a lousy one.)

*Basis for the joke: *An N-P test bases error probability on all possible outcomes or measurements that could have occurred in repetitions, but did not. As with many infamous pathological examples, often presented as knockdown criticisms of all of frequentist statistics, this was invented by a frequentist, Cox (1958). It was a way to highlight what could go wrong in the case at hand, if one embraced an unthinking behavioral-performance view. Yes, error probabilities are taken over hypothetical repetitions of a process, but not just any repetitions will do. Here’s the statistical formulation.

We flip a fair coin to decide which of two instruments, E_{1 }or E_{2}, to use in observing a Normally distributed random sample * Z* to make inferences about mean

*θ*. E

_{1 }has variance of 1, while that of E

_{2 }is 10

^{6}. Any randomizing device used to choose which instrument to use will do, so long as it is irrelevant to

*θ*. This is called a mixture experiment. The full data would report both the result of the coin flip and the measurement made with that instrument. We can write the report as having two parts: First, which experiment was run and second the measurement: (E

_{i},

*), i = 1 or 2.*

**z**In testing a null hypothesis such as *θ* = 0, the same * z *measurement would correspond to a much smaller

*P*-value were it to have come from E

_{1}rather than from E

_{2}: denote them as

*p*

_{1}(

*) and*

**z***p*

_{2}(

*), respectively. The overall significance level of the mixture: [*

**z***p*

_{1}(

*) +*

**z***p*

_{2}(

*)]/2, would give a misleading report of the precision of the actual experimental measurement. The claim is that N-P statistics would report the average*

**z***P*-value rather than the one corresponding to the scale you actually used! These are often called the unconditional and the conditional test, respectively. The claim is that the frequentist statistician must use the unconditional test.

Suppose that we know we have observed a measurement from E_{2 }with its much larger variance:

The unconditional test says that we can assign this a higher level of significance than we ordinarily do, because if we were to repeat the experiment, we might sample some quite different distribution. But this fact seems irrelevant to the interpretation of an observation which we know came from a distribution [with the larger variance]. (Cox 1958, p. 361)

Once it is known which E_{i } has produced * z*, the

*P*-value or other inferential assessment should be made with reference to the experiment actually run. As we say in Cox and Mayo (2010):

The point essentially is that the marginal distribution of a

P-value averaged over the two possible configurations is misleading for a particular set of data. It would mean that an individual fortunate in obtaining the use of a precise instrument in effect sacrifices some of that information in order to rescue an investigator who has been unfortunate enough to have the randomizer choose a far less precise tool. From the perspective of interpreting the specific data that are actually available, this makes no sense. (p. 296)

To scotch his famous example, Cox (1958) introduces a principle: weak conditionality.

If a mixture experiment (of the aforementioned type) is performed, then, if it is known which experiment produced the data, inferences aboutWeak Conditionality Principle (WCP):θ are appropriately drawn in terms of the sampling behaviorin the experiment known to have been performed (Cox and Mayo 2010, p. 296).

It is called weak conditionality because there are more general principles of conditioning that go beyond the special case of mixtures of measuring instruments.

While conditioning on the instrument actually used seems obviously correct, nothing precludes the N-P theory from choosing the procedure “which is best on the average over both experiments” (Lehmann and Romano 2005, p. 394), and it’s even possible that the average or unconditional power is better than the conditional. In the case of such a conflict, Lehmann says relevant conditioning takes precedence over average power (1993b).He allows that in some cases of acceptance sampling, the average behavior may be relevant, but in scientific contexts the conditional result would be the appropriate one (see Lehmann 1993b, p. 1246). Context matters. Did Neyman and Pearson ever weigh in on this? Not to my knowledge, but I’m sure they’d concur with N-P tribe leader Lehmann. Admittedly, if your goal in life is to attain a precise α level, then when discrete distributions preclude this, a solution would be to flip a coin to decide the borderline cases! (See also Example 4.6, Cox and Hinkley 1974, pp. 95–6; Birnbaum 1962, p. 491.)

**Is There a Catch?**

The “two measuring instruments” example occupies a famous spot in the pantheon of statistical foundations, regarded by some as causing “a subtle earthquake” in statistical foundations. Analogous examples are made out in terms of confidence interval estimation methods (Tour III, Exhibit (viii)). It is a warning to the most behavioristic accounts of testing from which we have already distinguished the present approach. Yet justification for the conditioning (WCP) is fully within the frequentist error statistical philosophy, for contexts of scientific inference. There is no suggestion, for example, that only the particular data set be considered. That would entail abandoning the sampling distribution as the basis for inference, and with it the severity goal. Yet we are told that “there is a catch” and that WCP leads to the Likelihood Principle (LP)!

It is not uncommon to see statistics texts argue that in frequentist theory one is faced with the following dilemma: either to deny the appropriateness of conditioning on the precision of the tool chosen by the toss of a coin, or else to embrace the strong likelihood principle, which entails that frequentist sampling distributions are irrelevant to inference once the data are obtained. This is a false dilemma. Conditioning is warranted to achieve objective frequentist goals, and the [weak] conditionality principle coupled with sufficiency does not entail the strong likelihood principle. The ‘dilemma’ argument is therefore an illusion. (Cox and Mayo 2010, p. 298)

There is a large literature surrounding the argument for the Likelihood Principle, made famous by Birnbaum (1962). Birnbaum hankered for something in between radical behaviorism and throwing error probabilities out the window. Yet he himself had apparently proved there is no middle ground (if you accept WCP)! Even people who thought there was something fishy about Birnbaum’s “proof” were discomfited by the lack of resolution to the paradox. It is time for post-LP philosophies of inference. So long as the Birnbaum argument, which Savage and many others deemed important enough to dub a “breakthrough in statistics,” went unanswered, the frequentist was thought to be boxed into the pathological examples. She is not.

In fact, I show there is a flaw in his venerable argument (Mayo 2010b, 2013a, 2014b). That’s a relief. Now some of you will howl, “Mayo, not everyone agrees with your disproof! Some say the issue is not settled.” Fine, please explain where my refutation breaks down. It’s an ideal brainbuster to work on along the promenade after a long day’s tour. Don’t be dismayed by the fact that it has been accepted for so long. But I won’t revisit it here.

From *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* (Mayo 2018, CUP).

Excursion 3 Tour II, pp. 170-173.

I just noticed (12/29) that the classic Berger and Wolpert The Likelihood Principle is on-line. Here’s their description of the Cox (1958) example:

Note to the Reader:

The LP was a main topic for the first few years of this blog (2011-2014). That’s because I was still refining an earlier disproof from Mayo (2010), based on giving a counterexample. I later saw the need for a deeper argument which I give in Mayo (2014) in *Statistical Science*.[3] (There, among other subtleties, the WCP is put as a logical equivalence as intended.)

“It was the adoption of an unqualified equivalence formulation of conditionality, and related concepts, which led, in my 1962 paper, to the monster of the likelihood axiom,” (Birnbaum 1975, 263).

An intermediate paper is Mayo (2013).

Some authors are claiming to have new and improved proofs of it. The only problem is that the new attempts reiterate the same premises that render the initial argument circular, only with greater gusto–or so I will argue. Once an argument is circular, it remains so. Textbooks should not call a claim a theorem if it’s not a theorem, i.e., if there isn’t a proof of it (within the relevant formal system).

If statistical inference follows Bayesian posterior probabilism, the LP follows easily. It’s shown in just a couple of pages of SIST Excursion 1 Tour II (45-6). All the excitement is whether the frequentist (error statistician) is bound to hold it. If she is, then error probabilities become irrelevant to the evidential import of data (once the data are given), at least when making parametric inferences within a statistical model.

Stay tuned for more later in the week.

[1] Cox 1958 has a different variant of the chestnut.

[2] Note sufficiency is not really needed in the “proof”.

[3] The discussion includes commentaries by Dawid, Evans, Martin and Liu, Hannig, and Bjørnstad–some of whom are very unhappy with me. But I’m given the final word (at least in that journal) in the rejoinder.

**References **(outside of the excerpt; for refs within SIST, please see SIST):

Birnbaum, A. (1962), “On the Foundations of Statistical Inference“, *Journal of the American Statistical Association* 57(298), 269-306.

Birnbaum, A. (1975). *Comments on Paper by J. D. Kalbfleisch*. Biometrika, 62 (2), 262–264.

Cox, D. R. (1958), “Some problems connected with statistical inference“, The Annals of Mathematical Statistics, 29, 357-372.

Mayo, D. G. (2010). “An Error in the Argument from Conditionality and Sufficiency to the Likelihood Principle” in *Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science* (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 305-14.

Mayo, D. G. (2013) “Presented Version: On the Birnbaum Argument for the Strong Likelihood Principle”, in *JSM Proceedings*, Section on Bayesian Statistical Science. Alexandria, VA: American Statistical Association: 440-453.

Mayo, D. G. (2014). Mayo paper: “On the Birnbaum Argument for the Strong Likelihood Principle,” Paper with discussion and Mayo rejoinder: *Statistical Science** *29(2) pp. 227-239, 261-266*.*

Dear Deborah,

I have been studying your 2014 article for some time, but I still do not understand it. I agree with your conclusion, that the SLP is probably wrong, but I have different reasons than you for believing so. In fact, I think Birnbaum’s proof, or Berger and Wolpert’s version thereof, is deductively valid. That proof, however, does not make the SLP true, because that depends on the truth of the SP and WCP.

My main problem with your argument against Birnbaum’s proof is Birnbaumization. Why is that required ? Neither Birnbaum nor Berger and Wolpert use it and, as far as I can tell, do not require it. But if it is not required, that is, if EB in your article can be replaced by Emix, and WCP is assumed, SLP follows.

I do not think Birnbaumization is required for SP to be true. Consider a simple experiment in which a coin with unknown bias is flipped four times and the sequence ‘HHTH’ is observed. Are you saying that statistical inference has to be based on the output “three heads” rather than the actual sequence ? Or that a repeat of the experiment, with outcome “HTHH” might lead to a different inference ? Why can’t I ignore sufficiency and just proceed with the outcomes as is ? Yes, sufficiency simplifies my calculations, but is it necessary to get valid inferences ?

SP is used in the proofs of SLP to demonstrate inferential equivalence between two possible distinct outcomes in the same experiment (a mixture experiment in fact). That equivalence exists, if I interpret SP correctly, even without Birnbaumization.

Leedert:

I think it’s clear you’ll need to come to my talk. The info for doing so is on the announcement.

A valid argument can still be circular, in fact, circular arguments are valid:

A/therefore, A.

They do use Birnbaumization, even though they don’t call it that. Thanks for your interest.

You were lucky, by the way because first time commentators, as a rule, are required by WordPress to have their first comment moderated, but yours did not (of course I would have approved it).

Dear Deborah,

I will try to attend your lecture, although I am not quite sure yet how to do that. By zoom ?

I reread the relevant section in Berger and Wolpert and I do not find anything that looks like Birnbaumization. They claim that x* and y*, as observed in the raw mixture experiments are inferentially equivalent because of SP. You claim that they are only equivalent in EB, the Birnbaumized version of the raw mixture experiment. I do not understand that. All statements of the SP, including yours, say that two possible outcomes of an experiment are inferentially equivalent if there is some sufficient statistics such that it maps those two outcomes to the same value. Nobody, as far as I know, claims that the equivalence can only hold in experiments that have already been reduced by a (minimal) sufficient statistic. Or am I misunderstanding the reach of the Sufficiency principle ?

Leendert:

The link for joining the Forum is on the January 7 link, or go to:

I’m not sure I understand your question, but I think you may be confusing the statement of the LP with the steps used to try to derive it. Since you’re looking at Berger and Wolpert, note on p. 27 that the two members of the LP pair are to give the same value of sufficient statistic T. One outcome disappears, but it doesn’t matter if one is within the mixed experiment E* whose sampling distribution averages over both. Two turns into one.