Don’t Birnbaumize that Experiment my Friend*

(A)  “It is not uncommon to see statistics texts argue that in frequentist theory one is faced with the following dilemma: either to deny the appropriateness of conditioning on the precision of the tool chosen by the toss of a coin[i], or else to embrace the strong likelihood principle which entails that frequentist sampling distributions are irrelevant to inference once the data are obtained.  This is a false dilemma … The ‘dilemma’ argument is therefore an illusion”. (Cox and Mayo 2010, p. 298)

The “illusion” stems from the sleight of hand I have been explaining in the Birnbaum argument—it starts with Birnbaumization.

(B) A reader wrote in that he awaits approval of my argument by either Sir David Cox or Christian Robert ; I cannot vouchsafe for Robert, unless he has revised his first impression in his October 6, 2011 blog (as I hope he has). For in that blog post Robert says

“If Mayo’s frequentist stance leads her to take the sampling distribution into account at all times, this is fine within her framework. But I do not see how this argument contributes to invalidate Birnbaum’s proof.”

I am taking sampling distributions into account because Birnbaum’s “proof” is supposed to be relevant for a sampling theorist!   If it is not relevant for a sampling theorist (my error statistician) then there is no breakthrough and there is no special interest in the result (given that Bayesians already have the LP, as do the likelihoodists.)[ii] It is only because principles that are already part of the sampling theorist’s steady diet are alleged to entail the LP (in Birbaum’s argument) that Savage declared that, once made aware of Birnbaum’s result, he doubted people would stop at the LP appetizer, but would instead go all the way to consuming the full Bayesian omelet!

Robert’s remark is just the tip of the iceberg that reveals a deep misunderstanding of sampling theory.  (Although I prefer error statistics, I will use sampling theory for this post.)   Even if Robert has corrected himself, as I very much hope he has, other readers may be under the same illusion. I had paused to clarify this point in my October 20, 2011 post.

(C) Likelihood Principle Violations

My Oct. 20 post was devoted to arguing that it is impossible to understand the whole issue without understanding how it is that frequentist sampling theory violates the LP.  That it does so is not a point of controversy, so far as I know:

As  Lindley (1971) stresses:

“.. sampling distributions, significance levels, power, all depend on something more [than the likelihood function]–something that is irrelevant in Bayesian inference–namely the sample space” (Lindley p. 436).

He means, once the data are known the sample space is irrelevant for appraisal.  (The LP already assumes the statistical model underlying the likelihood is given or not in question.)   Or, more recently, take Kadane 2011:

“Significance testing violates the Likelihood Principle, which states that, having observed the data, inference must rely only on what happened, and not on what might have happened but did not. The Bayesian methods explored in this book obey this principle” (Kadane, 439).

“Like their testing cousins, confidence intervals and sets violate the likelihood principle” (ibid. 441).

So it’s hard to see how Robert can really mean to say that sampling distribution considerations are irrelevant, when they are the heart and centerpiece of the Birnbaum argument. Far from being irrelevant, Birnbaum’s result is all about sampling distributions (even if addressed by someone who is not herself a sampling theorist!)

(D) Now to consider what Robert says in his post, with my remarks following:

Robert: “The core of Birnbaum’s proof is relatively simple: given two experiments E’ and E” about the same parameter θ with different sampling distributions and , such that there exists a pair of outcomes (y’, y”) from those experiments with proportional likelihoods, one considers the mixture experiment where E’ and E” are each chosen with probability ½.

Then it is possible to build a sufficient statistic T that is equal to the data (j,z), except when j=2 and z=y”, in which case T(j,z)=(1,y’).”

Mayo:  Put more informally, if y’ and y” is any LP violation pair (i.e., the two would yield different inferences/assessments of the evidence due to the difference in sampling distributions), then it is possible to “build” a statistic T for interpreting them such that y” (from E”) is always reported as y’ from E’.[iii] I called this Birnbaum’s statistic T-BB.[iv] It is possible, in short, to Birnbaumize the result (E’, y’) whenever there is an experiment E”, not performed, that could have resulted in y”, with a proportional likelihood (with the same parameter under investigation and the model assumptions granted).

Robert: “This statistic [T-BB] is sufficient”.

Mayo: Yes, T-BB is sufficient for an experiment that will report its inference based on the rules of Birnbaumization: The sampling distribution of T-BB is to be the convex combination of the sampling distributions of E’ and E” whenever confronted with an outcome that has an LP violation pair (for more details see posts from Dec. 6, 7, and references within).[v] Cox rightly questions even this first step, but I’m prepared to play along since the “proof” breaks down anyway.[vi]

It should be emphasized that in carrying out this Birnbaumization, one is not free from considering the accompanying sampling distribution (corresponding to the statistic T-BB just “built”): the Birnbaumization move depends on having a single sampling distribution (otherwise sufficiency would not apply)[vii].  

While Robert switches our InfrE(z) notation (Cox and Mayo 2010) to Birnbaum’s Ev(E, z), I will go ahead and leave it as Ev. InfrE was deliberately designed to be clearer, easier to read, and less likely to hide the very equivocation that is overlooked in this example.

Robert observes:

Whether j = 1 or j = 2,  Ev(E-BB, (j, z)) = Ev(E-BB, T(j,z))

This corresponds to my premise (1):
(1) InfrE-BB(E’, y’) = InfrE-BB(E”, y”)

In the relevant case, y’ and y” are LP violation pairs, since only those pose the threat to obeying the LP.  So we can focus just on those in this note. In Mayo 2010 I used the * to indicate an outcome is part of an LP violation pair.
(E)  Next Robert gives premise (2), though he switches the order: this corresponds to two applications of weak conditionality (WCP) [combining my 2a and 2b]:

(2) Whether j = 1 or j = 2, Ev(E-BB, (j, z)) = Ev(Ej, z)

The key issue concerns a quote from me (with Robert’s substitutions of Ev for Infr).  Note, by the way, that Robert is alluding to my chapter in Mayo 2010, not the short version that I posted on this blog, Dec 6, 7

Robert: “Now, Mayo argues this is wrong because [it asserts that]:

 ‘[the mixed experiment E-BB] is appropriately identified with an inference from outcome yj based on the sampling distribution of Ej, which is clearly false'”.(p.310)
(continuing Robert’s quote of me):
“ ‘The sampling distribution to arrive at Ev(E-BB, (j, yj )) would be the convex combination averaged over the two ways that yj could have occurred.  This differs from the sampling distributions of both Ev(E’, y’) and Ev(E”, y”)’. This sounds to me like a direct rejection of the conditionality principle, so I do not understand the point.” (Robert, Oct. 6, 2011 post, p.310)

Mayo: I am not at all rejecting the WCP. The passage Robert quotes merely states the obvious; namely, the assertion: the inference computed using the sampling distribution of E-BB is identical to the inference using the sampling distribution of E’ by itself (or E” by itself)—is false!  If we are playing Birnbaumization, then the appropriate sampling distribution is the convex combination. (In the section from which Robert is quoting, a reader will note, I have put Birnbaum’s argument in valid form.)

But wait a minute, just a few lines later it turns out Robert does not deny my claim!  He repeats it as obviously true, …..but suddenly it has become irrelevant.

Robert: “Indeed, and rather obviously, the sampling distribution of the evidence Ev(E*,z*) will differ depending on the experiment. But this is not what is stated by the likelihood principle, which is that the inference itself should be the same for y’ and y” Not the [sampling?] distribution of this inference” (Robert, p. 310).

Mayo: What? This just makes no sense. There is no inference apart from the sampling distribution for a sampling theorist. One cannot assume there is somehow an inference apart from the sampling distribution. Sampling theory has simply not been understood.  Robert’s own rendition of the argument [my Premise 1], depends on a merged sampling distribution, thanks to Birnbaumization; it certainly does not ignore sampling distributions.  So I’m afraid I don’t know what Robert is talking about here.  (This same point arose in the discussion by Aris Spanos when Robert’s post first appeared.)

Robert will go on to deny there are any LP counterexamples, because they all turn on pointing up the difference in sampling distributions!  All I can do at this point is go back to where I bagan:  listen to Birnbaum, Kadane, Lindley, Savage and everyone else who has discussed the (uncontroversial) fact that error statistics violates the LP!  No one would be claiming sampling theory was incoherent were it not that it is prepared to reach different inferences from y’, y” despite their having proportional likelihoods (i.e., despite the conditions for the LP being met), and it does so solely because of a difference in sampling distributions.[viii]

Kadane, J. (2011), Principles of Uncertainty, CRC Press.
Mayo: 10/20/2011 Post: blogging-likelihood-principle-2
* The title is a distant analogue to that song “Don’t Bogart that chalk my friend, pass it on to me”.

[i] This refers to a mixture experiment where the fair coin toss outcomes determines whether to use a highly precise or a highly imprecise instrument (Cox and Mayo 2010, pp. 295-6).
[ii] But whether Bayesians should care and even greet my critique with a sigh of relief (given that they are nowadays inclined to reject the LP), is a distinct issue.
[iii] If your outcome is not part of a pair that would be an LP violation, forget the imaginary mixture and just report is in the usual way with its regular sampling distribution.
[iv] Abbreviation (1,y’) is just another way to write (E’, y’)—that is, the coin flip outcome directs you to perform E’ and y’ is the resulting outcome.
[v] Note however that the “mixture” in this “Birnbaumization” could as well have had j = 1 with probability ¼ and j=2 with probability ¾, or any other assignments to the outcomes summing to 1—so it is still ill-defined. I don’t think there is any warrant for actually interpreting one’s actual data using the results of a Birnbaumization game. I’m playing along for purposes of showing the argument still fails at the next step.
[vi] I deliberately describe Birnbaumization so that it is possible to perform the experiment, even though it isn’t a genuine mixture experiment.
[vii] That is why sufficiency is considered the “weak likelihood principle”.
[viii]Along with satisfying the other stipulations of the antecedent to the strong LP.
[ix] In referring to an inference from y in a sampling theory experiment E by means of the abbreviation InfrE(y), we assume, for simplicity, that packed into E would be the probability model, parameters, and the sampling distribution corresponding to the inference in question. We prefer it because it underscores the need to consider the associated methodology and context. Birnbaum construes Ev(E, x) as “the evidence about the parameter arising from experiment E and result x“and allows it to range over the inference, conclusion or report, including p-values, confidence intervals and levels, posteriors. So our notation accomplishes the same, but with (hopefully) less chance of equivocations.
Categories: Statistics | Tags: , , ,

Post navigation

16 thoughts on “Don’t Birnbaumize that Experiment my Friend*

  1. John Byrd

    I apologize for being simplistic, but I am hung up on fundamentals. Please help me with this basic question. If the LP does not hold, and the likelihood is supposed to contain “all information that the experiment has to offer” for a Bayesian inference, then how can I have any faith/confidence in the posterior probability? In other words, what tells me that the numeric value of the likelihood is measuring exactly what I am interested in? I thought the LP and Law of L gave me that grounding. It seems that having some principle is necessary before taking the singular value of the posterior seriously.

    • My understanding is that the LP holds for Bayesian inference. We error statisticians, by contrast, violate (happily!) the LP.
      But the current post was not about the advantages of violating or not violating the LP, it was just about my criticism of Birnbaum’s argument that certain frequentist principles lead even a frequentist to accept the LP.
      More specifically, it was just about whether Christian Robert’s rejection of my criticism (in his blog) holds up. I had promised to get back to that, and so I have.
      Without pretending there is a grand plan, there is a step by step cyclical movement to the blog—go back, or catch it on the next round (e.g., perhaps when Robert responds).

      • John Byrd

        I will watch with interest as the blog develops. All of this is thought-provoking.

  2. Eileen

    *The actual title to that song refers to “that joint” not “that chalk”. 🙂

  3. Finally getting to comment on those comments: when you mention “the sampling distributions of both Ev(E’, y’) and Ev(E”, y”)”, what do you mean? The sampling/frequentist distribution of the random quantities Ev(E’, y’) and Ev(E”, y”)? Or the sampling distribution that produced the y’ we observed and/or the y” that we observed?

    • I just noticed this! I am referring to the sampling distribution associated with the given experiment. I thought I corrected “of” to “associated with”. Birnbaum includes p-values as among all the various things that Ev can equal. He’s very clear that it is intended to be entirely general. So, to have an illusgtration I choose the p-value.

  4. (continued) How do you define the p-value in the optional stopping experiment? What is your exact stopping rule? And what is your observation?

    • I imagined here, simply for illustration, that one stopped at 100. I am only taking the very example used hundreds of times, over and over, in discussions that refer to optional stopping as a dramatic example of how error statistics violates the SLP. (It is used as a criticism, of course, whereas we error statisticians think it’s correct to take it into account.) For calculations, see EGEK (from Armitage) and Mayo and Kruse. Of course ANY violation of the SLP will do for an illustration, and that there are SLP violations (in sampling theory statistics) is not in dispute. (i.e., however you wish to compute p” in optional stopping, and it depends on the particular rule, it doesn’t matter so long as p’ is not equal to p”). Were there no SLP violations in frequentist sampling statistics, then frequentist statistics would obey the SLP!!!! (traveling a taxi, sorry for haste!)

  5. Ok, so having clarified what you mean by “the sampling distributions of both Ev(E’, y’) and Ev(E”, y”)”, I can hopefully clarify my position and criticism: I agree that sampling distributions are important for drawing inferences, so this criticism of mine was not about the use of the sampling distribution at any point. I do need the sampling distribution to find the likelihood at the observed realisation. However, when re-reading your argument of page 312, I do not see further than a rejection of the conditionality principle. You write that it is “obvious”, that inferences computed using different sampling distributions cannot be identical, but this position simply cancels any applicability to the conditionality principle. If not, can you post an example where it applies?

  6. Christian: I can’t see why you’re stuck at the point of (wondering about what is) essentially a tautology: The problem stems from imagining that one is BOTH to condition and not to condition at the same time.One is not canceling conditioning in general.

    Let me take one of the analogies I give in my Dec. 25, 2011 post, despite the fact that I have found that, for some reason, analogies do not seem to enjoy their usual force* among statisticians, and despite my having just arrived in London with scant sleep:
    Think of
    ‘computing a p-value by conditioning’
    as analogous to
    ‘computing the amount of tax owed by a married person filing singly’.
    Birnbaumization is roughly akin to filing jointly, I don’t have to know which spouse to know what they owe in taxes, since it’s the same.

    EXAMPLE #3:
    Let me rearrange/flesh out premises a bit.

    0. Deborah and George are a married couple in the U.S.

    1. For any married couple (x,y) if filing federal taxes jointly (in the U.S.), then x and y have the same tax liability; namely, the amount in the “married filing jointly” column.

    If Deborah and George file jointly, then Tax$ owed by Deborah = Tax $ owed by George.

    2. If a married couple in the U.S. does not file jointly but each files separately, then each owes the amount in the “married, filing separately” column.

    2a. If Deborah files separately, then Tax$ owed by Deborah equals $d.

    2b. If George files separately, then Tax$ owed by George equals $g.

    Conclusion: d = g

    You may assume these premises are true**, e.g., that d and g are dollar numbers given in the respective “married filing separately columns”. Never mind deductions or the like.:

    However, it is easy to see that the conclusion may be false, in fact, let us stipulate (which is true) that if they each file separately, d is NOT equal to g—their tax under the “married filing separately” differs, since their individual incomes differ.

    Now since the premises are true and the conclusion is false the argument is invalid. The key terms “tax$ owed by Deborah”, and “Tax$ owed by George” shift in meaning in the argument, because in one case it refers to jointly filing, in another to filing singly. One is not denying there is such a thing as married filing singly simply because one insists one cannot compute taxes both ways at the same time and expect identities to go through!

    OK, you will say I’ve made it vastly more complicated: can you come to London, LSE Wed March 8 around noon? I hope to talk extremely informally with whoever shows up, about the Birnbaum argument and the SLP, and related issues. I’ll give you the place if you can make it.

    *The use of analogies to critique arguments was essentially my rule #1 for this blog. There was no rule #2 (yet).
    **This is akin to the version of the Birnbaum argument for the SLP where the premises are allowed to be true. But then the terms shift meanings, and so the argument is deductively invalid.

    A valid argument (of the pattern the Birnbaum argument WANTS to follow) is:
    1.A = B
    2. A = d
    3. B = g
    Therefore, d =g


  7. Hmmm… I am now even more confused!!! Alas, I cannot jump to London on Wednesday as I have a visitor coming, quite sorry to miss this opportunity!

  8. That’s unfortunate (that you’re more confused). As for opportunities to talk, another one may arise in May-June, so keep it in mind. I will write separately.

  9. Xplntn: I am confused by the fact that you use an analogy rather than a statistical example where you would agree with the conditionality principle…

  10. Rafael Stern

    Is the expression “sufficiency principle” being used in different ways?

    I) In the sufficiency principle is stated as:

    Ev(E^{0},(j,x)) = Ev(E^{0},T(j,x))


    II) In

    “In a BB- experiment, if the outcome from the experiment you actually performed has an outcome with a proportional likelihood to one in some other experiment not performed, E”, then we say that your result has an “LP pair”. For any violation of the strong LP, the outcome observed, let it be x’, has an “LP pair”, call it x”, in some other experiment E”. In that case, a BB-experiment stipulates that you are to report x’ as if you had determined whether to run E’ or E” by flipping a fair coin.


    I’m saying you have to grant that x’ from a fixed sample size experiment E’ could have been generated through a BB-experiment. If you are to interpret your experiment as if you are within the rules of a BB experiment, then x’ is evidentially equivalent to x” (is equivalent to x*). This is premise (1).


    BIRNBAUM: STEP 2 is this: Surely, you agree, that once you know which experiment the observed 2-standard deviation difference actually came, from you ought to report the p-value corresponding to that experiment. You ought NOT to report the average (p’ + p”)/2 as you were instructed to do in the BB experiment.”


    Let T(j,x) corresponds to only report the 2-standard deviation difference and (j,x) to report the 2-standard deviation and the experiment it came from. If I did not mess up in the interpretation, the first two paragraphs in II) state that:

    III) Ev(E^{0},T(1,x’)) = Ev(E^{0},T(2,x”))

    for any x’ and x” with proportional likelihoods under E^{1} and E^{2}.

    The second paragraph contradicts the definition of sufficiency given in I) (Ev(E^{0},(j,x)) = Ev(E^{0},T(j,x)))

    If one agreed with the equality in I), knowing or not knowing which experiment the 2-standard deviation came from would make no difference. Since it does, I assume the definition of sufficiency in II) is III) and not I).

  11. In other words:

    A) My interpretation of Sufficiency Principle for I): For any experiment E and sufficient statistic, T, the inference drawn from knowing x should be the same as from knowing T(x).

    B) II) disagrees with this definition from I): It is a sufficient statistic to report only the 2 standard deviation distance in the Birnbaum Experiment. Nevertheless, the conclusion is not the same as the one which would be obtained after reporting the experiment which was performed and the 2 standard deviation distance (in this case, one would measure the evidence taking into consideration the sampling distribution of the performed experiment, that is, one would condition on the coin flip which is an ancillary statistic).

    C) My interpretation of Sufficiency Principle for II): Once the correct sampling distribution is chosen (possibly, after conditioning on the relevant ancillary statistics) then, if T is sufficient, the inference drawn from a sample x is the same the one from T(x).

    Is this right?
    Thanks, and sorry for possibly writing wrong interpretations for any one of the two positions.

Blog at