(A) “It is not uncommon to see statistics texts argue that in frequentist theory one is faced with the following dilemma: either to deny the appropriateness of conditioning on the precision of the tool chosen by the toss of a coin[i], or else to embrace the strong likelihood principle which entails that frequentist sampling distributions are irrelevant to inference once the data are obtained. This is a false dilemma … The ‘dilemma’ argument is therefore an illusion”. (Cox and Mayo 2010, p. 298)
The “illusion” stems from the sleight of hand I have been explaining in the Birnbaum argument—it starts with Birnbaumization.
(B) A reader wrote in that he awaits approval of my argument by either Sir David Cox or Christian Robert ; I cannot vouchsafe for Robert, unless he has revised his first impression in his October 6, 2011 blog (as I hope he has). For in that blog post Robert says
“If Mayo’s frequentist stance leads her to take the sampling distribution into account at all times, this is fine within her framework. But I do not see how this argument contributes to invalidate Birnbaum’s proof.”
I am taking sampling distributions into account because Birnbaum’s “proof” is supposed to be relevant for a sampling theorist! If it is not relevant for a sampling theorist (my error statistician) then there is no breakthrough and there is no special interest in the result (given that Bayesians already have the LP, as do the likelihoodists.)[ii] It is only because principles that are already part of the sampling theorist’s steady diet are alleged to entail the LP (in Birbaum’s argument) that Savage declared that, once made aware of Birnbaum’s result, he doubted people would stop at the LP appetizer, but would instead go all the way to consuming the full Bayesian omelet!
Robert’s remark is just the tip of the iceberg that reveals a deep misunderstanding of sampling theory. (Although I prefer error statistics, I will use sampling theory for this post.) Even if Robert has corrected himself, as I very much hope he has, other readers may be under the same illusion. I had paused to clarify this point in my October 20, 2011 post.
(C) Likelihood Principle Violations
My Oct. 20 post was devoted to arguing that it is impossible to understand the whole issue without understanding how it is that frequentist sampling theory violates the LP. That it does so is not a point of controversy, so far as I know:
As Lindley (1971) stresses:
“.. sampling distributions, significance levels, power, all depend on something more [than the likelihood function]–something that is irrelevant in Bayesian inference–namely the sample space” (Lindley p. 436).
He means, once the data are known the sample space is irrelevant for appraisal. (The LP already assumes the statistical model underlying the likelihood is given or not in question.) Or, more recently, take Kadane 2011:
“Significance testing violates the Likelihood Principle, which states that, having observed the data, inference must rely only on what happened, and not on what might have happened but did not. The Bayesian methods explored in this book obey this principle” (Kadane, 439).
“Like their testing cousins, confidence intervals and sets violate the likelihood principle” (ibid. 441).
So it’s hard to see how Robert can really mean to say that sampling distribution considerations are irrelevant, when they are the heart and centerpiece of the Birnbaum argument. Far from being irrelevant, Birnbaum’s result is all about sampling distributions (even if addressed by someone who is not herself a sampling theorist!)
(D) Now to consider what Robert says in his post, with my remarks following:
Robert: “The core of Birnbaum’s proof is relatively simple: given two experiments E’ and E” about the same parameter θ with different sampling distributions f¹ and f², such that there exists a pair of outcomes (y’, y”) from those experiments with proportional likelihoods, one considers the mixture experiment where E’ and E” are each chosen with probability ½.
Then it is possible to build a sufficient statistic T that is equal to the data (j,z), except when j=2 and z=y”, in which case T(j,z)=(1,y’).”
Mayo: Put more informally, if y’ and y” is any LP violation pair (i.e., the two would yield different inferences/assessments of the evidence due to the difference in sampling distributions), then it is possible to “build” a statistic T for interpreting them such that y” (from E”) is always reported as y’ from E’.[iii] I called this Birnbaum’s statistic T-BB.[iv] It is possible, in short, to Birnbaumize the result (E’, y’) whenever there is an experiment E”, not performed, that could have resulted in y”, with a proportional likelihood (with the same parameter under investigation and the model assumptions granted).
Robert: “This statistic [T-BB] is sufficient”.
Mayo: Yes, T-BB is sufficient for an experiment that will report its inference based on the rules of Birnbaumization: The sampling distribution of T-BB is to be the convex combination of the sampling distributions of E’ and E” whenever confronted with an outcome that has an LP violation pair (for more details see posts from Dec. 6, 7, and references within).[v] Cox rightly questions even this first step, but I’m prepared to play along since the “proof” breaks down anyway.[vi]
It should be emphasized that in carrying out this Birnbaumization, one is not free from considering the accompanying sampling distribution (corresponding to the statistic T-BB just “built”): the Birnbaumization move depends on having a single sampling distribution (otherwise sufficiency would not apply)[vii].
While Robert switches our InfrE(z) notation (Cox and Mayo 2010) to Birnbaum’s Ev(E, z), I will go ahead and leave it as Ev. InfrE was deliberately designed to be clearer, easier to read, and less likely to hide the very equivocation that is overlooked in this example.
Whether j = 1 or j = 2, Ev(E-BB, (j, z)) = Ev(E-BB, T(j,z))
This corresponds to my premise (1):
(1) InfrE-BB(E’, y’) = InfrE-BB(E”, y”)
In the relevant case, y’ and y” are LP violation pairs, since only those pose the threat to obeying the LP. So we can focus just on those in this note. In Mayo 2010 I used the * to indicate an outcome is part of an LP violation pair.
(E) Next Robert gives premise (2), though he switches the order: this corresponds to two applications of weak conditionality (WCP) [combining my 2a and 2b]:
(2) Whether j = 1 or j = 2, Ev(E-BB, (j, z)) = Ev(Ej, z)
The key issue concerns a quote from me (with Robert’s substitutions of Ev for Infr). Note, by the way, that Robert is alluding to my chapter in Mayo 2010, not the short version that I posted on this blog, Dec 6, 7
Robert: “Now, Mayo argues this is wrong because [it asserts that]:
‘[the mixed experiment E-BB] is appropriately identified with an inference from outcome yj based on the sampling distribution of Ej, which is clearly false'”.(p.310)
(continuing Robert’s quote of me):
“ ‘The sampling distribution to arrive at Ev(E-BB, (j, yj )) would be the convex combination averaged over the two ways that yj could have occurred. This differs from the sampling distributions of both Ev(E’, y’) and Ev(E”, y”)’. This sounds to me like a direct rejection of the conditionality principle, so I do not understand the point.” (Robert, Oct. 6, 2011 post, p.310)
Mayo: I am not at all rejecting the WCP. The passage Robert quotes merely states the obvious; namely, the assertion: the inference computed using the sampling distribution of E-BB is identical to the inference using the sampling distribution of E’ by itself (or E” by itself)—is false! If we are playing Birnbaumization, then the appropriate sampling distribution is the convex combination. (In the section from which Robert is quoting, a reader will note, I have put Birnbaum’s argument in valid form.)
But wait a minute, just a few lines later it turns out Robert does not deny my claim! He repeats it as obviously true, …..but suddenly it has become irrelevant.
Robert: “Indeed, and rather obviously, the sampling distribution of the evidence Ev(E*,z*) will differ depending on the experiment. But this is not what is stated by the likelihood principle, which is that the inference itself should be the same for y’ and y” Not the [sampling?] distribution of this inference” (Robert, p. 310).
Mayo: What? This just makes no sense. There is no inference apart from the sampling distribution for a sampling theorist. One cannot assume there is somehow an inference apart from the sampling distribution. Sampling theory has simply not been understood. Robert’s own rendition of the argument [my Premise 1], depends on a merged sampling distribution, thanks to Birnbaumization; it certainly does not ignore sampling distributions. So I’m afraid I don’t know what Robert is talking about here. (This same point arose in the discussion by Aris Spanos when Robert’s post first appeared.)
Robert will go on to deny there are any LP counterexamples, because they all turn on pointing up the difference in sampling distributions! All I can do at this point is go back to where I bagan: listen to Birnbaum, Kadane, Lindley, Savage and everyone else who has discussed the (uncontroversial) fact that error statistics violates the LP! No one would be claiming sampling theory was incoherent were it not that it is prepared to reach different inferences from y’, y” despite their having proportional likelihoods (i.e., despite the conditions for the LP being met), and it does so solely because of a difference in sampling distributions.[viii]
Kadane, J. (2011), Principles of Uncertainty, CRC Press.
Mayo: 10/20/2011 Post: blogging-likelihood-principle-2
* The title is a distant analogue to that song “Don’t Bogart that chalk my friend, pass it on to me”.
I apologize for being simplistic, but I am hung up on fundamentals. Please help me with this basic question. If the LP does not hold, and the likelihood is supposed to contain “all information that the experiment has to offer” for a Bayesian inference, then how can I have any faith/confidence in the posterior probability? In other words, what tells me that the numeric value of the likelihood is measuring exactly what I am interested in? I thought the LP and Law of L gave me that grounding. It seems that having some principle is necessary before taking the singular value of the posterior seriously.