U-Phil: I would like to open up this post, together with Gandenberger’s (Oct. 30, 2012), to reader U-Phils, from December 6- 19 (< 1000 words) for posting on this blog (please see # at bottom of post). Where Gandenberger claims, “Birnbaum’s proof is valid and his premises are intuitively compelling,” I have shown that if Birnbaum’s premises are interpreted so as to be true, the argument is invalid. If construed as formally valid, I argue, the premises contradict each other. Who is right? Gandenberger doesn’t wrestle with my critique of Birnbaum, but I invite you (and Greg!) to do so. I’m pasting a new summary of my argument below.
The main premises may be found on pp. 11-14. While these points are fairly straightforward (and do not require technical statistics), they offer an intriguing logical, statistical and linguistic puzzle. The following is an overview of my latest take on the Birnbaum argument. See also “Breaking Through the Breakthrough” posts: Dec. 6 and Dec 7, 2011.
Gandenberger also introduces something called the methodological likelihood principle. A related idea for a U-Phil is to ask: can one mount a sound, non-circular argument for that variant? And while one is at it, do his methodological variants of sufficiency and conditionality yield plausible principles?
Graduate students and others invited!
New Summary of Mayo Critique of Birnbaum’s Argument for the SLP
See also a (draft) of the full PAPER corresponding to this summary, a later and more satisfactory draft is here. Yet other links to the Strong Likelihood Principle SLP: Mayo 2010; Cox & Mayo 2011 (appendix).
Please alert me to corrections, not all the symbols transferred so well.
1. (SLP): For any two experiments E’ and E” with different probability models f’, f” but with the same unknown parameter θ, if the likelihood of outcomes x’* and x”* (from E’ and E” respectively) are proportional to each other, then x’* and x”* should have the identical evidential import for any inference concerning parameter θ.
SLP pairs. When the antecedent holds, x’* and x”* are said to have “the same likelihood function”, i.e., f’(x’; θ) = cf”(x”, θ) for all θ, c a positive constant. In such cases, we abbreviate by saying x’* and x”* are SLP pairs, and the asterisk * will be used to indicate this.
So we can abbreviate the SLP as follows:
SLP: for any two experiments, E’ and E”, if x’* and x”* are SLP pairs (from E’ and E” respectively) then
Infr E’(x’*) equiv Infr E”(x”*).
2.1 SLP Violation with Binomial, Negative Binomial
Example 1 . Binomial vs. Negative Binomial. Consider independent Bernoulli trials, with the probability of success at each trial an unknown constant θ, but produced by different procedures, E’, E”. E’ is Binomial with a pre-assigned number n of Bernoulli trials, say 20, and R, the number of successes observed. In E” trials continue until a pre-assigned number r, say 6, of successes has occurred, with the number N trials recorded. The sampling distribution of R is Binomial:
f(R; θ) = (nC r ) θr(1– θ)n-r
while the sampling distribution of N is Negative Binomial.
f(N; θ) = (n-1C r-1) θr(1– θ)n-r
If two outcomes from E’ and E” respectively, have the same number of successes and failures, r and n, then they have the “same” likelihood, in the sense that they are proportional to θr(1– θ)n-r.
The two outcomes, x’* and x”* are SLP pairs. But the difference in the sampling distributions of the respective statistics, R and N, of E’ and E” respectively, entails a difference in p-values or confidence level assessments. Accordingly, their evidential appraisals differ for sampling distribution inference. Thus x’* and x”* are SLP pairs leading to an SLP violation.
An SLP violation with Binomial (E’) and Negative Binomial (E”):
(E’, r=6) and (E”, n=20) have proportional likelihoods
but InfrE’ (x’*= 6) is not equiv to Infr E”(x”*=20).
Loss of relevant information if the index is erased
In making inferences about θ on the basis of data x in sampling theory, relevant information would be lost if the report removed the index from E and reported:
Data x consisted of r successes in n Bernoulli trials, generated from either a Binomial experiment with n fixed at 20, or a negative binomial experiment with r fixed at 6—erasing the index indicating the actual source of data.
2.2 SLP violation with fixed normal testing and optional stopping: E’, E”
Example 2. Fixed vs. sequential sampling. Suppose X’ and X” are sets of independent observations from N(μ,σ2), with σ known, and p-values are to be calculated for the null hypothesis μ = 0. In E’ the sample size is fixed, whereas in E” the sampling rule is to continue sampling until 1.96σ/√n is attained or exceeded. Suppose E” is first able to stop with n = 169 trials. Then x” has a proportional likelihood to a result that could have occurred from E’, where n was fixed in advance to be 169, and result x’ is 1.96σ/√n from 0. Although the corresponding p-values would be different, the two results would be inferentially equivalent according to the SLP. This application of the SLP to the case of optional stopping is often call this the Stopping Rule Principle SRP (Berger and Wolpert 1988).[i]
SLP violation with Fixed Normal Testing and Optional Stopping: E’, E”
(E’, 1.96σ/13) and (E”, n = 169) have proportional likelihoods
InfrE’ (1.96σ /13) is not equiv to Infr E”( n = 169).
(a) Sufficient Statistic: Let data x= (x1,x2,…,xn) be a realization of random variable X, following a distribution f, a statistic T(x) is a sufficient statistic if the following relation holds:
f(x; θ) = fT(t; θ) fx|T(x| t)
where fx|T does not depend on the unknown parameter θ.
(b) Sufficiency Principle (general): If random sample X, in experiment E, has probability density f(x; θ), and the assumptions of the model are valid, and T is minimal sufficient for θ, then if t(X’) = t(X”), then InfrE’(x’) = InfrE”(x”).
Since the sufficiency principle holds for different inference schools, any application must take into account the relevant method for inference under discussion (Cox and Mayo 2010).
(c) Sufficiency Principle applied in sampling theory: If a random variable X, in experiment E, arises from f(x;θ), and the assumptions of the model are valid, then all the information about θ contained in the data may be obtained from considering its minimal sufficient statistic t and the sampling distribution fT(t;θ) of experiment E.
Weak Conditionality Principle (WCP):If a mixture experiment is performed, with components E’, E” determined by a randomizer (independent of the parameter of interest), then once (E’,x’) is known, inference should be based on E’ and its sampling distribution; not on the sampling distribution of the convex combination of E’ and E”.
4.1 Understanding the WCP
The WCP includes a prescription and a proscription for the proper evidential interpretation of x’, once it is known to have come from E’:
The evidential meaning of any outcome (E’, x’) of any experiment E having a mixture structure is the same as the evidential meaning of the corresponding outcome x’ of the corresponding component experiment E’, ignoring otherwise the over-all structure of the original experiment.” (Birnbaum 1962, 279)
While the WCP seems obvious enough, it is actually rife with equivocal potential. To avoid this, we belabor here its three assertions.
- First, it applies once we know which component of the mixture has been observed, and what the outcome was (Ej, xj). (Birnbaum considers mixtures with just two components).
- Second, there is the prescription about evidential equivalence. Once it is known Ej has generated the data, given that our inference is about a parameter of Ej, inferences are appropriately drawn in terms of the sampling distribution in Ej –the experiment known to have been performed.
- Third, there is the proscription: In the case of informative inferences about parameter of Ej our inference should not be influenced by whether the decision to perform Ej was determined by a coin flip or fixed all along. Misleading informative inferences result from averaging over the convex combination of Ej and an experiment known not to have given rise to the data. The latter may be called the unconditional sampling distribution.
A second ambiguity. Casella and Berger (2002) write:
The [weak] Conditionality principle simply says that if one of two experiments is randomly chosen and the chosen experiment is done, yielding data x, the information about θ depends only on the experiment performed….The fact that this experiment was performed, rather than some other, has not increased, decreased, or changed knowledge of θ. (emphasis added, 293)
Casella and Berger’s intended meaning is the correct claim:
(i) Given it is known that measurement x’ is observed as a result of using tool E’, then it does not matter (and it need not be reported) whether or not E’ was chosen by a random toss (that might have resulted in using tool E”) or fixed all along.
Compare this to a false and unintended reading:
(ii) If some measurement x is observed, then it does not matter (and it need not be reported) if it came from a precise tool E’ or imprecise tool E”.
Claim (i) by contrast, may well be warranted, not on purely mathematical grounds, but as the most appropriate way to report the precision of the result attained, as when WCP applies.
The linguistic similarity of (i) and (ii) may explain the equivocation that vitiates the Birnbaum argument.
4.3 Is WCP an Equivalence? (you may wish to compare this to my earlier treatments, e.g., Mayo 2010😉
A central question is whether WCP is a proper equivalence, holding in both directions (Evans, et.al..1986, Durbin 1970). Weighing against viewing it as an equivalence is this: it makes no sense to say one should use the unconditional rather than the conditional assessment (once it is known which component of a mixture was performed), and at the same time maintain the unconditional and conditional assessments are evidentially equivalent. WCP prescribes conditioning on the experiment known to have produced the data, and not the other way around. It is only because these do not yield equivalent appraisals that the WCP may serve to avoid counterintuitive assessments (e.g., that would otherwise be permitted from those famous weighing machines). It is their inequivalence, in short, that gives Cox’s WCP its normative proscriptive force:
WCP proscription: Once (E’, x’) is known, InfrE’(x’) should be computed using, not the unconditional sampling distribution over E’ and E”, but rather, the sampling distribution of E’.
Yet there is an equivalence within the WCP , and so long as it is consistently interpreted, raises no problems.[ii] This turns out to be the linchpin of disentangling the Birnbaum argument.
To hold WCP for a given context is to judge that the information that E’ was determined by a flip is a redundancy, equivalent to conjoining a tautology to the outcome (E’, x’):
- Knowing that (E’, x’) occurred,
- InfrE’(x’) equiv [InfrE’(x’) and (Either E’ was chosen by flipping, or E’ was fixed)]
where it given that the flipping conjunct in no way alters the construal of (E’, x’). [iii]
Viewing the WCP as endorsing a genuine “two-way” equivalence requires viewing any known experimental result as equivalent, evidentially, to its being a component of a corresponding mixture, even though it is known that in fact E was not chosen by a mixture. While this may seem unsettling, no untoward evidential interpretations result so long as the proscriptive part of the WCP remains, and is not contradicted (say by allowing the imaginary mixture to influence the interpretation of the known “component”).
5. Birnbaum’s Argument
SLP: for any two experiments, E’ and E”, if x’* and x”* are SLP pairs (from E’ and E” respectively) then Infr E’(x’*) equiv Infr E”(x”*).
Begin with any case where the antecedent of the SLP holds. The task is to show the two ought to be deemed evidentially equivalent.
Suppose we have observed (E’, x’*) with an SLP pair (E”, x”*). Then view (E’, x’*) as having resulted from getting heads on the toss of a fair coin, where tails would have meant performing E” (any other irrelevant randomizer would do). This is sometimes called the “enlarged experiment”. Now construct the Birnbaum test statistic T-B defined in terms of the enlarged experiment:
T-B(Ej, xj*) = (E’, x’*), if x’= x’* or j = 2 and x” = x”*.
Else, report the outcome (Ej, xj ).
In words: in the case of a member of an SLP pair, statistic T-B has the effect of erasing the index j. Inference based on T-B is to be computed averaging over the performed and unperformed experiments E’ and E”. This is the unconditional formulation of the enlarged experiment. This gives premise one:
(1) For any (E’, x’*), the result of construing its evidential import in terms of the unconditional formation is that:
InfrE-B(x’*) equiv InfrE-B(x”*)
The likelihood functions of (E’, x’*) and (E”, x”*) are proportional for all θ, being .5f(x’*;θ) and .5f(x”*; θ).
However E’ and E” are different models of the experiment producing the two likelihoods, and the enlarged model associated with T-B is yet a third model of the experiment. The second premise now concerns the WCP:
(2) Once it is known that E’ produced the outcome x’*, compute the inference just as if it were known all along that E’ was going to be performed, i.e., one should use the conditional formulation, ignoring any mixture structure:
InfrE-B(x’*) equiv InfrE’(x’*)
More generally, once (xj*) is known to have come from Ej, j = 1 or 2, premise (2) is
InfrE-B(xj*) equiv InfrE’(xj*)
From premises (1) and (2) it is concluded, for any arbitrary SLP pair x’*, x”*,
InfrE’(x’*) equiv InfrE”(x”*)
The SLP is said to follow. This is an unsound argument.
A sound argument must be both deductively valid and have all true premises.
Consider the truth of the two premises of Birnbaum’s argument. Premise one: (InfrE-B(x’*) equiv InfrE-B(x”*) is true provided that
InfrE-B(x’*) is the inference from (E’, x’) averaging over the unconditional sampling distribution of statistic T-B. In effect it reports just the likelihood of x*, which enters inference in terms of the convex combination of E’ and E”.
For premise two to be true
(i.e., InfrE-B(xj*) equiv InfrE’(xj*) for j= 1, 2)
InfrE-B(xj*) must refer the inference from (Ej, xj*) modeled in terms of the sampling distribution of Ej alone. The experiment E-B on which inference is to be based has different meanings in each premise. The argument is invalid.
5.2 Second formulation: allowing true “if then” premises
We can formulate the argument so that both premises are true “if then” statements[iv] incorporating the stipulated sampling distributions:
As before, suppose an arbitrary member of an SLP pair (E’, E”) is observed, e.g.,
(E’, x’*) is observed. The question is to its evidential import.
(1) If InfrE-B(x’*) is computed unconditionally, averaging over the sampling distributions of T-B, then
InfrE-B(x’*) equiv InfrE-B(x”*)
(2) If InfrE-B(Ej,xj*) is computed conditionally, using the sampling distribution of Ej:
InfrE-B(xj*) equiv InfrE’(xj*) for i= 1, 2.
Construed as “if then” claims, the premises can both be true, but then we cannot validly infer the SLP:
InfrE’(x’*) equiv InfrE”(x”*)
We would need contradictory antecedents to hold.
The formal invalidity is proved by any SLP violation, since in that case, the premises are true and the conclusion is false. SLP violation pairs are readily available (e.g., Examples 1 and 2), and no contradiction results. In fact, we have demonstrated something stronger: whenever we deal with an SLP violation pair, the two “if then” premises, when true yield a false conclusion.
REFERENCES: See Paper (or my latest version upcoming in Statistical Science).
[i] Applying the stopping rule principle requires stipulating that the stopping rule was uninformative for the inference, as in the above example.
[ii] Birnbaum himself is conflicted here. In his later, 1969 paper, Note 11, Birnbaum asserts, “The formulation of the conditionality concept as one of equivalence”, as in [WCP] was proposed by him in (1962) as the natural explication of the concept, not withstanding the one-sided form to which applications of the concept had been restricted (substitution of simpler for less simple models of evidence). This proposal seems to have found general acceptance among those interested in the concept.
[iii] For that matter, as Birnbaum suggests (1969, 119), a “trivial but harmless” augmentation to any experiment might be to toss a fair coin and report heads or tails (where this was irrelevant to the original model). Given (E’, x’),
InfrE’(x’) equiv [InfrE’(x’) and either a coin was tossed or it was not].
He intends the move in applying the WCP is to be just as innocuous as the report of an irrelevant coin toss.
[iv] I am deliberately avoiding the term “conditional” statement, since it is used with a very different sense throughout.
#: This will give graduate students at my 28 Nov., 2012 presentation of this paper, as part of the (PH500) seminar, London School of Economics, a chance to submit something. Inquiries: email@example.com
For some older examples of U-Phils, see an earlier post, and search this blog.
This is a very interesting post, but I have to say that I cannot follow it to the end. I do, however, have some observations about the examples. (Forgive me if they are consequences of naivety!)
To me examples 1 and 2 are equivalent. The conflicts that they illustrate between ‘frequentism’ and the likelihood principle come from p-values being calculated in a manner that ‘corrects’ for the sequential nature of sampling. That is obvious in the second example, but the first example purports to be about binomial vs. negative binomial experiments. However the negative binomial experiment is really a sequential sampling scheme. Thus the two examples are equivalent.
Now a question. Is it possible that there are frequentist p-values that are not in conflict with the likelihood principle? I suggest that the conflict is really between likelihood principle and p-values that come from the Neyman-Pearsonian error-decision paradigm. The p-values ‘corrected’ for sequential sampling are calculated in a way that restores their linear relationship with type I errors. Such p-values relate to global error rates rather than the evidential worth of the data and so conflict with the likelihood principle is not only inevitable, but of no consequence. In contrast, p-values that are calculated assuming a fixed sample size for both of the examples seem to respect the conditionality principle and so, I assume, also respect the likelihood principle. Such p-values are indices of the evidential worth of the data.
Michael: since we’ve taken up the SLP a few times already on this blog, I was just going to give links, but decided to post a summary I sketched recently of a variant on my earlier discussion. I f you read the linked paper, I guarantee you’ll make it to the end (this overview may be too sketchy). Sure the two examples are analogous, but people have different intuitions about them, and of course one is discrete, the other continuous. They are just illustrations. I’ll have to study the rest of what you wrote later on, thanks.