Birnbaum’s argument for the SLP involves some equivocations that are at once subtle and blatant. The subtlety makes it hard to translate into symbolic logic (I only partially translated it). Philosophers should have a field day with this, and I should be hearing more reports that it has suddenly hit them between the eyes like a ton of bricks, to use a mixture metaphor. Here are the key bricks. References can be found in here, background to the U-Phil here..

**Famous (mixture) weighing machine example and the WLP**** **

The main principle of evidence on which Birnbaum’s argument rests is the *weak conditionality principle *(WCP). This principle, Birnbaum notes, follows not from mathematics alone but from intuitively plausible views of “evidential meaning.” To understand the interpretation of the WCP that gives it its plausible ring, we consider its development in “what is now usually called the ‘weighing machine example,’ which draws attention to the need for conditioning, at least in certain types of problems” (Reid 1992).

*The basis for the WCP *

**Example 3. ***Two measuring instruments of different precisions. *We flip a fair coin to decide which of two instruments, E’ or E”, to use in observing a normally distributed random sample **X** to make inferences about mean q. E*’ *has a known variance of 10^{−4}, while that of E” is known to be 10^{4}. The experiment is a mixture: E-mix. The fair coin or other randomizer may be characterized as observing an indicator statistic J, taking values 1 or 2 with probabilities .5, independent of the process under investigation. The full data indicates first the result of the coin toss, and then the measurement: (E^{j}, **x**^{j}).[i]

The sample space of E-mix with components E^{j}, j = 1, 2, consists of the union of

{(j,** x’**): j = 0, possible values of** X’**} and {(j, **x**”): j = 1, possible values of **X**”}.

In testing a null hypothesis such as q = 0, the same **x** measurement would correspond to a much smaller p-value were it to have come from E′ than if it had come from E”: denote them as p′(**x**) and p′′(**x**), respectively. However, the overall significance level of the mixture, the convex combination of the p-value: [p′(**x**) + p′′(**x**)]/2, would give a misleading report of the precision or severity of the actual experimental measurement (See Cox and Mayo 2010, 296).

Suppose that we know we have observed a measurement from E” with its much larger variance:

The unconditional test says that we can assign this a higher level of significance than we ordinarily do, because if we were to repeat the experiment, we might sample some quite different distribution. But this fact seems irrelevant to the interpretation of an observation which we know came from a distribution [with the larger variance] (Cox 1958, 361).

In effect, an individual unlucky enough to use the imprecise tool gains a more informative assessment because he might have been lucky enough to use the more precise tool! (Birnbaum 1962, 491; Cox and Mayo 2010, 296). Once it is known whether E′ or E′′ has produced **x**, the p-value or other inferential assessment should be made conditional on the experiment actually run.

*Weak Conditionality Principle (WCP):*** **If a mixture experiment is performed, with components E’, E” determined by a randomizer (independent of the parameter of interest), then once (E’,** x’**) is known, inference should be based on E’ and its sampling distribution, not on the sampling distribution of the convex combination of E’ and E”.

*Understanding the WCP*

The WCP includes a prescription and a proscription for the proper evidential interpretation of** x’**, once it is known to have come from E’:

The evidential meaning of any outcome (E’,** x’**) of any experiment E having a mixture structure is the same as: the evidential meaning of the corresponding outcome** x’** of the corresponding component experiment E’*, ignoring otherwise the over-all structure of the original experiment *E (Birnbaum 1962, 489 E_{h} and x_{h} replaced with E’ and x’ for consistency).

While the WCP seems obvious enough, it is actually rife with equivocal potential. To avoid this, we spell out its three assertions.

*First*, it applies once we know which component of the mixture has been observed, and what the outcome was (E^{j} **x**^{j}). (Birnbaum considers mixtures with just two components).

*Second*, there is the prescription about evidential equivalence. Once it is known that E^{j} has generated the data, given that our inference is about a parameter of E^{j}, inferences are appropriately drawn in terms of the distribution in E^{j }—the experiment known to have been performed.

*Third*, there is the proscription. In the case of informative inferences about the parameter of E^{j} our inference should not be influenced by whether the decision to perform E^{j} was determined by a coin flip or fixed all along. Misleading informative inferences might result from averaging over the convex combination of E^{j} and an experiment known not to have given rise to the data. The latter may be called the unconditional (sampling) distribution. ….

*______________________________________________*

*One crucial equivocation: *

* *Casella and R. Berger (2002) write:

The [weak] Conditionality principle simply says that if one of two experiments is randomly chosen and the chosen experiment is done, yielding data **x**, the information about *q* depends only on the experiment performed. . . . *The fact that this experiment was performed, rather than some other, has not increased, decreased, or changed knowledge of **q**. *(p. 293, emphasis added)

I have emphasized the last line in order to underscore a possible equivocation. Casella and Berger’s intended meaning is the correct claim:

(i) Given that it is known that measurement **x**’ is observed as a result of using tool E’, then it does not matter (and it need not be reported) whether or not E’ was chosen by a random toss (that might have resulted in using tool E”) or had been fixed all along.

Of course we do not know what measurement would have resulted had the unperformed measuring tool been used.

Compare (i) to a false and unintended reading:

(ii) If some measurement **x** is observed, then it does not matter (and it need not be reported) whether it came from a precise tool E’ or imprecise tool E”.

The idea of detaching **x**, and reporting that “**x** came from somewhere I know not where,” will not do. For one thing, we need to know the experiment in order to compute the sampling inference. For another, E’ and E” may be like our weighing procedures with very different precisions. It is analogous to being given the likelihood of the result in Example 1,(here) withholding whether it came from a negative binomial or a binomial.

Claim (i), by contrast, may well be warranted, not on purely mathematical grounds, but as the most appropriate way to report the precision of the result attained, as when the WCP applies. The essential difference in claim (i) is that it is known that (E, **x**’), enabling its inferential import to be determined.

The linguistic similarity of (i) and (ii) may explain the equivocation that vitiates the Birnbaum argument.

Now go back and skim 3 short pages of notes here, pp 11-14, and it should hit you like a ton of bricks! If so, reward yourself with a double Elba Grease, else try again. Report your results in the comments.

I think we can work through this example in a much simpler way. Given the two measurement instruments with their known variances, we know how to combine data points taken with each: we weight the points inversely according to the variance of their respective measuring instrument. This is a principle that is easy to show mathematically without recourse to any WCP, etc. We all learn it early on in courses.

So if there are no points from Instrument 2, its properties don’t contribute to our calculation of the results, simply because the sum over those points gives a zero contribution (since there are no such points).

If we do know which points have been measured by each instrument, we just compute our variance-weighted sums.

If we don’t know which points have been measured by which instrument, there’s not a whole lot we can do to get a proper result, because we can’t establish the variance of the mixture.

Bingo. We’ve just established that claim i) is the correct one, at least if we know which points were measured by which instrument: We need to know the instrument for each point, but not how it was selected. The matter is not esoteric or complicated.

Tom: by the way, this case always does assume you know which instrument was used…and also that the flipping or other randomizer is irrelevant to the inference. You say you learn this in courses (stat?), but for some reason it is alleged (in this famous result) that frequentists would consider experiments not performed, because they talk about other outcomes that might have arisen (from this experiment) under repetitions.

Tom. Exactly! Averaging over the two is crazy! But Professor B is always looking to get someone to buy him a bottle of Elbar Grease. Prof. B comes along and says, sure the data x’ came from E’, but let’s change the second experiment E” that you might have done but didn’t. He will bet you there is one case where you should not mind averaging over the two.

Choose an E” that could have given rise to an x” with the “same” (proportional) likelihood as E’ (over the parameter of interest). Suppose you flipped the coin to decide whether to perform either E’, Normal sampling with fixed sample size n (known sigma), or E” where sample size was determined by a stopping rule that stops when a k standard deviation difference is observed. It’s perfectly possible E” could have stopped at the same point as your fixed sample size experiment E’ (so fix k accordingly). Then the two outcomes (E’, x’*) and (E”, x”*) have the “same” likelihood. I put the * there to indicate that x’* and x”* have proportional likelihoods. We say x’* has a “star pair” in some experiment not performed, E’.

Now Prof B says, whenever your outcome has a star pair in some other experiment E”, then just report x* occurred and pretend you do not know it came from E’. To get your p-value or the like, use the convex combination over E’ and E” as if you had done the relevant mixture experiment. You didn’t really do one, so it’s a hypothetical mixture, but no matter. The likelihood for x* is a sufficient statistic for this hypothetical mixture experiment, so there’s no information lost (regarding the parametric inference) and it need not be reported which x* came from.

Now the frequentist sampling theorist wants to distinguish the evidential import of (E’, x’*) from the evidential import of (E”, x”*). The outcome from the latter would have a higher type 1 error probability, or higher significance level, than the former, even though the two have proportional likelihoods over the parameter space. But it looks like he cannot distinguish the two after all, while holding to both sufficiency and the WCP. Much clearer in my paper, but since it’s Sat. night, just look at my Dec 31, 2012 blog: http://errorstatistics.com/2012/12/31/midnight-with-birnbaum-reblog/

“This principle, Birnbaum notes, follows not from mathematics alone but from intuitively plausible views of “evidential meaning.” ”

Well, “not from mathematics alone” is something of an understatement. Mathematics can investigate the consequences of such principles once they are mathematically formulated, but it doesn’t have to say anything at all about whether one should follow them. Actually the whole discussion in this posting, interesting and enlightening as it may be, is extra-mathematical.

I should have added: “it doesn’t have to say anything at all about whether one should follow them or how they are to be interpreted in reality.”

Fine. But Birnbaum is keen to demonstrate something based on what he regards as intuitive principles of evidence. And I adhere to these intuitions exactly. But see the equivocation in this post (relating to the Casella and Berger quote). Once we interpret his principles in the intuitively plausible way he intends, the “proof” flounders.

Well, or one needs to admit that the principles as formulated by Birnbaum, are not enough to capture this intuition.

Actually, what is called “Sufficiency principle (in sampling theory)” in your recent (?) draft which I have found linked from this blog in the meantime (sorry, could have done that earlier), describes very well what is missing in Birnbaum’s original formulation. I agree with you in that the proof breaks down if the “sampling theory” one is used. With the original formulation, though, it’s rather the principle that breaks down (in terms of capturing the relevant intuition) than the proof.