I am going to post a FIRST draft (for a brief presentation next week in Madrid). [I thank David Cox for the idea!] I expect errors, and I will be very grateful for feedback! This is part I; part II will be posted tomorrow. These posts may disappear once I’ve replaced them with a corrected draft. I’ll then post the draft someplace.

If you wish to share queries/corrections please post as a comment or e-mail: error@vt.edu. (ignore Greek symbols that are not showing correctly, I await fixes by Elbians.) Thanks much!

**ONE***: ***A Conversation between Sir David Cox and D. Mayo (June, 2011) **

Toward the end of this exchange, the issue of the Likelihood Principle (LP)[1] arose:

*COX: *It is sometimes claimed that there are logical inconsistencies in frequentist theory, in particular surrounding the strong Likelihood Principle (LP). I know you have written about this, what is your view at the moment.

*MAYO: *What contradiction?

*COX: *Well, that frequentist theory does not obey the strong LP.

*MAYO: *The fact that the frequentist rejects the strong LP is no contradiction.

*COX: *Of course, but the alleged contradiction is that from frequentist principles (sufficiency, conditionality) you should accept the strong LP. The (argument for) the strong LP has always seemed to me totally unconvincing, but the argument is still considered one of the most powerful arguments against the frequentist theory.

*MAYO: *Do you think so?

*COX: *Yes, it’s a radical idea, if it were true.

*MAYO: *You’re not asking me to discuss where Birnbaum goes wrong (are you)?

*COX: *Where *did *Birnbaum go wrong?

*MAYO: *I am not sure it can be talked through readily, even though in one sense it is simple; so I relegate it to an appendix.

It turns out that the premises are inconsistent, so it is not surprising the result is an inconsistency.

The argument is unsound: it is impossible for the premises to all be true at the same time.

Alternatively, if one allows the premises to be true, the argument is not deductively valid. You can take your pick.

Thus arose the challenge to sketch the bare (not bear) bones of this complex business, even though I must direct you to appropriate details elsewhere.

**TWO: The Birnbaum result heralded as a breakthrough in statistics! *** *

*(indeed it would undo the fundamental feature of error statistics and will be explained):*

Savage:

Without any intent to speak with exaggeration it seems to me that this is really a historic occasion. This paper is a landmark in statistics … I myself, like other Bayesian statisticians, have been convinced of the truth of the likelihood principle for a long time. Its consequences for statistics are very great.

….I can’t stop without saying once more that this paper is really momentous in the history of statistics. It would be hard to point to even a handful of comparable events. (Savage 1962).

…people will not long stop at that halfway house but will go forward and accept the implications of personalistic probability…

All error statistical notions, p-values, significance levels,…all violate the likelihood principle (ibid.)

The Birnbaum argument has long been treated, by Bayesians and likelihoodists at least, as a great breakthrough, a landmark, and a momentous event; I have no doubt that revealing the flaw in the alleged proof will not be greeted with anything like the same fanfare (Mayo 2010).

**THREE: (Frequentist) Error Statistical Methods**

Probability arises (in inference) to quantify how frequently methods are capable of discriminating between alternative hypotheses and how reliably they detect errors.

These probabilistic properties of inference procedures are *error frequencies* or *error probabilities*

Formally: the probabilities refer to the distribution of statistic T(x) (sampling distribution)

behavioristic rationale: to control the rate of erroneous inferences (or decisions):

inferential or testing rationale: or to control and appraise probativeness or severity of tests, for a given inference (about some aspect of a data generating procedure, as modeled); a typical inference would be about the accordance (or discordance) of a model, as indicated by the data

The general idea of appraising rules probabilistically is very Popperian (so should be familiar to philosophers of science)

In contrast to “probabilism” that inferring a hypothesis *H* is warranted only by showing it is true or probably true, *we may assign probabilies to rules for testing (or estimating) H*

Good fits between *H* and **x** are “too cheap to be worth having” , they only count if they result from serious attempts to refute *H*

(I see error statistical methods as allowing us to make good on the Popperian idea, although his tools did not)

*Severity Principle (Weakest):* Data **x** do not provide good evidence for hypothesis *H* if **x** results from a test procedure with a very low probability or capacity of having uncovered the falsity of *H* (even if *H* is incorrect).

Such a test we would say is insufficiently stringent or severe.

Formal error statistical tools may be regarded as providing systematic ways to evaluate and promote this goal

**FOUR: Error Statistical Methods Violate the LP**

(by considering outcomes other than the one observed)

Critics of frequentist error statistics rightly accuse of us insisting on considering outcomes other than the one observed because that is what is need to assess probativeness

A test stastistic or distance measure T(**x**) may be regarded as a measure of fit; once we get its value we still want to know how often such a fit with H would occur even if H is false, i.e., the sampling distribution of T(**x**)

Likelihood (likelihood ratios) yield measures of fit, but crucial information is given by the distribution of that fit measure: if so good a fit (between **x** and *H*) would very probably arise even if *H* were specifiably false, then the good fit is poor evidence for *H*.

Aspects of the data and hypotheses generation can alter the probing capacities of tests, e.g., double-counting, *ad hoc* adjustments, selection effects, hunting for significance, etc. and error probabilities pick this up

This immediately takes us to the core issue of the LP:

Those who do not accept the likelihood principle believe that the probabilities of sequences that might have occurred, but did not, somehow effect the import of the sequence that did occur (Edwards Lindman, and Savage 1963, 238)

The error statistician is “guilty as charged!”:

The question of how often a given situation would arise is utterly irrelevant to the question how we should reason when it does arise. I don’t know how many times this simple fact will have to be pointed out before statisticians of ‘frequentist” persuasions will take note of it.” (Jaynes 1976, 247)

What we wonder is how many times we will have to point out that to us, reasoning from the result that arose is crucially dependent on how often it would have arisen…..

Error statistical methods consider outcomes other than the one observed, but they don’t say average over any and all experiments not even performed!

One of the most common criticisms of frequentist error statistics assumes they do

Cox had to construct a special principle to make this explicit

**FIVE: Weak Conditionality (WCP): You should not get Credit (be blamed) for something you don’t deserve**

*A mixture Experiment**: Toss a fair coin to determine whether to make 10 or 10,000 observations of *Y a normally distributed random variable with unknown mean m.

For any given result y, one could report an overall p-value:

{p’(**y**) + p”(**y**)}/2.

the convex combination of the p-values averaged over the two sample sizes.

* (WCP) Conditionality Principle (weak)***:** If a mixture experiment (of the above type) is performed, then if it is known which experiment produced the data, inferences about m *are appropriately drawn in terms of the sampling behavior in the *experiment known to have been performed.

Once we know which tool or test generated the data y, given our inference is about some aspect of what generated y, it should not be influenced by whether a coin was tossed to decide which of two to perform.

If you only observed 10 samples, it would be misleading to report this average as your p-value,

“It would mean that an individual fortunate in obtaining the use of a precise instrument sacrifices some of that information in order, in effect, to rescue an investigator who has been unfortunate enough to have the randomizer choose a far less precise tool. From the perspective of interpreting the specific data that are actually available this makes no sense. Once it is known whether E’ or E” has been run, the p-value assessment should be made conditional on the experiment actually run.” (Cox and Mayo 2010 )

WCP is a *normative* epistemological claim about the appropriate manner of reaching an inference in the given context.

*Appealing to the severity assessment*: Maybe if all you cared about was low error rates in some long run, defined in some way or other, then you could average over experiments not performed, but low long-run error probabilities are necessary but not sufficient for satisfying severity.

The severity assessment reports on how good a job the test did in uncovering a mistaken claim regarding some aspect of the experiment that actually generated particular data x_{0}.

The WCP is entirely within the frequentist philosophy.

It does not lead to conditioning on the particular sample observed!

Here’s where the Birnbaum result enters—his argument is supposed to show that it does….

How can so innocent a principle as the WCP be claimed to force the error statistician to give up on error probability reports altogether?

**SIX: (Frequentist) Error Statistics Violates the LP— once again, more formally**

**Strong Likelihood Principle (LP). **

*It is a universal conditional claim: *

If two data sets ** y’** and

**from experiments E’ and E” respectively, have likelihood functions which are functions of the same parameter(s) µ and are proportional to each other, then**

*y”***and**

*y’***should lead to identical inferential conclusions about µ.**

*y”*For any two data sets **y’**,** y”**…

Whenever there are a pair of samples **y’**, **y”**

*Y’ is a shorthand for ( y’ was observed in experiment E’)*

*E’ and E” may have different probability models but with the same unknown parameter μ**[ii]*

**Examples of LP violations**: Fixed vs. Data-Dependent Stopping

E’ and E” might be Binomial sampling with n fixed, and Negative Binomial sampling, respectively.

I will focus on a more extreme example that is very often alluded to in showing the error statistician is guilty of LP violations: *fixed versus optional stopping*

E’ might be iid sampling from a Normal distribution N(µ, s^{2}), s known, with a fixed sample size n, and E” the corresponding experiment that uses this stopping rule:

**Keep sampling until H_{0}: is rejected at the .05 level**

(s should be sigma, Y-bar is the sample mean,

(*Y _{i }*~ N(µ,s) and testing

*H*

_{0}: µ=0, vs.

*H*

_{1}: µ > 0.

i.e., keep sampling until |Y-bar | exceeds 1.96 s/sq root of n

The likelihood principle emphasized in Bayesian statistics implies, … that the rules governing when data collection stops are irrelevant to data interpretation. (Edwards, Lindman, Savage 1963, p. 239).

This conflicts with error statistical theory:

We see that in calculating [the posterior], our inference about m, the only contribution of the data is through the likelihood function….In particular, if we have two pieces of data y’ and y” with [proportional] likelihood function ….the inferences about m from the two data sets should be the same. This is not usually true in the orthodox theory and its falsity in that theory is an example of its incoherence. (Lindley 1976, p. 36).

Frequentist inference about m can take different form, but since the argument is to be entirely general, and given the need for brevity here, it will be easiest to take a particular kind of inference, say forming a p-value.

*As Lindley rightly claims, there is an LP Violation in the Optional Stopping Experiment:* There is a difference in the corresponding p-values from E’ and E”, write as p’ and p”, respectively.

While p’ would be ∼ .05, p” would be much larger, ∼ .3. The error probability accumulates because of the optional stopping.

Clearly p’ is not equal to p”, so the two outcomes are not evidentially equivalent

**Infr _{E’}(y’) is not equal to Infr_{E”}(y”) **[for an error statistician]

Infr_{E}(y) abbreviates: the inference[2] based on outcome y from experiment E

By contrast

**Infr _{E’}(y’) is equal to Infr_{E”}(y”)** [for one who accepts the LP]

Instead of “is equal to” it would be more accurate to write this as something like “should be treated as” equivalent, “should not be treated as equivalent” evidentially; the claims are based on one or another methodology or philosophy of inference (but I follow the more usual formulation)

Suppose you observed **y”** from our optional stopping experiment E” that stopped at n = 100.

**Infr _{E’}(y’) is equal to Infr_{E”}(y”)** [for one who accepts the LP]

Where **y’** comes from the same experiment but with n fixed to 100

Bayesians call this the *Stopping Rule Principle SRP*.

The SRP would imply, [in the Armitage example], that if the observation in [the case of optional stopping] happened to have *n*=*100*, then the evidentiary content of the data would be the same as if the data had arisen from the fixed sample size experiment (Berger and Wolpert 1988, 76).

Some frequentists argue, correctly I think, the optional stopping example alone as enough to refute the strong likelihood principle.” (Cox, 1977, p. 54) since, with probability 1, it will stop with a “nominally” significant result even though µ = 0.

It violates the principle that we should avoid misleading inferences with high or maximal probability (weak repeated sampling principle).

In our terminology, it permits an inference with minimal severity

(The example can also be made out in terms of confidence intervals, where the rule ensures 0 is never in an interval with probability 1

Berger and Wolpert grant that the frequentist probability that the interval exclude 0, even where 0 is true, is 1. pp. 80-1 )[3]

See continuation: here.

*I want to thank Sir David Cox for numerous discussions and insights regarding these arguments, especially the clarification of the notion of sufficiency, for a frequentist sampling theorist.

[1] I will always mean the “strong” likelihood principle.

[2] In the context of error statistical inference, this is based on the particular statistic and sampling distribution specified by E.

[3] See EGEK, p. 355 for discussion.

[ii] We think this captures the generally agreed upon meaning of the LP although statements may be found that seem stronger. For example, in Pratt, Raiffa, and Schlaifer, 1995:

If, in a given situation, two random variables are observable, and if the value ** x** of the first and the value

**of the second give rise to the same likelihood function, then observing the value**

*y***of the first and observing the value**

*x***of the second are equivalent in the sense that**

*y**they should give the same inference, analysis, conclusion, decision, action, or anything else*. (Pratt, Raiffa, Schlaifer 1995, 542; emphasis added)

As a Bayesian your stopping rule example based upon the normal does give me food for thought. Like Berger and Wolpert I am uncomfortable with it, but I think there are some reasonable ways to “accept it”. It does indicate one (by no means not the only one) unintuitive consequence of accepting a Bayesian view. I remain a Bayesian, because I feel more comfortable in explaining this unintuitive consequence than others e.g. as given by the binomial, negative binomial example that you refer to.

I actually wrote code to simulate this situation. You said that the program should always terminate, in my simulations I often found the running time so long I had to terminate the simulation early. Is there a known distribution for the experiment length? Also if there is interest I can post the code.

The first part about this is that even from a Bayesian view I don’t like the formulation of the problem i.e. of determining P(\mu>0|D). It is in my opinion a little bit better to work with P(\mu|D) and a lot better to work with P(X_{n+1}|D).

The typical outcome of the simulation is a posterior P(\mu|D) that is slightly to the positive side, and most of the time very narrow and near zero. In cases where the posterior is very narrow it is also very near zero.

This is still a bit uncomfortable, but a little less so, P(\mu|D) will indicate (in general) that \mu is positive and near zero. If you plot the predictive distribution P(X_{n+1}|D) will in general be very close to the true “N(0,1)”, a close inspection will show that it is slightly positively biased. If you try to run the experiment for longer in an attempt to distort the predictive distribution you run into the following problem. If you sample for long enough the confidence interval will move eventually (further) to the left, but var[\mu|D] will also reduce and so the predictive distribution will continue to converge to the ‘truth’ regardless.

It seems that either accepting or rejecting the likelihood principle leads to uncomfortable consequences…

Your previous post on trivial intervals was enlightening from the point of view that you clearly feel able to ‘live with’ the fact that a 90% confidence interval may be obviously true or obviously false conditional on the data. I remain perplexed about the purpose of confidence intervals if it isn’t to indicate an interval which is likely to contain the parameter.

urls for two of the refs you give are:

http://www.phil.vt.edu/dmayo/personal_website/EGEKChap10.pdf

http://projecteuclid.org/DPubS/Repository/1.0/Disseminate?handle=euclid.lnms/1215466215&view=body&content-type=pdf_1

Results about the lengths of experiments when sampling to a foregone conclusion are in here;

http://www.springerlink.com/content/v74j385227q167k5/

… see the earlier commentary by Stone for the context.

Thank you guest & David. Of course, none of this is quite relevant for the argument purporting to show the LP; it’s only one of zillions of examples that show frequentists violate the LP. That is not in dispute. It is simply (a) more dramatic and (b) an example that the Bayesian have used in their favor, at least they used to.

Hi–I’m still trying to work through this–the general logic of the argument that is. It seems to me that what you are saying is something like: “Start with any violation of the LP, that is, a case where the antecedent of the LP holds, and the consequent does not hold, and show you get a contradiction.” So I am reading your optional stopping example where n=100 in both cases (preplanned trial and optional stopping) as showing how p-values appear to lead to a contradiction for frequentists (if we’re following the LP) because the resulting p-values (which function as evidential measures for frequentists) are different– p=.05 is not equal to p=.3, so a contradiction is derived. And supposedly this would show that frequentist p-values are wrong (as evidential mesures at least) if you hold the LP. But isn’t your point that you don’t really get a contradiction, that Birnbaum thinks you get one but you don’t?

And if I am understanding your argument correctly, it seems like the reason there isn’t really a contradiction is because frequentists’ probabilities are all about how the data are generated– (ensuring severe or stringent tests, your weak severity principle, etc.)–and so the evidential statement reflects (and should reflect) that aspect of the experimental situation, i.e., the data generating process. And as that really is different in the two cases (how one got to 100n) then getting different evidential measures is no contradiction.. Or am I totally missing the point here?

I also wonder…won’t most Bayesians say they never get a violation of the LP anyway

Try working through the follow-up post that completes the argument.