*“Model Verification and the Likelihood Principle” by Samuel C. Fletcher*

Department of Logic & Philosophy of Science (PhD Student)

*University of California, Irvine*

I’d like to sketch an idea concerning the applicability of the Likelihood Principle (LP) to non-trivial statistical problems. What I mean by “non-trivial statistical problems” are those involving substantive modeling assumptions, where there could be any doubt that the probability model faithfully represents the mechanism generating the data. (Understanding exactly how scientific models represent phenomena is subtle and important, but it will not be my focus here. For more, see http://plato.stanford.edu/entries/models-science/.) In such cases, it is crucial for the modeler to verify, inasmuch as it is possible, the sufficient faithfulness of those assumptions.

But the techniques used to verify these statistical assumptions are themselves statistical. One can then ask: do techniques of model verification fall under the purview of the LP? That is: are such techniques a part of the inferential procedure constrained by the LP? I will argue the following:

(1) If they are—what I’ll call the *inferential view* of model verification—then there will be in general no inferential procedures that satisfy the LP.

(2) If they are not—what I’ll call the *non-inferential view*—then there are aspects of any evidential evaluation that inferential techniques bound by the LP do not capture.

If (1) and (2) hold, then it follows that the LP cannot be a constraint on any measure of evidence, for either no such measure can satisfy it by (1), or measures that do satisfy it cannot capture essential aspects of evidential bearing by (2). I want to emphasize that I am not arguing for either the inferential or non-inferential view of model verification. (Indeed, I suspect that whether one seems more plausible will be contextual.) Instead I want to point out that, whatever one’s views about the role of model verification in inference, the LP cannot, as many commentators have assumed, constrain the inferential procedures used in non-trivial statistical problems. In the remainder, I will flesh out some arguments for (1) and (2).

First, what does it mean for techniques of model verification to be a part of the inferential procedure constrained by the LP, as the inferential view holds? One way to understand such a view is to represent the probability model as an enormous mixture model , where the index β of the indicator function *I* labels all conceivable models one might use for a given statistical problem. Procedures of model verification, then, are inferential in the sense that they select some α in the same way as other procedures of statistical inference select elements from the parameter space of .

In general, this huge mixture will be vaguely defined, so it is hard to see how one could apply the LP to it. But even if one could, there would be no general techniques for model verification that that conform to it. Essentially, the reason is because all such techniques seem to require the Fisherian logic of testing: one makes an assumption, from which it follows that certain statistics follow certain sampling distributions which one constructs theoretically or estimates through simulation. To the degree that the data are improbable, one then has reason to reject said assumption. Because it is well known that inferential procedures depending on sampling distributions do not in general satisfy the LP, the same follows for these techniques.

Now, such tools are available in a Bayesian context (e.g., Ch. 6 of *Bayesian Data Analysis* (2004), by Gelman et al.), but they too use sampling distributions. Other methods commonly used in Bayesian model checking are essentially comparative, so while they may be useful in their own respects, they cannot suffice to check assumptions generally. For example, the fact that the Bayes factor for two models—the ratio of their marginal likelihoods—favors one over the other by 1,000 doesn’t say anything about whether the favored model could have plausibly been generated the data. In other words, because comparative methods must work within a circumscribed class of statistical models, they cannot evaluate the statistical adequacy of that class itself.

What, then, is the non-inferential view of model verification? One way to understand this view is that it divides inferences for the primary parameters of interest from those used to test model assumptions. To a first approximation, this view structures statistical analysis as a two-step process, the first of which involves techniques of model verification to select a sufficiently statistically adequate model. The second step, in which the model is then subjected to inferential procedures for the parameters of scientific interest, is the only one for which the LP applies. (In practice, these steps may be repeated to sequentially test different assumptions, but I will leave that complication aside.)

But if techniques of model verification are bracketed from the inferential procedures constrained by the LP, then those procedures cannot take into account the outcomes of the former in assessing the evidence. Call the outcomes of techniques of model verification *assessments of reliability* of the model to which they are applied. Then measures of evidence that adhere to the LP cannot distinguish between two statistical models that have proportional likelihoods for a given set of data but are not equally reliable.

I take it to be uncontroversial that statisticians concerned with non-trivial statistical problems should care about the outcomes of their model checks—that is, I take it that they should be concerned with the reliability of their modeling assumptions. But any measure of evidence that satisfies the LP cannot take into account this sense of reliability, because the non-inferential conception of model verification brackets information about reliability from the assessment of evidence.

So under either the inferential or non-inferential conceptions of model verification, there are difficulties applying the LP. Under the non-inferential conception, however, proponents of the LP may hold out for a restricted version thereof, one that does not bind any notion of evidence whatsoever but instead does so for a more specialized and circumscribed notion. Perhaps this would go some way towards illuminating the controversial nature of the LP.

**********

*“Remarks on the Likelihood Principle” by Nicole Jinn*

Department of Philosophy (MA student)

*Virginia Tech*

*The general issue as to whether the Likelihood Principle[*] directs against using sampling distributions for model validation is not yet settled. For this reason, I would like to make a few remarks that I hope would aid in clarifying the Likelihood Principle.*

**Statistical adequacy and model checking**

First, there has been *a lot* of confusion on what it means to adhere to the Likelihood Principle. To shed light on this confusion, consider one of Samuel Fletcher’s statements about restrictiveness: “… even if there were an instance where the Likelihood Principle could apply, the only available techniques for model verification are classical – that is, effectively based on sampling distributions – techniques that, in principle, do not satisfy the Likelihood Principle”(Fletcher 2012a, 8). Failure to satisfy the likelihood principle, i.e., violating the Likelihood Principle, can be roughly understood as considering outcomes other than the one observed. The Likelihood Principle is violated, rightly *if *one cares about error probabilities.

The trouble is, if we have problems with methods that have statistical assumptions, as (Fletcher 2012a) does, we would have problems with all of them — we could not use significance tests or any methods for parameter estimation, because they all depend on the adequacy of the model. It’s very important to emphasize that statistical adequacy of a model requires *only *that the computed error probabilities from the model be *approximately *equal to the actual ones in using the appropriate statistical methodology. However, Fletcher asks what “actual error probabilities” are and whether we have (epistemological) access to them in (Fletcher 2012b). The context I had in mind when defining statistical adequacy is: “approximate” is being juxtaposed with “exact”, meaning even if one is *not* in a position to be able to calculate the error probabilities, one can attempt to approximate what they *would* be. An example of a non-approximate error probability can be found in (Mayo 2012, sec. 6.2)

Admittedly, even some textbooks are confused about the point at issue. Leading statisticians George Casella and Roger Berger state, “Most data analysts perform some sort of ‘model checking’ when analyzing a set of data. For example, it is common practice to examine residuals from a model … such a practice directly violates the Likelihood Principle”(Berger and Casella 2002, 295–296). Professor Deborah Mayo comments on this passage in an earlier blog post.

A better way to say what Casella and Berger mean is that the Likelihood Principle is inapplicable *if* you don’t know the underlying model. Fletcher even acknowledges that we *must* be comfortable with the model *before* considering the Likelihood Principle (Fletcher 2012b).

**Reliability and using error probabilities**

Second, a notion of reliability seems to violate the Likelihood Principle. “[W]e want our experiments to be reliable so that we can trust the evidence they produce. If this is one such reason, though, it does not make sense for one to care about the reliability of an experimental design but maintain that this reliability has no bearing on the evidence the experiment produces”(Fletcher 2012a, 5). Fletcher thinks, as error statisticians strongly advocate, that appraising evidence *cannot occur without considering error probabilities, which requires considering the sampling distribution, which violates the Likelihood Principle.* By contrast, the Likelihood Principle tells us that the sampling distribution is irrelevant to inference once the data are known. On the other hand, Fletcher in (Fletcher 2012b) warns that there may be more than one related notion of evidence, which adverts to the question of whether appraising evidence means assessing reliability of a statistical method. Put in another way, there does *not* seem to be a lucid notion of reliable evidence in terms of adherence (or not) to the Likelihood Principle.

Nonetheless, those who embrace the Likelihood Principle, such as Bayesians, *still allow *that other elements, e.g., priors and costs are needed for a *full *inference or decision. Their position is that the evidential import of the data is through the likelihood. A respectable number of researchers allow that the Likelihood Principle only goes as far as the information from the data *within* the model for the experiment, and therefore differences *could* occur in priors and utilities. Though, why would there be differences in priors if the hypothesis is the same? The Likelihood Principle applies in the context of two distinct models, yet the *same* inference is made in *both* models in terms of a *common* set of unknown parameter(s).

Furthermore, the notion of a “relevant difference in utilities or prior probabilities” (Gandenberger) is dubious because a unified theory of relevance still does not exist, in the sense of (Seidenfeld 1979, 219). After all, a sufficient statistic supposedly captures as much of the relevant information from the original data, as possible. But exactly how do we make sense of what counts as relevant information? To shed light on establishing methodological variants of the sufficiency and conditionality principles, (Fisher 1922) might help to (re)consider the purpose of statistical methods. I couldn’t agree with Fletcher more in advocating using error probabilities and that is really the ground for rejecting the (Evidential) Likelihood Principle. T*hat is the ground that needs to be emphasized*.

Strictly speaking, those who accept *any* version of the Likelihood Principle *could *allow model checking that uses methods that employ sampling distributions and error probabilities. Maybe they’re schizophrenic, but they do it. Fortunately, the (Evidential) Likelihood Principle does *not *follow from the principles thought to entail it, as has been demonstrated recently in (Mayo 2010 and the appendix to Mayo and Cox 2011).

**References**

Berger, R. L., and G. Casella. 2002. *Statistical Inference*. Second. Duxbury Press.

Fisher, R. A. 1922. “On the Mathematical Foundations of Theoretical Statistics.” *Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character* 222 (594-604) (January 1): 309–368. doi:10.1098/rsta.1922.0009. http://rsta.royalsocietypublishing.org/content/222/594-604/309.

Fletcher, Samuel. 2012a. “Design and Verify”. Unpublished discussion. Virginia Tech Graduate Philosophy of Science Conference.

———. 2012b. “Design and Verify”. Presentation slides. Virginia Tech Graduate Philosophy of Science Conference.

Mayo, Deborah G. 2010. “An Error in the Argument from Conditionality and Sufficiency to the Likelihood Principle” in *Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science* (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 305-14.

Mayo, D. G. and Cox, D. R. (2011) “Statistical Scientist Meets a Philosopher of Science: A Conversation with Sir David Cox.” *Rationality, Markets and Morals (RMM), *2, Special Topic: Statistical Science and Philosophy of Science, 103-114

———. 2012. “Statistical Science Meets Philosophy of Science Part 2: Shallow Versus Deep Explorations.” *Rationality, Markets and Morals: Studies at the Intersection of Philosophy and Economics* 3 (Special Topic: Statistical Science and Philosophy of Science) (September 26): 71–107.

Seidenfeld, T. 1979. *Philosophical Problems of Statistical Inference: Learning from R.A. Fisher*. Springer.

[*] Likelihood Principle (LP): For any two experiments E’ and E” with different probability models *f’*, *f’’ *but with the same unknown parameter θ, if the likelihood of outcomes x’* and x”* (from E’ and E” respectively) are proportional to each other, then x’* and x”* should have the identical evidential import for any inference concerning parameter θ.

**Background to these U-Phils may be found here.**

An earlier exchange between Fletcher and Jinn took place at the Virginia Tech Philosophy graduate student conference, fall 2012.

I am happy to see posts here from fellow graduate students who are thinking hard about problems related to the likelihood principle. I need to think more about Sam’s argument, which seems to provide another reason to suspect that the likelihood principle, even if true, does not have the radical implications for statistical practice that it is often taken to have. I don’t see a problem with the notion of “a relevant difference in prior probabilities or utilities.” The fact that there is disagreement about what aspects of the data and sampling distribution are relevant concerns a separate issue. My main point was just that it’s not necessarily a violation of the likelihood principle for two people with different priors or utilties to reach different conclusions from data with the same likelihood function (contra http://normaldeviate.wordpress.com/2012/07/28/statistical-principles/). I agree with Nicole’s point that the key question for a statistical principle is not how well it accords with our intuitions but how well it serves the purposes of statistical methods.

Samuel and Nicole: Good arguments, as far as I’m concerned.

However, Samuel, I’d object against the term “model verification”, because models cannot be verified, only checked.

Greg: What would it mean to say that “the likelihood principle is true”? How could the truth of such principles decided? (I ask this because you write “the likelihood principle, even if true”.)

Christian: I think it is pretty clear what it means for the LP to be true: relevant evidential import for parametric inference within model M is fully contained in the likelihood function. No mysterious “realism” about statistical principles is required to either hold or question the correctness of the LP for parametric inference, as defined.

“I think it is pretty clear what it means for the LP to be true: relevant evidential import for parametric inference within model M is fully contained in the likelihood function. No mysterious “realism” about statistical principles is required to either hold or question the correctness of the LP for parametric inference, as defined.”

But this seems to depend entirely on what one thinks is relevant. Is the sampling distribution relevant or not? That’s a matter of decision, isn’t it? (Of course there are arguments either way and one can exchange them but still…)

I’d always argue that I wouldn’t accept the likelihood principle as a general principle and I have my arguments, but that’s different from saying that it is false. (I’d only argue that it’s false to claim that it’s objectively true.;-)

The LP is a universal generalization about pairs of experiments and outcomes, so if there’s any counterexample in a methodology, then it is false in that methodology. And of course regardless of that, the “proof” fails to permit detaching the LP as a true conclusion.

Christian: Thanks — I don’t have much stock in one term over another, but to my ear they’re synonymous.

I want to thank both Fletcher and Jinn for complex and highly interesting U-Phils. First to make a side remark on Fletcher’s claim that “To the degree that the data are improbable, one then has reason to reject said assumption”. All data are improbable under various hypotheses, so this would never suffice to declare genuine evidence against the null hypothesis. Improbable results are not genuine experimental effects, which is why Fisher insisted that an experimental effect is not demonstrated until one knows how to generate statistically significant results “at will” so to speak.

As for the LP, which will always mean the strong LP (SLP) here, I think Fletcher raises an important point. However, in general, what is relevant will depend on the particular stage of inquiry and question being asked. To be fair to the strong likelihood principle, it depends on the overarching statistical model being adequate, or not in question for the moment. Therefore when the model itself is in question, the LP (in relation to the primary model) is simply inapplicable rather than false. Jinn’s qualification/correction of Casella and R. Berger is well taken. (I am surprised they made this slip.) Please note my post: failing to apply vs violating the LP:

https://errorstatistics.com/2012/08/31/failing-to-apply-vs-violating-the-likelihood-principle/

Now tests of statistical assumptions generally have their own assumptions, but they are not on the order of the assumptions of the primary model: one must keep track of what is under test, and what is not, in applying statistical techniques. We had a unit on misspecification tests starting with

https://errorstatistics.com/2012/02/22/2294/ .

Deborah: I’m equally thankful that you’ve offered this forum to discuss these interesting issues. With regard to your first (side) remark, I’m not sure to what you’re objecting, exactly. I had in mind something like the test for independence in a regression model, as you outline in 6.1.2 of your (2012) that NJ cited above. There you seem comfortable with the idea of evidence against the null.

With regard to your second remark, it seems to me that you’re advocating for the non-inferential view of model verification. I agree that under this view, the LP does not apply to techniques of model verification — and in particular, it cannot be sensitive to crucial issues of reliability (e.g, severity). Since I’m not arguing against (or for) this view, nor for the falsity of the LP, I’m not sure how I’m being unfair to it.

To really nail down the notion of “all conceivable models one might use for a given statistical problem”, I think it’s reasonable to define them as models that correspond to a Turing-computable measures over infinite binary strings. The “Turing-computable” part captures the notion that scientists are interested in statistical models for which there exist effective procedures (that is, procedures humans can actually carry out) to calculate predictions of all sorts (e.g., error probabilities). The restriction to measures on sets of infinite binary strings is really no restriction at all — almost all statistical calculations actually done today are done using computers, and therefore amount to manipulations of scientific data represented as finite binary strings (never mind infinite ones). With that said, I take issue with the following passage from Fletcher’s post:

“In general, this huge mixture will be vaguely defined, so it is hard to see how one could apply the LP to it. But even if one could, there would be no general techniques for model verification that that [sic] conform to it.”

The first sentence is false — the huge mixture in question can be defined quite precisely in rather the same sense that Chaitin’s construction can be defined precisely. The mixture is a Kolmogorov-complexity-weighted mixture of semi-computable semi-measures; it’s known as the Solomonoff universal prior. See for example Chapter 5 of Li and Vitànyi, An Introduction to Kolmogorov Complexity and its Applications, 3rd ed. (A noteworthy property of this prior is given in Corollary 5.2.1 in that book. In verbose English, it says that for any computable model, with probability 1, the posterior predictive distribution with respect to the Solomonoff universal prior converges to the model’s sampling distribution.)

The second sentence is true but vacuous — by construction, the mixture includes every model of possible interest to scientists, so no “model verification” is needed.

It’s important to emphasize that the reason the Solomonoff universal prior hasn’t made all scientists obsolete is that it (or rather, the version of it that is normalized to form a true probability measure) is uncomputable. It’s meant to provide a theoretical model of ideal, superhuman induction. (It seems plausible to me that this refutes Nicole Jinn’s claim that a unified theory of relevance in the sense of (Seidenfeld 1979, 219) still does not exist, but I can’t get a hold of the Seidenfeld text, so I can’t be sure.) Insofar as Solomonoff induction is considered ideal, it seems desirable to approximate it as closely as possible; granting that, the key question with respect to the LP would seem to be: in what circumstances, if any, does adhering to the LP help one approximate Solomonoff induction?

Corey: The sense of “vaguely defined” I intended is not the very general sense of allowing for a mathematical proof of (unique) existence. (Indeed, I have taken such a proof for granted in assuming that the big mixture model is mathematically well-defined.) Rather, I meant to refer to how such a mixture model would be used in practice, where I don’t think anyone would give an effective procedure to enumerate all possible models and explicitly compute the big mixture likelikehood function for the recorded data. Nevertheless, I think that you’re perhaps right to imply that “vague” may not be the best choice of words. What do you think would be better in this context?

With respect to your second comment (about vacuity), remember that the inferential view takes model verification to be inferential. Insofar as researchers aren’t actually going to work explicitly with the huge mixture model in practice, they will use model verification (inference to one or a few components of the mixture) to make their work tractable.

I might note that while I think it is surely correct to call model validation and misspecification testing “inferential” it won’t always involve a proper “statistical inference”. Much like ascertaining the sun spoilt some of the eclipse photos, in some contexts we are pinpointing the source of a known effect rather than generalizing. It’s a stretch to call such cases a statistical generalization. This is of interest in its own right, I am finding, aside from the fact that the LP concerns parametric statistical inference within a model.

Sam: I’d write something like, “For inference adhering to the LP, the need for a practical, tractable procedure requires some restriction of the set of models under consideration. Any such a priori restriction seems questionable in that it may exclude the true data generating mechanism, so there is a need to verify it vis-à-vis the realized data.” Does that seem close to the notion you want to convey?

Corey: I think that’s a fair statement of what a proponent of the inferential view might say.

Nicole: I’m puzzled why you think I have problems with methods that have statistical assumptions. (I don’t.) Also I’m not sure as to why you take me to think that one “must be comfortable with the model before considering the Likelihood Principle.” Under the inferential view of model verification, for example, one applies the LP to a huge mixture of all conceivable models. Must one must be “comfortable” with all conceivable models?

Sam: I don’t understand about applying the LP to a huge mixture of all conceivable models: again, the LP is always defined in terms of relevant information for parameteric inference within a model M (the “comfortable” phrase was Casella and Berger).

Deborah: The huge mixture is a perfectly well-defined parametric model where β labels each conceivable model. Model verification is inference on this parameter, which is how, on the inferential view, the LP applies to model verification.

As arose in my helpful exchange with Corey above, despite being well-defined it is totally impractical. So while it is in principle possible to apply the LP to the huge mixture, no one is going to do it in full detail. (This is what I meant when I said, “this huge mixture will be vaguely defined, so it is hard to see how one could apply the LP to it.”) In practice, one might not use the full mixture but idealize to a reduced mixture model.

I also want to emphasize that nothing about my argument requires anyone to find the inferential view attractive. I included it because there are a remarkable number of statisticians (like Casella & Berger) who claim that certain techniques of model verification are incompatible with the LP. But this is just impossible under the non-inferential view, so I felt that charity demanded that I address it.

Sam: Ok, thank you for the clarification – I will keep that in mind from this point onwards (that you do not have problems with methods that have statistical assumptions).

You made reference to Casella & Berger in your rejoinder (i.e., the last part of your presentation slides at the Graduate Conference in November). From that reference, I *assumed* you had *no* serious problems/disagreements with the quote in your rejoinder, which is: “[I]t must be realized that before considering the Sufficiency Principle (or the Likelihood Principle), we must be comfortable with the model” (page 296).

For your last question, I, too, do *not* understand about applying the LP to a huge mixture of all conceivable models.