# Statistical “reforms” without philosophy are blind (v update)

.

Is it possible, today, to have a fair-minded engagement with debates over statistical foundations? I’m not sure, but I know it is becoming of pressing importance to try. Increasingly, people are getting serious about methodological reforms—some are quite welcome, others are quite radical. Too rarely do the reformers bring out the philosophical presuppositions of the criticisms and proposed improvements. Today’s (radical?) reform movements are typically launched from criticisms of statistical significance tests and P-values, so I focus on them. Regular readers know how often the P-value (that most unpopular girl in the class) has made her appearance on this blog. Here, I tried to quickly jot down some queries. (Look for later installments and links.) What are some key questions we need to ask to tell what’s true about today’s criticisms of P-values?

I. To get at philosophical underpinnings, the single most import question is this:

(1) Do the debaters distinguish different views of the nature of statistical inference and the roles of probability in learning from data?

Three Roles For Probability: Degrees of Confirmation, Degrees of Long-run Error Rates, Degrees of Well-Testedness

A. Probabilism: To assign degrees of probability, confirmation, support or belief in a hypotheses and other claims: absolute[a] (e.g., Bayesian posteriors, confirmation measures) or comparative (likelihood ratios, Bayes factors).

B. Performance (inductive behavior philosophy): To ensure long-run reliability of methods (e.g., Neyman-Pearson NP behavioristic construal; high throughput screening, false discovery rates).

C. Probativeness (falsification/corroboration philosophy): To determine the warrant of claims by assessing how stringently tested or severely probed they are. (Popper, Peirce, Mayo)

Error Probability Methods: In B and C, unlike A, probability attaches to the methods of testing or estimation. These “methodological probabilities” report on their ability to control the probability of erroneous interpretations of data.

The inferences (some call them “actions”) may take several forms: declare there is/is not evidence for a claim or a solution to a problem; infer there’s grounds to modify a model, etc. Since these inferences go beyond the data, they are inductive and thus, open to error. The methodological probabilities are also called error probabilities. They are defined in terms of the sampling distribution of a statistic.[b]

Some spin-off questions:

(2) Do criticisms of P-values assume probabilism?

We often hear: “There is nothing philosophical about our criticism of statistical significance tests. The problem is that a small P-value is invariably, and erroneously, interpreted as giving a small probability to the null hypothesis that the observed difference is mere chance.” Really? P-values are not intended to be used this way; presupposing they ought to be so interpreted grows out of a specific conception of the role of probability in statistical inference. That conception is philosophical.

a. Probabilism says H is not warranted unless it’s true or probable (or increases probability).

b. Performance says H is not warranted unless it stems from a method with low long-run error.

c. Probativism says H is not warranted unless something (a fair amount) has been done to probe, and rule out, ways we can be wrong about H.

Remark. In order to show that a probabilist reform (in the form of posteriors) is adequate for error statistical goals, it must be shown that a high posterior probability in H corresponds to having done a good job ruling out the ways we can be mistaken about H. In this connection, please see Section IV.

It’s not clear how comparative reports (C is favorable relative to C’) reach inferences about evidence for C.

In this connection be sure to ask: Do advocates of posterior probabilities tell us whether their priors will be conventional (default, reference), and if so, which? Frequentist? Or subjective? (Or are they just technical strategies to estimate parameters, justified on error statistical grounds?)

• A very common criticism is that P-values exaggerate the evidence against the null: A statistically significant difference from H0 can correspond to large posteriors in H0 From the Bayesian perspective, it follows that P-values “exaggerate” the evidence; but the significance testers balk at the fact that the recommended priors result in highly significant results being construed as no evidence against the null—or even evidence for it! Nor will it do to simply match numbers.

It’s crucial to be able to say, H is highly believable but poorly tested. Even if you’re a probabilist, you can allow for the distinct tasks of stringent testing and error probes.(probativism).

Different philosophies of statistics are fine; but assuming one as grounds for criticizing another leads to question-begging and confusion.

(3) Are critics correctly representing tests?

• Do criticisms of P-values distinguish between simple (or “pure”) significance tests, and Neyman-Pearson (NP) tests and confidence intervals (within a model)? The most confused notion of all (often appropriated for unintended tasks) is that of power. (Search this blog for quite a lot on power.)
• Are criticisms just pointing up well-known fallacies of rejection and non-rejection that good practitioners know to avoid?(i) (e.g., Confusing nominal P-values and actual P-values.) Do their criticisms relate to an abusive animal (NHST) that permits moving from statistical inference to a substantive research hypothesis (as we have seen, at least since Paul Meehl, in psychology)?
• Underlying many criticisms is the presupposition that error probabilities must be misinterpreted to be relevant. This follows from assuming that error probabilities are irrelevant to qualifying particular scientific inferences. In fact, error probabilities have a crucial role in appraising well-testedness, which is very different from appraising believability, plausibility, or confirmation. Looking at hypothetical long-runs serves to understand the properties of the methods for this inference.

Notice, the problems with selective reporting, cherry picking, stopping when the data look good, P-hacking, barn hunting, etc. are not problems about long-runs. It’s that we cannot say about the case at hand that it has done a good job of avoiding the sources of misinterpreting data.

Loads of background information enters informally at all stages of planning, collecting, modeling and interpreting data. (Please search “background information” on this blog.)

I link to some relevant papers, Mayo and Cox (2006), and Mayo and Spanos (2006).

II. Philosophers are especially skilled at pointing up paradoxes, inconsistencies and ironies [ii]

Critic: It’s too easy to satisfy standard significance thresholds.

You: Why do replicationists find it so hard to achieve significance thresholds?

Critic: Obviously the initial studies were guilty of p-hacking, cherry-picking, significance seeking, QRPs…

You: So, the replication researchers want methods that pick up on and block these biasing selection effects.

Critic: Actually the “reforms” recommend methods where selection effects and data dredging make no difference!

It’s actually an asset of P-values that they are demonstrably altered by biasing selection effects (hunting, fishing, cherry picking, multiple testing, stopping rules, etc.). Likelihood ratios are not altered. This is formalized in the likelihood principle.

(4) Likelihood principle: Do critics assume inference must satisfy the likelihood principle—(all of the evidential import is in the likelihoods, given the model)? This is at odds with the use of error probabilities of methods.

• Probabilist reforms often recommend replacing tests (and CIs) with likelihood ratios, Bayes factors, HPD intervals, or just lowering the P-value (so that the maximal likely alternative gets .95 posterior)

The problem is, the same p-hacked hypothesis can occur in Bayes factors; optional stopping can exclude true nulls from HPD intervals. With one big difference: Your direct basis for criticism and possible adjustments has just vanished!

•  All traditional probabilisms obey the likelihood principle; violating it, however, (as with conventional priors) doesn’t automatically yield good error control.

Some critics are admirably forthcoming about how the likelihood principle surrenders this basis–something entirely apt under the likelihoodist philosophy [iii]. Take epidemiologist Stephen Goodman:

Two problems that plague frequentist inference: multiple comparisons and multiple looks, or, as they are more commonly called, data dredging and peeking at the data. The frequentist solution to both problems involves adjusting the P-value…But adjusting the measure of evidence because of considerations that have nothing to do with the data defies scientific sense, belies the claim of ‘objectivity’ that is often made for the P-value”(Goodman 1999, p. 1010).

On the error statistical philosophy, it has a lot to do with the data.

III. Conclusion so far: It’s not that I’m keen to defend many common uses of significance tests (at least without subsequent assessments of discrepancies indicated), it’s just that highly influential criticisms are based on serious misunderstandings of the nature and role of these methods; consequently so are many “reforms”.

How can you be clear the reforms are better if you might be mistaken about existing methods?

Some statisticians employ several different methods, even within a given inquiry, and so long as the correct interpretations are kept in mind, no difficulty results. In some cases, a vital means toward self-correction and triangulation comes about by examining the data from more than one perspective. For example, simple significance tests are often used in order to test statistical assumptions of models, which may then be modified or fed into subsequent inferences.[v] This reminds me:

(6) Is it consistent to criticize P-values for being based on statistical assumptions, while simple significance tests are the primary method for testing assumptions of statistical models? (Even some Bayesians will use them for this purpose.)

Quick jotting on a huge topic is never as succinct as intended. Send corrections, comments, and questions. I will likely update this. Here’s a promised update:

(IV) Zeroing in on a key point that the reformers leave unacceptably vague:

One of the major sources of hand-wringing is the charge that P-values are often viewed as posterior probabilities in the null or non-null hypotheses. (A) But what is the evidence of this? (B) And how shall we interpret a legitimate posterior ‘hypothesis probability’?

(A) It will be wondered how I can possibly challenge this. Don’t we hear people saying that when a null of “no effect” or “no increased risk” is rejected at the .05 level, or with P-value .05 or .01, this means there’s “probably an effect” or there’s “probably evidence of an increased risk” or some such thing?

Sure, but if you ask most people how they understand the .05 or .01, you’ll find they mean something more like a methodological probability than a hypothesis probability. They mean something more like:

• this inference was the outgrowth of a reliable procedure, e.g., one that erroneously infers an effect with probability .01.

or

• 95% or 99% of the time, a smaller observed difference would result if in fact the data were due to expected variability, as described under the null hypothesis.

Such “methodological probabilities” are akin to either the “performance” or “probativeness” readings above. They are akin to what many call a “confidence concept” or confidence distribution, or what Popper called corroboration. Don Fraser argues (“Quick and dirty confidence” paper, 2011) that this is the more fundamental notion of probability, and blames Lindley for arbitrarily deciding that whenever a posterior disagreed with a confidence distribution notion, only the former would count. Fraser claims that this was a mistake, but never mind. The important point is that no one has indicated why they’re so sure that the “misinterpreters” of the P-value don’t have the confidence or corroboration (or severe testing) notion in mind.

(B) How shall we interpret a legitimate posterior hypothesis probability?

As often as it’s said that the P-value is not the posterior probability that the null hypothesis is true, critics rarely go on to tell us what the posterior probability would mean, and whether and why it should be wanted. There is an implicit suggestion that there’s a better assessment of evidence out there (offered by a posterior hypothesis probability). What kind of prior? Conventional, subjective, frequentist (empirical Bayes)? Reformers will rarely tell us.

The most prevalent view of a posterior probability is in terms of betting. I don’t think the betting notion is terribly clear, but it seems to be the one people fall back on. So if Ann assigns the null hypothesis .05 posterior probability, it means she views betting on the truth of Ho as if she’s betting on an event with known probability of .05. She’d post odds accordingly, at least in a hypothetical bet.  (If you think another notion works better, please comment.)

Is that really what Ann means when she takes a statistically significant result as evidence of a discrepancy from the null, or as evidence of a genuine risk, non-chance result, or the like?

Perhaps this could be put to empirical test. I’ll bet people would be surprised to find that Ann is more inclined to have something like methodological probability in mind, rather than betting probability.

An important point about English and technical notions:

In English, “a strong warrant for claim H” could well be described as saying H is probable or plausible. Being able to reliably bring about statistically significant effects may well warrant inferring genuine experimental effects. Therefore, using the ordinary English notion of “probable”, P-values (regularly produced) do make it “probable” that the effect is real. [I’m not advocating this usage, only suggesting it makes sense of common understandings.]

We must distinguish the ordinary English notions of probability and likelihood from the technical ones, but my point is that we shouldn’t assume that the English notion of “good evidence for” is captured by a formal posterior probability. Likewise if you ask people what they mean by assigning .95 to a .95 highest probability density (HPD) interval they will generally say something like, this method produces true estimates 95% of the time.

(V) 10/24/15 , 10/27/15 The improbability or infrequency with which the pattern of observed differences is “due to chance” is thought to be a posterior probability

The other central reason that people suppose a P-value is misinterpreted as a posterior is a misunderstanding as to what is meant by reporting how infrequently such impressive patterns could be generated by expected chance variability. “Due to chance” is not the best term, but in the context of a strong argument for ruling out “flukes” it’s clear what is meant. Contrary to what many suppose, the null hypothesis does not assert the results are due to chance, but at most entails that the results are due to chance. When there’s a strong argument for inferring one has got hold of a genuine experimental effect, as when it’s been reliably produced by independent instruments with known calibrations, any attempt to state how improbable it is that all these procedures show the effect “by chance” simply does not do justice to the inference. It’s more like denying a Cartesian demon could be responsible for deliberately manipulating well-checked measuring devices just to trick me.  In methodological falsification, of which statistical tests are examples, we infer the effect is genuine. That is the inference. (We may then set about to estimate it’s magnitude.) The inference to a “real (non-chance) effect” is qualified by the error probabilities of the test, but they are not assigned to the inference as a posterior would be. I’ve discussed this at length on this blog, notably in relation to the Higgs discovery and probable flukes. See for example here.

Notes:

i. R.A. Fisher was quite clear:

In order to assert that a natural phenomenon is experimentally demonstrable we need, not an isolated record, but a reliable method of procedure. In relation to the test of significance we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result. (Fisher 1947, p. 14)

[ii] Two examples: “Irony and Bad Faith: Deconstructing Bayesians”

“Some Ironies in the Replication Crisis in Social Psychology”

[v] David Cox gave an excellent “taxonomy” of tests, years ago.

—————–

Notes added after 10/18/15

[a] “absolute” vs comparative is a common way to distinguish “straight up” posteriors with comparative measures, but it’s not a very good term. What should we call it? Maclaren notes (in a comment) that Gelman doesn’t fit here and I agree, insofar as I understand his position. The Bayesian tester or Bayesian “falsificationist” may be better placed under the error statistician umbrella, and he calls himself that (e.g., in Gelman and Shalizi, I think it’s 2013). The inference is neither probabilistic updating nor a Bayes boost/Bayes factor measure.

[b] There may well be Bayesians who fall under the error statistical rubric (e.g., Gelman?) But the recommended reforms and reconciliations, to my knowledge,take the form of probabilisms. An eclecticist like Box, as I understand him, still falls under probabilism, insofar as that is the form of his “estimation”, even though he insists on using frequentist significance tests for developing a model. In fact, Box regards significance tests as the necessary engine for discovery. I thank Maclaren for his comment.

References:

Fisher, R. A. 1947. The Design of Experiments (4th ed.). Edinburgh: Oliver and Boyd.

Goodman, S. 1999. ‘Toward Evidence-Based Medical Statistics. 2: The Bayes Factor,’ Annals of Internal Medicine, 130(12): 1005-1013.

Mayo, D.G. and Cox, D. R. (2006) “Frequentists Statistics as a Theory of Inductive Inference,” Optimality: The Second Erich L. Lehmann Symposium (ed. J. Rojo), Lecture Notes-Monograph series, Institute of Mathematical Statistics (IMS), Vol. 49: 77-97.

Mayo, D. G. and Spanos, A. (2006). “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,” British Journal of Philosophy of Science, 57: 323-357.

### 193 thoughts on “Statistical “reforms” without philosophy are blind (v update)”

1. Mayo –
Under ‘Probabilism’ you list Bayesian posteriors as an ‘absolute’ measure of support for ‘hypotheses’.

In my view this is your central misunderstanding of people like me (and Andrew Gelman, on my personal reading of him). We have had this conversation many times to no success but here is my view (and, I believe, Gelman’s) again:

Continuous Bayesian posterior distributions are a relative measure of support for parameters within a mathematical model structure.

I hope you can have a “fair-minded engagement” with this view.

• Oliver: First of all, I don’t view Gelman as a probabilist. Second, I don’t know a better word to compare Bayesians who view inference as updating to a posterior in contrast to Bayes Factor types. The standard term is”absolute”. I’d like a better one. It’s true as well that both kinds are comparative to the alternatives in a model, as you say. But that’s not how I’m using the term “comparativist”. The distinction here concerns the form of inference. A posterior vs a Bayes “boost” or other comparative Bayes measure, and there are several. I place likelihoodist here as well.

• Since it is entirely analogous to the distinction between propositional and predicate logics, I would suggest distinguishing between ‘zeroth order’ Bayesians and ‘first-order’ (or higher, in some cases) Bayesians.

Working ‘within a model’, as Gelman is fond of saying, means using variables within a domain of discourse just as in the case of variables and predicates.

• (PS – aside from the more central issue of zeroth-order vs first-order Bayesians, a ‘prior’ is what you have before data, a posterior is what you have after data and a ‘boost’ is how the data gets you from one to the other.)

• omaclaran: Many Bayesian reject conditioning by Bayes rule as the way to get the boost, but anyway, as I already said, I fail to see the connection, except possibly that 2 place relations are relations.

• You need to distinguish between ‘boosts’ relating to predictive distributions and ‘boosts’ relating to parameters within predictive distributions.

Again, careful formulation in terms of sets, variables and quantifiers can do much to clarify ambiguities arising from a focus on treating hypotheses as simple propositions and ignoring the proper mathematical structure of the problem (in my view, anyway).

• omaclaran: I fail to see the analogy between propositional and predicate logic, much less that it is “entirely” analogous. Propositional logic is the logic of statements with no quantifiers, predicate logic is the logic of properties and relations , permitting quantification (which I happen to be teaching in 2 days). So you’ll have to explain.

• That’s my point…if you treat ‘hypotheses’ as simple propositions you’re doing something much different than treating them as parameters within a higher level structure.

Ironically it seems to me that failure to make these distinctions is a source of many philosophical confusions.

Any philosophical examples should state the sets/predicates/variables etc as that is the (minimal) language of mathematics and mathematical models.

• Michael Lew

omaclaren, I agree strongly with your suggestion that the treatment of hypotheses as something more than just specification of parameter values within a model is problematical. I suggest that we should resist dichotomisation of hypothesis space just as strongly as we resist the dichotomisation of P-values into significant and not significant.

• > we should resist dichotomisation of hypothesis space just as strongly as we resist the dichotomisation of P-values

Exactly. Or at least recognise that the assumptions we make on the geometry and/or toplogy of our parameter (hypothesis!) space (and data space and model space) can have a ‘significant’ impact on our results 😉

We need to make explicit our domain, codomain and mappings, and the ‘level(s)’ at which we are speaking. No more ‘Newton’s theory is true/false’ or Issac is/isn’t college ready without making the mathematical structure of the problem explicit, please.

• Incidentally, I believe these issues regarding the nature of the parameter space used were essentially the source of Stephen Senn and David Colquhoun’s dispute about p-values and false discovery rates

• Omaclaren: I don’t see how. The issue turns on whether positive predictive values are relevant for evaluating the evidence for a given hypothesis. By the way, Colquuhoun gives a defn for fdrs that differs from the original defn of that term.

• Senn gave a pretty clear comparison between the Bayesian formulation with a prior as a set of Ts and Fs – which is effectively how Colquhoun set up the problem though he considered his ‘prior’ to be ‘frequentist’ – with the Bayesian formulation in terms of a continuous prior of parameters and the corresponding p-value set up.

Not surprisingly the simplistic T/F case differs from the more continuous formulations. There is of course the issue of which best represents the current (bad) behaviour of scientists, distinct from what’s the best way to use the methods in principle.

• OM: This is what the error statistician does by a distinction of “levels”. That is how we avoid underdetermination and the Bayesian catchall factor. But surely you’d know all this being the proud owner of two copies of EGEK (having collected your second today). Right?

But it’s not enough to want a piece-meal approach, one has to have an account that let’s you split things off so as to exhaust the answers to the question at each level. There is a tension between this “testing” goal and wanting a posterior probability or other probabilism.

• Sure, that’s the general idea of the error statistical approach. How about a detailed comparison with a hierarchical bayesian approach, for example?

• OM: You’re the one who should be giving such a comparison; I don’t work in that area. At least try delineating the aspects of the hierarchy that seem analogous to ours. There are two different “hierarchies” in the error statistical philo of science. One is the so-called hierarchy of models, though there’s no reason to use that term. That concerns the levels from data, stat model, to increasingly theoretical claims. This is very different from a set of “levels” of probabilities, as when we say Pr (T > to;Ho) = p0, and Pr(P< a;Ho) = a. Yet it's the non-probabilistic levels that are most obviously related to the piecemeal delineation we were alluding to.

• omaclaren: I added notes (a) and (b) in response to your comment.

2. I’m next to jump in this picture, but I don’t. I turn around and tell everybody to go back, back, back, and think it through!

3. “It’s actually an asset of P-values that they are demonstrably altered by biasing selection effects (hunting, fishing, cherry picking, multiple testing, stopping rules, etc.). Likelihood ratios are not altered. This is formalized in the likelihood principle.”

Some kinds of selection processes alter likelihood ratios; others don’t. For example, in genome-wide association studies it’s common to apply a statistical significance filter to highlight SNPs of interest. The post-statistical-significance-filter likelihood function (let’s say, under a normal model) for the mean given the observed sample mean and variance is not proportional to that of the vanilla normal model.

• Corey: I can’t tell the form of inference in your example. The issue concerns selection effects dropping out in the likelihood ratio even if they alter the likelihood function.

• I’m saying that while some “biasing selection effect” don’t affect likelihood ratios, others do; hence the flat claim that “likelihood ratios are not altered” is false. Suppose my study design is: collect statistics on a large number of individual cases (e.g., SNPs in genome-wide association studies), present estimated effect sizes for only those cases where a statistical significance cut-off is reached. Presumably you would consider those effect size estimates as subject to a biasing selection effect, no? I’m saying that the likelihood function under this design is not proportional to the likelihood function under a no-selection design.

• Corey: But that’s an analysis that is based on significance levels being the first check for inclusion. Also, it’s not enough that the likelihood function be altered for the ratio to be altered. Maybe you should spell out the LR a bit more.

4. Mayo:

I do think NHST is the problem, and I think it’s a problem if p-values are used and also a problem if Bayes factors are used. It’s the trivial yet omnipresent error of rejecting null hypothesis A and taking this as evidence of preferred alternative B.

• Andrew: I totally agree, but regard it as an abuse of significance tests. I’ll write more later, traveling.

• Gelman and Mayo,

What is the problem with NHST? A null hypothesis is a scientific conjecture ‘S’ translated into statistical terms ‘H’. That is, S contains scientific statements while H contains statistical statements.

1. Statisticians and scientists, in general (or all of them?), take for granted some rules of inference (and their limitations) that connect the statements ‘H’ and ‘S’.

2. Statistical models are approximate mechanisms to explain some uncertain events. This fact has a huge impact in the relations between ‘H’ and ‘S’.

3. Many other mathematical theories could be implemented to verify the consistency of ‘S’. All methods will suffer from the translation problem.

Questions:

a0) Do you agree that science has conjectures?

a1) If yes to a0), let ‘S’ be a scientific conjecture. Do you think that science should verify the consistence of the conjecture ‘S’?

a2) If yes to a1), do you think that it should be implemented a mathematical method to verify the scientific conjecture ‘S’?

a3) If yes to a2), how to do this without formulating a null hypothesis ‘H’ related with ‘S’?

If you said no to a0)– a2) any items above, please explain the reason.

• Alexandre:

The troubles are two: First, the mapping from substantive to statistical hypothesis is weak. A substantive hypothesis might be that fecundity in women affects political attitudes. This connects to many many many statistical hypotheses, including whatever happens to be tested in some paper, for example an interaction between time of the month and relationship status being predictive of vote intention. Second, in NHST the logic is to reject straw-man hypothesis A in order to justify belief in alternative hypothesis B. This is a logically incorrect step.

• Andrew,

Yes, but this critics can be used against any mathematical theories, since mathematics is a precise language that has little to do with our natural language used to define things around us.

It is given that any natural statements will connect many many many different mathematical concepts. How do you overcome this problem? which mathematical model do you propose to treat the problem of testing scientific conjectures? Or do you think that science should stop making conjectures about nature?

• Alexandre: You’re being silly, and ignoring the huge quantity of material on this blog that discusses statistical tests and models and is lightyears away from skepticism of mathematical models.

• Mayo,

My question was addressed to Andrew. He said that a “substantive hypothesis” connects to many many statistical hypotheses, which is a intrinsic feature when you try to connect an imprecise language with a precise language. I said that this critic can be applied in any mathematical concepts that you invent to represent a “substantial hypothesis”.

Is it silly? well, I don’t think so. Sorry 🙂

• Alexandre: It’s silly to suggest that criticizing fallacious uses of tests is unfounded given the inevitable gaps between data and substantive hypotheses, or given that models are only approximate. Gelman wasn’t merely talking of the logical underdetermination of any theory or model by data, but situations wherein the flexibility (of concepts, interpretation, modeling, etc) allows essentially any substantive hypotheses to be saved at will, or read into the data, and any number of other ‘bad science’ practices. The inferred hypothesis has been poorly or inseverely probed. Even if one comment doesn’t get at all the nuances, there are numerous posts on those issues that do.

• Alexandre: I think most people know that I’m certainly NOT against significance tests. I’m against the abuses of tests and against the abuses of abuses of tests–notably, often, as a pretext to ban significance tests in favor of whatever method the critic is keen to advance. But NHST is an acronym that, so far as I can tell, is advanced in psych and other social sciences, as a test PURPORTING to allow moving from statistical significance directly to a research hypothesis. Neither Fisherian tests nor NP tests would allow that. If psych researchers are going to use tests, they’d be much better off with NP tests which, after all, is where power is defined. (That Jacob Cohen was a power analyst and yet psych people use single null hypothesis tests is puzzling.) In NP tests, the alternative and null must EXHAUST the space of hypotheses. There’s no jump to a level beyond the statistical parameter space. They’d also get an interpretation of results that fail to reach statistical significance. (Severity would lead to exhaustion in interpreting Fisherian tests as well, even though one only needs a test statistic to get going.)

I’d never just say i’m against NHST without indicating what I mean (as I do in this post) lest it be construed as more grist for the mills of P-bashers.

99% of the handwringing about P-values that keeps so many people busy concerns abuses of statistics and misinterpretations of tests. For instance, I am quite surprised to hear savvy researchers imagine that one is entitled to go from statistical significance to a maximally likely alternative. Indeed, some “reforms” (e.g., by likelihoodists) endorse this. (Even keeping to parametric alternatives within a model this is problematic.) Some do use power, but again, in unintended ways. For example, I see reforms by J. Berger recommending an inference to an alternative against which a test has high power. A kind of likelihood ratio is formed by power/alpha.This gets tests backwards. I’ve written so many posts on power, it’s best to just recommend you check them out.

• Andrew: But no actual account of statistical tests licenses these fallacious moves. That’s why it’s a straw man to condemn statistical testing methodology on grounds of an abusive method. How much clearer could Fisher have been that stat sig does not directly warrant research claims? The severity requirement makes it clear: there are ways one can be wrong about substantive claim C that haven’t been probed at all, let alone ruled out, by dint of arriving at a stat sig result. And that’s even assuming it passes an “audit” which requires considering (a) statistical assumptions and (b) biasing selection effects.
The examples you flog would fail any audit immediately.
The only recommended reforms that speak to the actual problems have almost nothing to do with formal statistics and everything to do with experimental designs and bias. They should add to this the need to question if the toy experiments social psych people are so fond of can actually be used to learn about the phenomenon in question. I guarantee it would be relatively easy to falsify presuppositions about many “treatments”, from unscrambling soap words (embodying cleanliness) , checking distances between points on paper (instilling closeness to family), to reading passages on determinism (increasing the proclivity to cheat), etc. etc. The wonder is that psych researchers don’t tear these experiments apart. Instead they establish “replication research projects” and are the new experts on good science. Now that’s a clever way to use the rewards system, isn’t it?

5. Christian Hennig

Thanks for this posting which I appreciate overall.
I’d like to add one thing that may be seen as slightly off-topic and I’m not sure whether it has consequences for the discussion but anyway:
You start with the “Three Roles For Probability” in “learning from data”. Probably you meant “roles” in a sense restricted to “rationales for inference”, but when I read “Roles For Probability” in the beginning, my first thought was:
Modeling – setting up formal mechanisms that could generate data that may (or may not) look as the data generated by “world”. For all frequentists, all inference starts from there and is ultimately about such models. For many Bayesians this can be said as well, at least I see much of this in practice (and it should correspond well with Gelman’s interpretation of Bayes), although according to subjectivist and some brands of objectivist Bayesian foundations, probabilities do *not* model something in the world that can generate data, but are about degrees of belief/credibility all the way. But even following Bayesian foundations according to which probabilities don’t model data generating processes, still a key role of probability is formal modeling of a world view in which data arise in ways that are not pre-determined by the observer.

Chances are that I’m not telling you anything new here, but still I think it’s helpful to keep this in mind. Whatever P-values as well as prior and posterior probabilities can deliver, they deliver it on the basis of probability modeling, and there are certain aspects of the discussion in which this is all too easily forgotten. Null and alternative hypotheses are routinely interpreted as something much more general than what they are within the model. H0: “The intervention doesn’t make a difference”, or “it does”, are much bigger statements than if one would include all the model assumptions and say “assuming that the world behaves like our model otherwise”. P-values are computed not based on “The intervention doesn’t make a difference” but based on a rather specific data generation mechanism (or belief) etc.

Actually, this issue is mentioned already above by omaclaren, “working within a model”. In my understanding, this is important for all probability modeling and inference, frequentist, Bayesian or whatever. (One issue I have with many uses of Bayesian statistics in practice is that the sampling model and the parameter prior are often chosen in very different manners corresponding to different ways of thinking and “world views”, and I’m often not sure whether a clear and sound interpretation can be given to results of using them both as “the same kind of probability” in probability calculus.)

6. Christian: Thanks for your comment, haven’t heard from you in awhile.
I think my new addition, section (IV) addresses your points. In any event, your comment led me to write it. I don’t think I meant “rationales” but, literally, how probability quantities enter into inferences. In other words, What are they doing there?
See if my add-on helps. I’ll study your comment some more later.

I entirely agree with your point about Bayesians mixing different notions of probability together, and find it striking that leading statisticians can be so casual about it––even in the midst of calling for greater clarity on key terms!

7. Given that most people here now accept the ‘within’ and ‘without/outside’ the model distinction, perhaps we could build on this point of agreement.

For example, I suggest giving an explicit example (or set of examples) to consider both the Bayesian/Likelihoodist and the Frequentist approaches, where the concepts lying ‘within’ and ‘without’ the (primary) model are made *explicit* in each case.

Say the example involves stopping rules. My first question for each approach is:

Is the stopping rule considered a ‘within the model’ concept or an ‘outside the model’ concept?

So for example, in one case the stopping rule might be explicitly modelled ‘within’ the primary model or might alternatively represent an ‘external’ concept which then requires a conditional independence statement about the ‘closure’ of the ‘boundaries’ of the model to hold in order for any ‘within the model’ estimation to be valid.

[I started jotting a few thoughts on these topics on my own (not a research) blog to avoid clogging up other people’s blogs. If anyone wants to, feel free to pop over – just click my name for the link. The first post tried to consider the idea of ‘model closure’, I also have a post on the ‘tacking paradox’ which mentions stopping rules. Of course, I don’t claim any of it is particularly original or deep.]

8. Mayo:

You ask if there is any evidence that people view p-values as posterior probabilities of the null and/or alternative, as is commonly claimed. Do you not buy the work done by Oakes, or Haller and Kraus, or Hoekstra et al (for CIs)? They all found quite clear evidence that researchers (in psych at least) were, in fact, interpreting p-values (and CIs) as all sorts of crazy things. Things that include the probability of H0, or H1, or the probability of replication, etc etc. Even some methodology instructors do this.

Hoekstra’s survey even explicitly asks if they endorse, “The probability that the true mean equals zero is less than 5%” when the CI excluded zero, and roughly half of all respondents endorsed it.

I’m sure you’ve seen it, but for other readers here is a link to Haller and Kraus, who also summarize Oakes’s earlier results. http://myweb.brooklyn.liu.edu/cortiz/PDF%20Files/Misinterpretations%20of%20Significance.pdf

And Hoekstra et al

Click to access HoekstraEtAlPBR.pdf

• Alex: Sure, but did you read the rest of my remarks? I’m going rather deeper. Calling a claim probable needn’t mean a posterior–in ordinary Enlgish. Likelihood in ordinary English is similarly used in ways quite distinct from formal likelihood. My point is that people mean different things by “H is probably true” or the like. If you ask what someone means by the .95 of a.95 HPD interval, they may say, this method covers the true parameter 95% of the time.
That said, let’s grant that researchers find that people misinterpret P-vaues as posteriors. What interpretation do the researchers give? Typically they are cashed out in terms of betting, or maybe beliefs, depending on the type of prior. But such a probabilistic measure differs from both performance and well-testedness.
I also find Fraser’s point of interest: he thinks probability should be reserved for a “confidence concept”.
But really I was here just raising the point that too little attention is given to what’s meant by a posterior probability.
I also think that people mistake the meaning of “the probabiity the effect we’re producing is due to chance” as a posterior. You can look up the Higgs experiment for a post on this. The Higgs researchers just meant: Pr(Tests would regularly produce > 5 sigma results; under the assumption they are due to background fluctuations is exceedingly small. See an updated section (V) to this post.

• Michael Lew

Alex, you need to take the results of the surveys of Oakes, Haller & Kraus, and Hoekstra et al. with a grain of salt. They are set up in a way that maximises the chances of wrong responses by offering a set of questions about P-values (and CIs) that are somewhat different in direction from the way that ordinary users of P-values (and CIs) use them. Further, they offer a set of questions that are all false. Thus where users think that P-values are a sort of probability (which they are), the users can be expected to choose one of the offered types of probability as the best match for their hazy understanding. If the questions were written in a way that explored the users’ understandings of P-values as indices of evidence then the users would do rather better, in my experience.

It is notable that at least one of the ‘correct’ responses in the Hoekstra paper is contestable.

• Hi Michael, good to hear from you. Not surprised about their surveys, they are typically written in that manner. Do you have a link to them?

• Michael Lew

The Oakes (Michael, I think) one is in his excellent little stats textbook–one of very few that explicitly discuss the differences between Neyman-Pearsonian, Fisherian, Bayesian and Likelihoodist approaches. I don’t have a link, and I am not in my office to even look at the title of the book.

My own presentation and analysis of the questions: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3419900

Hoekstra confidence intervals: http://www.ejwagenmakers.com/inpress/HoekstraEtAlPBR.pdf

• Michael: Thanks so much for the links, especially your paper. It’s interesting to read yet one more of a million papers covering this ground–seriously, I find your paper very interesting. As you likely (or should I say probably?) know I’ve had many posts and papers discussing what’s wrong with the alleged “great big” war between Fisher and NP views. Most especially your depiction of NP as automatic accept or reject, the justification being global long run control of error rates over entirely different experiments, is wildly incorrect, while repeated 5 times a day. If people would pledge not to write one more paper repeating this but rather to read NP, they would become enlightened. Predesignated alpha and power were important to ensure tests capable of probing types and degrees of discrepancies of interest, and to avoid post-data hunting for one or another difference. Pre-data planning doesn’t mean no data dependent reports. Post-data, the NP theorists proposed reporting the P-value attained. Erich Lehmann called it the significance probability and other terms besides. Even if Wald and others couched tests in decision theory terms to derive optimality results, this wasn’t NP.
Fisher early on invented the cut-off, even recommending differences below .05 be ignored (which is in tension with one of the quotes of his in your paper, one that I like a lot). After Barnard told Fisher that N had converted his tests to acceptance sampling rules, Fisher took up this cudgel.The standard war story rarely mentions that Fisher was pushing fiducial probabilities which were ill defined. NP’s attempts to clarify the difference between error probabilities of procedures and fiducial confusions led them to emphasize the fact that in frequentist methods–tests and estimates– the inference is to a claim, e.g., regard the data as indicating a genuine discrepancy from the null, infer mu is greater than the lower bound of a CI. The error probability of the procedure qualifies how well the claim was tested, probed or corroborated. Read the “triad” of 1955 and 56 (on this blog). No one can say what Fisher meant by his fiducials, even though today’s confidence distribution people think they are carrying out something close (which they may well be). But these are not degrees of belief, support or prob of hypotheses but error probs (with either a performance or severity-like construals).
The “global” error probabilities over unrelated cases which you claim is standard for NP was not standard at all. It was N pointing out in 1977 that error probabilities do not require considering performing the same test over and over again. (I’m not denying N was interested in those results, but today’s focus on screening rates a la Ioannidis is N on steroids!)
There was a war between N and F but not between the methods, which are simply different tools, very closely connected mathematically, within the repertoire of error statistics. Depicting this personality war as if it’s a huge war within the methods themselves–as if philosophy determines the mathematics– is one of the main reasons for today’s confusion.As I’ve been saying, this should stop. It’s the methods, stupid!
(Readers can check a post with this title.)

• Michael Lew

Mayo, as you know, you and I read the 1933 paper of Neyman & Pearson differently. It is not sensible to try to argue further on that matter as we have both failed to convince the other. However, I will note that neither Neyman nor Fisher was consistent or clear in all of his respective writings, so there is plenty of room for disagreement on the details.

• Michael: It’s not a matter of a single, early paper (they hadn’t even introduced the word “power” yet), it’s (a) a matter of a slew of papers and applications after, and (b) most importantly, it’s wrongheaded to attach one possible, extreme, interpretation to the methodology itself, and to keep teaching it as if it’s impossible to do inference with NP tests. That’s just false. There’s a perfectly good philosophy of inference that goes with the methods, whether interpreted as “confidence distributions”, performance, or probativeness. It’s the one held by Popper, Peirce, and many others.

9. I always seem to get involved when the discussion has almost petered out. Christian Henge said his contribution may be slightly off-topic but I find it very much on-topic. A P-value is one but not the only or necessarily the best measure of how well a specific model approximates the data. In fact there will be several P-values involved in measuring the degree of approximation each corresponding to a different feature of the data; mean, standard deviation, Kuiper distance and outliers to name but four for the normal model. They are well-defined and their interpretation makes no reference to any real world problem. As Hennig points out the connection to the real world may leads to statements such as the copper content of this sample of water is significantly less than the legal limit’ which are much bigger than saying that the model based on the legal limit is a poor approximation to the data, that is the P-value is very small. Suppose the P-value in question is based on the normal model with null hypothesis H_0: mu=1.97. How does the statistician decide on the normal model? One possibility is always to use it regardless of the data. A second is to look at the data accompanied by a waving of the hand and the statement ´they look normal to me’. A third is to write a computer programme with input data and output at least 1 or 0 depending on whether a normal model is judged to be an adequate approximation. This has the huge advantage in that it is open to criticism. The data I have include one observation 1.7 which is ´much’ smaller than all the others. One hand-waver may say, normal, another may say normal apart from the outlier which I hereby discard. The includer will report a P-value of 0.0502, the excluder one of 0.0064. The programmer is not much better off. If the output is 0 that is a bad approximation due to one somewhat deviant observation what now? A slight alteration of the required degree of approximation may bring the result 1. So not surprisingly the P-value depends on what is usually regarded as the EDA phase of an analysis. Not only that it can be sensitive to a few observations. Any offers of a programme for the standard i.i.d. normal model? Any suggestions of how to let the P-value depend explicitly on the EDA phase? Any suggestions on stabilizing the P-value?

• lauriedavies2014: Well I hope it hasn’t petered out just yet. To recognize the roles of “several P-values involved in measuring the degree of approximation each corresponding to a different feature of the data; mean, standard deviation,…” is already to understand P-values, and related measures, in a manner that is more aligned with practice than what we often hear from critics many of whom talk as if P-values are useless unless maybe they match posterior probabilities. Imagine assigning posteriors to claims about all of the different features of the data of interest, as if they could be exhausted.

One thing about your claim that P-values “are well-defined and their interpretation makes no reference to any real world problem”. It does not follow from the fact that statistical hypotheses and claims differ from various substantive ones that the interpretations of statistical inferences make no reference to any real world problem. A central real world problem is constructing a model serviceable for some question of interest and/or creating a data generation method and question such that the models are serviceable to that end. Those are real world problems. It’s the rare case that we just come across a phenomenon well modelled statistically (or other wise), but expecting to come across such things is not how humans operate. Much too passive. You speak of “legal limits” of copper content. By and large, we develop such limits hand in hand with our ability to measure a related quantity of interest reliably.

But I’m actually not sure any of this gets to the gist of your comment.

10. We think differently. I once sent a comment on stopping rules and likelihood and you asked whether it was on topic. The difference is behaving as if true’ and behaving as an approximation’. Here some thoughts on P-values without semantics, mapping models to the real world in a consistent manner is another problem. Let us take the easiest case of a normal model and H_0: mu=mu_0 and a sample x of size n. The standard definition of the P-value is 2(1-pt(sqrt(n)|mean(x)-mu_0|/sd(x),n-1)) but to make things simpler let us replace the t-distribution by the normal distribution. Suppose I tell you the the sample x is a N(mu,sigma^2). How do you now define a P-value with this knowledge? A reasonable suggestion would seem to be 2(1-qnorm(sqrt(n)(mu-mu_0)/sigma)). I now ask which pairs (mu,sigma) are such that x may be well approximated by the N(mu,sigma^2) distribution. Each pair will give me a P-value 2(1-qnorm(sqrt(n)(mu-mu_0)/sigma)) and being conservative I take the largest one. The difference can be large. For my copper data the null hypothesis H_0: mu=2.1 has a standard P-value of 8.7e-4. My version gives a P-value of 0.24. For H_0: mu =2.15 the values are 2.5e-6 and 4.4e-4 respectively. I think about goodness-of-fit test in the same manner. Suppose the model is a Poisson one. The standard way of performing a chi-squared test it replace lambda by the mean and subtract one degree of freedom. This will not tell you which parameter values are consistent with the data. This can only be done by testing each parameter value separately. A further problem is the choice of the normal model. This has consequences for severity of testing. Roughly speaking the larger the Fisher information the more severe the testing. The Fisher information is not restricted by requiring a good fit say in the sense of the Kuiper distance. In other words you can introduce severity by an appropriate choice of model. The problem can be overcome to some extent by regularizing, say a minimum Fisher model. For a given finite variance this is a normal distribution. In other words when you choose the normal distribution you maximize the P-value. I accept this and to be consistent I go further and maximize over the parameter values. And yet a further problem is consistency of interpretation over parametric model. The log-normal model also provides an adequate model for the copper data. How do I map its parameters onto the copper content of the water. I suspect the answer is to replace parameters by functionals which automatically give a consistency of interpretation across models.

• Laurie: I invite readers more familiar with your work to translate your comment. As for “The difference is behaving as if true’ and behaving as an approximation’”, I would say the latter is operative, but I don’t like the behavioral language. It was introduced to distinguish the inference, qualified by error probabilities of method, from a degree of belief, support, or confirmation construal.

• Christian Hennig

My impression is that the comment cannot be “translated”, rather it needed to be expanded to explain quite a number of details in the background. Davies’s book “Data Analysis and Approximate Models” gives more detail but even this is quite compressed. I think that he has thought about these ideas and worked with them for ages, so they are very familiar to him, and he perhaps can’t imagine anymore how much explanation an “uninitiated” reader needs.

11. Georgi

A technical (SEO & SMM) comment here: please don’t change the URLs of your posts every time you make a change to the title (WordPress has this option since ages ago). This way you lose a lot of links and a lot of social mentions. I went back to a post on my wall to check on this thread only to see a 404 page not existing error. This is very bad user experience and bad for spreading your views.

• Georgi: I only did it one time in 4 years, and that was the “Stat reforms w/o philo are blind”. I did it because some people who were sent the old one wanted a notification of significant changes. I don’t know how else to notify them. Besides, I wasn’t happy with the earlier post.

• Georgi

I see. Well, I guess you can just post a separate two-liner post about the update or have everyone on an email subscription list or something. Or just have a list of revisions right at the top of the post so that people can easily see if there were major changes when they visit it (I’ve seen it done and it’s OK)?

• Georgi: OK, OK, I’ll never do it again, but I don’t know how to notify subscribers. Actually, they should come to the website sometime to properly see the pictures and color.

• Laurie: What do you mean by a true model? Assertions are true or false, and an assertion might be that “model M is true about the problem or data generation’ , meaning, a la Tarski, that the data were generated as stated in M. But few people would view the statement about the model as exactly true in all respects, nor need they, any more than geometric models need to speak of actual triangles.

On the likelihood principle, it’s true that it assumes the adequacy of the model. I was starting from the argument as given. Those who have denied Birnbaum’s argument by declaring the model is never true, as Oscar Kempthorne, never get around to poking the real hole in the argument. From a philosophical/logical point of view, that’s a poor way to critique an argument. The strong way to critique an argument is to show, even granting the premises, the conclusion doesn’t follow soundly. That’s what I was doing.

13. Christian Hennig

Mayo, I think there is quite some explanation found here on this blog and also in your books and papers about how you do not assume the model to be literally “true” but “adequate” in a certain sense (and this adequacy can be tested by misspecification/goodness of fit tests). I hope I get you right here.
One issue about Laurie’s view is that several distributional families can be adequate for the same dataset, in the sense that they cannot be distinguished from the data by testing. But such different families that can be adequate for the same dataset can lead to quite different inferences such as p-values. I think that Laurie’s use of “true in a provisional sense” points to the fact that if you use, for example, a Gaussian model, potentially after having run misspecification tests and graphical diagnostics that confirmed the Gaussian model to be “adequate”, all further computations then are based on the Gaussian model alone, and the fact that there are still other families of distributions that would also be adequate, potentially yielding quite different inferences and p-values, is swept aside and no longer carried through the analysis.

• Christian: There are many general underdetermination allegations about science, I have found, that have to be cashed out in detail before accepted. To say there are always genuinely rival models or hypotheses that would have done just as well on my tests as did M, is to allege that we can never severely corroborate any model or hypothesis. I’ve never seen any argument in support of this kind of ‘egalitarianism” (as Laudan calls it) that holds up. If the focus is on statistical models serving as intermediate tools for arriving at more substantive hypotheses and theories, it’s far from clear that uniqueness is required to distinguish between genuine rivals. Since in my philosophy of science learning is “from the ground up” so to speak, severely testing one part or one claim is not threatened by claims at other levels of depth or generality. That said, it’s easy to agree that many hypotheses and models have been poorly probed at any given time.

14. Deborah, yes that is what I meant but your comparison with geometry misses the point with respect to likelihood. Problems in Euclidean geometry are well-posed in the mathematical sense so small errors only have small consequences. Choosing a likelihood is ill-posed and requires regularization as in Tikhonov regularization. The reason is very simple, it is the pathological discontinuity of the differential operator. In other words whatever your definition of adequacy there will be adequate models whose likelihoods are arbitrarily far apart and therefore the conclusion you draw will be completely different. My position is (i) if the likelihood principle is false then likelihood should be abandoned and (ii) if the likelihood principle is true then likelihood should be abandoned. In other words the whole discussion about the likelihood principle is irrelevant. I agree with you that simply pointing out that a model is never true does not mean in itself that likelihood should be abandoned. If likelihood were a continuous concept in terms of adequacy then it could possibly be saved. But its not and so it can’t be. This is why in the context of likelihood it is not possible to bridge the gap between truth and adequacy. One good ´regularizer’ is ´common sense’ but we should be able to do better than this . Christian Hennig has put it nicely in his comment.

• Laurie: I don’t know what you mean by “likelihood should be abandoned”. Do you mean we should never use likelihood functions because they are invariably based on statistical models that are at best adequate for the job and fail to provide replicas of all features of the world? We really couldn’t learn much from a representation that was literally true in all respects, and I deny the situation with statistical models is importantly different from applying other mathematical or abstract models. Problems don’t start out being “well posed”. We figure out how to either understand more and/or improve our measurement repertoires to triangulate from diverse probes, each rather lousy if used alone. It happens that we actually manage to get iid data (or data specifiably non-iid) so that the computations from the probability model are quite close to actual relative frequencies of interest. (iid is more important than the distribution anyway.) That’s why resampling and similar strategies work.

The problem you raise reminds me of the post on the dangers of small misspecifications in complex Bayesian modeling not too long ago: “When Bayesian Inference Shatters”.

But the authors argue that this problem is avoided in frequentist modeling and methods. I should ask them for an update (which they’ve done for a couple of years)
I don’t know what you mean by likelihood not being a “continuous concept in terms of adequacy”.

If you’re saying, no I’m just claiming the likelihood principle is unimportant, well it’s quite important to foundations of statistics, and the supposed demonstration of the LP (from frequentist principles) has been one of the predominant reasons for the nagging discomfort many have felt for 50 plus years that frequentist error statistical methods are taking into account something that is strictly irrelevant to evidence. I realize people have started to say it never mattered much. See this post.
https://errorstatistics.com/2011/12/22/the-3-stages-of-the-acceptance-of-novel-truths/

Fine. Should the Birnbaum result be removed from the “Breakthroughs in Statistics” volume? Savage and many others surely took it to be a momentous occasion! At the very least we ought to see the alleged proof removed from textbooks. My interest in the problem was equal parts foundations of statistics and logic. It was with my logician’s hat that I tackled it, not with any statistical arguments, really.
It’s interesting that, in the first few years of this blog, the LP arose significantly often. I’ve had my last and (hopefully) final word on the topic in Statistical Science.
https://errorstatistics.com/2014/09/06/statistical-science-the-likelihood-principle-issue-is-out/

That doesn’t mean I’ll stop meeting Birnbaum on New Year’s Eve.

• Mayo –
You could try starting here for some terminology:

Though they don’t seem to be the best wikipedia pages I’ve read…

• And differentiation (esp. numerical) is a notoriously ill-posed problem. Laurie is pointing out that this causes difficulties for the likelihood concept.

• Michael Lew

Laura, I would not be so fast to advocate abandonment of likelihoods. Your comments regarding the fact that likelihoods arbitrarily far apart from different ‘adequate’ models must also apply to any other statistical version of evidence because all versions of statistical evidence come from models. The models might be distributional iid, or they might only involve notions of exchangeability, but they are all models.

(The likelihood principle is true, but its scope is a bit more restricted than most people seem to assume. I’ve written a paper about that but it has been rejected with some referee comments that seem to me to be mistaken. Oh well, you can read it for yourself: http://arxiv.org/abs/1507.08394)

15. Deborah, ´ill-posed’ or if you prefer it ´well-posed’ is a technical term within mathematics. In particular if the inverse A^-1 of an invertible linear transformation A is unbounded determining the inverse A^-1(F) for a given F is an ill-posed problem. One example is when A is the linear integration operator. Its inverse is the differentiation operator which is unbounded. Thus determining a density is an ill-posed problem which means that determing a likelihood is an ill-posed problem. Deconvolution is also an ill-posed problem. Ill-posed problems require regularization. Penalized maximum likelihood is a form of regularization. Sometimes the problem can be regularized by a Bayesian prior but not always. Common sense is an informal way of regularizing. Oliver has given some links which are more detailed and one of which mentions the possibility of regularization using a prior. You write ´computations from the probability model are quite close to actual relative frequencies of interest’ but this simply does not regularize the problem. The Kuiper metric measures the largest discrepancy over all intervals between the relative frequency of observations in an interval and the probability of that interval under the model. It is pefectly possible for two different models to be equally good in the sense of the Kuiper metric, that is in reproducing the empirical frequencies of intervals and yet have very different likelihoods. For an example see Table 1.2 in my book. The reason is clear. The probability of the interval (a,b] is F(b)-F(a) where F is the distribution function, that is the calculation of probabilities corresponding to frequencies is based on distribution functions. To get the density you have to differentiate F and this is ill-posed. I had some correspondence with Tim Sullivan about his joint paper and yes it is related to ill-posedness. I have various reasons for not using likelihood one of which is the pathological discontinuity, another perhaps related is that it is blind. The data are a series of zeros and ones, say coin-tossing. I calculate the likelihood based on the normal model. Can I read off from the likelihood that the model is nonsense? No I can’t. At the moment I am trying to model the S+P 500 data over about 90 years. I am interested in several stylized facts such as heavy tails, volatility clustering, slow decay of autocorrelations of absolute values, in all about 10 different features. I use a GARCH(1,1) model. Can I read off from the likelihood whether or not the model is capable of reproducing these features. No I can’t. To do this I must simulate under the model and then not surprisingly it doesn’t. What can I read off from the likelihood? Maybe you can tell me. An appeal to authorities is never a good argument, Birnbaum, Savage. If I were to edit a volume or volumes entitled “Breakthroughs in Statistics” it would be different. I know at least one person in the present version who wouldn’t be in mine but I will not name names. I would probably alter the title to Wrong Paths in Statistics’ and then include Birnbaum and Savage. The title in German is even better Irrwege in der Statistik’. On the positive side the statistician I feel closest to is Peter Huber. His book Data Analysis: What Can be Learned from the Past 50 Years’ makes no mention of likelihood. You argued against the likelihood principle in that it doesn’t follow from Birnbaum’s premisies, I argue against it, and more generally against likelihood from a completely different direction.

• Laurie: I think we are confusing the mere use of the likelihood function from the likelihood principle (which entails no error probabilities). Then of course there’s likelhoodism and other variations as well. At least I was confused about what you were condemning, and it’s likely that I still am.

• Michael Lew

Laura, I cannot follow your arguments against likelihood but I’m pretty sure that it is inappropriate to point to a failing of likelihoods for one purpose (e.g. model validation) and extrapolate to say that likelihoods fail for all purposes. Do your ‘ill-posed’ issues have consequences for the use of likelihood ratios as measures of evidential favouring by the data of values of model parameters? If so, then I would be very pleased to read exactly how. If not, then I think that you need to back off a little.

• Christian Hennig

His (!) name, by the way, is Laurie, not Laura.

• Michael Lew

Whoops. My apologies, Laurie, I misread your name.

• john byrd

Misspecified. Are all of his points therefore incorrect?

16. Christian Hennig

Mayo wrote: “There are many general underdetermination allegations about science, I have found, that have to be cashed out in detail before accepted. To say there are always genuinely rival models or hypotheses that would have done just as well on my tests as did M, is to allege that we can never severely corroborate any model or hypothesis. I’ve never seen any argument in support of this kind of ‘egalitarianism” (as Laudan calls it) that holds up. If the focus is on statistical models serving as intermediate tools for arriving at more substantive hypotheses and theories, it’s far from clear that uniqueness is required to distinguish between genuine rivals.”

I agree that uniqueness is not required. The issue is to what extent the rival models that cannot be ruled out lead to different conclusions regarding the research aim. This depends on details of the methodology. The discontinuity of the likelihood as mentioned by Laurie means that likelihood-based methods applied to very similar models (in the sense of producing very similar datasets) may lead to very different conclusions. This is not general egalitarism. There are more robust methods that don’t have these problems, or at least not to the same extent.

Deborah and Christian: I am not confusing them in my mind but use them almost interchangeably in writing. My comments apply to likelihood however used. I will try and be more careful in future. I missed Deborah#s comment quoted by Christian but I agree with his response.

• Michael Lew

Laurie, it is the biggest failing of statistics that there is no acceptable form for ‘statistical evidence’, because it is inevitably evidence that we want to use for making scientific inferences.

I cannot pretend that I understand better now than before and I do not have your book. However, if the data are in [0,1] then the normal distribution would seem to be inappropriate. Any function that could be called a ‘comb’ function also be inappropriate any time a normal model might be sensible. Thus I cannot see why it would be a problem that the estimates from an analysis based on a normal distribution differ from those based on a comb distribution.

It is the responsibility of the analyst to choose appropriate models, and I don’t understand why you expect a likelihood analysis to provide a reasonableness check on its own output, at least beyond the obvious silliness of an interval that extends well outside the support of the model.

Can you describe what constitutes a ‘pathological discontinuity’? I would have thought that the differential of a smoothly varying function would be continuous (but I’m not a mathematician), so can you say that some likelihood functions do not suffer from the discontinuity? Would it be OK to base inferences on them?

• Michael – some intuitive points that might help. It’s not differentiating the likelihood function that Laurie is worried about, it’s differentiating to get the likelihood function.

You might be familiar with modelling based on simple conservation laws and differential equations? In some sense the integral expression of the conservation laws are more general (really, invariance of certain integral expressions is more fundamental). Differential equations can only be obtained from the more general expressions under certain regularity (eg smoothness) conditions.

So you might say that using likelihood is more like starting from a differential equation than a more general conservation or invariance statement and requires more auxiliary assumptions to be valid. Hence you might argue it is cannot be the most fundamental concept, rather a derived concept useful after additional ‘regularisation’.

The question here is – what is the best way to state the more general principle(s)? Personally I think we need a theory of statistics stated in similar terms to our best ‘physical’ theories and that this means using the language of geometry, topology and invariance.

18. An issue here is that deep questions recur across many different fields. The basic questions here touch on at least mathematics, physics and philosophy. Preferred resolutions and even the presentation of equivalent resolutions depend on background training and other (in-principle!) unimportant details. There is a lot of irony in this if you think about it.

I was considering adding other examples to the discussion but would probably just add to the babel. I do believe that there are some constructive points of agreement possible given enough ‘fair-mindedness’. It’s limitations on this latter quality that get in the way, and this in turn is constrained by time, training and temperament.

19. george

A possibly-naive comment: do the likelihood-based intervals based on the “comb” distribution have any justification in samples of this size? Almost all likelihood-based approaches are approximate, and the comb looks like the sort of example one would cook up to show that the approximations (i.e. approximate 95% coverage) don’t necessarily work well in realistic data settings.

20. Michael I agree that this is difficult for non-mathematicians but I can try and explain what pathological discontinuity in this context means. Take the two functions F(x)= x and G(x)=x+0.001*sin(1000000x). These two functions are close together, they never differ by more than 0.001. Now differentiate them. The first has derivative 1 the second 1+1000cos(1000000x). The derivative of the second function oscillates very rapidly between -999 and 1001. There is nothing particular about 0.001. I can replace it by any number say 10^-100 and the 1000000 by 10^200. In other words I can make G arbitrarily close to F and simultaneously their derivatives arbitrarily far apart. This is discontinuity. There is also nothing special about F(x)=x. I can do it for any differentiable function F. In other words differentiation is discontinuous everywhere which I call pathological discontinuity. In particular I can do it for distribution functions F and I can also make G a distribution function. If two distribution functions F and G are very close together then at the level of data you cannot discriminate between them but their densities and hence their likelihoods can be arbitrarily far apart.

Michael and Oliver. This is my take on Oliver. Firstly probability measures are the basic objects which for the one dimensional case means that distribution functions are the basic objects. I agree with this. Now I have some data and the order is irrelevant for my analysis. This has to be justified and very often can be. For example the data come from different laboratories and it is irrelevant if the first observation is from the third laboratory or the twentieth. Secondly the laboratories are analysing the same sample of water using the same or similar methods. Based on this I decide on an i.i.d. model. I also want my results to be independent of the scale used for the observations. This leads me to a location-scale model F((x-mu)/sigma) and it remains to choose F. The precision of my result depends on the data but also on the model. A model with a large Fisher information will give smaller interval than one with a small Fisher information. In other words you can import ´precision’ through the model. In general I will not be able to justify any increase of precision through the model. This leads me to the model with the smallest Fisher information. Suppose I also want my model to have a finite variance so I minimize the Fisher information over all models F with variance 1. This latter value is just conventional. One can prove that the minimum Fisher information distribution F must have a density and that it is the normal distribution. So at least partially we arrive at a model derived from invariance, equivariance and regularization principles.

• Yes, that’s generally the sort of presentation I would favour. I might wonder whether precision is the most general concept to introduce amd why, but that’s a minor quibble in the present context.

• Christian Hennig

Laurie, here’s a question leading a bit off the path regarding Mayo’s original posting, but something I’m very interested in.
You motivate the use of iid from “irrelevance” of the order of the data. But certainly there may be some relevance that you are not aware of from just knowing where the data come from and what they are, like influence of sunspots on the specific experiment or whatever.
I’d think, in principle, that the iid assumption could and probably should be subject to the same kind of discussion, namely how could conclusions be affected by looking at non-iid models that cannot be told apart from iid by the data? Do you agree? I wonder sometimes, from the way you discuss things, whether you’re keen to be more generous on the iid assumption than on distributional assumptions such as normality.
There may be a pragmatic reason behind it because one can’t solve all problems in one go and discussing distributional assumptions taking iid as given leads to some nice robustness theory whereas questioning all including iid may be hard. Obviously, questioning iid can open a Pandora’s box; for example one cannot exclude distributions such as the point mass on the data looking exactly as they look from which one cannot learn anything, so some non data-based argument is needed to exclude certain badly behaved distributions from consideration.
I know that there is a (modest as far as I think) amount of work on robustness against iid in the literature, but I wonder whether something more general can be said. Your book, as far as I can see, has Sec. 2.14 on independence, but this doesn’t reflect on the question what to do about non-iid models that fit the data, too, in case that iid is adequate (i.e. the issue of robustness/continuity of conclusions from iid).

• Michael Lew

Laurie, thank you for the explanation. I can see the issue now. However, I have to ask, does it matter for all uses of likelihood functions? It seems to me to be irrelevant to the use of a likelihood function as a depiction of the relative support by the data of parameter values within a model. Thus it would be irrelevant for many inferential purposes and so a suggestion that likelihood be abandoned would be a substantial over-reach.

21. George. No, it’s terrible. In more detail it fits the data just as well as the normal distribution in terms of a Kolmogorov goodness-of-fit but is better in terms of likelihood. This is not only the case for my data set but also for all the data sets given by Stephen Stigler in his article on robust statistics. It shows that fitting the data and having a high likelihood does in no way guarantee that the model is good. I take your point that covering parameter values in simulations is not the same as covering true values for real data, here true refers say to the value of the gravitational constant. This applies in particular to Stephen Stigler’s data even for the 95% intervals based on the normal distribution. This is an important topic and requires a longer discussion.

• george

Laurie: that wasn’t really my point.

You seem (as far as I can tell) to be arguing that, for likelihood-based inference, the practitioner instructed to pick a distribution resembling the data is in a bind; the data resemble many distributions, that lead to intervals of very different widths, and hence has essentially no idea what inference should be drawn.

But this omits an important step: the practitioner should also be considering how accurate the approximations used in that inference will be; for the comb distribution I suspect they are terrible, and this would mean it should not be considered – no approximate method is going to work well when we sail so close to its regularity conditions.

Taking away the Comb distribution from your book’s Table 1.2 makes the problem you describe much less compelling. Is there a way to make your argument without accuracy of approximations becoming an issue?

22. This post began as an outlet to express my thoughts prior to attending the recent “P-value Pow Wow” run by the ASA in DC as “an observer”. I met Michael Lew while I was there. This post pretty well covers all the main issues that I’d have shared if I wasn’t a mere observer. The meeting was not secret and Lew knows I’m mentioning it here, buried somewhere in comment #75 or so, even though I’m not to blog on it until public documents appear. Actually I have no inclination to blog on the Pow Wow itself, but I’ve had some good exchanges with participants since the meeting.

23. Michael, there are several reasons why likelihood is bad as a measure of relative support’ whatever that means. The comb distribution is similar to the sin example I gave. Its density function oscillates quite rapidly but its distribution function is close to the normal. This means for example that a slight change in a parameter value can move you from a peak to a trough with a corresponding large change in the measure of support. The same applies to the data. A very slight change in the data values will can lead to very large changes in the support for parameter values. This is discussed in my book. You can probably get a copy from the Stats. library in Melbourne and if not ring up Peter Hall and complain. If the data really were comb this is what you would want but they are not so you don’t. If you regularize by say choosing the distribution with the smoothest density which is adequate for the data then you can avoid this. In other words you have to regularize. The problem of regularization is never mentioned in discussions of likelihood. I suspect the reason is that in discussions for and against likelihood both sides operate in the ´behave as if true’ mode. All discussants accept the model and then conduct arguments about likelihood, stopping times etc. within this model. The problem of regularization never enters. As I said in a reply to Deborah I regard this form of argument as an Irrweg, sorry about the German but it expresses my attitude more precisely and accurately then any English equivalent. Pages 260-268 of my book give several criticisms of likelihood some of which I mentioned in a reply to Deborah. Two main points are that likelihood reduces the measure of fit between a complex data set and a complex model to a single numerical value for every choice of parameters. In your pensive moments this must worry you. One consequence of this, the second point, is that you cannot read off from the likelihood whether the model is or is not reproducing those aspects of the data you regard as important.

• Michael Lew

Laurie, I’m confident that your book is, indeed, in the library. However, I am currently on leave in Pennsylvania.

I see what you are concerned about better now, but I don’t share your concern. For example, you keep using the comb model as an example of where the likelihood function is dysfunctional, but then you wrote this: “If the data really were comb this is what you would want but they are not so you don’t.” My response is that if the data are not from a system for which the comb function is a reasonable representation then do not use the comb function. For likelihood functions I have only ever worked with distributions with a single mode, but in curve fitting I have used multimodal functions, and I can say that they are quite awkward because their parameters have interacting effects on the goodness of fit. That leads to parameter value-fit landscapes with lots of local minima. My guess is that multimodal distributions are naturally awkward for likelihood analysis as well. However, that is not a general criticism of likelihood functions as representations of evidence.

The likelihood function can only ever be used _within_ the model from which it is calculated (see my arxived paper for more on that http://arxiv.org/abs/1507.08394) and so the ‘behave as if true’ thing is unavoidable, but it is probably more accurate to say for most purposes that one has to ‘behave as if reasonable’ rather than ‘behave as if true’. If you choose a silly model then the likelihood function is a silly representation of the support by the data of parameter values, but equally reasonable models will usually yield equally reasonable depictions of the evidence. If the model performs badly for the purpose of a likelihood analysis, then choose a model with better properties. You cannot “read off the likelihood function” the reasonableness of the model because the likelihood function only has a meaning within the model. Use other tools, including common sense, for that task.

Discussions of the likelihood principle have to include the ‘behave as if true’ thing because the likelihood principle only applies within a statistical model. Discussions where the models are in question are not discussions about the validity and applicability of the likelihood principle.

I certainly do not worry in my “pensive moments” about the fact that the likelihood function summarises multidimensional information into fewer dimensions because I consider that to be a desirable property. If it did not do so then there would be no advantage in using a likelihood function place of the actual data. Finally, I note that a likelihood function is not a single value, as it is a function, and there is no use of a single likelihood licensed by the likelihood principle. Instead, likelihoods within a model are compared by way of their ratios. The inferential use of likelihoods of isolated hypotheses is not something that the likelihood principle licenses.

24. George, I missed yours, you are missing mine. Let us denote the standard normal distribution function by F and the standard comb distribution function by G. A general method of generating normal and comb random variable is to take a uniform random variable U and use F^-1(U) and G^-1(U). Take U=0.1 for example then qnorm(0.1)= -1.281552 and qcomb(0.1)=-1.315991. They differ by less than 0.035. I gave you 30 i.i.d. normal and 30 i.i.d. comb random variables you would not be able to distinguish them unless of course I told you one is normal one is comb but for real data this is not possible. The point is that if one is an adequate approximation for a data set then so is the other, there are too close to be distinguished. The measure of approximation I use is the Kuiper distance or rather sqrt(n)*Kuiper distance and the values are 0.204 for the normal and 0.248 for the comb. Both of the values are perfectly acceptable as indicated by the P-values. There is no mention of regularity conditions at this point. I could take a distribution without a density for example and find one in terms of goodness-of-fit which is equally good. Look at the example I used to explain the problem to Michael using f(x)=x and g(x)=x+0.001*sin(1000000x). Tell me what your concept of approximation is. The one I use here is the Kuiper metric and in terms of this approximation the comb distribution is not terrible. What is terrible is basing the inference on the likelihood function. No statistician would use the comb distribution for inference purposes unless there were very good reasons for this. Instead the normal distribution would be used. This is a form of regularization. This is either informal, a sensible model, or formal, for example a minimum Fisher model. Any discussion of likelihood must respect this and take regularization seriously and ´sensible’ is not serious.

• george

Laurie

I do see (and have all along seen) your point about not being able to distinguish between Normal and Comb based on a small sample. I’m also not arguing over whether likelihood is terrible, or not. What I’m interested in is whether your Comb example, particularly the width of its confidence interval, will convince anyone there’s an issue in practice.

You will, I’m sure, agree that the coverage one gets from likelihood-based methods is only approximate – even when the likelihood is perfectly-specified. The Comb looks like the sort of example that one cooks up to show that such methods can behave very badly in finite samples. (Google “Radford Neal” and “inconsistent MLE” to see another, rather eloquently described)

You appear to point out that the inference obtained assuming one distribution (Normal) is very different from that assuming a second distribution (Comb) yet samples of the two are indistinguishable. But if the second distribution is something that would never be used for inference anyway, why should anyone care? As a piece of rhetoric it seems very weak. Showing an important difference between inference under two distributional assumptions which both lead to reasonable coverage would be much more compelling. Can you do this?

25. P-values. This started off as a blog on P-values and veered off at some point. I have always been somewhat unhappy about P-values whilst using them myself. I have always had some sympathy with the Bayesian point of view that they overestimate differences. Here some further thoughts and I would appreciate comments. To make things simple we have a normal model, data and the null hypothesis H_0: mu=mu_0=0. I have data with n=30, mean(x)= 0.786 and sd(x)=1.07 and calculate the P-value 2(1-pt(sqrt(n)|mean(x)-mu_0|/sd(x),n-1)) with result 0.000379. Now I think to myself and staying within the Gaussian model what are the possible values of mu for the data x? In fact, as I think in terms of approximation which requires specifying a complete probability measure, I think what are the possible values of (mu,sigma) consistent with the data? To answer this I calculate a joint say 0.95 confidence region for (mu,sigma). This confidence region contains for example (mu,sigma)= (0.489,1.188). Now I can simulate the P-values for H_0 but now using the model (mu,sigma)=(0.489,1.188) which is, to repeat, consistent with the data. The first 10 values of such a simulation are 0.1109843, 0.2997034, 0.6154555, 0.006614542, 0.01594904, 0.09508757, 0.006960614, 2.43778e-05, 0.0593152, 0.02416382. As mean(x)= 0.786 and sd(x)=1.07 these simulated P-values are on average going to be much larger than 0.000379. Any thoughts?

• Christian Hennig

My first (perhaps stupid) thought is “what’s wrong with this?” Can you elaborate a bit more on why you think this is a problem?

• Michael Lew

If you want to know what values of mu are reasonable in light of the data then you want to look at the likelihood function. It answers your question in the most direct manner.

26. Christian you are right but this is not the place to discuss it. You have mentioned this before and I intended to reply but … I’ll send you some thoughts via email.

27. Michael, that’s a good dogmatic statement do be countered by my own, namely, complete nonsense.

• Michael Lew

Dear Laurie

I’m sorry to hear your response. If the likelihood principle is true, or if it is nearly true, or if it is just a sensible normative principle then your response is, at best, self-referential.

This is the likelihood principle:
Two likelihood functions which are proportional to each other over the whole of their parameter spaces have the same evidential content. Thus, within the framework of a statistical model, all of the information provided by observed data concerning the relative merits of the possible values of a parameter of interest is contained in the likelihood function of that parameter based on the observed data.

The law of likelihood says that the degree to which the data support one parameter value relative to another on the same likelihood function is given by the ratio of the likelihoods of those parameter values.

Regards,
Michael Lew

• Michael: They have the same evidential content with respect to questions of parametric inference (not, for example, for questions regarding the correct model). If Laurie opposes the LP as a principle for evidence within models, he must have another way of doing parametric inference. The error statistician, of course, does, in that evidential important requires considering error probabilities. Two data sets may have identical evidential import for a likelihoodist or Bayesian (or holder of the LP) while they constitute different evidence for an error statistician: good fit is not enough! Equally good fits may grow out of procedures with very different capabilities of having unearthed erroneous interpretations. The sampling distribution is altered and thus the evidential import of the data.

• Mayo –
Assume for now we have carried out some form of regularization, have a sufficiently smooth likelihood function etc etc. That is, ignore Laurie’s important (to me) points for now.

Now, clearly setting the first derivative of the likelihood function to zero allows one to find a (possibly local) maxima (or minima).

My question: do you ascribe any meaning to (and hence potential use of) the second derivative of the likelihood function?

Clearly the sign provides qualitative information, right? How its magnitude?

What about higher derivatives?

• I meant how *about* its magnitude

• john byrd

Michael, it seems that you are always determined to double down on the efficacy of the likelihood principle and law of likelihood within the model, but do you not agree that the link of the imaginary world and the real world is through the error probabilities associated with the inferences? In other words, who cares what is indicated within the model if it cannot be argued to be applicable to real problems within some degree of error?

• John: Those who keep to relative likelihoods and formal Mathematical “fit” measures don’t really see the crucial role of error probabilities in reliable inference.

• Michael Lew

Mayo, that allegation may be convenient in a sort of ‘them and us’ way, but it is inaccurate. For example, I am a likelihoodlum and so by your terminology I use mathematical “fit”, but I am very concerned about error rates. However, error rates seem to me to be most relevant to questions of what to do or decide rather than the question of what the data say. In that context I like to see an explicit loss function lest concerns about type I errors trump all other considerations, as they frequently do in the standardly used ‘hypothesis test’ procedures.

• I just find it perplexing that a study whose error probing capacities are weakened by biasing selection effects (cherry picking, post-data subgroups, trying and trying again etc) is not deemed problematic and its inferences unwarranted for the case at hand. I simply cannot understand the point of a statistical inference method that overlooks what is essential for ampliative-inductive learning from data, relegating it to some kind of concern with costs, losses and deciding whose interests to employ. The concern is entirely with finding things out and doing so in the cases at hand. Here I agree with Fisher.
I’m not claiming likelihoodlums, as you so aptly call them, wouldn’t want an account focused on finding things out (as opposed to what is practical to do, according to so and sos payoff). I really think it’s likely that the hoodlums just don’t get the essence of inferential learning from error, and truly believe error probs are relevant only for long-run error control.
The point of my new (forthcoming) book is to convey that essence–a philosophical feat.

• Michael Lew

Mayo, you write “I just find it perplexing that a study whose error probing capacities are weakened by biasing selection effects (cherry picking, post-data subgroups, trying and trying again etc) is not deemed problematic and its inferences unwarranted for the case at hand” but I find it perplexing that when you say “error” you appear to be entirely focussed on false positive errors. Cherry picking and post-data subgroups and repeated testing all bring advantages along with the disadvantage of increased risk of false positives.

The green jelly beans cartoon of xkcd (https://xkcd.com/882/) gives a nice illustration of some issues. The usual interpretation is that the multiple subgroup testing leads to a spurious significant finding that is misinterpreted as being important. That’s fine, and I agree with it. However, consider how we would like the results to come out if green jelly beans _did_ cause cancer and no other jelly bean did? (That is obviously biologically possible given that the colour comes from a dye.) We would expect to see little signal when jelly beans are tested as a mixture of colours and we would hope to see a signal when we test green jelly beans but no other colour. That is what the cartoon shows! To focus only on type I errors is to miss important results. To use a dichotomous P greater or less than 0.05 presentation of the results obscures the evidential information in the data. (Substitute any value you like for 0.05, although note that almost no-one uses any other value.)

Complaining about the various types of P-hacking procedures and about the stopping rule independence of the likelihood function without considering the inferential process as a whole is just as damaging as ignoring those things. We should be determining the overall operational characteristics of procedures under the various circumstances that pertain. Not only should we, but I have done some of that, and I can report that interim data peeking improves the performance of a hypothesis test whenever there is some weight placed on the false negative error as well as the false positive error.

• Michael: I find it hard to believe you think I ignore type 2 errors and power when my account is built on NP testing, and I’ve spent zillions of posts on power and power analysis. Cheating can easily go both ways. Severity arose largely to deal with statistically insignificant results. So do you not have EGEK(2006)? If you give me an address I’ll send you a copy.

• Michael Lew

Mayo, I agree that you have emphasised power frequently, and, as I have commented before, I prefer the severity account to the classical Neyman-Pearson account. However, I’m not sure that you have addressed my points at all. P-hacking _increases_ power**. When there really is an effect to detect, that increase in power will usually be a good thing. The ‘best’ real-world inferential procedure is the one that leads to a sufficiently correct* inference in minimal time and expense. Often that procedure will not have ‘good’ control of type I errors. I don’t think a reader of this blog will find those ideas in your writings.

* The meaning of sufficiently correct varies with the context of the experiment. Sometimes it is sufficient to correctly discard a false null hypothesis, but often it requires an adequately accurate assessment of the size of effect. For the former the increased power from P-hacking is good, for the latter the sometimes inflation of observed effect size is a bad thing.

** I am not in favour of the widespread employment of P-hacking without knowledge of its effects and without acknowledgement during inference. However, I am against the idea that all parts of P-hacking are universally a bad thing.

• Michael Lew

John, I agree that there needs to be a link between the statistical hypotheses concerning model parameters contained by the statistical model and the scientific hypotheses of interest. I do not agree that error rates are necessarily the best way to connect the two. I would prefer a more thoughtful connection, even if nebulous, that can come from separating what the data say from questions of what to believe and what to do or decide. Looking at the likelihood function as a picture of what the data say does not preclude one from making inferences on the basis of other things, including error rates, and it is the best way to see what the data say as far as I can see.

Laurie’s arguments seem weak to me because he illustrates them with a ridiculous comb distribution which would never be a contender if the alternative model, normal, is a contender (and vice versa). Model selection is not what the likelihood principle is about, and it seems silly to think that the likelihood function is useless for every purpose just because it does not encode anything about the validity of the model.

Discussion with you, Laurie, Mayo and others is difficult because you all seem intent on reading more into my words than I intend them to contain. The likelihood function tells us about the relative merits to the parameter values within a model in light of the data. That is what the data say. You can choose a different model and the data will naturally say something different. So what. Choose a different model and you get different error rates. So what? Choose your model carefully, and assess its usefulness as often as you like or need. When you want to make statistical inferences you must supply a model, and the model-dependency of the inferences is unavoidable. Thus model-based arguments against the use of likelihoods in scientific inferential processes apply to other statistical approaches just as much.

• Michael – even Aris Spanos, who I take to be an error statistician if anyone is, also carefully distinguishes between testing ‘within’ and testing ‘without’ a model. So yes, it is confusing to read some of Mayo and John’s objections to your points.

I do think that Laurie’s criticisms are more valid but agree that the comb example does not have especially great rhetorical strength (as george has also mentioned here).

• Omaclaren: Firstly, I don’t think you are grasping correctly the Spanos “within/without” distinction. But most importantly, just now, I would ask you to find any place where I criticize likelihoodism (and I’ve numerous posts on it on this blog) on grounds that it assumes a statistical model.

• Well, as these discussions tend to go, for my part I don’t think you’re grasping correctly many of my, Michael or Laurie’s distinctions.

One example appears to be present in your question – I am not saying that you criticise likelihoodism on the grounds it assumes a statistical model at all. I’m saying that you don’t seem to grasp that likelihoodists (of which I am not but occasionally wear the hat when useful) claim there is a sense in that

*given a model (and regularity conditions – see regularization discussions), the likelihood principle holds*.

Here is one simple bayesian view (I am fine to try on different hats just as I might consider different models). Consider

(1) p(y|y0) = ∫p(y|θ,y0)p(θ|y0)dθ

Where y is future/predicted data, y0 observed data and theta the parameter(s) of the model.

Assume we don’t doubt the model in light of observed data, and formise this as

(2) p(y|θ,y0) = p(y|θ)

Then

p(y|y0) = ∫p(y|θ)p(θ|y0)dθ

p(θ|y0) prop. to p(y0|θ) p(θ)

Thus info about theta enters via p(y0|θ). What is your objection to this reasoning?

I take the likelihoodist to be saying something like (2) implies the likelihood principle. That is, working ‘within’ a model means using the same model structure for the future predictions without modifying it based on data y0.

A likelihoodist could presumably doubt (2) in particular instances but this corresponds to misspecification.

PS I you doubt I grasp Spanos’ point perhaps you could enlighten me. Excuse any errors I’m writing on my phone in a hospital bed (not too serious tho).

• You wrote: “ I’m saying that you don’t seem to grasp that likelihoodists ….claim there is a sense in that
*given a model ….. the likelihood principle holds*.
Here is one simple bayesian view “ from which the LP falls out.

Michael, I assure you that I have written and thought quite a bit about the fact that “there is a sense” in which the likelihood contains all the evidential import when parametric inference is via likelihood ratios or Bayes theorem. How else could I have figured out what’s wrong with Birnbaum’s “proof” of the LP? And don’t forget that the “weak” LP holds for frequentist parametric inference within a model, reducing to sufficiency. The “strong” LP, and that’s the controversial one makes reference to two distinct distributions (with a shared parameter of inference). Search SLP on this blog for a clear statement of the principle and quite a lot of discussion.
The confusion relates to a confusion over sufficiency in frequentist inference. See Mayo and Cox (2006): http://www.phil.vt.edu/dmayo/personal_website/(2006)%20MayoCox%20Freq%20stats%20as%20a%20theory%20of%20inductive%20%20inference.pdf

So how’d you wind up in the hospital? (or should I say “in hospital”, since you’re in England)?

• Mayo
OK good – we agree then that the weak likelihood principle is not so bad, strong is dubious.

Here is another simple example, here showing how a Bayesian violates the SLP. Consider

1) p(y|y0,e) = ∫p(y|θ,y0,e)p(θ|y0,e)dθ

Where e is a parameter taking indicating different experimental setups (say e’s domain contains two possible values).

Assume again we don’t doubt the model in light of observed data, but also don’t doubt it applies to both setups, ie for all e values. Formalise this by

(2) p(y|θ,y0,e) = p(y|θ)

Then

p(y|y0,e) = ∫p(y|θ)p(θ|y0,e)dθ

Now we need another closure assumption – we need to know how the prior p(θ|y0,e) depends on the experimental setup. It is perfectly possible for a Bayesian to have

p(θ|y0,e) != p(θ|y0)

Where != indicates ‘not equal’. Eg via

p(θ|y0,e) prop to p(y0|θ,e)p(θ|e) = p(y0|θ)p(θ|e) by (2).

So if p(θ|e) != p(θ) then we can violate the SLP, no?

PS – hospital, long story. Short version is I have proteins in my blood and urine that I shouldn’t. Doing tests and having IV treatment. A bit bored so arguing on the internet about statistical principles.

Best,
Oliver

• Also note:

Suppose p(y|y0,e) = p(y|y0) ie we consider experiments with ‘equivalent’ evidence.

Then, using the above we have

p(y|y0,e) – p(y|y0) = ∫p(y|θ)[p(θ|y0,e) – p(θ|y0)]dθ = 0

So it seems that there is an orthogonality condition lurking in there. It is probably obvious to a better mathematician than me. I’ll need to find a pen a paper at some point to work it out, can’t do it on my phone/in my head at the moment.

• But something to do with the prior being freely variable with the experimental setup as long as this variation is orthogonal to your likelihood. Again, a closure/boundary constraint or ‘within/without’ distinction.

• Oliver: We do not agree that the weak LP is not so bad but the SLP is dubious. The weak is just sufficiency and correctly stated it is a tautology. The SLP negates the use of the sampling distribution for parametric inference. It’s not a matter of it being dubious. For some forms of parametric inference (e.g., Royall-style likelihoodlumism) it too is a tautology. For an error statistician, it leads to bankrupt inference. What’s unsound is Birnbaum’s alleged proof of the SLP.

Bored in hospital? Catch up on reading, e.g., Cox and Mayo (2010) “conditionality and objectivity” http://www.phil.vt.edu/dmayo/personal_website/ch%207%20cox%20&%20mayo.pdf

• dubious/ˈdjuːbɪəs/

1. hesitating or doubting.
2 not to be relied upon; suspect.

And
NZ slang for a tautology? Or at least not the source of the problems at hand.

Given these terms your response makes no sense to me.

Did you look at my derivations?

I’ve read that paper but will read again. I’m quite sympathetic to Mike Evans’ discussion of your paper on Birbaum btw.

• > For some forms of parametric inference (e.g., Royall-style likelihoodlumism) it too is a tautology. For an error statistician, it leads to bankrupt inference. What’s unsound is Birnbaum’s alleged proof of the SLP.

Re-reading this part I see your point a bit better but it hardly seems very fair-minded. I presented a simple Bayesian argument (that may or may not be a good one – you could perhaps focus on criticising the math so we don’t get caught in verbal contortions) against the SLP.

Not a direct critique of Birnbaum. So frequentists, bayesian’s, others (I tend towards other) have reason to doubt SLP. Even likelihoodists like Michael appear to possibly have issues with SLP. Personally I think a prior is needed for proper closure of the likelihoodist approach (see also Evans). The prior is still a conditional distribution though. But we won’t get anywhere unless you engage me on the math and I engage you more on __?

• O: This is quite confused, we know the SLP follows from inference by Bayesian posteriors; maybe it’s the hospital food?.

• Mayo: Did you read Mike Evans’ response to – and bayesian perspective on – your Birnbaum paper? Why would a Bayesian deny the SLP? Ask him.

• Michael Lew

I have been reading around likelihood for a long time and I am of the opinion that there are many distinct versions of the likelihood principle. I don’t mean that there are many ways to state the principle, but that there are many versions of the principle that have different meanings. Can we please specify exactly what the strong likelihood principle is? What is the weak? Here is my statement of the likelihood principle:

Likelihood principle:
Two likelihood functions which are proportional to each other over the whole of their parameter spaces have the same evidential content. Thus, within the framework of a statistical model, all of the information provided by observed data concerning the relative merits of the possible values of a parameter of interest is contained in the likelihood function of that parameter based on the observed data.

Is it strong or weak? Neither?

Note that my statement makes no comment about inferences, and it makes no comment about the comparison of likelihoods from different models. As far as I can tell, most of the resistance to the likelihood principle comes from the erroneous assumption that it says something about how one should make inferences.

• Michael – I think I generally understand your viewpoint and am not unsympathetic. I think issues of regularization are more important than you do, however.

Do you require any other properties of your likelihood function? If I were you I would modify your statement to at least

“two [smooth] likelihood functions…”

• Michael: You cannot fault me for questioning the value of likelihoodism on grounds that it assumes the model. That’s never my problem, so please don’t pin it on me.

• Michael Lew

Mayo, OK, you do not object to the model, but you do sometimes assume that I mean to imply that a likelihood function leads to a decision without regard to error rates.

• Michael: I’m always only talking about evidence or inference when discussing this issue. I fail to see why considerations that can destroy or vouchsafe an inference should be prohibited from being considered at the evidence or inference stage, and have to be saved “for later”. As I say in my book, “this is later”.

• John: Michael is confusing your reference to linking what’s done within “picturesque” statistical inferences and the real world, with linking statistical hypotheses and substantive theories and claims.

• Michael Lew

Picturesque? That’s not a word I usually associate with a statistical model. Can you expand?

• That’s Neyman’s term.

• john byrd

I did ask Michael the questions because I do read alot into the insistence that all is well so long as it is assumed to be within the model. As discussed before, I am not comfortable saying that all of the evidence the data have to reveal is seen within the model, I suppose because I consider the manner in which data are gathered to be a quintessential component of “the evidence.” To me, the statistical analysis of whatever stripe is always just one part of a greater whole of evidence and must be evaluated within that greater context. What is missing in Michael’s discussions here, though perhaps not in his published papers, is how the statistical result is integrated into a greater whole. It seems obvious that no one cares what is valid within the model if the model is not shown to be useful and appropriate. Error probabilities play a role there.

• John: There’s some equivocation now in what you say, in relation to the discussion. I understood the model to be one thing and the relevant sampling distribution another. That is how one gets violations of the SLP. Do you see what I’m saying? Even with the same sufficient statistic, its sampling distribution can be different, and that’s what makes for likelihood principle violations.

• john byrd

Mayo: I understand that, but it seems that one can assume that away by isolating the model and its implications? I do not wish to do so, but if I understand the argument, some do focus on the model and make assumptions that negate such factors. They interpret the result first with the narrow view within the model, then consider extraneous factors as a separate step?

• John: If I understand you, you’re saying holders of the LP (by which we mean the strong LP) don’t consider the sampling distribution relevant to making parametric inferences, whereas the sampling theorist does.

• Anonymous

Michael:
“Laurie’s arguments seem weak to me because he illustrates them with a ridiculous comb distribution which would never be a contender if the alternative model, normal, is a contender (and vice versa).”
I wonder whether you could explain why this is so without using a reasoning that is very similar if not identical to Laurie’s. It’s exactly his point that the comb distribution is ridiculous, but that this cannot be seen from classical likelihood or even quality-of-fit arguments.

• Michael Lew

Anonymous, the criticism that Laurie provides using the comb function is that the likelihood function does not tell the analyst that his or her model is stupid. That is true, but no-one says otherwise, and the likelihood principle does not imply it. You should use whatever model validation assessments you like (including common sense) and when you are happy with the model then inspect the likelihood function to see the evidence in the data about the values of your parameter of interest. Given such a procedure, Laurie’s criticism is nothing more than a complaint about the fact that likelihoods do not provide an analytical function that they do not pretend to provide. That is a complaint that can be discounted with a simple ‘So what?’.

• Christian Hennig

Hi Michael, I had posted as “Anonymous” above because I used a computer on which my name was not saved, which I had ignored.
I see you’re still in discussion with Laurie himself about this, so I may not need to chip in, but perhaps I still do, asking in a more straightforward way than Laurie usually asks things, what is your principle/rationale that tells you that the comb distribution is stupid?

Actually, you could set up a model (or in Laurie’s terminology, a parametric family of models) that has one parameter for whether it’s Gaussian or comb, and another parameter for location (and perhaps a third one for variation). Using the likelihood, the best model you’d find for many datasets including Laurie’s copper data would be the comb (as long as you get the parameter about right, of course). Probably you’d find this model also stupid, but once more, for what reason?

• Michael Lew

Christian, I would imagine that the comb model that best fits the data has more parameters than the data points being fitted. Overfitting leads to silly outputs from all statistical methods, not just likelihoods. Model complexity as measured by the number of parameters (parameters in my sense, not Laurie’s personal sense of the word) is penalised in the widely used Akaike information criterion (AIC) method of model comparison.

Likelihood is used as a component in many model selection procedures and criteria such as AIC, BIC, DIC and others. However, it is not the sole component of any. As far as I know the issue of how to choose a model is not one that has been definitively solved in a general sense. What is clear, though, is that the fact that a likelihood function is insufficient to test the model from which it comes is not a reason to say that looking at likelihood functions is pointless, which is what Laurie proposed many comments ago.

• “Christian, I would imagine that the comb model that best fits the data has more parameters than the data points being fitted.”
Not true. Note by the way that the number of parameters is a characteristic of a parametric family of models whereas the “best fitting model” is a single distribution (Laurie wrote something on this distinction, too) that doesn’t have a well defined “number of parameters”. You can always write it down as a member of a one-parameter-family (for example, if it was a Gaussian, you could fix the variance), which is what Laurie did in his book.

28. Christian, why I think what is a problem?

• Christian Hennig

“Christian, why I think what is a problem?”
The P-value posting.
“As mean(x)= 0.786 and sd(x)=1.07 these simulated P-values are on average going to be much larger than 0.000379. “

29. George, firstly you are correct nobody would use the comb distribution in practice. The point I was making is very simple, EDA is not sufficient to decide on a model, you have to regularize. The problem is a mathematical one, namely the discontinuity of the differential operator, and the solution is a mathematical namely regularization. As I said in a previous post you can ´regularize’ with common sense which is what you are proposing. I am not satisfied with that answer because it does not make clear what is going on, namely in this particular case Fisher information. If I ask you why the comb distribution is not a good choice, you may say that the confidence interval is to small to believed but do you go further. I go further any say that it has a large Fisher information and therefore imports efficiency which cannot be justified. I go further and say that the normal model has a minimum Fisher information, that is, it is a worst case, and should therefore be used in preference. Here is a long Tukey quote from 1993:

\section{B is for Blandness}
Davies emphasizes the pathological discontinuity of estimate behavior as a function of assumed model. In any neighborhood of a simple model (location, or location- and-scale, say) there are arbitrarily many (similarly simple) models with arbitrarily precise potentialities for estimation – – potentialities that can only be realized by very different procedures that [sic] those that realize the potentialities of the original simple model.
There are always nearby models that “describe the data” – – and are arbitrarily easier to estimate from. There may or may not be models that “describe the data” and are harder to estimate from.

A slightly pessimistic (or cynical) view of the world is that we can only trust inferences from models that are (locally) hard to estimate. Nothing that I know about the practicalities of data analysis offers any evidence that this view is too pessimistic. What we can expect, therefore, is that we ought to find trustworthy and helpful – – not as {\it prescribing} procedures, but as useful challenges to procedures, leading to useful illustrations of how procedures function – – those models that are (locally in the space of models) hard to estimate. Locally easy models are not to be trusted.

From one point of view, that we are putting forward is a two-part description of the meaning of the acronym\\
[1 cm]\\
\begin{center} tinstaafl\\
\end{center}
well know to science-fiction readers, which stands for\\
\begin{center}THERE’S NO SUCH THING AS A FREE LUNCH,\\
\end{center}
the two parts being expressible in similar terms, but more closely relevant to our context as\\
\begin{center}NO ONE HAS EVER SHOWN THAT HE OR SHE HAD A FREE LUNCH\\
\end{center}
Here, of course, “FREE LUNCH” means “usefulness of a model that is locally easy to make inferences from”.

Any very specific characteristic of a distribution is a potential handle by which easier estimation is possible. So we want models that are as {\it bland} as we know how to make them as our prime guides – – our prime illumination of the behavior of alternative procedures.

There is a sense in which the most useful interpretation of the Central Limit Theorem is not about how nearly Gaussian the distribution of estimates is likely to be, but rather as an indication of blandness of Gaussian models – – since they arise, in the limit, by the disappearance of any and every kind of distinctive behavior. As one of several, or even many, models useful in illuminating behavior of some procedure, then, Gaussian distributions probably do have a special role.\\
\begin{center}* \qquad blandness in finite samples \qquad*
\end{center}
\addcontentsline{toc}{subsubsection}{\hspace*{0.455cm} blandness in finite samples}

One outcome of the Princeton Robustness Study \cite{Andrews}(Andrews et.al. 1972e) was the recognition of the non-blandness – – in small samples – – of Student’s $t$ with very few degrees of freedom. The peakedness of $t_1$(and, to a lesser extent, of $t_2$) was found to be sufficient for an estimate of location, tuned to make use of this peakedness, to have 10’s of \% more efficacy than is possible for a model with the same (seriously challenging) tail behavior but smaller peakedness (for example the slash or $h_1$ distribution – – the distribution of a centered Gaussian deviate divided by an independent rectangular [0,1] denominator.)

There is thus a finite sample sense in which $h_1$ is much blander than $t_1$ – – and a finite sample sense in which we can reasonably ask if $h_1$ is bland enough.

Whether there will ever be detailed and rigorous formulations of blandness – – for either floating (asymptotic) $n$ or finite $n$ – – that will be directly useful seems presently to be less than certain. But that does not excuse us for not asking about blandness in every situation where we are trying to understand the behavior of a chosen procedure against a background of well-selected alternate models. \\

• Blandness, entropy, information, smoothness, invariance, regularization, well-posedness, renormalization…I agree (if it is a fair approximation to your view) that somewhere among concepts like these are ideas that probably more fundamental than

• Oops, pressed send…

…than many of the debates on topics like the likelihood principle. (I also think there is [and these may already have been stated by others] some more general ideas that imply something like the likelihood principle. )

I still wonder what the ‘best’ formulation/s is/are.

• george

Laurie: thanks. I do think regularizing is important, as is the development of tools that motivate it formally. But it’d be so much easier to interest others in this approach with motivating examples that are less contrived.

This far down in a discussion is probably not the place for them but I do hope you’ll be able to present them somewhere. Good luck, and thanks for all the responses.

• Hi Laurie, if you’re still lurking about, perhaps you could take a look at my discussion with Mayo a few comments above about the weak/strong likelihood principles. In particular the sketchy derivations I give leading to the ‘orthogonality’ part. Note I’m putting aside a number of regularity concerns there (for now).

As I mentioned above, I’m just lying about in a hospital bed feeling a bit bored so any math feedback would give me a nice distraction.

30. Unless I have missed it Birnbaum never uses the word ´true’ when talking about the likelihood principle. He uses the concept of ´adequacy’ (top of page 274). I take it he had given some thought to this and uses ´adequate ‘ precisely because he does not want to use the word ´true’. Birnbaum never spells out what is meant by ´adequate’ but it has always seemed to me that whatever definition is taken it is not going to exclude silly models. Moreover the sillier the model the stronger the demonstration: this model is completely ridiculous and it can’t even exclude it. Furthermore I had the impression that there was a sort of continuity assumption in all discussions of likelihood. Even if the model is wrong as long as it is close in some sense to the ´truth’ it will not matter: all close models will give more or less the same results. The comb model shows that there is no continuity principle. In spite I see traces of it in the way Michael formulates his position. What do you mean by ´nearly’ in the statement ´if the likelihood principle is nearly true’? One data set, lots of different adequate models, lots of different likelihoods each telling a different story and consequently the data also telling lots of different stories. Are these all equally valid or are some stories better than others? How do you make sense of this cacophony? One data set consists of zeros and ones the other is a Gaussian data set with the same mean and variance. The likelihoods are equal so the degree to which the data support one parameter value relative to another is the same for both samples in spite of the fact that the model is completely false for the one sample and correct for the other. I stand amazed. Clearly there is no general definition but this is no excuse for not giving definition for simple models, for example and above all for the Gaussian model. He is one such definition. The adequate values of (mu,sigma) are those for which the following inequalities hold.
q1<-qnorm((1+beta)/2)
q21<-qchisq((1-beta)/2,n)
q22<-qchisq((1+beta)/2,n)
q3<-qnorm((1+beta^(1/n))/2)
q4<-qkuip(beta)
y<-(x-mu)/sig
sqrt(n)*abs(mean(y))<q1
q21<-sum(y^2)<q22
max(abs(y))<q3
dk<-max((1:n)/n-pnorm(y))-min((1:n)/n-pnorm(y))
sqrt(n)*dk<q4
which for a data set x checks behaviour of mean, behaviour of variance, behaviour of maximum deviation from mu (outliers) and Kuiper distance of model from empirical distribution. beta is so chosen that for a N(mu,sigma^2) sample of size n the parameters (mu,sigma) will be adequate with a prescribed probability say 0.99. Several comments are in order. Firstly the region is not based solely on the sufficient statistics for the model. What does this say about the principle of sufficiency? Secondly there are five inequalities involved. These can be turned into P-values. Thus for any (mu,sigma) there are five different measurements of how well the N(mu,sigma^2) model fits the data. Some may fit well with respect to goodness-of-fit but badly with respect to outliers and vice versa. Michael and others think nothing is lost if this is replaced by one single number involving just the mean and variance. Indeed they seem to think something is gained. I stand amazed once again. Thirdly if there are some adequate parameter values but not very many and even these are on the limits of acceptability. does one now ignore all this and move to a likelihood stage where all parameters are allowed even those which have just been discarded? Fifthly if writing a scientific article does one report all the adequate parameter values or just the likelihood function? Sixthly it is now in the open and can be criticized. Finally Michael give us your concept of adequacy for the Gaussian model.

• Michael Lew

Laurie, I’ll answer your questions one by one, but I have to say that I think you are way off track.

“What do you mean by ´nearly’ in the statement ´if the likelihood principle is nearly true’?” I mean ‘adequate’ in a sense that is probably similar to Birnbaum’s use of adequate. If most of the evidence in the data relevant to the model parameters is contained in the likelihood function then the likelihood principle might be described as nearly true. I am also thinking about acting as if the likelihood principle applies to the various versions of likelihood functions, such as marginal likelihoods or empirical likelihoods, that are necessary for dealing with nuisance parameters.

“Are these all equally valid or are some stories better than others? How do you make sense of this cacophony?” No. Obviously I don’t think that. Don’t pretend that my words can be turned into the ravings of an idiot. The stories from the more appropriate models are better, just as will be the case for ALL statistical methods.

“One data set consists of zeros and ones the other is a Gaussian data set with the same mean and variance.” Only an idiot would choose a model featuring a Gaussian distribution for the data that is all ones and zeros. I am not an idiot and so I would not choose that model. Maybe a comb distribution… well no.

“I stand amazed.” I too am amazed at your ability to misunderstand what I write. Look up the principle of charity. It might help.

“What does this say about the principle of sufficiency?” I have no idea, because I do not understand what you have written.

“Michael and others think nothing is lost if this is replaced by one single number involving just the mean and variance.” You are probably wrong, but as I don’t understand the context of your statement I cannot be sure, and I could only speak for myself, not the imaginary others. I would very rarely advocate presenting the evidence in data as a single number. I much prefer a full likelihood function, as I’ve been writing in these comments for ages.

“does one now ignore all this and move to a likelihood stage where all parameters are allowed even those which have just been discarded?” I have no idea. What is a ‘likelihood stage’? A likelihood _function_ tells you the relative level of support offered by the data of parameter values within a model. Which parameters are you discarding? Are you using the likelihoods in a decision procedure? What criteria are you applying, and how do you include a loss function? Are you using a likelihood functions to decide how many model parameters to use, or to decide which values of a parameter are supported by the data. I only approve of the latter. We may be arguing about different things, as the likelihood principle is relevant to the values of a parameter of interest and not to the inclusion of parameters in a model, but it sounds like you are talking about choosing models.

“Fifthly if writing a scientific article does one report all the adequate parameter values or just the likelihood function?” The full likelihood function is a richer display of what the data say than any single number summary can be. However, there are many circumstances where a P-value is an adequate summary. The analyst should regularly examine the full likelihood function and decide how best to communicate the results in the publications.

“Sixthly it is now in the open and can be criticized.” What is ‘it’? I don’t know what that sentence is intended to tell (or ask) me.

“Finally Michael give us your concept of adequacy for the Gaussian model.” A Gaussian model is adequate to the extent that it is a sufficiently realistic representation of the data generating system for the purposes of the analysis. It is generally not adequate for binary data, and would probably not be adequate for any circumstance where a comb function might be reasonably contemplated.

Finally I will note that Birnbaum was never enthusiastic about the likelihood principle, and apparently harboured doubts even after publishing his proof (Mayo will say faulty proof) that the likelihood principle is entailed by conditionality and sufficiency. Birnbaum preferred his own nebulously defined “confidence concept”. There is no need to exclude ‘silly’ models from the scope of the likelihood principle because it is entirely model-bound. It is the real-world consequences of application of the likelihood principle that require the model to be appropriate, not the principle itself. As I said, it seems to me that you are way off track in your thinking about the likelihood principle.

• Michael: On the last para only: Birnbaum was apoplectic that he seemed to have shown the likelihoods contained all the import for parametric inference because he thought the ability to control the probability of misleading interpretations was absolutely central. His confidence concept is clear enough, the only problem with it is that it’s still too behavioristic (for me). I’m just not sure why he thought the confidence concept wasn’t just what Neyman espoused.
Ronald Giere was here yesterday, driving through the area, and we talked about Birnbaum. He seemed to have a slightly but importantly different memory of what likely drove B to depression than in our many past discussions of Birnbaum.

31. Christian, that is not a problem. I just wanted to know what you thought about the argument.

32. Oliver I wish you a quick recovery, too much excitement last Saturday I take it. It’s now late here but will look at your comments and respond. I wanted to add Kolmogorov–Chaitin complexity and chaos to your list.

• Thanks 🙂 Never made it to Saturday but followed along as best I could.

I certainly have no objections to your additions. I’m fresher on chaos and related topics after having been teaching dynamical systems recently. More of a passing familiarity with KC but agree part of an obviously important and deep set of ideas.

fpval<-function(x,alpha,theta=NA,ngrid=100){
n<-length(x)
x<-sort(x)
mux<-mean(x)
sdx<-sd(x)
beta<-(3+alpha)/4
beta<-alpha
q1<-qnorm((1+beta)/2)
q21<-qchisq((1-beta)/2,n)
q22<-qchisq((1+beta)/2,n)
q4<-qkuip(beta)
q3<-qnorm((1+beta^(1/n))/2)
sigl<-sqrt(sum((x-mux)^2)/q22)
sigu<-sqrt(sum((x-mux)^2)/(q21-q1^2))
theta<-double(2*(ngrid+1)^2)
dim(theta)<-c((ngrid+1)^2,2)
ic<-0
pvl<-0
j<-0
while(j<=ngrid){
sig<-sigl*exp(j*log(sigu/sigl)/ngrid)
i<-0
while(i<=ngrid){
mu<-mux+(2*i/ngrid-1)*sig*q1/sqrt(n)
ic1<-0
ic2<-0
ic3<-0
ic4<-0
yq1){ic1<-1}
if(ic1==0){
sy2q22)|(sy2<q21)){ic2q3){ic3<-1}}
if(ic1+ic2+ic3==0){
dkq4){ic4<-1}
}
if(ic1+ic2+ic3+ic4==0){
ic<-ic+1
theta[ic,1]<-mu
theta[ic,2]<-sig
# pvl<-max(pvl,2*(1-pnorm(-1.65+sqrt(n)*abs(mu-mu0)/sig)))
}
i<-i+1
}
j=1){theta<-theta[1:ic,]}
else{theta<-NA}
list(theta)
}
#
#
#
pkuip<-function(x){
if(x<=0){ss<-0}
else{
ss<-2*(4*x^2-1)*exp(-2*x^2)
eps<-1
m10^(-4)){
m<-m+1
y<-m*x
eps<-2*(4*y^2-1)*exp(-2*y^2)
ss<-ss+eps
}
ss<-1-ss
}
return(ss)
}
#
#
#
qkuip<-function(alpha){
x2<-1
ss<-pkuip(x2)
while(ss<alpha){
x2<-2*x2
ss<-pkuip(x2)
}
x1<-0
x3<-(x1+x2)/2
eps10^(-3)){
q3<-pkuip(x3)
eps<-x2-x1
if(q3<alpha){
x1<-x3}
else{x2<-x3}
x3<-(x1+x2)/2
}
return(x3)
}
#
You can copy it and run it under R. Parameter values not contained in the set theta (taking into account the discretization) are the discarded ones. For example for the copper data with alpha=0.9 the parameter (2.04,0.11) is a discarded one. No use of likelihood. No loss function and the criterion can simply be read off from the programme. As I said before it is public, everyone can read it run it, criticize it. There are only two parameters involved, it is the normal model. The adequate ones are given by the output theta up to the fineness of the discretization.
´A Gaussian model is adequate to the extent that it is a sufficiently realistic representation of the data generating system for the purposes of the analysis'. Here we go again. The generating system for the copper data is at the level of quantum mechanics taking into account the matrix of other chemicals in the water etc. You do not have the slightest chance of knowing what the generating system is. The generating system for the coin throw is deterministic Newtonian mechanics in a chaotic setting. It is complicated. In what sense is the binomial distribution a realistic representation of this chaotic deterministic system?
´There is no need to exclude ‘silly’ models from the scope of the likelihood principle because it is entirely model-bound'. I fail to understand you. There is no need to exclude them but if I include them you complain. I am no wiser about what you or Birnbaum mean by adequate. Moreover you refuse to state in in any fashion which can be criticized but take recourse to adjectives such as ´silly' if you don't like a model. I suspect that this is a fundamental weakness of the likelihood approach. No wonder there are interminable arguments on likelihood even to your own ´The likelihood principle does not entail a sure thing', evil demon' or determinist' hypothesis'. charity' is not relevant for a scientific discussion. I don't have to look up the word, my English is quite good. If anybody regarded my scientific work with charity I would feel insulted. What I do is to try and understand your scientific point of view. I probe it to understand it and very often the best probes are extreme ones, for example the comb distribution and the normal distribution for binomial data. They are not personal.I stand amazed ad infinitum.

• Re: the single number for each of two different models.

Is the likelihoodist helped if they only compare models related by a smooth transformation? Then they have access to higher derivatives and hence more numbers

• There is a sort of singular limit/birfurcation blocking smooth transformation of the comb structure into the normal structure, isn’t there?

• Michael Lew

Laurie, the ‘principle of charity’ is what I meant by “the principle of charity”, not ‘charity’. The principle of charity says that you should read someones words in a manner that assumes that they have tried to communicate something that is rational. I was not being flippant, and I still suggest that you look it up (https://en.wikipedia.org/wiki/Principle_of_charity).

In order to be ‘adequate’, a model should provide probabilities of observable values that are adequately close to the true probabilities of the real-world data generating processes. How close they need to be depends on context and purpose. Your implication that adequacy requires a match to the quantum-mechanics and chaos is amusing, but the principle of charity impels me to assume that you are joking. (E.T. Jaynes provides an amusing variant of the attitude in his posthumous book Probability Theory, but I don’t remember the page number.)

I agree with you that your usage of ‘model’ differs from mine, and I now see that the difference the s probably causing your misunderstanding of both my writing and the likelihood principle. For example, you wrote “one [number] for each model, namely the likelihoods l(x,theta_1) (one number) and l(x,theta_2)”. If theta is the parameter vector (mu, sigma) for a Gaussian distribution, then theta_1 and theta_2 are different values of the parameter and the two likelihoods are two points on the likelihood function from _one_ model. The likelihood function that I would urge you to view is the function for all values of theta, not just theta_1 and theta_2. That function is not a single number summary. You also wrote “Thus N(0,1) and N(0,2) are two different models.” Well, for the purposes of application of the likelihood principle they are definitely not different models. You need to adjust or translate your particular usage of the word ‘model’ when considering the likelihood principle, and the practical utility of likelihood functions.

Your usage of ‘parameter’ is similarly different. You wrote “For example for the copper data with alpha=0.9 the parameter (2.04,0.11) is a discarded one.” I assume that you mean by that that the vector (mu=2.04, sigma=0.11) has been discounted as an acceptable setting for the Gaussian model. If that is the case then what you mean by “parameter” is what I would mean by “value of a parameter”. The parameter of your Gaussian model is the vector (mu, sigma) or, equivalently, the parameters are mu and sigma.

I suggest that you read all of my comments again using the principle of charity and assuming my own (conventional, I think) usage of ‘parameter’ and ‘model’. You will, I expect, find that there is little to disagree with in what I wrote.

• The issue of levels appears again. While dynamical systems are on my mind – compare fixed point stability to structural stability.

Structural stability refers to perturbations of the equation structure itself – ie perturbing ‘outside’ the model, fixed point stability refers to stability of a fixed point ‘within’ a model.

Quite analogous to within/without stability in statistics. And suggests ‘without’ perturbations can be analysed by smoothly embedding in a higher level model again, as I suggested above.

• PS the ‘charitable’ interpretation of Laurie’s quantum mechanics quip is that one needs to carefully think about *why* one can get away with neglecting phenomena above and below certain scales and *how to formalise* this.

Ie a ‘theory of approximate models’. I agree with Laurie that this is a deep and important question and that it has practical implications.

34. Oliver two more, perturbations and stability of analysis. I supported the Japanese.

• Absolutely. Singular perturbation theory but then we might be back to renormalization. I have two japanese nieces – I did too.

35. Oliver asked me about the WLP which I had to read about to know what it is. Deborah writes ´ And don’t forget that the “weak” LP holds for frequentist parametric inference within a model, reducing to sufficiency’ so I take it to be equivalent to sufficiency. In the abstract of her paper she writes ´ … and so uncontroversial a principle as sufficiency’ so I take it to be uncontroversial. Here is an argument against the sufficiency principle. If the model is TRUE, that is the data are a sample whose distribution is correctly specified by the model for some parameter value, then there is no need for EDA. In such a case I suppose that it is reasonable to restrict attention to the sufficient statistics. I go no further than this because I have never thought about it. In the following I will use ´model’ in a sense different to the standard one in parametric statistics. A model is a single probability measure so that N(0,1) and N(0,2) are different models. What is usually called a model in statistics is a family of probability models indexed by a parameter. For each parameter theta the question is whether the model P_theta is an adequate approximation to the data. This means that ´typical’ data generated under the model ´look like’ the real data. As an example. If P_theta=N(mu,sigma^2) then for data X generated under this model sqrt(n)(mean(X)-mu)/sigma is a N(0,1) random variable and will typically lie between -2.5 and 2.5. I do the same for the variance, for the maximum deviation max_i|X_i-mu|/sigma and for the Kuiper distance between the model N(mu,sigma^2) and the empirical distribution P_n. The limits in each case can be determined by specifying a number alpha for typical, alpha=0.99 means that 99% of samples generated are typical in that all the inequalities are satisfied for 99% of the simulated samples. Note that all the inequalities involve only distribution functions and that the measure of distance the Kuiper metric is weak. More generally the topology is weak. Assuming that there are some suitable parameter values it is usual at this stage to do formal inference, that is all further calculations are done within the parametric model. I formulate this as saying that the statistician is behaving as if the parametric model were true. The WLP now requires the calculation of the likelihood for each parameter value. This means that the ´fit’ between the data and the model P_theta_0 is reduced to a single number namely l(x,theta_0) which involves only the sufficient statistics, namely in this case the mean and standard deviation of the data. . It has been claimed that by doing this one attains a deeper understanding of the how the parameter values are supported by the data. To me this is a complete mystery. The approximation region defined by the R programme above associates four numbers with each parameter value. One gives the size of the largest deviation in case one is interested in possible outliers, another gives the Kuiper distance of data from the model, another measures the variance of the sample against the variance of the model and one the size of mean against the mean of the model. On this basis the model is chosen. And now we have a free lunch allowing us to get much more information than we put in, namely the likelihood. I don’t believe it for a moment.

• Could EDA be considered somewhat analogous to phase portrait analysis in dynamical systems? And issues of likelihood instability related to structural instability?

Another analogy: In a sense, the phase portrait is the more fundamental object of study in dynamical systems, despite it being ‘about’ the study of differential/difference equations

36. george

Laurie; the principle of charity is a technical term. You might consider an apology…

38. Oliver I take a very naive and simplistic approach. The R programme above calculates four statistics for each parameter value. These measure the size of the mean, variance, largest deviating x value (outliers) and Kuiper distance between the empirical measure and P_theta. Each of these statistics has a typical range of values for data generated under the model P_theta. So given theta I look at these four statistics and the ranges and I can read off certain features. This theta is satisfactory in all respects. This theta is satisfactory but on the limit of acceptable Kuiper distances. This theta is perfectly satisfactory apart from outliers. This theta fails in all respects. In my naive way of thinking I claim that these statistics give me information about how the data relate to this particular parameter value. Michael claims as far as I understand him that once I have some parameter values which are adequate and I can throw away my four numbers and replace them by one single number dependent only on the mean and variance and that by stringing these together to form a function I am rewarded with greater insight about the relative strength with which parameter values are supported by the data. Moreover this function as a function of the parameter is the same for all x with given mean and variance. If the model were true I could believe this, it seems plausible. But if the model is not true I can make no sense of it. calling a model adequate is no help. You suggest if I understand you correctly that you can perhaps get more information by looking at derivatives. However the objection that this relationship is in no way dependent on the shape of the data as long as the mean and variance are the same remains. As far as I can see I cannot read off any useful information about the relationship between data and model from them.
There may be such a transformation but you are way beyond my knowledge. I once wrote a paper on bifurcation with one set of physicists only to be told by another that it was no such thing. Since then I have been wary of using the word.
My use of the word perturbation is misleading as it has a definite meaning in mathematics which was not intended here. I simply meant any measure with epsilon of the given model in some weak metric. A perturbation of the form (1-epsilon)P+epsilon Q is probably more in the sense I think you mean.
Processes at quantum level do seem to produce effects at the macro level which can be modelled in a stochastic sense. This may seem to be a case of the central limit theorem of additive small effects. I am not satisfied with this. Bohmian mechanics is deterministic chaos and has the same predictive power as stochastic interpretations of quantum mechanics. I would like some reference of a chaotic system which can be approximated by i.i.d. random variable in the limit. Let me know if you find one. In the same vein but different complex numbers in the sense of Kolmogorov pass all tests of independence Martin-L\”of. This goes in the right direction as complex numbers seem to have chaotic binary expansions but so does pi at least empirically and pi is not complex. I have been thinking of such things when having nothing else to do but the progress is zero.

Your comments on EDA again go beyond my knowledge of dynamic systems. I do not even know what phase portrait analysis is but there again you have the advantage. I have never given lectures on dynamical systems.

39. Michael, here is an example due to Andrew Gelman. The model is a mixture of normals 0.5N(mu_1,sigma_1^2)+0.5N(mu_2,sigma_2^2). At each value of mu_1=x_i for some x_i the likelihood tends to infinity when sigma_1 tends to zero. Similarly for each value of mu_2=x_i when sigma_2 tends to zero. There is nothing pathological about this example, the problem is well posed, no regularization needed. Do you conclude that such values of (mu_1,sigma_1) and (mu_2,sigma_2) are relatively speaking much better supported than all the others?

• Michael Lew

Better supported by what observed values?

• Laurie: This looks to be essentially the kind of counterexample to likelihood obtained by hypothesizing 0 variance (Barnard, Hacking).

• Michael Lew

Mayo, yes, it seems analogous to those example to me as well. However, its ambiguity of parameters is novel to me.

40. Michael, ´The law of likelihood says that the degree to which the data support one parameter value relative to another on the same likelihood function is given by the ratio of the likelihoods of those parameter values.’
In the example the data is x_1,…,x_n generated under the model for some parameter value, for example mu_1=mu_2=0,sigma_1=sigma_2=1. Then the parameter values mu_1=x_1, sigma_1=10^-10, mu_2=0,sigma_2=1 is much better supported by the data relative to mu_1=0,sigma_1=1,mu_2=0,sigma_2=1 that is the parameter values used to generate the data.

Re

• Michael Lew

Laurie, you’ve left out some information. What are the data? Just saying “x_1,…,x_n” does not allow the generation of any likelihoods. How many data points are there? (If there are fewer than the number of parameters, 4, then the model is overfitting and all bets are off.)

It seems to me that for the data to support “mu_1=x_1, sigma_1=10^-10, mu_2=0,sigma_2=1” better than any other vector of parameter values the data would need to consist only of zeros and values very close to x_1. Is that what you have in mind? How would the model yield data like that if the true parameter vector is “mu_1=mu_2=0,sigma_1=sigma_2=1”?

41. Michael, here are data generated by R according to the model with mu_1=mu_2=0,sigma_1=sigma_2=1. In this particular case the model reduces to 0.5N(0,1)+0.5N(0,1)=N(0,1)
-2.03908440 -1.19551369 -0.85432527 0.74703609 0.93696809 0.81682768
-1.77763674 0.60089285 1.02502853 0.05336583-2.03908440 -1.19551369 -0.85432527 0.74703609 0.93696809 0.81682768

The model is true, n=10, there are four parameters. The loglikelihood for mu_1=mu_2=0,sigma_1=sigma_2=1 is -15.68677 but you can check the calculations. The loglikelihood for mu_1=-2.03908440, mu_2=0,sigma_1=10^-10,sigma_2=1 is approximately 2.486544. The ratio of the likelihoods in favour of mu_1=-2.03908440, mu_2=0,sigma_1=10^-10,sigma_2=1 is approximately exp(18.17) but you can check my calculations.

• Michael Lew

Interesting. How would it turn out if the data come from a model that was actually bimodal, a model that is not redundant in its parameters? The mu_1=mu_2, sigma_1=sigma_2 model is just an over-parameterised variant of a single normal model N(mu,sigma), so I guess the apparently poor performance of the likelihoods is a consequence of the poor choice of model. Is the model one that you would seriously use? Given how it performs, I guess I would have to call your model inadequate, and I would blame the redundancy of parameters for that inadequacy, but it is not something that I have explored before. Does the model meet the criterion of common sense?

42. Michael, the model in your sense is the parametric family 0.5N(mu_1,sigma_1^2)+0.5*N(mu_2,sigma_2^) with mu_1,mu_2 real numbers and sigma_1 and sigma_2 strictly positive real numbers just as the normal model in your sense is the parametric family N(mu,sigma^2) with mu a real number and sigma a positive real number. I now generate 10 data points, for example the ones I gave in my previous post, using mu_1=mu_2=0, sigma_1=sigma_2=1. There is no question of the model being true, we have it in writing from the very person who supplied the data that this is so and his honesty in this particular case is not in doubt. For any choice of the parameter values you can calculate the likelihood. As I understand you this is done in principle for all possible parameter values, none are excluded. I repeat your statement ´The law of likelihood says that the degree to which the data support one parameter value relative to another on the same likelihood function is given by the ratio of the likelihoods of those parameter values.’ I do that for the values used to generate the data u_1=mu_2=0, sigma_1=sigma_2=1 and for the values mu_1=-2.03908440, mu_2=0,sigma_1=10^-10,sigma_2=1 for the data in my last post. The likelihood principle now tells me that the degree to which the data support these latter parameter values compared to the one actually used to generate the data is approximately exp(18.17). You are of course (almost) correct. If I had generated data using mu_1=-2.03908440, mu_2=0,sigma_1=10^-10,sigma_2=1 then about 50% of the data points would be very close to -2.03908440, the other 50% N(0,1), not 0 as you state. It seems to me that there is here a conflict between this latter claim and the likelihood principle. I could here have made mistakes in my calculations or in my application of the likelihood principle, let me know.

• Michael Lew

Laurie, I don’t see that there need be any mistakes, and I don’t see that the example provides a reason to suppose the likelihood principle is false. Instead, the example shows why it is important to inspect the whole likelihood function rather than focus on just the maximally likely parameter values. I have to acknowledge that inspection of a five dimensional likelihood function is a little difficult, and that my discussion of it is therefore going to be largely qualitative, but there is my considered take on it.

1. The likelihood function must be symmetrical the because labels of the pairs of parameters (mu_1, sigma_1) and (mu_2, sigma_2) can be substituted for each other. Thus there are two ‘best’ supported values of the parameters, not just the one that you presented. Thus the result is ambiguous, and that ambiguity is a consequence of the model having interchangeable parameters. Ambiguity in ‘what the data say’ should not be surprising, as that interchangeability of parameters leads to a direct analogy with the ambiguity of natural language testimony containing multiple identical pronouns. (It may be that the problem can be parameterised in a different manner to avoid the ambiguity. See my likelihood paper http://arxiv.org/abs/1507.08394 for examples of how alternative parameterisation of problems can yield useable likelihood functions.)

2. The likelihood function will have the general form of a lump centred on mu=0 and sigma=1. That lump is decorated with two very tall spikes and other lesser spikes and bumps corresponding to the mu parameters matching each of the observed values. The spikiness of the function might lead an analyst to treat the spikes with caution. When making inferences about the parameters from the likelihood function, an analyst may prefer parameter values lying within the main bulk of the function over isolated values with higher likelihoods at the periphery. Your example provides an extreme case where the highest likelihoods are at the extreme edges of the function (the edges in the sigma dimension).

3. The largest two spikes in the likelihood function, mu_1 and mu_2 = -2.039, are very narrow in their respective dimension, and are located at corresponding locations on the sigma dimensions corresponding to near zero values. If a zero value for sigma is in any way unexpected or implausible then those peaks can be discounted on that basis. A Bayesian prior could do that formally, but even in the absence of a prior it is reasonable to discount them in any way that is satisfactory to the analyst. Neither the likelihood principle nor the law of likelihood insists that the best supported parameter values be accepted. The likelihood function shows what the data say according to the model, but inferences should usually be informed by more than just what the data say. What to believe and what to decide or do are different questions from what the data say.

4. Your example is a very nice illustration of the importance of inspecting the whole likelihood function, and of the necessity of thoughtful treatment of evidence instead of an algorithmic approach.

43. Oliver, now I understand your remark. It seems similar to generating data under a model and comparing it with the real data. In complex situations this comparison is difficult and sometimes one has to generate not the data itself but something closely related to it. In my last post I missed the obvious example of chaos which is a deterministic random number generator.

• Yes exactly. The emphasis is on topological structure and multiple systems of differential/difference equations may generate equivalent stuctures.

44. Michael, there is nothing pathological about the model. No need to regularize or anything. Imagine you have data say on weights which come from a population with male and female but the data do not have M and F labels. It seems reasonable to model it by a 50-50 mixture of normals. This is the model. It doesn’t matter what the data are, whether bimodal or not. The same phenomenon occurs. If you include a mixture parameter pN(mu_1,sigma_1^2)+(1-p)N(mu_2,sigma_2^2) you increase the likelihood by letting p tend to 1 as well as sigma_1 tending to zero with mu_1=x_1 . The best supported parameter values then correspond essentially to Dirac measures of mass 1 at the data points. In a sense this is the inverse of my 0-1 example you complained about. Now the data are N(0,1) and its the likelihood principle which supports Dirac measures of mass 1 which is in a sense even worse. You get reasonable estimates of the true parameter by minimizing the Kuiper distance between the model and the empirical measure. This is not a conceptual problem, only a numerical one but it can be solved if you take care in choosing a grid of values for the parameters. It will always work. If the data are strongly binomial you will get a reasonable estimate for p as well as for the parameters of the normal components. If the parameters are more or less equal all values of p will be accepted and the parameter values are more or less what you would get if you had modelled the data with a simple Gaussian model in the first place. There is no overparameterization.

• Michael Lew

Laurie, your example of a mixed male and female population with lost labels provides a perfect rationale for the analyst to ignore parameter values that feature sigma near zero. The maximal likelihood spikes would be excluded from the portion of the likelihood function being considered. That mixed population would also suggest that a single sigma might suffice for the whole model. Model validation procedures of your choice could be used to decide between modelling the data with three parameters, shared sigma, or the four that you used. The fact that true variances of males and females can be expected to differ a little is not a sufficient reason to use a model that allows them to differ by a large margin. Particularly so when there are so few data points.

45. Anonymous

Deborah, I am not sufficiently informed about likelihood to know what the example is you are referring to. The mixture model 0.5N(mu_1,sigma_1^2)+0.5N(mu_2,sigma_2^2) is in no way artificial. It is a well posed problem, no need for regularization or anything. There is no hypothesis of zero variance but of course it does rely on one the variances going to zero and the corresponding mu being a data point so I suppose there is some connection.

46. Deborah, I do not know the example you are referring to but the mixture example dioes rely on one ov the variances going to zero and the corresponding mu being a data point so I suppose there is some connection.

• Michael Lew

Laurie, Birnbaum’s variant of the zero variance “counter-example” is the example debunked in my likelihood paper http://arxiv.org/abs/1507.08394. The introduction mentions other variants.

47. Deborah, Michael, in that case the Birnbaum example has nothing to do with it.

• Michael Lew

Laurie, perhaps, but that depends on exactly what you mean by “it”. The Birnbaum example is logically the same as the Barnard and Hacking versions, even if it differs in some details. I agree with Mayo’s suggestion that there is a degree of similarity with your example.

48. Michael, I hope that you are in general more careful when describing a likelihood function than you have been in this case. You can avoid the ambiguity by taking mu_1<=mu_2 which is what I do in my programme. This is a minor matter of no consequence and has nothing to do with interpreting the likelihood function. Having done this there are n=10 spikes, one for each data value, not just the one you mention. Bayesian priors have difficulty with this example, more exactly they cannot solve it, see
@Article{GEL03,
author = {Gelman,~A.},
title = {A {B}ayesian formulation of exploratory data analysis and goodness-of-fit testing},
journal = {International Statistical Review},
year = {2003},
OPTkey = {},
volume = {71},
number = {2},
pages = {369-382},
OPTmonth = {},
OPTnote = {},
OPTannote = {}
}
You are being very generous when you interpret your likelihood function. There are ten best supported values of the parameters and all of them yield models which are inconsistent with the data. To be exact these models may be consistent with the data as the sample size is very small, but for n=20 they are probably not. You refer to my models as silly (the comb which is at least consistent with the data) or, to transfer your adjective, idiotic when I apply the normal model to 0-1 data. What adjectives to you use to describe your best supported parameter values?
´Your example is a very nice illustration of the importance of inspecting the whole likelihood function, and of the necessity of thoughtful treatment of evidence instead of an algorithmic approach.'

Alternatively we can use the algorithmic approach. For a sample of size n=200 i.i.d. random variables the results are as follows. The parameter values (mu_1,sigma_1,mu_2,sigma_2)=(-0.386, 0.548, 0.607, 0.959) are best supported by the data, that the agreement is very close with a Kuiper distance of dk=0.052. The 0.9 quantile for the Kuiper metric is 0.1146. The values (mu_1,sigma_1,mu_2,sigma_2)=(-2.44175,10^-6,0,1) are inconsistent with the data with a Kuiper distance of 0.514. A second example of for the algorithmic approach is the following. A sample of size 200 is generated with (mu_1,sigma_1,mu_2,sigma_2)=(0,1,1.5,0.1). The parameter values best supported by the data are (mu_1,sigma_1,mu_2,sigma_2)=(0.0468,1.108,1.496,0.0942) with a Kuiper distance with dk=0.049. The values (mu_1,sigma_1,mu_2,sigma_2)=(-2.129243
,10^-6,1.5,0.1) are inconsistent with the data with a Kuiper distance of 0.530.

For the record, a lot of thought and hard work went into the approximation approach I favour and the algorithms
that I and others have developed.

You write
´If you want to know what values of mu are reasonable in light of the data then you want to look at the likelihood function. It answers your question in the most direct manner'
and
´The law of likelihood says that the degree to which the data support one parameter value relative to another on the same likelihood function is given by the ratio of the likelihoods of those parameter values.’

I find these two statements in the light of the mixture example antagonistic. Apparently I can read off the reasonable values from the likelihood and this is somehow direct which I interpret as meaning that I just look at the likelihood and nothing else. Equally I can read off the best supported parameter values from the likelihood. Unfortunately the reasonable values are not equal to the best supported values. I am at a loss. Michael I simply cannot get into the way you think. My world is one of approximation, regularization, metrics, functionals, continuity, differentiability, stability and perturbations. Likelihood plays no role whatsoever. I look at the results of my algorithmic approach and your problems with the likelihood from which one can read off nothing and require a distinction between ´reasonable' and ´supported' then I know which I favour. You wrote at one point ´The likelihood function can only ever be used _within_ the model from which it is calculated …. and so the ‘behave as if true’ thing is unavoidable. It is this ´behave as if true' which I reject and when one does this, as you point out, likelihood goes out of the window. My approach to statistics can be nicely summed up by quoting D.W.Müller of the Kiefer-Müller process:

… distanced rationality. By this we mean an attitude to the given,
which is not governed by any possible or imputed immanent laws but
which confronts it with draft constructs of the mind in the form of
models, hypotheses, working hypotheses, definitions, conclusions,
alternatives, analogies, so to speak from a distance, in the manner
of partial, provisional, approximate knowledge.

Note the use of ´distanced' which is always my attitude to data, exactly the opposite of asking what the data tell us.

• Michael Lew

Laurie, you seem to have trouble understanding my words. I wrote that the likelihood function has, in addition to the two main spikes “other lesser spikes and bumps corresponding to the mu parameters matching each of the observed values”. Thus when you write “there are n=10 spikes, one for each data value” you are agreeing with me and so your words ” not just the one [sic, two] you mention” are wrong. I’m done arguing with you.

49. Michael, a few final comments, I am not expecting a response. Whether there are one or ten is irrelevant. A simple reply along the lines ´I know there are ten just didn’t mention them’ would have been quite satisfactory. I agree saying there is one does not mean there are not more so I apologize, I find it embarrassing, but it really is a very minor point. You make no mention of the more substantial comments. Differentiation is a dangerous operation, pathologically discontinuous. If you use it you should always be very, very careful. In this case the problem is well defined but you insist on differentiating and run into trouble. To get out of it you appeal to semantics, a mathematical operation causes so to speak artifacts and to explain them away you turn to semantics. You mention using a Bayesian prior to get rid of them. It doesn’t work. Andrew Gelman also has problems and no good solution. He suggests sampling from the posterior, generating data and comparing them with the real data. All the time there is a very simple solution, minimize the Kuiper distance. This is at the level of distribution functions, no derivatives and no artifacts to be explained away. When I mentioned the Gaussian likelihood for the 0-1 data you got very indignant. I sort of apologized but in a somewhat cryptic manner. In fact no apology was called for. It was used who introduced the word idiot referring to a person. I have used the word silly but only referring to distributions. So is it idiotic, now referring to an action, to consider the Gaussian likelihood when the data are 0-1? It seems sort of idiotic, again referring to the action, but on the other hand the Gaussian likelihood function is the same for all data sets with the same mean and variance. This suggests to me that there may be an interpretation which is not idiotic. Ignoring constants the loglikelihood is
L(x,mu,sigma)=-n*log(sigma) -n*s^2/(2sigma^2)-n*Sum((mean(x)-mu)^2)/2*sigma^2)
where s^2 is the data variance and mean(x) the data mean. Maximizing this over mu and sigma leads to mu=mean(x) and sigma^2=s^2 with loglikelihood -n*log(s)-n/2. Now consider D(mean(x),s,mu,sigma)=-L(x,mu,sigma)+n*log(s)+n/2. This function is always non-negative with a minimum of 0 when mu=mean(x) and sigma^2=s^2. It becomes a combined measure of discrepancy between the mean(x) and mu and between s and sigma. With this interpretation it is not idiotic to apply it to 0-1 data or indeed to any data set. You can argue about whether the discrepancy is a good one, its advantages or disadvantages, but idiotic it is not. This are the lines along which I was thinking, namely is it idiotic or not and after giving the matter some thought I decided it was not. You can apply here the principle of charity and I was admonished for not doing so. But I was applying the principle of charity, I was looking for an interpretation of using the likelihood of the Gaussian model for 0-1 data which was not idiotic. Here it is. It was not possible for you to apply the principle to me because you cannot understand the way I think. A second example is the quantum mechanics episode. You applied the principle of charity and came to the conclusion that I was joking. I was thinking of determinism, randomness, chaos, complex numbers in the sense of Kolmogorov, Martin-Löf’s work, the binary expansion of pi etc. Oliver, although we had never discussed it, picked it up straight away.

The principle of charity does not work. Our ways of thinking are so different that it is not applicable. Indeed we have nothing more to say to each other, not because of one against ten, which really is not a reason for closing a debate, but at a much deeper level. No doubt the arguments about likelihood, for or against will continue for many years to come with no progress or conclusion. Given the fuzziness of the whole concept it cannot be otherwise. It seems pointless to take part in that particular discussion.

50. Deborah, Michael, a comment on Birnbaum’s example. A single observation x drawn from a parametric family of models P_(mu,sigma) with sigma \in {0,100}, P_(mu,0)=delta_mu and P_(mu,100)=U(mu-100,mu+100), delta_mu being the one point measure in mu, delta_mu({mu})=1. Birnbaum or somebody writes that the density f(x,mu,0)= 1 if x=mu zero otherwise, sigma=0, f(x,mu,100)= 0 if x\ne mu sigma=0 , f(x,mu,100)=1/200 if mu-100 < x <mu+100 and zero otherwise. How does one define a density? A measure pi on the Borel sets of R is called sigma-finite if R can be expressed as countable union of Borel sets B_i with pi(B_i)< infinity. This is the case for Lebesgue measure and for the counting measure which are the standard ones in statistics. Given such a pi and a probability measure P the measure pi is said to dominate P if P(B)=0 for all Borel sets B with pi(B)=0. This is often written P < <pi. The famous theorem of Radon-Nikodym states that if pi dominate P then there exists a function f such that P(B)=\integral_B f dpi. The function f is called the Radon-Nikodym derivative or we can call it the density of P with respect to pi. If pi =Lebesgue measure f is simply the density. This is all standard measure and integration theory. If you now have a parametric family P_theta then to form the likelihood you require that there exists a sigma-finite measure pi which dominate P_theta for all theta, P_theta < < pi for all theta. In this case you will have densities f_theta for all theta and your likelihood. At this point the Birnbaum example falls apart. We require a dominating measure pi such that P_{mu,sigma} < < pi for all mu and sigma. If sigma =0 this implies delta_mu < < pi for all mu. However it is well known that there is no sigma-finite measure which dominates all Dirac measures delta_mu simultaneously. So in the Birnbaum example there are no densities and no likelihoods. Another excellent reason for dismissing likelihood, see my remarks on perturbations. Can the example be saved discretizing? When sigma=0 let us consider a finite number of possible mu values j/K for all integers j -K^2<=j<=K^2 and K large. Let us write mu_j=j/K and put pi_1=sum_j P_(mu_j,0). Then pi is sigma-finite and dominates all the P_(mu_j,0). On letting pi_2 be standard Lebesgue measure it is seen that pi=pi_1+pi_2 dominates P_(mu,sigma) for all mu and sigma and we have a density f(x,mu,sigma). What does it look like? It can be seen that f(x,mu,0)= 1 if x=mu_j for some j and zero otherwise and f(x,mu,100) =0 if x=mu_j for some j and 1/200 if mu-100 < x < mu +100 otherwise. Thus if you observe an x with x=mu_j you say sigma=0, otherwise you say sigma=100. Michael this is why the Birnbaum example has nothing to do with the Gelman example. Once again forget differentiating it causes only trouble and there is not the slightest need for it.

´Thus there are two ‘best’ supported values of the parameters, not just the one that you presented.’
And you blame me for interpreting that as ´exactly two’ rather than the ´at least two’ you would have preferred. The standard way of reading ´there are two ways of …’ is that there are exactly two but maybe they do things differently in Australia, not that I have noticed. I just presented one I you write but there is not the slightest suggestion that this is the only one. If I had written however ´there is one best …’ how would you have interpreted that?

• Michael Lew

Laurie, yes, of course I blame you for misreading what I wrote. (Not reading, I suspect.) Having a ‘best’ suit does not mean that you have only one suit. In the second numbered point of my comment I wrote specifically that there are “two very tall spikes and other lesser spikes”. If you still think, as your response implies, that I wrote that there are only two spikes, then your reading comprehension skills are very poor. In the third numbered point in that comment I started with “The largest two spikes in the likelihood function” which clearly implies that there are additional spikes, so even if your eye failed to catch all of the words in the relevant sentence in my point number 2, your mind should have realised that you had missed it in the very next point.

What are you, an adult or a child trolling from his bedroom? I will not be responding to any more of your comments, as I will not be reading them.

52. A mathematician, statistician and philosopher walk into a debate about whether people can agree on statistical foundations. Much disagreement ensues.

I enjoyed some of it tho’.

Laurie – it looked for a minute like you had set up a blog? If you do, I’ll certainly read.

Michael – I enjoyed Edwards’ book when I first read it and see you as in that tradition. I think that when working ‘within the likelihoodist model’ so to speak you take the generally correct approach. I have moved away from this approach however, probably due to similar background ‘culture’ and concerns as Laurie. I am less certain than Laurie that likelihood needs to be abandoned though – it may be still be useful as a derived concept if proper consideration is given to regularity conditions. I haven’t decided, but will keep an eye on what people like you are doing.

Mayo – I enjoyed you first book a lot when I first read it and your writing certainly provokes me to think even when I disagree. I do find you a bit dogmatic and unwilling to entertain either Bayes-Freq points of agreement or deeper mathematical concerns. I look forward to reading your new book.

• Oliver: What is your evidence from my first book that I’m unwilling to entertain Bayes-frequentist points of agreement? I’m always glad to find Bayesians who are interested in controlling the probabilities of erroneous interpretations. I came to VT largely because of IJ Good who was a leader in the Bayes/non-bayes compromise. Are you familiar with it?
The ideas we bandy around here are rarely fleshed out with philosophical care, and you’re always in an overly big rush to pigeon-hole views, including your own. More care and depth should be sought. That said, I’ve studied phil, math, and stat and am always glad to learn of “deeper mathematical concerns”.

• Just my cumulative impression. Yes I’ve heard of and read Good. Interesting you feel I try to pigeonhole people including myself – I always thought my views were too inconsistent to be pigeonholed. Will keep an eye on it and try to do better. Agree more care and depth needed all around, including from myself.

• Om: Thank you. Hope they’re discharging you, if they haven’t already.

53. Michael, lim_{sigma_1 \rightarrow 0}l(x,x_1,sigma_1,0,1)= \infinity, similarly lim_{sigma_1 \rightarrow 0}l(x,x_2,sigma_1,0,1)= \infinity and so on to lim_{sigma_1 \rightarrow 0}l(x,x_10,sigma_1,0,1)=\infinity. The likelihood tends to infinity at all ten points. There are no two largest peaks and lesser peaks. They all tend to infinity. Moreover this is for the choice mu_1=0,sigma_2=1. It holds for any values of mu_2 and sigma_2. There are infinitely many points where the likelihood tends to infinity.

54. Deborah, Michael, possible a contribution when nobody is listening but for the record. The bottom line of the discussion I gave above for the Birnbaum example is that it doesn’t have a likelihood to be discussed.

• If it doesn’t have one, then the LP doesn’t apply. I just noticed there are 189 comments in this post, which I’m sure is a record. I didn’t even think it could go that high.

55. Deborah, if you are still reading someone somewhere mentioned similar examples to the Birnbaum one. Do you have references?

• Laurie davies: do you mean counterexamples to likelihood? e.g., from Barnard, Hacking and others?

56. I meant Barnard’s example. I only know it from the paper Michael wrote on it and when reading it I noticed that there is no density and therefore no likelihood. I thought they may be further examples with this weakness, maybe Hacking’s example is one of them but I am not acquainted with the literature.