There was a session at the Philosophy of Science Association meeting last week where two of the speakers, Greg Gandenberger and Jiji Zhang had insightful things to say about the “Law of Likelihood” (LL)[i]. Recall from recent posts here and here that the (LL) regards data * x* as evidence supporting

*H*over

_{1}*H*iff

_{0 }Pr(*x;** H _{1}*) > Pr(

*x;**H*).

_{0}On many accounts, the likelihood ratio also measures the strength of that comparative evidence. (Royall 1997, p.3). [ii]

*H _{0 }*and

*H*are statistical hypothesis that assign probabilities to the random variable

_{1 }*taking value*

**X***.*

**x***As I recall, the speakers limited*

_{ }*H*and

_{1}*H*to simple statistical hypotheses (as Richard Royall generally does)–already restricting the account to rather artificial cases, but I put that to one side. Remember, with likelihoods, the data

_{0 }*are fixed, the hypotheses vary.*

**x**** 1. Maximally likely alternatives.** I didn’t really disagree with anything the speakers said. I welcomed their recognition that a central problem facing the (LL) is the ease of constructing maximally likely alternatives: so long as Pr(

*x;**H*) < 1, a maximum likely alternative

_{0}*H*would be evidentially “favored”. There is no onus on the likelihoodist to predesignate the rival, you are free to search, hunt, post-designate and construct a best (or better) fitting rival. If you’re bothered by this, says Royall, then this just means the evidence disagrees with your prior beliefs.

_{1}After all, Royall famously distinguishes between evidence and belief (recall the evidence-belief-action distinction), and these problematic cases, he thinks, do not vitiate his account as an account of *evidence*. But I think they do! In fact, I think they render the (LL) utterly bankrupt *as an account of evidence*. Here are a few reasons. (Let me be clear that I am not pinning Royall’s defense on the speakers[iii], so much as saying it came up in the general discussion[iv].)

** 2. Appealing to prior beliefs to avoid the problem of maximally likely alternatives.** Recall Royall’s treatment of maximally likely alternatives in the case of turning over the top card of a shuffled deck, and finding an ace of diamonds:

According to the law of likelihood, the hypothesis that the deck consists of 52 aces of diamonds (H

_{1}) is better supported than the hypothesis that the deck is normal (H_{N}) [by the factor 52]…Some find this disturbing.

Not Royall.

Furthermore, it seems unfair; no matter what card is drawn, the law implies that the corresponding trick-deck hypothesis (52 cards just like the one drawn) is better supported than the normal-deck hypothesis. Thus even if the deck is normal we will always claim to have found strong evidence that it is not. (Royall 1997, pp. 13-14)

To Royall, it only shows a confusion between evidence and belief. If you’re not convinced the deck has 52 aces of diamonds “it does not mean that the observation is not strong evidence in favor of *H*_{1} versus *H*_{N}.” It just wasn’t strong enough to overcome your prior beliefs.

The relation to Bayesian inference, as Royall notes, is that the likelihood ratio “that the law [LL] uses to measure the strength of the evidence, is precisely the factor by which the observation ** X **=

*would change the probability ratio” Pr(*

**x***H*) /Pr(

_{0}*H*). (Royall 1997, p. 6). So, if you don’t think the maximally likely alternative is palatable, you can get around it by giving it a suitably low prior degree of probability. But the more likely hypothesis is still favored

_{1}*on grounds of evidence*, according to this view. (

*Do Bayesians agree?)*

When this “appeal to beliefs” solution came up in the discussion at this session, some suggested that you should simply refrain from proposing implausible maximally likely alternatives! I think this misses the crucial issues.

** 3. What’s wrong with the “appeal to beliefs” solution to the (LL) problem:** First, there are many cases where we want to distinguish the warrant for one and the same hypothesis according to whether it was constructed post hoc to fit the data or predesignated. The “use constructed” hypothesis

*H*could well be plausible, but we’d still want to distinguish the evidential credit

*H*deserves in the two cases, and appealing to priors does not help.

Second, to suppose one can be saved from the unpleasant consequences of the (LL) by the *deus ex machina* of a prior is to misidentify what the problem really is—at least when there is a problem (and not all data-dependent alternatives are problematic—see my double-counting papers, e.g., here). In the problem cases, the problem is due to the error probing capability of the overall testing procedure being diminished. You are not “sincerely trying”, as Popper puts it, to find flaws with claims, but instead you are happily finding evidence in favor of a well-fitting hypothesis that you deliberately construct— unless your intuitions tell you it is unbelievable. So now the task that was supposed to be performed by an account of statistical evidence is not being performed by it at all. It has to be performed by *you*, and you are the most likely one to follow your preconceived opinions and pet theories.You are the one in danger of confirmation bias. If your account of statistical evidence won’t supply tools to help you honestly criticize yourself (let alone allow the rest of us to fraud-bust your inference), then it comes up short in an *essential* way.

** 4. The role of statistical philosophy in philosophy of science.** I recall having lunch with Royall when we first met (at an ecology conference around 1998) and trying to explain, “You see, in philosophy, we look to statistical accounts in order to address general problems about scientific evidence, inductive inference, and hypothesis testing. And one of the classic problems we wrestle with is that data

*underdetermine*hypotheses; there are many hypotheses we can dream up to “fit” the data. We look to statistical philosophy to get insights into warranted inductive inference, to distinguish

*ad hoc*hypotheses, confirmation biases, etc. We want to bring out the problem with that Texas “sharpshooter” who fires some shots into the side of a barn and then cleverly paints a target so that most of his hits are in the bull’s eye, and then takes this as evidence of his marksmanship. So, the problem with the (LL) is that it appears to license rather than condemn some of these pseudoscientific practices.”

His answer, as near as I can recall, was that he was doing statistics and didn’t know about these philosophical issues. Had it been current times, perhaps I could have been more effective in pointing up the “reproducibility crisis,” “big data,” and “fraud-busting”. Anyway, he wouldn’t relent, even on stopping rules.

But his general stance is one I often hear: We can take into account those tricky moves later on in our belief assignments. The (LL) just gives a measure of the evidence in the data. But this IS later on. Since these gambits can completely destroy your having any respectable evidence whatsoever, you can’t say “the evidence is fine, I’ll correct things with beliefs later on”.

Besides, the influence of the selection effects is not on the believability of *H* but rather on the capability of the test to have unearthed errors. Their influence is on the error probabilities of the test procedure, and yet the (LL) is conditional on the actual outcome.

** 5. Why does the likelihoodist not appeal to error probabilities to solve his problem?** The answer is that he is convinced that such an appeal is necessarily limited to controlling erroneous actions in the long run. That is why Royall rejects it (claiming it is only relevant for “action”), and only a few of us here in exile have come around to mounting a serious challenge to this extreme behavioristic rationale for error statistical methods. Fisher, E. Pearson, and even Neyman some of the time, rejected such a crass behavioristic rational, as have Birnbaum, Cox, Kempthorne and many other frequentists.(See this post on Pearson.)

Yet, I have just shown that the criticisms based on error probabilities have scarcely anything to do with the long run, but have everything to do with whether you have done a good job providing evidence for your favored hypothesis *right now*.

“A likelihood ratio may be a criterion of relative fit but it “is still necessary to determine its sampling distribution in order to control the error involved in rejecting a true hypothesis, because a knowledge of [likelihoods] alone is not adequate to insure control of this error (Pearson and Neyman, 1930, p. 106).

Pearson and Neyman should have been explicit as to how this error control is essential for a strong argument from coincidence *in the case at hand*.

Ironically, a great many critical discussions of frequentist error statistical inference (significance tests, confidence intervals, P-values, power, etc.) start with assuming “the law (LL)”, when in fact attention to the probativeness of tests by means of the relevant sampling distribution is just the cure the likelihoodist needs.

** 6. Is it true that all attempts to say whether x is good or terrible evidence for H are utterly futile?** Royall says they are, that only comparing a fixed

**to**

*x**H*versus some alternative

*H’*can work.

[T]he likelihood view is that observations [like

andx]…have no valid interpretation as evidence in relation to the single hypothesisyH.” (Royall 2004, p. 149).

But we should disagree. We most certainly can say that * x* is quite lousy evidence for

*H*, if nothing (or very little) has been done to find flaws in

*H*, or if I constructed an

*H*to agree swimmingly with

*, but by means that make it extremely easy to achieve, even if*

**x***H*is false.

Finding a non-statistically significant difference on the tested factor, I find a subgroup or post-data endpoint that gives “nominal” statistical significance. Whether *H _{1 }*was pre-designated or post-designated makes no difference to the likelihood ratio, and the prior given to

*H*would be the same whether it was pre- or post-designated. The post-designated alternative might be highly plausible, but I would still want to say that selection effects, cherry-picking, and generally “trying and trying again” alter the stringency of the test. This altered capacity in the test’s picking up on sources of bias and unreliability has no home in the (LL) account of evidence. That is why I say it fails in an

_{1 }*essential*way, as an account of evidence.

** 7. So what does the Bayesian say about the (LL)?** I take it the Bayesian would deny that the comparative evidence account given by the (LL) is adequate. LRs are important, of course, but there are also prior probability assignments to hypotheses. Yet that would seem to get us right back to Royall’s problem that we have been discussing here.

In this connection, ponder (v).

8. **Background**. You may wish to review “Breaking the Law! (of likelihood) (A) and (B)”, and Breaking the Royall Law of Likelihood ©. A relevant paper by Royall is here.

**NOTES**

[i] The PSA program is here: http://philsci.org/images/docs/PSA Program.pdf. Zhang and Gandenberger are both excellent young philosophers of science who engage with real statistical methods.

[ii] For a full statement of the [LL] according to Royall. “*If hypothesis A implies that the probability that a random variable X takes the value x is p*

_{A}*(*

**x**), while hypothesis B implies that the probability is p

_{B}*(*

**x**), then the observation**X**=**x**is evidence supporting A over B if and only if p

_{A}*(*

**x**) > p

_{B}*(*

**x**), and the likelihood ratio, p

_{A}*(*

**x**)/ p

_{B}*(*

**x**), measures the strength of that evidence.” (Royall, 2004, p. 122)*“This says simply that if an event is more probable under hypothesis A than hypothesis B, then the occurrence of that event is evidence supporting A over B––the hypothesis that did the better job of predicting the event is better supported by its occurrence.” Moreover, “the likelihood ratio, is the exact factor by which the probability ratio [ratio of priors in A and B] is changed. (ibid. 123)*

Aside from denying the underlined sentence,can a Bayesian violate the [LL]? In comments to this first post, it was argued that they can.

[iii] In fact, Gandenberger’s paper was about why he is not a “methodological likelihoodist” and Zhang was only dealing with a specific criticism of (LL) by Forster. [Gandenberger’s blog: http://gandenberger.org]

[iv] Granted, the speakers did not declare Royall’s way out of the problem leads to bankruptcy, as I would have wanted them to.

[v] I’m placing this here for possible input later on. Royall considers the familiar example where a positive diagnostic result is more probable under “disease” than “no disease”. If the prior probability for disease is sufficiently small, it can result in a low posterior for disease. For Royall, “to interpret the positive test result as evidence that the subject does not have the disease is never appropriate––it is simply and unequivocally wrong. Why is it wrong?” (2004, 122). Because it violates the (LL). This gets to the contrast between “Bayes boosts” and high posterior again. I take it the Bayesian response would be to agree, but still deny there is evidence for disease. Yes? [This is like our example of Isaac who passes many tests of high school readiness, so the LR in favor of his being ready is positive. However, having been randomly selected from “Fewready” town, the posterior for his readiness is still low (despite its having increased).] Severity here seems to be in sync with the B-boosters,at least in direction of evidence.

**REFERENCES**

Mayo, D. G. (2014) On the Birnbaum Argument for the Strong Likelihood Principle (with discussion & rejoinder). *Statistical Science* 29, no. 2, 227-266.

Mayo, D. G. (2004). “An Error-Statistical Philosophy of Evidence,” 79-118, in M. Taper and S. Lele (eds.) *The Nature of Scientific Evidence: Statistical, Philosophical and Empirical Considerations. *Chicago: University of Chicago Press.

Pearson, E.S. & Neyman, J. (1930). On the problem of two samples. In *J. Neyman and E.S. Pearson, 1967, Joint Statistical Papers*, (99-115). Cambridge: CUP.

Royall, R. (1997) *Statistical Evidence: A likelihood paradigm, *Chapman and Hall, CRC Press.

Royall, R. (2004), “The Likelihood Paradigm for Statistical Evidence” 119-138; Rejoinder 145-151, in M. Taper, and S. Lele (eds.) *The Nature of Scientific Evidence: Statistical, Philosophical and Empirical Considerations. *Chicago: University of Chicago Press.

Thanks for the post, Deborah! It would be nice to have a notion of evidence that both (1) captures compelling intuitions about what the evidential meaning of data depends on and (2) tells us what we re objectively warranted in believing in light of our total evidence. In my view, we can’t have both in a single account: compelling intuitions lead to the Likelihood Principle and the Law of Likelihood, but (as problems like that of maximally likely hypotheses show), it is not possible to provide a good norm of commitment on the basis of likelihood functions alone.

One option is to go Bayesian: either “what one is objectively warranted in believing” doesn’t exist (the subjectivist view), or it exists but is given by the posterior probability distribution that results from updating a noninformative prior on the data (the objectivist view), rather than by the evidential meaning of the data.

A second option (perhaps) is to develop an account of evidence that does tell us what we are objectively warranted in believing in light of our total evidence but violates compelling intuitions about what the evidential meaning of data depends on. If such an account were developed, then my only objection to it would be the merely verbal quibble that it would be more perspicuously described as an account of objectively warranted belief rather than as an account of evidence.

A third option is to deny that compelling intuitions lead to the Likelihood Principle. I’ve given my reasons for thinking otherwise here: http://gandenberger.org/wp-content/uploads/2013/11/new_proof_of_lp_post.pdf

Greg:

“It would be nice to have a notion of evidence that both (1) captures compelling intuitions about what the evidential meaning of data depends on and (2) tells us what we re objectively warranted in believing in light of our total evidence. In my view, we can’t have both in a single account: ”

Your (2) is in danger of blurring the question as to what is warranted given all background information, (big picture inference) and what the data x warrant wrt the question posed (local inference). I don’t think you intended that (or did you)? Therefore, to make the position clear one should stick to the considerations relevant for determining what’s warranted from the given data. Those considerations,I say, require information about the probative capacities of tests, thus sampling distributions, thus violations of the SLP (which is a most unintuitive principle, it seems to me, in that it countenances methods that lead to misleading inferences with high or maximal probability).

On the other hand, the assessment that “x poorly warrants H” is defeasible with other information. To have an example, a type of selection effect may always count against the warrant of an inference afforded by a type of test, but then, other background could permit one to show that the threat is taken care of by means of some other check, or some other background info–even though it is no thanks to the test. I am reminded of the “creative error-theorist” (9.4, p. 314 of EGEK).

Thanks for the reply. It’s not clear to me what distinction you have in mind when you talk about big picture inference vs. local inference. Is it the same as the distinction between inferring a high-level theory (such as general relativity) and a low-level phenomenon (such as the shift in apparent positions of stars as their light passes close to the sun), or do you have something else in mind?

Greg: No, here it’s the background info that changes. First of all, I want to reiterate, that the statistical account in which I am interested is not to tell us what to believe given the data x, but how well or poorly tested various claims are with given data x and given test T. This is my main point (not the big picture vs local point, just my main standpoint). Like the criticism of a prosecutor, I need to be able to distinguish believability (or even knowledge) of H from how good a job was done in checking and ruling out flaws in warranting H. It can and does happen that I believe or even know H, but reject a test as horrible evidence for H. Conversely, my disbelief in a claim H does not in itself show what’s wrong with a test purporting to provide evidence for H. Knowledge of past biases may make me suspicious, but that’s not enough.

Now on the local vs big picture. A test T, for me, can combine several tests, as can an associated severity assessment. Within this scrutiny, I would need to distinguish what this one data set warrants from combined tests or using other severely passed claims and models. For instance, failure to reject the null of model adequacy does not indicate adequacy if based just on x, but the big picture inference—based on other tests and everything known– may. So this is different from the level of generality issue.

I’m in the middle of traveling, so this is on the fly, but I wanted to respond to the main query.

Mayo, you have left out an essential component of the Law of Likelihood: the likelihoods in question have to be computed from the same statistical model. In fact, I think that you have never actually considered that restriction. Edwards makes it clear in his book Likelihood, writing “given a statistical model” in each definition, but others are not so clear.

The reason for the restriction is that the value of Pr(x|H) depends on the nature and properties of the statistical model within which it is being considered, so likelihoods are necessarily model-bound. That leads naturally to comparisons of likelihoods being similarly model-bound. Likelihoods do not by themselves allow model comparisons, hence the Akaike account of model comparisons.

If the likelihoods being compared using the Law of Likelihood are from the same model, their hypotheses will have the same number of parameters (the same dimensions, I like to think) and there can be no comparison of the sure-thing hypothesis, the “what happened had to happen” hypothesis, with other hypotheses within the same model. Any alternatives to that hypothesis would have different numbers of parameters and thus be from different models.

What keeps you from expanding the model to encompass any pair of hypotheses that ascribe a probability to a given datum?

Sorry, quantifier ambiguity. What I meant was, given any pair of hypotheses each of which ascribes a probability to a given datum, what keeps you from expanding your model to encompass both of them?

Can you expand the model in that way and still have fully defined probabilities for all possible parameter values? If not, then that is what stops such an expansion. The likelihoods require a fully defined statistical model. Once you have expanded the model then you have to compare likelihoods within that expanded model and any likelihoods from the simpler model are irrelevant.

Expand the model enough to include the sure thing hypothesis and that hypothesis is the entire model so that no comparisons are possible.

Michael: Even comparing parameters within a model, you can very easily be mislead when following the LL. This is because sampling error can frequently lead to a value for x that favors a parameter value other than the true parameter value from the population that was sampled. Of course, you have seen this in your simulations, which reveal relationships between p-values and likelihoods. I know I keep returning to this issue, but it appears to me that Royall offers the interpretive rubric of not recognizing a difference in two hypotheses when the odds are less than 8:1 or 32:1, depending on your conservatism, because he has run into this. I suspect these are offered because of the very issue stated above. Users of likelihoods have no doubt encountered numerous instances in which a known true parameter value is not most favored by the sample values. Should be no surprise given that sampling error is reality. Is this not why Hacking let go of it? It is this sampling error problem that renders the likelihoods insufficient as sole representations of evidence. Neither likelihoods nor likelihoods coupled with priors allow one to rigorously take account of what sampling error has introduced into the experimental result, and most accounts that rely upon them explicitly deny that the processes that produced the numbers (with their inherent limitations and caveats, such as small sample size, or the sampling protocol) have to be considered when making inferences (other than to note that the evidence is contingent on the model). These methodological approaches that ignore sampling error cannot tell you how likely it is that you just made an erroneous inference. Thus, they should come with a warning label like a pack of cigarettes. Yes, you can say things like “under this model parameter value Y is best supported by the sample value x.”. But, this is a petty piece of the puzzle, and you must examine your whole experimental process to be able to infer that the true parameter value is not X, much less should be inferred to be Y.

John, you make some excellent points. However, I don’t think that I have yet made my position clear. The likelihood principle says that all of the evidence in the data relevant to parameter values in the model is contained in the likelihood function. That DOES NOT say that one has to make inferences only on the basis of the likelihood function. Priors can be important, and they are dealt with by Bayes. However, the probability of erroneous inference is obviously also important. I think of that as being the ‘reliability’ of the evidence. Standard Bayesian methods do not appear to have a mechanism to capture that aspect of the evidence as it captured by the likelihood function. Frequentist approaches do have the potential to capture the reliability but conventionally they ignore much of the evidential meaning of the data. Mayo’s severity functions seem to have the potential to integrate the strength of evidence (as encoded by the likelihood function) with the reliability of the evidence, but I have not seen them calculated for any circumstance where there is any unreliability to capture.

There are many, many different statements that purport to be the likelihood principle, and many of them are substantially different in their implications from the version that I prefer (Edwards’s). I choose to think that those statements are erroneous rather than to think that the likelihood principle is stupid.

Now to your specific points.

The sampling error is entirely accounted for within the likelihood function is that sampling error matches the model that supplies the probabilities. Thus if there has been no procedures that might be described as “P-hacking” the likelihood function is indeed all that is needed to know the full implications of the data with respect to the parameter values of interest. However, if there has been optional stopping or mulitplicity of testing then they add some degree of unreliability to the evidence. That does not change the likelihood function, and does not change the strength of the evidence or change the parameter values favoured by the evidence, but it may change how one should make inferences about those parameter values.

I do not see the need to defend Royall’s arbitrary cutoffs for various descriptors ‘strong’ etc. as they are obviously just arbitrary guidelines for convenience. He does not propose them as a way to get around anything other than the desires of ordinary people for guidelines.

Yes, every user of likelihood will come across circumstances where the most favoured hypothesis is not the true hypothesis. However, they will only occasionally come across circumstances where the most favoured hypothesis is much more strongly favoured than the true hypothesis. The frequency of strongly supported parameters being far from the correct values of the parameter gets smaller as the evidence accumulates, but no-one ever said that the mode of a likelihood function is always at the true value of the parameter.

I do not know exactly why Hacking let go of likelihood, but he did write that the single observation problem described by Birnbaum was persuasive. (I make the claim that that alleged counter-example to the law of likelihood can be debunked, and I’ve started on a paper that does exactly that.)

Yes, I agree that neither likelihood functions nor likelihood functions with priors can capture the unreliability introduced into the evidence by methodological factors. Those factors need to be dealt with, but not by discarding the evidential meaning of the data. Severity seems to be a possible way forward. (I will point out the fact that sample size issues do seem to be fully accounted for in the likelihood function. It is P-hacking-like procedures that need to be accounted for externally.)

Yes, I agree also that a warning about any aspects of the experimental design that lead to unreliability should be made. It is not really a statistical or philosophical response to the problem as it would obviously by part of any sufficient and truthful account of the experiment.

Michael: “The sampling error is entirely accounted for within the likelihood function is that sampling error matches the model that supplies the probabilities.”. Michael this seems naive to me. While I can see that the distribution used to derive likelihoods will become steeper or flatter based on sample size, this makes no difference if one adheres to the LL. Following the LL collapses all the likelihoods to essentially rank order, so it does not matter what the sample sizes are, right? If my sample mean is ’10’ then the best supported population mean is ’10,’ regardless of sample size. Yet, we know that ’10’ might be the result of sampling error and we know that following LL can frequently lead to wrong inferences.

Another counterpoint is that you like to cite the likelihood principle as authoritative but you are aware that it is not well defined. I think you need to state your definition to be clear, because my read is that most expositions of it do not agree with yours.

I think that frequency statistics do get directly at reliability by design, though perhaps no one test addresses all aspects of reliability. Conversely, Bayesian and likelihood approaches seem to ignore reliability by design (though exceptions no doubt exists, such as Gelman’s concerns for model checking, etc.).

Finally, I am not comfortable with the tendency to separate inference from evidence as though they are neighbors who may or may not share a fence. To me, there is no separation because inference follows from evidence, and evidence follows from procedure. Probilities only relate to processes. I am confused by your points along these lines, but suspect the issue might be confusion from the likelihoodist end. Good scientists always keep one eye on what they observe, and the other eye on how they came to observe it. There is no place for ignoring the latter.

John, the statistical model used to generate likelihoods has to include the sampling distribution for the estimator of the parameter of interest, otherwise it would not be possible to calculate the probability of the observation given the parameter taking specified values.

The model does not include stopping rules (actually it can, but such inclusion makes no difference to the resulting likelihood function) or multiplicity of comparisons etc., but it certainly includes sample size. Increasing sample size does tend to make a likelihood function ‘steeper’, but that has more consequence than you seem to think: the steeper function has a relatively narrower peak and so parameter values around the most likely estimate become less favoured by the evidence as sample size increases. The likelihood function is useful for estimation, an important feature that is lost by discussing likelihood analyses in the frequentist terms of true/false with two nominated hypotheses.

Correction of what I think are mis-statements of the likelihood principle would take more space than I could sensibly use in a blog comment, but it is important. (Mayo, feel free to commission a guest blog post!) I suggest that you look at AWF Edwards’s book Likelihood (1972, 1992) for the clearest exposition of the principle, and is still in print. (Edwards was a student of Fisher.) Royall’s book is excellent, but set up as a polemic which is unhelpful for readers who do not have an open mind.

The likelihood principle comes from consideration of what likelihood is and the consequences that follow if the likelihood function is accepted as a picture of the evidence in the data. It is a principle that tells us about evidence, not one that sets out how to make inferences. Statements that suggest that the likelihood principle outlaws consideration of aspects of the experiment other than the likelihood function are mistaken, as are statements that twist the principle into a warning against considerations of aspects of sampling distributions that do not affect the likelihood function. One may, and should, consider all relevant information when making inferences.

My calls for recognition of reliability of evidence as separate from strength of evidence and for inference to be a process step separate from assessment of evidence are attempts to make it possible to reconcile the likelihood principle and frequentist concerns that reliability be accommodated in scientific inference. (Not much luck so far!)

Michael: “the steeper function has a relatively narrower peak and so parameter values around the most likely estimate become less favoured by the evidence as sample size increases”: Of course, that is apparent, but if you follow the LL it makes no difference how steep the curve is, you will be led to select the sample estimate as the answer, without regard to how much confidence one should have in that conclusion. This will often be wrong and we know it in advance.

“The likelihood function is useful for estimation, an important feature that is lost by discussing likelihood analyses in the frequentist terms of true/false with two nominated hypotheses.”. This criticism seems misplaced with regards to comparison to frequentist confidence intervals, which I believe have the advantage of tempering the boldness of the estimates by considering sample size explicitly, and by providing a probability that is tied to the process. The true/false terms are pertinent to discussion of the LL, which you have not really discussed.

John, you’ve reached the core issue. Well done! The likelihood function does not tell you everything that you might wish to know in order to form an inference. It only tells you how much to prefer any hypothesised parameter value over any other within the same statistical model. You have to add consideration of issues relating to prior probabilities or reliability of the evidence using other statistical tools. That doesn’t make the law of likelihood not useful and nor does it make the likelihood principle false or in any way deficient. That’s just how it is. My bicycle is a good bicycle, but when I want to fly I have to board an aeroplane. That doesn’t make my bike a bad bike, and it doesn’t make it less useful when I want to go somewhere close enough to make the bicycle a sensible transport.

According to the likelihood principle, the likelihood function contains all of the evidence in the data relevant to choosing between the possible parameter values within the model, but there are other things that you might want to include in inference: the prior probabilities and previously obtained evidence, and the reliability of the evidence. Incorporate them using other statistical tools. The law of likelihood is a one-trick pony. It’s not sensible to blame that trick for the absence of things that it does not pretend to deliver. Not only that, but the suggestion that a conclusion based on a likelihood function “will often be wrong” is an extreme exaggeration, and is anyway a criticism that applies equally to any statistical method.

I think that you are not quite right in your statements about confidence intervals. There is a one-ton-one correspondence between a frequentist likelihood function and a P-value, and a one-to-one correspondence between that P-value and a likelihood function. How does a confidence interval ‘consider’ sample size in a manner that a likelihood function does not? How do you ‘temper’ your response to an interval in a manner that is not applicable to a likelihood function?

Typo in my comment above: where I say “frequentist likelihood function” in the last paragraph, I obviously mean “frequentist confidence interval”. Damn.

Michael: Thanks for bearing with me on this. I see the relationships you point to, and have benefitted from the graphics that resulted from your simulations. The difference I am pointing at is what is entailed by the LP and LL as presented by nearly all expositions of the likelihoodist approach to inference. Combined, they instruct us to settle on a model, then choose the best answer from the sample as the conclusion because the “evidence” says so. Once you accept the model, you must choose the answer with greatest likelihood. There is no accommodation in this approach for uncertainty due to the process followed, once you commit to the process.

I know that you are not advocating this but many are; maybe most likelihoodists are. Significance testing is criticized precisely because it violates the LP AND LL. So there is my chief concern. Royall seems to have backed off of the extreme position when he suggested the 8:1 and 32:1 standards. This is not just methodological fancy, it is of fundamental relevance in real world application. Confidence intervals directly address the concern, whereas 8:1 as a standard is unclear and hard to explain to the consumer. Note that in forensic DNA testing, some advocate a similar scale against which to rate the strength of evidence of likelihood ratios. Clear as mud… Not explicitly related to anything you can grab onto… In some cases quite misleading, as when the match was found by fishing in large databases…

Can you offer a more comprehensive approach to using the likelihoods vs the more traditional freq methods?

Sorry, I’m not following. In the card example, you could have a model with fifty-two parameters each of which specifies the number of cards of a particular type that are in the deck. All of the “sure-thing” hypotheses are in that model, as is the standard-deck hypothesis, and all hypotheses in the model assign definite probabilities to all possible outcomes of a given draw. What’s wrong with that?

Greg, when you expand the model to allow the sure thing hypothesis (a deck of 52 fours of hearts, assuming the observed card was four of hearts) then you see that there are more than just two hypotheses. You also see that the model includes decks of cards that are very nearly as likely as the sure thing: 51 fours of hearts and one ace of spades; 51 fours of hearts and a two of spades; etc. Thus the standard presentation of the problem with just two hypotheses can be seen to be misleading in that the evidence does not uniquely favour the sure thing hypothesis over all others. The fact that the evidence supports 51 hypotheses to the degree 51/52 as strongly as the best supported hypothesis is notable but never mentioned. 51*51 hypotheses are supported to the degree 50/52, 51*51*51 to the degree 49/52 etc. The single datum is not in any sense definitive.

(Please see my response to John Byrd for a little discussion of the role of arbitrary cutoffs for applying strength descriptors to particular ranges of likelihood ratios.)

I have often suggested that you to plot the likelihood functions for the various problems. That is not just rhetorical advice. In plotting the likelihood function you will see the evidential meaning of the data much more clearly.

This trick deck problem leads to the idea that evidence may have yet another dimension (i.e. in addition to strength, as shown in the likelihood function, and reliability). That dimension is something like ‘mutability’. It relates to how easily the evidential favouring of the parameter values can be changed by the addition of an extra datum. In the card problem a second observation can entirely eliminate the support for the hypotheses most strongly favoured by the first datum. That aspect of the evidence does not seem to be captured by ‘strength’, by ‘reliability’ or by any Bayesian prior. It is an idea that I intend to explore in depth, but it is not at the top of my list at the moment.

Michael Lew: I know of no way to formalize the distinction you want to make between certain kinds of Sure Thing hypotheses and statistical models; it seems we’ll have to fall back on intuition, and the problem with that, of course is that intuitions differ.

As to “mutability”: the entropy of the prior (or dispersion of the prior when the setting allows such) captures that just fine. Consider the toy example of a binomial parameter and a Beta prior; one reasonable measure of mutability is simply the inverse of the “effective prior sample size”, i.e., the sum of the hyperprior parameter values. The example readily generalizes.

Thanks Corey. Yes, there does seem to me to be a bit missing in the totality of the formalisations that we have in our kitbags. Of course my intuition may differ from yours.

I don’t see how a prior encodes the mutability that I have in mind, but maybe that shows that I’m missing something. Would your prior for the card problem differ between the situation where one card was drawn from the situation where several cards were drawn? If not, then how does the prior encode what I am thinking of as mutability of the data?

“Would your prior for the card problem differ between the situation where one card was drawn from the situation where several cards were drawn?”

Sure would. Consider a round of poker and two possible priors. The first prior places some weight on a fair deal and some on a cold deck; the second places all of its weight on a fair deal. The entropy of the first prior is larger because its support is a proper superset of the second prior. It is also more mutable in your sense — with just a little bit of evidence that there are, say, more than two monster hands at the table, the predictions that follow from the first prior will change drastically. In the case of the first prior, at the moment one player sees just their first card, there is not enough evidence to cause such a shift; but the second card can hint, the flop can hint more, and the pattern of bets in the first round can be highly suspicious, and the turn can be downright incriminating.

Interesting. Your priors seem to interact with the observations in a manner that leads me to see them as different from the usual prior that are combined with evidence to produce posterior probability distributions. That conflicts with my idea that the evidence comes from the data and the prior comes from what is known (or supposed) that is not the data. How do you decide how to alter the priors? Is there any way to alter the priors in response to the data that does not impose a layer of ‘subjective’ evidence on the analysis? (I’m guessing that I still don’t get it.)

For a holder of the LL it can only be by disbelieving claims that the error statistician might simply consider poorly tested–perhaps quite believable. It also requires the user to arrange to have skeptical beliefs in a hypothesis when the skepticism is actually because of the methodology for its testing. They may come out in the same place, if you already had an error statistician’s ear to pick up on poor tests. But for a holder of the LL, this is not part of the evidence, only your beliefs.

Michael Lew: The situation I’m describing is not meant to be anything other than the usual Bayesian updating — I can’t see anything in the scenario I’m describing that doesn’t accord with “evidence comes from the data and the prior comes from what is known (or supposed) that is not the data”. I’m not sure how confusion crept in.

We can imagine the exact same sequence of cards and bets in two different scenarios: first, in a home game where the players don’t all know each other well; second, in a tournament with professional dealers and automatic card shufflers. The first prior might be appropriate in the first situation and the second prior in the second situation. The first prior places weight (perhaps not much, but not a negligible amount either) on the possibility of a certain kind of cheating; the second pretty much rules it out. The information that the prior encodes is entirely pre-data.

From the perspective of one of the players, both priors have pretty much the same prior predictive distribution for the identity of the first card. But if hints of cold decking start to accumulate, then the post-updating predictive distributions can diverge a great deal. Updating the second prior will result in predictive distributions that continue to reflect a fair shuffle even as card identities become known and bet information accumulates; updating the first prior will result in predictive distributions that favor deal results that would tend to result in unusually large bets.

I gave the card example only to show Royall’s position on maximally likely alternatives, however constructed or chosen for testing. The relevant kind of hypothesis should be of the type I give in the post, where through selection effects, post-data subgroups, data-dependent endpoints etc, I’d be able to say that one and the same hypothesis has been tested differently (poorly). This is a distinction that Royall says we cannot make. We can only compare two different hypotheses to a fixed data set. I want to contrast the well-testedness of H (when selected in a given way) and the same H (without selection) and I can readily do so by considering the different error probabilities. The selection effect is invisible to the holder of the LL. Likewise for comparing x from fixed sample vs optional stopping. Royall is happy as a clam and jumps for joy that the LL can discern NO difference. he denies there is an evidential difference. The fact that a test of H that I’d condemn right now, thanks to my picking up on the altered sampling distribution and poor error probabilities, might prove to have inferred an H that is irreproducible and discovered wrong after more testing–so that even Royall might discover it was wrong–in no way saves the holder of the LL from a bankrupt account.

I take it that Michael’s notion of mutability has to do with the impact that possible future observations would have on one’s evidence, rather than with (or maybe in combination with?) how concentrated one’s prior is. Maybe a measure of expected information would do the trick: http://www.uv.es/~bernardo/1979AS.pdf

In any case, I don’t see what facts about the impact of possible future observations have to do with the evidential significance of past observations, although they might affect one’s choices about whether to collect further data before making a decision.

I agree that one would typically want to look at the entire likelihood function, but that doesn’t change what the LoL says about the comparison between the sure-thing hypothesis and the standard-deck hypothesis, nor do I see a principled way to prevent that comparison from being made.

Greg, I guess I’ll have to come out and say that I’m actually quite comfortable with the fact that the evidence from a single card is 52 times more strongly in favour of the sure thing hypothesis over the standard deck hypothesis. However, 52 times nothing is nothing. What inference are you going to draw in the knowledge that only one card was drawn?

The law of likelihood does not say that one has to make inference, or even how one should make inference. If you get a silly answer when equating the law of likelihood with a law of inference then that is a consequence of a poor approach to inference, not a faulty description of the evidential favouring of the limited data.

Agreed

Michael – Bias-variance? (See also regularisation and Corey’s discussion of priors?)

omaclaren, I don’t know what bias-variance refers to or why you mention it. Same for regularisation. What do you want me to get from Corey’s priors?

Michael – I’ve commented a bit more below but if these terms are unfamiliar then that probably won’t be any clearer either. Better to see e.g. the books on statistical learning by Hastie et. al [Chapter 2; available at http://statweb.stanford.edu/~tibs/ElemStatLearn/ ] and James et. al [Chapters 2 and 6; available at http://www-bcf.usc.edu/~gareth/ISL/getbook.html%5D.

I believe you may be coming across the idea of the bias-variance tradeoff discussed in these books (see also e.g. Wasserman’s all of statistics); regularization is one way to deal with this and it also often has a bayesian interpretation (see e.g. 2.8.1 of Hastie et. al), where the prior gives you the regularization/penalty term.

Greg, another way to go is to note that Jack Good’s justification for the LP is only ‘in the limit’ as one gets enough evidence. In the card example one is certain before a draw is made that if the pack is normal the true hypothesis will not be the most likely. So as well as identifying the maximum likelihood hypothesis one needs – as Good does – to consider the ‘strength’ of the evidence.

As an example, it may be that it was more likely than not that Saddam had WMD but how strong was the evidence?

Deborah, how does my approach differ from yours? Any critical examples?

Dave: Good has a notion of “weight of evidence” but I don’t know if that’s what you’re alluding it. I myself, if you are asking,never consider the LR as more than a fit measure (thus relevant for the first condition for severity). The one has to consider the probability of so good a fit (say, with H), under various discrepancies from H. In non-formal contexts, one does something analogous.

In your example, the question might be: had they done a good job ruling out ways they could be wrong about assuming WMDs?

Deborah, Have you written up your non-formal thoughts? For example, how would you reason about real coins? I suspect that our ideas are similar, but hope that yours are clearer than mine.

Thanks for the response, Dave. Do you have in mind something like “weight of evidence” in Keynes’s sense, which reflects the *quantity* of evidence rather than its valence? If so, how is the appeal to that notion supposed to work? If not, what do you have in mind. Good regards likelihood ratios as measuring “weight of evidence,” and likelihood ratios can be quite large in cases like Royall’s card example.

(This follows on from my previous comment.)

It is commonplace, for convenience I suppose, to write of pairs of particular hypotheses, such as H1 versus H2. However, focussing on nominated hypotheses disguises the role of the statistical model(s) and predisposes one to thinking that a useful account of evidence has to be able to compare simple with compound. That is a mistake.

For application of the Law of Likelihood, the hypotheses are nothing more than parameter values within the statistical model that yields the likelihoods. Otherwise the hypotheses would effectively be models and the problem becomes a model selection problem about which the Law of Likelihood is silent, and for which AIC or similar should be used.

The likelihoods can be had as a function of all of the possible values of the model parameters, and that function is the likelihood function. If you look at a likelihood function and map onto it a comparison of a simple (point) hypothesis with a compound hypothesis, either disconnected or connected multi-points, you will see that there is no real-world utility in such a comparison. The function shows which hypotheses the evidence favours relative to which others, and once you have that knowledge there is no reason to want to know if one particular hypothesis is better supported than an average of others. The likelihoods provide (as Fisher said, I think) an order of preference among the hypotheses. That would be order of preference among the possible values of model parameters.

Am I more likely to be taller than six foot, or shorter than six foot? You can answer that using probabilities after choosing a relevant population distribution. However, if you have evidence about my height from a measurement the relevant question becomes “how tall am I?” The question becomes specific and the model becomes one related to the measurement rather than the population that you might have chosen for the first question.

“I take it the Bayesian response would be to agree, but still deny there is evidence for disease. Yes?”

Speaking only for myself, I would deny that the totality of the available information (comprising the test result and the known prevalence) favors the claim that disease is present.

Corey: OK, but could there be a case where the totality of the evidence favors “no disease”? I don’t like the disease example, my question is simpler really: if H has been given a Bayes boost, but the priors result in not-H (or whatever rival being considered) being more probable than H, then would a Bayesian say there is evidence for not-H? Remember non-fan from the first installment on “Breaking the law of likelihood”? That person was arguing this, as I recall.

https://errorstatistics.com/2014/08/29/breaking-the-law-of-likelihood-only-way-to-keep-their-fit-measures-in-line-a/#comment-91761

Mayo: On my account, a prior weighing heavily to not-H is logically equivalent to — that is, implies and is implied by — the availability of (prior) evidence for not-H. It is not that the prior *is* evidence; rather the prior encodes some heretofore implicit evidence.

Michael’s likelihoodist approach seems perfectly reasonable, given enough data. I also dislike the emphasis on two hypotheses as opposed to something like considering the whole likelihood function.

I wonder if the difficulty with the single-observation problem lies in it being in some sense ill-posed/ill-conditioned. Under typical variations in single samples the most likely parameters would vary wildly, no? Hence if one actually does want to do the inference there must be heavy regularisation or strong priors to stabilise the problem, right? The degree to which the regularisation/prior info informs the inference relative to the pure data/likelihood probably gives some notion of the undetermination etc in the problem. Bias-variance tradeoffs similarly come to mind.

omaclaren, you are on to something here. The counter-examples to the law of likelihood all seem to rely on n=1. Perhaps the degrees of freedom being zero is a warning sign.

What is exactly meant by the term evidence? I see it used quite a lot, but have a sneaking suspicion there are a few different definitions in use at once.

— From the OED: “The available body of facts or information indicating whether a belief or proposition is true or valid.”

— In the Bayesian literature, it is often used as another name for the likelihood of data given a model, post marginalization.

— A collection of results from statistical tests.

— Something else… say using information theory perhaps?

West, there is no agreed meaning for “evidence”in statistics or philosophy of statistics beyond the ordinary lay language meaning. Unless you allow the likelihood principle to guide how you look at the idea of statistical evidence, that is.

I think that one of the important impediments to reconciliation of the various approaches to statistical inference (which, you should admit, are at least plausibly complementary) is the fact that those who, implicitly or explicitly, accept the likelihood principle can ‘lord it over’ the poor frequentists who have to scratch around trying to integrate ordinary notions of evidence into their thinking. (I have to say that Mayo’s severity goes a long way towards that end.)

The situation is quite the opposite! The most natural thing is to care about what other outcomes could have occurred, the overall error properties of tools, and the influences that selection effects have on the capability of methods to detect and control erroneous assertions. That’s why Birnbaum rejected it, despite his “proof”. Those left scratching around are the poor adherents of the SLP who cannot capture the most ordinary and mandatory intuitions about taking evidence seriously. It’s no fluke that buying the “simplicity and freedom” (Savage) of the SLP has cost us: cavalier attitudes toward cherry picking, post-hoc subgroups, multiple testing, optional stopping and the like.

Michael: Your position reads like rhetoric more than arguments supported by substance. I would add that it does not appear to me that your comment about the LL criticisms point only to N=1 has much of a basis in the literature. At least not what I have seen. Criticisms against LL easily follow from principle, and can be easily shown with any modest sample sizes. In light of how the SLP and LL are being used by many practitioners, it does not seem to be good practice to cling to the (petty) excuse that the “evidence” is only from the model and sample, and we need not burden our view of evidence in favor of an assertion with worries over other highly relevant factors (like other values that could have occurred, given the methods used…). It seems that we are misleading researchers to adopt naive methods of data analysis and interpretation by promoting the commonly presented versions of LL and SLP, but perhaps then offering backdoor ways to cut the losses created by using these likelihoodist principles (e.g. 32:1 is very strong evidence; or 1000:1 is very strong evidence). And, these backdoor ways have no clear justification.

John – could you provide some examples which don’t rely on having more parameters than data points (following Michael’s emphasis on models)?

Sure. Consider what happens when you have a population with mean of 20. You take a random sample of N=15 and calculate the sample mean. Then, take another sample and calculate the sample mean… repeat this many times. Each iteration of this game offers an opportunity to examine the likelihood distribution based on the sample. The LL and LP for any iteration direct you to choose the sample mean value as the parameter mean value. Yet, we all know that this process of estimating a population mean will very often give us a value different from the population mean because of sampling error. The values not observed are highly relevant when interpreting the sample statistics and must be considered as part of the evidence when making an inference, because these numbers mean nothing when separated from knowledge of the process that produced them.

OK good, sure. I agree values not observed are relevant. That’s why I referred to bias-variance trade-offs (see refs above), regularization and priors. Each of these generally uses the likelihood as the ‘fit’ and then penalises for taking one sample too seriously. This seems to be how the likelihood is used by most in practice, if not in principle (reminds me of the joke about the theoretician who asks – ‘sure, it works in practice, but does it work in principle?!’).

In each case too, the smaller the sample the higher the expected difference in the next sample mean (i.e. the higher the sample-to-sample variance) and the more you should (if you want to obtain a stable estimate for future samples) impose additional bias (regularization/priors etc) based on knowledge/modelling/assumptions of past or future samples, if you want to stabilise the estimate*.

And in each case, the larger the sample the less you need to worry about sample-to-sample variability.

So again, it’s about the parameters vs sample size and, in pathological or difficult cases, doing the best you can.

*I should note that it’s ‘easy’ enough to report how much regularization/prior info contributes relative to the sample/likelihood. This would perhaps be helpful for the whole severity/how reliable is this actually question?

John, you are mistaken. You say that “The LL and LP […] direct you to choose the sample mean value as the parameter mean value”, but the law of likelihood simply tells you how to quantify the evidential support for one parameter value over another, and the likelihood principle tells you that the evidence is in the observation given the model. How do either or both of them “direct” you to choose anything? The likelihood principle is not a decision theory.

The likelihood functions that you would generate for each of your iterations does take the sampling distribution of the mean into account, as that is built into the model. Plot a likelihood function and you will see for yourself.

John, you are being harsh and unfair.

I said that counter-examples to the law of likelihood seem to rely on n=1, not all criticisms. The criticisms of the likelihood principle and the law of likelihood come from more than just the existence of alleged counter-examples.

When you say “in light of how the SLP and LL are being used by many practitioners”, do you not see that I may not agree with the common statements of SLP? Do you really think that there are many people expressly applying the law of likelihood? Certainly not in the basic biomedical fields that I play around in. Do you mean to imply that “P-hacking” and lax experimental practices are the fault of the law of likelihood? That sounds like a position that would be difficult to support.

I don’t see how you can take exception to the notion that “evidence” is in the observation. The observation has to contain the evidence as a consequence of the natural language meaning of evidence. The model is necessary to interpret or quantify the evidence because, well, how else? Other factors are indeed relevant and, often, essential for a proper interpretation of the evidence, but those factors are external to the evidence. I do not take the position that you ascribe to me wherein it is not necessary to consider those extra-evidential factors. No likelihoodlum needs to take such a position, although I do see that Royall and Edwards might do so.

I do not support the general utility of arbitrary levels of likelihood ratio for labels of evidential strength. That leads to damaging tendency towards dichotomisation. Instead of looking at a ratio of the likelihoods for two hypothetical parameter values, I recommend looking at the whole likelihood function. After all, that is the whole evidence.

Michael: You might be giving me too much credit when you say I am harsh and unfair. I am being somewhat provocative because there seems to be confusion over definitions and criticisms. More on that shortly…

The place to look for a characterization of evidence is not the dictionary, it is to philosophy. Not that we agree. Of course, you should speak of evidence for or against a claim or the like–not just evidence.

More: from that ultimate authoritative source, Wikipedia: “Combining the likelihood principle with the law of likelihood yields the consequence that the parameter value which maximizes the likelihood function is the value which is most strongly supported by the evidence. This is the basis for the widely used method of maximum likelihood.” It appears to me that many practitioners take this to say that you must make your inference based on this reasoning. I know that I have been criticized for using p-values precisely because it violates the LP and LL combination. At the same time it is clear that many likelihoodist do not take the LL too seriously if they have extra add-ons to mitigate the foolishness of taking the LL to be some kind of final word in an analysis. So, why call it a Law? Seems overstated and oversold, and is likely confusing a lot of people.

As I say, the LL is bankrupt as an account of evidence. I haven’t seen any good arguments to the contrary. In fact, it’s not even an account of inference. H makes x less probable than does J. Where’s the inference?

So, my understanding is that LP and LL say that is the evidence and the evidence points to J. If you do not infer J, you have “broken the law.”. But, it appears to me that the likelihood function based on the sample is not THE evidence, it is part of the evidence that can inform the inference. To present it as has been done so many times as ALL the evidence is very confused and confusing. I would go further to say that numbers can never be all of the evidence. The procedures used to generate the data are inseparable from the numbers generated and are part of the package we should call our evidence.

So, if J compares to H with a 7:1 likelihood ratio do we infer J? Well, it is less than 8:1 and a lot less than 32:1, so what how do we use the LL? We probably take it with a big grain of salt. We are worried about being wrong about J, because we understand limitations of the process that led to the likelihood function. And other factors as well. So, the numbers cannot stand as THE evidence in my view.

John, you are badly mistaken. The likelihood principle and the law of likelihood tell you how to see and quantify the evidence in the data relevant to estimation of a parameter of interest (given a statistical model), but neither of them tells you how you have to make inferences! It is entirely “lawful” to make inferences on the basis of evidence and _any_ other information, loss functions, prior opinions, non-data-related aspects of the evidence (multiplicity of testing etc.), and anything else that you think important. Inferences should be made by sentient beings after thoughtful consideration. Evidence is important, but it is not the only thing that matters to inference, despite being the only thing subject to the law of likelihood.

I believe that much of the ill-feeling towards the law of likelihood comes from the misconception that you expressed.

Yes, you are no doubt right about that ill feeling being traceable to the issue I raised. But, you say it is a misconception– and for you it might be– but you know there are very critiques of classical significance testing, confidence procedures, etc that express the misconceptions as though they provide a definition refutation of statistics such as p-values. Another issue is how to use the word “evidence”, but I suppose this is semantics?

Should “very many critiques…” And, “..definitive refutation…”

Btw I totally agree that it seems silly to call it a law, in the same way it would be silly to have a ‘law of the sum of squared residuals’ or similar. Was just trying to point out that pretty much every school of thought has ways of avoiding overfitting. Clearly though there has been some controversy over the years which still muddies the waters so it’s helpful to make criticisms explicit.

“Every school” that has ways to avoid overfitting had better be able to justify those ways in terms of their effectiveness and the overarching goals of inference account. I have not seen that.

Do you mean that you haven’t seen these sort of justifications for methods of avoiding overfitting or that you don’t buy the justifications given?

Pingback: If you did not already know: “Law of Likelihood” | Data Analytics & R