There was a session at the Philosophy of Science Association meeting last week where two of the speakers, Greg Gandenberger and Jiji Zhang had insightful things to say about the “Law of Likelihood” (LL)[i]. Recall from recent posts here and here that the (LL) regards data x as evidence supporting H1 over H0 iff
Pr(x; H1) > Pr(x; H0).
On many accounts, the likelihood ratio also measures the strength of that comparative evidence. (Royall 1997, p.3). [ii]
H0 and H1 are statistical hypothesis that assign probabilities to the random variable X taking value x. As I recall, the speakers limited H1 and H0 to simple statistical hypotheses (as Richard Royall generally does)–already restricting the account to rather artificial cases, but I put that to one side. Remember, with likelihoods, the data x are fixed, the hypotheses vary.
1. Maximally likely alternatives. I didn’t really disagree with anything the speakers said. I welcomed their recognition that a central problem facing the (LL) is the ease of constructing maximally likely alternatives: so long as Pr(x; H0) < 1, a maximum likely alternative H1 would be evidentially “favored”. There is no onus on the likelihoodist to predesignate the rival, you are free to search, hunt, post-designate and construct a best (or better) fitting rival. If you’re bothered by this, says Royall, then this just means the evidence disagrees with your prior beliefs.
After all, Royall famously distinguishes between evidence and belief (recall the evidence-belief-action distinction), and these problematic cases, he thinks, do not vitiate his account as an account of evidence. But I think they do! In fact, I think they render the (LL) utterly bankrupt as an account of evidence. Here are a few reasons. (Let me be clear that I am not pinning Royall’s defense on the speakers[iii], so much as saying it came up in the general discussion[iv].)
2. Appealing to prior beliefs to avoid the problem of maximally likely alternatives. Recall Royall’s treatment of maximally likely alternatives in the case of turning over the top card of a shuffled deck, and finding an ace of diamonds:
According to the law of likelihood, the hypothesis that the deck consists of 52 aces of diamonds (H1) is better supported than the hypothesis that the deck is normal (HN) [by the factor 52]…Some find this disturbing.
Furthermore, it seems unfair; no matter what card is drawn, the law implies that the corresponding trick-deck hypothesis (52 cards just like the one drawn) is better supported than the normal-deck hypothesis. Thus even if the deck is normal we will always claim to have found strong evidence that it is not. (Royall 1997, pp. 13-14)
To Royall, it only shows a confusion between evidence and belief. If you’re not convinced the deck has 52 aces of diamonds “it does not mean that the observation is not strong evidence in favor of H1 versus HN.” It just wasn’t strong enough to overcome your prior beliefs.
The relation to Bayesian inference, as Royall notes, is that the likelihood ratio “that the law [LL] uses to measure the strength of the evidence, is precisely the factor by which the observation X = x would change the probability ratio” Pr(H0) /Pr(H1). (Royall 1997, p. 6). So, if you don’t think the maximally likely alternative is palatable, you can get around it by giving it a suitably low prior degree of probability. But the more likely hypothesis is still favored on grounds of evidence, according to this view. (Do Bayesians agree?)
When this “appeal to beliefs” solution came up in the discussion at this session, some suggested that you should simply refrain from proposing implausible maximally likely alternatives! I think this misses the crucial issues.
3. What’s wrong with the “appeal to beliefs” solution to the (LL) problem: First, there are many cases where we want to distinguish the warrant for one and the same hypothesis according to whether it was constructed post hoc to fit the data or predesignated. The “use constructed” hypothesis H could well be plausible, but we’d still want to distinguish the evidential credit H deserves in the two cases, and appealing to priors does not help.
Second, to suppose one can be saved from the unpleasant consequences of the (LL) by the deus ex machina of a prior is to misidentify what the problem really is—at least when there is a problem (and not all data-dependent alternatives are problematic—see my double-counting papers, e.g., here). In the problem cases, the problem is due to the error probing capability of the overall testing procedure being diminished. You are not “sincerely trying”, as Popper puts it, to find flaws with claims, but instead you are happily finding evidence in favor of a well-fitting hypothesis that you deliberately construct— unless your intuitions tell you it is unbelievable. So now the task that was supposed to be performed by an account of statistical evidence is not being performed by it at all. It has to be performed by you, and you are the most likely one to follow your preconceived opinions and pet theories.You are the one in danger of confirmation bias. If your account of statistical evidence won’t supply tools to help you honestly criticize yourself (let alone allow the rest of us to fraud-bust your inference), then it comes up short in an essential way.
4. The role of statistical philosophy in philosophy of science. I recall having lunch with Royall when we first met (at an ecology conference around 1998) and trying to explain, “You see, in philosophy, we look to statistical accounts in order to address general problems about scientific evidence, inductive inference, and hypothesis testing. And one of the classic problems we wrestle with is that data underdetermine hypotheses; there are many hypotheses we can dream up to “fit” the data. We look to statistical philosophy to get insights into warranted inductive inference, to distinguish ad hoc hypotheses, confirmation biases, etc. We want to bring out the problem with that Texas “sharpshooter” who fires some shots into the side of a barn and then cleverly paints a target so that most of his hits are in the bull’s eye, and then takes this as evidence of his marksmanship. So, the problem with the (LL) is that it appears to license rather than condemn some of these pseudoscientific practices.”
His answer, as near as I can recall, was that he was doing statistics and didn’t know about these philosophical issues. Had it been current times, perhaps I could have been more effective in pointing up the “reproducibility crisis,” “big data,” and “fraud-busting”. Anyway, he wouldn’t relent, even on stopping rules.
But his general stance is one I often hear: We can take into account those tricky moves later on in our belief assignments. The (LL) just gives a measure of the evidence in the data. But this IS later on. Since these gambits can completely destroy your having any respectable evidence whatsoever, you can’t say “the evidence is fine, I’ll correct things with beliefs later on”.
Besides, the influence of the selection effects is not on the believability of H but rather on the capability of the test to have unearthed errors. Their influence is on the error probabilities of the test procedure, and yet the (LL) is conditional on the actual outcome.
5. Why does the likelihoodist not appeal to error probabilities to solve his problem? The answer is that he is convinced that such an appeal is necessarily limited to controlling erroneous actions in the long run. That is why Royall rejects it (claiming it is only relevant for “action”), and only a few of us here in exile have come around to mounting a serious challenge to this extreme behavioristic rationale for error statistical methods. Fisher, E. Pearson, and even Neyman some of the time, rejected such a crass behavioristic rational, as have Birnbaum, Cox, Kempthorne and many other frequentists.(See this post on Pearson.)
Yet, I have just shown that the criticisms based on error probabilities have scarcely anything to do with the long run, but have everything to do with whether you have done a good job providing evidence for your favored hypothesis right now.
“A likelihood ratio may be a criterion of relative fit but it “is still necessary to determine its sampling distribution in order to control the error involved in rejecting a true hypothesis, because a knowledge of [likelihoods] alone is not adequate to insure control of this error (Pearson and Neyman, 1930, p. 106).
Pearson and Neyman should have been explicit as to how this error control is essential for a strong argument from coincidence in the case at hand.
Ironically, a great many critical discussions of frequentist error statistical inference (significance tests, confidence intervals, P-values, power, etc.) start with assuming “the law (LL)”, when in fact attention to the probativeness of tests by means of the relevant sampling distribution is just the cure the likelihoodist needs.
6. Is it true that all attempts to say whether x is good or terrible evidence for H are utterly futile? Royall says they are, that only comparing a fixed x to H versus some alternative H’ can work.
[T]he likelihood view is that observations [like x and y]…have no valid interpretation as evidence in relation to the single hypothesis H.” (Royall 2004, p. 149).
But we should disagree. We most certainly can say that x is quite lousy evidence for H, if nothing (or very little) has been done to find flaws in H, or if I constructed an H to agree swimmingly with x, but by means that make it extremely easy to achieve, even if H is false.
Finding a non-statistically significant difference on the tested factor, I find a subgroup or post-data endpoint that gives “nominal” statistical significance. Whether H1 was pre-designated or post-designated makes no difference to the likelihood ratio, and the prior given to H1 would be the same whether it was pre- or post-designated. The post-designated alternative might be highly plausible, but I would still want to say that selection effects, cherry-picking, and generally “trying and trying again” alter the stringency of the test. This altered capacity in the test’s picking up on sources of bias and unreliability has no home in the (LL) account of evidence. That is why I say it fails in an essential way, as an account of evidence.
7. So what does the Bayesian say about the (LL)? I take it the Bayesian would deny that the comparative evidence account given by the (LL) is adequate. LRs are important, of course, but there are also prior probability assignments to hypotheses. Yet that would seem to get us right back to Royall’s problem that we have been discussing here.
In this connection, ponder (v).
[ii] For a full statement of the [LL] according to Royall. “If hypothesis A implies that the probability that a random variable X takes the value x is pA(x), while hypothesis B implies that the probability is pB(x), then the observation X = x is evidence supporting A over B if and only if pA(x) > pB(x), and the likelihood ratio, pA(x)/ pB(x), measures the strength of that evidence.” (Royall, 2004, p. 122)
“This says simply that if an event is more probable under hypothesis A than hypothesis B, then the occurrence of that event is evidence supporting A over B––the hypothesis that did the better job of predicting the event is better supported by its occurrence.” Moreover, “the likelihood ratio, is the exact factor by which the probability ratio [ratio of priors in A and B] is changed. (ibid. 123)
Aside from denying the underlined sentence,can a Bayesian violate the [LL]? In comments to this first post, it was argued that they can.
[iii] In fact, Gandenberger’s paper was about why he is not a “methodological likelihoodist” and Zhang was only dealing with a specific criticism of (LL) by Forster. [Gandenberger’s blog: http://gandenberger.org]
[iv] Granted, the speakers did not declare Royall’s way out of the problem leads to bankruptcy, as I would have wanted them to.
[v] I’m placing this here for possible input later on. Royall considers the familiar example where a positive diagnostic result is more probable under “disease” than “no disease”. If the prior probability for disease is sufficiently small, it can result in a low posterior for disease. For Royall, “to interpret the positive test result as evidence that the subject does not have the disease is never appropriate––it is simply and unequivocally wrong. Why is it wrong?” (2004, 122). Because it violates the (LL). This gets to the contrast between “Bayes boosts” and high posterior again. I take it the Bayesian response would be to agree, but still deny there is evidence for disease. Yes? [This is like our example of Isaac who passes many tests of high school readiness, so the LR in favor of his being ready is positive. However, having been randomly selected from “Fewready” town, the posterior for his readiness is still low (despite its having increased).] Severity here seems to be in sync with the B-boosters,at least in direction of evidence.
Mayo, D. G. (2014) On the Birnbaum Argument for the Strong Likelihood Principle (with discussion & rejoinder). Statistical Science 29, no. 2, 227-266.
Mayo, D. G. (2004). “An Error-Statistical Philosophy of Evidence,” 79-118, in M. Taper and S. Lele (eds.) The Nature of Scientific Evidence: Statistical, Philosophical and Empirical Considerations. Chicago: University of Chicago Press.
Pearson, E.S. & Neyman, J. (1930). On the problem of two samples. In J. Neyman and E.S. Pearson, 1967, Joint Statistical Papers, (99-115). Cambridge: CUP.
Royall, R. (1997) Statistical Evidence: A likelihood paradigm, Chapman and Hall, CRC Press.
Royall, R. (2004), “The Likelihood Paradigm for Statistical Evidence” 119-138; Rejoinder 145-151, in M. Taper, and S. Lele (eds.) The Nature of Scientific Evidence: Statistical, Philosophical and Empirical Considerations. Chicago: University of Chicago Press.