How likelihoodists exaggerate evidence from statistical tests


I insist on point against point, no matter how much it hurts

Have you ever noticed that some leading advocates of a statistical account, say a testing account A, upon discovering account A is unable to handle a certain kind of important testing problem that a rival testing account, account B, has no trouble at all with, will mount an argument that being able to handle that kind of problem is actually a bad thing? In fact, they might argue that testing account B is not a  “real” testing account because it can handle such a problem? You have? Sure you have, if you read this blog. But that’s only a subliminal point of this post.

I’ve had three posts recently on the Law of Likelihood (LL): Breaking the [LL](a)(b)[c], and [LL] is bankrupt. Please read at least one of them for background. All deal with Royall’s comparative likelihoodist account, which some will say only a few people even use, but I promise you that these same points come up again and again in foundational criticisms from entirely other quarters.[i]

An example from Royall is typical: He makes it clear that an account based on the (LL) is unable to handle composite tests, even simple one-sided tests for which account B supplies uniformly most powerful (UMP) tests. He concludes, not that his test comes up short, but that any genuine test or ‘rule of rejection’ must have a point alternative!  Here’s the case (Royall, 1997, pp. 19-20):

[M]edical researchers are interested in the success probability, θ, associated with a new treatment. They are particularly interested in how θ relates to the old treatment’s success probability, believed to be about 0.2. They have reason to hope θ is considerably greater, perhaps 0.8 or even greater. To obtain evidence about θ, they carry out a study in which the new treatment is given to 17 subjects, and find that it is successful in nine.

Let me interject at this point that of all of Stephen Senn’s posts on this blog, my favorite is the one where he zeroes in on the proper way to think about the discrepancy we hope to find (the .8 in this example). (See note [ii])

A standard statistical analysis of their observations would use a Bernouilli (θ) statistical model and test the composite hypotheses H1: θ ≤ 0.2 versus H2: θ > 0.2. That analysis would show that H1 can be rejected in favor of Hat any significance level greater than 0.003, a result that is conventionally taken to mean that the observations are very strong evidence supporting H2 over H1. (Royall, ibid.)

Following Royall’s numbers, the observed success rate is:

m0 = 9/17 = .53, exceeding H1: θ ≤ 0.2 by ~3 standard deviations, as σ / √17 ~ 0.1, yielding significance level ~.003.

So, the observed success rate m0 = .53, “is conventionally taken to mean that the observations are very strong evidence supporting H2 over H1.” (ibid. p. 20) [For a link to an article by Royall, see the references.]

And indeed it is altogether warranted to regard the data as very strong evidence that θ > 0.2—which is precisely what H2 asserts (not fussing with his rather small sample size). In fact, m0 warrants inferring even larger discrepancies, but let’s first see where Royall has stopped in his tracks.[iii]

Royall claims he is unable to allow that m0 = .53 is evidence against the null in the one sided-test we are considering:  H1: θ ≤ 0.2 versus H2: θ > 0.2.

He tells us why in the next paragraph (ibid., p. 20):

But because H1 contains some simple hypotheses that are better supported than some hypotheses in H2 (e.g.,θ = 0.2 is better supported than θ= 0.9 by a likelihood ratio of LR = (0.2/0.9)9(0.8/0.1)8 = 22.2), the law of likelihood does not allow the characterization of these observations as strong evidence for H2 over H1(my emphasis; note I didn’t check his numbers since they hardly matter.)

It appears that Royall views rejecting H1: θ ≤ 0.2 and inferring H2: θ > 0.2 as asserting every parameter point within H2 is more likely than every point in H1! (That strikes me as a highly idiosyncratic meaning.) Whereas, the significance tester just takes it to mean what it says:

to reject H1: θ ≤ 0.2 is to infer some positive discrepancy from .2.

We, who go further, either via severity assessments or confidence intervals, would give discrepancies that were reasonably warranted, as well as those that were tantamount to making great whales out of little guppies (fallacy of rejection)! Conversely, for any discrepancy of interest, we can tell you how well or poorly warranted it is by the data. (The confidence interval theorist would need to supplement the one-sided lower limit which is, strictly speaking, all she gets from the one-sided test. I put this to one side here.)

But Royall is blocked! He’s got to invoke point alternatives, and then give a comparative likelihood ratio (to a point null). Note, too, the point against point requirement is always required (with a couple of exceptions, maybe) for Royall’s comparative likelihoodist; it’s not just in this example where he imagines a far away alternative point of .8. The ordinary significance test is clearly at a great advantage over the point against point hypotheses, given the stated goal here is to probe discrepancies from the null. (See Senn’s point in note [ii] below.)

Not only is the law of likelihood unable to tackle simple one-sided tests, what it allows us to say is rather misleading:

What does it allow us to say? One statement that we can make is that the observations are only weak evidence in favor of θ = 0.8 versus θ = 0.2 (LR = 4). We can also say that they are rather strong evidence supporting θ = 0.5 over any of the values under H1: θ ≤ 0.2 (LR > 89), and at least moderately strong evidence for θ = 0.5 over any value θ > 0.8 (LR) > 22). …Thus we can say that the observation of nine successes in 17 trials is rather strong evidence supporting success rates of about 0.5 over the rate 0.2 that is associated with the old treatment, and at least moderately strong evidence for the intermediate rates versus the rates of 0.8 or greater that we were hoping to achieve. (Royall 1997, p. 20, emphasis is mine)

But this is scarcely “rather strong evidence supporting success rates of about 0.5” over the old treatment. What confidence level would you be using if you inferred m0 is evidence that θ > 0.5? Approximately .5. (It’s the typical comparative likelihood move of favoring the claim that the population value equals the observed value. (*See comments.)

Royall”s “weak evidence in favor of θ = 0.8 versus θ = 0.2 (LR = 4)” fails to convey that there is rather horrible warrant for inferring θ = 0.8–associated with something like 99% error probability! (It’s outside the 2-standard deviation confidence interval, is it not?)

We significance testers do find strong evidence for discrepancies in excess of .3 (~.97 severity or lower confidence level) and decent evidence of excesses of .4 (~.84 severity or lower confidence level).  And notice that all of these assertions are claims of evidence of positive discrepancies from the null H1: θ ≤ 0.2. In short, at best (if we are generous in our reading, and insist on confidence levels at least .5), Royall is rediscovering what the significance tester automatically says in rejecting the null with the data!

His entire analysis is limited to giving a series of reports as to which parameter values the data are comparatively closer to. As I already argued, I regard such an account as bankrupt as an account of inference. It fails to control probabilities of misleading interpretations of data in general, and precludes comparing the warrant for a single H by two data sets x, y. In this post, my aim is different. It is Royall, and some fellow likelihoodists, who lodge the criticism because we significance testers operate with composite alternatives. My position is that dealing with composite alternatives is crucial, and that we succeed swimmingly, while Royall is barely treading water. He will allow much stronger evidence than is warranted in favor of members of H2. Ironically, an analogous move is advocated by those who raise the riot act against P-values for exaggerating evidence against a null! [iv]

Elliott Sober, reporting on the Royall road of likelihoodism, remarks:

The fact that significance tests don’t contrast the null hypothesis with alternatives suffices to show that they do not provide a good rule for rejection. (Sober 2008, 56) 

But there is an alternative, it’s just not limited to a point, the highly artificial case we rarely are involved in testing. Perhaps they are more common in biology. I will assume here that Elliott Sober is mainly setting out some of Royall’s criticisms for the reader, rather than agreeing with them.slide11

According to the law of likelihood, as Sober observes, whether the data are evidence against the null hypothesis depends on which point alternative hypothesis you consider. Does he really want to say that so long as you can identify an alternative that is less likely given the data than is the null, then the data are “evidence in favor of the null hypothesis, not evidence against it” (Sober, 56). Is this a good thing? What about all the points in between?  The significance test above exhausts the parameter space, as do all N-P tests.[v]


[i] I know because, remember, I’m writing a book that’s close to being done.

[ii] “It would be ludicrous to maintain that [the treatment] cannot have an effect which, while greater than nothing, is less than the clinically relevant difference.” (Senn 2008, p. 201)

[iii] Note: a rejection at the 2-standard deviation cut-off would be ~M* = .2 + 2(.1) = .4.

[iv] That is, they allow the low P-value to count as evidence for alternatives we would regard as unwarranted. But I’ll come back to that another time.

[v] In this connection, do we really want to say, about a null with teeny tiny likelihood, that there’s evidence for it, so long as there is a rival, miles away, in the other direction? (Do I feel the J-G-L Paradox coming on? Yes! It’s the next topic in Sober p.56)


Royall, R. (2004), “The Likelihood Paradigm for Statistical Evidence” 119-138; Rejoinder 145-151, in M. Taper, and S. Lele (eds.) The Nature of Scientific Evidence: Statistical, Philosophical and Empirical Considerations. Chicago: University of Chicago Press.

Royall, R.(1997)  Statistical Evidence, A Likelihood Paradigm. Chapman and Hall.

Senn, S. (2007), Statistical Issues in Drug Development, Wiley.

Sober, E. (2008). Evidence and Evolution. CUP.

Categories: law of likelihood, Richard Royall, Statistics | Tags:

Post navigation

18 thoughts on “How likelihoodists exaggerate evidence from statistical tests

  1. “It’s the typical likelihood move of inferring that the population value equals the observed value.”

    I imagine I need not even comment on this as Michael Lew is quite capable, but what the heck — I do love the sound of my own voice.

    Likelihoodists do not, to my knowledge, ever assert that the inference that the population value equals the observed value is justified. They might say that the observed value is the best supported possibility, but that’s not the same thing at all.

    • *Corey reminds me to emphasize again that for comparative likelihoodists there is not an inference per se but merely a comparative claim. Yet I also had in mind many who go further than Royall, such as Hacking–when he was first a likelihoodist.)

      Let me emphasize that the point of this post isn’t to repeat my problems with comparative likelihoodists, it’s to take up a criticism of significance tests that Royall and others have promoted, and many others have accepted. Significance testers do actually detach inferences. The issue is similar to what some “reformers” of significance tests recommend–and they too detach inferences.

  2. Michael Lew

    Mayo, I really think that you are mistaken. The likelihood function shows you the relative support provided by the data for any parameter values. How is that a bad thing?

    You seem to be thinking in terms of all or none responses to the evidence in a manner that does not utilise the graded support for various hypothesised parameter values that is displayed in a likelihood function. For example you write this: “According to the law of likelihood, whether the data are evidence against the null hypothesis depends on which point alternative hypothesis you consider. So long as you can identify an alternative that is less likely given the data than is the null, then the data are “evidence in favor of the null hypothesis, not evidence against it” (Sober, 56). Is this a good thing? What about all the points in between?”

    The answer to your questions are: yes, it is a good thing to characterise the evidential favouring of the data; and all of the points in between are included on the likelihood function and are therefore available for your consideration by simple inspection.

    I know that you meant those questions as rhetorical flourish, but the actual answers show that your complaints about the law of likelihood lack foundation. If the data favour the null hypothesised parameter value only weakly then you should respond to that weak favouring in your inference. (Possibly by deferring inference until more evidence are available, or possibly by making a qualified inference that is subject to future updating.) If the data favour a parameter value different from the null, but only by a little bit then you can respond to that in your inference.

    There is nothing in the law of likelihood that forces anyone to make silly inferences.

    • Michael: In this post I’m really only considering the criticisms Royall raises against significance tests. My comments will also strive to keep to that (plus we’ve been through the other business umpteen times.) Given his goals, this is all hunky dory! But notice where he lands up—at the very best, he must rediscover something like rejecting the null in favor of various points in the alternative space. This is no surprise, given for example likelihood ratio tests developed by Neyman and Pearson. But he is the one declaring they are giving inadequate tests.

    • john byrd

      Michael: Do p-values violate the LL?

      • Michael Lew

        John, it depends on what you mean by that. You can violate the LL using P-values and you can comply with the LL using P-values. P-values relate to the law of likelihood in roughly the same way as your shoes are related to laws about jaywalking.

        The conventional accounts of how P-values violate the LL involve two ideas. 1: The idea that because P-values are tail areas they involve unobserved results, and those unobserved results are irrelevant to evidence (according to accounts of the likelihood principle that I think are mistaken). 2: The idea that, because one set of data from a single experiment imply only one likelihood function (for a particular parameter of interest and within a particular statistical model) but can simultaneously yield several different P-values depending on how the experimenter’s intentions are accounted for and how the analysis is conditioned on the experimental design, P-values cannot be valid expressions of the evidence according to the LL. I discuss both of those ideas at some length in my multiply rejected paper ( Neither idea is sufficient to think that there is a fundamental conflict between P-values and the law of likelihood.

        However, the use of P-values generally does violate the law of likelihood if they are being used as a numerical measure of the strength of evidence. P-values relate to the evidence by way of the likelihood function that they point to or imply.

        • john byrd

          Michael: I understand your points.I think one source of confusion is how likelihoodists define evidence versus error statisticians.

          Back to Royall’s example above, is he not just trying to re-invent the confidence interval? Or, should he not just use a CI?

          • John: There is only comparative support, at least for Royall (Hacking spoke of rejection). Remember a confidence interval would enable you to make an inference about a single hypothesized parameter. Royall says this is not possible, or that no respectable account can possibly achieve this. You can only compare how much support x gives to H as opposed to J. I guess he can pretend to get a kind of likelihood confidence interval while at the same time vilifying significance tests and confidence intervals as irrelevant for evidence. Nor can he deal with “nuisance” parameters. I realize there are a lot of generalized accounts based on likelihoods that are not Royall’s comparative account, but his has really had an impact on philosophers. These other accounts are kind of striving to normalize likelihoods enough to serve as posteriors, or something. But many of them are error statistical whereas he has an overly simplistic position that leads him to reject all such appeals (as relevant for action and not evidence).

            But, once again, my focus here is on this type of a criticism of significance tests. Why demand that rejecting theta = .2 in favor of theta > .2 be precluded for his reasons: “every parameter point within H2 is more likely than every point in H1” (quoting my post). Is this a quantifier error?

            • Michael Lew

              Mayo, your case against Royall’s presentation would be greatly strengthened if you were to present the relevant severity curve for the data along with the relevant likelihood function.

          • Michael Lew

            John, definitions and semantics do seem to be an impediment to getting Mayo and me on the same page. As far as I can see the frequentists do not have a technical definition of the word ‘evidence’, and always use it in a natural language sense. Mayo may complain about this, but Neyman & Pearson (1933) explicitly say that they have devised their approach to statistics on the assumption that no particular experiment can provide valuable evidence about the truth of a particular hypothesis. That sounds very much like an assertion that ‘evidence’ does not have a technical definition within their framework.

            Together, the likelihood principle and law of likelihood provide a technical definition of evidence in a restricted sense. The restrictions are that it is only evidence relevant to a particular parameter within a fully specified statistical model, and it has to be noted that alternative parameterisations and models may be possible. It also should be borne in mind that the likelihood principle does not imply that one has to make inferences only on the basis of the evidence, a feature that matches well my own understanding of the role of ‘evidence’ of the natural language variety into everyday decisions.

            • Michael: You are badly misconstruing the meaning of that infamous quote from a very early paper by N-P, taken out of the context in which they use tests evidentially: I have explained this at length elsewhere. What’s the point in working on a blog to go BEYOND the age-old, superficial misunderstandings, if readers insist on repeating the age-old, superficial misunderstandings? I should stop and just concentrate on finishing my book. (I’m not saying this always happens.) I do give a specific, clear definition of evidence in terms of passing a severe test, and N-P error probabilities can (though they won’t always) serve to control and assess the severity with which various claims have passed. If people actually read N-P applications, their papers, books or even the responses to Fisher posted often on this blog, rather than merely recite one or two well-beaten phrases out of context, they might have understood the error statistical account of evidence long ago.

              I will repeat that “my case” in this post is not about Royall’s “presentation” but about his criticism of significance tests. You can add a few more confidence limits to the ones given here if you want a severity curve–I’ve shown a few benchmarks, which is all the severity theorist needs.

              • Michael Lew

                Mayo, I knew that you would be cross with me for raising that quote, but it seems to me that it is central to answering John’s question. I also knew that you have explained why you think that I misconstrue its meaning, but I disagree with you. I’ve studied your reasoning but I don’t accept it. I have read your papers, books and many of Neyman’s papers.

                I’ve read your blog and I’ve spent a lot of time thinking about these issues. You are mistaken if you think that I am some naive rube who has not done his homework. It is simply the case that I disagree with some of what you write. So far, you have failed to fully persuade me of your case, just as I have failed to persuade you.

                Severity curves provide an appropriate framework of evidence and severity is what you write about, but severity curves are not used by anyone else. Thus, even though you have a valid framework for evidence, Neyman & Pearson did not. Neither do the majority of frequentist statisticians who use accept/reject and not severity curves.

                Provide a severity curve and a likelihood function for each problem that you want to consider from the viewpoint of evidence.

                • Michael:Not cross with you. I appreciate your efforts to take these issues seriously. Just was rushing when i should not have even been on the blog!

            • john byrd

              Michael: I believe that “evidence” has a broader use by scientists, physicians, jurors, etc. making the narrow likelihoodist use a source of confusion for everyone else.
              Maybe a new term would be an improvement.

  3. Michael Lew

    John, evidence is not a simple substance that can be weighed in one dimension, so I think we simply need to pay attention to a few adjectival modifications of evidence. Likelihoods and severities indicate something to do with strength of evidence, changes in error rates consequent to protocol features like P-hacking indicate something to do with the reliability of the evidence. The evidence points very precisely to a parameter value if the likelihood function is narrow of the severity curve is steep. The evidence is firm if the likelihood function of the severity curve would not be much changed by extra data.

    • john byrd

      Maybe to the likelihoodist you can separate evidence from reliability of evidence in your thinking aboutresearch results, but to me unreliable evidence is not evidence at all. What scientist would even call unreliable information evidence? Few, if any. This mess should be cleaned up, just as the LL needs a new name (not a Law). It appears to me these problems are a real source of confusion.

  4. newguest

    For the first time I understand why Royall opposes compound hypotheses, and it’s befuddling because who ever thinks evidence favoring a positive difference from the null would mean ALL points in the alternative are more likely than the null! At the end of the day,all we get here is an ordering of the parameters according to point likelihoods given the data, and associated comparisons. No inference. Revelation. I get what what the charge of “bankruptcy” is all about in a previous entry.

    • newguest: It is goofy and I’m surprised it hasn’t been outed earlier–I’m not saying all likelihoodists are Royallists. By the way, I’ve tried to be in touch with Royall to invite a response. Even before he retired from Hopkins he had a very elaborate, circuitous procedure put in place to make it difficult to send e-mails. Now he doesn’t seem to have an e-mail address.

Blog at