Have you ever noticed that some leading advocates of a statistical account, say a testing account A, upon discovering account A is unable to handle a certain kind of important testing problem that a rival testing account, account B, has no trouble at all with, will mount an argument that being able to handle that kind of problem is actually a bad thing? In fact, they might argue that testing account B is not a “real” testing account because it can handle such a problem? You have? Sure you have, if you read this blog. But that’s only a subliminal point of this post.
I’ve had three posts recently on the Law of Likelihood (LL): Breaking the [LL](a)(b), [c], and [LL] is bankrupt. Please read at least one of them for background. All deal with Royall’s comparative likelihoodist account, which some will say only a few people even use, but I promise you that these same points come up again and again in foundational criticisms from entirely other quarters.[i]
[M]edical researchers are interested in the success probability, θ, associated with a new treatment. They are particularly interested in how θ relates to the old treatment’s success probability, believed to be about 0.2. They have reason to hope θ is considerably greater, perhaps 0.8 or even greater. To obtain evidence about θ, they carry out a study in which the new treatment is given to 17 subjects, and find that it is successful in nine.
Let me interject at this point that of all of Stephen Senn’s posts on this blog, my favorite is the one where he zeroes in on the proper way to think about the discrepancy we hope to find (the .8 in this example). (See note [ii])
A standard statistical analysis of their observations would use a Bernouilli (θ) statistical model and test the composite hypotheses H1: θ ≤ 0.2 versus H2: θ > 0.2. That analysis would show that H1 can be rejected in favor of H2 at any significance level greater than 0.003, a result that is conventionally taken to mean that the observations are very strong evidence supporting H2 over H1. (Royall, ibid.)
Following Royall’s numbers, the observed success rate is:
m0 = 9/17 = .53, exceeding H1: θ ≤ 0.2 by ~3 standard deviations, as σ / √17 ~ 0.1, yielding significance level ~.003.
So, the observed success rate m0 = .53, “is conventionally taken to mean that the observations are very strong evidence supporting H2 over H1.” (ibid. p. 20) [For a link to an article by Royall, see the references.]
And indeed it is altogether warranted to regard the data as very strong evidence that θ > 0.2—which is precisely what H2 asserts (not fussing with his rather small sample size). In fact, m0 warrants inferring even larger discrepancies, but let’s first see where Royall has stopped in his tracks.[iii]
Royall claims he is unable to allow that m0 = .53 is evidence against the null in the one sided-test we are considering: H1: θ ≤ 0.2 versus H2: θ > 0.2.
He tells us why in the next paragraph (ibid., p. 20):
But because H1 contains some simple hypotheses that are better supported than some hypotheses in H2 (e.g.,θ = 0.2 is better supported than θ= 0.9 by a likelihood ratio of LR = (0.2/0.9)9(0.8/0.1)8 = 22.2), the law of likelihood does not allow the characterization of these observations as strong evidence for H2 over H1. (my emphasis; note I didn’t check his numbers since they hardly matter.)
It appears that Royall views rejecting H1: θ ≤ 0.2 and inferring H2: θ > 0.2 as asserting every parameter point within H2 is more likely than every point in H1! (That strikes me as a highly idiosyncratic meaning.) Whereas, the significance tester just takes it to mean what it says:
to reject H1: θ ≤ 0.2 is to infer some positive discrepancy from .2.
We, who go further, either via severity assessments or confidence intervals, would give discrepancies that were reasonably warranted, as well as those that were tantamount to making great whales out of little guppies (fallacy of rejection)! Conversely, for any discrepancy of interest, we can tell you how well or poorly warranted it is by the data. (The confidence interval theorist would need to supplement the one-sided lower limit which is, strictly speaking, all she gets from the one-sided test. I put this to one side here.)
But Royall is blocked! He’s got to invoke point alternatives, and then give a comparative likelihood ratio (to a point null). Note, too, the point against point requirement is always required (with a couple of exceptions, maybe) for Royall’s comparative likelihoodist; it’s not just in this example where he imagines a far away alternative point of .8. The ordinary significance test is clearly at a great advantage over the point against point hypotheses, given the stated goal here is to probe discrepancies from the null. (See Senn’s point in note [ii] below.)
Not only is the law of likelihood unable to tackle simple one-sided tests, what it allows us to say is rather misleading:
What does it allow us to say? One statement that we can make is that the observations are only weak evidence in favor of θ = 0.8 versus θ = 0.2 (LR = 4). We can also say that they are rather strong evidence supporting θ = 0.5 over any of the values under H1: θ ≤ 0.2 (LR > 89), and at least moderately strong evidence for θ = 0.5 over any value θ > 0.8 (LR) > 22). …Thus we can say that the observation of nine successes in 17 trials is rather strong evidence supporting success rates of about 0.5 over the rate 0.2 that is associated with the old treatment, and at least moderately strong evidence for the intermediate rates versus the rates of 0.8 or greater that we were hoping to achieve. (Royall 1997, p. 20, emphasis is mine)
But this is scarcely “rather strong evidence supporting success rates of about 0.5” over the old treatment. What confidence level would you be using if you inferred m0 is evidence that θ > 0.5? Approximately .5. (It’s the typical comparative likelihood move of favoring the claim that the population value equals the observed value. (*See comments.)
Royall”s “weak evidence in favor of θ = 0.8 versus θ = 0.2 (LR = 4)” fails to convey that there is rather horrible warrant for inferring θ = 0.8–associated with something like 99% error probability! (It’s outside the 2-standard deviation confidence interval, is it not?)
We significance testers do find strong evidence for discrepancies in excess of .3 (~.97 severity or lower confidence level) and decent evidence of excesses of .4 (~.84 severity or lower confidence level). And notice that all of these assertions are claims of evidence of positive discrepancies from the null H1: θ ≤ 0.2. In short, at best (if we are generous in our reading, and insist on confidence levels at least .5), Royall is rediscovering what the significance tester automatically says in rejecting the null with the data!
His entire analysis is limited to giving a series of reports as to which parameter values the data are comparatively closer to. As I already argued, I regard such an account as bankrupt as an account of inference. It fails to control probabilities of misleading interpretations of data in general, and precludes comparing the warrant for a single H by two data sets x, y. In this post, my aim is different. It is Royall, and some fellow likelihoodists, who lodge the criticism because we significance testers operate with composite alternatives. My position is that dealing with composite alternatives is crucial, and that we succeed swimmingly, while Royall is barely treading water. He will allow much stronger evidence than is warranted in favor of members of H2. Ironically, an analogous move is advocated by those who raise the riot act against P-values for exaggerating evidence against a null! [iv]
Elliott Sober, reporting on the Royall road of likelihoodism, remarks:
The fact that significance tests don’t contrast the null hypothesis with alternatives suffices to show that they do not provide a good rule for rejection. (Sober 2008, 56)
But there is an alternative, it’s just not limited to a point, the highly artificial case we rarely are involved in testing. Perhaps they are more common in biology. I will assume here that Elliott Sober is mainly setting out some of Royall’s criticisms for the reader, rather than agreeing with them.
According to the law of likelihood, as Sober observes, whether the data are evidence against the null hypothesis depends on which point alternative hypothesis you consider. Does he really want to say that so long as you can identify an alternative that is less likely given the data than is the null, then the data are “evidence in favor of the null hypothesis, not evidence against it” (Sober, 56). Is this a good thing? What about all the points in between? The significance test above exhausts the parameter space, as do all N-P tests.[v]
[i] I know because, remember, I’m writing a book that’s close to being done.
[ii] “It would be ludicrous to maintain that [the treatment] cannot have an effect which, while greater than nothing, is less than the clinically relevant difference.” (Senn 2008, p. 201)
[iii] Note: a rejection at the 2-standard deviation cut-off would be ~M* = .2 + 2(.1) = .4.
[iv] That is, they allow the low P-value to count as evidence for alternatives we would regard as unwarranted. But I’ll come back to that another time.
[v] In this connection, do we really want to say, about a null with teeny tiny likelihood, that there’s evidence for it, so long as there is a rival, miles away, in the other direction? (Do I feel the J-G-L Paradox coming on? Yes! It’s the next topic in Sober p.56)
Royall, R. (2004), “The Likelihood Paradigm for Statistical Evidence” 119-138; Rejoinder 145-151, in M. Taper, and S. Lele (eds.) The Nature of Scientific Evidence: Statistical, Philosophical and Empirical Considerations. Chicago: University of Chicago Press.
Royall, R.(1997) Statistical Evidence, A Likelihood Paradigm. Chapman and Hall.
Senn, S. (2007), Statistical Issues in Drug Development, Wiley.
Sober, E. (2008). Evidence and Evolution. CUP.