1.An Assumed Law of Statistical Evidence (law of likelihood)
Nearly all critical discussions of frequentist error statistical inference (significance tests, confidence intervals, p- values, power, etc.) start with the following general assumption about the nature of inductive evidence or support:
Data x are better evidence for hypothesis H1 than for H0 if x are more probable under H1 than under H0.
Ian Hacking (1965) called this the logic of support: x supports hypotheses H1 more than H0 if H1 is more likely, given x than is H0:
Pr(x; H1) > Pr(x; H0).
[With likelihoods, the data x are fixed, the hypotheses vary.]*
x is evidence for H1 over H0 if the likelihood ratio LR (H1 over H0 ) is greater than 1.
It is given in other ways besides, but it’s the same general idea. (Some will take the LR as actually quantifying the support, others leave it qualitative.)
In terms of rejection:
“An hypothesis should be rejected if and only if there is some rival hypothesis much better supported [i.e., much more likely] than it is.” (Hacking 1965, 89)
2. Barnard (British Journal of Philosophy of Science )
But this “law” will immediately be seen to fail on our minimal severity requirement. Hunting for an impressive fit, or trying and trying again, it’s easy to find a rival hypothesis H1 much better “supported” than H0 even when H0 is true. Or, as Barnard (1972) puts it, “there always is such a rival hypothesis, viz. that things just had to turn out the way they actually did” (1972 p. 129). H0: the coin is fair, gets a small likelihood (.5)k given k tosses of a coin, while H1: the probability of heads is 1 just on those tosses that yield a head, renders the sequence of k outcomes maximally likely. This is an example of Barnard’s “things just had to turn out as they did”. Or, to use an example with P-values: a statistically significant difference, being improbable under the null H0 , will afford high likelihood to any number of explanations that fit the data well.
3.Breaking the law (of likelihood) by going to the “second,” error statistical level:
How does it fail our severity requirement? First look at what the frequentist error statistician must always do to critique an inference: she must consider the capability of the inference method that purports to provide evidence for a claim. She goes to a higher level or metalevel, as it were. In this case, the likelihood ratio plays the role of the needed statistic d(X). To put it informally, she asks:
What’s the probability the method would yield an LR disfavoring H0 compared to some alternative H1 even if H0 is true?
What’s the probability of so small a likelihood for H0 compared to H1, even if H0 adequately describes the data generating procedure? As Pearson and Neyman put it:
“[I]n order to fix a limit between ‘small’ and ‘large’ values of LR we must know how often such values appear when we deal with a true hypothesis. That is to say we must have knowledge of the chance of obtaining [so small a likelihood ratio] in the case where the hypothesis tested [H0 ] is true” (Pearson and Neyman 1930, 106).
Looking at “how often such values appear” of course turns on the sampling distribution of the LR viewed as a statistic. That’s why frequentist error statistical accounts are called sampling theory accounts. This requires considering other values that could have occurred, not just the one you got.
But this this breaks the law of likelihood and so is taboo for the likelihoodist! (Likewise for anyone holding the Likelihood Principle[i].)
Viewing the sampling distribution as taboo (once the data are given) is puzzling in the extreme[ii]. How can it be desirable to block out information about how the data were generated and the hypotheses specified? I fail to see how anyone can evaluate an inference from data x to a claim C without learning about the capabilities of the method, through the relevant sampling distribution. Readers of this blog know my favorite example to demonstrate the lack of error control if you look only at likelihoods: the case of optional stopping. (Keep sampling until you get a nominal p value of .05 against a 0 null hypothesis in two-sided Normal testing of the mean. You can be wrong with maximal probability.)
Just such examples, where the alternative is not a point value, led Barnard to abandon (or greatly restrict) the Likelihood Principle. Interestingly, in raising these criticisms of likelihood, Barnard is reviewing Ian Hacking’s 1965 book: The Logic of Statistical Inference. Only thing is, by the time of this 1972 review, Hacking had given it up as well! In fact, in the pages immediately following Barnard’s review of Hacking, is Hacking reviewing A.F. Edwards’ book Likelihood (1972) wherein Hacking explains why he’s thrown his own likelihood rule of support overboard.
4.Hacking (also BJPS)
A classic case is the normal distribution and a single observation. Reluctantly we will grant Edwards that the observation x is the best supported estimate of the unknown mean. But the hypothesis about the variance, with highest likelihood, is the assumption that there is no variance, which strikes us as monstrous. .. we must concede that as prior information we take for granted the variance is at least w. But even this will not do, for the best supported view on the variance is then that it is exactly w.
For a less artificial example, take the ‘tram-car’ or ‘tank’ problem We capture enemy tanks at random and note the serial numbers on their engines. We know the serial numbers start at 0001. We capture a tank number 2176. How many tanks did the enemy make? On the likelihood analysis, the best supported guess is: 2176. Now one can defend this remarkable result by saying that it does not follow that we should estimate the actual number as 2176, only that comparing individual numbers, 2176 is better supported than any larger figure. My worry is deeper. Let us compare the relative likelihood of the two hypotheses, 2176 and 3000. Now pass to a situation where we are measuring, say, widths of a grating, in which error has a normal distribution with known variance; we can devise data and a pair of hypotheses about the mean which will have the same log-likelihood ratio. I have no inclination to say that the relative support in the tank case is ‘exactly the same as’ that in the normal distribution case, even though the likelihood ratios are the same. Hence even on those increasingly rare days when I will rank hypotheses in order of their likelihoods, I cannot take the actual log-likelihood number as an objective measure of anything. (Hacking 1972, 136-137).
Hacking appears even more concerned with the fact that likelihood ratios do not enjoy a stable evidential meaning or calibration, than the lack of error control in likelihoodist accounts. But Hacking was still assuming the latter must be cashed out in terms of long run error performance[iii] as opposed to stringency of test.
I say: a method that makes it easy to declare evidence against hypotheses erroneously gives an unwarranted inference each time; a method that fails to directly pick up on optional stopping, data dredging, cherry picking, multiple testing or any of the other gambits that alter the capabilities of tests to avoid mistaken inferences are poor methods, but not because of their behavior in the long-run. They license unwarranted or questionable inferences in each and every application.This is so, I aver, even if we happen to know, through other means, that their inferred claim C is correct.
5.Three ways likelihoods arise in inference. Aug. 31 note at end of para.
Likelihoods are fundamental in all statistical inference accounts. One might separate how they arise in three groups (acknowledging divisions within each)
(1) likelihoods only (pure likelihoodist)
(2) likelihoods + priors (Bayesian)
(3) likelihoods + error probabilities based on sampling distributions (error statistics, sampling theory
Only the error statistician (3) requires breaking the likelihood law.[See note.] You can feed us fit measures from (1) and (2), and we will do the same thing: ask about the probability of so good (or poor) a fit between data and some claim C, even if C is false (true). The answer will be based on the sampling distribution of the relevant statistic, computed under the falsity of C, or discrepancies from what C asserts).[iv]
Aug 31 note:
If someone wanted to describe the addition of the priors under rubric (2) as tantamount to “breaking the likelihood law”, as opposed to merely requiring it to be supplemented, nothing whatever changes in the point of this post. (It would seem to introduce idiosyncrasies in the usual formulation–but these are not germane to my post.) My sentence, in fact, might well have been “Only the error statistician (3) requires breaking the likelihood law and the likelihood principle (by dint of requiring considerations of the sampling distribution to obtain the evidential import of the data).
Installment (B): an ad hoc clarificatory note, prompted by comments from an anonymous fan
6. Of tests and comparative support measures
The statements of “the” law of likelihood, and likelihood support logics are not all precisely identical. Some accounts are qualitative, merely indicating prima facie increased support; others will devise quantitative measures of support based on likelihoods. (There are at least 10 of them we covered in our recent seminar, maybe more.) Some will try out corresponding “tests” others not. One needn’t have anything like a test or a “rejection rule” to be a likelihoodist. I mentioned the construal in terms of tests because it is in the sentence just before the one I quote from Barnard, and wanted to be true to what he had just said about Hacking’s 1965 book.
Remember the topic of my post concerns criticisms of error statistical methods, and a principle (or “law”) of evidence used in those criticisms. (If you reject that principle, then presumably you wouldn’t use it to criticize error statistical methods, so we have no disagreement on this.) A clear rationale for connecting tests of hypotheses—be they Fisherian or N-P style—and logics of likelihood is to mount criticisms: to explain what’s wrong with those (Fisherian or N-P) tests, and how they may be cured of their problems.
Hacking lays out an impressive argument that all that is sensible in N-P likelihood ratio tests are captured by his conception of likelihood tests (the one he advanced back in 1965) while all the (apparently) counterintuitive parts are jettisoned. Now that I’ve access to my NYC library, I can quote the portion to which Barnard is alluding in his review of Hacking.
“Our theory of support leads directly to the theory of testing suggested in the last chapter [VI]. An hypothesis should be rejected if and only if there is some rival hypothesis much better supported than it is. Support has already been analysed in terms of ratios of likelihoods. But what shall serve as ‘much better supported’? For the present I leave this in abeyance, and speak merely of tests of different stringency. With each test will be associated a critical ratio. The greater the critical ratio, the more stringent the test. Roughly speaking hypothesis h will be rejected in favour of rival i at critical level alpha, just if the likelihood ratio of i to h exceeds alpha.” (Hacking 1965 p.89)
I don’t want to pursue this discussion of Hacking here. To repeat, my post concerns criticisms of error statistical methods. A foundational critique of a method of inference depends on holding another view or principle or method of inference. This post is an offshoot of the recent posts here and here (7/14/14 and 8/17/14)..
Critiques in those posts are based on assuming that it is fair, reasonable, obvious or what have you, to criticize the way p-values arise in inference by means of a different view of inference. (I allude here to genuine or “audited” p-values, not mere nominal or computed p-values.) The p-value, it is reasoned, should be close to either a posterior probability (in the null hypothesis) or a likelihood ratio (or Bayes ratio). Ways to “fix” p-values are proposed to get them closer to these other measures. I don’t think there was anything controversial about this being the basic goal, not just of the particular papers we looked at, but mountains of papers that have been written and are being written this very moment.
I may continue with my intended follow-up (Part C)
*Note; I am not sure whether the powers that be are allowing us to say “data x is” nowadays–I read something about this, maybe it was by Pinker. Can somebody please ask Stephen Pinker for me? Thanks.
[i] Please search this blog for quite a lot on the likelihood principle and the strong likelihood principle.
[ii]I would say this even if we knew the model was adequate. Likelihood principlers may regard using the sampling distribution to test the model as legitimate.
[iii]Perhaps he still is, I don’t mean to saddle him with my testing construal of error probabilities at all. (Some hints of a shift exists in his 1980 article in the Braithwaite volume.)
[iv] This delineation comes from Cox and Hinkley, but I don’t have it here.
Barnard, G. (1972). Review of ‘The Logic of Statistical Inference’ by I. Hacking, Brit. J. Phil.Sci., 23(2): 123-132.
Hacking, I. (1965). Logic of Statistical Inference. Cambridge: CUP.
Hacking, I. (1972). “Review of Likelihood. An Account of the Statistical Concept of Likelihood and Its Application to Scientific Inference by A. F. Edwards,” Brit. J. Phil.Sci., 23(2): 132-137.
Hacking, I. (1980). “The Theory of Probable Inference: Neyman, Peirce and Braithwaite.” In D. H. Mellor (ed.), Science, belief and behavior: Essays in honor of R.B. Braithwaite. 141-160. Cambridge: CUP.
Pearson, E.S. & Neyman, J. (1930). On the problem of two samples.Joint Statistical Papers by J. Neyman & E.S. Pearson, 99-115 (Berkeley: U. of Calif. Press). First published in Bul. Acad. Pol.Sci. 73-96.