For the first time, I’m excerpting all of Excursion 1 Tour II from SIST (2018, CUP).
1.4 The Law of Likelihood and Error Statistics
If you want to understand what’s true about statistical inference, you should begin with what has long been a holy grail–to use probability to arrive at a type of logic of evidential support–and in the first instance you should look not at full-blown Bayesian probabilism, but at comparative accounts that sidestep prior probabilities in hypotheses. An intuitively plausible logic of comparative support was given by the philosopher Ian Hacking (1965)–the Law of Likelihood. Fortunately, the Museum of Statistics is organized by theme, and the Law of Likelihood and the related Likelihood Principle is a big one.
Law of Likelihood (LL):Data x are better evidence for hypothesis H1 than for H0 if x is more probable under H1 than under H0: Pr(x; H1) > Pr(x; H0) that is, the likelihood ratio LR of H1 over H0 exceeds 1.
H0 and H1 are statistical hypotheses that assign probabilities to the values of the random variable X. A fixed value of X is written x0, but we often want to generalize about this value, in which case, following others, I use x. The likelihood of the hypothesis H, given data x, is the probability of observing x, under the assumption that H is true or adequate in some sense. Typically, the ratio of the likelihood of H1 over H0 also supplies the quantitative measure of comparative support. Note when X is continuous, the probability is assigned over a small interval around X to avoid probability 0.
Does the Law of Likelihood Obey the Minimal Requirement for Severity?
Likelihoods are vital to all statistical accounts, but they are often misunderstood because the data are fixed and the hypothesis varies. Likelihoods of hypotheses should not be confused with their probabilities. Two ways to see this. First, suppose you discover all of the stocks in Pickrite’s promotional letter went up in value (x)–all winners. A hypothesis H to explain this is that their method always succeeds in picking winners. H entails x, so the likelihood of H given x is 1. Yet we wouldn’t say H is therefore highly probable, especially without reason to put to rest that they culled the winners post hoc. For a second way, at any time, the same phenomenon may be perfectly predicted or explained by two rival theories; so both theories are equally likely on the data, even though they cannot both be true.
Suppose Bristol-Roach, in our Bernoulli tea tasting example, got two correct guesses followed by one failure. The observed data can be represented as x0 =<1,1,0>. Let the hypotheses be different values for θ, the probability of success on each independent trial. The likelihood of the hypothesis H0 : θ = 0.5, given x0, which we may write as Lik(0.5), equals (½)(½)(½) = 1/8. Strictly speaking, we should write Lik(θ;x0), because it’s always computed given data x0; I will do so later on. The likelihood of the hypothesis θ = 0.2 is Lik(0.2)= (0.2)(0.2)(0.8) = 0.032. In general, the likelihood in the case of Bernoulli independent and identically distributed trials takes the form: Lik(θ)= θs(1- θ)f, 0< θ<1, where s is the number of successes and f the number of failures. Infinitely many values for θ between 0 and 1 yield positive likelihoods; clearly, then likelihoods do not sum to 1, or any number in particular. Likelihoods do not obey the probability calculus.
The Law of Likelihood (LL) will immediately be seen to fail on our minimal severity requirement – at least if it is taken as an account of inference. Why? There is no onus on the Likelihoodist to predesignate the rival hypotheses – you are free to search, hunt, and post-designate a more likely, or even maximally likely, rival to a test hypothesis H0
Consider the hypothesis that θ = 1 on trials one and two and 0 on trial three. That makes the probability of x maximal. For another example, hypothesize that the observed pattern would always recur in three-trials of the experiment (I. J. Good said in his cryptoanalysis work these were called “kinkera”). Hunting for an impressive fit, or trying and trying again, one is sure to find a rival hypothesis H1 much better “supported” than H0 even when H0 is true. As George Barnard puts it, “there always is such a rival hypothesis, viz. that things just had to turn out the way they actually did” (1972, p. 129).
Note that for any outcome of n Bernoulli trials, the likelihood of H0 : θ = 0.5 is (0.5)n, so is quite small. The likelihood ratio (LR) of a best-supported alternative compared to H0 would be quite high. Since one could always erect such an alternative,
(*) Pr(LR in favor of H1 over H0; H0) = maximal.
Thus the LL permits BENT evidence. The severity for H1 is minimal, though the particular H1 is not formulated until the data are in hand.I call such maximally fitting, but minimally severely tested, hypotheses Gellerized, since Uri Geller was apt to erect a way to explain his results in ESP trials. Our Texas sharpshooter is analogous because he can always draw a circle around a cluster of bullet holes, or around each single hole. One needn’t go to such an extreme rival, but it suffices to show that the LL does not control the probability of erroneous interpretations.
What do we do to compute (*)? We look beyond the specific observed data to the behavior of the general rule or method, here the LL. The output is always a comparison of likelihoods. We observe one outcome, but we can consider that for any outcome, unless it makes H0 maximally likely, we can find an H1 that is more likely. This lets us compute the relevant properties of the method: its inability to block erroneous interpretations of data. As always, a severity assessment is one level removed: you give me the rule, and I consider its latitude for erroneous outputs. We’re actually looking at the probability distribution of the rule, over outcomes in the sample space. This distribution is called a sampling distribution. It’s not a very apt term, but nothing has arisen to replace it. For those who embrace the LL, once the data are given, it’s irrelevant what other outcomes could have been observed but were not. Likelihoodists say that such considerations make sense only if the concern is the performance of a rule over repetitions, but not for inference from the data. Likelihoodists hold to “the irrelevance of the sample space” (once the data are given). This is the key contrast between accounts based on error probabilities (error statistical) and logics of statistical inference.
To continue reading Excursion 1 Tour II, go here.
This excerpt comes from Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo, CUP 2018).
Earlier excerpts and mementos from SIST up to Dec 31, 20018 are here.
Jan 10, 2019 Excerpt from SIST is here, Jan 27 is here, and Feb 23 here.
Jan 13, 2019 Mementos from SIST (Excursion 4) are here. These are summaries of all 4 tours.
March 5, 2019 Blurbs of all 16 Tours can be found here.
Your ‘kinkera’ where the probability of outcomes varies between trials is not available among the parameter values within the statistical model for Bernoulli trials that you describe in the paragraph starting with “Suppose Bristol-Roach”. That means that a kinkera is not among the ‘hypotheses’ that a likelihoodist is free to designate as a rival to the parameter values of theta. Therefore your criticism is at least incomplete and probably false, as I have argued on several occasions on this blog.
I have an arxived paper that explores the issue in full: https://arxiv.org/abs/1507.08394