In Tour II of this first Excursion of Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST, 2018, CUP), I pull back the cover on disagreements between experts charged with restoring integrity to today’s statistical practice. Some advised me to wait until later (in the book) to get to this eye-opener. Granted, the full story involves some technical issues, but after many months, I think I arrived at a way to get to the heart of things informally (with a promise of more detailed retracing of steps later on). It was too important not to reveal right away that some of the most popular “reforms” fall down on the job even with respect to our most minimal principle of evidence (you don’t have evidence for a claim if little if anything has been done to probe the ways it can be flawed).
All of Excursion 1 Tour II is here. After this post, I’ll resume regular blogging for a while, so you can catch up to us. Several free (signed) copies of SIST will be given away on Twitter shortly.
1.4 The Law of Likelihood and Error Statistics
If you want to understand what’s true about statistical inference, you should begin with what has long been a holy grail–to use probability to arrive at a type of logic of evidential support–and in the first instance you should look not at full-blown Bayesian probabilism, but at comparative accounts that sidestep prior probabilities in hypotheses. An intuitively plausible logic of comparative support was given by the philosopher Ian Hacking (1965)–the Law of Likelihood. Fortunately, the Museum of Statistics is organized by theme, and the Law of Likelihood and the related Likelihood Principle is a big one.
Law of Likelihood (LL):Data x are better evidence for hypothesis H1 than for H0 if x is more probable under H1 than under H0: Pr(x; H1) > Pr(x; H0) that is, the likelihood ratio LR of H1 over H0 exceeds 1.
H0 and H1 are statistical hypotheses that assign probabilities to the values of the random variable X. A fixed value of X is written x0, but we often want to generalize about this value, in which case, following others, I use x. The likelihood of the hypothesis H, given data x, is the probability of observing x, under the assumption that H is true or adequate in some sense. Typically, the ratio of the likelihood of H1 over H0 also supplies the quantitative measure of comparative support. Note when X is continuous, the probability is assigned over a small interval around X to avoid probability 0.
Does the Law of Likelihood Obey the Minimal Requirement for Severity?
Likelihoods are vital to all statistical accounts, but they are often misunderstood because the data are fixed and the hypothesis varies. Likelihoods of hypotheses should not be confused with their probabilities. Two ways to see this. First, suppose you discover all of the stocks in Pickrite’s promotional letter went up in value (x)–all winners. A hypothesis H to explain this is that their method always succeeds in picking winners. H entails x, so the likelihood of H given x is 1. Yet we wouldn’t say H is therefore highly probable, especially without reason to put to rest that they culled the winners post hoc. For a second way, at any time, the same phenomenon may be perfectly predicted or explained by two rival theories; so both theories are equally likely on the data, even though they cannot both be true.
Suppose Bristol-Roach, in our Bernoulli tea tasting example, got two correct guesses followed by one failure. The observed data can be represented as x0 =<1,1,0>. Let the hypotheses be different values for θ, the probability of success on each independent trial. The likelihood of the hypothesis H0 : θ = 0.5, given x0, which we may write as Lik(0.5), equals (½)(½)(½) = 1/8. Strictly speaking, we should write Lik(θ;x0), because it’s always computed given data x0; I will do so later on. The likelihood of the hypothesis θ = 0.2 is Lik(0.2)= (0.2)(0.2)(0.8) = 0.032. In general, the likelihood in the case of Bernoulli independent and identically distributed trials takes the form: Lik(θ)= θs(1- θ)f, 0< θ<1, where s is the number of successes and f the number of failures. Infinitely many values for θ between 0 and 1 yield positive likelihoods; clearly, then likelihoods do not sum to 1, or any number in particular. Likelihoods do not obey the probability calculus.
The Law of Likelihood (LL) will immediately be seen to fail on our minimal severity requirement – at least if it is taken as an account of inference. Why? There is no onus on the Likelihoodist to predesignate the rival hypotheses – you are free to search, hunt, and post-designate a more likely, or even maximally likely, rival to a test hypothesis H0
Consider the hypothesis that θ = 1 on trials one and two and 0 on trial three. That makes the probability of x maximal. For another example, hypothesize that the observed pattern would always recur in three-trials of the experiment (I. J. Good said in his cryptoanalysis work these were called “kinkera”). Hunting for an impressive fit, or trying and trying again, one is sure to find a rival hypothesis H1 much better “supported” than H0 even when H0 is true. As George Barnard puts it, “there always is such a rival hypothesis, viz. that things just had to turn out the way they actually did” (1972, p. 129).
Note that for any outcome of n Bernoulli trials, the likelihood of H0 : θ = 0.5 is (0.5)n, so is quite small. The likelihood ratio (LR) of a best-supported alternative compared to H0 would be quite high. Since one could always erect such an alternative,
(*) Pr(LR in favor of H1 over H0; H0) = maximal.
Thus the LL permits BENT evidence. The severity for H1 is minimal, though the particular H1 is not formulated until the data are in hand.I call such maximally fitting, but minimally severely tested, hypotheses Gellerized, since Uri Geller was apt to erect a way to explain his results in ESP trials. Our Texas sharpshooter is analogous because he can always draw a circle around a cluster of bullet holes, or around each single hole. One needn’t go to such an extreme rival, but it suffices to show that the LL does not control the probability of erroneous interpretations.
What do we do to compute (*)? We look beyond the specific observed data to the behavior of the general rule or method, here the LL. The output is always a comparison of likelihoods. We observe one outcome, but we can consider that for any outcome, unless it makes H0 maximally likely, we can find an H1 that is more likely. This lets us compute the relevant properties of the method: its inability to block erroneous interpretations of data. As always, a severity assessment is one level removed: you give me the rule, and I consider its latitude for erroneous outputs. We’re actually looking at the probability distribution of the rule, over outcomes in the sample space. This distribution is called a sampling distribution. It’s not a very apt term, but nothing has arisen to replace it. For those who embrace the LL, once the data are given, it’s irrelevant what other outcomes could have been observed but were not. Likelihoodists say that such considerations make sense only if the concern is the performance of a rule over repetitions, but not for inference from the data. Likelihoodists hold to “the irrelevance of the sample space” (once the data are given). This is the key contrast between accounts based on error probabilities (error statistical) and logics of statistical inference.
To continue reading Excursion 1 Tour II, go here.
This excerpt comes from Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo, CUP 2018).
Here’s link to all excerpts and mementos that I’ve posted (up to July 2019).
Mementos from Excursion I Tour II are here.
Blurbs of all 16 Tours can be found here.
Search topics of interest on this blog for the development of many of the ideas in SIST, and a rich sampling of comments from readers.