In Tour II of this first Excursion of Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST, 2018, CUP), I pull back the cover on disagreements between experts charged with restoring integrity to today’s statistical practice. Some advised me to wait until later (in the book) to get to this eye-opener. Granted, the full story involves some technical issues, but after many months, I think I arrived at a way to get to the heart of things informally (with a promise of more detailed retracing of steps later on). It was too important not to reveal right away that some of the most popular “reforms” fall down on the job even with respect to our most minimal principle of evidence (you don’t have evidence for a claim if little if anything has been done to probe the ways it can be flawed).

All of Excursion 1 Tour II is *here*. After this post, I’ll resume regular blogging for a while, so you can catch up to us. Several free (signed) copies of SIST will be given away on Twitter shortly.

**1.4 The Law of Likelihood and Error Statistics**

If you want to understand what’s true about statistical inference, you should begin with what has long been a holy grail–to use probability to arrive at a type of logic of evidential support–and in the first instance you should look not at full-blown Bayesian probabilism, but at comparative accounts that sidestep prior probabilities in hypotheses. An intuitively plausible logic of comparative support was given by the philosopher Ian Hacking (1965)–the Law of Likelihood. Fortunately, the Museum of Statistics is organized by theme, and the Law of Likelihood and the related Likelihood Principle is a big one.

*Law of Likelihood (LL):*Data ** x **are better evidence for hypothesis

*H*than for

_{1 }*H*if

_{0 }**is more probable under**

*x**H*than under

_{1 }*H*: Pr(

_{0}

*x;**H*) > Pr(

_{1}

*x;**H*) that is,

_{0}*the likelihood ratio LR*of

*H*over

_{1 }*H*exceeds 1.

_{0 }*H _{0 }*and

*H*are statistical hypotheses that assign probabilities to the values of the random variable

_{1 }**A fixed value of**

*X.***is written**

*X*

*x*_{0}, but we often want to generalize about this value, in which case, following others, I use

**. The**

*x**likelihood of the hypothesis*

*H,*given data

**, is the probability of observing**

*x***, under the assumption that**

*x**H*is true or adequate in some sense. Typically, the ratio of the likelihood of

*H*over

_{1 }*H*also supplies the quantitative measure of comparative support. Note when

_{0 }

**X***is continuous, the probability is assigned over a small interval around*

*to avoid probability 0.*

**X**

**Does the Law of Likelihood Obey the Minimal Requirement for Severity?**

Likelihoods are vital to all statistical accounts, but they are often misunderstood because the data are fixed and the hypothesis varies. Likelihoods of hypotheses should not be confused with their probabilities. Two ways to see this. First, suppose you discover all of the stocks in Pickrite’s promotional letter went up in value (** x**)–all winners. A hypothesis

*H*to explain this is that their method always succeeds in picking winners.

*H*

*entails*

**, so the likelihood of**

*x**H*given

**is 1. Yet we wouldn’t say**

*x**H*is therefore highly probable, especially without reason to put to rest that they culled the winners post hoc. For a second way, at any time, the same phenomenon may be perfectly predicted or explained by two rival theories; so both theories are equally likely on the data, even though they cannot both be true.

Suppose Bristol-Roach, in our Bernoulli tea tasting example, got two correct guesses followed by one failure. The observed data can be represented as *x*_{0 }=<1,1,0>. Let the hypotheses be different values for θ, the probability of success on each independent trial. The likelihood of the hypothesis *H _{0 }*: θ = 0.5, given

*x*_{0}, which we may write as Lik(0.5), equals (½)(½)(½) = 1/8. Strictly speaking, we should write Lik(θ;

*x*_{0}), because it’s always computed given data

*x*_{0}; I will do so later on. The likelihood of the hypothesis θ = 0.2 is Lik(0.2)= (0.2)(0.2)(0.8) = 0.032. In general, the likelihood in the case of Bernoulli independent and identically distributed trials takes the form: Lik(θ)= θ

*(1- θ)*

^{s}*, 0< θ<1, where*

^{f}*s*is the number of successes and

*f*the number of failures. Infinitely many values for θ between 0 and 1 yield positive likelihoods; clearly, then likelihoods do not sum to 1, or any number in particular. Likelihoods do not obey the probability calculus.

The Law of Likelihood (LL) will immediately be seen to fail on our minimal severity requirement – at least if it is taken as an account of inference. Why? There is no onus on the Likelihoodist to predesignate the rival hypotheses – you are free to search, hunt, and post-designate a more likely, or even maximally likely, rival to a test hypothesis *H _{0 }*

Consider the hypothesis that θ = 1 on trials one and two and 0 on trial three. That makes the probability of ** x **maximal. For another example, hypothesize that the observed pattern would always recur in three-trials of the experiment (I. J. Good said in his cryptoanalysis work these were called “kinkera”). Hunting for an impressive fit, or trying and trying again, one is sure to find a rival hypothesis

*H*much better “supported” than

_{1 }*H*even when

_{0 }*H*is true. As George Barnard puts it, “there

_{0 }*always*is such a rival hypothesis, viz. that things just had to turn out the way they actually did” (1972, p. 129).

Note that for any outcome of *n *Bernoulli trials, the likelihood of *H _{0 }*: θ = 0.5 is (0.5)

*, so is quite small. The likelihood ratio (LR) of a best-supported alternative compared to*

^{n}*H*would be quite high. Since one could always erect such an alternative,

_{0 }(*) Pr(LR in favor of *H _{1 }*over

*H*;

_{0}*H*) = maximal.

_{0}*Thus the LL permits BENT evidence. *The severity for *H _{1 }*is minimal, though the particular

*H*is not formulated until the data are in hand.I call such maximally fitting, but minimally severely tested, hypotheses

_{1 }*Gellerized*, since Uri Geller was apt to erect a way to explain his results in ESP trials. Our Texas sharpshooter is analogous because he can always draw a circle around a cluster of bullet holes, or around each single hole. One needn’t go to such an extreme rival, but it suffices to show that the LL does not control the probability of erroneous interpretations.

What do we do to compute (*)? We look beyond the specific observed data to the behavior of the general rule or method, here the LL. The output is always a comparison of likelihoods. We observe one outcome, but we can consider that for any outcome, unless it makes *H _{0 }*maximally likely, we can find an

*H*that is more likely. This lets us compute the relevant properties of the method: its inability to block erroneous interpretations of data. As always, a severity assessment is one level removed: you give me the rule, and I consider its latitude for erroneous outputs. We’re actually looking at the probability distribution of the rule, over outcomes in the sample space. This distribution is called a

_{1 }*sampling distribution.*It’s not a very apt term, but nothing has arisen to replace it. For those who embrace the LL, once the data are given, it’s irrelevant what other outcomes could have been observed but were not. Likelihoodists say that such considerations make sense only if the concern is the performance of a rule over repetitions, but not for inference from the data. Likelihoodists hold to “the irrelevance of the sample space” (once the data are given). This is the key contrast between accounts based on error probabilities (error statistical) and logics of statistical inference.

**To continue reading Excursion 1 Tour II, go here.**

__________

This excerpt comes from Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo, CUP 2018).

Here’s link to all excerpts and mementos that I’ve posted (up to July 2019).

Mementos from Excursion I Tour II are here.

Blurbs of all 16 Tours can be found here.

Search topics of interest on this blog for the development of many of the ideas in SIST, and a rich sampling of comments from readers.

As you know, I disagree strongly with your account here. You fail to appropriately consider the role of the statistical model that provides likelihoods and the fact that likelihoods of hypotheses are actually likelihoods of particular values for the parameter(s) of interest within the statistical model. (Yes, some people prefer to talk about rival statistical models where I talk about rival parameter values within a model. That makes not difference to the substance.)

Likelihood ratios only make sense as indices of evidence where the likelihoods belong to parameter values that are points along a singular scale. That means that when Barnard writes “there always is such a rival hypothesis, viz. that things just had to turn out the way they actually did” it only means that there is always a maximally supported value of the parameter given the data. In other words the data support one (usually) value for the parameter more strongly than all others: so what? It only means that the likelihood function has a mode!

I have written a paper on this topic that explains the issue, and also the mistaken interpretations of a toy example by Birnbaum and Hacking. https://arxiv.org/abs/1507.08394

Michael: It is Royall who illustrates his own problem here with the “trick deck” example. For any card observed, the hypothesis that the entire deck is made up of just that type of card is more likely than the null of a normal deck. Things are only moderately less problematic if we restrict the alternative to alt values of a given parameter–as in the case of optional stopping.

The problem isn’t that on the Likelihood account “data support one value for the parameter more strongly than all others” (as you write), the problem is that it strongly supports an alt hyp even though it’s false. This violates (Cox’s) weak repeated sampling and shows the LR lacks error control (except for restricted cases of predesignated points).

As for likelihoods always being within models, that’s true but only presents deeper problems for the Likelihoodist–how do you use a comparative likelihood account to check a model. Almost everyone agrees it violates the Likelihood Principle or renders it inapplicable.

Well, the deck of cards problem illustrates my point perfectly in a couple of ways!

Edwards and Royall both make use of an example where a single card is drawn from a deck and two hypotheses are considered, that the deck is normal and that the deck is made of 52 identical cards. The problem may be useful when re-examined in light of the dictum (my dictum) that one should put the hypotheses on the x-axis of a likelihood function. After a single observation, ace of diamonds, the x-axis becomes the number of aces of diamonds in the deck. The data do not tell anything about any other case, so the likelihood functions for any other cases are undefined.

When viewed in that light the single observation is not in favour of the 52 aces of diamonds hypothesis by 52 to one, as it seems when only two hypotheses are considered, but 52 to 51 against the hypotheses of two aces of diamonds, 52 to 50 against three, etc. Thus the apparently strong support for the 52 aces hypotheses over the normal deck (i.e. one ace of diamonds) hypothesis is set into the proper context of other relevant hypotheses and, as should be the case with so sparse a dataset, the evidence is not convincing.

One might complain that this treatment changes the nature of the example by introducing hypotheses that are not called for in the original presentations, but I would counter by saying that the normal deck hypothesis is ‘natural’ and the alternative class of hypotheses, ‘trick’ decks, is not naturally restricted to the set of decks with 52 cards alike. Another potential problem is that as soon as another card is drawn that differs from the first, the class of relevant hypotheses becomes far more complicated.

Of course, another obvious response to your concern is to say that we should not worry that the evidence in the _least possible_ dataset appears to support a possibly wrong point parameter value over a possibly right point parameter value. After all, drawing just one more card will suffice to make the ‘problem’ go away no matter what the second card shows.

The disarming of this widely known counter-ish example by the collection of a little more data is generalisable. The single observation toy problem that Birnbaum claims to have been persuaded by is also disarmed by a second observation, as I discuss in the paper linked in my previous comment.

Does a method have to be well behaved when the number of observations is not more than the number of model parameters in order to be acceptable to you? If so then many methods would fail, probably including severity.

Michael: Again, the trick deck example is Royall’s–I was always surprised he trots it out–but the central problem remains, as does my criticism, even with your restriction. For the Likelihoodist the import of the evidence is limited to the likelihood ratio, and thus it ignores the stopping rule, and more generally, ignores the “sampling plan”. Edwards, Lindmann and Savage are happy to declare in 1963 that this restores the simplicity and freedom that had been lost with statistical significance and N-P tests. I just don’t see how current day “reformers” can still hold such a view in the face today’s concerns with P-hacking and the “21 word solution” to avoid the ease of spurious significance. Yet Bayarri et.al (2016), tell us that “Bayes factors can be used in the complete absence of a sampling plan or in situations where the analyst does not know the sampling plan that was used (2016, p. 100). If you don’t need to know it, you can’t use it to take account of any flexible choice points that alter error probabilities. But the same P-hacked hypothesis that can appear in a test can appear in a Bayes factor or likelihood ratio. That an account does not pick up on these gambits doesn’t make them go away. That is one of the chief messages of SIST. Ignoring optional stopping is another way that the probability of erroneous rejections balloons. (SIST 43-4). Why does the ASA Guide include Principle 4 which says that you cannot interpret results without knowing the number of hypotheses tested, any post-data selection rules, stopping rules etc? I’m sorry but I just find it baffling in the extreme that we can have constant calls to reveal selection effects (to avoid irreplication) while at the same time some people, including Royall, deny these considerations alter the evidential import of data. To pick up on them, for Royall, one might move to the context of belief to finds low priors for data dependent hypotheses.

“If so then many methods would fail, probably including severity.” It seems that any error statistical approach will not fail in the face of small samples precisely because it considers sample space (and the results we could have had) in the mix. Very small samples give inconclusive answers, exactly what is called for.

Mayo, your response is perplexing. You do not respond at all to the content of my comments and instead bring in issues that are irrelevant or at best tangential.

I did not claim that the trick deck example was yours. In fact I specifically noted that it was Edwards’s and then Royall’s.

It was the intention of my comments to explain that “The central problem” as you call it is not a real problem. If you wish to assert to the contrary that it is a real problem, and that it remains after my restriction then you need to explain how and why or, perhaps suggest why my comments are mistaken. Otherwise it is a waste of our time to post any further comments.

You did not respond to my question regarding the negative evaluation of a method that misbehaves (allegedly: see my previous paragraph) when applied to a dataset of one singular observation. You have not provided any example of likelihood ratios being seriously misleading when the dataset is larger than the parameters being fit by the model, and did not respond to the suggestion that a severity analysis will also perform badly with a single observation.

Stopping rules are not relevant to the single card problem as presented by Royall and Edwards or to my comments and so I do not know why your response goes off in that direction. Neither do I know why you mention Bayes factors. I do not defend Bayes factors and, in fact, I feel about them pretty much the same as you. I would add a further criticism that you will probably disagree with: Bayes factors are typically calculated for two points in parameter space that are predetermined, and that prevents the data from being able to direct our attention.

Michael,

So in the trick deck problem, this issue as I understand it is that even when the deck is a normal one, after drawing one card the likelihood ratio between fair deck and “fully” trick deck (or the whole likelihood on the full set of possible tricky decks function, if you prefer) is *always* going to say that in light of the available data the evidence is to some degree against the fair deck (versus any other surviving possibility). If we accept that our methods should be have *some* non-zero chance to detect errors when present then this is a problem, since this method *always* says that what evidence is available is against the normal deck hypothesis — even when it’s true.

(Curiously, Bayes does not share this defect.)

Corey, given that we are talking about a model with 52 parameters (one for each type of card), the fact that a single datum will often or always point you in the wrong direction is neither surprising nor should it be disqualifying.

We always risk disastrous mistakes when we try to argue about statistical methods without explicit consideration of the statistical model(s). What model would you propose for the cards problem to make sure that the number of parameters is less than the number of data points with a single card drawn?

Michael: I realize you’re responding to Corey, but just to inform the general reader, the reason I don’t think the “trick deck” case is of great interest (except to get the feel of “Gellerized” hypotheses) is that violations of error control occur with examples that do not involve such an extreme data dependent alternative. Optional stopping (with a two-sided Normal test) is the classic example & is discussed in Excursion 1 Tour II linked in this post, with links to Cox and Hinkley and others.

Michael: I have valued your contributions on my blog to just this topic. I’m disappointed & perplexed now because our earlier rounds over 5 years or whatever actually led us to a much, much clearer place (than is apparent in your comments on this post). Rather than forfeit that progress (which is how I would feel if I tried to retrace the rounds now), I will just ask the reader to have a look at this Tour for themselves (it’s linked here in its entirety, in proof form). Perhaps, too, on some Saturday night, the reader might look at some of the “Sat night comedy” posts on this blog, such as ‘Who is allowed to cheat: IJ Good and that after-dinner comedy hour”: https://errorstatistics.com/2014/04/05/who-is-allowed-to-cheat-i-j-good-and-that-after-dinner-comedy-hour-2/

Interested readers can also search this blog for law of likelihood, likelihood principle etc. to find much more–including valuable comments by M. Lew and others.

In rereading 1 year later, there are obviously things I would have put differently. I might have said more about the fact that “logic of induction” here refers to a purely syntactical account where the appraisal of the relationships between statements of evidence and hypotheses is a purely formal matter. It’s to be context-free, as with deductive logic. So when, for example, Hacking says there’s no such thing as an inductive logic, he means no purely formal set of context-free rules. I do not mean there is no “logic” in the reasoning associated with severe testing. It’s logical, but determining if H is severely tested by x in test T requires information about the background, the assumptions, the mistakes already ruled out, etc. I hope that the overall discussion of logical positivists and logics of induction/confirmation, novelty and Popper, Popper vs Carnap, solving the problem of induction now etc. convey this.

Mayo’s severe testing principles derive support from the acceptance sampling area type problems. Assume that a batch of 10000 (packaged) items must be assured of safety. For example, pathogens such salmonella must be absent for the safety of the product. If we want 100% confidence, all the items in the batch must be tested. The percentage of the batch to be tested is about the same as the degree of confidence (probability) that the batch is pathogen free. See Wright (1990) and Benedict (1990).

If p is the proportion of items that are nonconforming, discriminating between the hypotheses p=0 vs p=1 (Haldane type prior problem) creates issues. A stopping rule will work when p is large (or 1) but not for p=0 or very small. Assume that likelihood ratio is built to discriminate between p=0.001 and p=0.999, and the decision criterion is to reject the batch when a single tested item is nonconforming. We can plot the likelihood ratio against various sample sizes and see its ineffectiveness.

Benedict, J. P. (1990). Comment on Wright (1990). The American Statistician, 44, p.330.

Wright, T. (1990). When Zero Defectives Appear in a Sample: Upper Bounds on Confidence Coefficients of Upper Bounds. The American Statistician, 44, 40-41.

K.

There’s much I might say to your comment, but let me just mention the most concerning one. The central point of the severe testing philosophy is to deny that control of error probabilities matters solely because of a concern with the long-run performance of a method. The reason we object to the data dredger claiming to have good evidence of a genuine effect is that the dredged up hypothesis has not passed a severe test. The test scarcely can be said to have bent over backwards to avoid mistaking spurious effects as real. In scientific contexts, I argue, the role of error probabilities of methods is to quantify the capacity of a test to have found flaws in a claim, if present. Although data dredging and optional stopping would result in high error rates in the long run of uses, that is not the reason we would question the scientific standing of a resulting inference. It is rather that the claim has not been well-tested. This intuition should be reflected in one’s statistical account. At present it is not. Instead, probability arises either for the kind of performance (e.g., in acceptance sampling) that you mention, or to assign degrees (usually comparative) of belief, confirmation, or support.

I realize there are some other points in your and Michael’s comment, but this is so fundamental that I thought it worth clarifying right away.