An essential component of inference based on familiar frequentist notions: p-values, significance and confidence levels, is the relevant sampling distribution (hence the term sampling theory, or my preferred error statistics, as we get error probabilities from the sampling distribution). This feature results in violations of a principle known as the strong likelihood principle (SLP). To state the SLP roughly, it asserts that all the evidential import in the data (for parametric inference within a model) resides in the likelihoods. If accepted, it would render error probabilities irrelevant post data. Continue reading
law of likelihood
In Tour II of this first Excursion of Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST, 2018, CUP), I pull back the cover on disagreements between experts charged with restoring integrity to today’s statistical practice. Some advised me to wait until later (in the book) to get to this eye-opener. Granted, the full story involves some technical issues, but after many months, I think I arrived at a way to get to the heart of things informally (with a promise of more detailed retracing of steps later on). It was too important not to reveal right away that some of the most popular “reforms” fall down on the job even with respect to our most minimal principle of evidence (you don’t have evidence for a claim if little if anything has been done to probe the ways it can be flawed). Continue reading
For the first time, I’m excerpting all of Excursion 1 Tour II from SIST (2018, CUP).
1.4 The Law of Likelihood and Error Statistics
If you want to understand what’s true about statistical inference, you should begin with what has long been a holy grail–to use probability to arrive at a type of logic of evidential support–and in the first instance you should look not at full-blown Bayesian probabilism, but at comparative accounts that sidestep prior probabilities in hypotheses. An intuitively plausible logic of comparative support was given by the philosopher Ian Hacking (1965)–the Law of Likelihood. Fortunately, the Museum of Statistics is organized by theme, and the Law of Likelihood and the related Likelihood Principle is a big one. Continue reading
Here are the slides from my discussion of Nancy Reid today at BFF4: The Fourth Bayesian, Fiducial, and Frequentist Workshop: May 1-3, 2017 (hosted by Harvard University)
Here’s a quick note on something that I often find in discussions on tests, even though it treats “power”, which is a capacity-of-test notion, as if it were a fit-with-data notion…..
1. Take a one-sided Normal test T+: with n iid samples:
H0: µ ≤ 0 against H1: µ > 0
σ = 10, n = 100, σ/√n =σx= 1, α = .025.
So the test would reject H0 iff Z > c.025 =1.96. (1.96. is the “cut-off”.)
- Simple rules for alternatives against which T+ has high power:
- If we add σx (here 1) to the cut-off (here, 1.96) we are at an alternative value for µ that test T+ has .84 power to detect.
- If we add 3σx to the cut-off we are at an alternative value for µ that test T+ has ~ .999 power to detect. This value, which we can write as µ.999 = 4.96
Let the observed outcome just reach the cut-off to reject the null,z0 = 1.96.
If we were to form a “likelihood ratio” of μ = 4.96 compared to μ0 = 0 using
it would be 40. (.999/.025).
It is absurd to say the alternative 4.96 is supported 40 times as much as the null, understanding support as likelihood or comparative likelihood. (The data 1.96 are even closer to 0 than to 4.96). The same point can be made with less extreme cases.) What is commonly done next is to assign priors of .5 to the two hypotheses, yielding
Pr(H0 |z0) = 1/ (1 + 40) = .024, so Pr(H1 |z0) = .976.
Such an inference is highly unwarranted and would almost always be wrong. Continue reading
Have you ever noticed that some leading advocates of a statistical account, say a testing account A, upon discovering account A is unable to handle a certain kind of important testing problem that a rival testing account, account B, has no trouble at all with, will mount an argument that being able to handle that kind of problem is actually a bad thing? In fact, they might argue that testing account B is not a “real” testing account because it can handle such a problem? You have? Sure you have, if you read this blog. But that’s only a subliminal point of this post.
I’ve had three posts recently on the Law of Likelihood (LL): Breaking the [LL](a)(b), [c], and [LL] is bankrupt. Please read at least one of them for background. All deal with Royall’s comparative likelihoodist account, which some will say only a few people even use, but I promise you that these same points come up again and again in foundational criticisms from entirely other quarters.[i]
[M]edical researchers are interested in the success probability, θ, associated with a new treatment. They are particularly interested in how θ relates to the old treatment’s success probability, believed to be about 0.2. They have reason to hope θ is considerably greater, perhaps 0.8 or even greater. To obtain evidence about θ, they carry out a study in which the new treatment is given to 17 subjects, and find that it is successful in nine.
Let me interject at this point that of all of Stephen Senn’s posts on this blog, my favorite is the one where he zeroes in on the proper way to think about the discrepancy we hope to find (the .8 in this example). (See note [ii]) Continue reading
There was a session at the Philosophy of Science Association meeting last week where two of the speakers, Greg Gandenberger and Jiji Zhang had insightful things to say about the “Law of Likelihood” (LL)[i]. Recall from recent posts here and here that the (LL) regards data x as evidence supporting H1 over H0 iff
Pr(x; H1) > Pr(x; H0).
On many accounts, the likelihood ratio also measures the strength of that comparative evidence. (Royall 1997, p.3). [ii]
H0 and H1 are statistical hypothesis that assign probabilities to the random variable X taking value x. As I recall, the speakers limited H1 and H0 to simple statistical hypotheses (as Richard Royall generally does)–already restricting the account to rather artificial cases, but I put that to one side. Remember, with likelihoods, the data x are fixed, the hypotheses vary.
1. Maximally likely alternatives. I didn’t really disagree with anything the speakers said. I welcomed their recognition that a central problem facing the (LL) is the ease of constructing maximally likely alternatives: so long as Pr(x; H0) < 1, a maximum likely alternative H1 would be evidentially “favored”. There is no onus on the likelihoodist to predesignate the rival, you are free to search, hunt, post-designate and construct a best (or better) fitting rival. If you’re bothered by this, says Royall, then this just means the evidence disagrees with your prior beliefs.
After all, Royall famously distinguishes between evidence and belief (recall the evidence-belief-action distinction), and these problematic cases, he thinks, do not vitiate his account as an account of evidence. But I think they do! In fact, I think they render the (LL) utterly bankrupt as an account of evidence. Here are a few reasons. (Let me be clear that I am not pinning Royall’s defense on the speakers[iii], so much as saying it came up in the general discussion[iv].) Continue reading
With this post, I finally get back to the promised sequel to “Breaking the Law! (of likelihood) (A) and (B)” from a few weeks ago. You might wish to read that one first.* A relevant paper by Royall is here.
Richard Royall is a statistician1 who has had a deep impact on recent philosophy of statistics by giving a neat proposal that appears to settle disagreements about statistical philosophy! He distinguishes three questions:
- What should I believe?
- How should I act?
- Is this data evidence of some claim? (or How should I interpret this body of observations as evidence?)
It all sounds quite sensible– at first–and, impressively, many statisticians and philosophers of different persuasions have bought into it. At least they appear willing to go this far with him on the 3 questions.
How is each question to be answered? According to Royall’s
commandments writings, what to believe is captured by Bayesian posteriors; how to act, by a behavioristic, N-P long-run performance. And what method answers the evidential question? A comparative likelihood approach. You may want to reject all of them (as I do),2 but just focus on the last.
Remember with likelihoods, the data x are fixed, the hypotheses vary. A great many critical discussions of frequentist error statistical inference (significance tests, confidence intervals, p- values, power, etc.) start with “the law”. But I fail to see why we should obey it.
To begin with, a report of comparative likelihoods isn’t very useful: H might be less likely than H’, given x, but so what? What do I do with that information? It doesn’t tell me I have evidence against or for either.3 Recall, as well, Hacking’s points here about the variability in the meanings of a likelihood ratio across problems. Continue reading
1.An Assumed Law of Statistical Evidence (law of likelihood)
Nearly all critical discussions of frequentist error statistical inference (significance tests, confidence intervals, p- values, power, etc.) start with the following general assumption about the nature of inductive evidence or support:
Data x are better evidence for hypothesis H1 than for H0 if x are more probable under H1 than under H0.
Ian Hacking (1965) called this the logic of support: x supports hypotheses H1 more than H0 if H1 is more likely, given x than is H0:
Pr(x; H1) > Pr(x; H0).
[With likelihoods, the data x are fixed, the hypotheses vary.]*
x is evidence for H1 over H0 if the likelihood ratio LR (H1 over H0 ) is greater than 1.
It is given in other ways besides, but it’s the same general idea. (Some will take the LR as actually quantifying the support, others leave it qualitative.)
In terms of rejection:
“An hypothesis should be rejected if and only if there is some rival hypothesis much better supported [i.e., much more likely] than it is.” (Hacking 1965, 89)
2. Barnard (British Journal of Philosophy of Science )
But this “law” will immediately be seen to fail on our minimal severity requirement. Hunting for an impressive fit, or trying and trying again, it’s easy to find a rival hypothesis H1 much better “supported” than H0 even when H0 is true. Or, as Barnard (1972) puts it, “there always is such a rival hypothesis, viz. that things just had to turn out the way they actually did” (1972 p. 129). H0: the coin is fair, gets a small likelihood (.5)k given k tosses of a coin, while H1: the probability of heads is 1 just on those tosses that yield a head, renders the sequence of k outcomes maximally likely. This is an example of Barnard’s “things just had to turn out as they did”. Or, to use an example with P-values: a statistically significant difference, being improbable under the null H0 , will afford high likelihood to any number of explanations that fit the data well.
3.Breaking the law (of likelihood) by going to the “second,” error statistical level:
How does it fail our severity requirement? First look at what the frequentist error statistician must always do to critique an inference: she must consider the capability of the inference method that purports to provide evidence for a claim. She goes to a higher level or metalevel, as it were. In this case, the likelihood ratio plays the role of the needed statistic d(X). To put it informally, she asks:
What’s the probability the method would yield an LR disfavoring H0 compared to some alternative H1 even if H0 is true?