1.An Assumed Law of Statistical Evidence (law of likelihood)
Nearly all critical discussions of frequentist error statistical inference (significance tests, confidence intervals, p- values, power, etc.) start with the following general assumption about the nature of inductive evidence or support:
Data x are better evidence for hypothesis H1 than for H0 if x are more probable under H1 than under H0.
Ian Hacking (1965) called this the logic of support: x supports hypotheses H1 more than H0 if H1 is more likely, given x than is H0:
Pr(x; H1) > Pr(x; H0).
[With likelihoods, the data x are fixed, the hypotheses vary.]*
Or,
x is evidence for H1 over H0 if the likelihood ratio LR (H1 over H0 ) is greater than 1.
It is given in other ways besides, but it’s the same general idea. (Some will take the LR as actually quantifying the support, others leave it qualitative.)
In terms of rejection:
“An hypothesis should be rejected if and only if there is some rival hypothesis much better supported [i.e., much more likely] than it is.” (Hacking 1965, 89)
2. Barnard (British Journal of Philosophy of Science )
But this “law” will immediately be seen to fail on our minimal severity requirement. Hunting for an impressive fit, or trying and trying again, it’s easy to find a rival hypothesis H1 much better “supported” than H0 even when H0 is true. Or, as Barnard (1972) puts it, “there always is such a rival hypothesis, viz. that things just had to turn out the way they actually did” (1972 p. 129). H0: the coin is fair, gets a small likelihood (.5)k given k tosses of a coin, while H1: the probability of heads is 1 just on those tosses that yield a head, renders the sequence of k outcomes maximally likely. This is an example of Barnard’s “things just had to turn out as they did”. Or, to use an example with P-values: a statistically significant difference, being improbable under the null H0 , will afford high likelihood to any number of explanations that fit the data well.
3.Breaking the law (of likelihood) by going to the “second,” error statistical level:
How does it fail our severity requirement? First look at what the frequentist error statistician must always do to critique an inference: she must consider the capability of the inference method that purports to provide evidence for a claim. She goes to a higher level or metalevel, as it were. In this case, the likelihood ratio plays the role of the needed statistic d(X). To put it informally, she asks:
What’s the probability the method would yield an LR disfavoring H0 compared to some alternative H1 even if H0 is true?