Here’s a quick note on something that I often find in discussions on tests, even though it treats “power”, which is a capacity-of-test notion, as if it were a fit-with-data notion…..
1. Take a one-sided Normal test T+: with n iid samples:
H0: µ ≤ 0 against H1: µ > 0
σ = 10, n = 100, σ/√n =σx= 1, α = .025.
So the test would reject H0 iff Z > c.025 =1.96. (1.96. is the “cut-off”.)
- Simple rules for alternatives against which T+ has high power:
- If we add σx (here 1) to the cut-off (here, 1.96) we are at an alternative value for µ that test T+ has .84 power to detect.
- If we add 3σx to the cut-off we are at an alternative value for µ that test T+ has ~ .999 power to detect. This value, which we can write as µ.999 = 4.96
Let the observed outcome just reach the cut-off to reject the null,z0 = 1.96.
If we were to form a “likelihood ratio” of μ = 4.96 compared to μ0 = 0 using
it would be 40. (.999/.025).
It is absurd to say the alternative 4.96 is supported 40 times as much as the null, understanding support as likelihood or comparative likelihood. (The data 1.96 are even closer to 0 than to 4.96). The same point can be made with less extreme cases.) What is commonly done next is to assign priors of .5 to the two hypotheses, yielding
Pr(H0 |z0) = 1/ (1 + 40) = .024, so Pr(H1 |z0) = .976.
Such an inference is highly unwarranted and would almost always be wrong.
- How could people think it plausible to compute a comparative likelihood this way?
I have been thinking about this for awhile because it’s ubiquitous throughout criticisms of error statistical testing, and it comes from a plausible comparativist likelihood position (which I do not hold), namely that data are better evidence for μ than for μ’ if μ is more likely than μ’ given the data. I’m guessing they’re reasoning as follows:
The probability is very high that z > 1.96 under the assumption that μ = 4.96.
The probability is low that z > 1.96 under the assumption that μ=μ0 = 0.
We’ve observed z0 = 1.96 (so you’ve observed z > 1.96)
Therefore,μ= 4.96 makes the observation more probable than does μ = 0.
Therefore the outcome is (comparatively) better evidence for μ= 4.96 than for μ = 0.
But the “outcome” for a likelihood is to be the specific outcome, and the comparative appraisal of which hypothesis accords better with the data only makes sense when one keeps to this. Power against μ’ concerns the capacity of a test to have produced a larger difference, under μ’. (It refers to all of the outcomes that could have been generated.)
- That’s not at all how power works.
The result is that power works in the opposite way! If there’s a high probability you should have observed a larger difference than you did, assuming the data came from a world where μ =μ’, then the data indicate you’re not in a world where μ is as high as μ’. In fact:
if Pr(Z > z0;μ =μ’) = high , then Z = z0 is strong evidence that μ < μ’!
Rather than being evidence for μ’, the statistically significant result is evidence against μ being as high as μ’.
- Stephen Senn
Stephen Senn (2007, p. 201) has correctly said that the following is “nonsense”:
“[U]pon rejecting the null hypothesis, not only may we conclude that the treatment is effective but also that it has a clinically relevant effect.”
Now the test is designed to have high power to detect a clinically relevant effect (usually .8 or .9). I happen to have chosen an extremely high power (.999) but the claim holds for any alternative that the test has high power to detect. The clinically relevant discrepancy, as he describes it, is one “we should not like to miss”, but obtaining a statistically significant result is not evidence we’ve found a discrepancy that big. (See also Senn’s post here.)
Supposing that it is, is essentially to treat the test as if it were:
H0:μ < 0 vs H1:μ > 4.96
This, he says, is “ludicrous”as it:
“would imply that we knew, before conducting the trial, that the treatment effect is either zero or at least equal to the clinically relevant difference. But where we are unsure whether a drug works or not, it would be ludicrous to maintain that it cannot have an effect which, while greater than nothing, is less than the clinically relevant difference.”(Senn, 2007, p. 201).
The same holds with H0:μ = 0 as null.
If anything, it is the lower confidence limit that we would look at to see what discrepancies from 0 are warranted. The lower .975 limit (if one sided) or .95 (if two-sided) would be 0 and .3, respectively. So we would be warranted in inferring from z:
μ > 0 or μ > .3.
- What does the severe tester say?
In sync with the confidence interval, she would say SEV(μ > 0)= .975 (if one sided), and would also note some other benchmarks, e.g., SEV(μ > .96)= .84.
Equally important for her is a report of what is poorly warranted. In particular the claim that the data indicate
μ > 4.96
would be wrong over 99% of the time!
Of course, I would want to use the actual result, rather than the cut-off for rejection (as with power) but the reasoning is the same, and here I deliberately let the outcome just hit the cut-off for rejection.
- The (type1,2 error probability) trade-off vanishes
Notice what happens if we consider the “real type 1 error” as Pr(H0 |z0)
Since Pr(H0 |z0) decreases with increasing power, it decreases with decreasing type 2 error. So we know that to identify “type 1 error” and Pr(H0 |z0) is to use language in a completely different way than the one in which power is defined. For there we must have a trade-off between type 1 and 2 error probabilities.
The conclusion is that using size and power as likelihoods is a bad idea for anyone who wants to assess the comparative evidence by likelihoods. It’s true that the error statistician is not in the business of making inferences to point values, nor to comparative appraisals of different point hypotheses (much less do we wish to be required to assign priors to the point hypotheses). Criticisms often start out forming these ratios and then blaming the “tail areas” for exaggerating the evidence against. We don’t form those ratios. My point here, though, is that this gambit serves very badly for a Bayes ratio or likelihood assessment.
Likelihood is a “fit” measure, “power” is not. (Power is a “capacity” measure.)
Send any corrections, I was just scribbling this….
This is related to my “no headache power for Dierdre” post, and several posts having to do with allegations that p-values overstate the evidence against the null hypothesis, such as this one.
Senn, S. (2007), Statistical Issues in Drug Development. Wiley.
Deborah, I don’t think you’ll like my answer, but my first reaction to your post is that the problem is due to obsession with likelihood in the first place. As you said yourself, comparing likelihoods seems fishy, but I would go further and say that I believe too much is made of likelihood in general anyway. In addition to philosophical issues, likelihood can be very sensitive to distributional departures from the model, especially in the tails, the regions of interest. (Recall my related statement about the Higgs boson analysis.)
More generally, my problem with all of the analysis you describe is that it is all aimed at taking the decision making out of the hands of the domain expert. I’d much prefer that the statistics tell the analyst what plausible values μ may have, and then let the analyst make a decision (on policy or whatever) accordingly. As usual, I’m talking about confidence intervals, but in a different way than you seem to view them.