If your smoke alarm has little capability of triggering unless your house is fully ablaze, then if it has triggered, is that a strong or weak indication of a fire? Compare this insensitive smoke alarm to one that is so sensitive that burning toast sets it off. The answer is: that the alarm from the insensitive detector is triggered is a good indication of the presence of (some) fire, while hearing the ultra sensitive alarm go off is not.[i]
Yet I often hear people say things to the effect that:
if you get a result significant at a low p-value, say ~.03,
but the power of the test to detect alternative µ’ is also low, say .04 (i.e., POW(µ’)= .04),then “the result hasn’t done much to distinguish” the data from that obtained by chance alone.
–but wherever that reasoning is coming from it’s not from statistical hypothesis testing, properly understood. It’s easy to see.
We can use a variation on the one-sided test T+ from our illustration of power: We’re testing the mean of a Normal distribution with n iid samples, and (for simplicity) known σ:
H0: µ ≤ 0 against H1: µ > 0
Let σ = 1, n = 25, so (σ/ √n) = .2.
To avoid those annoying X-bars, I will use M instead. The Excel example has µ ≤ 12, but it’s even easier to have 0, and easy to switch over.Test T+ rejects H0 at the .025 level if M > 1.96(.2). Let’s make it the 2-standard deviation cut-off:
Test T+ rejects H0 at ~ .025 level if M > 2(.2) = .4. So the cut-off M*= .4.
Now we need a µ’ such that POW(µ’) = low.
Power is always defined in terms of the cut-off for rejection, M*.
- I know the power against alternatives between 0 and cut-off M* will be less than .5.
- I’ll get really low power (.16) if µ’ were to exceed 0 by only 1 (σ/ √n) unit –which in this case is 1(.2) = .2. (That is, POW(.2) =.16).
- I’ll get even lower power if µ’ were to exceed 0 by only .25 (σ/ √n) unit–which in this case is .25(.2) = .05.
I’m cutting corners with symbols wherever possible.
So what’s the power of T+ against .05? POW(.05) = ?
P(M > .4; µ = .05)= P(Z > (.4 -.05)(1/.2)) = P(Z > .35(5)) =P(Z > 1.75)= .04
So POW(.05) = .04 –quite low.
[Whether this low chance of triggering when µ = .05 is just what we want is a separate issue.]
My claim is, if it has triggered, say just at the cut-off M* (.4), then there’s a good indication µ >.05.
You can see this using lower confidence limits (LL) corresponding to test T+.
Find the .96 lower confidence limit (LL) corresponding to test T+, supposing the observed sample mean M = .4. (Never mind that we’d typically estimate σ).
µ > M – (1.75)1/ √25
µ > M – (1.75)(.2)
µ > M – .35.
Since we’re imagining M reaches the cut-off M*, we have the following one-sided lower .96 confidence limit.
µ > .4 – .35 = .05.
So µ > .05 is certainly warranted.
(This is also given by severity reasoning.)
Here’s another example: What’s the power of T+ against .1? POW(.1) = ?
P(M > .4; µ = .1)= P(Z > (.4 -.1)(1/.2)) = P(Z > (.3)(5)) = P(Z > 1.5) = .07
So POW(.1) = .07.
Correspondingly, µ = .1 is the lower limit of a one-sided confidence interval with confidence level of ______?
So the statistically significant result is a better indication that µ > .05 than µ > .1.
You can see the duality between CIs and tests, but I’ll come back to this. The main lesson is:
If a test’s power to detect µ’ is low, then a statistically significant result (i.e., a rejection of the null with low p-value), is a good indication of discrepancy µ’.
[i] I assume the alarm system shares the obvious properties of good tests for detecting discrepancies; that’s the point of an analogy. In any event, I have delineated those points elsewhere.