Taboos about power nearly always stem from misuse of power analysis. Sander Greenland (2012) has a paper called “Nonsignificance Plus High Power Does Not Imply Support for the Null Over the Alternative.” I’m not saying Greenland errs; the error would be made by anyone who interprets power analysis in a manner giving rise to Greenland’s objection. So what’s (ordinary) power analysis?
(I) Listen to Jacob Cohen (1988) introduce Power Analysis
“PROVING THE NULL HYPOTHESIS. Research reports in the literature are frequently flawed by conclusions that state or imply that the null hypothesis is true. For example, following the finding that the difference between two sample means is not statistically significant, instead of properly concluding from this failure to reject the null hypothesis that the data do not warrant the conclusion that the population means differ, the writer concludes, at least implicitly, that there is no difference. The latter conclusion is always strictly invalid, and is functionally invalid as well unless power is high. The high frequency of occurrence of this invalid interpretation can be laid squarely at the doorstep of the general neglect of attention to statistical power in the training of behavioral scientists.
What is really intended by the invalid affirmation of a null hypothesis is not that the population ES is literally zero, but rather that it is negligible, or trivial. This proposition may be validly asserted under certain circumstances. Consider the following: for a given hypothesis test, one defines a numerical value i (or iota) for the ES, where i is so small that it is appropriate in the context to consider it negligible (trivial, inconsequential). Power (1 – b) is then set at a high value, so that b is relatively small. When, additionally, a is specified, n can be found. Now, if the research is performed with this n and it results in nonsignificance, it is proper to conclude that the population ES is no more than i, i.e., that it is negligible; this conclusion can be offered as significant at the b level specified. In much research, “no” effect (difference, correlation) functionally means one that is negligible; “proof” by statistical induction is probabilistic. Thus, in using the same logic as that with which we reject the null hypothesis with risk equal to a, the null hypothesis can be accepted in preference to that which holds that ES = i with risk equal to b. Since i is negligible, the conclusion that the population ES is not as large as i is equivalent to concluding that there is “no” (nontrivial) effect. This comes fairly close and is functionally equivalent to affirming the null hypothesis with a controlled error rate (b), which, as noted above, is what is actually intended when null hypotheses are incorrectly affirmed (J. Cohen 1988, p. 16).
Here Cohen imagines the researcher sets the size of a negligible discrepancy ahead of time–something not always available. Even where a negligible i may be specified, the power to detect that i may be low and not high. Two important points can still be made:
- First, Cohen doesn’t instruct you to infer there’s no discrepancy from H0, merely that it’s “no more than i”.
- Second, even if your test doesn’t have high power to detect negligible i, you can infer the population discrepancy is less than whatever γ your test does have high power to detect (given nonsignificance).
Now to tell what’s true about Greenland’s concern that “Nonsignificance Plus High Power Does Not Imply Support for the Null Over the Alternative.”
(II) The first step is to understand the assertion, giving the most generous interpretation. It deals with nonsignificance, so our ears are perked for a fallacy of nonrejection or nonsignificance. We know that “high power” is an incomplete concept, so he clearly means high power against “the alternative”.
For a simple example of Greenland’s phenomenon, consider an example of the Normal test we’ve discussed a lot on this blog. Let T+: H0: µ ≤ 12 versus H1: µ > 12, σ = 2, n = 100. Test statistic Z is √100(M – 12)/2 where M is the sample mean. With α = .025, the cut-off for declaring .025 significance from M*.025 = 12+ 2(2)/√100 = 12.4 (rounding to 2 rather than 1.96 to use a simple Figure below).
[Note: The thick black vertical line in the Figure, which I haven’t gotten to yet, is going to be the observed mean, M0 = 12.35. It’s a bit lower than the cut-off at 12.4.]
Now a title like Greenland’s is supposed to signal some problem. What is it? The statistical part just boils down to noting that the observed mean M0 (e.g., 12.35) may fail to make it to the cut-off M* (here 12.4), and yet be closer to an alternative against which the test has high power (e.g., 12.6) than it is to the null value, here 12. This happens because the Type 2 error probability is allowed to be greater than the Type 1 error probability (here .025).
Abbreviate the alternative against which the test T+ has .84 power as, µ.84 , as I’ve often done. (See, for example, this post.) That is, the probability Test T+ rejects the null when µ = µ.84 = .84. i.e.,POW(T+, µ.84) = .84. One of our power short-cut rules tells us:
µ.84 = M* + 1σM = 12.4 + .2 = 12.6,
where σM: =σ/√100 = .2.
Note: the Type 2 error probability in relation to alternative µ = 12.6 is.16. This is the area to the left of 12.4 under the red curve above. Pr(M < 12.4; μ = 12.6) = Pr(Z < -1) = .16 = β(12.6).
µ.84 exceeds the null value by 3σM: so any observed mean that exceeds 12 by more than 1.5σM but less than 2σM gives an example of Greenland’s phenomenon. [Note: the previous sentence corrects an earlier wording.] In T+ , values 12.3 < M0 <12 .4 do the job. Pick M0 = 12.35. That value is indicated by the black vertical line in the figure above.
Having established the phenomenon, your next question is: so what?
It would be problematic if power analysis took the insignificant result as evidence for μ = 12 (i.e., 0 discrepancy from the null). I’ve no doubt some try to construe it as such, and that Greenland has been put in the position of needing to correct them. This is the reverse of the “mountains out of molehills” fallacy. It’s making molehills out of mountains. It’s not uncommon when a nonsignificant observed risk increase is taken as evidence that risks are “negligible or nonexistent” or the like. The data are looked at through overly rosy glasses (or bottle). Power analysis enters to avoid taking no evidence of increased risk as evidence of no risk. Its reasoning only licenses μ < µ.84 where .84 was chosen for “high power”. From what we see in Cohen, he does not give a green light to the fallacious use of power analysis.
(III) Now for how the inference from power analysis is akin to significance testing (as Cohen observes). Let μ1−β be the alternative against which test T+ has high power, (1 – β). Power analysis sanctions the inference that would accrue if we switched the null and alternative, yielding the one-sided test in the opposite direction, T-, we might call it. That is, T- tests H0: μ ≥ μ1−β versus H1: μ < μ1−β at the β level. The test rejects H0 (at level β) when M < μ0 – zβσM. Such a significant result would warrant inferring μ < μ1−β at significance level β. Using power analysis doesn’t require making this switcheroo, which might seem complicated. The point is that there’s really no new reasoning involved in power analysis, which is why the members of the Fisherian tribe manage it without even mentioning power.
EXAMPLE. Use μ.84 in test T+ (α = .025, n = 100, σM = .2) to create test T-. Test T+ has .84 power against μ.84 = 12 + 3σM = 12.6 (with our usual rounding). So, test T- is
H0: μ ≥ 12.6 versus H1: μ <12 .6
and a result is statistically significantly smaller than 12.6 at level .16 whenever sample mean M < 12.6 – 1σM = 12.4. To check, note (as when computing the Type 2 error probability of test T+) that
Pr(M < 12.4; μ = 12.6) = Pr(Z < -1) = .16 = β. In test T-, this serves as the Type 1 error probability.
So ordinary power analysis follows the identical logic as significance testing. [i] Here’s a qualitative version of the logic of ordinary power analysis.
Ordinary Power Analysis: If data x are not statistically significantly different from H0, and the power to detect discrepancy γ is high, then x indicates that the actual discrepancy is no greater than γ.[ii]
Or, another way to put this:
If data x are not statistically significantly different from H0, then x indicates that the underlying discrepancy (from H0) is no greater than γ, just to the extent that that the power to detect discrepancy γ is high,
[i] Neyman, we’ve seen, was an early power analyst. See, for example, this post.
[ii] Compare power analytic reasoning with severity reasoning from a negative or insignificant result.
POWER ANALYSIS: If Pr(d > cα; µ’) = high and the result is not significant, then it’s evidence µ < µ’
SEVERITY ANALYSIS: (for an insignificant result): If Pr(d > d0; µ’) = high and the result is not significant, then it’s evidence µ < µ.’
Severity replaces the pre-designated cut-off cα with the observed d0. Thus we obtain the same result remaining in the Fisherian tribe. We still abide by power analysis though, since if Pr(d > d0; µ’) = high then Pr(d > cα; µ’) = high, at least in a sensible test like T+. In other words, power analysis is conservative. It gives a sufficient but not a necessary condition for warranting bound: µ < µ’. But why view a miss as good as a mile? Power is too coarse.
Cohen, J. 1988. Statistical Power Analysis for the Behavioral Sciences. 2nd ed. Hillsdale, NJ: Erlbaum. [Link to quote above: p. 16]
Greenland, S. 2012. ‘Nonsignificance Plus High Power Does Not Imiply Support for the Null Over the Alternative’, Annals of Epidemiology 22, pp. 364-8. Link to paper: Greenland (2012)