
.
In giving some informal remarks about power at a seminar a couple of weeks ago, I proposed that the tendency to turn the notion of power on its head might be avoided by imagining we need to define a test error probabilities in terms of its power alone. We can refer to the power against the null hypothesis, rather than alluding to a type 1 error probability, for example. What do I mean by turning power on its head? I mean, at least here, supposing that a test provides poor evidence of discrepancies that the test has low power to detect.
This grows out of the assumption that a statistical significant result only provides good evidence of discrepancies (from a null hypothesis) that the test has reasonably high power to detect. But these claims actually reverse what is the case about power and warranted (population) discrepancies. They turn power on its head.
To remind us, the goal of this statistical significance test is to assess the compatibility of data with a reference or null hypothesis, such as to see if the value of test statistic D indicates a genuine positive (population) discrepancy from 0. The tester may go on to consider the evidence for various other positive discrepancies as well. For simplicity consider testing H0: µ ≤ 0 vs H1: µ >0 with known SE. I will use some numbers from a guest blog post by Stephen Senn discussing the interpretation of tests in clinical trials:
For simplicity, allow the cut-off to be 2, rather than 1.96. Write the cut-off for rejecting the null as D*, which in Senn’s example is .7. So we have SE =~ .35*. The power of the test against different values of µ doesn’t require knowing the true value of µ; there is a power function. The test is falsificationist, and uses hypothetical reasoning. The power of this test against µ’ is the probability D exceeds D* (.7) computed under the assumption that µ = µ’. Write this as POW(µ’).
Tests, particularly in clinical trials, are often specified to have high probability, .8 or .9, of detecting a discrepancy from the null that “we would not like to miss”. To “miss” means the test does not set off the “significance alarm”, that is, the result is statistically insignificant. Senn’s example stipulates that the population discrepancy we would really hate to miss is ∆ = 1. This means that were the population ∆ = 1 or higher, then we want there to be a high probability that the value of the sample D will exceed D*.
Note: I use the word “discrepancy” in alluding to population effect sizes and “differences” to refer to observed difference. I’m deliberately calling ∆ “the discrepancy we would really hate to miss” because “the discrepancy we would not like to miss” is often interpreted in a weaker manner than intended. In particular, it is often construed as the smallest discrepancy of interest. But this minimal discrepancy of interest would be smaller than ∆ . [1] See also my commentary on Senn’s post:
Let’s now turn to a test H0: µ ≤ 0 vs H1: µ >0 .
(1) The power at the null is α. Note that POW(0) = .025 (more like .023)
Let’s assume for the moment that D just makes it to the cut-off D* for rejection. Then POW(0) is also equal to the significance level for the outcome. Here’s the logic of statistical significance tests using power, and D=D*:
(2) If D is just statistically significant, and its statistical significance level is low, then D indicates µ >0.
(2) is equivalent to (2)’:
(2)’ If POW(0) is low, then D* indicates µ >0.
Of course, indications need to be supplemented by audits of assumptions, checks of biasing selection effects, and ideally, replication. But we must first make out the intended logic of tests, under the presumption the assumption hold approximately, and separately audit them.
(3) If it would be difficult for the test to generate a D as large as D* if µ = 0, and yet we observe D*, then it indicates it was generated by a µ that exceeds 0.
The assertion in (3) holds not just for the null but for discrepancies from 0. Now a critic of tests might note: “But your test also has rather low power to detect positive discrepancies close to 0. For example:
POW(.5 SE) = .07. [i.e., POW(.17) = .07.]”
To which a tester would respond: Yes, and I can similarly infer my D* indicates µ > .17. I reason as follows: were µ ≤ .17, then 93% of the time I’d get a smaller D than I did. That’s the logic of testing. Note too that the P-value is .07, and the lower confidence interval µ > .17. has confidence level .93.
A critic might continue: “But your test also has rather low power to detect positive discrepancies of 1 SE.
POW(1SE) = .16!” [i.e., POW(.35) = .16.]”
To which a tester could respond: Yes, and I therefore have a weak indication that µ > .35. The P-value is .16, and the lower confidence interval µ > .35. has confidence level .84.
And she could go on to note: I clearly do not have evidence that µ exceeds those values against which the test has high power! Even to infer, on grounds that POW(.7) = .5, that my observing D* indicates µ > .7 would be wrong 50% of the time!
I hope it is now clear why the bold phrases at the outset turn power on its head, in relation to statistical significance tests. Senn would not say a statistically significant result is fairly good evidence that µ > 1, on the grounds that POW(1) = .8. Yet you will sometimes see medical researchers and spokespeople claim literally this. What we can correctly say is:
(4) If it would be improbable for the test to generate a D > D* were µ < µ0, and yet I observe D*, then D is an indication it was generated by a µ that exceeds µ0.
However, there is a different assertion that has a superficial resemblance to the ones I am pointing to as reversing power, and that other assertion can hold true. I discuss it in my next post. (I promise not to wait a month to write it!)
Share your questions and remarks in the comments to this post.
[1] Other construals: the minimum value of D we hope to observe, the smallest discrepancy we’d like to learn about, or still others. See this earlier Senn post


