Suppose you are reading about a statistically significant result x (just at level α) from a one-sided test T+ of the mean of a Normal distribution with n iid samples, and (for simplicity) known σ: H0: µ ≤ 0 against H1: µ > 0.
I have heard some people say :
A. If the test’s power to detect alternative µ’ is very low, then the statistically significant x is poor evidence of a discrepancy (from the null) corresponding to µ’. (i.e., there’s poor evidence that µ > µ’ ).◊See point on language in notes.
They will generally also hold that if POW(µ’) is reasonably high (at least .5), then the inference to µ > µ’ is warranted, or at least not problematic.
I have heard other people say:
B. If the test’s power to detect alternative µ’ is very low, then the statistically significant x is good evidence of a discrepancy (from the null) corresponding to µ’ (i.e., there’s good evidence that µ > µ’).
They will generally also hold that if POW(µ’) is reasonably high (at least .5), then the inference to µ > µ’ is unwarranted.
Which is correct, from the perspective of the (error statistical) philosophy, within which power and associated tests are defined?
Allow the test assumptions are adequately met. I have often said on this blog, and I repeat, the most misunderstood and abused (or unused) concept from frequentist statistics is that of a test’s power to reject the null hypothesis under the assumption alternative µ’ is true: POW(µ’). I deliberately write it in this correct manner because it is faulty to speak of the power of a test without specifying against what alternative it’s to be computed. It will also get you into trouble if you define power as in the first premise in a recent post:
the probability of correctly rejecting the null
–which is both ambiguous and fails to specify the all important conjectured alternative. [For handholding slides on power, please see this post.] That you compute power for several alternatives is not the slightest bit problematic; it’s precisely what you want to do in order to assess the test’s capability to detect discrepancies. If you knew the true parameter value, why would you be running an inquiry to make statistical inferences about it?
It must be kept in mind that inferences are going to be in the form of µ > µ’ =µ0 + δ, or µ < µ’ =µ0 + δ or the like. They are not to point values! (Not even to the point µ =M0.) Most simply, you may consider that the inference is in terms of the one-sided lower confidence bound (for various confidence levels)–the dual for test T+.
DEFINITION: POW(T+,µ’) = POW(Test T+ rejects H0;µ’) = Pr(M > M*; µ’), where M is the sample mean and M* is the cut-off for rejection. (Since it’s continuous it doesn’t matter if we write > or ≥). I’ll leave off the T+ and write POW(µ’).
In terms of P-values: POW(µ’) = Pr(P < p*; µ’) where P < p* corresponds to rejecting the null hypothesis at the given level.
Let σ = 10, n = 100, so (σ/ √n) = 1. (Nice and simple!) Test T+ rejects H0 at the .025 level if M > 1.96(1). For simplicity, let the cut-off, M*, be 2.
Test T+ rejects H0 at ~ .025 level if M > 2.
CASE 1: We need a µ’ such that POW(µ’) = low. The power against alternatives between the null and the cut-off M* will range from α to .5. Consider the power against the null:
1. POW(µ = 0) = α = .025.
Since the the probability of M > 2, under the assumption that µ = 0, is low, the significant result indicates µ > 0. That is, since power against µ = 0 is low, the statistically significant result is a good indication that µ > 0.
Equivalently, 0 is the lower bound of a .975 confidence interval.
2. For a second example of low power that does not use the null: We get power of .04 if µ’ = M* – 1.75 (σ/ √n) unit –which in this case is (2 – 1.75) .25. That is, POW(.25) =.04.[ii]
Equivalently, µ >.25 is the lower confidence interval (CI) at level .96 (this is the CI that is dual to the test T+.)
CASE 2: We need a µ’ such that POW(µ’) = high. Using one of our power facts, POW(M* + 1(σ/ √n)) = .84.
3. That is, adding one (σ/ √n) unit to the cut-off M* takes us to an alternative against which the test has power = .84. So µ = 2 + 1 will work: POW(T+, µ = 3) = .84. See this post.
Should we say that the significant result is a good indication that µ > 3? No, the confidence level would be .16.
Pr(M > 2; µ = 3 ) = Pr(Z > -1) = .84. It would be terrible evidence for µ > 3!
Blue curve is the null, red curve is one possible conjectured alternative: µ = 3. Green area is power, little turquoise area is α.
As Stephen Senn points out (in my favorite of his posts), the alternative against which we set high power is the discrepancy from the null that “we should not like to miss”, delta Δ. Δ is not the discrepancy we may infer from a significant result (in a test where POW(Δ, ) = .84).
So the correct answer is B.
Does A hold true if we happen to know (based on previous severe tests) that µ <µ’? I’ll return to this.
◊Point on language: “to detect alternative µ'” means, “produce a statistically significant result when µ = µ’.” It does not mean we infer µ’. Nor do we know the underlying µ’ after we see the data. Perhaps the strict definition should be employed unless one is clear on this. The power of the test to detect µ’ just refers to the probability the test would produce a result that rings the significance alarm, if the data were generated from a world or experiment where µ = µ’.
 A comment by Michael Lew’s brings out my presumption of a given α level, since that is the context in which this matter arises. (I actually have a further destination in mind, indicated in red,, and it won’t be clear until a follow -up). Back to Lew’s query: power is always defined for the “worst case” of just reaching the α level cut-off, which is why I prefer the data-dependent severity, or associated confidence interval. Lew’s reference to a likelihood analysis underscores the need to indicate how one is evaluating evidence. Although I indicated “practicing within the error statistical tribe”, perhaps that was too vague. Still, the interpretation in the case he gives is not very different (except that I wouldn’t favor x over µ’. I thank Lew for his graphic. PastedGraphic-5
[i] I surmise, without claiming a scientific data base, that this fallacy has been increasing over the past few years. It was discussed way back when in Morrison and Henkel (1970). (A relevant post relates to a Jackie Mason comedy routine.) Research was even conducted to figure out how psychologists could be so wrong. Wherever I’ve seen it, it’s due to (explicitly or implicitly) transposing the conditional in a Bayesian use of power. For example, (1 – β)/ α is treated as a kind of likelihood in a Bayesian computation. I say this is unwarranted, even for a Bayesian’s goal, see 2/10/15 post below.
[ii] Pr(M > 2; µ = .25 ) = Pr(Z > 1.75) = .04.
OTHER RELEVANT POSTS ON POWER
- (6/9) U-Phil: Is the Use of Power* Open to a Power Paradox?
- (3/4/14) Power, power everywhere–(it) may not be what you think! [illustration]
- (3/12/14) Get empowered to detect power howlers
- 3/17/14 Stephen Senn: “Delta Force: To what Extent is clinical relevance relevant?”
- (3/19/14) Power taboos: Statue of Liberty, Senn, Neyman, Carnap, Severity
- 12/29/14 To raise the power of a test is to lower (not raise) the “hurdle” for rejecting the null (Ziliac and McCloskey 3 years on)
- 01/03/15 No headache power (for Deirdre)
- 02/10/15 What’s wrong with taking (1 – β)/α, as a likelihood ratio comparing H0 and H1?
I haven’t yet encountered A or B, but I often encounter
C. If the test’s power to detect an alternative µ’ to µ is high, then a statistically nonsignificant x is good evidence for µ (or good evidence favoring µ over µ’).
But (as you know) it is easy to construct counterexamples in which P>0.05 yet by all common evidence measures such as likelihood ratios or P-value functions, the evidence favors µ’ over µ.
How would you describe these high-power fallacies in the “tribal” terms you used above?