This is a modified reblog of an earlier post, since I keep seeing papers that confuse this.
Suppose you are reading about a result x that is just statistically significant at level α (i.e., P-value = α) in a one-sided test T+ of the mean of a Normal distribution with n iid samples, and (for simplicity) known σ: H0: µ ≤ 0 against H1: µ > 0.
I have heard some people say:
A. If the test’s power to detect alternative µ’ is very low, then the just statistically significant x is poor evidence of a discrepancy (from the null) corresponding to µ’. (i.e., there’s poor evidence that µ > µ’ ).*See point on language in notes.
They will generally also hold that if POW(µ’) is reasonably high (at least .5), then the inference to µ > µ’ is warranted, or at least not problematic.
I have heard other people say:
B. If the test’s power to detect alternative µ’ is very low, then the just statistically significant x is good evidence of a discrepancy (from the null) corresponding to µ’ (i.e., there’s good evidence that µ > µ’).
They will generally also hold that if POW(µ’) is reasonably high (at least .5), then the inference to µ > µ’ is unwarranted.
Which is correct, from the perspective of the (error statistical) philosophy, within which power and associated tests are defined?
(Note the qualification that would arise if you were only told the result was statistically significant at some level less than or equal to α rather than, as I intend, that it is just significant at level α, discussed in a comment due to Michael Lew here )
Allow the test assumptions are adequately met (though usually, this is what’s behind the problem). I have often said on this blog, and I repeat, the most misunderstood and abused (or unused) concept from frequentist statistics is that of a test’s power to reject the null hypothesis under the assumption alternative µ’ is true: POW(µ’). I deliberately write it in this correct manner because it is faulty to speak of the power of a test without specifying against what alternative it’s to be computed. It will also get you into trouble if you define power as in the first premise in a recent post:the probability of correctly rejecting the null
–which is both ambiguous and fails to specify the all important conjectured alternative. [For slides explaining power, please see this post.] That you compute power for several alternatives is not the slightest bit problematic; it’s precisely what you want to do in order to assess the test’s capability to detect discrepancies. If you knew the true parameter value, why would you be running an inquiry to make statistical inferences about it?
It must be kept in mind that inferences are going to be in the form of µ > µ’ =µ0 + δ, or µ < µ’ =µ0 + δ or the like. They are not to point values! (Not even to the point µ =M0.) Most simply, you may consider that the inference is in terms of the one-sided lower confidence bound (for various confidence levels)–the dual for test T+.
DEFINITION: POW(T+,µ’) = POW(Test T+ rejects H0;µ’) = Pr(M > M*; µ’), where M is the sample mean and M* is the cut-off for rejection at level α . (Since it’s continuous it doesn’t matter if we write > or ≥). I’ll leave off the T+ and write POW(µ’).
In terms of P-values: POW(µ’) = Pr(P < p*; µ’) where P < p* corresponds to rejecting the null hypothesis at the given level.
Let σ = 10, n = 100, so (σ/ √n) = 1. Test T+ rejects H0 at the .025 level if M > 1.96(1). For simplicity, let the cut-off, M*, be 2.
Test T+ rejects H0 at ~ .025 level if M > 2.
CASE 1: We need a µ’ such that POW(µ’) = low. The power against alternatives between the null and the cut-off M* will range from α to .5. Consider the power against the null:
1. POW(µ = 0) = α = .025.
Since the the probability of M > 2, under the assumption that µ = 0, is low, the just significant result indicates µ > 0. That is, since power against µ = 0 is low, the statistically significant result is a good indication that µ > 0.
Equivalently, 0 is the lower bound of a .975 confidence interval.
2. For a second example of low power that does not use the null: We get power of .04 if µ’ = M* – 1.75 (σ/ √n) unit –which in this case is (2 – 1.75) .25. That is, POW(.25) =.04.[ii]
Equivalently, µ >.25 is the lower confidence interval (CI) at level .96 (this is the CI that is dual to the test T+.)
CASE 2: We need a µ’ such that POW(µ’) = high. Using one of our power facts, POW(M* + 1(σ/ √n)) = .84.
3. That is, adding one (σ/ √n) unit to the cut-off M* takes us to an alternative against which the test has power = .84. So µ = 2 + 1 will work: POW(T+, µ = 3) = .84. See this post.
Should we say that the significant result is a good indication that µ > 3? No, there’s a high probability (.84) you’d have gotten a larger difference than you did, were µ > 3.
Pr(M > 2; µ = 3 ) = Pr(Z > -1) = .84. It would be terrible evidence for µ > 3!
Blue curve is the null, red curve is one possible conjectured alternative: µ = 3. Green area is power, little turquoise area is α.
Note that the evidence our result affords µ > µ’ gets worse and worse as we pull µ further and further to the right, even though in so doing we’re increasing the power.
As Stephen Senn points out (in my favorite of his guest posts), the alternative against which we set high power is the discrepancy from the null that “we should not like to miss”, delta Δ. Δ is not the discrepancy we may infer from a significant result (in a test where POW(Δ) = .84).
So the correct answer is B.
Does A hold true if we happen to know (based on previous severe tests) that µ <µ’?
No, but it does allow some legitimate ways to mount complaints based on a significant result with low power to detect a known discrepancy.
(1) It does mean that if M* (the cut-off for a result just statistically significant at level α) is used as an estimate of µ and we know that µ < M*, then M* is larger than µ. So your observed result, which we’re assuming is M*, “exaggerates” µ, were you to use it as an estimate, rather than using the lower limit of a confidence bound. This is not power analysis.
(2) However, knowing the test had little capability of detecting the effect sizes deemed true, it might raise the question of whether the researchers cheated, and this, I claim, is a plausible reason to suspect the result. If the study is in a field known to have lots of researcher flexibility, and the power to detect discrepancies in the ballpark known independently to be correct is low, then you might rightly suspect that they cheated. They reported only the one impressive result after trying and trying again, or tampered with the discretionary points of the study to achieve nominal significance. This is a different issue, and doesn’t change my answer. More generally, it’s because the answer is B that the only way to raise the criticism legitimately is to challenge the assumptions of the test.
Why do people raise the criticism illegitimately? In some circles it’s a direct result of trying to do a Bayesian computation and setting about to compute Pr(µ = µ’|Mo = M*) using POW(µ’)/α as a kind of likelihood ratio in favor of µ’. I say this is unwarranted, even for a Bayesian’s goal, see 2/10/15 and 5/22/16 posts below. Notice that supposing the probability of a type I error goes down as power increases is at odds with the trade-off that we know holds between these error rates. So this immediately indicates a different use of terms.
*Point on language: “to detect alternative µ'” means, “produce a statistically significant result when µ = µ’.” It does not mean we infer µ’. Nor do we know the underlying µ’ after we see the data, obviously. Perhaps the strict definition should be employed unless one is clear on this. The power of the test to detect µ’ just refers to the probability the test would produce a result that rings the significance alarm, if the data were generated from a world or experiment where µ = µ’.
 Since power is what’s used in the articles we’re considering, rather than the data-dependent computation of severity, by making the result just significant at level α, we make the assessment equal.
[i] I surmise, without claiming a scientific data base, that this fallacy has been increasing over the past few years. It was discussed way back when in Morrison and Henkel (1970). (A relevant post relates to a Jackie Mason comedy routine.) Research was even conducted to figure out how psychologists could be so wrong. Nowadays it’s because of the confusion introduced by computing a “positive predictive value” and transposing the ‘conditional’.
[ii] Pr(M > 2; µ = .25 ) = Pr(Z > 1.75) = .04.
OTHER RELEVANT POSTS ON POWER
- (3/4/14) Power, power everywhere–(it) may not be what you think! [illustration]
- (3/12/14) Get empowered to detect power howlers
- 3/17/14 Stephen Senn: “Delta Force: To what Extent is clinical relevance relevant?”
- (3/19/14) Power taboos: Statue of Liberty, Senn, Neyman, Carnap, Severity
- 12/29/14 To raise the power of a test is to lower (not raise) the “hurdle” for rejecting the null (Ziliac and McCloskey 3 years on)
- 01/03/15 No headache power (for Deirdre)
- 02/10/15 What’s wrong with taking (1 – β)/α, as a likelihood ratio comparing H0 and H1?
- 5/22/16 Frequentstein:What’s wrong with taking (1 – β)/α, as a measure of evidence against the null?