A classic fallacy of rejection is taking a statistically significant result as evidence of a discrepancy from a test (or null) hypothesis larger than is warranted. Standard tests do have resources to combat this fallacy, but you won’t see them in textbook formulations. It’s not new statistical method, but new (and correct) interpretations of existing methods, that are needed. One can begin with a companion to the rule in this recent post:
(1) If POW(T+,µ’) is low, then the statistically significant x is a good indication that µ > µ’.
To have the companion rule also in terms of power, let’s suppose that our result is just statistically significant. (As soon as it exceeds the cut-off the rule has to be modified).
Rule (1) was stated in relation to a statistically significant result x (at level α) from a one-sided test T+ of the mean of a Normal distribution with n iid samples, and (for simplicity) known σ: H0: µ ≤ 0 against H1: µ > 0. Here’s the companion:
(2) If POW(T+,µ’) is high, then an α statistically significant x is a good indication that µ < µ’.
(The higher the POW(T+,µ’) is, the better the indication that µ < µ’.)
That is, if the test’s power to detect alternative µ’ is high, then the statistically significant x is a good indication (or good evidence) that the discrepancy from null is not as large as µ’ (i.e., there’s good evidence that µ < µ’).
An account of severe testing based on error statistics is always keen to indicate inferences that are not warranted by the data, as well as those that are. Not only might we wish to indicate which discrepancies are poorly warranted, we can give upper bounds to warranted discrepancies by using (2).
EXAMPLE. Let σ = 10, n = 100, so (σ/√n) = 1. Test T+ rejects H0 at the .025 level if M > 1.96(1). For simplicity, let the cut-off, M*, be 2. Let the observed mean M0 just reach the cut-off 2.
POWER: POW(T+,µ’) = POW(Test T+ rejects H0;µ’) = Pr(M > M*; µ’), where M is the sample mean and M* is the cut-off for rejection. (Since it’s continuous, it doesn’t matter if we write > or ≥.)[i]
The power against alternatives between the null and the cut-off M* will range from α to .5. Power exceeds .5 only once we consider alternatives greater than M*. Using one of our power facts, POW(M* + 1(σ/√n)) = .84.
That is, adding one (σ/ √n) unit to the cut-off M* takes us to an alternative against which the test has power = .84. So, POW(T+, µ = 3) = .84. See this post.
By (2), the (just) significant result x is decent evidence that µ< 3, because if µ ≥ 3, we’d have observed a more statistically significant result, with probability .84. The upper .84 confidence limit is 3. The significant result is even better evidence that µ< 4, the upper .975 confidence limit is 4 (approx.), etc.
Reporting (2) is typically of importance in cases of highly sensitive tests, but I think it should always accompany a rejection to avoid making mountains out of molehills. (Only (2) should be custom-tailored to the outcome not the cut-off.) In the case of statistical insignificance, (2) is essentially ordinary power analysis. (In that case, the interest may be to avoid making molehills out of mountains.) Power analysis, applied to insignificant results, is especially of interest with low-powered tests. For example, failing to find a statistically significant increase in some risk may at most rule out (substantively) large risk increases. It might not allow ruling out risks of concern. Naturally, that’s a context-dependent consideration, often stipulated in regulatory statutes.
Rule (2) also provides a way to distinguish values within a 1-α confidence interval (instead of choosing a given confidence level and then reporting CIs in the dichotomous manner that is now typical).
At present, power analysis is only used to interpret negative results–and there it is often confused with “retrospective power” (what I call shpower). Again, confidence bounds could be, but they are not now, used to this end (but rather the opposite [iii]).
Severity replaces M* in (2) with the actual result, be it significant or insignificant.
Looking at power means looking at the best case (just reaching a significance level) or the worst case (just missing it). This is way too coarse; we need to custom tailor results using the observed data. That’s what severity does, but for this post, I wanted to just illuminate the logic.[ii]
One more thing:
Applying (1) and (2) requires the error probabilities to be actual (approximately correct): Strictly speaking, rules (1) and (2) have a conjunct in their antecedents [iv]: “given the test assumptions are sufficiently well met”. If background knowledge leads you to deny (1) or (2), it indicates you’re denying the reported error probabilities are the actual ones. There’s evidence the test fails an “audit”. That, at any rate, is what I would argue.
[i] To state power in terms of P-values: POW(µ’) = Pr(P < p*; µ’) where P < p* corresponds to rejecting the null hypothesis at the given level.
[ii] It must be kept in mind that inferences are going to be in the form of µ > µ’ =µ0 + δ, or µ < µ’ =µ0 + δ or the like. They are not to point values! (Not even to the point µ =M0.) Most simply, you may consider that the inference is in terms of the one-sided upper confidence bound (for various confidence levels)–the dual for test T+.
[iii] That is, upper confidence bounds are viewed as “plausible” bounds, and as values for which the data provide positive evidence. As soon as you get to an upper bound at confidence levels of around .6, .7, .8, etc. you actually have evidence µ’ < CI-upper. See this post.
[iv] The “antecedent” of a conditional refers to the statement between the “if” and the “then”.
OTHER RELEVANT POSTS ON POWER