A classic fallacy of rejection is taking a statistically significant result as evidence of a discrepancy from a test (or null) hypothesis larger than is warranted. Standard tests do have resources to combat this fallacy, but you won’t see them in textbook formulations. It’s not new statistical method, but new (and correct) interpretations of existing methods, that are needed. One can begin with a companion to the rule in this recent post:

(1) If POW(T+,µ’) is low, then the statistically significantxis agoodindication that µ > µ’.

To have the companion rule also in terms of power, let’s suppose that our result *is just statistically significant *at a level α. (As soon as the observed difference exceeds the cut-off the rule has to be modified).

Rule (1) was stated in relation to a statistically significant result ** x** (at level α) from a one-sided test T+ of the mean of a Normal distribution with

*n*iid samples, and (for simplicity) known σ:

*H*

_{0}: µ ≤

_{ }0 against

*H*

_{1}: µ >

_{ }0. Here’s the companion:

(2) If POW(T+,µ’) is high, then an α statistically significantxis agoodindication that µ < µ’.

(The higher the POW(T+,µ’) is, the better the indication that µ < µ’.)That is, if the test’s power to detect alternative µ’ is

high, then the statistically significantis axgoodindication (or good evidence) that the discrepancy from null isnotas large as µ’ (i.e., there’s good evidence that µ < µ’).

An account of severe testing based on error statistics is always keen to indicate inferences that are not warranted by the data, as well as those that are. Not only might we wish to indicate which discrepancies are poorly warranted, we can give upper bounds to warranted discrepancies by using (2).

**POWER**: POW(T+,µ’)** = **POW(Test T+ rejects *H*_{0};µ’) = Pr(M > M*; µ’), where M is the sample mean and M* is the cut-off for rejection. (Since it’s continuous, it doesn’t matter if we write > or ≥.)[i]

**EXAMPLE**. Let σ = 10, *n* = 100, so (σ/√*n*) = 1. Test T+ rejects H_{0 }at the .025 level if M_{ } > 1.96(1).

Find the power against µ = 2.3. To find Pr(M > 1.96; 2.3), get the standard Normal z = (1.96 – 2.3)/1 = -.84. Find the area to the right of -.84 on the standard Normal curve. It is .8. So POW(T+,2.8) = .8.

For simplicity in what follows, let the cut-off, M*, be 2. Let the observed mean M_{0} just reach the cut-off 2.

The power against alternatives between the null and the cut-off M* will range from α to .5. Power exceeds .5 only once we consider alternatives greater than M*, for these yield negative z values. Power fact, POW(M* + 1(σ/√*n*)) = .84.

That is, adding one (σ/ √*n*) unit to the cut-off M* takes us to an alternative against which the test has power = .84. So, POW(T+, µ** _{ }**= 3) = .84. See this post.

By (2), the (just) significant result * x* is decent evidence that µ< 3, because if µ ≥ 3, we’d have observed a more statistically significant result, with probability .84. The upper .84 confidence limit is 3. The significant result is much better evidence that µ< 4, the upper .975 confidence limit is 4 (approx.), etc.

Reporting (2) is typically of importance in cases of highly sensitive tests, but I think it should always accompany a rejection to avoid making mountains out of molehills. (However, in my view, (2) should be custom-tailored to the outcome not the cut-off.) In the case of statistical *in*significance, (2) is essentially ordinary *power analysis.* (In that case, the interest may be to avoid making molehills out of mountains.) Power analysis, applied to insignificant results, is especially of interest with low-powered tests. For example, failing to find a statistically significant increase in some risk may at most rule out (substantively) large risk increases. It might not allow ruling out risks of concern. Naturally, what counts as a risk of concern is a context-dependent consideration, often stipulated in regulatory statutes.

NOTES ON HOWLERS: When researchers set a high power to detect µ’, it is not an indication they regard µ’ as plausible, likely, expected, probable or the like. Yet we often hear people say “if statistical testers set .8 power to detect µ = 2.3 (in test T+), they must regard µ = 2.3 as probable in some sense”. No, in no sense. Another thing you might hear is, “when *H*_{0}: µ ≤ _{ }0 is rejected (at the .025 level), it’s reasonable to infer µ > 2.3″, or “testers are comfortable inferring µ ≥ 2.3”. No, they are not comfortable, nor should you be. Such an inference would be wrong with probability ~.8. Given M = 2 (or 1.96), you need to subtract to get a lower confidence bound, if the confidence level is not to exceed .5 . For example, µ > .5 is a lower confidence bound at confidence level .93.

Rule (2) also provides a way to distinguish values *within* a 1-α confidence interval (instead of choosing a given confidence level and then reporting CIs in the dichotomous manner that is now typical).

At present, power analysis is only used to interpret negative results–and there it is often called “retrospective power”, which is a fine term, but it’s often defined as what I call shpower). Again, confidence bounds could be, but they are not now, used to this end [iii].

**Severity replaces M* in (2) with the actual result, be it significant or insignificant. **

Looking at power means looking at the best case (just reaching a significance level) or the worst case (just missing it). This is way too coarse; we need to *custom tailor* results using the observed data. That’s what severity does, but for this post, I wanted to just illuminate the logic.[ii]

*One more thing:*

**Applying (1) and (2) requires the error probabilities to be actual** (approximately correct): Strictly speaking, rules (1) and (2) have a conjunct in their antecedents [iv]: “given the test assumptions are sufficiently well met”. *If background knowledge leads you to deny (1) or (2), it indicates you’re denying the reported error probabilities are the actual ones.* There’s evidence the test fails an “audit”. That, at any rate, is what I would argue.

————

[i] To state power in terms of P-values: POW(µ’) = Pr(P < p*; µ’) where P < p* corresponds to rejecting the null hypothesis at the given level.

[ii] It must be kept in mind that statistical testing inferences are going to be in the form of µ > µ’ =µ_{0 }+ δ, or µ ≤ µ’ =µ_{0 }+ δ or the like. They are *not* to point values! (Not even to the point µ =M_{0}.) Take a look at the alternative *H*_{1}: µ > _{ }0. It is not a point value. Although we are going beyond inferring the existence of some discrepancy, we still retain inferences in the form of inequalities.

[iii] That is, upper confidence bounds are too readily viewed as “plausible” bounds, and as values for which the data provide positive evidence. In fact, as soon as you get to an upper bound at confidence levels of around .6, .7, .8, etc. you actually have evidence µ’ < CI-upper. See this post.

[iv] The “antecedent” of a conditional refers to the statement between the “if” and the “then”.

OTHER RELEVANT POSTS ON POWER

- (6/9) U-Phil: Is the Use of Power* Open to a Power Paradox?
- (3/4/14) Power, power everywhere–(it) may not be what you think! [illustration]
- (3/12/14) Get empowered to detect power howlers
- 3/17/14 Stephen Senn: “Delta Force: To what Extent is clinical relevance relevant?”
- (3/19/14) Power taboos: Statue of Liberty, Senn, Neyman, Carnap, Severity
**12/29/14**To raise the power of a test is to lower (not raise) the “hurdle” for rejecting the null (Ziliac and McCloskey 3 years on)**01/03/15**No headache power (for Deirdre)

Wouldn’t this reasoning be essentially equivalent to the following? Probably I’m missing something…

Let σ=10, n=100, so (σ/√n) = 1. Let the observed mean M_0, a random variable distributed as (µ,1), be 2.

It is a decent evidence that µ<3, because M_0<3 and the one-sided p-value for H1:µ<3 against H0:µ=3 is 0.16

It is an even better evidence that µ<4, because the one-sided p-value for H1:µ<4 against H0:µ=4 is 0.023

Carlos: It’s important to see the interpretation as an interpretation of the given test, and not change the test or the hypotheses, even though, obviously there are relationships to what might be inferred from related tests, but we’re asking a somewhat different question. The idea is to first have an indicated effect, and then rule out upper bounds.

The test either rejects or does not reject the hypothesis µ=0. If you try to interpret the result as evidence for or against µ<3, aren't you changing the hypothesis?

What question is being asked exactly? You say "(2) If POW(T+,µ’) is high, then an α statistically significant x is a good indication that µ < µ’." But then you don't talk about "a statistically significant x" anymore, the question you answer is about the specific value of x sitting at the threshold of significance.

You say "By (2), the (just) significant result x is decent evidence that µ< 3". Would you also say that "By (2), the (very) significant result x=10 is decent evidence that µ< 3"?

I don't see how (2) is valid in general for "a α statistically significant x". For that, you would need to make some strong assumptions about the distribution of the "α statistically significant x".

Carlos: As I wrote: “let’s suppose that our result is just statistically significant. (As soon as it exceeds the cut-off the rule has to be modified). ”

and

“Applying (1) and (2) requires the error probabilities to be actual (approximately correct): Strictly speaking, rules (1) and (2) have a conjunct in their antecedents [iv]: “given the test assumptions are sufficiently well met”.

The rules are then

(1) If POW(T+,µ’) is low, then x=M* (the significance cutoff for T+) is a good indication that µ>µ’.

(2) If POW(T+,µ’) is high, then x=M* (the significance cutoff for T+) is a good indication that µ<µ’.

Given that POW(T+,µ’)<50% M*>µ’ (the lower the power, the higher the difference) and POW(T+,µ’)>50% M*µ’ is an indication that µ>µ’ and x<µ’ is an indication that µ<µ’. I agree.

I don't understand the second remark. What test assumptions are sufficiently well met? Assuming that H0 is true and µ=0, then anything can be taken as an indication that µ0, H1:µ=42, or H1:”µ>0 and the Moon is made of cheese”.

The second paragraph should read:

Given that POW(T+,µ’)µ’ (the lower the power, the higher the difference) and POW(T+,µ’)>50% if and only if M*µ’ is an indication that µ>µ’ and x<µ’ is an indication that µ<µ’. I agree.

Third try:

Given that POW(T+,µ’)µ’ (the lower the power, the higher the difference) and POW(T+,µ’)>50% if and only if M*µ’ is an indication that µ>µ’ and x<µ’ is an indication that µ<µ’. I agree.

(Last try, without any more than/ less than sign)

Given that POW(T+,µ’) is less than 50% if and only if M* is above µ’ (the lower the power, the higher the difference) and POW(T+,µ’) is more than 50% if and only if M* is below µ’ (the higher the power, the higher the difference), then the rules essentially say: x greater than µ’ is an indication that µ greater than µ’ and x less than µ’ is an indication that µ less than µ’. I agree.

I’ve just noticed that the third paragraph was also botched:

I don’t understand the second remark. What test assumptions are sufficiently well met? Assuming that H0 is true and µ=0, then anything can be taken as an indication that µ is less than µ’.

Thanks for you replies. I still don’t understand the point of the exercise, but it’s probably because I don’t see why would anyone take the rejection of the null hypothesis as evidence for an arbitrary alternative in the first place. The test depends on the null hypothesis H0:µ=0, on the sampling distribution of the statistic, on the one-sided/two-sided distinction, and on the significance level α. It does not depend at all on the stated alternative hypothesis. The result of the test will be the same whether the alternative is H1: µ greater than 0, H1: µ=42, or H1: ”µ greater than 0 and the Moon is made of cheese”.

Carlos: I think you are missing the fact that in a N-P style hypothesis test, the alternative is the statistical complement of the null, and the test properties such as power would reflect that. Let me reference this paper:

Click to access 2006Mayo_Spanos_severe_testing.pdf

I may come back to this tomorrow.

Can you develop rules that will let the investigator know when to prefer the estimate of the parameter of interest provided by the data over the pre-specified $\mu_0$ and $\mu’$?

Michael: it isn’t a matter of preference over the prespecified hypotheses, but interpreting the pre-specified hypotheses. More commonly, one tests, then estimates which is ok, but many of the problems we hear about concern interpreting tests. I was just talking about some of the ways to avoid fallacies of rejection. I do not advocate changing the hypotheses post data. There is, of course, some pedagogical value to seeing what the inference would have been with a different set of prespecified hypotheses.

I agree that the only alternative that makes sense is the statistical complement of the null. I don’t think that the test should be use to make any inference about other hypothesis. You seem to agree to some extent, because you say that to make inference about the new hypothesis “H: the discrepancy (from µ is) less than γ” it’s better to ignore the cut-off point of the test and base your analysis on the observed data x0. (In what follows I will consider x0 to be the mean of the n measurements, to keep the notation simpler and avoid formatting issues. I will also use ≤ and ≥ instead of strict inequalities to avoid the less-than and greater-than signs.)

You introduce severity, in principle a function of T(α) and d(x0) which is in fact function of just x0 (and µ1=µ0+γ). Your definition for a non-rejected test is

The severity with which the claim µ≤µ1 passes test T(α) with data x0

[1] SEV(µ≤µ1)=P(d(x)≥d(x0);µ≥µ1)={1}=P(x≥x0;µ≥µ1)={2}=P(x≥x0;µ=µ1)

{1} because d(x) is a monotonically increasing function of x. {2} because you say severity is evaluated at µ=µ1.

For a rejected test, the corresponding definition is

The severity with which test T(α) passes µ1≥µ0 with data x0

[2] SEV(µ≥µ1)=P(d(x)≤d(x0);µ≤µ1)=P(x≤x0;µ≥µ1)=P(x≤x0;µ=µ1)

Not by coincidence, these results are identical to 1 – ( p-value when the hypothesis µ=µ1 is tested with data x0 and the alternative is [1] µ≤µ1 or [2] µ≥µ1 ).

You “emphasize that you are not advocating changing the original null and alternative hypothesis of the given test T(α); rather you are using the severe testing concept to evaluate which inferences are warranted, in this case of the form µ≤µ1”. As far as I can see, your severe testing concept is equivalent to changing the original null and alternative hypothesis and calculating the p-value given the data x0 for, in this case, H0:µ=µ1 and H1:µ≤µ1. The only dependency on the original test is that when rejected your alternative is µ≤µ1 and when not rejected µ≥µ1.

Maybe you have better examples where the severity interpretation of acceptance or rejection give results which are different from those obtained from a straightforward µ≤µ1 or µ≥µ1 hypothesis test?

Carlos: In a simple case like this, especially, there are lots of other computations that one could appeal to in order to get the same or similar numbers What matters is the logic that would direct you to one or another computation and subsequent interpretation.There is often considerable disagreement about the logic because there is disagreement about the justification for these tools. If you are a likelihoodist or Bayesian you might apply a different logic and proclaim these computations are illogical. Even among those using tests we find disagreement or unclarity. for instance, there’s disagreement among some as to whether reaching an alpha level rejection when the power against mu’ is low is tantamount to exaggerating the magnitude of the discrepancy indicated (at least if you use the observed difference as an estimate of the magnitude. On this view, we should be more impressed with an alpha significant result from a test with very high power. At the same time it is standard to criticize a significant result if based on such huge sample sizes that the test is deemed pathologically powerful. The null is rejected, many argue, on the basis of results that allegedly “favor” the null. So which is correct? These, and several other criticisms conflict with each other! These “metarules” for interpreting tests, I claim, direct the answers–but from what standpoint? It is the philosophical standpoint–I claim it is really the intended standpoint of error statistical tests–that directs the answer. But regardless of one’s philosophy, one ought to be able to see the conflict in the criticisms.

In the first case, when a rejection from a test with low power is criticized, what I think actually happens is this: an area is known to have low power against plausibly sized discrepancies, and yet achieves significance. This, together with other evidence, may make people suspect the significant result is due to cherry-picking, multiple testing, p-hacking and various biasing selection effects. That suspicion is often warranted. However, questioning if the error probabilities are actual is different from reasoning based on error probabilities, when they are assumed to be approximately correct.

Thank you for you answer, I appreciate the time you took to reply. Unfortunately, I don’t see the logic in performing a test for µ=0 vs µ≥0 and try to base on it any inference about µ≥µ1 or µ≤µ1. For that you have to do some convoluted calculations involving power to arrive to the same probabilities as the test for µ=µ1 (which seems the logical thing to do for inference about µ=µ1). I’m also unconvinced that in more complex cases the outcome of the severity approach will be different from a simple test as you suggest.

You might have your philosophical reasons to prefer your approach, but I think the subtlety will be lost on the people who believe that the rejection of a test is good evidence for µ≥µ1 if the power of the test against µ1 is high. That would imply that the evidence for µ≥(µ1+1) is better, because the power against (µ1+1) is higher, but for any reasonable definition of evidence this is not possible ( µ≥(µ1+1) implies µ≥µ1, so any evidence for the first is also evidence for the latter ). If this simple argument doesn’t convince them, I don’t think severity arguments will.

Carlos: It didn’t take much time to reply. As far as your last, let me just say that there are many more reasons for seeking an adequate statistical philosophy to clarify, scrutinize, and justify statistical inferences than this. But even wrt your simple example, the entailment condition you presume, sometimes called “the special consequence condition” by philosophers, does not generally hold in statistics. So you can’t just appeal to it. With Bayesian statistics, for example, we have statistical “affirming the consequent” legitimized (as a Bayes boost). In significance testing, we may have mu> mu’ allowed but not “either mu > mu’ or mu < mu'" My point is simply that different statistical philosophies countenance different logical principles. See for example this post which garnered 75 comments and a follow-up post:

https://errorstatistics.com/2013/10/19/bayesian-confirmation-philosophy-and-the-tacking-paradox-in-i/ (something we currently don't have)

Something looks wrong to me here with (2). Copied and pasted (I hope the symbols come out well):

“(2) If POW(T+,µ’) is high, then an α statistically significant x is a good indication that µ < µ’."

In the discussed example, the power of the test is high against $\mu'=3$. Let's say we observe x=4, which is significant. No way can this be an indication that $\mu<3$.

Christian: That’s why I wrote: “let’s suppose that our result is just statistically significant. (As soon as it exceeds the cut-off the rule has to be modified).”

OK, I see. I think “borderline insignificant” would have been the better choice then for wording, because in case it’s insignificant we’d like to know how close our evidence tells us we’re to the null hypothesis (which is what you tell us here). In case it’s significant we rather wonder how far away from it are we at least, I’d think. The way it’s currently presented in this post seems confusing. Well at least it confused me.

Christian: What you say about what we usually look for is correct. But I also think it’s useful to know what we’re not warranted in inferring, and even those discrepancies that we have evidence for ruling out. That’s the “don’t make mts out of molehills” part. So I want it to be a significant result. To make the rule in terms of power, I imagined it was just significant.