How to avoid making mountains out of molehills (using power and severity)

Posted on December 12, 2017 by Mayo

In preparation for a new post that takes up some of the recent battles on reforming or replacing p-values, I reblog an older post on power, one of the most misunderstood and abused notions in statistics. (I add a few “notes on howlers”.) The power of a test T in relation to a discrepancy from a test hypothesis H₀ is the probability T will lead to rejecting H₀ when that discrepancy is present. Power is sometimes misappropriated to mean something only distantly related to the probability a test leads to rejection; but I’m getting ahead of myself. This post is on a classic fallacy of rejection.

A classic fallacy of rejection is taking a statistically significant result as evidence of a discrepancy from a test (or null) hypothesis larger than is warranted. Standard tests do have resources to combat this fallacy, but you won’t see them in textbook formulations. It’s not new statistical method, but new (and correct) interpretations of existing methods, that are needed. One can begin with a companion to the rule in this recent post:

(1) If POW(T+,µ’) is low, then the statistically significant x is a good indication that µ > µ’.

To have the companion rule also in terms of power, let’s suppose that our result is just statistically significant at a level α. (As soon as the observed difference exceeds the cut-off the rule has to be modified).

Rule (1) was stated in relation to a statistically significant result x (at level α) from a one-sided test T+ of the mean of a Normal distribution with n iid samples, and (for simplicity) known σ: H₀: µ ≤ 0 against H₁: µ > 0. Here’s the companion:

(2) If POW(T+,µ’) is high, then an α statistically significant x is a good indication that µ < µ’.
(The higher the POW(T+,µ’) is, the better the indication that µ < µ’.)

That is, if the test’s power to detect alternative µ’ is high, then the just statistically significant x is a good indication (or good evidence) that the discrepancy from null is not as large as µ’ (i.e., there’s good evidence that µ < µ’).

An account of severe testing based on error statistics is always keen to indicate inferences that are not warranted by the data, as well as those that are. Not only might we wish to indicate which discrepancies are poorly warranted, we can give upper bounds to warranted discrepancies by using (2).

POWER: POW(T+,µ’) = POW(Test T+ rejects H₀;µ’) = Pr(M > M*; µ’), where M is the sample mean and M* is the cut-off for rejection. (Since it’s continuous, it doesn’t matter if we write > or ≥.)[i]

EXAMPLE. Let σ = 10, n = 100, so (σ/√n) = 1. Test T+ rejects H₀at the .025 level if M > 1.96(1).

Find the power against µ = 2.3. To find Pr(M > 1.96; 2.3), get the standard Normal z = (1.96 – 2.3)/1 = -.84. Find the area to the right of -.84 on the standard Normal curve. It is .8. So POW(T+,2.8) = .8.

For simplicity in what follows, let the cut-off, M*, be 2. Let the observed mean M₀ just reach the cut-off 2.

The power against alternatives between the null and the cut-off M* will range from α to .5. Power exceeds .5 only once we consider alternatives greater than M*, for these yield negative z values. Power fact, POW(M* + 1(σ/√n)) = .84.

That is, adding one (σ/ √n) unit to the cut-off M* takes us to an alternative against which the test has power = .84. So, POW(T+, µ= 3) = .84. See this post.

By (2), the (just) significant result x is decent evidence that µ< 3, because if µ ≥ 3, we’d have observed a more statistically significant result, with probability .84. The upper .84 confidence limit is 3. The significant result is much better evidence that µ< 4, the upper .975 confidence limit is 4 (approx.), etc.

Reporting (2) is typically of importance in cases of highly sensitive tests, but I think it should always accompany a rejection to avoid making mountains out of molehills. (However, in my view, (2) should be custom-tailored to the outcome not the cut-off.) In the case of statistical insignificance, (2) is essentially ordinary power analysis. (In that case, the interest may be to avoid making molehills out of mountains.) Power analysis, applied to insignificant results, is especially of interest with low-powered tests. For example, failing to find a statistically significant increase in some risk may at most rule out (substantively) large risk increases. It might not allow ruling out risks of concern. Naturally, what counts as a risk of concern is a context-dependent consideration, often stipulated in regulatory statutes.

NOTES ON HOWLERS: When researchers set a high power to detect µ’, it is not an indication they regard µ’ as plausible, likely, expected, probable or the like. Yet we often hear people say “if statistical testers set .8 power to detect µ = 2.3 (in test T+), they must regard µ = 2.3 as probable in some sense”. No, in no sense. Another thing you might hear is, “when H₀: µ ≤ 0 is rejected (at the .025 level), it’s reasonable to infer µ > 2.3″, or “testers are comfortable inferring µ ≥ 2.3”. No, they are not comfortable, nor should you be. Such an inference would be wrong with probability ~.8. Given M = 2 (or 1.96), you need to subtract to get a lower confidence bound, if the confidence level is not to exceed .5 . For example, µ > .5 is a lower confidence bound at confidence level .93.

Rule (2) also provides a way to distinguish values within a 1-α confidence interval (instead of choosing a given confidence level and then reporting CIs in the dichotomous manner that is now typical).

At present, power analysis is only used to interpret negative results–and there it is often called “retrospective power”, which is a fine term, but it’s often defined as what I call shpower). Again, confidence bounds could be, but they are not now, used to this end [iii].

Severity replaces M* in (2) with the actual result, be it significant or insignificant.

Looking at power means looking at the best case (just reaching a significance level) or the worst case (just missing it). This is way too coarse; we need to custom tailor results using the observed data. That’s what severity does, but for this post, I wanted to just illuminate the logic.[ii]

One more thing:

Applying (1) and (2) requires the error probabilities to be actual (approximately correct): Strictly speaking, rules (1) and (2) have a conjunct in their antecedents [iv]: “given the test assumptions are sufficiently well met”. If background knowledge leads you to deny (1) or (2), it indicates you’re denying the reported error probabilities are the actual ones. There’s evidence the test fails an “audit”. That, at any rate, is what I would argue.

————

[i] To state power in terms of P-values: POW(µ’) = Pr(P < p*; µ’) where P < p* corresponds to rejecting the null hypothesis at the given level.

[ii] It must be kept in mind that statistical testing inferences are going to be in the form of µ > µ’ =µ₀+ δ, or µ ≤ µ’ =µ₀+ δ or the like. They are not to point values! (Not even to the point µ =M₀.) Take a look at the alternative H₁: µ > 0. It is not a point value. Although we are going beyond inferring the existence of some discrepancy, we still retain inferences in the form of inequalities.

[iii] That is, upper confidence bounds are too readily viewed as “plausible” bounds, and as values for which the data provide positive evidence. In fact, as soon as you get to an upper bound at confidence levels of around .6, .7, .8, etc. you actually have evidence µ’ < CI-upper. See this post.

[iv] The “antecedent” of a conditional refers to the statement between the “if” and the “then”.

8 thoughts on “How to avoid making mountains out of molehills (using power and severity)”

December 12, 2017

Carlos Ungil

For the proposed example, I find the following alternative formulations can help to understand what’s going on:
(1) If µ’ is low (µ’<2 is a good indication that µ>µ’ (our estimate for µ is M>2>>µ’)
(2) If µ’ is high (µ’>>2), then M=2 is a good indication that µ<µ’ (our estimate for µ is M=2<<µ’)

Reply
December 15, 2017

Mayo

Carlos: I assume you mean to say if the power against mu’ is low, not if mu’ is low, etc. Please send me a revised version. Also the estimate part isn’t clear.

Reply

December 16, 2017

Carlos Ungil

I sent an amended version, but it stayed in the moderation queue… Here it is again:

(1) If µ’ is low (µ’<<2), then M>2 is a good indication that µ>µ’ (our estimate for µ is M>2>>µ’)
(2) If µ’ is high (µ’>>2), then M=2 is a good indication that µ<µ’ (our estimate for µ is M=2<<µ’)

In your example, µ’ is low <=> POW (T+,µ’) is low.

And by estimate I mean that the “best guess” / point estimate / MLE for µ is M. So if M>µ’ it is reasonable that we can take it as evidence for µ>µ’. And the same for M<µ’ being evidence for µ<µ’. I’m not trying to find a fault in what you wrote, but simply restating it in what seems a simpler form to me.

Reply

December 17, 2017

Mayo

Carlos: You want to say if the power to detect mu’ is low (and high); I don’t get the rest.

Reply

December 28, 2018

coreyyanofsky

I can’t recall if “discrepancies from the alternative” is the way you phrase it (i’t not the way I’d phrase it either); I think you’ll know that I’m asking about well and poorly warranted directional hypotheses of the form μ ≤ μ’ given that we have just barely accepted μ ≤ 0.

Reply
December 29, 2018

coreyyanofsky

Bit of a brain fart here: the test procedure has, by construction, a Type I error rate of 0.05, not 0.95.

Reply
December 29, 2018

Mayo

If you reject, SEV goes in the opposite direction of power. Also, error probs over various sub-rules do not give you the relevant assessment of probativeness for the case at hand, as I think you know. My new post has the famous weighing machine ex. The error statistician conditions. That said, try to reply to your own question.

Reply

December 29, 2018

coreyyanofsky

If you ask me, the answer to each of my questions is “yes” — all the same reasoning you go through to construct SEV would seem to apply here. But I’m reluctant to just go around saying “yup, this is the SEV function” without checking with you first.

Reply

I welcome constructive comments that are of relevance to the post and the discussion, and discourage detours into irrelevant topics, however interesting, or unconstructive declarations that "you (or they) are just all wrong". If you want to correct or remove a comment, send me an e-mail. If readers have already replied to the comment, you may be asked to replace it to retain comprehension. Cancel reply

How to avoid making mountains out of molehills (using power and severity)

Post navigation

8 thoughts on “How to avoid making mountains out of molehills (using power and severity)”

The Statistics Wars & Their Casualties

Blog links (references)

Reviews of Statistical Inference as Severe Testing (SIST)

Interviews & Debates on PhilStat (2020)

Interviews on PhilStat (2019)

LSE PH500 Research Seminar (May 21-June 25, 2020): Controversies in Phil Stat

Summer Seminar 2019 (article)

Top Posts & Pages

Conferences & Workshops

RMM Special Topic

Mayo & Spanos, Error Statistics

Follow Blog via Email

My Websites

Recent Posts: PhilStatWars

The Statistics Wars and Their Casualties Videos & Slides from Sessions 1 & 2

THE STATISTICS WARS AND THEIR CASUALTIES VIDEOS & SLIDES FROM SESSIONS 3 & 4

Final session: The Statistics Wars and Their Casualties: 8 December, Session 4

SCHEDULE: The Statistics Wars and Their Casualties: 1 Dec & 8 Dec: Sessions 3 & 4

WORKSHOP

LOG IN/OUT

Archives

© Deborah G. Mayo, Error Statistics Philosophy, 2011-2018 All Rights Reserved.

How to avoid making mountains out of molehills (using power and severity)

Related

Post navigation

8 thoughts on “How to avoid making mountains out of molehills (using power and severity)”

The Statistics Wars & Their Casualties

Blog links (references)

Reviews of Statistical Inference as Severe Testing (SIST)

Interviews & Debates on PhilStat (2020)

Interviews on PhilStat (2019)

LSE PH500 Research Seminar (May 21-June 25, 2020): Controversies in Phil Stat

Summer Seminar 2019 (article)

Top Posts & Pages

Conferences & Workshops

RMM Special Topic

Mayo & Spanos, Error Statistics

Follow Blog via Email

My Websites

Recent Posts: PhilStatWars

The Statistics Wars and Their Casualties Videos & Slides from Sessions 1 & 2

THE STATISTICS WARS AND THEIR CASUALTIES VIDEOS & SLIDES FROM SESSIONS 3 & 4

Final session: The Statistics Wars and Their Casualties: 8 December, Session 4

SCHEDULE: The Statistics Wars and Their Casualties: 1 Dec & 8 Dec: Sessions 3 & 4

WORKSHOP

LOG IN/OUT

Archives

© Deborah G. Mayo, Error Statistics Philosophy, 2011-2018 All Rights Reserved.