# Power howlers return as criticisms of severity

Mayo bangs head

Suppose you are reading about a statistically significant result x that just reaches a threshold p-value α from a test T+ of the mean of a Normal distribution

H0: µ ≤  0 against H1: µ >  0

with n iid samples, and (for simplicity) known σ.  The test “rejects” H0 at this level & infers evidence of a discrepancy in the direction of H1.

I have heard some people say:

A. If the test’s power to detect alternative µ’ is very low, then the just statistically significant x is poor evidence of a discrepancy (from the null) corresponding to µ’.  (i.e., there’s poor evidence that  µ > µ’ ). See point* on language in notes.

They will generally also hold that if POW(µ’) is reasonably high (at least .5), then the inference to µ > µ’ is warranted, or at least not problematic.

I have heard other people say:

B. If the test’s power to detect alternative µ’ is very low, then the just statistically significant x is good evidence of a discrepancy (from the null) corresponding to µ’ (i.e., there’s good evidence that  µ > µ’).

They will generally also hold that if POW(µ’) is reasonably high (at least .5), then the inference to µ > µ’ is unwarranted.

Which is correct, from the perspective of the frequentist error statistical philosophy?

(within which power and associated tests are defined). A big HINT is below.

*Allow the test assumptions are adequately met, at least to start with.

I have often said on this blog, and I repeat, the most misunderstood and abused (or unused) concept from frequentist statistics is that of a test’s power to reject the null hypothesis under the assumption alternative µ’ is true: POW(µ’). I deliberately write it in this long drawn-out, correct, manner because it is faulty to speak of the power of a test without specifying against what alternative it’s to be computed.

Because fallacious uses of power–power howlers as I call them–are so common, I make sure that the concept of severity is deliberately designed, not just to avoid them, but to make the basis for the fallacy so clear that no one will slip back into committing them. The claims of “good” and “poor” evidence get explicitly cashed out in terms of high/low severity accorded to the associated claims.

But here we are, 3.5 years since the publication of Statistical Significance as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP), [SIST], and these fallacies persist. We even hear that (what I claim is) the fallacious interpretation “is now well established”. Worse, the fallacious interpretations are taken as knock down criticisms of (my notion of) severity (which instructs you as to the right way to interpret results)!

So I’m going to focus some blogposts on power howlers (some earlier ones are linked to below).

For the BIG HINT, I will draw from (pp 239-240) Excursion 4 Tour II of SIST “Rejection Fallacies: Who’s Exaggerating What?” (in blue) [1]:

“How Could a Group of Psychologists be so Wrong? I’ll carry a single tome: Morrison and Henkel’s 1970 classic, The Significance Test Controversy. Some abuses of the proper interpretation of significance tests were deemed so surprising even back then that researchers in psychology conducted studies to try to understand how this could be. Notably, Rosenthal and Gaito (1963) discovered that statistical significance at a given level was often fallaciously taken as evidence of a greater discrepancy from the null the larger the sample size n.  In fact, it is indicative of less of a discrepancy from the null than if it resulted from a smaller sample size.

What is shocking is that these psychologists indicated substantially greater confidence or belief in results associated with the larger sample size for the same p values. According to the theory, especially as this has been amplified by Neyman and Pearson (1933), the probability of rejecting the null hypothesis for any given deviation from null and p values increases as a function of the number of observations.  The rejection of the null hypothesis when the number of cases is small speaks for a more dramatic effect in the population…The question is, how could a group of psychologists be so wrong? (Bakan 1970, p. 241)

(Our convention is for “discrepancy” to refer to the parametric, not the observed, difference [or effect size]. Their use of “deviation” from the null alludes to our “discrepancy”.)

As statistician John Pratt, notes “the more powerful the test, the more a just significant result favors the null hypothesis” (1961, p. 166). Yet we still often hear: “The thesis implicit in the [Neyman-Pearson, NP] approach, [is] that a hypothesis may be rejected with increasing confidence or reasonableness as the power of the test increases” (Howson and Urbach 1993, p. 209). In fact, the thesis implicit in the N-P approach, as Bakan remarks, is the opposite! The fallacy is akin to making mountains out of molehills according to severity (Section 3.2).

Mountains out of Molehills (MM) Fallacy (large n problem): The fallacy of taking a rejection of H0, just at level P, with larger sample size (higher power) as indicative of a greater discrepancy from H0 than with a smaller sample size.

Consider an analogy with two fire alarms: The first goes off with a sensor liable to pick up on burnt toast; the second is so insensitive, it doesn’t kick in until your house is fully ablaze. You’re in another state, but you get a signal when the alarm goes off. Which fire alarm indicates the greater extent of fire? Answer, the second, less sensitive one. When the sample size increases it alters what counts as a single sample. It is like increasing the sensitivity of your fire alarm.  It is true that a large enough sample size triggers the alarm with an observed mean that is quite “close” to the null hypothesis. But, if the test rings the alarm (i.e., rejects H0) even for tiny discrepancies from the null value, then the alarm is poor grounds for inferring larger discrepancies. Now this is an analogy, you may poke holes in it. For instance, a test must have a large enough sample to satisfy model assumptions. True, but our interpretive question can’t get started without taking the P-values as legitimate and not spurious.”

A link to the proofs of Excursion 4 Tour II. For another big hint see [2].

Now the high power against alternative µ’ can result from increasing the sample size (as in the above variation of the MM fallacy), or it can result by selecting the value of alternative µ’  to be sufficiently far from µ0.

The paper discussed in my last post includes a criticism of severity that instantiates the second form of the MM fallacy. I will come back to this, and some other howlers (from other papers) later on. In this connection, see a question that might arise [3]

Share your constructive remarks in the comments.

Notes

*Point on language. “To detect alternative µ’” means, “produce a statistically significant result when µ = µ’.” It does not mean we infer µ’. Nor do we know the underlying µ’ after we see the data, obviously. The power of the test to detect µ’ just refers to the probability the test would produce a result that rings the significance alarm, if the data were generated from a world or experiment where µ = µ’.

[1] It must be kept in mind that inferences are going to be in the form of µ > µ’ =µ+ δ,  or µ < µ’ =µ+ δ  or the like. They are not to point values! (Not even to the point µ =M0.)

[2] Big Hint: Ask yourself: What is the power of test T+: H0: µ ≤  0 against the null hypothesis H0

The answer is α!  so for example if we set α = .025, then the power of the test at 0 is POW(µ = 0) = α = .025. Because the power against µ = 0 is low, the statistically significant result is good evidence that µ > 0.

[3] Does A hold true if we assume that we know (based on previous severe tests) that µ < µ’? I’ll return to this.

OTHER RELEVANT POSTS ON POWER (you’ll find more by searching “power” on this blog)

### 7 thoughts on “Power howlers return as criticisms of severity”

1. “A test must have a large enough sample to satisfy model assumptions”. This is a rather strange sentence. Most psychologists perform tests of normality in order to choose between the t-test and the Wilcoxon test. I would not include “the sample is large enough that the normal approximation may be used” as a model assumption. The model assumptions are the assumptions about having a random sample from a population of interest… As a consulting statistician I am asked all the time: is my sample large enough *in order to draw conclusions about the population*? People seem to believe something like “representative = ‘N > 100’ ”. I am afraid that the whole p-value controversy is part of the bigger problem that only a small part of the population has a mathematical mind; and fields like psychology attract those who don’t. For this reason I prefer frequentist inference to Bayesian. It is simpler. Less mathematically sophisticated. Less dangerous.

• Richard: I’m surprised you picked that sentence out. I said it because often, when someone gives an example of a test with too low power , the sample size is too small to even check an assumption of iid–and I was talking of such an example. They draw the wrong lesson about a significant result with low power–but you can’t even legitimately compute the power without the adequate statistical model. I think I have a mathematical mind and am not attracted to psych (but to logic and philosophy of science)–so I agree with that.

• In fact standard alternatives to normal distribution based inference such as nonparametric tests lose *more* power with low sample sizes, meaning that one could make the case that *particularly with low sample sizes* a restrictive model assumption helps (to some extent it makes up for not having much information in the data).

Of course you are also right that conditions for testing normality are not good then, however as I have stated earlier (for example in my presentation in your series), the issue is not whether normality is really fulfilled (it isn’t anyway), but rather whether deviations from normality will mislead conclusions, and at least the most problematic deviation (gross outliers) can be diagnosed with let’s say n>=4 (problems with independence cannot be diagnosed, but this is hard to do even large samples, unless there is a clear dependence pattern such as dependence within known groups).

• Christian: Thanks for your comment. We’re getting beyond the key issue I wanted to make as regards comparing what’s warranted when we grant the model, because people are, according to me, NP tests, and confidence intervals, getting it backwards regarding the question I ask at the outset–and they presume the power computation is legitimate.

2. Peter Chapman

I always find these types of discussions a bit odd, and entirely devoid of context.

A professional statistician working in an application area in which the experimental material is well understood should be able to construct a reasonable power curve prior to the study commencement. And the study should always have high power for the discrepancy of interest. So if, for example, we want to know whether the discrepancy is 10% or more, we should carry out a test with high power for detecting a 10% effect. If we can’t do this for some reason, such as lack of funds, we shouldn’t do the study.

When I first joined ICI in 1982 in the UK we were about to launch a new herbicide, Fusilade, that killed grass weeds in broad leaved crops. I counted 200 late development trials. We tested in different countries, on different weed species and in different climates and soil types. We compared different formulations and adjuvants. We tried different application rates and and varied times and numbers of applications. After all of those trials we knew that the product worked. Psychologists should do the same amount of work on their hypotheses – ESPECIALLY IF THEY WANT TO INFLUENCE GOVERNMENT POLICY etc.

A further point. If you want to build a nuclear power station or an F35 all weather fighter you recruit people who know what they are doing. But when it comes to psychology, a Professor who has attended a couple of courses in statistics, thinks he is equipped to design, conduct and analyse studies that can have a profound influence on our lives. It’s bonkers.

3. I disagree.

I think the word ‘larger’ in this sentence is the main concern for me:

> “But, if the test rings the alarm (i.e., rejects H0) even for tiny discrepancies from the null value, then the alarm is poor grounds for inferring larger discrepancies.”

For me, a *barely* significant result in a powerful test is good evidence that there is a non-zero effect, and the effect is small (or maybe so tiny as to be boring), and we know the direction of the effect.

More broadly, let’s assume 5% is our alpha threshold. All other things being equal, if the power is extremely low (e.g. small sample size), then the probability of getting a significant result will be barely more than 5%. Therefore, a significant result is meaningless. A significant tells us nothing about the direction or magnitude of the effect.

On the other hand, if the power is high, then a significant result is less likely to be a false positive. We can at least be confident about the direction of the effect. As we look in detail at the results and the properties of the estimator, we might conclude that the true effect is very small. Perhaps the true effect is shown to be so small that it’s not useful in the real world. But we still have some confidence that the effect is non-zero and we know the effect.

Maybe we should take a step back and reconsider “inferring larger discrepancies” and then ask: what conclusions we want to make from a planned experiment? Do we just want confirmation of ‘discrepancies’ (i.e. non-zero effect with confidence in the direction of the effect)? Or do we want evidence of ‘larger discrepancies’?

• Aaron:
Thanks for your comment Let’s be clear on what I’m claiming, and to what the “larger” refers.
I claim, if the test would probably ring the alarm at level p (i.e., reject H0) even for a discrepancy from the null value of µ’, then the alarm at level p is poor grounds for inferring even larger discrepancies, i.e., poor grounds for inferring µ > µ’. We keep to the one-sided test of the mean, as in the post, for simplicity.

Since the word “effect” is ambiguous between the observable effect size and the population or parametric effect size, I always use “discrepancy” to refer to the latter. It’s true that a *barely* significant result (at small p) is good evidence that there is a non-zero discrepancy, but we don’t know if it should count as “small”, since that’s a context relevant issue. My point is independent of that.

Let me just deal with just your initial remark. You say::
“More broadly, let’s assume 5% is our alpha threshold. All other things being equal, if the power is extremely low (e.g. small sample size), then the probability of getting a significant result will be barely more than 5%. Therefore, a significant result is meaningless. A significant tells us nothing about the direction or magnitude of the effect.”

This is not true! If the POW(mu’) is barely more than some small alpha, the result will be evidence mu > mu’ nearly as strong as it is for mu > mu0!

Let me show this to readers simply.
Background: Let M* be the sample mean that is just statistically significant at the .025 level in the one-sided (positive) test being considered. The power of the test at the null value, i.e., POW(mu = mu0) is .025. And the power of the test at mu = M* is .5.

Example: In the example on p.144 of SIST, linked to on the blogpost, the test is of mu less than or equal to 150 vs mu > 150, with SE= 1. The cut-off M* (at level .025) would be 152. POW(150) = .025 and POW(152) = .5. So the power is low against mu values in-between these two. If there are questions about this, they’d need to be cleared up first.

Now to make my point: Here’s an example of an alternative against which this test has low power, mu = 150.5. the power of the test against 150.5 is only .07. [Pr(M > 152; mu = 150.5) = Pr(Z > 1.5) = .07.] You say this is meaningless and doesn’t give evidence of direction or magnitude. Not so.

Spoze you’re testing mu is less than or equal to 150.5 vs mu > 150.5.
The p-value reached by our observation (M = M* = 152) would be .07. So it would be evidence against the null and in favor of 150.5.
We are assuming the model assumptions hold, or this discussion doesn’t get started.
The severity for mu > 150.5 would be .93.
Once we settle this, we can move to your other points and their problems.

I hope there are no errors of computation.