blizzard of 26

A Blizzard of Power Puzzles Replicate in Meta-Research

.

I often say that the most misunderstood concept in error statistics is power. One week ago, stuck in the blizzard of 2026 in NYC —exciting, if also a bit unnerving, with airports closed for two and a half days and no certainty of when I might fly out—I began collecting the many power howlers I’ve discussed in the past, because some of them are being replicated in todays meta-research about replication failure! Apparently, mistakes about statistical concepts replicate quite reliably—even when statistically significant effects do not. Others I find in medical reports of clinical trials of treatments I’m trying to evaluate in real life! Here’s one variant: A statistically significant result in a clinical trial with fairly high (e.g.,  .8) power to detect an impressive improvement δ’ is taken as good evidence of its impressive improvement δ’. Often the high power of .8 is even used as a (posterior) probability of the hypothesis of improvement being δ’. [0] If these do not immediately strike you as fallacious, compare:

  • If the house is fully ablaze, then very probably the fire alarm goes off.
  • If the fire alarm goes off, then very probably the house is fully ablaze.

The first bullet is saying the fire alarm has high power to detect the house being fully ablaze. It does not mean the converse in the second bullet.

Today’s meta-statistical researchers are keen to point up the consequences of using statistical significance tests, figuring out why they lead to the various replication crises in science, and how they may be more honestly viewed. Yet they too use statistical analyses, and these can reflect philosophical and conceptual standpoints that may replicate the same shortcomings that arise in classic criticisms of significance tests.  A major purpose of my Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP) is to clarify basic notions to get beyond what I call “chestnuts” and “howlers” of tests, but these misunderstandings tend to crop up today at the meta-level. When power enters this meta-research, often as a kind of probability of replication, unsurprisingly, the same confusions pop up at the meta-level. But I will not tackle any meta-research in this post. Instead, let’s go back to power howlers that arise in criticisms of tests. I had a blogpost long ago on Ziliac and McCloskey (2008) (Z & M) on power (from Oct. 2011), following a review of their book by Aris Spanos (2008). They write:

“The error of the second kind is the error of accepting the null hypothesis of (say) zero effect when the null is in face false, that is, when (say) such and such a positive effect is true.”

So far so good, keeping in mind that “positive effect” refers to a parameter discrepancy, say δ, not an observed difference.

And the power of a test to detect that such and such a positive effect δ is true is equal to the probability of rejecting the null hypothesis of (say) zero effect when the null is in fact false, and a positive effect as large as δ is present.

Fine. Let this alternative be abbreviated H’(δ):

H’(δ): there is a positive (population) effect at least as large as δ.

Suppose the test rejects the null when it reaches a significance level of .01 (nothing turns on the small value chosen).

(1) The power of the test to detect H’(δ) = Pr(test rejects null at the .01 level| H’(δ) is true).

Say it is 0.85.

According to Z & M:

“[If] the power of a test is high, say, 0.85 or higher, then the scientist can be reasonably confident that at minimum the null hypothesis (of, again, zero effect if that is the null chosen) is false and that therefore his rejection of it is highly probably correct.” (Z & M, 132-3)

But this is not so. They are mistaking (1), defining power, as giving a posterior probability of .85–either to some effect, or specifically to H’(δ)! That is, (1) is being transformed to (1′):

(1’) Pr(H’(δ) is true| test rejects null at .01 level)=.85!

(I am using the symbol for conditional probability “|” all the way through for ease in following the argument, even though, strictly speaking, the error statistician would use “;”, abbreviating “under the assumption that”). Or to put this in other words, they argue:

1. Pr(test rejects the null | H’(δ) is true) = 0.85.

2. Test rejects the null hypothesis.

Therefore, the rejection is probably correct, e.g., the probability H’ is true is 0.85.

Oops. Premises 1 and 2 are true, but the conclusion fallaciously replaces premise 1 with 1′.

 

High power as high hurdle. As Aris Spanos (2008) points out, “They have it backwards”.  Their reasoning comes from thinking that the higher the power of the test from which statistical significance emerges, the higher the hurdle it has gotten over statistical significance emerges from a test with Extracting from a Spanos comment on this blog in 2011:

“When [Ziliak and McCloskey] claim that: ‘What is relevant here for the statistical case is that refutations of the null are trivially easy to achieve if power is low enough or the sample size is large enough.’ (Z & M, p. 152), they exhibit [confusion] about the notion of power and its relationship to the sample size; their two instances of ‘easy rejection’ separated by ‘or’ contradict each other! Rejections of the null are not easy to achieve when the power is ‘low enough’. They are more difficult exactly because the test does not have adequate power (generic capacity) to detect discrepancies from the null; that stems from the very definition of power and optimal tests. [Their second claim] is correct for the wrong reason. Rejections are easy to achieve when the sample size n is large enough due to high not low power. This is because the power of a ‘decent’ (consistent) frequentist test increases monotonically with n!” (Spanos 2011) [i]

 Ziliak and McCloskey (2008) tell us: “It is the history of Fisher significance testing. One erects little significancehurdles, six inches tall, and makes a great show of leaping over them, . . . If a test does a good job of uncovering efficacy, then the test has high power and the hurdles are high not low.” (ibid., p. 133) This explains why they suppose high power translates into high hurdles, but it is the opposite. The higher the hurdle required before rejecting the null, the more difficult it is to reject, and the lower the power. High hurdles correspond to insensitive tests, like insensitive fire alarms. It might be that using sensitivityrather than power would make this abundantly clear. We may coin: The high power = high hurdle (for rejecting the null) fallacy. A powerful test does give the null hypothesis a harder time in the sense that its more probable that discrepancies from it are detected. But this makes it easier to infer H1. To infer H1 with severity, H1 needs to be given a hard time. 

Ponder the consequences their construal would have for the required trade-off between type 1 and type 2 error probabilities. (Use the comments to explain what happens.) For a fuller discussion, see this link to Excursion 5 Tour I of SIST (2018). [ii] [iii]

What power howlers have you found? Share them in the comments and I’ll add them to my blizzard. 

Spanos, A. (2008), Review of S. Ziliak and D. McCloskey’s The Cult of Statistical SignificanceErasmus Journal for Philosophy and Economics, volume 1, issue 1: 154-164.

Ziliak, Z. and McCloskey, D. (2008), The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice and Lives, University of Michigan Press.

[0] Some meta-researchers, having brilliently generated a superpopulation of treatments (from which this one is taken as a random sample), and finding these probabilities don’t hold, take this to show p-values exaggerate effects. I’ll come back to that case, which is a bit different from the one in today’s post.

[i] When it comes to raising the power by increasing sample size, Z & M often make true claims, so it’s odd when there’s a switch, as when they say “refutations of the null are trivially easy to achieve if power is low enough or the sample size is large enough”. (Z & M, p. 152) It is clear that “low” is not a typo here either (as I at first assumed), so it’s mysterious. 

[ii] Remember that a power computation is not the probability of data x under some alternative hypothesis, it’s the probability that data fall in the rejection region of a test under some alternative hypothesis. In terms of a test statistic d(X), it is Pr(test statistic d(X) is statistically significant | H’ true), at a given level of significance. So it’s the probability of getting any of the outcomes that would lead to statistical significance at the chosen level, under the assumption that alternative H’ is true. The alternative H’ used to compute power is a point in the alternative region. However, the inference that is made in tests is not to a point hypothesis but to an inequality, e.g., θ > θ’. 

[iii] My rendering of their fallacy above sees it as a type of affirming the consequent. They are right that if inference is by way of a Bayes boost, then affirming the consequent is not a fallacy. A hypothesis H that entails (or renders probable) data x will get a “B-boost” from x, unless its probability is already 1. The trouble erupts when Z & M take an error statistical concept like power, and construe it Bayesianly. Even more confusing, they only do so some of the time.

Categories: blizzard of 26, power, SIST, statistical significance tests | Tags: , , | Leave a comment

Blog at WordPress.com.