
.
I often say that the most misunderstood concept in error statistics is power. One week ago, stuck in the blizzard of 2026 in NYC —exciting, if also a bit unnerving, with airports closed for two and a half days and no certainty of when I might fly out—I began collecting the many power howlers I’ve discussed in the past, because some of them are being replicated in todays meta-research about replication failure! Apparently, mistakes about statistical concepts replicate quite reliably—even when statistically significant effects do not. Others I find in medical reports of clinical trials of treatments I’m trying to evaluate in real life! Here’s one variant: A statistically significant result in a clinical trial with fairly high (e.g., .8) power to detect an impressive improvement δ’ is taken as good evidence of its impressive improvement δ’. Often the high power of .8 is even used as a (posterior) probability of the hypothesis of improvement being δ’. [0] If these do not immediately strike you as fallacious, compare:
- If the house is fully ablaze, then very probably the fire alarm goes off.
- If the fire alarm goes off, then very probably the house is fully ablaze.
The first bullet is saying the fire alarm has high power to detect the house being fully ablaze. It does not mean the converse in the second bullet.
Today’s meta-statistical researchers are keen to point up the consequences of using statistical significance tests, figuring out why they lead to the various replication crises in science, and how they may be more honestly viewed. Yet they too use statistical analyses, and these can reflect philosophical and conceptual standpoints that may replicate the same shortcomings that arise in classic criticisms of significance tests. A major purpose of my Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP) is to clarify basic notions to get beyond what I call “chestnuts” and “howlers” of tests, but these misunderstandings tend to crop up today at the meta-level. When power enters this meta-research, often as a kind of probability of replication, unsurprisingly, the same confusions pop up at the meta-level. But I will not tackle any meta-research in this post. Instead, let’s go back to power howlers that arise in criticisms of tests. I had a blogpost long ago on Ziliac and McCloskey (2008) (Z & M) on power (from Oct. 2011), following a review of their book by Aris Spanos (2008). They write:
“The error of the second kind is the error of accepting the null hypothesis of (say) zero effect when the null is in face false, that is, when (say) such and such a positive effect is true.”
So far so good, keeping in mind that “positive effect” refers to a parameter discrepancy, say δ, not an observed difference.
And the power of a test to detect that such and such a positive effect δ is true is equal to the probability of rejecting the null hypothesis of (say) zero effect when the null is in fact false, and a positive effect as large as δ is present.
Fine. Let this alternative be abbreviated H’(δ):
H’(δ): there is a positive (population) effect at least as large as δ.
Suppose the test rejects the null when it reaches a significance level of .01 (nothing turns on the small value chosen).
(1) The power of the test to detect H’(δ) = Pr(test rejects null at the .01 level| H’(δ) is true).
Say it is 0.85.
According to Z & M:
“[If] the power of a test is high, say, 0.85 or higher, then the scientist can be reasonably confident that at minimum the null hypothesis (of, again, zero effect if that is the null chosen) is false and that therefore his rejection of it is highly probably correct.” (Z & M, 132-3)
But this is not so. They are mistaking (1), defining power, as giving a posterior probability of .85–either to some effect, or specifically to H’(δ)! That is, (1) is being transformed to (1′):
(1’) Pr(H’(δ) is true| test rejects null at .01 level)=.85!
(I am using the symbol for conditional probability “|” all the way through for ease in following the argument, even though, strictly speaking, the error statistician would use “;”, abbreviating “under the assumption that”). Or to put this in other words, they argue:
1. Pr(test rejects the null | H’(δ) is true) = 0.85.
2. Test rejects the null hypothesis.
Therefore, the rejection is probably correct, e.g., the probability H’ is true is 0.85.
Oops. Premises 1 and 2 are true, but the conclusion fallaciously replaces premise 1 with 1′.
High power as high hurdle. As Aris Spanos (2008) points out, “They have it backwards”. Their reasoning comes from thinking that the higher the power of the test from which statistical significance emerges, the higher the hurdle it has gotten over. Extracting from a Spanos comment on this blog in 2011:
“When [Ziliak and McCloskey] claim that: ‘What is relevant here for the statistical case is that refutations of the null are trivially easy to achieve if power is low enough or the sample size is large enough.’ (Z & M, p. 152), they exhibit [confusion] about the notion of power and its relationship to the sample size; their two instances of ‘easy rejection’ separated by ‘or’ contradict each other! Rejections of the null are not easy to achieve when the power is ‘low enough’. They are more difficult exactly because the test does not have adequate power (generic capacity) to detect discrepancies from the null; that stems from the very definition of power and optimal tests. [Their second claim] is correct for the wrong reason. Rejections are easy to achieve when the sample size n is large enough due to high not low power. This is because the power of a ‘decent’ (consistent) frequentist test increases monotonically with n!” (Spanos 2011) [i]
Ziliak and McCloskey (2008) tell us: “It is the history of Fisher significance testing. One erects little “significance” hurdles, six inches tall, and makes a great show of leaping over them, . . . If a test does a good job of uncovering efficacy, then the test has high power and the hurdles are high not low.” (ibid., p. 133) This explains why they suppose high power translates into high hurdles, but it is the opposite. The higher the hurdle required before rejecting the null, the more difficult it is to reject, and the lower the power. High hurdles correspond to insensitive tests, like insensitive fire alarms. It might be that using “sensitivity” rather than power would make this abundantly clear. We may coin: The high power = high hurdle (for rejecting the null) fallacy. A powerful test does give the null hypothesis a harder time in the sense that it’s more probable that discrepancies from it are detected. But this makes it easier to infer H1. To infer H1 with severity, H1 needs to be given a hard time.
Ponder the consequences their construal would have for the required trade-off between type 1 and type 2 error probabilities. (Use the comments to explain what happens.) For a fuller discussion, see this link to Excursion 5 Tour I of SIST (2018). [ii] [iii]
What power howlers have you found? Share them in the comments and I’ll add them to my blizzard.
Spanos, A. (2008), Review of S. Ziliak and D. McCloskey’s The Cult of Statistical Significance, Erasmus Journal for Philosophy and Economics, volume 1, issue 1: 154-164.
Ziliak, Z. and McCloskey, D. (2008), The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice and Lives, University of Michigan Press.
[0] Some meta-researchers, having brilliently generated a superpopulation of treatments (from which this one is taken as a random sample), and finding these probabilities don’t hold, take this to show p-values exaggerate effects. I’ll come back to that case, which is a bit different from the one in today’s post.
[i] When it comes to raising the power by increasing sample size, Z & M often make true claims, so it’s odd when there’s a switch, as when they say “refutations of the null are trivially easy to achieve if power is low enough or the sample size is large enough”. (Z & M, p. 152) It is clear that “low” is not a typo here either (as I at first assumed), so it’s mysterious.
[ii] Remember that a power computation is not the probability of data x under some alternative hypothesis, it’s the probability that data fall in the rejection region of a test under some alternative hypothesis. In terms of a test statistic d(X), it is Pr(test statistic d(X) is statistically significant | H’ true), at a given level of significance. So it’s the probability of getting any of the outcomes that would lead to statistical significance at the chosen level, under the assumption that alternative H’ is true. The alternative H’ used to compute power is a point in the alternative region. However, the inference that is made in tests is not to a point hypothesis but to an inequality, e.g., θ > θ’.
[iii] My rendering of their fallacy above sees it as a type of affirming the consequent. They are right that if inference is by way of a Bayes boost, then affirming the consequent is not a fallacy. A hypothesis H that entails (or renders probable) data x will get a “B-boost” from x, unless its probability is already 1. The trouble erupts when Z & M take an error statistical concept like power, and construe it Bayesianly. Even more confusing, they only do so some of the time.
The next blizzard 26 of power puzzles on this blog is here



I think that Ziliak, Z. and McCloskey blaming or crediting (depending on point of view) RA Fisher with the concept of power is also wrong. Rightly or wrongly, power, was not a concept he liked See Guest Blog: STEPHEN SENN: ‘Fisher’s alternative to the alternative’ | Error Statistics Philosophy
Stephen:
True. Maybe Z & M think that’s why Fisher set his tests to have small power. It’s silly, of course, and puzzling that they can get the logic backwards. As for Fisher, he only didn’t like power because it was Neyman’s term, and was associated with the behavioristic interpretations of tests. Sensitivity was all important to Fisher, as you know, and it’s the same thing. I often think that using “sensitivity” might have resulted in less confusion. What do you think? Being powerful sounds desirable, so rejecting with greater and greater power is better and better–to some ears–but of course that’s backwards. For Bayesians this emerges from viewing high Pr(reject;H’) as akin to high likelihood for H’.
Thanks, Deborah. I prefer to accept Fisher’s reason as to why he preferred sensitivity to Savage’s account. I think that the difference is one of science (Fisher) versus mathematics (Savage & Neyman). Mathematics can be a formal game that starts with assumptions and moves to consequences. A prior distribution can be seen as an elaboration of assumptions.
Fisher’s expressed view to Bliss can be re-phrased as follows; a null hypothesis is more primitive than a statistic but a statistic is more primitive than an alternative hypothesis.
As for likelihood, my view is that the practical meaning of the Neyman-Pearson lemma is misunderstood. Power is not the justification for likelihood but it can be a bonus. To prefer a test simply because it appears to be more powerful can be dangerous if that gain is not justified by likelihood. I gave such an example of trying to improve on a test here https://academic.oup.com/biometrics/article-abstract/63/1/296/7321711?redirectedFrom=PDF
Ironically, the test in question is (a modified form of) Fisher’s exact test.
Stephen:
Thanks for your comment and link to a paper I’m unfamiliar with. On the test stat vs alternative, I’d still say what I said in my 2017 reply: “I think what troubles me about the idea of forming a test statistic post data is that it would allow the kind of selection effects Fisher was against. If the null were, say, drug makes no difference, what’s to stop the test stat from being chosen to be “improvement in memory” or whatever accords with the data. I know that Cox emphasizes the need to adjust for selection effects, and clearly doesn’t imagine a Fisherian test would allow that.” Moreover, starting with a test statistic, doesn’t specifying the test rule indicate the implicit alternative? I don’t understand your remark: “he preferred sensitivity to Savage’s account”. I might not know enough about that correspondence. When I read the paper to which you linked, I’ll get your point about the danger of preferring a test simply because it appears more powerful. I wonder if it has to do with an issue that often crops up, wherein rejecting a null with higher power is thought to give stronger evidence against a null. It’s Z & M’s confusion about hurdles again. I might note that the worst misunderstanding people have regarding N-P
s lemma is that it shows it’s OK to have 2 non-exhaustive point hypotheses in general. More later.
Deborah:
I apologise for being unclear. I think that Fisher was recommending using experience with previous data-sets to chose the analysis or the current data-set. So the test is not formed post-data or at least not post the data in the test.
Stephen: I’m going to need to check all the references you give, but I’ve apparently misunderstood (all this time) what you meant by Fisher’s alternative to the alternative. I was thinking of the classic criticism of Fisher for only setting out the null hypothesis, whereas N-P insist on specifying an alternative. You seem to be suggesting that Fisher recommended using the alternative that past experience taught you is most likely to be rejected, rather than the one you can show, analytically, would give the test optimal (or at least good) power. Is that right? But that still invokes an alternative, and we don’t see that in Fisherian tests with their single null. Now N-P set out their tests to rigorously exemplify what Fisher recommended on more informal grounds, so this idea was implicit in their tests. This is what Fisher wanted. But does Fisher explicitly require predesignating the alternative? Time to re-read Fisher.
Deborah, I don’t agree. It is clearly not necessary to invoke alternative hypotheses and a justification in terms of power to come up with a particular test, since the tests were all developed before power was even thought of. To give an analogy, you might have two weighing machines one of which experience has taught you is more precise than the other. You might then speculate as to why this is so and then come up with some explanation after the fact. “The titanium in this one protected it from accidental warping, which, however, has affected the other one.” However, to say that is the reason that you prefer the more accurate machine is rather silly since your preference was established without this “knowledge”. Only if one is already wedded to the idea that the test has to be justified in the NP framework can one say that Fisher needs an alternative hypothesis.
So, to be clear, I flatly reject the idea that an alternative is invoked. That’s for the NP brigade but not for Fisherians.
Stephen:
Wait a minute. I don’t know how you’re disagreeing, as I was merely trying to formulate your position, coming to suspect I was wrong in what I took you to be saying (all these years). Now I think my earlier view was right, at least in the main. I do not say that it’s necessary to invoke alternative hypotheses and a justification in terms of power to come up with a particular test. (Of course even without the word power, as you say, there is sensitivity or precision or whatever.) So I now am back to supposing that Fisher’s view is that we just have a null hypothesis, and after we have the data (collected by some directive), come to reason about why your experience had shown one machine is more probative (although you hadn’t had the why question answered before).Then post data you might infer “The titanium in this one protected it from accidental warping”–arriving at a causal claim that might be said to have passed severely (in my language) even though you had not set out your test as one of testing claims about the protectiveness offered by titanium. You discover the causal explanation post-data. Is this the idea?
I assume then a bit more is added to warrant this explanation rather than other post-data speculations that could also account for your preferring the more precise one? Is this right? I’m intrigued.
This is what Fisher wrote to Bliss on 6 October 1938 regarding Neyman and Pearson
“…The absence of reference to experience seems to me a serious flaw in their work. Their method only leads to definite results when mathematical postulates are introduced, which could only be believed as a result of extensive experience. They do not, unfortunately discuss the nature of these postulates. If they did so, they would see that it practice it would amount to no more than the experience that one test gave significance more frequently significant results than another. In general we may come to the conclusion that this is so, partly by direct trials on analogous material, partly from our general concepts as to how the observable effects arise. The introduction of hidden postulates only disguises the tentative nature of the process by which real knowledge is built up.” (My emphasis.)
See P246 of Bennett JH. Collected Papers of R.A.Fisher. Adelaide: University of Adelaide, 1971-1974.
So, there is no question, at least in this passage, of using the same data on which you will conduct the test to choose that test. My blog on the subject back in 2017 missed the fact that Fisher also mentioned “general concepts”.
So, to answer your question
Q “But does Fisher explicitly require predesignating the alternative?”
A “No. He regards such postulates as being a mistake but he does require predesignating the test”
Stephen:
I have found that disagreements between Fisher and Neyman are too bound up with matters of professional discord to take at face value. It’s not clear what’s involved in the predesignated test. If it’s just a statistical null hypothesis, then it seems researchers can readily invoke various substantive, scientific, alternatives post data to explain statistical significance. This, unfortunately, gets to the problems leading to today’s “retire significance”. The formal p-value won’t control the relevant errors any more. Christian Hennig made a comment 9 years ago on March 3, when the question came up in discussing your guest post. I past it here.
Christian Hennig
One could turn the Neyman-Pearson game on its head and say that every test statistic implies an alternative. If the test rejects in case T>c, the implicit alternative against which the test is testing is all distributions for which P(T>c) is larger than for the H0. Usually when testing hypotheses we are interested in certain deviations from the H0 but not in all conceivable deviations. For example, if H0: N(a,sigma^2) with fixed a we may be interested in comparing locations (e.g., whether we find outcomes that are systematically larger or smaller than a) but we may not be interested in whether the distribution really has a normal shape, or at most to the extent to which it would otherwise be detrimental to making statements about location. Certainly for any data that human beings can observe, N(a,sigma^2) is violated because of discreteness, but we are not interested in this and there’s no need to test it.
Running a Gauss-test or t-test compares locations, not distributional shapes or variances; in this sense these tests come with an implicit alternative and that’s why we use them. This implicit alternative is bigger than the class of N(b,sigma^2) with b not equal a though (and therefore the issue is somewhat more complex than what is covered by Neyman-Pearson theory); all kinds of distributions with mean not too close to a are in it. Anyway, I don’t find any appeal in thinking about a test without taking into account the alternative (or rather set of alternatives) against which it is meant to test, and the (implicit) alternative against which it actually tests.
Readers will be interested in the guest post Senn refers us to in his comment. Plus there are a series of comments by me, Andrew Gelman, Michael Lew, Christian Hennig, and Stephen Senn following that post! Feel free to share your thoughts on what we talked about back then.