Monthly Archives: March 2026

Power and Severity with nonsignificant results: more power puzzles?

The concept of a test’s power, originating in Neyman-Pearson’s early work, by and large, is a pre-data concept for purposes of specifying a test (notably, determining worthwhile sample size), and choosing between tests. In some papers, however, Neyman lists a third goal for power: to interpret test results post data much in the spirit of what is often called “power analysis”. This is to determine the discrepancy from a null hypothesis that may be ruled out, given nonsignificant results. One example is in a paper “The Problem of Inductive Inference” (Neyman 1955). The reason I’m bringing this up is that it has direct bearing on some of today’s most puzzling (and problematic) post-data uses of power. Interestingly, in that 1955 paper, Neyman is talking to none other than the logical positivist philosopher of confirmation, Rudof Carnap:

I am concerned with the term “degree of confirmation” introduced by Carnap.  …We have seen that the application of the locally best one-sided test to the data … failed to reject the hypothesis [that the n observations come from a source in which the null hypothesis is true].  The question is: does this result “confirm” the hypothesis that H0 is true of the particular data set? (Neyman, pp 40-41).

Neyman continues:

The answer … depends very much on the exact meaning given to the words “confirmation,” “confidence,” etc.  If one uses these words to describe one’s intuitive feeling of confidence in the hypothesis tested H0, then…. the attitude described is dangerous.… [T]he chance of detecting the presence [of discrepancy from the null], when only [n] observations are available, is extremely slim, even if [the discrepancy is present].  Therefore, the failure of the test to reject H0 cannot be reasonably considered as anything like a confirmation of H0.  The situation would have been radically different if the power function [corresponding to a discrepancy of interest] were, for example, greater than 0.95. (ibid.)

The general conclusion is that it is a little rash to base one’s intuitive confidence in a given hypothesis on the fact that a test failed to reject this hypothesis. A more cautious attitude would be to form one’s intuitive opinion only after studying the power function of the test applied.

Neyman alludes to a one-sided test of the mean of a Normal distribution with n iid samples, and known standard deviation, call it test T+.

H0: µ ≤ µ0 against H1: µ > µ0.

The test statistic d(X) is the standardized sample mean.

The test rule: Infer a (positive) discrepancy from µ0 iff d(x0) > cα where cα corresponds to a difference statistically significant at the α level.

In Carnap’s example the test could not reject the null hypothesis, i.e., d(x0) ≤ cα, but (to paraphrase Neyman) the problem is that the chance of detecting the presence of discrepancy δ from the null, with so few observations, is extremely slim, even if [δ is present]. Says Neyman:

“One may be confident in the absence of that discrepancy only if the power to detect it were high.”

The power of the test T+ to detect discrepancy δ:

(1)  P(d(X) > cα; µ =  µ0 + δ)

This is rather different than the more behavioristic construal Neyman usually championed.  In fact, Neyman sounds like a Cohen-style power analyst!

Still, in standard power analysis, power is calculated relative to an outcome just missing the cutoff  cα.  This is, in effect, the worst case of a negative (non significant) result. If the actual outcome corresponds to a larger p-value (an even more negative result), it seems to me that should be taken into account in interpreting the results. Do you agree? It is more informative, therefore, to look at the probability of getting a worse fit (with the null hypothesis) than you did:

(2)  P(d(X) > d(x0); µ = µ0 + δ)

In this example, this gives a measure of the severity (or degree of corroboration) for the inference µ < µ0 + δ.

Although (1) may be low, (2) may be high (For numbers, see Mayo and Spanos 2006).

Spanos and I (Mayo and Spanos 2006) couldn’t find a term in the literature defined precisely this way–the way I’d defined it in Mayo (1996) and before. Note this differs from what some have called “observed power” and I call “shpower” (see this post). Spanos and I called it the severity interpretation for acceptance SIA; in SIST, it’s cashed out as SIN: the severity interpretation of negative results. With SIA and SIN, we consider the value of the observed statistic, rather than the cut-off for rejection or significance. I also call it attained power. This is a core concept that I claim testers should be using to interpret warranted discrepancies post-data.

The claim in (2) could also be made out viewing the p-value as a random variable, calculating its distribution for various alternatives (Cox 2006, 25).  This reasoning yields a core frequentist principle of evidence  (FEV) in Mayo and Cox 2010, 256):

FEV:1 A moderate (i.e., non-small) p-value is evidence of the absence of a discrepancy δ from H0 only if there is a high probability the test would have given a worse fit with H0 (i.e., smaller p value) were a discrepancy d to exist.

It is only in the case of a negative result that severity for various inferences is in the same direction as power.  In the case of significant results, with d(x) in excess of the cutoff, the opposite concern arises—namely, the test may be too sensitive to warrant a claimed dicrepancy. So severity is always relative to the particular inference being entertained: speaking of the “severity of a test” simpliciter is an incomplete statement in this account.  These assessments enable sidestepping classic fallacies of tests that are either too sensitive or not sensitive enough.2
________________________________________

The full version of our frequentist principle of evidence FEV corresponds to the interpretation of a small p-value:

x is evidence of a discrepancy d from H0 iff, if H0 is a correct description of the mechanism generating x, then, with high probability a less discordant result would have occurred.

Severity (SEV) may be seen as a meta-statistical principle that follows the same logic as FEV reasoning within the formal statistical analysis.

By making a SEV assessment relevant to the inference under consideration, we obtain a measure where high (low) values always correspond to good (poor) evidential warrant.

Severity did not have to be defined this way, but I decided it was best to have a concept or measure that was always good–by contrast to a type 1 and 2 errors. However, it means SEV has to be computed relative to what is being inferred. This requires appropriately swapping out the claim H for which one wants to assess SEV.

NOTE: This discussion was part of what I dubbed Neyman’s Nursery posts (NN1-NN5). This was the second, NN2. Why I used that term is a long story, you can learn about by searching this blog. 

REFERENCES:

Cohen, J. (1992) A Power Primer.

Mayo, D. and Spanos, A. (2006), “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,” British Journal of Philosophy of Science, 57: 323-357.

Mayo, D. and Cox, D. (2010), “Frequentist Statistics as a Theory of Inductive Inference,” in D. Mayo and A. Spanos (2011), pp. 247-275.

Neyman, J. (1955), “The Problem of Inductive Inference,” Communications on Pure and Applied Mathematics, VIII, 13-46.

Neyman, J. [1957]: ‘The Use of the Concept of Power in Agricultural Experimentation,’ Journal of the Indian Society of Agricultural Statistics, IX, pp. 9–17.

Categories: Neyman's Nursery, power analysis | Tags: , , , | Leave a comment

Continuing the blizzard of 26 power puzzles

 

.The mayor of NYC offered $30 an hour to help shovel the ~ 30 inches of snow that fell last Sunday and Monday. From what I hear, it was a very effective program. Here’s a little power puzzle to very easily shovel through [1]

Suppose you are reading about a result x  that is just statistically significant at level α (i.e., P-value = α) in a one-sided test T+ of the mean of a Normal distribution with n iid samples, and (for simplicity) known σ:   H0: µ ≤  0 against H1: µ >  0. I have heard some people say:

A. If the test’s power to detect alternative µ’ is very low, then the just statistically significant x is poor evidence of a discrepancy (from the null) corresponding to µ’.  (i.e., there’s poor evidence that  µ > µ’ ). I am keeping symbols as simple as possible. *See point on language in notes.

They will generally also hold that if POW(µ’) is reasonably high (at least .5), then the inference to µ > µ’ is warranted, or at least not problematic.

I have heard other people say:

B. If the test’s power to detect alternative µ’ is very low, then the just statistically significant x is good evidence of a discrepancy (from the null) corresponding to µ’ (i.e., there’s good evidence that  µ > µ’).

They will generally also hold that if POW(µ’) is reasonably high (at least .5), then the inference to µ > µ’ is unwarranted.

Which is correct, from the perspective of the (error statistical) philosophy, within which power and associated tests are defined?

 (Note the qualification that would arise if you were only told the result was statistically significant at some level less than or equal to α rather than, as I intend, that it is just significant at level α, discussed in a comment due to Michael Lew here [0])

Allow the test assumptions are adequately met (though usually, this is what’s behind the problem). I have often said on this blog, and I repeat, the most misunderstood and abused (or unused) concept from frequentist statistics is that of a test’s power to reject the null hypothesis under the assumption alternative µ’ is true: POW(µ’). I deliberately write it in this correct manner because it is faulty to speak of the power of a test without specifying against what alternative it’s to be computed. It will also get you into trouble if you define power as in the first premise of this post:the probability of correctly rejecting the null–which is both ambiguous and fails to specify the all important conjectured alternative. That you compute power for several alternatives is not the slightest bit problematic; it’s what you want to do in order to assess the test’s capability to detect discrepancies. There’s a power function. If you knew the true parameter value, why would you be running an inquiry to make statistical inferences about it?

It must be kept in mind that inferences are going to be in the form of µ > µ’ =µ+ δ,  or µ < µ’ =µ+ δ  or the like. They are not to point values! (Not even to the point µ =M0.) Most simply, you may consider that the inference is in terms of the one-sided lower confidence bound (for various confidence levels)–the dual for test T+.

POWER: POW(T+,µ’) = POW(Test T+ rejects H0;µ’) = Pr(M > M*; µ’), where M is the sample mean and M* is the cut-off for rejection at level α . (Since it’s continuous it doesn’t matter if we write > or ≥). I’m simplifying notation. Also, I’ll leave off the T+ and write POW(µ’).

In terms of P-values: POW(µ’) = Pr(P < p*; µ’) where P < p* corresponds to rejecting the null hypothesis at the given level.

Let σ = 10, n = 100, so (σ/ √n) = 1.  Test T+ rejects Hat the .025 level if  M  > 1.96(1). For simplicity, let the cut-off, M*, be 2.

Test T+ rejects Hat ~ .025 level if M >  2.  

CASE 1:  We need a µ’ such that POW(µ’) = low. The power against alternatives between the null and the cut-off M* will range from α to .5. Consider the power against the null:

1. POW(µ = 0) = α = .025.

Since the the probability of M > 2, under the assumption that µ = 0, is low, the just significant result indicates  µ > 0.  That is, since power against µ = 0 is low, the statistically significant result is a good indication that µ > 0.

Equivalently, 0 is the lower bound of a .975 confidence interval.

2. For a second example of low power that does not use the null: We get power of .04 if µ’ = M* – 1.75 (σ/ √n) unit –which in this case is (2 – 1.75) .25. That is, POW(.25) =.04.[ii]

Equivalently, µ >.25 is the lower confidence interval (CI) at level .96 (this is the CI that is dual to the test T+.)

CASE 2:  We need a µ’ such that POW(µ’) = high. POW(M* + 1(σ/ √n)) = .84.

3. That is, adding one (σ/ √n) unit to the cut-off M* takes us to an alternative against which the test has power = .84. So POW(T+, µ = 3) = .84. 

Should we say that the significant result is a good indication that µ > 3?  No, there’s a high probability (.84) you’d have gotten a larger difference than you did, were µ > 3.  

Pr(M > 2;  µ = 3 ) = Pr(Z > -1) = .84. It would be terrible evidence for µ > 3!

IMG_1781

Blue curve is the null, red curve is one possible conjectured alternative: µ = 3. Green area is power, little turquoise area is α.

Note that the evidence our result affords µ > µ’ gets worse and worse as we drag µ further and further to the right, even though in so doing we’re increasing the power.  

As Stephen Senn points out (in my favorite of his guest posts), the alternative against which we set high power is the discrepancy from the null that “we should not like to miss”, delta Δ.  Δ is not the discrepancy we may infer from a significant result (in a test where POW(Δ) = .84), nor one that we believe obtains.

So the correct answer is B.

Does A hold true if we happen to know (based on previous severe tests) that µ <µ’? 

No, but it does allow some legitimate ways to mount complaints based on a significant result from a test with low power to detect a known discrepancy.

It does mean that if M* (the cut-off for a result just statistically significant at level α) is used as an estimate of µ, with no standard error given, although we know independently that µ < M*, then M* is larger than µ. This is a tautology, and crucial information about the unreliability of the estimate is hidden. While it might be said that your observed result, which we’re assuming is M*, “exaggerates” µ, if you were to use it as an estimate, without reporting the SE, it is more correctly describing a misuse of tests, which would direct you to use the lower limit of a confidence interval if you were keen to estimate effect size once finding significance. Why is it highly unkosher for a significance tester? Despite knowing µ < M* (one of the givens of the example), it’s estimated as M*–as if there’s no uncertainty. 

If the study is in a field known to have lots of researcher flexibility, it might legitimately raise the question of whether the researchers cheated, reporting only the one impressive result after trying and trying again, or tampered with the discretionary points of the study to achieve nominal significance. This is a different issue, and doesn’t change my answer. More generally, it’s because the answer is B that the only way to raise the criticism legitimately is to challenge the assumptions of the test.

Why does the criticism arise illegitimately? There are a few reasons. In some circles it’s a direct result of trying to do a Bayesian computation and setting about to compute Pr(µ = µ’|Mo = M*) using POW(µ’)/α as a kind of likelihood ratio in favor of µ’.  Notice that supposing the probability of a type I error goes down as power increases is at odds with the trade-off that we know holds between these error rates. So this immediately indicates a different use of terms. 

[1] This is a modified reblog of an earlier post. 

*Point on language: “to detect alternative µ'” means, “produce a statistically significant result when µ = µ’.” It does not mean we infer µ’. Nor do we know the underlying µ’ after we see the data, obviously. The power of the test to detect µ’ just refers to the probability the test would produce a result that rings the significance alarm, if the data were generated from a world or experiment where µ = µ’.

 

 

Categories: blizzard of 26 power puzzles, power, reforming the reformers | 1 Comment

A Blizzard of Power Puzzles Replicate in Meta-Research

.

I often say that the most misunderstood concept in error statistics is power. One week ago, stuck in the blizzard of 2026 in NYC —exciting, if also a bit unnerving, with airports closed for two and a half days and no certainty of when I might fly out—I began collecting the many power howlers I’ve discussed in the past, because some of them are being replicated in todays meta-research about replication failure! Apparently, mistakes about statistical concepts replicate quite reliably—even when statistically significant effects do not. Others I find in medical reports of clinical trials of treatments I’m trying to evaluate in real life! Here’s one variant: A statistically significant result in a clinical trial with fairly high (e.g.,  .8) power to detect an impressive improvement δ’ is taken as good evidence of its impressive improvement δ’. Often the high power of .8 is even used as a (posterior) probability of the hypothesis of improvement being δ’. [0] If these do not immediately strike you as fallacious, compare:

  • If the house is fully ablaze, then very probably the fire alarm goes off.
  • If the fire alarm goes off, then very probably the house is fully ablaze.

The first bullet is saying the fire alarm has high power to detect the house being fully ablaze. It does not mean the converse in the second bullet.

Today’s meta-statistical researchers are keen to point up the consequences of using statistical significance tests, figuring out why they lead to the various replication crises in science, and how they may be more honestly viewed. Yet they too use statistical analyses, and these can reflect philosophical and conceptual standpoints that may replicate the same shortcomings that arise in classic criticisms of significance tests.  A major purpose of my Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP) is to clarify basic notions to get beyond what I call “chestnuts” and “howlers” of tests, but these misunderstandings tend to crop up today at the meta-level. When power enters this meta-research, often as a kind of probability of replication, unsurprisingly, the same confusions pop up at the meta-level. But I will not tackle any meta-research in this post. Instead, let’s go back to power howlers that arise in criticisms of tests. I had a blogpost long ago on Ziliac and McCloskey (2008) (Z & M) on power (from Oct. 2011), following a review of their book by Aris Spanos (2008). They write:

“The error of the second kind is the error of accepting the null hypothesis of (say) zero effect when the null is in face false, that is, when (say) such and such a positive effect is true.”

So far so good, keeping in mind that “positive effect” refers to a parameter discrepancy, say δ, not an observed difference.

And the power of a test to detect that such and such a positive effect δ is true is equal to the probability of rejecting the null hypothesis of (say) zero effect when the null is in fact false, and a positive effect as large as δ is present.

Fine. Let this alternative be abbreviated H’(δ):

H’(δ): there is a positive (population) effect at least as large as δ.

Suppose the test rejects the null when it reaches a significance level of .01 (nothing turns on the small value chosen).

(1) The power of the test to detect H’(δ) = Pr(test rejects null at the .01 level| H’(δ) is true).

Say it is 0.85.

According to Z & M:

“[If] the power of a test is high, say, 0.85 or higher, then the scientist can be reasonably confident that at minimum the null hypothesis (of, again, zero effect if that is the null chosen) is false and that therefore his rejection of it is highly probably correct.” (Z & M, 132-3)

But this is not so. They are mistaking (1), defining power, as giving a posterior probability of .85–either to some effect, or specifically to H’(δ)! That is, (1) is being transformed to (1′):

(1’) Pr(H’(δ) is true| test rejects null at .01 level)=.85!

(I am using the symbol for conditional probability “|” all the way through for ease in following the argument, even though, strictly speaking, the error statistician would use “;”, abbreviating “under the assumption that”). Or to put this in other words, they argue:

1. Pr(test rejects the null | H’(δ) is true) = 0.85.

2. Test rejects the null hypothesis.

Therefore, the rejection is probably correct, e.g., the probability H’ is true is 0.85.

Oops. Premises 1 and 2 are true, but the conclusion fallaciously replaces premise 1 with 1′.

 

High power as high hurdle. As Aris Spanos (2008) points out, “They have it backwards”.  Their reasoning comes from thinking that the higher the power of the test from which statistical significance emerges, the higher the hurdle it has gotten over. Extracting from a Spanos comment on this blog in 2011:

“When [Ziliak and McCloskey] claim that: ‘What is relevant here for the statistical case is that refutations of the null are trivially easy to achieve if power is low enough or the sample size is large enough.’ (Z & M, p. 152), they exhibit [confusion] about the notion of power and its relationship to the sample size; their two instances of ‘easy rejection’ separated by ‘or’ contradict each other! Rejections of the null are not easy to achieve when the power is ‘low enough’. They are more difficult exactly because the test does not have adequate power (generic capacity) to detect discrepancies from the null; that stems from the very definition of power and optimal tests. [Their second claim] is correct for the wrong reason. Rejections are easy to achieve when the sample size n is large enough due to high not low power. This is because the power of a ‘decent’ (consistent) frequentist test increases monotonically with n!” (Spanos 2011) [i]

 Ziliak and McCloskey (2008) tell us: “It is the history of Fisher significance testing. One erects little significancehurdles, six inches tall, and makes a great show of leaping over them, . . . If a test does a good job of uncovering efficacy, then the test has high power and the hurdles are high not low.” (ibid., p. 133) This explains why they suppose high power translates into high hurdles, but it is the opposite. The higher the hurdle required before rejecting the null, the more difficult it is to reject, and the lower the power. High hurdles correspond to insensitive tests, like insensitive fire alarms. It might be that using sensitivityrather than power would make this abundantly clear. We may coin: The high power = high hurdle (for rejecting the null) fallacy. A powerful test does give the null hypothesis a harder time in the sense that its more probable that discrepancies from it are detected. But this makes it easier to infer H1. To infer H1 with severity, H1 needs to be given a hard time. 

Ponder the consequences their construal would have for the required trade-off between type 1 and type 2 error probabilities. (Use the comments to explain what happens.) For a fuller discussion, see this link to Excursion 5 Tour I of SIST (2018). [ii] [iii]

What power howlers have you found? Share them in the comments and I’ll add them to my blizzard. 

Spanos, A. (2008), Review of S. Ziliak and D. McCloskey’s The Cult of Statistical SignificanceErasmus Journal for Philosophy and Economics, volume 1, issue 1: 154-164.

Ziliak, Z. and McCloskey, D. (2008), The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice and Lives, University of Michigan Press.

[0] Some meta-researchers, having brilliently generated a superpopulation of treatments (from which this one is taken as a random sample), and finding these probabilities don’t hold, take this to show p-values exaggerate effects. I’ll come back to that case, which is a bit different from the one in today’s post.

[i] When it comes to raising the power by increasing sample size, Z & M often make true claims, so it’s odd when there’s a switch, as when they say “refutations of the null are trivially easy to achieve if power is low enough or the sample size is large enough”. (Z & M, p. 152) It is clear that “low” is not a typo here either (as I at first assumed), so it’s mysterious. 

[ii] Remember that a power computation is not the probability of data x under some alternative hypothesis, it’s the probability that data fall in the rejection region of a test under some alternative hypothesis. In terms of a test statistic d(X), it is Pr(test statistic d(X) is statistically significant | H’ true), at a given level of significance. So it’s the probability of getting any of the outcomes that would lead to statistical significance at the chosen level, under the assumption that alternative H’ is true. The alternative H’ used to compute power is a point in the alternative region. However, the inference that is made in tests is not to a point hypothesis but to an inequality, e.g., θ > θ’. 

[iii] My rendering of their fallacy above sees it as a type of affirming the consequent. They are right that if inference is by way of a Bayes boost, then affirming the consequent is not a fallacy. A hypothesis H that entails (or renders probable) data x will get a “B-boost” from x, unless its probability is already 1. The trouble erupts when Z & M take an error statistical concept like power, and construe it Bayesianly. Even more confusing, they only do so some of the time.

The next blizzard 26 of power puzzles on this blog is here

Categories: blizzard of 26, power, SIST, statistical significance tests | Tags: , , | 11 Comments

Blog at WordPress.com.