
.
In giving some informal remarks about power at a seminar a couple of weeks ago, I proposed that the tendency to turn the notion of power on its head might be avoided by imagining we need to define a test’s error probabilities in terms of its power alone. We can refer to the power against the null hypothesis, rather than alluding to a type 1 error probability, for example. What do I mean by turning power on its head? I mean, at least here, supposing that a test provides poor evidence of discrepancies that the test has low power to detect.
This grows out of the assumption that a statistically significant result only provides good evidence of discrepancies (from a null hypothesis) that the test has reasonably high power to detect. But these claims actually reverse what is the case about power and warranted (population) discrepancies. They turn power on its head.
To remind us, the goal of this statistical significance test is to assess the compatibility of data with a reference or null hypothesis, such as to see if the value of test statistic D indicates a genuine positive (population) discrepancy from 0. The tester may go on to consider the evidence for various other positive discrepancies as well. For simplicity consider testing H0: µ ≤ 0 vs H1: µ >0 with known SE. I will use some numbers from a guest blog post by Stephen Senn discussing the interpretation of tests in clinical trials:
For simplicity, allow the cut-off to be 2, rather than 1.96. Write the cut-off for rejecting the null as D*, which in Senn’s example is .7. So we have SE =~ .35*. The power of the test against different values of µ doesn’t require knowing the true value of µ; there is a power function. The test is falsificationist, and uses hypothetical reasoning. The power of this test against µ’ is the probability D exceeds D* (.7) computed under the assumption that µ = µ’. Write this as POW(µ’).
Tests, particularly in clinical trials, are often specified to have high probability, .8 or .9, of detecting a discrepancy from the null that “we would not like to miss”. To “miss” means the test does not set off the “significance alarm”, that is, the result is statistically insignificant. Senn’s example stipulates that the population discrepancy we would really hate to miss is ∆ = 1. This means that were the population ∆ = 1 or higher, then we want there to be a high probability that the value of the sample D will exceed D*.
Note: I use the word “discrepancy” in alluding to population effect sizes and “differences” to refer to observed difference. I’m deliberately calling ∆ “the discrepancy we would really hate to miss” because “the discrepancy we would not like to miss” is often interpreted in a weaker manner than intended. In particular, it is often construed as the smallest discrepancy of interest. But this minimal discrepancy of interest would be smaller than ∆ . [1] See also my commentary on Senn’s post:
Let’s now turn to a test H0: µ ≤ 0 vs H1: µ >0 .
(1) The power at the null is α. Note that POW(0) = .025 (more like .023)
Let’s assume for the moment that D just makes it to the cut-off D* for rejection. Then POW(0) is also equal to the significance level for the outcome. Here’s the logic of statistical significance tests using power, and D=D*:
(2) If D is just statistically significant, and its statistical significance level is low, then D indicates µ >0.
(2) is equivalent to (2)’:
(2)’ If POW(0) is low, then D* indicates µ >0.
Of course, indications need to be supplemented by audits of assumptions, checks of biasing selection effects, and ideally, replication. But we must first make out the intended logic of tests, under the presumption the assumptions hold approximately, and separately audit them.
(3) If it would be difficult for the test to generate a D as large as D* if µ = 0, and yet we observe D*, then it indicates it was generated by a µ that exceeds 0.
The assertion in (3) holds not just for the null but for discrepancies from 0. Now a critic of tests might note: “But your test also has rather low power to detect positive discrepancies close to 0. For example:
POW(.5 SE) = .07. [i.e., POW(.17) = .07.]”
To which a tester would respond: Yes, and I can similarly infer my D* indicates µ > .17. I reason as follows: were µ ≤ .17, then 93% of the time I’d get a smaller D than I did. That’s the logic of testing. Note too that the P-value is .07, and the lower confidence interval µ > .17. has confidence level .93.
A critic might continue: “But your test also has rather low power to detect positive discrepancies of 1 SE.
POW(1SE) = .16! [i.e., POW(.35) = .16.]”
To which a tester could respond: Yes, and I therefore have a weak indication that µ > .35. The P-value is .16, and the lower confidence interval µ > .35. has confidence level .84.
And she could go on to note: I clearly do not have evidence that µ exceeds those values against which the test has high power! Even to infer, on grounds that POW(.7) = .5, that my observing D* indicates µ > .7 would be wrong 50% of the time!
I hope it is now clear why the bold phrases at the outset turn power on its head, in relation to statistical significance tests. Senn would not say a statistically significant result is fairly good evidence that µ > 1, on the grounds that POW(1) = .8. Yet you will sometimes see medical researchers and spokespeople claim literally this. What we can correctly say is:
(4) If it would be improbable for the test to generate a D > D* were µ < µ0, and yet I observe D*, then D is an indication it was generated by a µ that exceeds µ0.
However, there is a different assertion that has a superficial resemblance to the ones I am pointing to as reversing power, and that other assertion can hold true. I discuss it in my next post. (I promise not to wait a month to write it!)
Share your questions and remarks in the comments to this post.
[1] Other construals: the minimum value of D we hope to observe, the smallest discrepancy we’d like to learn about, or still others. See this earlier Senn post



This breakdown of how power is turned on its head regarding “missed alarms” hits on a critical issue: the structural limits of our data-generation frameworks often guarantee a failure of severity before a statistical test is even applied.
I have recently formalized this exact type of error through what I call the Sepsis Reporting Gap in clinical data architectures. Looking at empirical tracking data from high-profile acute physiological collapses (using recent public timelines like the Kyle Busch case as a reference point), the underlying electronic health record functions as a severely restricted σ-algebra (F_h).
When a reporting infrastructure forces multi-modal, continuous biological trajectories into a discrete terminal classification like an ICD-10 ‘sepsis’ code, it creates an observational equivalence class [P]_h Within this restricted filtration, pathogen-driven metabolic collapse is mathematically indistinguishable from underlying iatrogenic-accelerated cellular or mitochondrial failure modes G.
In your terms, the system is structurally designed to “miss the alarm” on the latent variables G. The resulting data point isn’t an empirical deduction; it operates as an epistemic supplement dictated by the limitations of the reporting box. We cannot severely test causal hypotheses or machine learning models when the data-collection layer itself enforces an observational boundary that strips the test of its power.
I have published the full mathematical proof of this observational insufficiency and its implications for severe testing in health informatics here: https://trissimondsen.wordpress.com/2026/05/25/beyond-the-icd-10-code-a-structural-analysis-of-the-sepsis-reporting-gap/
I would be highly interested in how the error statistics framework views this type of pre-data structural unidentifiability, where the measurement mechanism itself pre-determines a “missed alarm.”
Tris:
I think I get your argument, but I’m not sure how you see it in relation to this post. Does my follow-up post help to make out the point I’m getting at here? The inadequacy you’re on about seems to get at a deep substantive problem: the test is too coarse and has no power to discriminate between more specific explanations.
I appreciate your time and your reply, Deborah. The connection to your post is about where the failure of severity actually occurs. Usually, in error statistics, we analyze the power of the statistical test applied to the data. My argument is that in these complex systems, the failure of severity is strictly upstream – hardcoded into the data-generation layer itself.
If the measurement mechanism (the ICD-10 coding) forces distinct biological realities (e.g., pathogen-driven vs. iatrogenic-accelerated collapse) into an observational equivalence class, it means the data output for both states is identical. In error statistical terms, the severity of any subsequent test to discriminate between these two realities is mathematically bounded at zero. The test didn’t just miss the alarm; the alarm wire was never connected to the sensor.
The danger – and where this intersects with institutional consensus – is that medical and AI systems take this coarse, zero-severity data and process it as if it were a Fully Specified Stochastic Process (FSSP). They treat the ‘Sepsis’ code as an empirical, severely-tested deduction, when it is actually an epistemic supplement forced by the limits of the reporting box. It creates a Spurious Stochastic Process (SSP).
Your follow-up post absolutely clarifies your point. My goal with the Observational Sufficiency Principle (OSP) is simply to push the strict requirements of “severe testing” all the way down to the foundational σ-algebra of the reporting mechanism. If the measurement box is too coarse to permit a severe test, we must halt and respect that underspecification, rather than blindly processing the data and pretending the test had power.