Do “underpowered” tests “exaggerate” population effects? (iv)


You will often hear that if you reach a just statistically significant result “and the discovery study is underpowered, the observed effects are expected to be inflated” (Ioannidis 2008, p. 64), or “exaggerated” (Gelman and Carlin 2014). This connects to what I’m referring to as the second set of concerns about statistical significance tests, power and magnitude errors. Here, the problem does not revolve around erroneously interpreting power as a posterior probability, as we saw in the fallacy in this post. But there are other points of conflict with the error statistical tester, and much that cries out for clarification — else you will misunderstand the consequences of some of today’s reforms..

(1) In one sense, the charge is unexceptional: If the various discovery procedures in the examples these authors discuss — flexible stopping rules, data dredging, and host of other biasing selection effects —  then finding statistical significance fails to give evidence of a genuine population effect. In those cases, an assertion about evidence of a genuine effect could be said to be “inflating”, but that’s because the error probability assessments, and thus the computation of power, fail to hold. That is why, as Fisher stressed, “we need, not an isolate record of statistical significance”, but must show it stands up to audits of the data and to replication. Granted, the sample size must be large enough to sustain the statistical model assumptions, and when not, we have grounds to suspect violations.

Let’s clarify further points:

(2) For starters it is incorrect to speak of tests being “underpowered” (tout court), because power is always defined in terms of a specific discrepancy from a test hypothesis or alternative null hypothesis. At most, they can mean that the test has low power to detect discrepancies of interest, or low power to detect a magnitude of (population) effect that is assumed known to be true. (The latter is what these critics tend to have in mind.) Take the type of example from discussions of the “mountains out of molehills” fallacy (in this blog and in Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP, 2008) [SIST 2018], Ex 5 Tour II), in testing the mean µ of a Normal distribution (with a random sample of size n): Test T+:

 H0: µ ≤ µ0 against H1: µ > µ0,

We can speak of a test T+ having low power to detect µ = µ’, (for µ’ a value in H1) while having high power to detect a larger discrepancy µ”. To remind us:

POW(µ’)–the power of the test to detect µ’ –is the probability the test  rejects H0, computed under the assumption that we are in a world where µ = µ’.

(I’ve often said that speaking of a test’s “sensitivity” would be less open to misconstrual, but it’s the same idea.) We want tests sufficiently powerful to detect discrepancies of interest, but once the data are in hand, a construal of the discrepancies warranted must take into account the sensitivity of the test that H0 has failed. (Is it like the fire alarm that goes off with burnt toast? Or one that only triggers when the house is ablaze?) 

If you want an alternative against which test T+ has super high power (~.98), choose µ’ = µ0 + 4 standard error units. But it would be unwarranted to take a just statistically significant result as grounds to infer a µ this large. (It would be wrong 98% of the time). The high power is telling us that if µ were as large as µ’, then with high probability the test would reject H0: µ ≤ µ0 (and find an indication in the direction of alternative H1.) It is an “if-then” claim.)

(3) To keep in mind the claim that I am making, I write it here

Mayo: If POW(μ’) is high then a just significant result is poor evidence that μ > μ’; while if POW(μ’) is low it’s good evidence that μ > μ’ (provided assumptions for these claims hold approximately).

Of course we would stipulate values for “high” (e.g., over .5) and “low” (e.g., less than .2), but this suffices for now. Let me suggest an informal way to understand error statistical reasoning from low power against an alternative μ’: Because it is improbable to get as low a P-value as we did (or lower), were μ as small as μ’–i.e., because POW(μ’) is low–it is an indication we’re in a world where population mean μ is greater than μ’. This is exactly the reasoning that allows inferring μ > μ0 with a statistically significant result. And notice: the power of the test against μ0 is α!

(4) The observed effect M is not the estimate of the population effect µ. Rather, we would use the lower bound of a confidence interval with high confidence level (or corresponding high severity). Nor is it correct to say the estimate “has” the power, it’s the test that has it (in relation to various alternatives–forming a power function).

But if it is supposed we will estimate the population mean using the statistically significant effect size, (as suggested in Ioannidis 2008 and Gelman and Carlin 2014), and it is further stipulated that this is known to be too high, then yes, then you can say the estimate is too high. The observed mean “exaggerates” what you know on good evidence to be the correct mean. No one can disagree with that, although they measure the exaggeration by a ratio. This is not about analyzing results in terms of power (it is not “power analytic reasoning”). But no matter. See “From p. 359 SIST” below or these pages here.

(5) Specific Example of test T+. Let’s use an example from SIST (2018) of testing

µ ≤ 150 vs. µ > 150

with σ = 10, SE = σ/√n  = 1.  The critical value for α =.025 is z = 1.96. That is, we reject when the sample mean  M > 150 + 1.96(1). You observe a just statistically significant result. You reject the null hypothesis and infer µ >150. Gelman and Carlin write:

An unbiased estimate will have 50% power if the true effect is 2 standard errors away from 0, it will have 17% power if the true effect is 1 standard error away from 0, and it will have 10% power if the true effect is 0.65 standard errors away from 0 (ibid., p. 4).

These correspond to µ =152, µ =151, µ =150.65. It’s odd to talk of an estimate having power; what they mean is that the test T+ has a power of .5 to detect a discrepancy 2 standard errors away from 150, and so on. I deliberately use numbers to match theirs.

[At this point, I turn to extracts from pp. 359-361 of SIST.] The “unbiased estimate” here is the statistically significant M. [I’m using M for X.] To see we’d match their numbers, compute POW(µ =152), POW(µ =151), POW(µ =150.65)[i]:

(a) Pr(M > 151.96; µ = 152) = Pr(Z > .04) = .51;
(b) Pr(M > 151.96; µ = 151)= Pr(Z >.96) = .17;
(c) Pr(M > 151.96; µ = 150.65)= Pr(Z >1.31) = .1.

They claim if you reach a just statistically significant result, yet the test had low power to detect a discrepancy from the null that is known from external sources to be correct, then the result exaggeratesthe magnitude of the discrepancy. In particular, when power gets much below 0.5, they say, statistically significant findings tend to be much larger in magnitude than true effect sizes. By contrast, “if the power is this high [.8], overestimation of the magnitude of the effect will be small” Gelman and Carlin 2014, p. 3. [I inserted this para from SIST p. 359 in version (iii).] Note POW(152.85) = .8….

They appear to be saying that there’s better evidence for µ ≥152 than for µ ≥151 than for µ ≥150.65, since the power assessments go down. Nothing changes if we write >. Notice that in each case the SEV computation for µ ≥152, µ ≥151, µ ≥150.65 are the complements, .49, .83, .9. So the lower the power for µ’ the stronger the evidence for µ > µ’. Thus there’s disagreement with my assertion in (3). But let’s try to pursue their thinking.

Suppose we observe M = 152. Say we have excellent reason to think it’s too big. We’re rather sure the mean temperature is no more than ~150.25 or 150.5, judging from previous cooling accidents, or perhaps from the fact that we don’t see some drastic effects we’d expect from water that hot. Thus 152 is an overestimate. …Some remarks:

From point (4), the inferred estimate would not be 152 but rather the lower confidence bounds, say, µ > (152 – 2SE ), i.e., µ > 150 (for a .975 lower confidence bound). True, but suppose the lower bound at a reasonable confidence level is still at odds with what we assume is known. For example, a lower .93 bound is µ > 150.5. What then? Then we simply have a conflict between what these data indicate and assumed background knowledge. 

Do Gelman and Carlin really want to say that the statistically significant M fails to warrant µ ≥ µ’ for any µ’ between 150 and 152 on grounds that the power in this range is low (going from .025 to .5)? If so, the result surely couldn’t warrant values larger than 152 (*). So it appears no values would be able to be inferred from the result.

[(*)Here’s a point of logic: If claim A (e.g., µ ≥ 152) entails claim B (e.g., µ ≥ 150.5), then in sensible inference accounts, claim B should be better warranted than claim A.]

A way to make sense of their view is to see them as saying the observed mean is so out of line with what’s known, that we suspect the assumptions of the test are questionable or invalid. Suppose you have considerable grounds for this suspicion: signs of cherry-picking, multiple testing, artificiality of experiments, publication bias and so forth — as are rife both examples given in Gelman and Carlin’s paper [as in Ioannidis 2008]. You have grounds to question the result because you question the reported error probabilities. …The error statistical point in (3) still stands.

This returns us to point (1). One reasons, if the assumptions are met, and the error probabilities approximately correct, then the statistically significant result would indicate µ > 150.5, P-value .07, or severity level .93. But you happen to know (or so it is stipulated) that µ ≤ 150.5. Thus, that’s grounds to question whether the assumptions are met. You suspect it would fail an audit. In that case put the blame where it belongs.[ii]

Please use the comments for questions and remarks.


From p. 359 SIST:

They claim if you reach a just statistically significant result, yet the test had low power to detect a discrepancy from the null that is known from external sources to be correct, then the result exaggeratesthe magnitude of the discrepancy. In particular, when power gets much below 0.5, they say, statistically significant findings tend to be much larger in magnitude than true effect sizes. By contrast, if the power is this high [.8], . . . overestimation of the magnitude of the effect will be small(p. 3).

[Added 5/4 22: [Remember, to say the power against the (assumed) known discrepancy from the null is less than .5 just is to say that the observed M (which just reaches statistical significance) exceeds the true value. And to say the power against the (assumed) known discrepancy exceeds .5 just is to say it exceeds the observed M (so M is not exaggerating it). Also see note (i) from SIST.

[i] There are slight differences depending on whether they are using 2 as the cut-off or 1.96, and from their using a two-sided test, but we hardly add anything for the negative direction: For (a), Pr( M < -2; µ =2) = Pr(Z < -4) ~ 0.

[ii] The point can also be made out by increasing power by dint of sample size. If n = 10,000, (σ/√n) = 0.1.  Test T+(n=10,000) rejects Hat the .025 level if  M > 150.2.  A 95% confidence interval is [150, 150.4]. With n = 100, the just .025 significant result corresponds to the interval [150, 154]. The latter is indicative of a larger discrepancy. Granted, sample size must be large enough for the statistical assumptions to pass an audit.


Categories: power, reforming the reformers, SIST, Statistical Inference as Severe Testing

Post navigation

16 thoughts on “Do “underpowered” tests “exaggerate” population effects? (iv)

  1. Nick Adams

    Nice! Calling power ‘sensitivity’ makes things clearer. Concordantly, the alpha-level is 1-specificity. Which then leads to a way to avoid the whole power mess. Rather than specifying an alpha-level or effect size of interest pre-data, post-data specify the observed effect size as the effect size of interest and define a ‘positive’ study as one where the effect size is greater than or equal to this. For a symmetric sampling distribution (like the Gaussian) the sensitivity is thus always 50%, and the specificity is 1 minus the one-sided p-value. From these we can calculate the likelihood ratio (LR) of mu>0 versus mu<0. (Without detailing why, it is LR=0.25/(p minus p-squared).
    Such a likelihood ratio is independent of the sample size showing that inference from a particular p-value does not depend on the study 'power', which I think is your position.

    • Nick:
      I’m not sure the supposed advantage of calling the observed effect the effect size of interest, nor in the likelihood analysis. I’d prefer to do a confidence interval or, even better, a severity assessment Here I am just trying to address claims about ordinary power (whereas with severity, we take account of, if effect, “attained power (Pr(d > d-observed, under various alternatives), for a test statistic d. With your LR, isn’t it restricted to 2 point hypotheses? We may end up in some similar places.
      The main problems in the examples discussed by Ioannidis, and Gelman and Carlin concern biasing selection effects, exploiting experimenter flexibility and violated assumptions. So you can’t even take the computed power as the actual power.

      • Nick Adams

        The idea is that rather than presenting an effect size and the uncertainty of its estimation (and maybe performing a significance test), one presents a likelihood ratio alone as the natural measure of the weight of evidence (see e.g. IJ Good).
        I am not talking about point hypotheses (a la Richard Royall), rather a dividing hypothesis (as Cox termed it) such as your mu<0. Of course the most useful dividing hypothesis would be mu<effect size of interest.

        • Nick:
          If you look at a few recent posts and especially Statistical Inference as Severe Testing (CUP, 2018), you’ll see why I’d deny comparative approaches in general, and LRS in particular—at least for error statistical testing goals. (There are other goals in inference, as I also make clear.) Don’t likelihoods of compound hypotheses require priors? Or are you suggesting reporting a series of LRs of points. In any event, one doesn’t have an error probability attached to the inferences, and many of the most concerning gambits of selection effects do not alter the likelihoods.

  2. Can we clarify what “just significant” means? Do you mean “barely significant”?

    I guess you mean a p-value between .04 and .05?

    I don’t mean to hardcode these thresholds. I’m just thinking of something that is significant, but is only barely (“just”) significant

    • Yes, I meant the result just makes it to whatever level is chosen as the cut-off for statistical significance.

  3. Erik

    The post asks: Do “underpowered” tests “exaggerate” population effects? I believe the answer is “yes”.

    In a recent paper with Eric Cator [1], we prove the following. Suppose b is an unbiased, normally distributed estimator of beta with standard error s. In other words, suppose b ~ N(beta,s).

    Define the signal-to-noise ratio as SNR=beta/s. For any beta, s>0 and c>0, define the “exaggeration ratio” (a.k.a. expected Type M error) as

    e(beta,s,c) = E( |b/beta| | beta, s, |b/s|>c ).

    If c=1.96 then we are conditioning on significance at the 5% level (two-sided). The exaggeration ratio

    • is greater than 1 for any beta, s>0 and c>0.
    • is increasing in c.
    • depends on beta and s only through the absolute value of the SNR.
    • is decreasing in the absolute value of the SNR.

    Since the power of the two-sided test of H0: beta=0 against the true effect is a strictly increasing function of the absolute value of the SNR, the exaggeration is also decreasing in the power against the true effect. So conditionally on significance (at any level), low power implies a large exaggeration ratio.

    [1] EW van Zwet and EA Cator “The significance filter, the winner’s curse and the need to shrink” Statistica Neerlandica (2021)

    • Erik:
      Thanks for your comment and the link to your joint paper.
      First, do you disagree with my claim *(Mayo)?
      One would not want to say that claiming mu is as large (or larger) than the observed M (152) would be to exaggerate mu, while allowing the same data (from the same test) to infer mu > 154, 155, etc and yet suppose this does not exaggerate (since there’s high power to detect the latter).

      Some of my reply here disappeared. I had noted that I assume you mean, similar to Gelman and Carlin, that: first: if the observed statistically significant M is known to exceed the true population mu, then M “exaggerates” mu, whereas, if the true mu is known to exceed the observed statistically significant M, then M does not exaggerate mu (since mu > M). Second, if M is taken as what’s allowed to be inferred about mu, i.e., that mu = M, then the estimate will exaggerate in the first case (since M exceeds the true mu), and M will not exaggerate in the second case. Three, mu values in excess of the statistically significant cut-off have power > .5. So if POW(mu’) > .5, and it’s assumed mu’ is known to be the true value of mu, then asserting mu = M will not be a (positive) exaggeration of mu.

      Do you disagree with Gelman and Carlin?
      Of course, we don’t typically know the true mu, and we typically do not estimate mu as the observed M. To infer mu > M would be to use a CI with confidence level .5.
      I also noted that confidence intervals warrant inferring mu > CI lower using a high confidence level 1 – c. But what’s the POW(CI-lower)? Answer: small, c for a one-sided interval corresponding to a 1-sided test.
      That’s all I remember from from first comment. Then I made the following Note:

      **Now the reason I wrote “just statistically significant at the given level”, e.g., .025, concerns another reading that I don’t think is meant by those claiming M exaggerates mu. It came up in a much earlier post. Suppose one compares the SAME outcome that reached the .025 level with n = 100 with that SAME OBSERVED EFFECT SIZE but with larger n, say in =10,000. Call this test T+ #2. Now an SE is (σ/√n)= 10/100 = .1. The 2-standard deviation cut-off for rejection becomes 150.2. Then the p-value against a specified value of µ will be smaller with n = 10,000 than with n = 100. So that’s a way to make the assertion true that has nothing to do with estimating the population parameter with the observed difference. But I doubt that’s what they mean. I’ll explain why.
      Spoze we know µ =150.4 or around that small.
      In Test + #1, POW(150.4) = Pr(M > 152; µ =150.4) = Pr(Z > 1.6) = .05—so it’s very low.
      Someone says but suppose n were large enough to make the power against µ =150.4 high? Test +#2 will do the trick. With n = 10,000, the POW(150.4)= Pr(M > 150.2;150.4) = Pr(Z > -2) = .98.
      In both cases, under this reading I am entertaining, the observed sample mean is the same, say 152. In test T+2, the observed mean M is 20 SE in excess of 150!
      (Wouldn’t one suspect a gross exaggeration then?) Anyway, I grant that the p-value against µ < 150.4 with the more powerful test is much smaller (0) in the test with n = 10,000 (p-value .05).
      Compare the p-values in testing µ <150.4 with M = 152 from the two tests. The p-value from Test T+ #1 is ~.05, whereas the p-value from test T+ #2 is 0, because 152 is 6 SE in excess of 150.4. Please inform me of errors, I' doing this quickly, but the same result holds even if approximate. So, there is stronger evidence against µ <150.4 from test T+ with the higher power (by imagining the same observed effect size, 152, came from a test with high power against 150.4.) So that is why I stipulated that in comparing two tests, one should look at outcomes with the same statistical significance level. I’m rather sure that proponents of the claim are not saying, “Imagine the same M came from a test where M is 20 SE rather than 2SE" but Michael Lew once raised a similar case only in terms of likelihoods.

      • Erik:
        I should have linked to Excursion 5 Tour 1. pp. 326-7.

      • Erik

        You ask: First, do you disagree with my claim?

        You claim: If POW(μ’) is high then a just significant result is poor evidence that μ > μ’; while if POW(μ’) is low it’s good evidence that μ > μ’ (provided assumptions for these claims hold approximately).

        This seems fine, as I understand it. POW(mu’) being high would mean that mu’ is something like 3 standard errors larger than mu0. If the test is *just* significant, then our estimate of mu would be about 2 standard errors larger than mu0 (assuming alpha=0.025, one-sided). So in this case, mu’ is 1 standard error larger than our estimate of mu. If anything, there is reason to think mu’ is larger than mu.

        Conversely, POW(mu’) being low would mean that mu’ is something like 1 standard error larger than mu0. In that case, mu’ is 1 standard error smaller than our estimate of mu. So, then there is some reason to think mu’ is smaller than mu.

        • Erik:
          I’m glad that you agree with my claim.
          “POW(mu’) being low would mean that mu’ is something like 1 standard error larger than mu0. In that case, mu’ is 1 standard error smaller than our estimate of mu. So, then there is some reason to think mu’ is smaller than mu.”
          Yes, so the estimate or inference should be mu > mu’, where mu’ corresponds to a lower CI bound, at an appropriate level, not mu = M. I know you said a CI of .95 would be too conservative, but we can learn a lot from reporting levels like .84 or whatever. Somehow, the “estimate” or inference needs to take the SE into account.

    • Erik:
      I think I’ve explained in some detail how these claims can be understood. In my view, they are not a matter of retrospective power–i.e., using power post data to evaluate or arrive at inferences about warranted magnitudes upon getting a stat sig result.
      Since you concur wrt my main claim, the main thing is for people to understand it and avoid confusing it with something entirely different. That is important, as many people take Ioannidis’ (2008) “exaggeration” claim as denying my assertion. Of course, violated assumptions and selection effects need to be taken account of (another reason not to estimate without an SE, as Stephen Senn has urged.

      Now on the other point, we don’t typically know the true value of mu. You can say if mu were one of those against which the test has low power, then it would be an exaggeration to use M to estimate mu etc. etc. I know we’ve been through this.
      Tests do not have low power, but rather low power against certain alt parameter values. So in our example, the “low” power goes from alpha to .5 in considering POW(mu0), and POW(mu0 + 2SE), respectively. Using M to estimate mu would be an “exaggeration” of those mu values. So you might not think there’s an assumption about the true population effect size when you say things like “low power implies” but there is. At most there’s an “if then” claim, but it is not in conflict with my claim. I’m not saying the pop effect size isn’t known in the clinical trials you discuss. If Senn likes your adjustment of the use of M as an estimate, I’m sure it’s sennsible. What are its error probabilities?

  4. Comments that came up in an earlier post on the topic are relevant for different ways of viewing the problem. Here’s one:

    Telling What’s True About Power, if practicing within the error-statistical tribe

    • I tried to paste just the specific comments, but this is what was pasted. Scroll down the comments of that early post. Also find a relevant post by Stephen Senn by searching “delta force”.

  5. Deborah:

    Your example of µ ≤ 150 vs. µ > 150 does not connect well to any applied problem I’ve seen. So I’ll talk about applied problems that I have seen.

    In my article with Carlin that you mention in your post, we give an example of a researcher who estimated that the children of attractive parents were 8 percentage points more likely to be girls, compared to the children of unattractive parents. The standard error of this estimate was 3.3 percentage points. Given that only statistically significant comparisons would be published and publicized, any estimate from this study would have to be at least 6.6 percentage points. Some reading of the literature makes it clear that any real population difference would have to be less than 0.5 percentage points. Hence the selection process results in extreme overestimates of effect size.

    We also go through another example, of an article reporting a 17 percentage point difference in vote preference, comparing women at different times of the month. The standard error of this estimate was 8 percentage points. Again, under the rule by which only statistically significant results are published and publicized, any such estimate would have to be at least 16 percentage points.

    The selection rule leads to a a positive bias in the estimate of the magnitude of the effect. The size of the bias depends on the true effect size so it can never be known, but in some cases such as these we know enough about the subject to know that the bias is huge. More generally, I’d like researchers to recognize that this procedure (publishing and publicizing results conditional on statistical significance) leads to a bias which in many well-known example can be large. The fact that the size of the bias is unknown is not a reason for researchers to act as if it does not exist.

    • Andrew:
      So great to have a comment from you. I construe your claims in Gelman and Carlin in just the way I take you to recommend: “the selection rule leads to a positive bias in the estimate of the magnitude of the effect” using the observed effect size. Realistic examples can indicate right away something is amiss–so I agree they are more effective than toy examples. The short discussion in my book SIST, pp 359-361 ends “Recall the (2010) study purporting to show genetic signatures of longevity…. Researchers found the observed differences suspiciously large, and sure enough, once reanalyzed, the data were found to suffer from the confounding of batch effects. when results seem out of whack with what’s known, its grounds to suspect the assumptions [especially when there’s evidence of multiple-testing, data dredging, publication bias, etc.] That’s how I propose to view Gelman and Carlin’s argument; whether they concur is for them to decide”.
      Nevertheless my claim that a statistically significant result is a better indication that μ > μ’ when POW(μ’) is low, than when it is high is also true–provided the assumptions approximately hold. So both positions are reconciled.

      Here’s a link to Excursion 5 Tour II that includes those pages.

      My new post extends this discussion to a different article.

Blog at