You will often hear that if you reach a just statistically significant result “and the discovery study is underpowered, the observed effects are expected to be inflated” (Ioannidis 2008, p. 64), or “exaggerated” (Gelman and Carlin 2014). This connects to what I’m referring to as the second set of concerns about statistical significance tests, power and magnitude errors. Here, the problem does not revolve around erroneously interpreting power as a posterior probability, as we saw in the fallacy in this post. But there are other points of conflict with the error statistical tester, and much that cries out for clarification — else you will misunderstand the consequences of some of today’s reforms..
(1) In one sense, the charge is unexceptional: If the various discovery procedures in the examples these authors discuss — flexible stopping rules, data dredging, and host of other biasing selection effects — then finding statistical significance fails to give evidence of a genuine population effect. In those cases, an assertion about evidence of a genuine effect could be said to be “inflating”, but that’s because the error probability assessments, and thus the computation of power, fail to hold. That is why, as Fisher stressed, “we need, not an isolate record of statistical significance”, but must show it stands up to audits of the data and to replication. Granted, the sample size must be large enough to sustain the statistical model assumptions, and when not, we have grounds to suspect violations.
Let’s clarify further points:
(2) For starters it is incorrect to speak of tests being “underpowered” (tout court), because power is always defined in terms of a specific discrepancy from a test hypothesis or alternative null hypothesis. At most, they can mean that the test has low power to detect discrepancies of interest, or low power to detect a magnitude of (population) effect that is assumed known to be true. (The latter is what these critics tend to have in mind.) Take the type of example from discussions of the “mountains out of molehills” fallacy (in this blog and in Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP, 2008) [SIST 2018], Ex 5 Tour II), in testing the mean µ of a Normal distribution (with a random sample of size n): Test T+:
H0: µ ≤ µ0 against H1: µ > µ0,
We can speak of a test T+ having low power to detect µ = µ’, (for µ’ a value in H1) while having high power to detect a larger discrepancy µ”. To remind us:
POW(µ’)–the power of the test to detect µ’ –is the probability the test rejects H0, computed under the assumption that we are in a world where µ = µ’.
(I’ve often said that speaking of a test’s “sensitivity” would be less open to misconstrual, but it’s the same idea.) We want tests sufficiently powerful to detect discrepancies of interest, but once the data are in hand, a construal of the discrepancies warranted must take into account the sensitivity of the test that H0 has failed. (Is it like the fire alarm that goes off with burnt toast? Or one that only triggers when the house is ablaze?)
If you want an alternative against which test T+ has super high power (~.98), choose µ’ = µ0 + 4 standard error units. But it would be unwarranted to take a just statistically significant result as grounds to infer a µ this large. (It would be wrong 98% of the time). The high power is telling us that if µ were as large as µ’, then with high probability the test would reject H0: µ ≤ µ0 (and find an indication in the direction of alternative H1.) It is an “if-then” claim.)
(3) To keep in mind the claim that I am making, I write it here
Mayo: If POW(μ’) is high then a just significant result is poor evidence that μ > μ’; while if POW(μ’) is low it’s good evidence that μ > μ’ (provided assumptions for these claims hold approximately).
Of course we would stipulate values for “high” (e.g., over .5) and “low” (e.g., less than .2), but this suffices for now. Let me suggest an informal way to understand error statistical reasoning from low power against an alternative μ’: Because it is improbable to get as low a P-value as we did (or lower), were μ as small as μ’–i.e., because POW(μ’) is low–it is an indication we’re in a world where population mean μ is greater than μ’. This is exactly the reasoning that allows inferring μ > μ0 with a statistically significant result. And notice: the power of the test against μ0 is α!
(4) The observed effect M is not the estimate of the population effect µ. Rather, we would use the lower bound of a confidence interval with high confidence level (or corresponding high severity). Nor is it correct to say the estimate “has” the power, it’s the test that has it (in relation to various alternatives–forming a power function).
But if it is supposed we will estimate the population mean using the statistically significant effect size, (as suggested in Ioannidis 2008 and Gelman and Carlin 2014), and it is further stipulated that this is known to be too high, then yes, then you can say the estimate is too high. The observed mean “exaggerates” what you know on good evidence to be the correct mean. No one can disagree with that, although they measure the exaggeration by a ratio. This is not about analyzing results in terms of power (it is not “power analytic reasoning”). But no matter. See “From p. 359 SIST” below or these pages here.
(5) Specific Example of test T+. Let’s use an example from SIST (2018) of testing
µ ≤ 150 vs. µ > 150
with σ = 10, SE = σ/√n = 1. The critical value for α =.025 is z = 1.96. That is, we reject when the sample mean M > 150 + 1.96(1). You observe a just statistically significant result. You reject the null hypothesis and infer µ >150. Gelman and Carlin write:
An unbiased estimate will have 50% power if the true effect is 2 standard errors away from 0, it will have 17% power if the true effect is 1 standard error away from 0, and it will have 10% power if the true effect is 0.65 standard errors away from 0 (ibid., p. 4).
These correspond to µ =152, µ =151, µ =150.65. It’s odd to talk of an estimate having power; what they mean is that the test T+ has a power of .5 to detect a discrepancy 2 standard errors away from 150, and so on. I deliberately use numbers to match theirs.
[At this point, I turn to extracts from pp. 359-361 of SIST.] The “unbiased estimate” here is the statistically significant M. [I’m using M for X.] To see we’d match their numbers, compute POW(µ =152), POW(µ =151), POW(µ =150.65)[i]:
(a) Pr(M > 151.96; µ = 152) = Pr(Z > .04) = .51;
(b) Pr(M > 151.96; µ = 151)= Pr(Z >.96) = .17;
(c) Pr(M > 151.96; µ = 150.65)= Pr(Z >1.31) = .1.
They claim if you reach a just statistically significant result, yet the test had low power to detect a discrepancy from the null that is known from external sources to be correct, then the result “exaggerates” the magnitude of the discrepancy. In particular, when power gets much below 0.5, they say, statistically significant findings tend to be much larger in magnitude than true effect sizes. By contrast, “if the power is this high [.8], overestimation of the magnitude of the effect will be small” Gelman and Carlin 2014, p. 3. [I inserted this para from SIST p. 359 in version (iii).] Note POW(152.85) = .8….
They appear to be saying that there’s better evidence for µ ≥152 than for µ ≥151 than for µ ≥150.65, since the power assessments go down. Nothing changes if we write >. Notice that in each case the SEV computation for µ ≥152, µ ≥151, µ ≥150.65 are the complements, .49, .83, .9. So the lower the power for µ’ the stronger the evidence for µ > µ’. Thus there’s disagreement with my assertion in (3). But let’s try to pursue their thinking.
Suppose we observe M = 152. Say we have excellent reason to think it’s too big. We’re rather sure the mean temperature is no more than ~150.25 or 150.5, judging from previous cooling accidents, or perhaps from the fact that we don’t see some drastic effects we’d expect from water that hot. Thus 152 is an overestimate. …Some remarks:
From point (4), the inferred estimate would not be 152 but rather the lower confidence bounds, say, µ > (152 – 2SE ), i.e., µ > 150 (for a .975 lower confidence bound). True, but suppose the lower bound at a reasonable confidence level is still at odds with what we assume is known. For example, a lower .93 bound is µ > 150.5. What then? Then we simply have a conflict between what these data indicate and assumed background knowledge.
Do Gelman and Carlin really want to say that the statistically significant M fails to warrant µ ≥ µ’ for any µ’ between 150 and 152 on grounds that the power in this range is low (going from .025 to .5)? If so, the result surely couldn’t warrant values larger than 152 (*). So it appears no values would be able to be inferred from the result.
[(*)Here’s a point of logic: If claim A (e.g., µ ≥ 152) entails claim B (e.g., µ ≥ 150.5), then in sensible inference accounts, claim B should be better warranted than claim A.]
A way to make sense of their view is to see them as saying the observed mean is so out of line with what’s known, that we suspect the assumptions of the test are questionable or invalid. Suppose you have considerable grounds for this suspicion: signs of cherry-picking, multiple testing, artificiality of experiments, publication bias and so forth — as are rife both examples given in Gelman and Carlin’s paper [as in Ioannidis 2008]. You have grounds to question the result because you question the reported error probabilities. …The error statistical point in (3) still stands.
This returns us to point (1). One reasons, if the assumptions are met, and the error probabilities approximately correct, then the statistically significant result would indicate µ > 150.5, P-value .07, or severity level .93. But you happen to know (or so it is stipulated) that µ ≤ 150.5. Thus, that’s grounds to question whether the assumptions are met. You suspect it would fail an audit. In that case put the blame where it belongs.[ii]
Please use the comments for questions and remarks.
From p. 359 SIST:
They claim if you reach a just statistically significant result, yet the test had low power to detect a discrepancy from the null that is known from external sources to be correct, then the result “exaggerates” the magnitude of the discrepancy. In particular, when power gets much below 0.5, they say, statistically significant findings tend to be much larger in magnitude than true effect sizes. By contrast, “if the power is this high [.8], . . . overestimation of the magnitude of the effect will be small” (p. 3).
[Added 5/4 22: [Remember, to say the power against the (assumed) known discrepancy from the null is less than .5 just is to say that the observed M (which just reaches statistical significance) exceeds the true value. And to say the power against the (assumed) known discrepancy exceeds .5 just is to say it exceeds the observed M (so M is not exaggerating it). Also see note (i) from SIST.
[i] There are slight differences depending on whether they are using 2 as the cut-off or 1.96, and from their using a two-sided test, but we hardly add anything for the negative direction: For (a), Pr( M < -2; µ =2) = Pr(Z < -4) ~ 0.
[ii] The point can also be made out by increasing power by dint of sample size. If n = 10,000, (σ/√n) = 0.1. Test T+(n=10,000) rejects H0 at the .025 level if M > 150.2. A 95% confidence interval is [150, 150.4]. With n = 100, the just .025 significant result corresponds to the interval [150, 154]. The latter is indicative of a larger discrepancy. Granted, sample size must be large enough for the statistical assumptions to pass an audit.