I said I’d reblog one of the 3-year “memory lane” posts marked in red, with a few new comments (in burgundy), from time to time. So let me comment on one referring to Ziliac and McCloskey on power. (from Oct.2011). I would think they’d want to correct some wrong statements, or explain their shifts in meaning. My hope is that, 3 years on, they’ll be ready to do so. By mixing some correct definitions with erroneous ones, they introduce more confusion into the discussion.
From my post 3 years ago: “The Will to Understand Power”: In this post, I will adhere precisely to the text, and offer no new interpretation of tests. Type 1 and 2 errors and power are just formal notions with formal definitions. But we need to get them right (especially if we are giving expert advice). You can hate the concepts; just define them correctly please. They write:
“The error of the second kind is the error of accepting the null hypothesis of (say) zero effect when the null is in face false, that is, then (say) such and such a positive effect is true.”
So far so good (keeping in mind that “positive effect” refers to a parameter discrepancy, say δ, not an observed difference.
And the power of a test to detect that such and such a positive effect δ is true is equal to the probability of rejecting the null hypothesis of (say) zero effect when the null is in fact false, and a positive effect as large as δ is present.
Let this alternative be abbreviated H’(δ):
H’(δ): there is a positive effect as large as δ.
Suppose the test rejects the null when it reaches a significance level of .01.
(1) The power of the test to detect H’(δ) =
P(test rejects null at .01 level; H’(δ) is true).
Say it is 0.85.
“If the power of a test is high, say, 0.85 or higher, then the scientist can be reasonably confident that at minimum the null hypothesis (of, again, zero effect if that is the null chosen) is false and that therefore his rejection of it is highly probably correct”. (Z & M, 132-3).
But this is not so. Perhaps they are slipping into the cardinal error of mistaking (1) as a posterior probability:
(1’) P(H’(δ) is true| test rejects null at .01 level)!
In dealing with these passages, I see why Spanos ( 2008) declared, “You have it backwards” in his review of their book. I had assumed (before reading it myself ) that Z & M, as with many significance test critics, were pointing out that the fallacy is common, not that they were committing it. I
am was confident (but I’m still hopeful) that they will want to correct this and not give Prionvac grounds to claim evidence of a large increase in survival simply because their test had a high capability (power) to detect that increase, if it were in fact present.
Other problematic assertions: “If a test does a good job of uncovering efficacy, then the the test has high power and the hurdles are high not low.”
No, higher power = lower hurdle.
They keep saying Fisher is guilty of requiring very low hurdles (for rejection) because his tests have low power, but it’s high power that translates into low hurdles.
When Z & M are talking about people who are doing “power analysis”, and summarizing their points, they do ok, but the power analysts are worried about interpreting non-statistically significant results with tests that have low-power to detect any but the grossest effects! The power analysts’ concern over low power is not in the case of rejection, which is Z & M’s concern. I honestly don’t think they realize they are caught in a confusion, and it’s been a severe disappointment not to have a correction.
From Aris Spanos, with a bit of softening: When [Ziliak and McCloskey] claim that:
“What is relevant here for the statistical case is that refutations of the null are trivially easy to achieve if power is low enough or the sample size is large enough.” (Z & M, p. 152),
they exhibit [confusion] about the notion of power and its relationship to the sample size; their two instances of ‘easy rejection’ separated by ‘or’ contradict each other! Rejections of the null are not easy to achieve when the power is ‘low enough’. They are more difficult exactly because the test does not have adequate power (generic capacity) to detect discrepancies from the null; that stems from the very definition of power and optimal tests. [Their second claim] is correct for the wrong reason. Rejections are easy to achieve when the sample size n is large enough due to high not low power. This is because the power of a ‘decent’ (consistent) frequentist test increases monotonically with n! show less.” (Spanos comment)
Now it would be one thing if they were claiming to do a Bayesian analysis, and if they made it clear they were illegitimately transposing claims about power to get a kind of posterior probability–problematic as this is. But that is not what they do. They are chastising significance tests, and mixing completely different meanings of power within the same chapter and even the same sentence. That is my main gripe.
I think they are mistaking
The power to reject a null in favor of alternative H’
as a posterior probability of H’, given a null has been rejected. But not all the time! As statistical reformers with an important responsibility to clarify these notions, I think they need to explain their transposition of power (in many places). Of course, having hugely high power against alternatives far away from the null is to be expected, and is typical. Do they really want to say that rejecting a null accords a high posterior probability to an alternative against which a Normal one-sided test has high power?
Use a variation on the one-sided test T+ from our illustration of power: We’re testing the mean of a Normal distribution with n iid samples, and known σ= 1:
H0: µ ≤ 0 against H1: µ > 0
Let σ = 1, n = 25, so (σ/ √n) = .2.
Test T+ rejects H0 at ~ .025 level if M > 2(.2) = .4. So the cut-off M*= .4.
Consider a value for µ that T+ has high power to detect: Let µ = M* + 3(σ/ √n) = .4 + 3(.2) = 1.0. POW(1.0) ~ 1.
Spoze Mo, the observed mean, just makes it to the cut-off .4. Would they want to say that observing Mo= .4 (the cut-off) is good evidence that µ is as great as 1? Surely not because were µ as great as 1, a higher M than .4 would have been observed (with prob ~1). The result would be to erroneously infer huge discrepancies.
Spanos, A. (2008), Review of S. Ziliak and D. McCloskey’s The Cult of Statistical Significance, Erasmus Journal for Philosophy and Economics, volume 1, issue 1: 154-164.
Ziliak, Z. and McCloskey, D. (2008), The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice and Lives, University of Michigan Press.