# To raise the power of a test is to lower (not raise) the “hurdle” for rejecting the null (Ziliac and McCloskey 3 years on)

I said I’d reblog one of the 3-year “memory lane” posts marked in red, with a few new comments (in burgundy), from time to time. So let me comment on one referring to Ziliac and McCloskey on power. (from Oct.2011). I would think they’d want to correct some wrong statements, or explain their shifts in meaning. My hope is that, 3 years on, they’ll be ready to do so. By mixing some correct definitions with erroneous ones, they introduce more confusion into the discussion.

From my post 3 years ago: “The Will to Understand Power”: In this post, I will adhere precisely to the text, and offer no new interpretation of tests. Type 1 and 2 errors and power are just formal notions with formal definitions.  But we need to get them right (especially if we are giving expert advice).  You can hate the concepts; just define them correctly please.  They write:

“The error of the second kind is the error of accepting the null hypothesis of (say) zero effect when the null is in face false, that is, then (say) such and such a positive effect is true.”

So far so good (keeping in mind that “positive effect” refers to a parameter discrepancy, say δ, not an observed difference.

And the power of a test to detect that such and such a positive effect δ is true is equal to the probability of rejecting the null hypothesis of (say) zero effect when the null is in fact false, and a positive effect as large as δ is present.

Fine.

Let this alternative be abbreviated H’(δ):

H’(δ): there is a positive effect as large as δ.

Suppose the test rejects the null when it reaches a significance level of .01.

(1) The power of the test to detect H’(δ) =

P(test rejects null at .01 level; H’(δ) is true).

Say it is 0.85.

“If the power of a test is high, say, 0.85 or higher, then the scientist can be reasonably confident that at minimum the null hypothesis (of, again, zero effect if that is the null chosen) is false and that therefore his rejection of it is highly probably correct”. (Z & M, 132-3).

But this is not so.  Perhaps they are slipping into the cardinal error of mistaking (1) as a posterior probability:

(1’) P(H’(δ) is true| test rejects null at .01 level)!

In dealing with these passages, I see why Spanos ( 2008) declared, “You have it backwards” in his review of their book. I had assumed (before reading it myself ) that Z & M, as with many significance test critics, were pointing out that the fallacy is common, not that they were committing it.  I am was confident (but I’m still hopeful) that they will want to correct this and not give Prionvac grounds to claim evidence of a large increase in survival simply because their test had a high capability (power) to detect that increase, if it were in fact present.

Other problematic assertions: “If a test does a good job of uncovering efficacy, then the the test has high power and the hurdles are high not low.”

No, higher power = lower hurdle.

They keep saying Fisher is guilty of requiring very low hurdles (for rejection) because his tests have low power, but it’s high power that translates into low hurdles.

When  Z & M are talking about people who are doing “power analysis”, and summarizing their points, they do ok, but the power analysts are worried about interpreting non-statistically significant results with tests that have low-power to detect any but the grossest effects!  The power analysts’ concern over low power is not in the case of rejection, which is Z & M’s concern.  I honestly don’t think they realize they are caught in a confusion, and it’s been a severe disappointment not to have a correction.

From Aris Spanos, with a bit of softening: When [Ziliak and McCloskey] claim that:

“What is relevant here for the statistical case is that refutations of the null are trivially easy to achieve if power is low enough or the sample size is large enough.” (Z & M, p. 152),
they exhibit [confusion] about the notion of power and its relationship to the sample size; their two instances of ‘easy rejection’ separated by ‘or’ contradict each other! Rejections of the null are not easy to achieve when the power is ‘low enough’. They are more difficult exactly because the test does not have adequate power (generic capacity) to detect discrepancies from the null; that stems from the very definition of power and optimal tests. [Their second claim] is correct for the wrong reason. Rejections are easy to achieve when the sample size n is large enough due to high not low power. This is because the power of a ‘decent’ (consistent) frequentist test increases monotonically with n! show less.” (Spanos comment)

Now it would be one thing if they were claiming to do a Bayesian analysis, and if they made it clear they were illegitimately transposing claims about power to get a kind of posterior probability–problematic as this is. But that is not what they do. They are chastising significance tests, and mixing completely different meanings of power within the same chapter and even the same sentence. That is my main gripe.

I think they are mistaking

The power to reject a null in favor of alternative H’

as a posterior probability of H’, given a null has been rejected. But not all the time! As statistical reformers with an important responsibility to clarify these notions, I think they need to explain their transposition of power (in many places). Of course, having hugely high power against alternatives far away from the null is to be expected, and is typical. Do they really want to say that rejecting a null accords a high posterior probability to an alternative against which a Normal one-sided test has high power?

Use a variation on the one-sided test T+ from our illustration of power: We’re testing the mean of a Normal distribution with n iid samples, and known σ= 1:

H0: µ ≤  0 against H1: µ >  0

Let σ = 1, n = 25, so (σ/ √n) = .2.

Test T+ rejects Hat ~ .025 level if M > 2(.2) = .4. So the cut-off M*= .4.

Consider a value for µ that T+ has high power to detect: Let µ = M* + 3(σ/ √n) = .4 + 3(.2) = 1.0. POW(1.0) ~ 1.

Spoze Mo, the observed mean, just makes it to the cut-off .4. Would they want to say that observing Mo= .4 (the cut-off) is good evidence that µ is as great as 1? Surely not because were µ as great as 1, a higher M than .4 would have been observed (with prob ~1).   The result would be to erroneously infer huge discrepancies.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Spanos, A. (2008), Review of S. Ziliak and D. McCloskey’s The Cult of Statistical SignificanceErasmus Journal for Philosophy and Economics, volume 1, issue 1: 154-164.

Ziliak, Z. and McCloskey, D. (2008), The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice and Lives, University of Michigan Press.

Categories: 3-year memory lane, power, Statistics |

### 6 thoughts on “To raise the power of a test is to lower (not raise) the “hurdle” for rejecting the null (Ziliac and McCloskey 3 years on)”

1. Deirdre McCloskey

Dear Professor Mayo,

I’ll try to work through all this, and if you are right that we have erred, we will say so in print immediately. Unconditionally.

But here’s the trouble politically (so to speak) with you (and Spanos) . By focusing on our confusions about power (let us stipulate that you are correct; and, after all, who is not confused about power!), you convey the strong impression that we are also mistaken about what is our main point, namely, that null hypothesis significance testing without a loss function is silly, unscientific, misleading, dangerous.

Do you understand that this is the effect of your picking away at our failure to achieve this or that item of PhD-in-statistics levels of accuracy in our remarks? If that’s what you want to achieve—defending null hypothesis significance testing without a loss function at arbitrary levels used for all manner of scientific questions such as drug trials, then, fine, you’ve done it!

But I take it that you are a fine statistician who (therefore: Igon Pearon, Neyman, et alii down to Bill Kruskal and Arnold Zellner) is contemptuous of the procedures in “evidence-based medicine” for example, and wishes to reform them.

So here’s my proposal. We will declare in print that we messed up in our discussion of power (if we did: it gives me a headache to work through your logic, but I suppose it is correct). And in the same publication you come out strongly against the obviously crazy procedures that, after all, was the main target of our work.

Deal?

Sincerely,

Deirdre McCloskey

• Deirdre:
Thanks for your comment. Sorry it’s taken me awhile to reply, I was away from my regular computer.
Correctly understanding power is not some mere technicality. When explicators or reformers of tests erroneously define key concepts it does grave and serious damage– on my loss function. The consequence of this particular mistake is to exactly reverse many crucial arguments. So let’s concentrate on getting this right for now. I have devoted several blogposts to power (readers can search) and I’d be very happy to help you work through the logic. Raising a test’s power is a bit like raising the sensitivity of a fire alarm so that little by little it goes off with mere burning toast. If this one point about power and hurdles can be cleared up, I will feel the blog has accomplished at least one positive thing.

• john byrd

I do not have any degrees in statistics but see these issues as fundamental. I read the Z&M book several years ago and still have it. I found the presentation of significance testing to be puzzling indeed. Like most, I do not doubt there are countless abuses of poor experimental design and statistical interpretations out there. But, the situation is not helped by inaccurate depictions of the proper use and interpreation of tools such as significance testing. What is dangerous is clearly misuse of tools and the concepts underlying them. That is in no way unique to significance testing.

• john byrd

Deirdre: I believe that Mayo has already fulfilled her end of the deal, as in papers such as Mayo and Cox, where she has given very clear expositions of wrong- headed uses of error statistics. Your main points as you like to call them can be better presented by reference to concepts such as the fallacy of acceptance and fallacy of rejection.

• john byrd

I would like to add that once you clarify the issues that have you confused, I think you will see that significance testing has no need for or connection to loss functions, and that these tests are based on sound principle as originally designed. Again, the fact that some use them incorrectly does not invalidate them as a method. If you want to make a valid argument against these tests, then give a clear presentation of the problems that reflects clear understanding of how the tests are supposed to be applied. Don’t ask us us to trust you are right for the wrong reasons. That would be a cult-like following.