As a Nietzschean, I am fond of the statistical notion of power; yet it is often misunderstood by critics of testing. Consider leaders of the reform movement in economics, Ziliac and McCloskey (Michigan, 2009).

In this post, I will adhere precisely to the text, and offer no new interpretation of tests. Type 1 and 2 errors and power are just formal notions with formal definitions. But we need to get them right (especially if we are giving expert advice). You can hate them; just define them correctly please. They write:

“The error of the second kind is the error of accepting the null hypothesis of (say) zero effect when the null is in face false, that is, then (say) such and such a positive effect is true.”

So far so good.

And the power of a test to detect that such and such a positive effect d is true is equal to the probability of rejecting the null hypothesis of (say) zero effect when the null is in fact false, and a positive effect as large as d is present.

Fine.

Let this alternative be abbreviated H’(d):

H’(d): there is a positive effect as large as d.

Suppose the test rejects the null when it reaches a significance level of .01.

(1) The power of the test to detect H’(d) equals

P(test rejects null at .01 level; H’(d) is true).

For example, if the prion vaccine so effective that it increases survival as much as 2 years, then, let us allow, the probability of rejecting the null would be high. Say it is .85.

“If the power of a test is high, say, .85 or higher, then the scientist can be reasonably confident that at minimum the null hypothesis (of, again, zero effect if that is the null chosen) is false and that therefore his rejection of it is highly probably correct”. (132-3).

But this is not so. Perhaps they are slipping into the cardinal error of mistaking (1) as

(1’) P(H’(d) is true; test rejects null at .01 level)!

In dealing with these passages, I see why Spanos ( 2008) declared, “You have it backwards” in his review of their book. I had assumed (before reading it myself ) that Z & M, as with many significance test critics, were pointing out that the fallacy is common, not that they were committing it. I am confident they will want to correct this and not give Prionvac grounds to claim evidence of a large increase in survival simply because their test had a high capability (power) to detect that increase, if it were in fact present.

What Z & M say about power analysts in this chapter is fine; but the power analysts are concerned about interpreting non-statistically significant results with tests that have low-power to detect any but the grossest effects. Perhaps this is the source of their confusion.

[Aside: They do not say whether a rejection of the null is merely to infer the existence of a (positive) effect or whether it is to infer an effect of size d, so neither do I. They also do not say if it is a one-sided or two-sided test; neither qualification effects the upshot of the flaw I am raising.]

Spanos, A. (2008), Review of S. Ziliak and D. McCloskey’s *The Cult of Statistical Significance*, *Erasmus Journal for Philosophy and Economics*, volume 1, issue 1: 154-164.

Ziliak, Z. and McCloskey, D. (2008), *The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice and Lives*, University of Michigan Press*.*

After reading only the first few pages of the Ziliak and McCloskey book, it becomes clear to an informed reader that the authors exhibit an alarming lack of knowledge of the most basic concepts of frequentist testing, including the p-value, type I and II error probability, the power of a test and the role of the sample size n. Let me give two examples.

1. When the claim that:

“A good and sensible rejection of the null is, among other things, a rejection with high power.” (p. 133),

they exhibit a remarkable ignorance about the key issue of their book. High power for discrepancies close to the null is the main culprit behind confusing statistical with substantive significance. Such an oversensitive test will pick up a small discrepancy and give rise to statistical significance.

2. When the claim that:

“What is relevant here for the statistical case is that refutations of the null are trivially easy to achieve if power is low enough or the sample size is large enough.” (p. 152),

they exhibit even more remarkable ignorance about the notion of power and its relationship to the sample size; their two instances of ‘easy rejection’ separated by ‘or’ contradict each other! Rejections of the null are not easy to achieve when the power is ‘low enough’. They are more difficult exactly because the test does not have adequate power (generic capacity) to detect discrepancies from the null; that stems from the very definition of power and optimal tests. Their ignorance is even more apparent with their second claim which is correct for the wrong reason. Rejections are easy to achieve when the sample size n is large enough due to high not low power. This is because the power of a ‘decent’ (consistent) frequentist test increases monotonically with n! show less

Like

I hope readers will take seriously your detailed response to Z and M. In their response to you, it seems they try to pull the wool over the reader’s eyes:

In correcting their erroneous definition of a statistical concept, it is only obvious that it is to the correct statistical definition that one must refer. And yet their ploy is to dismiss such corrections as mere “technical smoke”!

“Spanos throws up a lot of technical smoke that has the effect of obscuring the plain fact that he agrees with us. (The mathematics in his piece is irrelevant to anything of importance. The reader may omit it.)” (166)

The smoke “and mirrors” charge is one that they, not Spanos, is guilty of. There is another bit of rhetorical flimflam here that warrants a post of its own: Start from an uncontroversial (and trivial) point on which there is no disagreement, and infer “he agrees with us”— hoping the reader will be misled into assuming this holds as regards the crucial issue under challenge. Here’s a rule of thumb: When disputants give assurance that “the reader may omit” something, the reader can be sure that that something is altogether essential! The magician’s strategy, as everyone knows, is always that by clever misdirection, you will not look where they don’t want you to look!

Ziliak, S. and McCloskey, D. (2008), “Science is judgment, not only calculation: a reply to Aris Spanos’s review of The cult of statistical significance” Erasmus Journal for Philosophy and Economics, Volume 1, Issue 1, Autumn 2008, pp. 165-170.

http://ejpe.org/pdf/1-1-br-2r….

I see a bit of value in deconstructing the criticisms in texts such as Z&M, Today, I read a report in the Guardian stating that a Bayesian analysis had been the basis for overturning a criminal case in the UK (the judge tossed out the analysis). Some Bayesian proponents made the claim to the journalist that this court decision could undermine justice (no less) because Bayesian models are the only way to get to the truth with forensic evidence. Yet, the judge’s concern– that the data fed into the Bayesian model could not be shown reliable– seems to have been entirely lost on the experts interviewed for the article. We are to believe that Bayesian analysis of bogus data will ensure justice? (I would submit that presentation of accurate numbers with no statistics at all is more desirable.)

Yesterday, I read an article in Significance magazine written by Z that celebrated a US Supreme Court decision that reversed a decision that was heavily informed by an apparently poorly framed significance test. Obviously, these debates have import in current events outside academe.

The critiques of frequency approaches fixate on poor applications of frequency approaches more than the underlying philosophy (judging from the examples offered as support of the criticisms). The narratives are quite rhetorical, and succeed in attracting followers (especially graduate students). Z’s critique of the significance test in the court case is rhetorical more than technical, and treats the ill-conceived significance test he described as a reason why frequency approaches should be avoided. (Though we did not get alot of details, it appears that the frame-up of the test failed to account for relevant variables. It might have been what Spanos writes about alot– poor model specification.)

It is rare to non-existent to find any robust treatments concerning how the Bayesian is to deal with 1) inadequate samples (not just size), 2) selection of objective priors (assuming equal priors is not objective), or 3) how to explain the numerous caveats that must accompany the posterior probability (such as “we could not quantify all relevant information”). What is presented in court is primarily a very impressive number, built typically from shaky foundation (I have never seen sensible priors). For the forensic applications, it appears this is a problem that will get worse until some tragic mistake forces some correction. I have seen problems in practice already due to the above caveats. I see little effort to illuminate these issues or correct them.

Frequency models are more modest in their intent, which seems to be lost on the critics. Their limited nature often– though not always– makes them clearly interpretable and suitable for informing a larger, more complex analysis that may not be entirely statistical. I struggle with the coin example described from the Kadane text (Note: I ordered the book and will read it), because no one would ever address a real problem in such a pin-headed manner. It demonstrates nothing to me. So, I agree that some criticism of the critics is in order.

There is another hugely misleading interpretation of the Supreme court case which is beyond trivial. Firstly, the case had to do with information that could conceivably adversely affect a stock price! Anyone who trades knows that if the FDA is sending letters and there are court cases about side effects, as in this case,then it is misleading to declare that everything is hunky dory as this drug company did. The Supreme court merely said that it would be wrong to rule out information as possibly relevant to stock prices SOLELY because no test had shown statistical significance. What is more, and this part surprised me, it made it clear that no company has to say anything, it is only required to say something if it had already said something such that, not saying the further thing, would possibly be misleading, and that misleadingness could possibly influence the stock price. So it is absurdedly weak, and has really nothing to do with the business of correctly interpreting, or the value of using, significance tests. On the contrary, the Supreme Court granted it was an important tool, only that lots of other information can effect a stock price, and shareholders should be informed, if the company had already addressed the issue at all. I will write more on this case later.

You do not “adhere precisely to the text” as you claim. On page 132 the book states that “If *on the other hand* the power of a test is high…”; the quotation is from a passage where Z&M say that high power is only “one element of a good rejection”. You are therefore suggesting Z&M correct something they didn’t actually say.

For more confusion about this passage (which I actually agree is not wonderful) see

http://kvams.wordpress.com/200…

There is no “only” in the sentence. In any event, the issue is misinterpreting power.

Quite right, but there are no quotes around “only” in my text. “Only” doesn’t even change the meaning; if something is described as one element of a system it is implicit that there are more.

The issue, for me, is the (mis)quotation of Z&M, out of context – an area where Z&M are no angels either. After a promise to “adhere precisely” it looks pretty sloppy.

I’m (still) looking for this blog to provide a constructive justification for frequentist statistics… a goal not achieved by taking potshots at Z&M, or Kadane, or other antagonists.

If you want a constructive justification for using frequentist statistics, then please have a look at some things I’ve written to that end over the years. How about, Mayo, Error and the Growth of Experimental Knowledge (1996). Its chapters can be found on the web pages linked to this page. Here, on the blog, UNLIKE my published work, I am deliberately taking up some howlers that have never been corrected. My published work on philosophy of statistics, by contrast, is toward, not just a correct interpretation of statistical methods, but the development of an entire philosophy of experiment and inference that goes along with these tools. An adequate statistical account for the philosophy of science I develop MUST enable evaluation of a method’s error probabilities, and it must do so in such a way that it is relevant to appraising how well or severely tested various claims are. So, there is a whole story out there. Do your due diligence!!!! It’s up to you, I can’t force feed it!

This might explain why I could never understand their claiming Fisher allowed low hurdles for rejection on account of having low powered tests. But actually very high powered tests result in low hurdles for rejecting the null.

Thank you for pointing this out.