Power taboos: Statue of Liberty, Senn, Neyman, Carnap, Severity

Unknown-3Is it taboo to use a test’s power to assess what may be learned from the data in front of us? (Is it limited to pre-data planning?) If not entirely taboo, some regard power as irrelevant post-data[i], and the reason I’ve heard is along the lines of an analogy Stephen Senn gave today (in a comment discussing his last post here)[ii].

Senn comment: So let me give you another analogy to your (very interesting) fire alarm analogy (My analogy is imperfect but so is the fire alarm.) If you want to cross the Atlantic from Glasgow you should do some serious calculations to decide what boat you need. However, if several days later you arrive at the Statue of Liberty the fact that you see it is more important than the size of the boat for deciding that you did, indeed, cross the Atlantic.

My fire alarm analogy is here. My analogy presumes you are assessing the situation (about the fire) long distance.

Mayo comment (in reply): A crucial disanalogy arises: You see the statue and you see the observed difference in a test, but even when the stat sig alarm goes off, you are not able to see the discrepancy that generated the observed difference or the alarm you hear. You don’t know that you’ve arrived (at the cause). The statistical inference problem is precisely to make that leap from the perceived alarm to some aspect of the underlying process that resulted in the alarm being triggered. Then it is of considerable relevance to exploit info on the capability of your test procedure to result in alarms going off (perhaps of different loudness), due to varying values of an aspect of the underlying process µ’, µ”,µ”‘  …etc..

Using the loudness of the alarm you actually heard, rather than the minimal stat sig bell, would be analogous to using the p-value rather than the pre-data cut-off for rejection. But the logic is just the same.

While post-data power is scarcely taboo for a severe tester, severity always uses the actual outcome, with its level of statistical significance, whereas power is in terms of the fixed cut-off. Still power provides (worst-case) pre-data guarantees. Now before you get any wrong ideas, I am not endorsing what some people call retrospective power, and I call “shpower”–which goes against severity logic, and is misconceived.

We are reading the Fisher-Pearson-Neyman “triad” tomorrow in Phil6334. Even here (i.e., Neyman 1956), Neyman alludes to a post-data use of power. But, strangely enough,I only noticed this after discovering more blatant discussions in what Spanos and I call “Neyman’s hidden papers”.  Here’s an excerpt of from Neyman’s Nursery (part 2) [NN-2] 


One of the two surprising papers I came across the night our house was hit by lightening has the tantalizing title “The Problem of Inductive Inference” (Neyman 1955).  It reveals a use of statistical tests strikingly different from the long-run behavior construal most associated with Neyman.  Surprising too, Neyman is talking to none other than the logical positivist philosopher of confirmation, Rudof Carnap:

I am concerned with the term “degree of confirmation” introduced by Carnap.  …We have seen that the application of the locally best one-sided test to the data … failed to reject the hypothesis [that the n observations come from a source in which the null hypothesis is true].  The question is: does this result “confirm” the hypothesis that H0 is true of the particular data set? (Neyman, pp 40-41).

Neyman continues:

The answer … depends very much on the exact meaning given to the words “confirmation,” “confidence,” etc.  If one uses these words to describe one’s intuitive feeling of confidence in the hypothesis tested H0, then…. the attitude described is dangerous.… [T]he chance of detecting the presence [of discrepancy from the null], when only [n] observations are available, is extremely slim, even if [the discrepancy is present].  Therefore, the failure of the test to reject H0 cannot be reasonably considered as anything like a confirmation of H0.  The situation would have been radically different if the power function [corresponding to a discrepancy of interest] were, for example, greater than 0.95. (ibid.)

The general conclusion is that it is a little rash to base one’s intuitive confidence in a given hypothesis on the fact that a test failed to reject this hypothesis. A more cautious attitude would be to form one’s intuitive opinion only after studying the power function of the test applied.

Neyman alludes to a one-sided test of the mean of a Normal distribution with n iid samples, and known standard deviation, call it test T+.

H0: µ ≤ µ0 against H1: µ > µ0.

The test statistic d(X) is the standardized sample mean.

The test rule: Infer a (positive) discrepancy from µ0 iff {d(x0) > cα) where cα corresponds to a difference statistically significant at the α level.

In Carnap’s example the test could not reject the null hypothesis, i.e., d(x0) ≤ cα, but (to paraphrase Neyman) the problem is that the chance of detecting the presence of discrepancy δ from the null, with so few observations, is extremely slim, even if [δ is present].

We are back to our old friend: interpreting negative results!

“One may be confident in the absence of that discrepancy only if the power to detect it were high.”

The power of the test T+ to detect discrepancy δ:

(1)  P(d(X) > cα; µ =  µ0 + δ)

It is interesting to hear Neyman talk this way since it is at odds with the more behavioristic construal he usually championed.  He sounds like a Cohen-style power analyst!  Still, power is calculated relative to an outcome just missing the cutoff  cα.  This is, in effect, the worst case of a negative (non significant) result, and if the actual outcome corresponds to a larger p-value, that should be taken into account in interpreting the results.  It is more informative, therefore, to look at the probability of getting a worse fit (with the null hypothesis) than you did:

(2)  P(d(X) > d(x0); µ = µ0 + δ)

In this example, this gives a measure of the severity (or degree of corroboration) for the inference µ < µ0 + δ.

Although (1) may be low, (2) may be high (For numbers, see Mayo and Spanos 2006).

Spanos and I (Mayo and Spanos 2006) couldn’t find a term in the literature defined precisely this way–the way I’d defined it in Mayo (1996) and before.  We were thinking at first of calling it “attained power” but then came across what some have called “observed power” which is very different (and very strange).  Those measures are just like ordinary power but calculated assuming the value of the mean equals the observed mean!  (I call this “shpower”. )

Anyway, we refer to it as the Severity Interpretation of “Acceptance” (SIA) in Mayo and Spanos 2006.

The claim in (2) could also be made out viewing the p-value as a random variable, calculating its distribution for various alternatives (Cox 2006, 25).  This reasoning yields a core frequentist principle of evidence  (FEV) in Mayo and Cox 2010, 256):

FEV:1 A moderate p-value is evidence of the absence of a discrepancy d from H0 only if there is a high probability the test would have given a worse fit with H0 (i.e., smaller p value) were a discrepancy d to exist.

It is important to see that it is only in the case of a negative result that severity for various inferences is in the same direction as power.  In the case of significant results, d(x) in excess of the cutoff, the opposite concern arises—namely, the test is too sensitive. So severity is always relative to the particular inference being entertained: speaking of the “severity of a test” simpliciter is an incomplete statement in this account.  These assessments enable sidestepping classic fallacies of tests that are either too sensitive or not sensitive enough.…..

By making a SEV assessment relevant to the inference under consideration, we obtain a measure where high (low) values always correspond to good (poor) evidential warrant.
 It didn’t have to be done this way (at first I didn’t), but I decided it was best, even though it means appropriately swapping out the claim H for which one wants to assess SEV.

[i] To repeat it again: some may be thinking of an animal I call “shpower”.

[ii] I realize comments are informal and unpolished, but isn’t that the beauty of blogging?

NOTE:To read the full post go to [NN-2].There are 5 Neyman’s Nursery posts (NN1-NN5). Search this blog for the others.


Cohen, J. (1992) A Power Primer.

Cohen, J. (1988), Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Hillsdale, Erlbaum, NJ.

Mayo, D. G. and Spanos, A. (2006). “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,British Journal of Philosophy of Science, 57: 323-357.

Mayo, D.G. and Cox, D. R. (2006) “Frequentists Statistics as a Theory of Inductive Inference,” Optimality: The Second Erich L. Lehmann Symposium (ed. J. Rojo), Lecture Notes-Monograph series, Institute of Mathematical Statistics (IMS), Vol. 49: 77-97. reprinted in Mayo & Spanos 2010)

Mayo, D. and Spanos, A. (eds.) (2010), Error and Inference, Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science, CUP.

Mayo, D. G. and Spanos, A. (2011) “Error Statistics

Neyman, J. (1955), “The Problem of Inductive Inference,” Communications on Pure and Applied Mathematics, VIII, 13-46.

Categories: exchange with commentators, Neyman's Nursery, P-values, Phil6334, power, Stephen Senn

Post navigation

6 thoughts on “Power taboos: Statue of Liberty, Senn, Neyman, Carnap, Severity

  1. Deborah, I quote “Using the loudness of the alarm you actually heard, rather than the minimal stat sig bell, would be analogous to using the p-value rather than the pre-data cut-off for rejection. But the logic is just the same.” But you have to go further. You have to think that the P-value itself is a sufficient summary of loudness and most statistician don’t.

    • Stephen: The same reasoning would apply to your chosen distance or test statistic. The main thing, my main point, is the relevance of considering the capability of the test you have used/designed–post data–, and the result might be that your test is lousy.

      I think the taboo against power (post data) stems from assuming it comes hand in hand with an extreme Neyman-style behavioristic approach. But look at Neyman sounding just like a Cohen power analyst, and almost like a severity assessor.

  2. e.Berk

    This Neyman isn’t the behaviorist depicted in Fisher at all, but did he connect with Cohen?

    • I doubt it, but actually do not know if Neyman connected with Cohen. It’s as if that literature is entirely separate.

  3. Sam Dickson

    I think the best use of power post-data is when you do not achieve significance. If you have designed an experiment to be significant if a minimally interesting effect is present and no significance is achieved, then we can be more confident that an interesting effect is not present.

    To extend Senn’s analogy, suppose we did not know how far it is from Glasgow to Ellis Island. If we choose a boat that we know can probably make it 1000 miles and the boat ends up sinking, then we can conclude that Glasgow and Ellis Island are probably more than 1000 miles apart.

    • Sam: Hadn’t see this. Yes on your first para, but this is a sufficient not a necessary condition. The power can be low but the particular outcome so insignificantly different from the null that we can still argue that a given discrepancy from the null is absent. At least that’s the severity logic.
      On the second point, well I take it you mean something like:

      If the distance was < 1000 miles, then very probably it would not sink.
      It sank ,so there's evidence the distance is at least 1000 miles.

      Note, this would not assign a probability to its being at least 1000 miles. But in reality the first premise would be very dubious, in the sense that there are many reasons it could sink other than distance.

Blog at WordPress.com.