# Neyman’s Nursery

## Power taboos: Statue of Liberty, Senn, Neyman, Carnap, Severity

Is it taboo to use a test’s power to assess what may be learned from the data in front of us? (Is it limited to pre-data planning?) If not entirely taboo, some regard power as irrelevant post-data[i], and the reason I’ve heard is along the lines of an analogy Stephen Senn gave today (in a comment discussing his last post here)[ii].

Senn comment: So let me give you another analogy to your (very interesting) fire alarm analogy (My analogy is imperfect but so is the fire alarm.) If you want to cross the Atlantic from Glasgow you should do some serious calculations to decide what boat you need. However, if several days later you arrive at the Statue of Liberty the fact that you see it is more important than the size of the boat for deciding that you did, indeed, cross the Atlantic.

My fire alarm analogy is here. My analogy presumes you are assessing the situation (about the fire) long distance. Continue reading

## Comments on Wasserman’s “what is Bayesian/frequentist inference?”

What I like best about Wasserman’s blogpost (Normal Deviate) is his clear denial that merely using conditional probability makes the method Bayesian (even if one chooses to call the conditional probability theorem Bayes’s theorem, and even if one is using ‘Bayes’s’ nets). Else any use of probability theory is Bayesian, which trivializes the whole issue. Thus, the fact that conditional probability is used in an application with possibly good results is not evidence of (yet another) Bayesian success story [i].

But I do have serious concerns that in his understandable desire (1) to be even-handed (hammers and screwdrivers are for different purposes, both perfectly kosher tools), as well as (2) to give a succinct sum-up of methods,Wasserman may encourage misrepresenting positions. Speaking only for “frequentist” sampling theorists [ii], I would urge moving away from the recommended quick sum-up of “the goal” of frequentist inference: “Construct procedures with frequency guarantees”. If by this Wasserman means that the direct aim is to have tools with “good long run properties”, that rarely err in some long run series of applications, then I think it is misleading. In the context of scientific inference or learning, such a long-run goal, while necessary is not at all sufficient; moreover, I claim, that satisfying this goal is actually just a byproduct of deeper inferential goals (controlling and evaluating how severely given methods are capable of revealing/avoiding erroneous statistical interpretations of data in the case at hand.) (So I deny that it is even the main goal to which frequentist methods direct themselves.) Even arch behaviorist Neyman used power post-data to ascertain how well corroborated various hypotheses were—never mind long-run repeated applications (see one of my Neyman’s Nursery posts). Continue reading

## Neyman’s Nursery (NN5): Final Post

I want to complete the Neyman’s Nursery (NN) meanderings while we have some numbers before us, and while there is a particular example, test T+, on the table.  Despite my warm and affectionate welcoming of the “power analytic” reasoning I unearthed in those “hidden Neyman” papers (see post from Oct. 22)– admittedly, largely lost in the standard decision-behavior model of tests–, it still retains an unacceptable coarseness: power is always calculated relative to the cut-off point ca for rejecting H0.  But rather than throw out the baby with the bathwater, we should keep the logic and take account of the actual value of the statistically insignificant result.

__________________________________

(For those just tuning in, power analytic reasoning aims to avoid the age-old fallacy of taking a statistically insignificant result as evidence of 0 discrepancy from the null hypothesis, by identifying discrepancies that can and cannot be ruled out.  For our test T+, we reason from insignificant results to inferences of the form:  μ < μ0 + γ.

We are illustrating (as does Neyman) with a one-sided test T+, with μ0 = 0, and α=.025.  Spoze σ = 1, n = 25, so X is statistically significant only if it exceeds .392.

Power-analytic reasoning says (in relation to our test T+):

If X is statistically insignificant and the POW(T+, μ= μ1) is high, then X indicates, or warrants inferring (or whatever phrase you like) that  μ < μ1.)

_______________________________

Suppose one had an insignificant result from test T+  and wanted to evaluate the inference:   μ < .4

(it doesn’t matter why just now, this is an illustration).

Since POW(T+, μ =.4) is hardly more than .5, Neyman would say “it was a little rash” to regard the observed mean as indicating μ < .4 . He would say this regardless of the actual value of the statistically insignificant result.  There’s no place in the power calculation, as defined, to take into account the actual observed value.1

That is why, although  high power to detect  μ as large as  μ1 is sufficient for regarding the data as good evidence that   μ < μ1 , it is too coarse to be a necessary condition.  Spoze, for example, that X = 0.

Were μ as large as .4 we would have observed a larger observed difference from the null than we did with high probability (~.98). Therefore, our data provide evidence that μ < .4.2

We might say that the severity associated with μ < .4 is high.  There are many ways to articulate the associated justification—I have done so at length earlier; and of course it is no different from “power analytic reasoning”.  Why consider a miss as good as a mile?

When I first introduced this idea in my Ph.D dissertation, I assumed researchers already did this, in real life, since it introduces no new logic.  But I’ve been surprised not to find it.

I was (and am) puzzled to discover under “observed power” the Shpower computation, which we have already considered and (hopefully) gotten past—at least for present purposes, namely, reasoning from insignificant results to inferences of the form: μ < μ0 + γ.

Granted, there are some computations which you might say lead to virtually the same results as SEV, e.g., certain confidence limits, but even so there are differences of interpretation.3  Let me know if you think I am wrong, there may well be something out there I haven’t seen….
____________
(1) This does not mean the place to enter it is in the hypothesized value of under which the power is computed (as with Shpower). This is NOT power, and as we have seen in two posts, it is fallacious to equate it to power or to power analytic reasoning. Note that the Shpower associated with X = 0 is .025—that we are interested in μ < .4 does not enter.

(2) It doesn’t matter here if we use ≤ or < .

(3) For differences between SEV and confidence intervals, see Mayo 1996, Mayo and Spanos 2006, 2011.

Categories: Neyman's Nursery, Statistics |

## Logic Takes a Bit of a Hit!: (NN4) Continuing: Shpower ("observed" power) vs Power:

Logic takes a bit of a hit—student driver behind me.  Anyway, managed to get to JFK, and meant to explain a bit more clearly the first “shpower” post.
I’m not saying shpower is illegitimate in its own right, or that it could not have uses, only that finding that the logic for power analytic reasoning does not hold for shpower  is no skin off the nose of power analytic reasoning. Continue reading

Categories: Neyman's Nursery, Statistics |

## Neyman’s Nursery (NN3): SHPOWER vs POWER

 EGEK weighs 1 pound

Before leaving base again, I have a rule to check on weight gain since the start of my last trip.  I put this off til the last minute, especially when, like this time, I know I’ve overeaten while traveling.  The most accurate of the 4 scales I generally use (one is at my doctor’s) is actually in Neyman’s Nursery upstairs.  To my surprise, none of these scales showed any discernible increase over when I left.  At least one of the 4 scales would surely have registered a weight gain of 1 pound or more, had I gained it, and yet none of them do; that is an indication I’ve not gained a pound or more.  I check that each scale reliably indicates 1 pound, because I know that is the weight of the book EGEK (you can even see this on the scale shown), and they each show exactly one pound when EGEK is weighed. Having evidence I’ve gained less than 1 pound, there is even less grounds for supposing I’ve gained as much as 5 pounds, right? Continue reading

Categories: Neyman's Nursery, Statistics |

## Neyman’s Nursery (NN2): Power and Severity [Continuation of Oct. 22 Post]:

Let me pick up where I left off in “Neyman’s Nursery,” [built to house Giere’s statistical papers-in-exile]. The  main goal of the discussion is to get us to exercise correctly our “will to understand power”, if only little by little.  One of the two surprising papers I came across the night our house was hit by lightening has the tantalizing title “The Problem of Inductive Inference” (Neyman 1955).  It reveals a use of statistical tests strikingly different from the long-run behavior construal most associated with Neyman.  Surprising too, Neyman is talking to none other than the logical positivist philosopher of confirmation, Rudof Carnap:

I am concerned with the term “degree of confirmation” introduced by Carnap.  …We have seen that the application of the locally best one-sided test to the data … failed to reject the hypothesis [that the n observations come from a source in which the null hypothesis is true].  The question is: does this result “confirm” the hypothesis that H0 is true of the particular data set? (Neyman, pp 40-41).

Neyman continues:

The answer … depends very much on the exact meaning given to the words “confirmation,” “confidence,” etc.  If one uses these words to describe one’s intuitive feeling of confidence in the hypothesis tested H0, then…. the attitude described is dangerous.… [T]he chance of detecting the presence [of discrepancy from the null], when only [n] observations are available, is extremely slim, even if [the discrepancy is present].  Therefore, the failure of the test to reject H0 cannot be reasonably considered as anything like a confirmation of H0.  The situation would have been radically different if the power function [corresponding to a discrepancy of interest] were, for example, greater than 0.95. (ibid.)

The general conclusion is that it is a little rash to base one’s intuitive confidence in a given hypothesis on the fact that a test failed to reject this hypothesis. A more cautious attitude would be to form one’s intuitive opinion only after studying the power function of the test applied.

Neyman alludes to a one-sided test of the mean of a Normal distribution with n iid samples, and known standard deviation, call it test T+. (Whether Greek symbols will appear where they should, I cannot say; it’s being worked on back at Elba).

H0: µ ≤ µ0 against H1: µ > µ0.

The test statistic d(X) is the standardized sample mean.

The test rule: Infer a (positive) discrepancy from µ0 iff {d(x0) > cα) where cα corresponds to a difference statistically significant at the α level.

In Carnap’s example the test could not reject the null hypothesis, i.e., d(x0) ≤ cα, but (to paraphrase Neyman) the problem is that the chance of detecting the presence of discrepancy δ from the null, with so few observations, is extremely slim, even if [δ is present].

We are back to our old friend: interpreting negative results!

“One may be confident in the absence of that discrepancy only if the power to detect it were high.”

The power of the test T+ to detect discrepancy δ:

(1)  P(d(X) > cα; µ =  µ0 + δ)

It is interesting to hear Neyman talk this way since it is at odds with the more behavioristic construal he usually championed.  He sounds like a Cohen-style power analyst!  Still, power is calculated relative to an outcome just missing the cutoff  cα.  This is, in effect, the worst case of a negative (non significant) result, and if the actual outcome corresponds to a larger p-value, that should be taken into account in interpreting the results.  It is more informative, therefore, to look at the probability of getting a worse fit (with the null hypothesis) than you did:

(2)  P(d(X) > d(x0); µ = µ0 + δ)

In this example, this gives a measure of the severity (or degree of corroboration) for the inference µ < µ0 + δ.

Although (1) may be low, (2) may be high (For numbers, see Mayo and Spanos 2006).

Spanos and I (Mayo and Spanos 2006) couldn’t find a term in the literature defined precisely this way–the way I’d defined it in Mayo (1996) and before.  We were thinking at first of calling it “attained power” but then came across what some have called “observed power” which is very different (and very strange).  Those measures are just like ordinary power but calculated assuming the value of the mean equals the observed mean!  (Why  anyone would want to do this and then apply power analytic reasoning is unclear.  I’ll come back to this in my next post.)  Anyway, we refer to it as the Severity Interpretation of “Acceptance” (SIA) in Mayo and Spanos 2006.

The claim in (2) could also be made out viewing the p-value as a random variable, calculating its distribution for various alternatives (Cox 2006, 25).  This reasoning yields a core frequentist principle of evidence  (FEV) in Mayo and Cox 2010, 256):

FEV:1 A moderate p-value is evidence of the absence of a discrepancy d from H0 only if there is a high probability the test would have given a worse fit with H0 (i.e., smaller p value) were a discrepancy d to exist.

It is important to see that it is only in the case of a negative result that severity for various inferences is in the same direction as power.  In the case of significant results, d(x) in excess of the cutoff, the opposite concern arises—namely, the test is too sensitive. So severity is always relative to the particular inference being entertained: speaking of the “severity of a test” simpliciter is an incomplete statement in this account.  These assessments enable sidestepping classic fallacies of tests that are either too sensitive or not sensitive enough.2
________________________________________

The full version of our frequentist principle of evidence FEV corresponds to the interpretation of a small p-value:

x is evidence of a discrepancy d from H0 iff, if H0 is a correct description of the mechanism generating x, then, with high probability a less discordant result would have occurred.

Severity (SEV) may be seen as a meta-statistical principle that follows the same logic as FEV reasoning within the formal statistical analysis.

By making a SEV assessment relevant to the inference under consideration, we obtain a measure where high (low) values always correspond to good (poor) evidential warrant.
It didn’t have to be done this way, but I decided it was best, even though it means appropriately swapping out the claim H for which one wants to assess SEV.

NOTE: There are 5 Neyman’s Nursery posts (NN1-NN5). NN3 is here. Search this blog for the others.

REFERENCES:

Cohen, J. (1992) A Power Primer.
Cohen, J. (1988), Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Hillsdale, Erlbaum, NJ.

Mayo, D. and Spanos, A. (2006), “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,” British Journal of Philosophy of Science, 57: 323-357.

Mayo, D. and Cox, D. (2010), “Frequentist Statistics as a Theory of Inductive Inference,” in D. Mayo and A. Spanos (2011), pp. 247-275.

Mayo, D. and Spanos, A. (eds.) (2010), Error and Inference, Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science, CUP.

Neyman, J. (1955), “The Problem of Inductive Inference,” Communications on Pure and Applied Mathematics, VIII, 13-46.

Categories: Neyman's Nursery, Statistics |

## The Will to Understand Power: Neyman’s Nursery (NN1)

Way back when, although I’d never met him, I sent my doctoral dissertation, Philosophy of Statistics, to one person only: Professor Ronald Giere. (And he would read it, too!) I knew from his publications that he was a leading defender of frequentist statistical methods in philosophy of science, and that he’d worked for at time with Birnbaum in NYC. Continue reading

Categories: Neyman's Nursery, Statistics |