It was from my Virginia Tech colleague I.J. Good (in statistics), who died five years ago (April 5, 2009), at 93, that I learned most of what I call “howlers” on this blog. His favorites were based on the “paradoxes” of stopping rules. (I had posted this last year here.)

“In conversation I have emphasized to other statisticians, starting in 1950, that, in virtue of the ‘law of the iterated logarithm,’ by optional stopping an arbitrarily high sigmage, and therefore an arbitrarily small tail-area probability, can be attained even when the null hypothesis is true. In other words if a Fisherian is prepared to use optional stopping (which usually he is not) he can be sure of rejecting a true null hypothesis provided that he is prepared to go on sampling for a long time. *The way I usually express this ‘paradox’ is that a Fisherian [but not a Bayesian] can cheat by pretending he has a plane to catch like a gambler who leaves the table when he is ahead” (Good 1983, 135) *[*]

This paper came from a conference where we both presented, and he was *extremely* critical of my error statistical defense on this point. (I was like a year out of grad school, and he a University Distinguished Professor.)

One time, years later, after hearing Jack give this howler for the nth time, “a Fisherian [but not a Bayesian] can cheat, etc.,” I was driving him to his office, and suddenly blurted out what I really thought:

“You know Jack, as many times as I have heard you tell this, I’ve always been baffled as to its lesson about who is allowed to cheat. Error statisticians require the overall and not the ‘computed’ significance level be reported. To us, what would be cheating would be reporting the significance level you got after trying and trying again in just the same way as if the test had a fixed sample size. True, we are forced to fret about how stopping rules alter the error probabilities of tests, while the Bayesian is free to ignore them, but why isn’t the real lesson that the Bayesian is allowed to cheat?” (A published version of my remark may be found in EGEK p. 351: “As often as my distinguished colleague presents this point…”)

To my surprise, or actually shock, after pondering this a bit, Jack said something like, “Hmm, I never thought of it this way.”

By the way, the story of the “after dinner Bayesian comedy hour” on this blog, did not allude to Jack but to someone who gave a much more embellished version. Since it’s Saturday night, let’s once again listen into the comedy hour that unfolded at my dinner table at an academic conference:

** ***Did you hear the one about the researcher who gets a phone call from the guy analyzing his data? First the guy congratulates him and says, “The results show a **statistically significant difference at the .05 level—p-value .048.” But then, an hour later, the phone rings again. It’s the same guy, but now he’s apologizing. It turns out that the experimenter intended to keep sampling until the result was 1.96 standard deviations away from the 0 null—in either direction—so they had to reanalyze the data (n=169), and the results were no longer statistically significant at the .05 level.** *

*Much laughter.*

*So the researcher is tearing his hair out when the same guy calls back again. “Congratulations!” the guy says. “I just found out that the experimenter actually had planned to take n=169 all along, so the results **are** statistically significant.”*

* **Howls of laughter.*

* **But then the guy calls back with the bad news . . .*

*It turns out that failing to score a sufficiently impressive effect after *n*’ trials, the experimenter went on to *n*” trials, and so on and so forth until finally, say, on trial number 169, he obtained a result 1.96 standard deviations from the null. *

It continues this way, and every time the guy calls in and reports a shift in the p-value, the table erupts in howls of laughter! From everyone except me, sitting in stunned silence, staring straight ahead. The hilarity ensues from the idea that the experimenter’s reported psychological intentions about when to stop sampling is altering the statistical results.

The allegation that letting stopping plans matter to the interpretation of data is tantamount to letting psychological intentions matter may be called the *argument from intentions.* When stopping rules matter, however, we are looking not at “intentions” but at real alterations to the probative capacity of the test, as picked up by a change in the test’s corresponding error probabilities. The analogous problem occurs if there is a fixed null hypothesis and the experimenter is allowed to search for maximally likely alternative hypotheses (Mayo and Kruse 2001; Cox and Hinkley 1974). Much the same issue is operating in what physicists call the *look-elsewhere effect* (LEE), which arose in the context of “bump hunting” in the Higgs results.

The *optional stopping effect* often appears in illustrations of how error statistics violates the Likelihood Principle LP, alluding to a two-sided test from a Normal distribution:

*X*_{i }~ N(µ,σ) and we test *H*_{0}: µ=0, vs. *H*_{1}: µ≠0.

The stopping rule might take the form:

*Keep sampling until |m| ≥* 1.96 σ/√n),

with *m* the sample mean. When *n* is fixed the type 1 error probability is .05, but with this stopping rule the actual significance level may differ from, and will be greater than, .05. In fact, ignoring the stopping rule allows a high or maximal probability of error. For a sampling theorist, this example alone “taken in the context of examining consistency with θ = 0, is enough to refute the strong likelihood principle.” (Cox 1977, p. 54) since, with probability 1, it will stop with a “nominally” significant result even though θ = 0. As Birnbaum (1969, 128) puts it, “the likelihood concept cannot be construed so as to allow useful appraisal, and thereby possible control, of probabilities of erroneous interpretations.” From the error-statistical standpoint, ignoring the stopping rule allows readily inferring that there is evidence for a non- null hypothesis even though it has passed with low if not minimal severity.

Peter Armitage, in his comments on Savage at the 1959 forum (“Savage Forum” 1962), put it thus:

I think it is quite clear that likelihood ratios, and therefore posterior probabilities, do not depend on a stopping rule. . . . *I feel that if a man deliberately stopped an investigation when he had departed sufficiently far from his particular hypothesis, then “Thou shalt be misled if thou dost not know that.” *If so, prior probability methods seem to appear in a less attractive light than frequency methods, where one can take into account the method of sampling. (Savage 1962, 72; emphasis added; see [ii])

*H* is *not* being put to a stringent test when a researcher allows trying and trying again until the data are far enough from *H*_{0 }to reject it in favor of *H*.

**Stopping Rule Principle**

Picking up on the effect *appears* evanescent—locked in someone’s head—if one has no way of taking error probabilities into account:

In general, suppose that you collect data of any kind whatsoever — not necessarily Bernoullian, nor identically distributed, nor independent of each other . . . — stopping only when the data thus far collected satisfy some criterion of a sort that is sure to be satisfied sooner or later, then the import of the sequence of *n* data actually observed will be exactly the same as it would be had you planned to take exactly *n* observations in the first place. (Edwards, Lindman, and Savage 1962, 238-239)

This is called the *irrelevance of the stopping rule *or the* Stopping Rule Principle *(SRP), and is an implication of the (strong) likelihood principle (LP), which is taken up elsewhere in this blog.[i]

To the holder of the LP, the intuition is that the stopping rule is irrelevant; to the error statistician the stopping rule is quite relevant because the probability that the persistent experimenter finds data against the no-difference null is increased, even if the null is true. It alters the well-testedness of claims inferred. (Error #11 of Mayo and Spanos 2011 “Error Statistics“.)

**A Funny Thing Happened at the Savage Forum[i]**

While Savage says he was always uncomfortable with the argument from intentions, he is reminding Barnard of the argument that Barnard promoted years before. He’s saying, in effect, Don’t you remember, George? You’re the one who so convincingly urged in 1952 that to take stopping rules into account is like taking psychological intentions into account:

The argument then was this: The design of a sequential experiment is, in the last analysis, what the experimenter actually intended to do. His intention is locked up inside his head. (Savage 1962, 76)

But, alas, Barnard had changed his mind. Still, the argument from intentions is repeated again and again by Bayesians. Howson and Urbach think it entails dire conclusions for significance tests:

A significance test inference, therefore, depends not only on the outcome that a trial produced, but also on the outcomes that it could have produced but did not. And the latter are determined by certain private intentions of the experimenter, embodying his stopping rule. It seems to us that this fact precludes a significance test delivering any kind of judgment about empirical support. . . . For scientists would not normally regard such personal intentions as proper influences on the support which data give to a hypothesis. (Howson and Urbach 1993, 212)

It is fallacious to insinuate that regarding optional stopping as relevant is in effect to make private intentions relevant. Although the choice of stopping rule (as with other test specifications) is determined by the intentions of the experimenter, it does not follow that taking account of its influence is to take account of subjective intentions. The allegation is a *non sequitur*.

We often hear things like:

[I]t seems very strange that a frequentist could not analyze a given set of data, such as (*x*_{1},…, *x*_{n}) [in Armitage’s example] if the stopping rule is not given. . . . [D]ata should be able to speak for itself. (Berger and Wolpert 1988, 78)

But data do not speak for themselves, unless sufficient information is included to correctly appraise relevant error probabilities. The error statistician has a perfectly nonpsychological way of accounting for the impact of stopping rules, as well as other aspects of experimental plans. The impact is on the stringency or severity of the test that the purported “real effect” has passed. In the optional stopping plan, there is a difference in the set of possible outcomes; certain outcomes available in the fixed sample size plan are no longer available. If a stopping rule is truly open-ended (it need not be), then the possible outcomes do not contain any that fail to reject the null hypothesis. (The above rule stops in a finite # of trials, it is “proper”.)

Does the difference in error probabilities corresponding to a difference in sampling plans correspond to any real difference in the experiment? Yes. The researchers really did do something different in the try-and-try-again scheme and, as Armitage says, thou shalt be misled if your account cannot report this.

We have banished the argument from intentions, the allegation that letting stopping plans matter to the interpretation of data is tantamount to letting psychological intentions matter. So if you’re at my dinner table, can I count on you not to rehearse this one…?

One last thing….

** ****The Optional Stopping Effect with Bayesian (Two-sided) Confidence Intervals**

The equivalent stopping rule can be framed in terms of the corresponding 95% “confidence interval” method, given the normal distribution above (their term and quotes):

Keep sampling until the 95% confidence interval excludes 0.

Berger and Wolpert concede that using this stopping rule “has thus succeeded in getting the [Bayesian] conditionalist to perceive that μ ≠ 0, and has done so honestly” (pp. 80-81). This seems to be a striking admission—especially as the Bayesian interval assigns a probability of .95 to the truth of the interval estimate (using a”noninformative prior density”):

µ = **m** + 1.96(σ/√n)

But, they maintain (or did back then) that the LP only “seems to allow the experimenter to mislead a Bayesian. The ‘misleading,’ however, is solely from a frequentist viewpoint, and will not be of concern to a conditionalist.” Does this mean that while the real error probabilities are poor, Bayesians are not impacted, since, from the perspective of what they believe, there is no misleading?

[*] It was because of these “conversations” that Jack thought his name should be included in the “Jeffreys-Lindley paradox”, so I always call it the Jeffreys-Good-Lindley paradox. I discuss this in EGEK 1996, Chapter 10 , Mayo and Kruse (2001). See a recent paper by my colleague Aris Spanos (2013) on the Jeffreys-Lindley paradox.

[i] There are certain exceptions where the stopping rule may be “informative”. Other posts may be found on LP violations, and an informal version of my critique of Birnbaum’s LP argument. On optional stopping, see also Irony and Bad Faith.

[ii] I found, on an old webpage of mine, (a pale copy of) the “Savage forum”:

**REFERENCES**

Armitage, P. (1962), “Discussion”, in *The Foundations of Statistical Inference: A Discussion*, (G. A. Barnard and D. R. Cox eds.), London: Methuen, 72.

Berger J. O. and Wolpert, R. L. (1988), *The Likelihood Principle: **A Review, Generalizations, and Statistical Implications* 2^{nd} edition, Lecture Notes-Monograph Series, Vol. 6, Shanti S. Gupta, Series Editor, Hayward, California: Institute of Mathematical Statistics.

Birnbaum, A. (1969), “Concepts of Statistical Evidence” In *Philosophy, Science, and Method: Essays in Honor of Ernest Nagel*, S. Morgenbesser, P. Suppes, and M. White (eds.): New York: St. Martin’s Press, 112-43.

Cox, D. R. (1977), “The Role of Significance Tests (with discussion)*”, Scandinavian Journal of Statistics* 4, 49–70.

Cox, D. R. and D. V. Hinkley (1974), *Theoretical Statistics,* London: Chapman & Hall.

Edwards, W., H, Lindman, and L. Savage. 1963 Bayesian Statistical Inference for Psychological Research. *Psychological Review* 70: 193-242.

Good, I.J.(1983), *Good Thinking, The Foundations of Probability and its Applications*, Minnesota.

Howson, C., and P. Urbach (1993[1989]), *Scientific Reasoning: The Bayesian Approach*, 2^{nd} ed., La Salle: Open Court.

Mayo, D. (1996):[EGEK] Error and the Growth of Experimental Knowledge, Chapter 10 Why You Cannot Be Just a Little Bayesian. Chicago

Mayo, D. G. and Kruse, M. (2001). “Principles of Inference and Their Consequences,” in D. Cornfield and J. Williamson (eds.) *Foundations of Bayesianism*. Dordrecht: Kluwer Academic Publishes: 381-403.

Savage, L. (1962), “Discussion”, in *The Foundations of Statistical Inference: A Discussion*, (G. A. Barnard and D. R. Cox eds.), London: Methuen, 76.

Spanos, A. “Who Should Be Afraid of the Jeffreys-Lindley Paradox?” *Philosophy of Science*, 80 (2013): 73-93.

## Power taboos: Statue of Liberty, Senn, Neyman, Carnap, Severity

Is it taboo to use a test’s power to assess what may be learned from the data in front of us? (Is it limited to pre-data planning?) If not entirely taboo, some regard power as irrelevant post-data[i], and the reason I’ve heard is along the lines of an analogy Stephen Senn gave today (in a comment discussing his last post here)[ii].

My fire alarm analogy is here. My analogy presumes you are assessing the situation (about the fire) long distance.

While post-data power is scarcely taboo for a severe tester, severity always uses the actual outcome, with its level of statistical significance, whereas power is in terms of the fixed cut-off. Still power provides (worst-case) pre-data guarantees. Now before you get any wrong ideas, I am not endorsing what some people call retrospective power, and I call “shpower”–which goes against severity logic, and is misconceived.

We are reading the Fisher-Pearson-Neyman “triad” tomorrow in Phil6334. Even here (i.e., Neyman 1956), Neyman alludes to a post-data use of power. But, strangely enough,I only noticed this after discovering more blatant discussions in what Spanos and I call “Neyman’s hidden papers”. Here’s an excerpt of from Neyman’s Nursery (part 2) [NN-2]

_____________

One of the two surprising papers I came across the night our house was hit by lightening has the tantalizing title “The Problem of Inductive Inference” (Neyman 1955). It reveals a use of statistical tests strikingly different from the long-run behavior construal most associated with Neyman. Surprising too, Neyman is talking to none other than the logical positivist philosopher of confirmation, Rudof Carnap:

Neyman continues:

The general conclusion is that it is a little rash to base one’s intuitive confidence in a given hypothesis on the fact that a test failed to reject this hypothesis. A more cautious attitude would be to form one’s intuitive opinion only after studying the power function of the test applied.

Neyman alludes to a one-sided test of the mean of a Normal distribution with n iid samples, and known standard deviation, call it test T+.

H

_{0}: µ ≤ µ_{0}against H_{1}: µ > µ_{0}.The test statisticd(X) is the standardized sample mean.The test rule: Infer a (positive) discrepancy from µ

_{0}iff {d(x_{0}) > cα) where cα corresponds to a difference statistically significant at the α level.In Carnap’s example the test could not reject the null hypothesis, i.e., d(x

_{0}) ≤ cα, but (to paraphrase Neyman) the problem is that the chance of detecting the presence of discrepancy δ from the null, with so few observations, is extremely slim, even if [δ is present].We are back to our old friend: interpreting negative results!

“One may be confident in the absence of that discrepancy only if the power to detect it were high.”

The power of the test T+ to detect discrepancy δ:

(1) P(d(X) > cα; µ = µ

_{0}+ δ)It is interesting to hear Neyman talk this way since it is at odds with the more behavioristic construal he usually championed. He sounds like a Cohen-style power analyst! Still, power is calculated relative to an outcome just missing the cutoff cα. This is, in effect, the worst case of a negative (non significant) result, and if the actual outcome corresponds to a larger p-value, that should be taken into account in interpreting the results. It is more informative, therefore, to look at the probability of getting a worse fit (with the null hypothesis) than you did:

(2) P(d(X) > d(x0); µ = µ

_{0}+ δ)In this example, this gives a measure of the severity (or degree of corroboration) for the inference µ < µ

_{0}+ δ.Although (1) may be low, (2) may be high (For numbers, see Mayo and Spanos 2006).

Spanos and I (Mayo and Spanos 2006) couldn’t find a term in the literature defined precisely this way–the way I’d defined it in Mayo (1996) and before. We were thinking at first of calling it “attained power” but then came across what some have called “observed power” which is very different (and very strange). Those measures are just like ordinary power but calculated assuming the value of the mean equals the observed mean! (I call this “shpower”. )

Anyway, we refer to it as the Severity Interpretation of “Acceptance” (SIA) in Mayo and Spanos 2006.

The claim in (2) could also be made out viewing the p-value as a random variable, calculating its distribution for various alternatives (Cox 2006, 25). This reasoning yields a core frequentist principle of evidence (FEV) in Mayo and Cox 2010, 256):

FEV:

^{1}A moderate p-value is evidence of the absence of a discrepancy d from H_{0}only if there is a high probability the test would have given a worse fit with H_{0}(i.e., smaller p value) were a discrepancy d to exist.It is important to see that it is only in the case of a negative result that severity for various inferences is in the same direction as power. In the case of significant results, d(x) in excess of the cutoff, the opposite concern arises—namely, the test is too sensitive. So severity is always relative to the particular inference being entertained: speaking of the “severity of a test” simpliciter is an incomplete statement in this account. These assessments enable sidestepping classic fallacies of tests that are either too sensitive or not sensitive enough.…..

By making a SEV assessment relevant to the inference under consideration, we obtain a measure where high (low) values always correspond to good (poor) evidential warrant. It didn’t have to be done this way (at first I didn’t), but I decided it was best, even though it means appropriately swapping out the claim H for which one wants to assess SEV.

[i] To repeat it again: some may be thinking of an animal I call “shpower”.

[ii] I realize comments are informal and unpolished, but isn’t that the beauty of blogging?

NOTE:To read the full post go to [NN-2].There are 5 Neyman’s Nursery posts (NN1-NN5). Search this blog for the others.

REFERENCES:

Cohen, J. (1992)

A Power Primer.Cohen, J. (1988),Statistical Power Analysis for the Behavioral Sciences, 2^{nd}ed. Hillsdale, Erlbaum, NJ.Mayo, D. G. and Spanos, A. (2006). “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,”

British Journal of Philosophy of Science, 57: 323-357.Mayo, D.G. and Cox, D. R. (2006) “Frequentists Statistics as a Theory of Inductive Inference,”

Optimality: The Second Erich L. Lehmann Symposium(ed. J. Rojo), Lecture Notes-Monograph series, Institute of Mathematical Statistics (IMS), Vol. 49: 77-97.Mayo, D. and Spanos, A. (eds.) (2010),

Error and Inference, Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science, CUP.Mayo, D. G. and Spanos, A. (2011) “Error Statistics“

Neyman, J. (1955), “The Problem of Inductive Inference,”

Communications on Pure and Applied Mathematics, VIII, 13-46.