It was from my Virginia Tech colleague I.J. Good (in statistics), who died four years ago (April 5, 2009), at 93, that I learned most of what I call “howlers” on this blog. His favorites were based on the “paradoxes” of stopping rules.
“In conversation I have emphasized to other statisticians, starting in 1950, that, in virtue of the ‘law of the iterated logarithm,’ by optional stopping an arbitrarily high sigmage, and therefore an arbitrarily small tail-area probability, can be attained even when the null hypothesis is true. In other words if a Fisherian is prepared to use optional stopping (which usually he is not) he can be sure of rejecting a true null hypothesis provided that he is prepared to go on sampling for a long time. The way I usually express this ‘paradox’ is that a Fisherian [but not a Bayesian] can cheat by pretending he has a plane to catch like a gambler who leaves the table when he is ahead” (Good 1983, 135) [*]
This paper came from a conference where we both presented, and he was extremely critical of my error statistical defense on this point. (I was a year out of grad school, and he a University Distinguished Professor.)
One time, years later, after hearing Jack give this howler for the nth time, “a Fisherian [but not a Bayesian] can cheat, etc.,” I was driving him to his office, and suddenly blurted out what I really thought:
“You know Jack, as many times as I have heard you tell this, I’ve always been baffled as to its lesson about who is allowed to cheat. Error statisticians require the overall and not the ‘computed’ significance level be reported. To us, what would be cheating would be reporting the significance level you got after trying and trying again in just the same way as if the test had a fixed sample size. True, we are forced to fret about how stopping rules alter the error probabilities of tests, while the Bayesian is free to ignore them, but why isn’t the real lesson that the Bayesian is allowed to cheat?” (A published version of my remark may be found in EGEK p. 351: “As often as my distinguished colleague presents this point…”)
To my surprise, or actually shock, after pondering this a bit, Jack said something like, “Hmm, I never thought of it this way.”
By the way, the story of the “after dinner Bayesian comedy hour” on this blog, did not allude to Jack but to someone who gave a much more embellished version. Since it’s Saturday night, let’s once again listen into the comedy hour that unfolded at my dinner table at an academic conference:
Did you hear the one about the researcher who gets a phone call from the guy analyzing his data? First the guy congratulates him and says, “The results show a statistically significant difference at the .05 level—p-value .048.” But then, an hour later, the phone rings again. It’s the same guy, but now he’s apologizing. It turns out that the experimenter intended to keep sampling until the result was 1.96 standard deviations away from the 0 null—in either direction—so they had to reanalyze the data (n=169), and the results were no longer statistically significant at the .05 level.
So the researcher is tearing his hair out when the same guy calls back again. “Congratulations!” the guy says. “I just found out that the experimenter actually had planned to take n=169 all along, so the results are statistically significant.”
Howls of laughter.
But then the guy calls back with the bad news . . .
It turns out that failing to score a sufficiently impressive effect after n’ trials, the experimenter went on to n” trials, and so on and so forth until finally, say, on trial number 169, he obtained a result 1.96 standard deviations from the null.
It continues this way, and every time the guy calls in and reports a shift in the p-value, the table erupts in howls of laughter! From everyone except me, sitting in stunned silence, staring straight ahead. The hilarity ensues from the idea that the experimenter’s reported psychological intentions about when to stop sampling is altering the statistical results.
The allegation that letting stopping plans matter to the interpretation of data is tantamount to letting psychological intentions matter may be called the argument from intentions. When stopping rules matter, however, we are looking not at “intentions” but at real alterations to the probative capacity of the test, as picked up by a change in the test’s corresponding error probabilities. The analogous problem occurs if there is a fixed null hypothesis and the experimenter is allowed to search for maximally likely alternative hypotheses (Mayo and Kruse 2001; Cox and Hinkley 1974). Much the same issue is operating in what physicists call the look-elsewhere effect (LEE), which arose in the context of “bump hunting” in the Higgs results.
The optional stopping effect often appears in illustrations of how error statistics violates the Likelihood Principle LP, alluding to a two-sided test from a Normal distribution:
Xi ~ N(µ,σ) and we test H0: µ=0, vs. H1: µ≠0.
The stopping rule might take the form:
Keep sampling until |m| ≥ 1.96 σ/√n),
with m the sample mean. When n is fixed the type 1 error probability is .05, but with this stopping rule the actual significance level may differ from, and will be greater than, .05. In fact, ignoring the stopping rule allows a high or maximal probability of error. For a sampling theorist, this example alone “taken in the context of examining consistency with θ = 0, is enough to refute the strong likelihood principle.” (Cox 1977, p. 54) since, with probability 1, it will stop with a “nominally” significant result even though θ = 0. As Birnbaum (1969, 128) puts it, “the likelihood concept cannot be construed so as to allow useful appraisal, and thereby possible control, of probabilities of erroneous interpretations.” From the error-statistical standpoint, ignoring the stopping rule allows readily inferring that there is evidence for a non- null hypothesis even though it has passed with low if not minimal severity.
Peter Armitage, in his comments on Savage at the 1959 forum (“Savage Forum” 1962), put it thus:
I think it is quite clear that likelihood ratios, and therefore posterior probabilities, do not depend on a stopping rule. . . . I feel that if a man deliberately stopped an investigation when he had departed sufficiently far from his particular hypothesis, then “Thou shalt be misled if thou dost not know that.” If so, prior probability methods seem to appear in a less attractive light than frequency methods, where one can take into account the method of sampling. (Savage 1962, 72; emphasis added; see [ii])
H is not being put to a stringent test when a researcher allows trying and trying again until the data are far enough from H0 to reject it in favor of H.
Stopping Rule Principle
Picking up on the effect appears evanescent—locked in someone’s head—if one has no way of taking error probabilities into account:
In general, suppose that you collect data of any kind whatsoever — not necessarily Bernoullian, nor identically distributed, nor independent of each other . . . — stopping only when the data thus far collected satisfy some criterion of a sort that is sure to be satisfied sooner or later, then the import of the sequence of n data actually observed will be exactly the same as it would be had you planned to take exactly n observations in the first place. (Edwards, Lindman, and Savage 1962, 238-239)
This is called the irrelevance of the stopping rule or the Stopping Rule Principle (SRP), and is an implication of the (strong) likelihood principle (LP), which is taken up elsewhere in this blog.[i]
To the holder of the LP, the intuition is that the stopping rule is irrelevant; to the error statistician the stopping rule is quite relevant because the probability that the persistent experimenter finds data against the no-difference null is increased, even if the null is true. It alters the well-testedness of claims inferred. (Error #11 of Mayo and Spanos 2011 “Error Statistics“.)
A Funny Thing Happened at the Savage Forum[i]
While Savage says he was always uncomfortable with the argument from intentions, he is reminding Barnard of the argument that Barnard promoted years before. He’s saying, in effect, Don’t you remember, George? You’re the one who so convincingly urged in 1952 that to take stopping rules into account is like taking psychological intentions into account:
The argument then was this: The design of a sequential experiment is, in the last analysis, what the experimenter actually intended to do. His intention is locked up inside his head. (Savage 1962, 76)
But, alas, Barnard had changed his mind. Still, the argument from intentions is repeated again and again by Bayesians. Howson and Urbach think it entails dire conclusions for significance tests:
A significance test inference, therefore, depends not only on the outcome that a trial produced, but also on the outcomes that it could have produced but did not. And the latter are determined by certain private intentions of the experimenter, embodying his stopping rule. It seems to us that this fact precludes a significance test delivering any kind of judgment about empirical support. . . . For scientists would not normally regard such personal intentions as proper influences on the support which data give to a hypothesis. (Howson and Urbach 1993, 212)
It is fallacious to insinuate that regarding optional stopping as relevant is in effect to make private intentions relevant. Although the choice of stopping rule (as with other test specifications) is determined by the intentions of the experimenter, it does not follow that taking account of its influence is to take account of subjective intentions. The allegation is a non sequitur.
We often hear things like:
[I]t seems very strange that a frequentist could not analyze a given set of data, such as (x1,…, xn) [in Armitage’s example] if the stopping rule is not given. . . . [D]ata should be able to speak for itself. (Berger and Wolpert 1988, 78)
But data do not speak for themselves, unless sufficient information is included to correctly appraise relevant error probabilities. The error statistician has a perfectly nonpsychological way of accounting for the impact of stopping rules, as well as other aspects of experimental plans. The impact is on the stringency or severity of the test that the purported “real effect” has passed. In the optional stopping plan, there is a difference in the set of possible outcomes; certain outcomes available in the fixed sample size plan are no longer available. If a stopping rule is truly open-ended (it need not be), then the possible outcomes do not contain any that fail to reject the null hypothesis. (The above rule stops in a finite # of trials, it is “proper”.)
Does the difference in error probabilities corresponding to a difference in sampling plans correspond to any real difference in the experiment? Yes. The researchers really did do something different in the try-and-try-again scheme and, as Armitage says, thou shalt be misled if your account cannot report this.
We have banished the argument from intentions, the allegation that letting stopping plans matter to the interpretation of data is tantamount to letting psychological intentions matter. So if you’re at my dinner table, can I count on you not to rehearse this one…?
One last thing….
The Optional Stopping Effect with Bayesian (Two-sided) Confidence Intervals
The equivalent stopping rule can be framed in terms of the corresponding 95% “confidence interval” method, given the normal distribution above (their term and quotes):
Keep sampling until the 95% confidence interval excludes 0.
Berger and Wolpert concede that using this stopping rule “has thus succeeded in getting the [Bayesian] conditionalist to perceive that μ ≠ 0, and has done so honestly” (pp. 80-81). This seems to be a striking admission—especially as the Bayesian interval assigns a probability of .95 to the truth of the interval estimate (using a”noninformative prior density”):
µ = m + 1.96(σ/√n)
But, they maintain (or did back then) that the LP only “seems to allow the experimenter to mislead a Bayesian. The ‘misleading,’ however, is solely from a frequentist viewpoint, and will not be of concern to a conditionalist.” Does this mean that while the real error probabilities are poor, Bayesians are not impacted, since, from the perspective of what they believe, there is no misleading?
[*] It was because of these “conversations” that Jack thought his name should be included in the “Jeffreys-Lindley paradox”, so I always call it the Jeffreys-Good-Lindley paradox. I discuss this in EGEK 1996, Chapter 10 , Mayo and Kruse (2001). See a recent paper by my colleague Aris Spanos (2013) on the Jeffreys-Lindley paradox.
[i] There are certain exceptions where the stopping rule may be “informative”. Other posts may be found on LP violations, and an informal version of my critique of Birnbaum’s LP argument. On optional stopping, see also Irony and Bad Faith.
[ii] I found, on an old webpage of mine, (a pale copy of) the “Savage forum“.
Armitage, P. (1962), “Discussion”, in The Foundations of Statistical Inference: A Discussion, (G. A. Barnard and D. R. Cox eds.), London: Methuen, 72.
Berger J. O. and Wolpert, R. L. (1988), The Likelihood Principle: A Review, Generalizations, and Statistical Implications 2nd edition, Lecture Notes-Monograph Series, Vol. 6, Shanti S. Gupta, Series Editor, Hayward, California: Institute of Mathematical Statistics.
Birnbaum, A. (1969), “Concepts of Statistical Evidence” In Philosophy, Science, and Method: Essays in Honor of Ernest Nagel, S. Morgenbesser, P. Suppes, and M. White (eds.): New York: St. Martin’s Press, 112-43.
Cox, D. R. (1977), “The Role of Significance Tests (with discussion)”, Scandinavian Journal of Statistics 4, 49–70.
Cox, D. R. and D. V. Hinkley (1974), Theoretical Statistics, London: Chapman & Hall.
Edwards, W., H, Lindman, and L. Savage. 1963 Bayesian Statistical Inference for Psychological Research. Psychological Review 70: 193-242.
Good, I.J.(1983), Good Thinking, The Foundations of Probability and its Applications, Minnesota.
Howson, C., and P. Urbach (1993), Scientific Reasoning: The Bayesian Approach, 2nd ed., La Salle: Open Court.
Mayo, D. (1996):[EGEK] Error and the Growth of Experimental Knowledge, Chapter 10 Why You Cannot Be Just a Little Bayesian. Chicago
Mayo, D. G. and Kruse, M. (2001). “Principles of Inference and Their Consequences,” in D. Cornfield and J. Williamson (eds.) Foundations of Bayesianism. Dordrecht: Kluwer Academic Publishes: 381-403.
Savage, L. (1962), “Discussion”, in The Foundations of Statistical Inference: A Discussion, (G. A. Barnard and D. R. Cox eds.), London: Methuen, 76.
Spanos, A. “Who Should Be Afraid of the Jeffreys-Lindley Paradox?” Philosophy of Science, 80 (2013): 73-93.
This whole set of arguments always leaves me puzzled. All the calculations of p value, confidence intervals, and the like assume that the sample population is an unbiased set as if randomly drawn from the parent distribution (so often a normal one). This is an essential assumption. But if the stopping point is conditional on the results, *that assumption is not true*. The sample population is *not* a random unbiased set drawn from the parent. In fact, the stopping criteria specifically provide that the sample will be biased rather than random.
So how could anyone possibly argue that p values, etc., can be calculated the same way as if the sample had been random and unbiased?
This is not complicated!
Tom: p-values are not calculated the same way. But suppose you only look at likelihood ratios, as in certain types of Bayesian and likelihood accounts. Then once the data are known, given that the difference between the fixed vs. optional stopping case is a constant factor, the difference cancels out. The analogous point is seen in Savage’s other example, comparing a Binomial and negative Binomial.