Given it’s the first anniversary of this blog, which opened with the howlers in “Overheard at the comedy hour …” let’s listen in as a Bayesian holds forth on one of the most famous howlers of the lot: the mysterious role that psychological intentions are said to play in frequentist methods such as statistical significance tests. Here it is, essentially as I remember it (though shortened), in the comedy hour that unfolded at my dinner table at an academic conference:
Did you hear the one about the researcher who gets a phone call from the guy analyzing his data? First the guy congratulates him and says, “The results show a statistically significant difference at the .05 level—p-value .048.” But then, an hour later, the phone rings again. It’s the same guy, but now he’s apologizing. It turns out that the experimenter intended to keep sampling until the result was 1.96 standard deviations away from the 0 null—in either direction—so they had to reanalyze the data (n=169), and the results were no longer statistically significant at the .05 level.
So the researcher is tearing his hair out when the same guy calls back again. “Congratulations!” the guy says. “I just found out that the experimenter actually had planned to take n=169 all along, so the results are statistically significant.”
Howls of laughter.
But then the guy calls back with the bad news . . .
It turns out that failing to score a sufficiently impressive effect after n’ trials, the experimenter went on to n” trials, and so on and so forth until finally, say, on trial number 169, he obtained a result 1.96 standard deviations from the null.
It continues this way, and every time the guy calls in and reports a shift in the p-value, the table erupts in howls of laughter! From everyone except me, sitting in stunned silence, staring straight ahead. The hilarity ensues from the idea that the experimenter’s reported psychological intentions about when to stop sampling is altering the statistical results.
The allegation that letting stopping plans matter to the interpretation of data is tantamount to letting psychological intentions matter may be called the argument from intentions. When stopping rules matter, however, we are looking not at “intentions” but at real alterations to the probative capacity of the test, as picked up by a change in the test’s corresponding error probabilities. The analogous problem occurs if there is a fixed null hypothesis and the experimenter is allowed to search for maximally likely alternative hypotheses (Mayo and Kruse 2001; Cox and Hinkley 1974). Much the same issue is operating in what physicists call the look-elsewhere effect (LEE), which arose in the context of “bump hunting” in the Higgs results.
The optional stopping effect often appears in illustrations of how error statistics violates the Likelihood Principle LP, alluding to a two-sided test from a Normal distribution:
Xi ~ N(µ,σ) and we test H0: µ=0, vs. H1: µ≠0.
The stopping rule might take the form:
Keep sampling until |m| ≥ 1.96 σ/√n),
with m the sample mean. When n is fixed the type 1 error probability is .05, but with this stopping rule the actual significance level may differ from, and will be greater than, .05. In fact, ignoring the stopping rule allows a high or maximal probability of error. For a sampling theorist, this example alone “taken in the context of examining consistency with θ = 0, is enough to refute the strong likelihood principle.” (Cox 1977, p. 54) since, with probability 1, it will stop with a “nominally” significant result even though θ = 0. As Birnbaum (1969, 128) puts it, “the likelihood concept cannot be construed so as to allow useful appraisal, and thereby possible control, of probabilities of erroneous interpretations.” From the error-statistical standpoint, ignoring the stopping rule allows readily inferring that there is evidence for a non- null hypothesis even though it has passed with low if not minimal severity.
Peter Armitage, in his comments on Savage at the 1959 forum (“Savage Forum” 1962), put it thus:
I think it is quite clear that likelihood ratios, and therefore posterior probabilities, do not depend on a stopping rule. . . . I feel that if a man deliberately stopped an investigation when he had departed sufficiently far from his particular hypothesis, then “Thou shalt be misled if thou dost not know that.” If so, prior probability methods seem to appear in a less attractive light than frequency methods, where one can take into account the method of sampling. (Savage 1962, 72; emphasis added; see [ii])
H is not being put to a stringent test when a researcher allows trying and trying again until the data are far enough from H0 to reject it in favor of H.
Stopping Rule Principle
Picking up on the effect appears evanescent—locked in someone’s head—if one has no way of taking error probabilities into account:
In general, suppose that you collect data of any kind whatsoever — not necessarily Bernoullian, nor identically distributed, nor independent of each other . . . — stopping only when the data thus far collected satisfy some criterion of a sort that is sure to be satisfied sooner or later, then the import of the sequence of n data actually observed will be exactly the same as it would be had you planned to take exactly n observations in the first place. (Edwards, Lindman, and Savage 1962, 238-239)
This is called the irrelevance of the stopping rule or the Stopping Rule Principle (SRP), and is an implication of the (strong) likelihood principle (LP), which is taken up elsewhere in this blog.[i]
To the holder of the LP, the intuition is that the stopping rule is irrelevant; to the error statistician the stopping rule is quite relevant because the probability that the persistent experimenter finds data against the no-difference null is increased, even if the null is true. It alters the well-testedness of claims inferred. (Error #11 of Mayo and Spanos 2011 “Error Statistics“.)
A Funny Thing Happened at the Savage Forum[i]
While Savage says he was always uncomfortable with the argument from intentions, he is reminding Barnard of the argument that Barnard promoted years before. He’s saying, in effect, Don’t you remember, George? You’re the one who so convincingly urged in 1952 that to take stopping rules into account is like taking psychological intentions into account:
The argument then was this: The design of a sequential experiment is, in the last analysis, what the experimenter actually intended to do. His intention is locked up inside his head. (Savage 1962, 76)
But, alas, Barnard had changed his mind. Still, the argument from intentions is repeated again and again by Bayesians. Howson and Urbach think it entails dire conclusions for significance tests:
A significance test inference, therefore, depends not only on the outcome that a trial produced, but also on the outcomes that it could have produced but did not. And the latter are determined by certain private intentions of the experimenter, embodying his stopping rule. It seems to us that this fact precludes a significance test delivering any kind of judgment about empirical support. . . . For scientists would not normally regard such personal intentions as proper influences on the support which data give to a hypothesis. (Howson and Urbach 1993, 212)
It is fallacious to insinuate that regarding optional stopping as relevant is in effect to make private intentions relevant. Although the choice of stopping rule (as with other test specifications) is determined by the intentions of the experimenter, it does not follow that taking account of its influence is to take account of subjective intentions. The allegation is a non sequitur.
We often hear things like:
[I]t seems very strange that a frequentist could not analyze a given set of data, such as (x1,…, xn) [in Armitage’s example] if the stopping rule is not given. . . . [D]ata should be able to speak for itself. (Berger and Wolpert 1988, 78)
But data do not speak for themselves, unless sufficient information is included to correctly appraise relevant error probabilities. The error statistician has a perfectly nonpsychological way of accounting for the impact of stopping rules, as well as other aspects of experimental plans. The impact is on the stringency or severity of the test that the purported “real effect” has passed. In the optional stopping plan, there is a difference in the set of possible outcomes; certain outcomes available in the fixed sample size plan are no longer available. If a stopping rule is truly open-ended (it need not be), then the possible outcomes do not contain any that fail to reject the null hypothesis. (The above rule stops in a finite # of trials, it is “proper”.)
Does the difference in error probabilities corresponding to a difference in sampling plans correspond to any real difference in the experiment? Yes. The researchers really did do something different in the try-and-try-again scheme and, as Armitage says, thou shalt be misled if your account cannot report this.
We have banished the argument from intentions, the allegation that letting stopping plans matter to the interpretation of data is tantamount to letting psychological intentions matter. So if you’re at my dinner table, can I count on you not to rehearse this one…?
One last thing….
The Optional Stopping Effect with Bayesian (Two-sided) Confidence Intervals
The equivalent stopping rule can be framed in terms of the corresponding 95% “confidence interval” method, given the normal distribution above (their term and quotes):
Keep sampling until the 95% confidence interval excludes 0.
Berger and Wolpert concede that using this stopping rule “has thus succeeded in getting the [Bayesian] conditionalist to perceive that μ ≠ 0, and has done so honestly” (pp. 80-81). This seems to be a striking admission—especially as the Bayesian interval assigns a probability of .95 to the truth of the interval estimate (using a”noninformative prior density”):
µ = m + 1.96(σ/√n)
But, they maintain (or did back then) that the LP only “seems to allow the experimenter to mislead a Bayesian. The ‘misleading,’ however, is solely from a frequentist viewpoint, and will not be of concern to a conditionalist.” Does this mean that while the real error probabilities are poor, Bayesians are not impacted, since, from the perspective of what they believe, there is no misleading?
[i] There are certain exceptions where the stopping rule may be “informative”. Other posts may be found on LP violations, and an informal version of my critique of Birnbaum’s LP argument. On optional stopping, see also Irony and Bad Faith.
[ii] I found, on an old webpage of mine, (a pale copy of) the “Savage forum“.
Armitage, P. (1962), “Discussion”, in The Foundations of Statistical Inference: A Discussion, (G. A. Barnard and D. R. Cox eds.), London: Methuen, 72.
Berger J. O. and Wolpert, R. L. (1988), The Likelihood Principle: A Review, Generalizations, and Statistical Implications 2nd edition, Lecture Notes-Monograph Series, Vol. 6, Shanti S. Gupta, Series Editor, Hayward, California: Institute of Mathematical Statistics.
Birnbaum, A. (1969), “Concepts of Statistical Evidence” In Philosophy, Science, and Method: Essays in Honor of Ernest Nagel, S. Morgenbesser, P. Suppes, and M. White (eds.): New York: St. Martin’s Press, 112-43.
Cox, D. R. (1977), “The Role of Significance Tests (with discussion)”, Scandinavian Journal of Statistics 4, 49–70.
Cox, D. R. and D. V. Hinkley (1974), Theoretical Statistics, London: Chapman & Hall.
Edwards, W., H, Lindman, and L. Savage. 1963 Bayesian Statistical Inference for Psychological Research. Psychological Review 70: 193-242.
Howson, C., and P. Urbach (1993), Scientific Reasoning: The Bayesian Approach, 2nd ed., La Salle: Open Court.
Mayo, D. G. and M. Kruse (2001), “Principles of Inference and Their Consequences,” in D. Cornfield and J. Williamson (eds.) Foundations of Bayesianism. Dordrecht: Kluwer Academic Publishers, 381-403.
Savage, L. (1962), “Discussion”, in The Foundations of Statistical Inference: A Discussion, (G. A. Barnard and D. R. Cox eds.), London: Methuen, 76.