I finally saw *The Imitation Game* about Alan Turing and code-breaking at Bletchley Park during WWII. This short clip of Joan Clarke, who was engaged to Turing, includes my late colleague I.J. Good at the end (he’s not second as the clip lists him). Good used to talk a great deal about Bletchley Park and his code-breaking feats while asleep there (see note[a]), but I never imagined Turing’s code-breaking machine (which, by the way, was called the Bombe and not Christopher as in the movie) was so clunky. The movie itself has two tiny scenes including Good. Below I reblog: “Who is Allowed to Cheat?”—one of the topics he and I debated over the years. Links to the full “Savage Forum” (1962) may be found at the end (creaky, but better than nothing.)

[a]”Some sensitive or important Enigma messages were enciphered twice, once in a special variation cipher and again in the normal cipher. …Good dreamed one night that the process had been reversed: normal cipher first, special cipher second. When he woke up he tried his theory on an unbroken message – and promptly broke it.” This, and further examples may be found in this obituary

[b] Pictures comparing the movie cast and the real people may be found here.

____________________

**Who is allowed to cheat? I.J. Good and that after dinner comedy hour….**

It was from my Virginia Tech colleague I.J. Good (in statistics), who died ~~five~~ six years ago (April 5, 2009), at 93, that I learned most of what I call “howlers” on this blog. His favorites were based on the “paradoxes” of stopping rules. (I had posted this last year here.)

“In conversation I have emphasized to other statisticians, starting in 1950, that, in virtue of the ‘law of the iterated logarithm,’ by optional stopping an arbitrarily high sigmage, and therefore an arbitrarily small tail-area probability, can be attained even when the null hypothesis is true. In other words if a Fisherian is prepared to use optional stopping (which usually he is not) he can be sure of rejecting a true null hypothesis provided that he is prepared to go on sampling for a long time.

The way I usually express this ‘paradox’ is that a Fisherian [but not a Bayesian] can cheat by pretending he has a plane to catch like a gambler who leaves the table when he is ahead” (Good 1983, 135)[*]

This paper came from a conference where we both presented, and he was *extremely* critical of my error statistical defense on this point. (I was like a year out of grad school, and he a University Distinguished Professor.)

One time, years later, after hearing Jack give this howler for the nth time, “a Fisherian [but not a Bayesian] can cheat, etc.,” I was driving him to his office, and suddenly blurted out what I really thought:

“You know Jack, as many times as I have heard you tell this, I’ve always been baffled as to its lesson about who is allowed to cheat. Error statisticians require the overall and not the ‘computed’ significance level be reported. To us, what would be cheating would be reporting the significance level you got after trying and trying again in just the same way as if the test had a fixed sample size. True, we are forced to fret about how stopping rules alter the error probabilities of tests, while the Bayesian is free to ignore them, but why isn’t the real lesson that the Bayesian is allowed to cheat?” (A published version of my remark may be found in EGEK p. 351: “As often as my distinguished colleague presents this point…”)

To my surprise, or actually shock, after pondering this a bit, Jack said something like, “Hmm, I never thought of it this way.”

By the way, the story of the “after dinner Bayesian comedy hour” on this blog, did not allude to Jack but to someone who gave a much more embellished version. Since it’s Saturday night, let’s once again listen into the comedy hour that unfolded at my dinner table at an academic conference:

Did you hear the one about the researcher who gets a phone call from the guy analyzing his data? First the guy congratulates him and says, “The results show astatistically significant difference at the .05 level—p-value .048.” But then, an hour later, the phone rings again. It’s the same guy, but now he’s apologizing. It turns out that the experimenter intended to keep sampling until the result was 1.96 standard deviations away from the 0 null—in either direction—so they had to reanalyze the data (n=169), and the results were no longer statistically significant at the .05 level.

Much laughter.

So the researcher is tearing his hair out when the same guy calls back again. “Congratulations!” the guy says. “I just found out that the experimenter actually had planned to take n=169 all along, so the resultsarestatistically significant.”

Howls of laughter.

But then the guy calls back with the bad news . . .

It turns out that failing to score a sufficiently impressive effect aftern’ trials, the experimenter went on ton” trials, and so on and so forth until finally, say, on trial number 169, he obtained a result 1.96 standard deviations from the null.

It continues this way, and every time the guy calls in and reports a shift in the p-value, the table erupts in howls of laughter! From everyone except me, sitting in stunned silence, staring straight ahead. The hilarity ensues from the idea that the experimenter’s reported psychological intentions about when to stop sampling is altering the statistical results.

The allegation that letting stopping plans matter to the interpretation of data is tantamount to letting psychological intentions matter may be called the *argument from intentions.* When stopping rules matter, however, we are looking not at “intentions” but at real alterations to the probative capacity of the test, as picked up by a change in the test’s corresponding error probabilities. The analogous problem occurs if there is a fixed null hypothesis and the experimenter is allowed to search for maximally likely alternative hypotheses (Mayo and Kruse 2001; Cox and Hinkley 1974). Much the same issue is operating in what physicists call the *look-elsewhere effect* (LEE), which arose in the context of “bump hunting” in the Higgs results.

The *optional stopping effect* often appears in illustrations of how error statistics violates the Likelihood Principle LP, alluding to a two-sided test from a Normal distribution:

*X _{i }*~ N(µ,σ) and we test

*H*

_{0}: µ=0, vs.

*H*

_{1}: µ≠0.

The stopping rule might take the form:

*Keep sampling until |m| ≥* 1.96 σ/√n),

with *m* the sample mean. When *n* is fixed the type 1 error probability is .05, but with this stopping rule the actual significance level may differ from, and will be greater than, .05. In fact, ignoring the stopping rule allows a high or maximal probability of error. For a sampling theorist, this example alone “taken in the context of examining consistency with θ = 0, is enough to refute the strong likelihood principle.” (Cox 1977, p. 54) since, with probability 1, it will stop with a “nominally” significant result even though θ = 0. As Birnbaum (1969, 128) puts it, “the likelihood concept cannot be construed so as to allow useful appraisal, and thereby possible control, of probabilities of erroneous interpretations.” From the error-statistical standpoint, ignoring the stopping rule allows readily inferring that there is evidence for a non- null hypothesis even though it has passed with low if not minimal severity.

Peter Armitage, in his comments on Savage at the 1959 forum (“Savage Forum” 1962), put it thus:

I think it is quite clear that likelihood ratios, and therefore posterior probabilities, do not depend on a stopping rule. . . .

I feel that if a man deliberately stopped an investigation when he had departed sufficiently far from his particular hypothesis, then “Thou shalt be misled if thou dost not know that.”If so, prior probability methods seem to appear in a less attractive light than frequency methods, where one can take into account the method of sampling. (Savage 1962, 72; emphasis added; see [ii])

*H* is *not* being put to a stringent test when a researcher allows trying and trying again until the data are far enough from *H _{0 }*to reject it in favor of

*H*.

**Stopping Rule Principle**

Picking up on the effect *appears* evanescent—locked in someone’s head—if one has no way of taking error probabilities into account:

In general, suppose that you collect data of any kind whatsoever — not necessarily Bernoullian, nor identically distributed, nor independent of each other . . . — stopping only when the data thus far collected satisfy some criterion of a sort that is sure to be satisfied sooner or later, then the import of the sequence of

ndata actually observed will be exactly the same as it would be had you planned to take exactlynobservations in the first place. (Edwards, Lindman, and Savage 1962, 238-239)

This is called the *irrelevance of the stopping rule *or the* Stopping Rule Principle *(SRP), and is an implication of the (strong) likelihood principle (LP), which is taken up elsewhere in this blog.[i]

To the holder of the LP, the intuition is that the stopping rule is irrelevant; to the error statistician the stopping rule is quite relevant because the probability that the persistent experimenter finds data against the no-difference null is increased, even if the null is true. It alters the well-testedness of claims inferred. (Error #11 of Mayo and Spanos 2011 “Error Statistics“.)

**A Funny Thing Happened at the Savage Forum[i]**

While Savage says he was always uncomfortable with the argument from intentions, he is reminding Barnard of the argument that Barnard promoted years before. He’s saying, in effect, Don’t you remember, George? You’re the one who so convincingly urged in 1952 that to take stopping rules into account is like taking psychological intentions into account:

The argument then was this: The design of a sequential experiment is, in the last analysis, what the experimenter actually intended to do. His intention is locked up inside his head. (Savage 1962, 76)

But, alas, Barnard had changed his mind. Still, the argument from intentions is repeated again and again by Bayesians. Howson and Urbach think it entails dire conclusions for significance tests:

A significance test inference, therefore, depends not only on the outcome that a trial produced, but also on the outcomes that it could have produced but did not. And the latter are determined by certain private intentions of the experimenter, embodying his stopping rule. It seems to us that this fact precludes a significance test delivering any kind of judgment about empirical support. . . . For scientists would not normally regard such personal intentions as proper influences on the support which data give to a hypothesis. (Howson and Urbach 1993, 212)

It is fallacious to insinuate that regarding optional stopping as relevant is in effect to make private intentions relevant. Although the choice of stopping rule (as with other test specifications) is determined by the intentions of the experimenter, it does not follow that taking account of its influence is to take account of subjective intentions. The allegation is a *non sequitur*.

We often hear things like:

[I]t seems very strange that a frequentist could not analyze a given set of data, such as (

x_{1},…,x) [in Armitage’s example] if the stopping rule is not given. . . . [D]ata should be able to speak for itself. (Berger and Wolpert 1988, 78)_{n}

But data do not speak for themselves, unless sufficient information is included to correctly appraise relevant error probabilities. The error statistician has a perfectly nonpsychological way of accounting for the impact of stopping rules, as well as other aspects of experimental plans. The impact is on the stringency or severity of the test that the purported “real effect” has passed. In the optional stopping plan, there is a difference in the set of possible outcomes; certain outcomes available in the fixed sample size plan are no longer available. If a stopping rule is truly open-ended (it need not be), then the possible outcomes do not contain any that fail to reject the null hypothesis. (The above rule stops in a finite # of trials, it is “proper”.)

Does the difference in error probabilities corresponding to a difference in sampling plans correspond to any real difference in the experiment? Yes. The researchers really did do something different in the try-and-try-again scheme and, as Armitage says, thou shalt be misled if your account cannot report this.

We have banished the argument from intentions, the allegation that letting stopping plans matter to the interpretation of data is tantamount to letting psychological intentions matter. So if you’re at my dinner table, can I count on you not to rehearse this one…?

One last thing….

** ****The Optional Stopping Effect with Bayesian (Two-sided) Confidence Intervals**

The equivalent stopping rule can be framed in terms of the corresponding 95% “confidence interval” method, given the normal distribution above (their term and quotes):

Keep sampling until the 95% confidence interval excludes 0.

Berger and Wolpert concede that using this stopping rule “has thus succeeded in getting the [Bayesian] conditionalist to perceive that μ ≠ 0, and has done so honestly” (pp. 80-81). This seems to be a striking admission—especially as the Bayesian interval assigns a probability of .95 to the truth of the interval estimate (using a”noninformative prior density”):

µ = * m* + 1.96(σ/√n)

But, they maintain (or did back then) that the LP only “seems to allow the experimenter to mislead a Bayesian. The ‘misleading,’ however, is solely from a frequentist viewpoint, and will not be of concern to a conditionalist.” Does this mean that while the real error probabilities are poor, Bayesians are not impacted, since, from the perspective of what they believe, there is no misleading?

[*] It was because of these “conversations” that Jack thought his name should be included in the “Jeffreys-Lindley paradox”, so I always call it the Jeffreys-Good-Lindley paradox. I discuss this in EGEK 1996, Chapter 10 , Mayo and Kruse (2001). See a recent paper by my colleague Aris Spanos (2013) on the Jeffreys-Lindley paradox.

[i] There are certain exceptions where the stopping rule may be “informative”. Many other posts may be found on LP violations, and an informal version of my critique of Birnbaum’s LP argument. On optional stopping, see also Irony and Bad Faith. For my latest, and final (I hope) post on the (sttrong) likelihood principle, see the post with the link to my paper with discussion in Statistical Science.

**Link to complete discussion: **

Mayo, Deborah G. On the Birnbaum Argument for the Strong Likelihood Principle (with discussion & rejoinder). *Statistical Science* 29 (2014), no. 2, 227-266.

[ii] I found, on an old webpage of mine, (a pale copy of) the “Savage forum”:

- Savage Forum title page through page 20.pdf
- Savage Forum pages 21 through 35.pdf
- Savage Forum pages 36 through 52.pdf
- Savage Forum pages 53 through 55.pdf
- Savage Forum pages 56 through 70.pdf
- Savage Forum pages 71 through 77.pdf
- Savage Forum pages 78 through 103.pdf
- Savage Forum reference pages 104 through 112.pdf

**REFERENCES**

Armitage, P. (1962), “Discussion”, in *The Foundations of Statistical Inference: A Discussion*, (G. A. Barnard and D. R. Cox eds.), London: Methuen, 72.

Berger J. O. and Wolpert, R. L. (1988), *The Likelihood Principle: **A Review, Generalizations, and Statistical Implications* 2^{nd} edition, Lecture Notes-Monograph Series, Vol. 6, Shanti S. Gupta, Series Editor, Hayward, California: Institute of Mathematical Statistics.

Birnbaum, A. (1969), “Concepts of Statistical Evidence” In *Philosophy, Science, and Method: Essays in Honor of Ernest Nagel*, S. Morgenbesser, P. Suppes, and M. White (eds.): New York: St. Martin’s Press, 112-43.

Cox, D. R. (1977), “The Role of Significance Tests (with discussion)*”, Scandinavian Journal of Statistics* 4, 49–70.

Cox, D. R. and D. V. Hinkley (1974), *Theoretical Statistics,* London: Chapman & Hall.

Edwards, W., H, Lindman, and L. Savage. 1963 Bayesian Statistical Inference for Psychological Research. *Psychological Review* 70: 193-242.

Good, I.J.(1983), *Good Thinking, The Foundations of Probability and its Applications*, Minnesota.

Howson, C., and P. Urbach (1993[1989]), *Scientific Reasoning: The Bayesian Approach*, 2^{nd} ed., La Salle: Open Court.

Mayo, D. (1996):[EGEK] Error and the Growth of Experimental Knowledge, Chapter 10 Why You Cannot Be Just a Little Bayesian. Chicago

Mayo, D. G. and Kruse, M. (2001). “Principles of Inference and Their Consequences,” in D. Cornfield and J. Williamson (eds.) *Foundations of Bayesianism*. Dordrecht: Kluwer Academic Publishes: 381-403.

Savage, L. (1962), “Discussion”, in *The Foundations of Statistical Inference: A Discussion*, (G. A. Barnard and D. R. Cox eds.), London: Methuen, 76.

Spanos, A. “Who Should Be Afraid of the Jeffreys-Lindley Paradox?” *Philosophy of Science*, 80 (2013): 73-93.

n=2. Rule 1: Stop when you have two 1s, 0101. Rule 2: Stop when the number

of 1s equals the number of 0s, 0110. Rule 3: stop when you have two 1s

followed by two zeros, 1100.

n=3. Rule 1: Stop when you have three 1s, 101001. Rule 2: Stop when the number

of 1s equal the number of 0s. 011001. Rule 3: stop when you have three 1s

followed by three zeros, 111000.

n=4. Rule 1: Stop when you have four 1s, 01001101. Rule 2: Stop when the number

of 1s equal the number of 0s. 01101001: Rule 3: stop when you have four 1s

followed by four zeros, 11110000.

For which value of $n$ do you stop treating them as the same?

Never write a blog entry at 3am. Replace Rule 2 ‘Stop when the number of 1s equals the number of 0s by ‘Stop at 2n’.

Laurie: Did you intend this for here or for someplace else? I’m not sure of the upshot in relation to this post, and I never did follow-up on the discussion you were having with Corey on his blog, though I read it.

No it was intended for this discussion. I understood it to be about the relevance of a stopping rule, that is optional stopping. I gave three different stopping rules and three different data sets, the data sets having the property that the likelihoods are proportional (standard binomial model) , that is the same number of zeros and ones and the same sample size (unless I have miscounted). If one takes the irrelevance of the stopping rule seriously then all data sets contain the same information about the parameter p of the binomial distribution. This was what I meant by treating them as the same. So what I have formulated is a form of meta stopping problem. When do you stop analysing them in the same manner? You can of course answer that you would always analyse such data sets in the same manner irrespective of any stopping rule. The argument is just a reformulation of Chapter 11.1.5 of my book.

Where does the (irrelevance of) stopping rule principle come from? A close reading of Berger & Wolpert (pp. 88-90) shows that they are claiming that the stopping rules are irrelevant when they are uninformative about the parameter of interest. They claim that examples where the stopping rules are informative are rare in practice, but that is far, far from a proof that stopping rules are irrelevant that I expected.

In 1967 Roberts wrote this: “The irrelevance of stopping rules under the likelihood principle is an empirical generalization that stopping rules are usually noninformative. It is not a logical necessity.” (JASA 62: 763-775, http://www.jstor.org/stable/2283670) When I read that on Monday it came as a surprise to me, and as you probably have noticed, I have been reading and thinking around likelihoods for a while.

It now seems to me that if informative stopping rules are incorporated into a likelihood function then the close relationship between severity and likelihood becomes even closer…

I studied this a lot some years ago (e.g., for the Mayo and Kruse paper). They regard informative stopping rules as highly contrived, like maybe you’re measuring the ferocity of lions and you stopped sampling because you were about to get eaten by a lion.

Here’s thepaper, it’s measuring the lions and running off: p.395

I didn’t read through the posts entirely (just skimmed em). really just came for the video which brings me to my point: The video says “in order of appearance” but I’m pretty sure Dr Wylie and Dr Good should be interchanged. Good is the final speaker and not the second.