It was from my Virginia Tech colleague I.J. Good (in statistics), who died four years ago (April 5, 2009), at 93, that I learned most of what I call “howlers” on this blog. His favorites were based on the “paradoxes” of stopping rules.

“In conversation I have emphasized to other statisticians, starting in 1950, that, in virtue of the ‘law of the iterated logarithm,’ by optional stopping an arbitrarily high sigmage, and therefore an arbitrarily small tail-area probability, can be attained even when the null hypothesis is true. In other words if a Fisherian is prepared to use optional stopping (which usually he is not) he can be sure of rejecting a true null hypothesis provided that he is prepared to go on sampling for a long time.

The way I usually express this ‘paradox’ is that a Fisherian [but not a Bayesian] can cheat by pretending he has a plane to catch like a gambler who leaves the table when he is ahead” (Good 1983, 135)[*]

This paper came from a conference where we both presented, and he was *extremely* critical of my error statistical defense on this point. (I was a year out of grad school, and he a University Distinguished Professor.)

One time, years later, after hearing Jack give this howler for the nth time, “a Fisherian [but not a Bayesian] can cheat, etc.,” I was driving him to his office, and suddenly blurted out what I really thought:

“You know Jack, as many times as I have heard you tell this, I’ve always been baffled as to its lesson about who is allowed to cheat. Error statisticians require the overall and not the ‘computed’ significance level be reported. To us, what would be cheating would be reporting the significance level you got after trying and trying again in just the same way as if the test had a fixed sample size. True, we are forced to fret about how stopping rules alter the error probabilities of tests, while the Bayesian is free to ignore them, but why isn’t the real lesson that the Bayesian is allowed to cheat?” (A published version of my remark may be found in EGEK p. 351: “As often as my distinguished colleague presents this point…”)

To my surprise, or actually shock, after pondering this a bit, Jack said something like, “Hmm, I never thought of it this way.”

By the way, the story of the “after dinner Bayesian comedy hour” on this blog, did not allude to Jack but to someone who gave a much more embellished version. Since it’s Saturday night, let’s once again listen into the comedy hour that unfolded at my dinner table at an academic conference:

Did you hear the one about the researcher who gets a phone call from the guy analyzing his data? First the guy congratulates him and says, “The results show a statistically significant difference at the .05 level—p-value .048.” But then, an hour later, the phone rings again. It’s the same guy, but now he’s apologizing. It turns out that the experimenter intended to keep sampling until the result was 1.96 standard deviations away from the 0 null—in either direction—so they had to reanalyze the data (n=169), and the results were no longer statistically significant at the .05 level.

Much laughter.

So the researcher is tearing his hair out when the same guy calls back again. “Congratulations!” the guy says. “I just found out that the experimenter actually had planned to take n=169 all along, so the resultsarestatistically significant.”

Howls of laughter.

But then the guy calls back with the bad news . . .

It turns out that failing to score a sufficiently impressive effect aftern’ trials, the experimenter went on ton” trials, and so on and so forth until finally, say, on trial number 169, he obtained a result 1.96 standard deviations from the null.

It continues this way, and every time the guy calls in and reports a shift in the p-value, the table erupts in howls of laughter! From everyone except me, sitting in stunned silence, staring straight ahead. The hilarity ensues from the idea that the experimenter’s reported psychological intentions about when to stop sampling is altering the statistical results.

The allegation that letting stopping plans matter to the interpretation of data is tantamount to letting psychological intentions matter may be called the *argument from intentions.* When stopping rules matter, however, we are looking not at “intentions” but at real alterations to the probative capacity of the test, as picked up by a change in the test’s corresponding error probabilities. The analogous problem occurs if there is a fixed null hypothesis and the experimenter is allowed to search for maximally likely alternative hypotheses (Mayo and Kruse 2001; Cox and Hinkley 1974). Much the same issue is operating in what physicists call the *look-elsewhere effect* (LEE), which arose in the context of “bump hunting” in the Higgs results.

The *optional stopping effect* often appears in illustrations of how error statistics violates the Likelihood Principle LP, alluding to a two-sided test from a Normal distribution:

*X _{i }*~ N(µ,σ) and we test

*H*

_{0}: µ=0, vs.

*H*

_{1}: µ≠0.

The stopping rule might take the form:

*Keep sampling until |m| ≥* 1.96 σ/√n),

with *m* the sample mean. When *n* is fixed the type 1 error probability is .05, but with this stopping rule the actual significance level may differ from, and will be greater than, .05. In fact, ignoring the stopping rule allows a high or maximal probability of error. For a sampling theorist, this example alone “taken in the context of examining consistency with θ = 0, is enough to refute the strong likelihood principle.” (Cox 1977, p. 54) since, with probability 1, it will stop with a “nominally” significant result even though θ = 0. As Birnbaum (1969, 128) puts it, “the likelihood concept cannot be construed so as to allow useful appraisal, and thereby possible control, of probabilities of erroneous interpretations.” From the error-statistical standpoint, ignoring the stopping rule allows readily inferring that there is evidence for a non- null hypothesis even though it has passed with low if not minimal severity.

Peter Armitage, in his comments on Savage at the 1959 forum (“Savage Forum” 1962), put it thus:

I think it is quite clear that likelihood ratios, and therefore posterior probabilities, do not depend on a stopping rule. . . .

I feel that if a man deliberately stopped an investigation when he had departed sufficiently far from his particular hypothesis, then “Thou shalt be misled if thou dost not know that.”If so, prior probability methods seem to appear in a less attractive light than frequency methods, where one can take into account the method of sampling. (Savage 1962, 72; emphasis added; see [ii])

*H* is *not* being put to a stringent test when a researcher allows trying and trying again until the data are far enough from *H _{0 }*to reject it in favor of

*H*.

**Stopping Rule Principle**

Picking up on the effect *appears* evanescent—locked in someone’s head—if one has no way of taking error probabilities into account:

In general, suppose that you collect data of any kind whatsoever — not necessarily Bernoullian, nor identically distributed, nor independent of each other . . . — stopping only when the data thus far collected satisfy some criterion of a sort that is sure to be satisfied sooner or later, then the import of the sequence of

ndata actually observed will be exactly the same as it would be had you planned to take exactlynobservations in the first place. (Edwards, Lindman, and Savage 1962, 238-239)

This is called the *irrelevance of the stopping rule *or the* Stopping Rule Principle *(SRP), and is an implication of the (strong) likelihood principle (LP), which is taken up elsewhere in this blog.[i]

To the holder of the LP, the intuition is that the stopping rule is irrelevant; to the error statistician the stopping rule is quite relevant because the probability that the persistent experimenter finds data against the no-difference null is increased, even if the null is true. It alters the well-testedness of claims inferred. (Error #11 of Mayo and Spanos 2011 “Error Statistics“.)

**A Funny Thing Happened at the Savage Forum[i]**

While Savage says he was always uncomfortable with the argument from intentions, he is reminding Barnard of the argument that Barnard promoted years before. He’s saying, in effect, Don’t you remember, George? You’re the one who so convincingly urged in 1952 that to take stopping rules into account is like taking psychological intentions into account:

The argument then was this: The design of a sequential experiment is, in the last analysis, what the experimenter actually intended to do. His intention is locked up inside his head. (Savage 1962, 76)

But, alas, Barnard had changed his mind. Still, the argument from intentions is repeated again and again by Bayesians. Howson and Urbach think it entails dire conclusions for significance tests:

A significance test inference, therefore, depends not only on the outcome that a trial produced, but also on the outcomes that it could have produced but did not. And the latter are determined by certain private intentions of the experimenter, embodying his stopping rule. It seems to us that this fact precludes a significance test delivering any kind of judgment about empirical support. . . . For scientists would not normally regard such personal intentions as proper influences on the support which data give to a hypothesis. (Howson and Urbach 1993, 212)

It is fallacious to insinuate that regarding optional stopping as relevant is in effect to make private intentions relevant. Although the choice of stopping rule (as with other test specifications) is determined by the intentions of the experimenter, it does not follow that taking account of its influence is to take account of subjective intentions. The allegation is a *non sequitur*.

We often hear things like:

[I]t seems very strange that a frequentist could not analyze a given set of data, such as (

x_{1},…,x) [in Armitage’s example] if the stopping rule is not given. . . . [D]ata should be able to speak for itself. (Berger and Wolpert 1988, 78)_{n}

But data do not speak for themselves, unless sufficient information is included to correctly appraise relevant error probabilities. The error statistician has a perfectly nonpsychological way of accounting for the impact of stopping rules, as well as other aspects of experimental plans. The impact is on the stringency or severity of the test that the purported “real effect” has passed. In the optional stopping plan, there is a difference in the set of possible outcomes; certain outcomes available in the fixed sample size plan are no longer available. If a stopping rule is truly open-ended (it need not be), then the possible outcomes do not contain any that fail to reject the null hypothesis. (The above rule stops in a finite # of trials, it is “proper”.)

Does the difference in error probabilities corresponding to a difference in sampling plans correspond to any real difference in the experiment? Yes. The researchers really did do something different in the try-and-try-again scheme and, as Armitage says, thou shalt be misled if your account cannot report this.

We have banished the argument from intentions, the allegation that letting stopping plans matter to the interpretation of data is tantamount to letting psychological intentions matter. So if you’re at my dinner table, can I count on you not to rehearse this one…?

One last thing….

** ****The Optional Stopping Effect with Bayesian (Two-sided) Confidence Intervals**

The equivalent stopping rule can be framed in terms of the corresponding 95% “confidence interval” method, given the normal distribution above (their term and quotes):

Keep sampling until the 95% confidence interval excludes 0.

Berger and Wolpert concede that using this stopping rule “has thus succeeded in getting the [Bayesian] conditionalist to perceive that μ ≠ 0, and has done so honestly” (pp. 80-81). This seems to be a striking admission—especially as the Bayesian interval assigns a probability of .95 to the truth of the interval estimate (using a”noninformative prior density”):

µ = * m* + 1.96(σ/√n)

But, they maintain (or did back then) that the LP only “seems to allow the experimenter to mislead a Bayesian. The ‘misleading,’ however, is solely from a frequentist viewpoint, and will not be of concern to a conditionalist.” Does this mean that while the real error probabilities are poor, Bayesians are not impacted, since, from the perspective of what they believe, there is no misleading?

[*] It was because of these “conversations” that Jack thought his name should be included in the “Jeffreys-Lindley paradox”, so I always call it the Jeffreys-Good-Lindley paradox. I discuss this in EGEK 1996, Chapter 10 , Mayo and Kruse (2001). See a recent paper by my colleague Aris Spanos (2013) on the Jeffreys-Lindley paradox.

[i] There are certain exceptions where the stopping rule may be “informative”. Other posts may be found on LP violations, and an informal version of my critique of Birnbaum’s LP argument. On optional stopping, see also Irony and Bad Faith.

[ii] I found, on an old webpage of mine, (a pale copy of) the “Savage forum”:

- Savage Forum title page through page 20.pdf
- Savage Forum pages 21 through 35.pdf
- Savage Forum pages 36 through 52.pdf
- Savage Forum pages 53 through 55.pdf
- Savage Forum pages 56 through 70.pdf
- Savage Forum pages 71 through 77.pdf
- Savage Forum pages 78 through 103.pdf
- Savage Forum reference pages 104 through 112.pdf

**REFERENCES**

Armitage, P. (1962), “Discussion”, in *The Foundations of Statistical Inference: A Discussion*, (G. A. Barnard and D. R. Cox eds.), London: Methuen, 72.

Berger J. O. and Wolpert, R. L. (1988), *The Likelihood Principle: **A Review, Generalizations, and Statistical Implications* 2^{nd} edition, Lecture Notes-Monograph Series, Vol. 6, Shanti S. Gupta, Series Editor, Hayward, California: Institute of Mathematical Statistics.

Birnbaum, A. (1969), “Concepts of Statistical Evidence” In *Philosophy, Science, and Method: Essays in Honor of Ernest Nagel*, S. Morgenbesser, P. Suppes, and M. White (eds.): New York: St. Martin’s Press, 112-43.

Cox, D. R. (1977), “The Role of Significance Tests (with discussion)*”, Scandinavian Journal of Statistics* 4, 49–70.

Cox, D. R. and D. V. Hinkley (1974), *Theoretical Statistics,* London: Chapman & Hall.

Edwards, W., H, Lindman, and L. Savage. 1963 Bayesian Statistical Inference for Psychological Research. *Psychological Review* 70: 193-242.

Good, I.J.(1983), *Good Thinking, The Foundations of Probability and its Applications*, Minnesota.

Howson, C., and P. Urbach (1993[1989]), *Scientific Reasoning: The Bayesian Approach*, 2^{nd} ed., La Salle: Open Court.

Mayo, D. (1996):[EGEK] Error and the Growth of Experimental Knowledge, Chapter 10 Why You Cannot Be Just a Little Bayesian. Chicago

Mayo, D. G. and Kruse, M. (2001). “Principles of Inference and Their Consequences,” in D. Cornfield and J. Williamson (eds.) *Foundations of Bayesianism*. Dordrecht: Kluwer Academic Publishes: 381-403.

Savage, L. (1962), “Discussion”, in *The Foundations of Statistical Inference: A Discussion*, (G. A. Barnard and D. R. Cox eds.), London: Methuen, 76.

Spanos, A. “Who Should Be Afraid of the Jeffreys-Lindley Paradox?” *Philosophy of Science*, 80 (2013): 73-93.

This whole set of arguments always leaves me puzzled. All the calculations of p value, confidence intervals, and the like assume that the sample population is an unbiased set as if randomly drawn from the parent distribution (so often a normal one). This is an essential assumption. But if the stopping point is conditional on the results, *that assumption is not true*. The sample population is *not* a random unbiased set drawn from the parent. In fact, the stopping criteria specifically provide that the sample will be biased rather than random.

So how could anyone possibly argue that p values, etc., can be calculated the same way as if the sample had been random and unbiased?

This is not complicated!

Tom: p-values are not calculated the same way. But suppose you only look at likelihood ratios, as in certain types of Bayesian and likelihood accounts. Then once the data are known, given that the difference between the fixed vs. optional stopping case is a constant factor, the difference cancels out. The analogous point is seen in Savage’s other example, comparing a Binomial and negative Binomial.

Bayesian or not, you have to know things about the distribution of the sample population in order to perform calculations and draw inferences. In the case of stopping when you like the result (or dislike it, for that matter), you have made sure that you do not have a unbiased sample. Any approach that does not take this into account will necessarily arrive at invalid conclusions, at least with regard to an unbiased sample.

After all, in most cases we are much more interested in some other sample to be drawn in the future, most likely one that is less biased and more representative. We are rarely interested in characterising a given biased sample just for its own sake. Were that the case, what you say would be correct – but for only that biased sample.

This situation is very analogous to overfitting a curve to a set of data. You may get a good fit in the sense of low residuals, but you’ve used too many parameters. Hmm, I wonder if there is an approach that is analogous to the jacknife or bootstrap in regression?

Tom: Well maybe you’d be interested in a topic I have talked about too much on this blog, and have sworn off, for now: the alleged argument for the (strong) LP. I never thought it’s alleged proof was an issue (for me) until I was involved in writing something with David Cox in 2006 on objectivity and conditioning. I knew that conditioning on a relevant sampling distribution could not lead to excluding all sampling distributions. The matter remains unsettled, not in my mind, but in others. If you’re interested:

http://www.phil.vt.edu/dmayo/personal_website/ch 7 mayo birnbaum proof.pdf

https://errorstatistics.com/2012/12/08/dont-birnbaumize-that-experiment-my-friend-updated-reblog/

and/or search this blog. Thanks.

Tom: True. If you ignore aspects that alter a relevant error statistical appraisal, your results won’t be warranted with the reported error probability characteristics. And as a result, you might think you have a genuine, repeatable effect and find it is not genuine at all.

Mayo:

I agree with your frustration this. My resolution (as described in chapters 6 and 7 of BDA (chapters 6 and 8 of the forthcoming third edition) is to separate the step of inference-conditional-on-the-model (from which we get estimates, uncertainty intervals, and predictive distributions) from model checking (from which we get p-values and exploratory data analysis). The “likelihood principle” holds (for stopping rules that are conditional on observed data) for inference but not for checking. Also, even for inference, data-dependent stopping rules are relevant in that they affect the robustness of inferences to model assumptions.

Our use of p-values, stopping rules and other error probabilistic quantities are influenced by stopping rules. If anything, it seems somewhat the reverse of what you say, if I understand it, because in model checking we use the data to arrive at and probe violated assumptions–violations that can alter the error probabilities of the primary model-based inference. Recall the m-s testing posts….

Mayo:

I think you misread. I wrote above (and in chapter 6 of BDA) that p-values are influenced by stopping rules, and that stopping rules are relevant for model checking, even in cases where the stopping depends only on observed data and thus does not affect the likelihood.

Well you said that “The “likelihood principle” holds (for stopping rules that are conditional on observed data) for inference” whereas I want to say it doesn’t hold for inference, but rather is violated. Not to confuse things, it should be kept in mind that the LP is always dependent on the model being given or accepted (remember that remark “you have to be comfortable with the model” from Casella and R. Berger we once discussed, link is below). So rather than call aspects of model-checking LP violations,it is more correct to say that the LP is inapplicable without the given model.

Now maybe the confusion is that you are using p-values for a type of model-checking and the LP violations enter in using p-values for inference about a parameter within a model. Hopefully that helps. Thanks for the comment.

https://errorstatistics.com/2012/08/31/failing-to-apply-vs-violating-the-likelihood-principle/

Mayo:

Yes, what I’m saying is that (under appropriate conditions) the likelihood principle holds for Bayesian inference within a model, but it does not hold for p-values, which from a Bayesian perspective represent a form of model checking.

I think I mentioned in the comments to this post (perhaps too tersely; the whole comment thread is now oddly vanished) that once one admits that the sample space of the test is actually determined the intentions of the experimenter but also by events beyond one’s knowledge and control, it seems impossible to know the actual probative capacity of the test. In a way, this is just the prosaic point that we need to make assumptions about the setup to draw conclusions. But in medical trials in particular, it seems to me that every error-statistical conclusion would be premised on an untestable assumption about unmeasured confounders that influence the probability of dropping out (or some other form of censoring like dying).

Whoops: “…every error-statistical conclusion would *need to* be premised…”

Corey: are you saying a whole comment post of yours to this post has vanished? I could look for it…or do you mean on another post?

Mayo: I think I mistook this post for this one — I distinctly recall offering my comment in response to “HAVE WE LEFT ANY OUT?” But my comment is not there either… I am confused.

Mayo, maybe you should have had one of those Elba Grease drinks before the comedy show started,…I mean to get into it.

Two, at least.

This Berger-Wolpert stopping example is interesting, it would be unfair to say that it is a straw man – its very interesting – but it is a bit self serving to look only at this example and not the binomial / negative binomial or multi-meter examples as well, which I think in combination create a more genuine philosophical tension between the for and against stopping rule principle positions. A reader of this alone would have trouble working out why somebody would find the stopping rule principle appealing.

I think the problem being described here at the outset is a little misguided. The effect size is important, attempts to show a very small difference with very large samples isn’t a good idea from either a Bayesian or frequentist perspective. The example is interesting, becauses it illustrates naive (and to my mind strange) Bayesian approaches are not imune to this problem. To echo the applause given in your previous post you deserve credit for popularising this example.

If you think through this problem in the sense of effect size or preferably in a predictive framework there is nothing concerning here from a Bayesian perspective. Although it is a nice problem to think through.

In response to Gelman:

In a truly Bayesian approach model checking is a robustness check with respect to the probabilistic specification. This is compatible with the likelihood principle. Gelman model checking is robustness in a sampling theory sense and as such violates the likelihood principle. I understand that it gives practioners a way to “tell a story about the data”, and I don’t want to discourage that at all…. but the philsophical argument for this is at best very piece meal.

The thing that people overlook (but which Savage and others initially emphasized more) is that EVERY example of the use of an error probability is an LP (i.e, SLP) violation. It must be the case, for every such use that there is what I’ve called an SLP violation pair for sampling theory. But the violation depends on the context, p-value testing say as opposed to testing point against point. I am not sure if looking at ratios is a way to generate the SLP violation for a given data set to be used with p-value testing. I’m not sure if anyone has shown this.

Multi-meter?

David:

The methods described in BDA all use the posterior distribution. You need not choose to use these methods but I don’t think it makes sense for you to say they are not “truly” Bayesian.

Mayo:

The stopping rule in the example you give is obviously unreasonable to a sampling theorist. In the more usual examples sensitivity to seemingly reasonable stopping rules are demonstrated. So, yes it is an interesting example – but looking at just this example distorts the story.

The fact that some Bayesians ridicule p-values by showing that they depend on intentions of researchers (unobservable in the “comedy hour” examples but often in reality controllable by proper protocols) seems to me to be another symptom of a misguided strive to *appear* objective in science, by which most people mean, “independent of the observer and her/his intentions”. And the “problem” is made disappearing by using methods that just ignore such dependence on intentions, voila! Of course the outcome of such methods still depends on the observer in all kinds of ways (although the Bayesian would probably say “only through the prior”), but they do a better job hiding this.

I think we need methods that make transparent more honestly how results in science depend on researcher’s aims and intentions, not less (I give to the subjective Bayesians that at least regarding the prior, they acknowledge this, though it shouldn’t stop there).

Christian: aims, goals, context and error of concern are things that should be made paramount. The trouble with saying at least a subjective prior goes some way to accomplishing this completely misses the crucial need, at least from our perspective, to capture and control the relevant error probabilities without which one cannot assess how good a job the method has done in discriminating the error of interest. My point is that, recognizing goals of some sort is not the same as recognizing scientifically relevant goals,or goals that go hand in hand with techniques that can satisfy them, e.g., by pinpointing what’s wrong with the kinds of techniques we saw in the Tilberg report. If optional stopping doesn’t matter, why should multiple testing, cherry-picking, p-value hacking (a new term used in that area) or post data selection of subgroups? Your result is non-statistically significant? try some more data? a different subgroup– it’s all just a matter of your intentions, locked in your head. Not for those who have a niche to pick up on them, e.g., via a sampling distribution and error probabilities. It is noteworthy I think that the fraud detectors are outing these shenanigans with error statistical methods. Not so invisible after all.

https://errorstatistics.com/2013/04/01/flawed-science-and-stapel-priming-for-a-backlash/

Andrew:

Actually, I do try to consider posterior predictive checking in at least a crude form in my work.

I do this because lacking the tools and/or skills to do full Bayesian robustness checking it is (perhaps) the next best thing. I am uncomfortable with rejecting the usual foundations of statistics on the basis that it is difficult to do with current tools, I realise you see this differently…

If posterior predictive checking is “truly Bayesian” or not is I guess a semantic argument… but in general terms, I think I am only echoing a distinction that you yourself make.

The fact that the inference should depend on the stopping rule

become clearer when one makes a (imperfect) legal analogy.

The evidence presented by the prosector is not enough.

We have to know how he obtained the evidence.

For example, if he went out of his way to ignore

exculpatory evidence, we want to know that.

normaldeviate: The optional-stopping-rule situation and the omitting-relevant-evidence situation aren’t equivalent. Suppose I have a six-sided die, possibly loaded, and I record data from some rolls, but I only record 5 and 6 outcomes; I even refuse to record the total number of rolls I made. You can easily check that the likelihood function of these data are not proportional to the likelihood function of the full data — it’s not even a function of the same set of parameters. Gelman’s book gives all the Bayesian math for these sorts of things.

Whoops: “… *is* not proportional”

(grumble grumble stupid brain grumble grumble)

I didn’t say they were equivalent.

normaldeviate: I didn’t say you did. My comment can reasonably be read as implying that you claimed they’re equivalent; fair enough. In likewise fashion, your comment might be read as implying they’re equivalent.

Corey: This requires knowing the model sufficiently to check. If things like stopping rules and multiple testing are considered irrelevant for inference, then there’s no onus to report them.

As I pointed out in a previous blog post, there are two very different ways of handling the evidence from sequential trials. 1. You are counting significant results. 2. You are weighing results by the amount of information that produced the degree of significance seen. In the latter case, trials that stopped early will get less weight. This means that although in a long series of trials analysed sequentially without adjusting for the possibility of early stopping (a quite wrong thing to do according to conventional frequentist teaching) 1). The expected number of significant results would exceed the (falsely) claimed error probability under the null. 2) The meta-analysis of the results produce might nevertheless be perfectly reasonable. See https://errorstatistics.com/2012/09/29/stephen-senn-on-the-irrelevance-of-stopping-rules-in-meta-analysis/

Stephen: Thanks for reminding me of your post on this.

Stephen: Here’s an intuitive query, without really understanding the specifics of your example: when can counting significant results (way #1) from a batch of results each given weak weights (way #2) yield a strong result? The various individually weak results, it would seem, would need to be fortifying each other in some way–which I take it assumes they are all probing the same hypothesized effect.

A related point from Westfall and Young (1993) (Resampling-based multiple testing):

“once meta-analysis is adopted, one completely loses the ability to say which particular results are significant.” (23).

Deborah, I am not sure that I understand your question. The point is that as soon as you use a sequential approach you don’t know when the trial will stop. There are then two ways of looking at any stopped trial. a) As a random instance of an identical family of such trials. Their defining characteristic is the rule that they all follow and it is this rule and only this rule that defines the family. All such trials whenever they stopped are evidentially identical b) As a trial that collected the given amount of information it collected. Other trials that followed the same protocol collected different amounts of information. They are not evidentially identical bigger trials are more important than smaller ones even though the bigger trials might have been smaller and vice versa.

What I was arguing in my post is that it depends on how you combine the results from trials as to whether the stopping rule matters or not. In meta-analysis you weight by size of trial. This corresponds implicitly to point of view b) and in this view the stopping rule does not matter.

Stephen: Reading it again, I’m not sure I really do understand (a) because of a possible equivocation in “whenever they stopped” in your ” All such trials whenever they stopped are evidentially identical.” Does this mean, consider the point where it stopped, and compare to those that also stopped at this point.

So take a concrete example. A series of trials that are run under the identical protocol that you can look after 300 patients and stop if the result looks good or carry on to 600 if not. You now foresee that you will have a mixture of such trials. Do you say a) they were all run under the same protocol and so must be treated equally or b) they must be split into two sets, a set of large trials and a set of small ones with the former providing more evidence? In case b) the stopping rule does not matter. You do not need to adjust the point estimates because, although the set of stopped trials will overestimate the treatment effect, the set of continued trials will underestimate it. If the continued trials are given twice the weight of the stopped ones, these two biases cancel exactly but if the continued trial are given the same weight as the stopped ones they don’t. This all carries shades of Doob, martingales, the impossibility of winning gambling strategies and the quit while you are ahead fallacy.

I’d be interested to know what people think of Savage’s reference to the “simple dichotomy” (simply against simple) on pp 73-4 of Savage.

http://www.phil.vt.edu/dmayo/PhilStatistics/savage forum pages 71 through 77.pdf

But if one really is in the case where it is known the parameter is one of the two points, then optional stopping doesn’t show up in sampling theory either, according to Barnard, Cox, others.

The link is broken. Here’s one that works.

Corey: Thanks. I wonder why the one in the footnotes is OK and when pasted in a comment not OK.

Anyway, looking back at this I am reminded of something I’ve always puzzled over: Why does Savage say that in 1952 he regarded the “Stopping Rule principle” (SRP) as “patently wrong” just as now he thinks it’s “patently right”. What I mean is, what happened to whatever argument convinced him in 1952? And given his earlier self, should he be so very vehement in upholding it in 1961 or whenever this conference took place? Should he really be so adamant in denying Barnard’s principled basis for distinguishing two cases (one where it seems to hold and another where it does not)?

Mayo: The edit window for comments isn’t very smart — it’s raw text only. After a comment is submitted, it’s checked for html and URLs. The comment parser recognized some of the text as a URL and linkified it, but said text contains a whitespace, so the comment parser generated a broken link.

I believe I’ve read that Savage started as a good ol’ fashioned frequentist, so in 1952 he was probably thinking of error probabilities and their sample space dependence. By 1961 he had long since formulated his axioms and realized that they implied Bayesian updates, the LP, and hence the SRP.

Corey: Sure, but it shows a feigned dogmatism….

“In fact, ignoring the stopping rule allows a high or maximal probability of error.”

Doesn’t a sequential stopping rule tend to decrease the false negative error rate at the same time as it increases the false positive rate? If that is true then discussion of stopping rules in the context of statistical and scientific inference needs to consider both types of error.

It is not hard to imagine situations where the rate of erroneous inference decreases with optional stopping.

Michael: well if the procedure is really bound to reject the null then there’s no false negative. this differs from the case of a fixed simple alternative.

That’s just a cop out! The procedure is not ‘really’ bound to reject the null unless the experimenter really does have access to an infinite sample size and the time needed to evaluate an infinite sample. (Note that ‘really’ refers to any achievable reality, not some mathematician’s dream.)

If a real scientist is intending to make a sample of, say, 20 observations then the fact that the procedure will certainly give a ‘significant’ result with an infinite sample size is not very important if the largest achievable sample size is, say, 100 with a probability of achieving a significant result when the null is true closer to 0.05 than to 1.

When the false negative rate declines faster than the false positive rate then an unadjusted sequential sampling scheme can be a better design for inferential support than a fixed sample size design, particularly if the null is often false.

Any discussion of these issues that omits consideration of cases where the null is false is misleading.

Michael: The issue isn’t which of two or however many procedures to use–e.g., fixed or sequential of various sorts-but whether there is a rationale for considering their results evidentially inequivalent, or not. You seem to be saying we should consider their values for relevant error probabilities and the error statistician agrees. If one cannot pick up on the difference, and denies their ought to even be one (in interpreting results), then the issue being raised with Good arises.

Mayo, you are either ignoring my point or dancing around it so broadly that I can’t see the connection. If a sample is increased in size then it gains more evidence. How that evidence should be interpreted may be an issue to discuss, but the fact that a larger sample contains more evidence seems beyond dispute.

The fact that Neyman and Pearson chose to fix alpha and optimise beta rather than to optimise some function of both alpha and beta seems to be a problem here. When I say look at both type I and type II errors, you seem only to hear the type I bit. Perhaps because you think evidence relates to alpha alone…

My previous comment was not intended to imply that the results of a fixed sample size experiment and a sequential experiment with a larger sample are evidentially equivalent. because they are not. (Read the first paragraph of this comment again.) Whether the results of a fixed sample size experiment and a sequential experiment that stops with the same sample (both size and values) are evidentially equivalent depends on whether the likelihood principle is true and relevant to the definition of evidence. I believe that it is both true and relevant. You don’t. You should not pretend that the failure of an inferential system to conform to your special approach to evidence represents a universal failure.

Michael: You are combining so many diverse issues, it is impossible to address them adequately here. All my papers can be found linked to this blog. yes, large sample size is good–not the issue. Yes, some people like the idea of optimizing a function of alpha and beta–not the issue, but I do not endorse either N-P predesignated accounts or those. Moreover,the reason some prefer optimizing a function of alpha and beta rather than fixing alpha is taken care of in the severity assessment which I do favor.(e.g., Mayo and Spanos 2011). Your point about the LP is relevant, but simply declaring it true does not answer the intuitions against it in the example under discussion.

So I don’t think I’m ignoring or dancing (not even tap dancing), but trying to give brief reactions to comments, without taking up everything at once. Thanks.

Michael: How do you square thinking that the LP is true with thinking that different error probabilities resulting from different stopping rules matter at all?

The different error rates from different stopping rules matter mostly because some people care about them to the exclusion of other considerations. They don’t matter that much to me. In fact, it is my opinion that a focus on error rates hinders adequate scientific consideration of evidence and plausibility by scientists.

I’m curious. I was reading Rubin’s papers on the Bayesian perspective in experiments and one thing that strikes me is that potential outcomes (i.e., what could have happened, but didn’t, if the treatment was given to experimental units, all else constant) are crucial for causal analysis. Then, my question is: does it violates the LP (or SLP, if you prefer)? And if so, what does it mean?

I’ll take a look at Gelman’s (and Rubin’s!) BDA to see if I improve my understanding on this, but if any of you can provide any comments on the matters, it would be greatly appreciated.

Best,

Manoel

Manoel: Good question!

Can you clarify the following issue? I don’t understand how it fits into your argument.

The Bayesian would like to write down the likelihood function p(D|a,b,c,d…) where D is the data and a,b,c,d… are the unknown portions of the Bayesian’s model. And then write down the prior P(a,b,c,d…) and get a posterior on a,b,c,d…

If the Bayesian knows the data collection protocol, and hence the stopping criterion, then he should include n as one of the parameters, and should write down the correct likelihood which will depend on the stopping rule, and should write down priors which are informed by the stopping rule.

Are you saying that the appropriate likelihood function p(D|a,b,c,d…) is independent of both n and the stopping rule specification? I haven’t gone through a detailed example, but this seems unlikely to me.

The issue is whether that part of the probability that depends on the unknown parameter value varies according to the stopping rule. Consider, for example that the probability of observing five successes and five failures in any order in ten trials is a very different probability from observing exactly five successes and then five failures. The point is, however, that the former is 10 choose 5 = 252 larger than the latter whatever the probability of an individual success. Thus, from the likelihood point of view it makes no difference whether I tell you one or the other as to any inference you should make about the probability of a success.

I still don’t see that this answers the question. Suppose you are flipping a coin and your stopping rule is to flip until you get 5 heads in a row. There are now potentially 3 parameters in my bayesian model, p, rho, and n, where rho is the correlation between successive filps, and n is the actual number of flips required. If you tell me that you got 520 heads and 480 tails but you flipped 1000 times before you got 5 in a row that will certainly inform me about serial correlation. Only if you specifically require serial correlation of zero will n be uninformative for the model. But your stopping rule is hugely sensitive to serial correlation. So if the argument is “for models that are insensitive to the stopping rule by assumption, the model will be insensitive to the stopping rule” then I say yes, that’s true but uninteresting.

Stephen. I really like this way of putting it. It helps to illuminate an argument I’ve long been trying to make as regards the SLP (with mixed success).

Daniel: I think I dealt with this in my reply to Tom–the first comment.

Another way of putting this is that a “Bayesian” who claims that stopping rules are irrelevant is one who is committed to a restricted set of models, ie. one who has some kind of delta-function prior over aspects of the model that could be sensitive to stopping rules. It’s not very difficult to see that Bayesian inference in the presence of delta-function priors is not very interesting.

Daniel: That’s because…?

By the way, the LP does assume the correctness of the statistical model, but I realize that’s no longer a clear cut assumption for Bayesians.

Well, I give an example above, it takes us 1000 coin flips to get 5 in a row. For independent flips this happens 1/p^5 groups of 5, since we had p~1/2 that’s 1/32 groups of 5, 5*32 = 160, so the probability of going 1000 coin flips without 5 in a row under an independent assumption is very small, hence the likelihood for rho=0 is very small. the likelihood function on rho is a strong function of n. So not all likelihoods are necessarily insensitive to stopping rules. The likelihood on p may very well be a function of rho when rho != 0 as well, I haven’t thought about it much, but the point is only under certain models will the likelihood be insensitive to the stopping rule. A bayesian who asserts this is true is asserting that models where rho !=0 are impossible.

In error statistics, as I understand it, the simple null hypothesis model is first assumed true, and then we calculate the probability of seeing data similar to the data we saw, and then we reject the null based on this p being small. This emphasizes that we are not interested in things that could be consistent with boring hypotheses.

In most Bayesian statistics, we assume a model that tries its best to explain the data accurately based on scientific knowledge and a reasonable set of assumptions. The models are potentially complicated and should include information about the data collection process if that process seems to be relevant to the outcome. We also generally allow that alternative explanations and hence alternative models might be valid as well. Often we try to expand a single model to incorporate a pretty wide class of explanations (this is Gelman’s “continuous model expansion”).

Often we work on problems where we KNOW FOR A FACT that the model is wrong, we just hope that the way in which it is wrong does not strongly affect what we really care about (ie. non-nuisance parameters)

In case it wasn’t clear, the assumption of the correct statistical model is so strong as to make the argument meaningless for many Bayesians. Bayesian statistics thrives on problems that are complicated and require complicated interactions and in general will approximate or simplify many aspects of the problem in such a way as to be technically wrong but hopefully in a non essential way. One extremely simple example is the use of a normal distribution for something which is guaranteed to be positive. N(350,10^2) for the number of days that something happened has such a small probability of being negative that we ignore it, but we know it’s wrong because there is some probability of getting -1 a meaningless result.

We are free to ignore the fact that some likelihood functions are insensitive to some stopping rules, and simply substitute a likelihood that we believe better explains the process, and hence IS sensitive to the stopping rule. Or, if we have strong reasons to believe the process is properly modeled by a likelihood that is independent of the stopping rule, then in fact the stopping rule won’t matter (unless we’re wrong about the model, but that’s always a possibility).

One aspect of Error Statistics which is entirely counter to most of Bayesian statistics is that the model in question for the null hypothesis is *always right by definition*. ie. the null hypothesis is extremely well defined albeit boring.

Daniel: Someone might suspect you of being a plant, given that we were just listing, in relation to a paper for an upcoming conference here on Ontology and Methodology, that someone might (and some do) criticize statistical significance test reasoning as regarding a statistical null hypothesis as true. Thinking it too much of a howler, we were about to take it off the list, now it’s back. I don’t know where you have (mis)learned statistics, but error statistics is all about finding, testing, using models that at best approximate aspects of a phenomenon of interest. As a proponent of Bayesianism, on the other hand, I suppose you are not bothered by the need to enter into the analysis with a complete set of the possible hypotheses that could explain/predict a phenomenon of interest, plus prior degrees of belief or other prior probability assignments to all the hypotheses, models, catchalls, and the statistical assumptions underlying them?

On the matter of general scientific reasoning: I suppose that when physicists test and probe in the following way:

“assuming the Standard Model, what’s the rate of events” expected? that Daniel thinks they are assuming the standard model is “right by definition”?

Hypothetical reasoning is just that, hypothetical…

That said, your comment still doesn’t manage to get to the issue of distinguishing the evidential import of sequential vs fixed sampling for a given example.

I don’t claim that by definition the standard model is a correct explanation of reality, I only claim that by definition the standard model is a correct explanation of the predictions that the standard model makes.

When you say “suppose that the standard model is true, what’s the rate of events expected?” the first thing you do is not to sit down and observe the standard model in some experiments, then build a model of the standard model and hope that this model of the standard model is sufficiently close to the “true but unknown standard model”. You have the standard model itself!

So for the question of what the standard model predicts, you have an exact correspondence between the model you put in the computer and the model that we want to observe. Sure there are some numerical approximation procedures and soforth, but these are just computational nuisances.

Testing a null hypothesis is basically about testing whether a particular well specified fixed model produces data that is consistent with observed data in a certain way.

In a Bayesian analysis, the purpose is usually different, make some assumptions about what would be reasonable to expect, and then discover what those assumptions mean about things we don’t know (namely parameter values that are consistent both with our assumptions and the data) as well as potentially to choose among several alternative types of model.

I don’t have any problem with the need, in theory, to provide some kind of comprehensive set of beliefs about all possible models. I just know that in practice I have to choose some range of models that is sufficiently broad that it encompasses everything that I want to consider at the moment based on my knowledge at the moment and that it won’t look anything like the full set of expressible models. One nice thing about probability measures is that they have finite measure, so we are free to throw away pretty much an infinity of alternatives.

To be clear, I don’t deny the validity of Error Statistics either. I just think it answers a very different kind of question than the ones that I usually pose for myself.

Also, Error Statistics has a sense of “surprise”, ie. p < some small value which is interpreted as meaning that an alternative model is needed. Bayesian analysis has a similar thing, some kind of deviation between the prior and the posterior. If posterior things don't look like the prior only more specific then there is something to be reconsidered.

Daniel, I am struggling to reconcile your version of Bayesian reasoning with what is in most texts (except Gelman`s). I read you to say that sometimes sampling error should be modeled, but not always. Error statistics have sampling error as a central consideration though not the sole. As a scientist, I have not seen any published papers with Bayesian models that dealt with sampling error, though they might be out there. In fact, I have the perception that being able to ignore sampling error is what attracts many to the Bayesian way to begin with. Seems crisp and clean if you not worry about sampling issues. But, it is also junk if you do not deal with sampling issues. Can you clarify your view on this evidential import issue, if you do or do not account for sampling error?

Finally, I should say that either an experiment is a timeseries in which during the data collection things happen which affect the collection process, in which case we will in general need a timeseries type model to explain it, or it isn’t in which case we don’t.

It can be the case that a timeseries process does not need an explicit timeseries model to arrive at correct results, but this is a special case.

Any Bayesian who claims that the results of inference on any experiment CAN NOT depend on how many data points you collect, even though the collection procedure at time t depends on the data collected for t’ < t is saying something which is obviously false. They're not wrong because they're Bayesian, they're wrong because they're wrong.

The appropriate model for such data collection procedures is some kind of markov timeseries model. Only under very specific conditions will that model give the same results as a time-independent model.

Forgive me for oversimplifying, but I am not a statistician and hope take away some basic lessons from all this. So, I can see that being wrong is wrong, Bayesian or not, but you seem to have thrown the LP out the back door (and quietly). Is that your intent? I say this because you depart from the cavalier assumption that the model is adequate, and are willing to look at how the data were collected and the implications of that on expected outcomes. Do sampling considerations inform some error statistic in your results, such as a confidence measure?

I should tell you that I am basically a mixture of engineer, and applied mathematician, using Bayesian models when appropriate for problems in physical systems, biological systems, signal processing etc. I don’t do randomized controlled drug trials for example.

I think the LP makes sense *within the model context* so once one has a model, all the information about the parameters comes from that likelihood (and the prior). But to say that because you’ve chosen your model, no other model could possibly be a better model is just insane. So I wouldn’t say that I throw out the LP, I just say that it’s relevant to the problem of inference within a model context (after all, the likelihood isn’t even defined until after you specify the model!).

How to compare models is a problem that has been attacked by both Bayesian and Frequentist methods. The main Bayesian method that I know of is Bayes Factors and this is relatively interesting when the models are of a very different nature. Essentially you put a meta-model over the several types of data models and a meta prior over the relative belief in which model is likely to explain better. Many of the problems this approach has seem to come from people using non-informative type priors, something I just don’t ever do in my work.

When you are interested in a class of models which can be embedded into a larger class, typically where some coefficients are potentially set to zero, then there was just some discussion on Gelman’s blog about “wacky priors can work well” or some such thing where people were putting priors on coefficients that have mass either away from zero, or tightly coupled to zero, but no mass at small nonzero values. The goal there is to select coefficients to set to zero so that you can do a full bayesian analysis which simultaneously selects the relevant terms in the model and estimates the remaining parameters.

I think that the time-series aspect is a bit of a red-herring. Nearly all randomised clinical trial modelling whether frequentist or Bayesian model the patients as independent observations. And whereas frequentist analyses take account of the stopping rule and Bayesian don’t neither deals with any time series aspect.

When John Cook worked for a Cancer center a bunch of his blog posts came out of him trying to compute bayesian inequalities so that decisions could be made about the future enrollment criteria into different arms of cancer treatment studies, so there is an example.

Also, the fact that people make some kind of approximations in practice does not mean that those approximations are always justified by proper reasoning. Plenty of people do bad analysis by many methods.

I don’t mean to imply that all RCTs using bayesian methods that don’t use timeseries are “bad analysis” but if the trial had some kind of aspect where at various points in the analysis decisions were made about enrollment or changes to treatment, and then at the end all patients were grouped into a single group and treated the same whether they were before or after the decision points. then that would be bad bayesian analysis.

However, when these things do happen it seems much more likely that indicator variables for the groups are usually created, and the groups are given their own degrees of freedom in the model, so in that sense it is a “timeseries” analysis even if it isn’t some kind of fancy method.