# “Intentions” is the new code word for “error probabilities”: Allan Birnbaum’s Birthday

27 May 1923-1 July 1976

Today is Allan Birnbaum’s Birthday. Birnbaum’s (1962) classic “On the Foundations of Statistical Inference,” in Breakthroughs in Statistics (volume I 1993), concerns a principle that remains at the heart of today’s controversies in statistics–even if it isn’t obvious at first: the Likelihood Principle (LP) (also called the strong likelihood Principle SLP, to distinguish it from the weak LP [1]). According to the LP/SLP, given the statistical model, the information from the data are fully contained in the likelihood ratio. Thus, properties of the sampling distribution of the test statistic vanish (as I put it in my slides from my last post)! But error probabilities are all properties of the sampling distribution. Thus, embracing the LP (SLP) blocks our error statistician’s direct ways of taking into account “biasing selection effects” (slide #10).

Intentions is a New Code Word: Where, then, is all the information regarding your trying and trying again, stopping when the data look good, cherry picking, barn hunting and data dredging? For likelihoodists and other probabilists who hold the LP/SLP, it is ephemeral information locked in your head reflecting your “intentions”!  “Intentions” is a code word for “error probabilities” in foundational discussions, as in “who would want to take intentions into account?” (Replace “intentions” (or the “researcher’s intentions”) with “error probabilities” (or the method’s error probabilities”) and you get a more accurate picture.) Keep this deciphering tool firmly in mind as you read criticisms of methods that take error probabilities into account[2]. For error statisticians, this information reflects real and crucial properties of your inference procedure.

Birnbaum struggled. Why? Because he regarded controlling the probability of misleading interpretations to be essential for scientific inference, and yet he seemed to have demonstrated that the LP/SLP followed from frequentist principles! That would mean error statistical principles entailed the denial of error probabilities! For many years this was assumed to be the case, and accounts that rejected error probabilities flourished. Frequentists often admitted their approach seemed to lack what Birnbaum called a “concept of evidence”–even those who suspected there was something pretty fishy about Birnbaum’s “proof”.  I have shown the flaw in Birnbaum’s alleged demonstration of the LP/SLP (most fully in the Statistical Science issue). (It only uses logic, really, yet philosophers of science do not seem interested in it.) [3]

The Statistical Science Issue: This is the first Birnbaum birthday that I can point to the Statistical Science issue being out.I’ve a hunch that Birnbaum would have liked my rejoinder to discussants  (Statistical Science): Bjornstad, Dawid, Evans, Fraser, Hannig, and Martin and Liu. For those unfamiliar with the argument, at the end of this entry are slides from an entirely informal talk as well as some links from this blog. Happy Birthday Birnbaum!

[1] The Weak LP concerns a single experiment; whereas, the strong LP concerns two (or more) experiments. The weak LP is essentially just the sufficiency principle.

[2] I will give \$50 for each of the first 30 distinct (fully cited and linked) published examples (with distinct authors) that readers find of criticisms of frequentist methods based on arguing against the relevance of “intentions”. Include as much of the cited material as needed for a reader to grasp the general argument. Entries must be posted as a comment to this post.*

[3] The argument still cries out for being translated into a symbolic logic of some sort.

Excerpts from my Rejoinder

I.  Introduction

……As long-standing as Birnbaum’s result has been, Birnbaum himself went through dramatic shifts in a short period of time following his famous (1962) result. More than of historical interest, these shifts provide a unique perspective on the current problem.

Already in the rejoinder to Birnbaum (1962), he is worried about criticisms (by Pratt 1962) pertaining to applying WCP to his constructed mathematical mixtures (what I call Birnbaumization), and hints at replacing WCP with another principle (Irrelevant Censoring). Then there is a gap until around 1968 at which point Birnbaum declares the SLP plausible “only in the simplest case, where the parameter space has but two” predesignated points (1968, 301). He tells us in Birnbaum (1970a, 1033) that he has pursued the matter thoroughly leading to “rejection of both the likelihood concept and various proposed formalizations of prior information”. The basis for this shift is that the SLP permits interpretations that “can be seriously misleading with high probability” (1968, 301). He puts forward the “confidence concept” (Conf) which takes from the Neyman-Pearson (N-P) approach “techniques for systematically appraising and bounding the probabilities (under respective hypotheses) of seriously misleading interpretations of data” while supplying it an evidential interpretation (1970a, 1033). Given the many different associations with “confidence,” I use (Conf) in this Rejoinder to refer to Birnbaum’s idea. Many of the ingenious examples of the incompatibilities of SLP and (Conf) are traceable back to Birnbaum, optional stopping being just one (see Birnbaum 1969). A bibliography of Birnbaum’s work is Giere 1977. Before his untimely death (at 53), Birnbaum denies the SLP even counts as a principle of evidence (in Birnbaum 1977). He thought it anomalous that (Conf) lacked an explicit evidential interpretation even though, at an intuitive level, he saw it as the “one rock in a shifting scene” in statistical thinking and practice (Birnbaum 1970, 1033). I return to this in part IV of this rejoinder……

IV Post-SLP foundations

Return to where we left off in the opening section of this rejoinder: Birnbaum (1969).

The problem-area of main concern here may be described as that of determining precise concepts of statistical evidence (systematically linked with mathematical models of experiments), concepts which are to be non-Bayesian, non-decision-theoretic, and significantly relevant to statistical practice. (Birnbaum 1969, 113)

Given Neyman’s behavioral decision construal, Birnbaum claims that “when a confidence region estimate is interpreted as statistical evidence about a parameter”(1969, p. 122), an investigator has necessarily adjoined a concept of evidence, (Conf) that goes beyond the formal theory.  What is this evidential concept? The furthest Birnbaum gets in defining (Conf) is in his posthumous article (1977):

(Conf) A concept of statistical evidence is not plausible unless it finds ‘strong evidence for H2 against H1’ with small probability (α) when H1 is true, and with much larger probability (1 – β) when H2 is true. (1977, 24)

On the basis of (Conf), Birnbaum reinterprets statistical outputs from N-P theory as strong, weak, or worthless statistical evidence depending on the error probabilities of the test (1977, 24-26). While this sketchy idea requires extensions in many ways (e.g., beyond pre-data error probabilities, and beyond the two hypothesis setting), the spirit of (Conf), that error probabilities qualify properties of methods which in turn indicate the warrant to accord a given inference, is, I think, a valuable shift of perspective. This is not the place to elaborate, except to note that my own twist on Birnbaum’s general idea is to appraise evidential warrant by considering the capabilities of tests to have detected erroneous interpretations, a concept I call severity. That Birnbaum preferred a propensity interpretation of error probabilities is not essential.  What matters is their role in picking up how features of experimental design and modeling alter a methods’ capabilities to control “seriously misleading interpretations”. Even those who embrace a version of probabilism may find a distinct role for a severity concept. Recall that Fisher always criticized the presupposition that a single use of mathematical probability must be competent for qualifying inference in all logical situations (1956, 47).

Birnbaum’s philosophy evolved from seeking concepts of evidence in degree of support, belief, or plausibility between statements of data and hypotheses to embracing (Conf) with the required control of misleading interpretations of data. The former view reflected the logical empiricist assumption that there exist context-free evidential relationships—a paradigm philosophers of statistics have been slow to throw off.  The newer (post-positivist) movements in philosophy and history of science were just appearing in the 1970s. Birnbaum was ahead of his time in calling for a philosophy of science relevant to statistical practice; it is now long overdue!

“Relevant clarifications of the nature and roles of statistical evidence in scientific research may well be achieved by bringing to bear in systematic concert the scholarly methods of statisticians, philosophers and historians of science, and substantive scientists” (Birnbaum 1972, 861).

Mayo, Deborah G. On the Birnbaum Argument for the Strong Likelihood Principle (with discussion & rejoinder). Statistical Science 29 (2014), no. 2, 227-266.

Mayo, Deborah G. On the Birnbaum Argument for the Strong Likelihood Principle. Statistical Science 29 (2014), no. 2, 227-239.

Dawid, A. P. Discussion of “On the Birnbaum Argument for the Strong Likelihood Principle”. Statistical Science 29 (2014), no. 2, 240-241.

Evans, Michael. Discussion of “On the Birnbaum Argument for the Strong Likelihood Principle”. Statistical Science 29 (2014), no. 2, 242-246.

Martin, Ryan; Liu, Chuanhai. Discussion: Foundations of Statistical Inference, Revisited. Statistical Science 29 (2014), no. 2, 247-251.

Fraser, D. A. S. Discussion: On Arguments Concerning Statistical Principles. Statistical Science 29 (2014), no. 2, 252-253.

Hannig, Jan. Discussion of “On the Birnbaum Argument for the Strong Likelihood Principle”. Statistical Science 29 (2014), no. 2, 254-258.

Bjørnstad, Jan F. Discussion of “On the Birnbaum Argument for the Strong Likelihood Principle”. Statistical Science 29 (2014), no. 2, 259-260.

Mayo, Deborah G. Rejoinder: “On the Birnbaum Argument for the Strong Likelihood Principle”. Statistical Science 29 (2014), no. 2, 261-266.

Abstract: An essential component of inference based on familiar frequentist notions, such as p-values, significance and confidence levels, is the relevant sampling distribution. This feature results in violations of a principle known as the strong likelihood principle (SLP), the focus of this paper. In particular, if outcomes x and y from experiments E1 and E2 (both with unknown parameter θ), have different probability models f1( . ), f2( . ), then even though f1(xθ) = cf2(yθ) for all θ, outcomes x and ymay have different implications for an inference about θ. Although such violations stem from considering outcomes other than the one observed, we argue, this does not require us to consider experiments other than the one performed to produce the data. David Cox [Ann. Math. Statist. 29 (1958) 357–372] proposes the Weak Conditionality Principle (WCP) to justify restricting the space of relevant repetitions. The WCP says that once it is known which Ei produced the measurement, the assessment should be in terms of the properties of Ei. The surprising upshot of Allan Birnbaum’s [J.Amer.Statist.Assoc.57(1962) 269–306] argument is that the SLP appears to follow from applying the WCP in the case of mixtures, and so uncontroversial a principle as sufficiency (SP). But this would preclude the use of sampling distributions. The goal of this article is to provide a new clarification and critique of Birnbaum’s argument. Although his argument purports that [(WCP and SP), entails SLP], we show how data may violate the SLP while holding both the WCP and SP. Such cases also refute [WCP entails SLP].

Key words: Birnbaumization, likelihood principle (weak and strong), sampling theory, sufficiency, weak conditionality

Regular readers of this blog know that the topic of the “Strong Likelihood Principle (SLP)” has come up quite frequently. Numerous informal discussions of earlier attempts to clarify where Birnbaum’s argument for the SLP goes wrong may be found on this blog. [SEE PARTIAL LIST BELOW.[i]] These mostly stem from my initial paper Mayo (2010) [ii]. I’m grateful for the feedback.

[i] A quick take on the argument may be found in the appendix to: “A Statistical Scientist Meets a Philosopher of Science: A conversation between David Cox and Deborah Mayo (as recorded, June 2011)”

Some previous posts on this topic can be found at the following links (and by searching this blog with key words):

UPhils and responses

[ii]

Below are my slides from my May 2, 2014 presentation in the Virginia Tech Department of Philosophy 2014 Colloquium series:

“Putting the Brakes on the Breakthrough, or
‘How I used simple logic to uncover a flaw in a controversial 50 year old ‘theorem’ in statistical foundations taken as a
‘breakthrough’ in favor of Bayesian vs frequentist error statistics’”

Birnbaum, A. 1962. “On the Foundations of Statistical Inference.” In Breakthroughs in Statistics, edited by S. Kotz and N. Johnson, 1:478–518. Springer Series in Statistics 1993. New York: Springer-Verlag.

*Judges reserve the right to decide if the example constitutes the relevant use of “intentions” (amid a foundations of statistics criticism) in a published article. Different subsets of authors can count for distinct entries. No more than 2 entries per person. This means we need your name.

### 48 thoughts on ““Intentions” is the new code word for “error probabilities”: Allan Birnbaum’s Birthday”

1. e.berk

“A significance test inference, therefore, depends not only on the outcome that a trial produced, but also on the outcomes that it could have produced but did not. And the latter are determined by certain private intentions of the experimenter, embodying his stopping rule. It seems to us that this fact precludes a significance test delivering any kind of judgment about empirical support… For scientists would not normally regard such personal intentions as proper influences on the support which data give to a hypothesis”. Howson and Urbach, 1989, p. 171). (Scientific Reasoning: The Bayesian Approach, Open Court).
I don’t have a link to their book, but the quote is found in Mayo, “Error and the Growth of Experimental Knowledge (EGEK 1996), p. 347): http://www.phil.vt.edu/dmayo/personal_website/EGEKChap10.pdf
I hope that’s good enough.

2. J.A.W

John K. Kruschke “What to believe: Bayesian methods for data analysis”

Click to access Kruschke2010TiCS.pdf

“Friends do not let friends compute p values: The crucial problem with NHST is that the p-value is
defined in terms of repeating the experiment, and what constitutes the experiment is determined by the experimenter’s intentions.”

• Yes, the fallacy here is one I bring out in EGEK (someplace*). Essentially all inference specifications are “determined by the experimenter’s intentions”(e.g., to study one thing rather than another, to use telescopes with such and such properties, sample humans, monkeys, etc) but it doesn’t follow that “taking account of those specifications” is merely to take account of “intentions”.

*Here it is, chapter 10, pp 346-7.

3. e.berk (1): Of course you’d know that one as a former student–and it’s a perfectly legitimate example of the kind of “argument from intentions” I had in mind. So it counts.

4. J.A.W

E-J Wagenmakers, Michael Lee, Tom Lodewyckx, Geoff Iverson “Bayesian Versus Frequentist Inference”

Click to access BayesFreqBook.pdf

“Frequentist Inference Depends on the Intention With Which the Data Were Collected

Because p-values are calculated over the sample space, changes in the sample
space can greatly affect the p-value. ….

What this simple example shows is that the intention of the researcher affects statistical inference – the data are consistent with both sampling plans, yet the p-value differs. Berger and Wolpert ([14, page 30-33]) discuss the resulting counterintuitive consequences through a story involving a naive scientist and a frequentist statistician.”

• Yes, mere consistency doesn’t suffice for the error statistician. The altered sample space is scarcely trivial, but corresponds also to an altered sufficient statistic and error probing capacity. Interestingly, nonsubjective (conventional) Bayesians (which I think makes up the predominant group of Bayesians these days–even though there are different tribes, and they disagree with eachother), are prepared to allow their priors to depend on the sample space. Subjective Bayesians (rightly) object that this is incoherent behavior. But J. Berger says it’s necessary for “objectivity”. Yet, ironically, this still doesn’t obviously bring them into error statistical territory (except possibly for “frequentist matching” priors). This remains an open question for other stripes of Bayesians.

5. J.A.W

William H. Jefferys, “Bayesian Analysis of Random Event Generator Data”
http://www.scientificexploration.org/journal/full/jse_04_full.pdf#page=153

“An important feature of Bayesian hypothesis testing is that the analysis is not affected by considerations such as stopping rules. The intentions of the investigator are simply irrelevant to a Bayesian. A properly formulated Bayesian hypothesis test will be relatively immune even to an obviously illegitimate stopping rule designed to “fool” the analysis, such as the strategy of sampling to a foregone conclusion:”Stop when the value of the test statistic exceeds a preassigned number k” (Berger, 1985, \$7.7). “

• J.Berger does admit this, and even concedes this can lead to 95% HPD intervals always excluding the true value (e.g, Berger and Wolpert). However, I seem to recall some Bayesians giving a special definition to a case of “sampling to a foregone conclusion” such that the allowable cases (bad as they are) don’t fall under that umbrella. No time to check, but it’s somewhere in the book jointly authored by Kadane, Shervish, Seidenfeld. I believe those cases are akin to those where the alternative hypothesis is predesignated, but might have this wrong.

6. J.A.W

Colin Howson and Peter Urbach, “Bayesian reasoning in science”

Click to access Howson%20and%20Urbach%201991.pdf

“So the degree of confidence we are invited to place in an estimate inevitably depends on the private plans of the experimenter, which is surely immensely counterintuitive”

7. J.A.W

Zoltan Dienes, “Bayesian Versus Orthodox Statistics: Which Side Are you on?”

Click to access Dienes%202011%20Bayes.pdf

“Surely, the subjective intentions concealed in the researcher’s mind are irrelevant when drawing inferences from data—what matters is just what the obtained data are.”

• J.A.W. Thanks much. Since I had stipulated (see *, the small print at the bottom of the post) only 2 per person and you’ve gone beyond this, I will let others have a shot at this (though I may post your others once the 30 examples are reached). Don’t forget to send me info for receiving the award.

• I hadn’t heard of this author before. Yes, it’s amazing how much staying power that logical positivist idea that there is a logical relationship between any given data and hypothesis, and the “data should speak for themselves” and you shouldn’t need to know how they were collected or how the hypotheses or test rule was formulated. Popper recognized long agao–a central reason for his rejecting logical positivism–that there was a conflict between this idea-that there’s “a logic of evidential relationship” –and our intuitions that ad hoc data, non-novel data, double counting data, etc are pejorative. The question for the Popperians was: where do we place this extra information? Popper placed it under “novelty” or “severity” but never identified a clear notion, nor did his followers. Lakatos, also a follower, proposed they belong under “historical” or “discovery” considerations. So the idea that you couldn’t have an account of evidence and testing without historical considerations was born. But they never had a satisfactory notion of which parts of history matter. Still, the Lakatosian idea, that it is relevant to consider the history by which a hypothesis or theory was propped up by worm-eaten stays to avoid falsification, was in the right spirit.

8. “Fourth, p depends on unobserved data and subjective intentions and therefore implies, given the evidential interpretation, that the evidential strength of observed data depends on things that did not happen and subjective intentions. …In contrast, the likelihood ratio escapes the above problems and is recommended as a tool for psychologists to represent the statistical evidence conveyed by obtained data relative to two hypotheses.” (p. 113)
Johansson T (2010) “Hail the impossible: p-values, evidence, and likelihood.” Scandinavian Journal of Psychology 52:113-125.

• Jean: I never heard of this person, it’s amazing how they all repeat the earlier slogans. I wonder what “first-third’ were, and maybe there’s a fifth or a sixth. Can you please link to the source?

• Jean

Here are all 4 “problems” according to Johansson (p. 113, link below):
“There are four major problems with using p as a measure of evidence and these problems are often overlooked in the domain of psychology. First, p is uniformly distributed under the null hypothesis and can therefore never indicate evidence for the null. Second, p is conditioned solely on the null hypothesis and is therefore unsuited to quantify evidence, because evidence is always relative in the sense of being evidence for or against a hypothesis relative to another hypothesis. Third, p designates probability of obtaining evidence (given the null), rather than strength of evidence. Fourth, p depends on unobserved data and subjective intentions and therefore implies, given the evidential interpretation, that the evidential strength of observed data depends on things that did not happen and subjective intentions.” (Johansson T (2010) “Hail the impossible: p-values, evidence, and likelihood.” Scandinavian Journal of Psychology 52:113-125; emphasis added)

9. vl

“Where, then, is all the information regarding your trying and trying again, stopping when the data look good, cherry picking, barn hunting and data dredging” the terminology is too broad – many of these characteristics can be modeled as conditioning or selection effects which can be modeled probabilistically.

I see intention as a separate issue from biased selection bias. Intention relates to for example, what constitutes the space of comparisons to be incorporated into a multiple comparison adjustment which I have _never_ heard a defensible answer to. Once you adjust for multiple comparisons within an experiment, do you adjust for all the studies happening in the field? Do you adjust for all comparisons in all of science? pre-registration has similar issues with subjectivity.

These concepts do real harm – scientists are afraid to look at data because they don’t (instead of simply modeling the base rate). Furthermore the same multiple comparison adjustments that “seemed” to filter out bad results because they were calibrated for the base rate suddenly don’t work anymore once the power of the study increased.

As for the LP, as an applied practitioner I kind of shrug along with larry wasserman regarding foundations. What matters most to scientists and applied practitioners is risk (in a decision theoretic sense).

• VL: “I see intention as a separate issue from biased selection bias. ” I do too. As for when to adjust, see my defn of “biasing selection effects” in the slides for the previous post. Figuring out a precise adjustment is not essential—holding you accountable for reporting an unvalidated error probability is. We don’t give up on making important distinctions simply because there may be ambiguous cases, or because a prima facie concern may be defeasible in a particular case (which severity specifically allows).

10. David

“Thus, p values can only be computed once the sampling plan is fully known and specified in advance. In scientific practice, few people are keenly aware of their intentions, particularly with respect to what to do when the data turn out not to be significant after the first inspection. Still fewer people would adjust their p values on the basis of their intended sampling plan.”

Eric-Jan wagenmakers, “A practical solution to the pervasive problems of p values” (Psychonomic Bulletin & Review 2007, 14 (5), 779-804)

• David:Thanks for the quote. I see that Wagenmakers was cited before, but since it’s in a different group of authors, it counts. However I need the link.

I want to make two comments, for the general reader (not directed at you, but the author): First,It is not true that the sampling plan must be specified in advance; sequential testing (to name just one example that comes up in relation to “intentions”) is relative to when it stops. Nor are error statisticians incapable of detecting violations in assumptions post data.

Second, as for “In scientific practice, few people are keenly aware of their intentions, particularly with respect to what to do when the data turn out not to be significant after the first inspection. Still fewer people would adjust their p values on the basis of their intended sampling plan,”

my answer is, well isn’t it just too bad that “fewer people would adjust their p-values”? We are free to deny their results are warranted, and to suggest to them that they are enabling, rather than helping to stem, biased selection effects. They are free not to indicate what they’ve done when the data turned out not to be nominally significant after the first inspection, and we are free to discount their results! Failure to be able to assess severity, even approximately, counts as low/poor severity.

• David e-mailed me the link to the Wagenmakers paper that he cited. I put it up in media and then linked, but you should be able to put links in comments directly. See if this works.https://errorstatistics.files.wordpress.com/2015/05/wagenmakers_2007_pvalueproblems.pdf

11. “To clarify the point, consider an example. A malicious experimenter conducts a sequential trial with a certain stopping rule, but the evidence against the null which she finds is not as strong as desired. In particular, the p-value is not significant enough to warrant rejection of the null and to publish the results (p ≈ 0.051). What can she do? The first option consists in outright fraud – she could fake some data (e.g. replace some observed failures by successes) and make the results significant in that way. While tempting, such a deception of the scientific community is risky and would be heavily punished if discovered. The career of our experimenter would be over once and for all. Therefore a second option looks more attractive: not to report the true stopping rule τ_1 (fixed sample size), but a modified stopping rule τ_2 under which the data D yield a p-value smaller than 0.05. The results are now “statistically significant” and get published. But clearly, as readers of a scientific journal, we want to be protected against such tricks. The crucial point is that the malicious experimenter did not manipulate the data: she was just insincere about her intentions when to terminate the experiment. Using fake data involves considerable risk: if continued replications fail to reproduce the results, our experimenter will lose all her reputation. By contrast, she can never be charged for insincerely reporting her intentions. The crucial point here is not the frequently uttered intuition that “intentions cannot matter for strength of evidence” (cf. p. 4), but rather that the scientific community is unable to control whether these intentions have been correctly reported. This inability to detect subjective distortion and manipulation of statistical evidence is a grave problem for frequentist methodology.”

–Sprenger, Jan (2009): Evidence and Experimental Design in Sequential Trials. Philosophy of Science 76: 637–649. http://philsci-archive.pitt.edu/4306/1/PSA2008.pdf

12. “If in a matched pairs trial engaging 6 subject pairs we were to obtain 5 preferences for the experimental treatment and only 1 for the control, the analysis of the results would depend on the design and therefore on the intentions of the investigators. Did they plan to enroll just 6 pairs and then stop to analyse the results? Or had they decided to stop after obtaining at least one preference for the control? The outcome of the analysis will be very different in the two cases. In particular, the trial result would be significant in the second case but not in the first, due to the fact that the set of ‘more extreme values’ used to compute the p-value is different in the two scenarios described.”

–Nardini, Cecilia (2012): Statistics in Clinical Trials: Out of Condition. PhD dissertation. https://air.unimi.it/retrieve/handle/2434/218889/269516/phd_unimi_R08436.pdf

13. I suppose the following doesn’t count (Bernardo and Smith 1994, Secton 5.1 pg 250-253):

> A parametric model for these data thus involves a probability density of the form p(n,x_n|h,theta), conditioning both on the stopping rule [h] (i.e. sampling mechanism) and on an underlying labelling parameter theta…The important question that now arises is the following: under what circumstances, if any, can we proceed to make inferences about theta…without conditioning on the actual form of h…

> …a notationally precise rendering of Bayes’ theorem…reveals that *knowledge of h might well affect the prior density!* It is for this reason that we use the term “likelihood non-informative” rather than just “non-informative” stopping rules. It cannot be emphasised too often that, although it is often convenient for expository reasons to focus at a given juncture on one or other of the “likelihood” and “prior” components of the model, our discussion in Chapter 4 makes clear their basic inseparability in coherent modelling and analysis…

• Another example that doesn’t count (Gelman et al. BDA3, Chapter 8 p. 198):

> A naive student of Bayesian inference might claim that because all inference is conditional on the observed data, it makes no difference how those data were collected. This misplaced appeal to the likelihood principle would assert that given (1) a fixed model (including the prior distribution) for the underlying data and (2) fixed observed values of the data, Bayesian inference is determined regardless of the design for the collection of the data. Under this view there would be no formal role for randomisation in either sample surveys or experiments. The essential flaw in the argument is that a complete definition of ‘the observed data’ should include information on how the observed values arose, and in many situations such information has a direct bearing on how these values should be interpreted. Formally then, the data analyst needs to incorporate the information describing the data collection process in the probability model used for analysis.

• Surely you know that randomization has long had a very uneasy and awkward life in Bayesian inference, especially for subjective Bayesians (see Wasserman, Kadane for 2 off the top of my head). It certainly isn’t required, and most importantly, doesn’t play the 3 roles it does in frequentist inference. The best argument is along the lines of enabling the posterior to be less sensitive to the prior. If I weren’t dashing, I’d look up references. I think Gelman-Bayesians should add an error statistical dimension to their accounts, rather than beating around the bushes.

• vl

Randomization doesn’t have anything to do with sensitivity to the prior, it comes in w.r.t. to conditioning and causal identifiability.

It’s certainly valid to do a bayesian analysis without randomization, but then what’s estimateddoesn’t map to the target causal parameter. randomization decorrelates the conditioning of the observations with any confounders.

Thus for Bayesians randomization vs. observational data is a relevant distinction in the way that choosing a scale vs. a thermometer for measurements is a relevant distinction for weighing a baby. In both cases, the measurement and object of interest need to be matched. I don’t see anything “awkward” this relationship.

• Certain aspects of how the data arose may of course be relevant to affirming the model (for a non-subjective Bayesian). Trying to get this into a justification for randomization is a stretch, but the point is, how are these properties used in the inference account? How does such information actually have bearing in the formal account (not just that someone thinks it ought to). In general, error probabilistic properties do not appear if inference is by Bayes theorem. One cannot just pronounce something, one needs a principled justification for where error probabilities of the method make a difference–if you think it should. Bayesians have been happy as claims that they do NOT deal in error probs because of their repeated sampling justification. Gelman is an error statistician, at least on Saturday, Monday, and Wednesday.

• Well I’m not one to say whether Gelman is an error statistician or not, and I’m sure there are many interesting questions concerning nailing down the formal philosophical stance of something like ‘Gelman-Bayes’ (Box-Rubin-Gelman-Jaynes-Bayes?). I do think it would be a stretch to include Gelman-Bayes under error statistics proper – beyond certain Popperian influences and desire to check models – due to the central role of regularization and (weakly?) informative priors and the eschewing of concepts like type I/II errors, power etc. But I’m sure it could be interesting to examine the formal basis of the approach while sticking as close as possible to how it operates in practice. Without a detailed account of the latter though, I’m not sure the former would be easy to work out.

Regarding the topic of the post more specifically – it is interesting how these ‘howlers’ are repeated so often, and worthwhile to call them out. Unfortunately I think this focus also takes away from either ‘side’ investing more effort in the more interesting arguments and approaches (such as Gelman-Bayes, or even Bernardo and Smith). Most of the arguments about the likelihood principle – while interesting – seem orthogonal to the perspective of something like Gelman-Bayes. It may be naive, but I find it helpful to think in terms of ‘your model for the data should represent (models for) all the relevant processes generating your data’ and that many issues to do with the likelihood principle either concern a misspecified or ill-defined model (where ‘model’ for a Bayesian includes the prior component) or a bias-variance trade-off for point estimates related to things like a paucity of data that could be addressed by e.g. inspecting the posterior (or likelihood) as a whole.

An interesting topic relevant for the error statistician, though, is how Bayesian inference behaves under misspecification in practice. How is it identified, fixed etc etc. There seems to be some quite interesting mathematical work on this topic. I went to a nice presentation by Peter Grünwald last week on this. I enjoyed his ‘geometry of misspecification’ perspective. Also related would be the work on ‘Bayesian brittleness’ you’ve mentioned before. In general though, these involve taking Bayesian inference seriously as an interesting modelling paradigm and investigating its performance under challenging conditions.

• While Gelman has been inspired by some of Jaynes’s writings, he’s not in the Cox-Jaynes school of thought on statistical foundations, so “Gelman-Jaynes-Bayes” isn’t really a thing.

• Fair enough re: Gelman-Jaynes-Bayes. I mainly meant that, while I appreciate and have learned a lot from Gelman’s approach, is also misleading to imply he is somehow especially unique in his views – there’s actually a pretty common subset of Bayesians with similar views.

• Really? Who reject inference as updating probabilities by Bayes theorem, and deny subjective and conventional priors, and reject Bayes factors, but test models/priors using his form of predictive distributions (using sampling distributions)?

• George Box, for one, has inspired quite a few. I see Gelman as following in that line. Quite a few will cite Box’s 1980 ish work and ‘Box’s loop’ etc. The school of ‘within model: Bayes, out of model
: check’.

Posterior predictive distributions are also pretty Bayesian creatures.

Even Bernardo and Smith present a predictive Bayes approach that is operationally pretty similar but with all the subjective terminology baggage.

PS This time I’m the one on a train and dashing!

• George Box advocated an eclectic approach combining frequentist “criticism” and Bayesian “estimation”, but the foundational picture isn’t very clear. His model checking work also inspired non-Bayesian model checking. But I was assuming you had in mind a much more pervasive approach, or is Bayesianism really that fragmented these days?

• Hmm well Gelman’s book has 6 or so authors and seems to sell pretty well so I imagine it’s not a minority view? But yes, I’m sure foundations haven’t kept pace with practice. Eclecticism seems popular among general users, but other’s have also reinterpreted and built on people like Box’s ideas to bring them more fully under the Bayes umbrella.

• vl

I don’t consider myself a gelman bayesian (I’m not even sure what that means), although I’m heavily influenced by his methods and writings on multilevel models. For example, for the life of me I can’t understand what he’s getting at with the “garden of forking paths” business beyond that _everyone_ can _always_ be accused of violating multiple comparisons and weakly powered studies are bad.

For me the issue comes down to my accepting that some aspect of the formal account of statical interpretation is extra-statistical. Model mis-specification is one of those things. There are ways of having a probabilistic model over models in bayesian inference, but at some point in the hierarchy you do have to stop.

I do find the causal graph formalism that Judea Pearl and Clark Glymour useful in conceptualizing the conditioning of data. Thus I view randomization from this perspective of detatching a variable so that it’s variation is exogenous (as economists like to say) to a system. In my view I’m not trying to ‘stretch’ bayesianism to encompass this, I see it as neither a bayesian nor frequentist consideration, but in the realm of causality, conditioning, and model misspecification.

I think you said yourself in another thread that the issue of confounding was a deeper than the statistical accounts? Randomization is closely related to confounding in this view.

• VL: I’ll respond once I’m finished traveling tonight/tomorrow. I actually like the garden of forking paths as a struggle to eek out an error statistical criticism.

• I don’t think there’s any question that the garden of forking paths is an error-statistical criticism. Gelman’s whole point is that frequentist methods do not have the advertised error probabilities when researchers decide which methods to apply after examining the data. But it seems to me that Spanos’s misspecification testing (as described in the ‘Intro to Misspecification Testing’ sequence of posts) is vulnerable to a garden of forking paths criticism. I’m thinking in particular of part three, in which he graphs the residuals after removing the time-trend and writes to the effect of, ‘to the trained eye, these data show Markov dependence’ and then does some testing that he leaves out for conciseness and declares in part four that, indeed, a model with Markov dependence is statistically adequate. I don’t think the example Spanos shows is particularly problematic; but what would be problematic is if the data are somewhat ambiguous such that ‘trained eyes’ might or might not perceive Markov dependence. What then could we conclude about the actual error probabilities of formal MS tests?

(Incidentally, part four of the MS sequence appears to have a fallacy of acceptance at the point where Spanos does a simultaneous test of the null hypothesis of zero coefficients for lagged predictors, gets a p-value of 0.823, and then explicitly declares that the null hypothesis is correct. Per your post ‘Anything Tests Can Do, CIs do better…?’ Spanos has skipped a step that he should at least mention: “To avoid the age-old fallacy of taking a statistically insignificant result as evidence of zero (0) discrepancy from the null hypothesis μ =μ0, we wish to identify discrepancies that can and cannot be ruled out.”)

• Since I’ve been ay, I may be picking up in the middle of things.
That error statistical guarantees are directly violated by cherry-picking, data-dependent hypotheses, multiple testing, optional stopping was PROVED by frequentist founders ~80 years ago. I don’t see such a demonstration in Gelman’s forking paths, rather than an intuitive, informal description of how easy it would be to get statistically significant results often, rather than rarely. For probabilists who accept the likelihood principle, such concerns disappear and merely reflect “intentions”. Gelman should make his error statistical concerns explicit.

• I think m-s testing has some differences from “primary” hypothesis testing . (Mayo and Spanos 2004, and much more in Spano). One is generally not making a statistical inference (to a generalization) and needs only good error properties as regards the inference reached (about the model assumption). I also don’t think that regarding a model as adequate wrt a given assumption is the same as claiming it’s precisely met or some such thing (fallacy of acceptance point). The inference should be to something like, the model is adequate for reliably asking the primary statistical questions. There needn’t be uniqueness.

• Let me quote from the post in question (forgive the lack of subscripts): “A test (an F test) of joint significance of the coefficients of (x_t, x_{t-1}, x_{t-2}), yields F(3, 26) = .302[.823], which does not reject the hypothesis that they are all 0, indicating that the secret variable is uncorrelated with the population variable! That is, despite the earlier “impressive” t-ratios and excellent goodness-of-fit in the estimated equation (1) [see part 1] the secret variables is [sic] unrelated to y_t!”

Where is the identification of discrepancies from the null that can and cannot be ruled out? How is this not a pristine example of the fallacy of acceptance?

• (And just to be clear, the fallacy of acceptance point *is* about the primary statistical inference, not about inference to an adequate model. The sentence that precedes the quoted ones above is, “Having established the statistical adequacy of the estimated DLRM, we are then licensed in making ‘primary’ statistical inferences about the values of parameters in this model.”)

• What you’re missing is that “going where the data take you” can be to a very reliable place. They key is to distinguish types of data dependencies (what my work has been about for years). Once there, the adequate model is a platform for making the primary statistical inferences.But maybe I’m missing your gripe.

• Mayo: In the past, you have written:

“To avoid the age-old fallacy of taking a statistically insignificant result as evidence of zero (0) discrepancy from the null hypothesis μ=μ0, we wish to identify discrepancies that can and cannot be ruled out.”

And in regard to the primary statistical inference (i.e., MS testing is complete and this test is the whole point of the analysis), Spanos wrote:

“A test (an F test) of joint significance of the coefficients of (x_t, x_{t-1}, x_{t-2}), yields F(3, 26) = .302[.823], which does not reject the hypothesis that they are all 0, indicating that… the secret variables is [sic] unrelated to [the outcome of interest].”

And my question is: where in Spanos’s example analysis is the identification of discrepancies from the null that can and cannot be ruled out that Mayo has asserted is necessary to avoid the age-old fallacy of acceptance?

I cannot put it more plainly than that.

• @vl

I’m not sure if you’re responding to me or Mayo, but I (broadly) agree with all of your points (as usual, it seems).

14. “…Dr. Bloggs tosses a coin labelled a and b twelve times and the outcome is the string aaabaaaabaab, which contains three bs and nine as. What evidence do these data give that the coin is biased in favour of a?…

[The statistician assumes a binomial experiment and calculates a p-value above 0.05. Dr. Bloggs then reveals that he used the stopping rule ‘stop after three bs, whereupon the statistician computes the p-value under the negative binomial distribution and calculates a p-value below 0.05.]

What do you think Dr. Bloggs should do?… At this point the audience divides in two. Half the audience intuitively feel that the stopping rule is irrelevant, and don’t need any convincing that the answer to exercise 37.1 (p.463) is ‘the inferences about [the coin’s bias] do not depend on the stopping rule’. The other half, perhaps on account of a thorough training in sampling theory, intuitively feel that Dr. Bloggs’s stopping rule, which stopped tossing the moment the third b appeared, may have biased the experiment somehow.

As a thought experiment, consider some onlookers who (in order to save money) are spying on Dr. Bloggs’s experiments: each time he tosses the coin, the spies update the values of r and n. The spies are eager to make inferences from the data as soon as each new result occurs. Should the spies’ beliefs about the bias of the coin depend on Dr. Bloggs’s intentions regarding the continuation of the experiment?”

David Mackay, Information Theory, Inference, and Learning Algorithms, pp 462-4.

• The % of the audience who thinks one way or another is irrelevant. The difference between the Binomial and negative Binomial is generally trivial, but I am not moved by arguments that one could model what he did in a different way, and the audience can have different feelings. The sufficient statistics of these two differ, and different warranted inferences ensue-perhaps trivially different here. We care about principled statistical reasoning, and the fact there can be ambiguous cases—this isn’t one of them–does not mean I give up on the distinction.

15. Jean

Of course, it is always fun to go back to the classics (to which I was first introduced via chapter 10 of D. Mayo’s EGEK.)

“These truths [about the optional stopping effect] are usually misinterpreted to suggest that the data of such a persistent experimenter are worthless or at least need special interpretation…. The likelihood principle, however, affirms that the experimenter’s intention to persist does not change the import of his experience.” (Savage 1962, The Foundations of Statistical Inference: A Discussion, London: Methuen, p.18)

Savage continues to support this view in the discussion at the 1962 forum. After listening to Bernard, who recanted at the forum his previous position that optional stopping didn’t matter to claiming that it does, Savage reminded him of his (Bernard’s) original position:

“The argument then was this: The design of a sequential experiment is, in the last analysis, what the experimenter actually intended to do. His intention is locked up inside of his head.” (Savage 1962, The Foundations of Statistical Inference: A Discussion, London: Methuen, p.76)

These quotes can be found in the “Complete Savage Forum” at the bottom of this webpage from D. Mayo’s Philosophy of Statistics course: http://www.phil.vt.edu/dmayo/PhilStatistics/supplementary_articles.htm

• Jean: Thanks for this. I see no reason to bar ex-students from earning \$ for good quotes. There’s one thing you left out though, unless I missed it: Savage declared that he himself was uncomfortable with the “argument from intentions”. I see that on EGEK p. 347, I don’t have an exact quote…but I think I may have that slender volume even here in NYC, I’ll check. But I’m curious, and will almost surely never know the answer, as to why Savage was uncomfortable with the argument. I’m guessing it’s because of the point I had made, namely that almost all specification could be regarded as determined by one’s intentions. Or maybe even more strongly, the “intentions” thing was always a ruse: there are concrete reasons that those error probability considerations do not show up in the Bayesian inference (I’m not saying other ones might not show up, but that’s another story). LIkewise there are concrete reasons that they make a difference to the error statistical inference. It’s not that we’re making inference turn on things that could have but didn’t happen, as they like to say. It’s rather that by considering things that could have but didn’t happen, we are able to ascertain something about the error probing capacities of the tool or method. I know that in my daily life I always scrutinize the particular result of a method by by considering the method’s general probative capacities, and these sampling distributions (in the formal context of statistics) are analogous to what I need to know. But it has taken me awhile to get really clear on this, I’m not saying it’s obvious at first. Certainly it’s more subtle than what you get on an E-R logic–but the logicists are wrong to think there’s a useful “logic” for ampliative inference.