# “Intentions (in your head)” is the code word for “error probabilities (of a procedure)”: Allan Birnbaum’s Birthday

27 May 1923-1 July 1976

Today is Allan Birnbaum’s Birthday. Birnbaum’s (1962) classic “On the Foundations of Statistical Inference,” in Breakthroughs in Statistics (volume I 1993), concerns a principle that remains at the heart of today’s controversies in statistics–even if it isn’t obvious at first: the Likelihood Principle (LP) (also called the strong likelihood Principle SLP, to distinguish it from the weak LP [1]). According to the LP/SLP, given the statistical model, the information from the data are fully contained in the likelihood ratio. Thus, properties of the sampling distribution of the test statistic vanish (as I put it in my slides from this post)! But error probabilities are all properties of the sampling distribution. Thus, embracing the LP (SLP) blocks our error statistician’s direct ways of taking into account “biasing selection effects” (slide #10). [Posted earlier here.] Interesting, as seen in a 2018 post on Neyman, Neyman did discuss this paper, but had an odd reaction that I’m not sure I understand. (Check it out.)

Intentions is a New Code Word: Where, then, is all the information regarding your trying and trying again, stopping when the data look good, cherry picking, barn hunting and data dredging? For likelihoodists and other probabilists who hold the LP/SLP, it is ephemeral information locked in your head reflecting your “intentions”!  “Intentions” is a code word for “error probabilities” in foundational discussions, as in “who would want to take intentions into account?” (Replace “intentions” (or the “researcher’s intentions”) with “error probabilities” (or the method’s error probabilities”) and you get a more accurate picture.) Keep this deciphering tool firmly in mind as you read criticisms of methods that take error probabilities into account[2]. For error statisticians, this information reflects real and crucial properties of your inference procedure.

Birnbaum struggled. Why? Because he regarded controlling the probability of misleading interpretations to be essential for scientific inference, and yet he seemed to have demonstrated that the LP/SLP followed from frequentist principles! That would mean error statistical principles entailed the denial of error probabilities! For many years this was assumed to be the case, and accounts that rejected error probabilities flourished. Frequentists often admitted their approach seemed to lack what Birnbaum called a “concept of evidence”–even those who suspected there was something pretty fishy about Birnbaum’s “proof”.  I have shown the flaw in Birnbaum’s alleged demonstration of the LP/SLP (most fully in the Statistical Science issue). (It only uses logic, really, yet philosophers of science do not seem interested in it.) [3]

The Statistical Science Issue: This is the 4th Birnbaum birthday where I can point to the Statistical Science issue being out. But are textbooks are out making changes, or still calling this a theorem? I’ve a hunch that Birnbaum would have liked my rejoinder to discussants  (Statistical Science): Bjornstad, Dawid, Evans, Fraser, Hannig, and Martin and Liu. For those unfamiliar with the argument, at the end of this entry are slides from an entirely informal talk as well as some links from this blog. Happy Birthday Birnbaum!

[1] The Weak LP concerns a single experiment; whereas, the strong LP concerns two (or more) experiments. The weak LP is essentially just the sufficiency principle.

[2] I will give a free signed hard copy of my new “Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars” (July 31, 2018) to each of the first 10 readers who sends a fully cited and linked published example (with distinct authors, you may be one) of criticisms of frequentist methods based on arguing against the relevance of “intentions”. Include as much of the cited material as needed for a reader to grasp the general argument. Entries must be posted as a comment to this post (not twitter), with a link to the article or portions of the article. A brief discussion of what you think of it should also be included. Judges on Elba have final say. [Write with questions.]*

[3] The argument still cries out for being translated into a symbolic logic of some sort.

Excerpts from my Rejoinder

I.  Introduction

……As long-standing as Birnbaum’s result has been, Birnbaum himself went through dramatic shifts in a short period of time following his famous (1962) result. More than of historical interest, these shifts provide a unique perspective on the current problem.

Already in the rejoinder to Birnbaum (1962), he is worried about criticisms (by Pratt 1962) pertaining to applying WCP to his constructed mathematical mixtures (what I call Birnbaumization), and hints at replacing WCP with another principle (Irrelevant Censoring). Then there is a gap until around 1968 at which point Birnbaum declares the SLP plausible “only in the simplest case, where the parameter space has but two” predesignated points (1968, 301). He tells us in Birnbaum (1970a, 1033) that he has pursued the matter thoroughly leading to “rejection of both the likelihood concept and various proposed formalizations of prior information”. The basis for this shift is that the SLP permits interpretations that “can be seriously misleading with high probability” (1968, 301). He puts forward the “confidence concept” (Conf) which takes from the Neyman-Pearson (N-P) approach “techniques for systematically appraising and bounding the probabilities (under respective hypotheses) of seriously misleading interpretations of data” while supplying it an evidential interpretation (1970a, 1033). Given the many different associations with “confidence,” I use (Conf) in this Rejoinder to refer to Birnbaum’s idea. Many of the ingenious examples of the incompatibilities of SLP and (Conf) are traceable back to Birnbaum, optional stopping being just one (see Birnbaum 1969). A bibliography of Birnbaum’s work is Giere 1977. Before his untimely death (at 53), Birnbaum denies the SLP even counts as a principle of evidence (in Birnbaum 1977). He thought it anomalous that (Conf) lacked an explicit evidential interpretation even though, at an intuitive level, he saw it as the “one rock in a shifting scene” in statistical thinking and practice (Birnbaum 1970, 1033). I return to this in part IV of this rejoinder……

IV Post-SLP foundations

Return to where we left off in the opening section of this rejoinder: Birnbaum (1969).

The problem-area of main concern here may be described as that of determining precise concepts of statistical evidence (systematically linked with mathematical models of experiments), concepts which are to be non-Bayesian, non-decision-theoretic, and significantly relevant to statistical practice. (Birnbaum 1969, 113)

Given Neyman’s behavioral decision construal, Birnbaum claims that “when a confidence region estimate is interpreted as statistical evidence about a parameter”(1969, p. 122), an investigator has necessarily adjoined a concept of evidence, (Conf) that goes beyond the formal theory.  What is this evidential concept? The furthest Birnbaum gets in defining (Conf) is in his posthumous article (1977):

(Conf) A concept of statistical evidence is not plausible unless it finds ‘strong evidence for H2 against H1’ with small probability (α) when H1 is true, and with much larger probability (1 – β) when H2 is true. (1977, 24)

On the basis of (Conf), Birnbaum reinterprets statistical outputs from N-P theory as strong, weak, or worthless statistical evidence depending on the error probabilities of the test (1977, 24-26). While this sketchy idea requires extensions in many ways (e.g., beyond pre-data error probabilities, and beyond the two hypothesis setting), the spirit of (Conf), that error probabilities qualify properties of methods which in turn indicate the warrant to accord a given inference, is, I think, a valuable shift of perspective. This is not the place to elaborate, except to note that my own twist on Birnbaum’s general idea is to appraise evidential warrant by considering the capabilities of tests to have detected erroneous interpretations, a concept I call severity. That Birnbaum preferred a propensity interpretation of error probabilities is not essential.  What matters is their role in picking up how features of experimental design and modeling alter a methods’ capabilities to control “seriously misleading interpretations”. Even those who embrace a version of probabilism may find a distinct role for a severity concept. Recall that Fisher always criticized the presupposition that a single use of mathematical probability must be competent for qualifying inference in all logical situations (1956, 47).

Birnbaum’s philosophy evolved from seeking concepts of evidence in degree of support, belief, or plausibility between statements of data and hypotheses to embracing (Conf) with the required control of misleading interpretations of data. The former view reflected the logical empiricist assumption that there exist context-free evidential relationships—a paradigm philosophers of statistics have been slow to throw off.  The newer (post-positivist) movements in philosophy and history of science were just appearing in the 1970s. Birnbaum was ahead of his time in calling for a philosophy of science relevant to statistical practice; it is now long overdue!

“Relevant clarifications of the nature and roles of statistical evidence in scientific research may well be achieved by bringing to bear in systematic concert the scholarly methods of statisticians, philosophers and historians of science, and substantive scientists” (Birnbaum 1972, 861).

Mayo, Deborah G. On the Birnbaum Argument for the Strong Likelihood Principle (with discussion & rejoinder). Statistical Science 29 (2014), no. 2, 227-266.

Mayo, Deborah G. On the Birnbaum Argument for the Strong Likelihood Principle. Statistical Science 29 (2014), no. 2, 227-239.

Dawid, A. P. Discussion of “On the Birnbaum Argument for the Strong Likelihood Principle”. Statistical Science 29 (2014), no. 2, 240-241.

Evans, Michael. Discussion of “On the Birnbaum Argument for the Strong Likelihood Principle”. Statistical Science 29 (2014), no. 2, 242-246.

Martin, Ryan; Liu, Chuanhai. Discussion: Foundations of Statistical Inference, Revisited. Statistical Science 29 (2014), no. 2, 247-251.

Fraser, D. A. S. Discussion: On Arguments Concerning Statistical Principles. Statistical Science 29 (2014), no. 2, 252-253.

Hannig, Jan. Discussion of “On the Birnbaum Argument for the Strong Likelihood Principle”. Statistical Science 29 (2014), no. 2, 254-258.

Bjørnstad, Jan F. Discussion of “On the Birnbaum Argument for the Strong Likelihood Principle”. Statistical Science 29 (2014), no. 2, 259-260.

Mayo, Deborah G. Rejoinder: “On the Birnbaum Argument for the Strong Likelihood Principle”. Statistical Science 29 (2014), no. 2, 261-266.

Abstract: An essential component of inference based on familiar frequentist notions, such as p-values, significance and confidence levels, is the relevant sampling distribution. This feature results in violations of a principle known as the strong likelihood principle (SLP), the focus of this paper. In particular, if outcomes x and y from experiments E1 and E2 (both with unknown parameter θ), have different probability models f1( . ), f2( . ), then even though f1(xθ) = cf2(yθ) for all θ, outcomes x and ymay have different implications for an inference about θ. Although such violations stem from considering outcomes other than the one observed, we argue, this does not require us to consider experiments other than the one performed to produce the data. David Cox [Ann. Math. Statist. 29 (1958) 357–372] proposes the Weak Conditionality Principle (WCP) to justify restricting the space of relevant repetitions. The WCP says that once it is known which Ei produced the measurement, the assessment should be in terms of the properties of Ei. The surprising upshot of Allan Birnbaum’s [J.Amer.Statist.Assoc.57(1962) 269–306] argument is that the SLP appears to follow from applying the WCP in the case of mixtures, and so uncontroversial a principle as sufficiency (SP). But this would preclude the use of sampling distributions. The goal of this article is to provide a new clarification and critique of Birnbaum’s argument. Although his argument purports that [(WCP and SP), entails SLP], we show how data may violate the SLP while holding both the WCP and SP. Such cases also refute [WCP entails SLP].

Key words: Birnbaumization, likelihood principle (weak and strong), sampling theory, sufficiency, weak conditionality

Regular readers of this blog know that the topic of the “Strong Likelihood Principle (SLP)” has come up quite frequently. Numerous informal discussions of earlier attempts to clarify where Birnbaum’s argument for the SLP goes wrong may be found on this blog. [SEE PARTIAL LIST BELOW.[i]] These mostly stem from my initial paper Mayo (2010) [ii]. I’m grateful for the feedback.

[i] A quick take on the argument may be found in the appendix to: “A Statistical Scientist Meets a Philosopher of Science: A conversation between David Cox and Deborah Mayo (as recorded, June 2011)”

Some previous posts on this topic can be found at the following links (and by searching this blog with key words):

UPhils and responses

[ii]

Below are my slides from my May 2, 2014 presentation in the Virginia Tech Department of Philosophy 2014 Colloquium series:

“Putting the Brakes on the Breakthrough, or
‘How I used simple logic to uncover a flaw in a controversial 50 year old ‘theorem’ in statistical foundations taken as a
‘breakthrough’ in favor of Bayesian vs frequentist error statistics’”

Birnbaum, A. 1962. “On the Foundations of Statistical Inference.” In Breakthroughs in Statistics, edited by S. Kotz and N. Johnson, 1:478–518. Springer Series in Statistics 1993. New York: Springer-Verlag.

*Judges reserve the right to decide if the example constitutes the relevant use of “intentions” (amid a foundations of statistics criticism) in a published article. Different subsets of authors can count for distinct entries. No more than 2 entries per person. This means we need your name.

### 7 thoughts on ““Intentions (in your head)” is the code word for “error probabilities (of a procedure)”: Allan Birnbaum’s Birthday”

1. Hoping this is not a repost – not sure if my first post was received:

Paper:
Kruschke, J. (2011). Bayesian Assessment of Null Values Via Parameter Estimation and Model Comparison. Perspectives on psychological science, 3, 299-312.

Quote:
Unfortunately for NHST, the p value is ill-defined. The conventional NHST analysis assumes that the same size N is fixed, and therefore repeating the experiment means generating simulated data based on the null value of the parameter over and over, with N = 47 each time. But the data do not tell us that the intention of the experimenter was to stop when N = 47. The data contain merley the information that z = 32 and N = 47 because we assume that the result of every trial in independent of other trials. The data collector may have intended to stop when the 32nd success was achieved, and it happended to take 47 trials to do that. In this case, the p value is computed by generating simulated data based on the null value of the parameterwith z= 32 each time and with N varying from one sample to another. … There are many other stopping rules that could have generated the data… It is wrong to speak of the “the” p value for a set of data, because any set of data has many different p values depending on the intent of the experimenter. According to NHST … we must know when the data collector intended to stop data collection, even though we also assume that the data are completely insulated from the researcher’s intention.

Opinion:
I always struggled a bit with this. On the one hand it seems obvious that we should care about intentions, and that intentions should matter for our inference. Hearing that a person found a significant result after looking at a single variable, declared before data collection, is much more impressive than finding the exact same significant result after looking at 200 other variables post-hoc. So here clearly intentions are important, and are needed for valid inference. On the other hand, some counter-examples make this sound downright silly. Imagine a researcher who has the intention to sample N = 40. However, the equipment breaks down after N = 20. Should the sampling distribution now be constructed as if N = 20 was the fixed N, or should the sampling distribution constructed taking into account that there is a probability of the equipment breaking down and collecting less than the intended N = 40. So, while I think I tend to generally agree that intentions matter, there are some cases, where it seems silly. It’s often exactly these cases that are presented in papers that try to argue against the use of intentions.

• I reject Kruschke’s remarks. It’s not that there are different p-values depending on “intentions” locked in someone’s head, the test has different capabilities for error detection depending on what the tester actually did. Changing the stopping rule changes the sampling distribution.
As for the counterintuitive examples, such as taking account of the possibility the instrument broke down or the like, there is NO error statistical justification for doing so! The probative value of the test is not influenced in the least. Please check my blog and published papers for more on this.

2. Repost, previous one held in moderation for 10+ days.

Hello Dr.Mayo,

The examples for arguing against the relevance of intentions as a part of an argument against frequentist inference as a whole are:

#1 Wagenmakers EJ., Lee M., Lodewyckx T., Iverson G.J. (2008) “Bayesian Versus Frequentist Inference.” In: Hoijtink H., Klugkist I., Boelen P.A. (eds) Bayesian Evaluation of Informative Hypotheses. Statistics for Social and Behavioral Sciences. Springer, New York, NY
DOI https://doi.org/10.1007/978-0-387-09612-4_9 Print ISBN 978-0-387-09611-7 Online ISBN 978-0-387-09612-4

“2.3 Frequentist Inference Depends on the Intention With Which
the Data Were Collected

Because p-values are calculated over the sample space, changes in the sample space can greatly affect the p-value. For instance, assume that a participant answers a series of 17 test questions of equal difficulty; 13 answers are correct, 4 are incorrect, and the last question was answered incorrectly. Under the standard binomial sampling plan (i.e., “ask 17 questions”), the two-sided pvalue is .049. The data are, however, also consistent with a negative binomial sampling plan (i.e., “keep on asking questions until the fourth error occurs”). Under this alternative sampling plan, the experiment could have been finished after four questions, or after a million. For this sampling plan, the p-value is 021.

What this simple example shows is that the intention of the researcher affects
statistical inference – the data are consistent with both sampling plans, yet the p-value differs. Berger and Wolpert ([14, page 30-33]) discuss the resulting counterintuitive consequences through a story involving a naive scientist and a frequentist statistician.

(the example is interesting as it deals with both researcher intent and external factors, such as a grant being extended or not, but I thought it is too long to include. The chapter is available for free online at http://www.ejwagenmakers.com/2008/BayesFreqBook.pdf )

I think this is a classical case that tries to portray relaying on the sampling space as absurd, as if it is somehow subjective, locked into the scientists’ mind and therefore cannot possibly be a legitimate consideration. It goes hand-in-hand with the argument that observed data is the only thing that matters, while leaving out that an integral part of what “data” is is the method through which the numbers were obtained.

#2 “An Introduction to Bayesian Hypothesis Testing for Management Research”, Sandra Andraszewicz, Benjamin Scheibehenne, Jörg Rieskamp, Raoul Grasman, Josine Verhagen, and Eric-Jan Wagenmakers, Journal of Management, Vol 41, Issue 2, pp. 521 – 543, December 10, 2014, https://doi.org/10.1177/0149206314560412

A host of “criticism” against p-values here, p-values depending on intent among them:

“Unfortunately, p values have a number of serious logical and statistical limitations (e.g.,
Wagenmakers, 2007). In particular, p values cannot quantify evidence in favor of a null
hypothesis (e.g., Gallistel, 2009; Rouder, Speckman, Sun, Morey, & Iverson, 2009), they
overstate the evidence against the null hypothesis (e.g., Berger & Delampady, 1987; Edwards,
Lindman, & Savage, 1963; Johnson, 2013; Sellke, Bayarri, & Berger, 2001), and they depend
on the sampling plan, that is, they depend on the intention with which the data were collected;
consequently, identical data may yield different p values (Berger & Wolpert, 1988;
Lindley, 1993; a concrete example is given below).

Bayesian hypothesis testing using Bayes factors provides a useful alternative to overcome
these problems (e.g., Jeffreys, 1961; Kass & Raftery, 1995). Bayes factors quantify the support
that the data provide for one hypothesis over another; thus, they allow researchers to
quantify evidence for any hypothesis (including the null) and monitor this evidence as the
data accumulate. In Bayesian inference, the intention with which the data are collected is
irrelevant (Rouder, 2014). As will be apparent later, inference using p values can differ dramatically
from inference using Bayes factors. Our main suggestion is that such differences
should be acknowledged rather than ignored.”

Then the authors go into more detail giving an example where p-values are inferior to Bayes factors due to issues related to reflecting researcher intent, in particular, in a continuous monitoring scenario:

“An additional advantage is that, in contrast to the p value, the Bayes factor is not affected
by the sampling plan, or the intention with which the data were collected”

[…]

“For Bayes factors, in contrast, the sampling plan is irrelevant to inference (as dictated by
the stopping rule principle; Berger & Wolpert, 1988; Rouder, 2014). This means that researchers
can monitor the evidence (i.e., the Bayes factor) as the data come in and terminate data
collection whenever they like, such as when the evidence is deemed sufficiently compelling
or when the researcher has run out of resources.”

Here we see the simple mistake of treating the fact that Bayes factors are not altered to accommodate the data known about the sampling procedure as a positive, while treating the need to alter the p-value calculation to accommodate that data as a negative.

Best regards,
Georgi

• The judges have accepted your submission. Please attach an appropriately sized word document with your address–large enough to use as a label in an envelope. However, I do not know when the books will be sent from CUP, and it could be as much as a couple of weeks. Congratulations.

3. Steven McKinney

I will leave it to the good citizens of Elba to decide if this is a proper published example. It appears to be course notes for a course presented at a conference.

“Practical Bayesian Data Analysis from a Former Frequentist”
Frank E Harrell Jr
Division of Biostatistics and Epidemiology
Department of Health Evaluation Sciences
University of Virginia School of Medicine

MASTERING STATISTICAL ISSUES IN DRUG DEVELOPMENT
HENRY STEWART CONFERENCE STUDIES
15-16 MAY 2000

Pages 24-25 of document:

> “Much controversy about need for adjusting for sequential testing. Frequentist approach is complicated.”

Well we can’t have that. Heaven forfend that there are any complexities. By definition I suppose, Bayesian approaches are not complicated.

> “Example: 5 looks at data as trial proceeds Looks had no effect, trial proceeded to end. Usual P = 0:04, need to adjust upwards for having looked”

How do looks have no effect? If looks have no effect, why do we look at all?

Of course looks have an effect. That’s precisely why many statisticians have worked on sequential methods over many years.

> “Two studies with identical experiments and data but with investigators with different intentions! one might claim ‘significance’, the other not (Berry10) Example: one investigator may treat an interim analysis as a final analysis, another may intend to wait.”

There is nothing wrong with two different investigators with differing intentions deriving differing conclusions from the same body of data. Analysis findings are context dependent.

> ” It gets worse — need to adjust ‘final’ point estimates for having done interim analyses”

I can understand adjusting final confidence interval endpoints for having done interim analyses, but I have yet to come across the scenario that a point estimate needed to be adjusted. I’m happy to be informed here of point estimate adjustment procedures that I have not yet heard of.

> “Freedman et al.36 give example where such adjustment yields 0.95 CI that includes 0.0 even for data indicating that study should be stopped at the first interim analysis”

> “As frequentist methods use intentions (e.g., stopping rule), they are not fully objective8. If the investigator died after reporting the data but before reporting the design of the experiment, it would be impossible to calculate a P–value or other standard measures of evidence.”

Of course any reasonable investigator reports the design of the experiment before collecting data. That’s why we have e.g. the clinicaltrials.gov site – so designs can be reported before the data is collected and the investigator dies. How this is not objective mystifies me.

> “Since P–values are probabilities of obtaining a result as or more extreme than the study’s result under repeated experimentation, frequentists interpret results by inferring ‘what would have occurred following results that were not observed at analyses that were never performed’ 29.”

Science is about studying repeated phenomena. We infer many conditions concerning results not observed at analyses never performed. We infer the results of a coin toss without observing all coins and all tosses of those coins. Of course, we could all be mightily surprised to find that after tomorrow, all coin tosses land heads up, and our old binomial coin toss examples, about fairness and 50/50 outcomes, no longer are of any use. If the sun comes up that is. But it hasn’t happened yet, and today the sun shines here in Vancouver, which is a bit odd but not impossible. These phenomena happen repeatedly, even though we have yet to observe them all, which is why frequentist methods have proven so useful in the scientific study of natural phenomena. So I intend to continue interpreting results using the useful tool of inferring what might occur following results that were not observed at analyses that were never performed, given results observed, when I perform analyses.

4. Thank you all for your submissions. The error statistical judges and I will study them right after I complete the final review of the revised version of my proofs. The deadline for your submissions is 1 year, so there’s no rush.

5. I’m ready to make good on my promise to give free books, now that they are available–or will be any day-, but I should have been clearer about one thing. I deny it’s really a matter of intentions. The various selection effects are real things that actually alter the probative capacity of a test. Thus, in the view I hope to convince you of, it’s not a matter of intentions and it’s wrongheaded to describe it as such. So for future applicants, you might say a bit as to why framing the issue as one of intentions is problematic. That is why in the “21 word solution” to nonreplication, Simmons, Nielson and Simonsohn, require the researcher to stipulate their stopping plan at the outset.