Today is Allan Birnbaum’s Birthday. Birnbaum’s (1962) classic “On the Foundations of Statistical Inference,” in Breakthroughs in Statistics (volume I 1993), concerns a principle that remains at the heart of today’s controversies in statistics–even if it isn’t obvious at first: the Likelihood Principle (LP) (also called the strong likelihood Principle SLP, to distinguish it from the weak LP ). According to the LP/SLP, given the statistical model, the information from the data are fully contained in the likelihood ratio. Thus, properties of the sampling distribution of the test statistic vanish (as I put it in my slides from this post)! But error probabilities are all properties of the sampling distribution. Thus, embracing the LP (SLP) blocks our error statistician’s direct ways of taking into account “biasing selection effects” (slide #10). [Posted earlier here.] Interesting, as seen in a 2018 post on Neyman, Neyman did discuss this paper, but had an odd reaction that I’m not sure I understand. (Check it out.)
Intentions is a New Code Word: Where, then, is all the information regarding your trying and trying again, stopping when the data look good, cherry picking, barn hunting and data dredging? For likelihoodists and other probabilists who hold the LP/SLP, it is ephemeral information locked in your head reflecting your “intentions”! “Intentions” is a code word for “error probabilities” in foundational discussions, as in “who would want to take intentions into account?” (Replace “intentions” (or the “researcher’s intentions”) with “error probabilities” (or the method’s error probabilities”) and you get a more accurate picture.) Keep this deciphering tool firmly in mind as you read criticisms of methods that take error probabilities into account. For error statisticians, this information reflects real and crucial properties of your inference procedure.
Birnbaum struggled. Why? Because he regarded controlling the probability of misleading interpretations to be essential for scientific inference, and yet he seemed to have demonstrated that the LP/SLP followed from frequentist principles! That would mean error statistical principles entailed the denial of error probabilities! For many years this was assumed to be the case, and accounts that rejected error probabilities flourished. Frequentists often admitted their approach seemed to lack what Birnbaum called a “concept of evidence”–even those who suspected there was something pretty fishy about Birnbaum’s “proof”. I have shown the flaw in Birnbaum’s alleged demonstration of the LP/SLP (most fully in the Statistical Science issue). (It only uses logic, really, yet philosophers of science do not seem interested in it.) 
The Statistical Science Issue: This is the 4th Birnbaum birthday where I can point to the Statistical Science issue being out. But are textbooks are out making changes, or still calling this a theorem? I’ve a hunch that Birnbaum would have liked my rejoinder to discussants (Statistical Science): Bjornstad, Dawid, Evans, Fraser, Hannig, and Martin and Liu. For those unfamiliar with the argument, at the end of this entry are slides from an entirely informal talk as well as some links from this blog. Happy Birthday Birnbaum!
 The Weak LP concerns a single experiment; whereas, the strong LP concerns two (or more) experiments. The weak LP is essentially just the sufficiency principle.
 I will give a free signed hard copy of my new “Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars” (July 31, 2018) to each of the first 10 readers who sends a fully cited and linked published example (with distinct authors, you may be one) of criticisms of frequentist methods based on arguing against the relevance of “intentions”. Include as much of the cited material as needed for a reader to grasp the general argument. Entries must be posted as a comment to this post (not twitter), with a link to the article or portions of the article. A brief discussion of what you think of it should also be included. Judges on Elba have final say. [Write with questions.]*
 The argument still cries out for being translated into a symbolic logic of some sort.
Excerpts from my Rejoinder
……As long-standing as Birnbaum’s result has been, Birnbaum himself went through dramatic shifts in a short period of time following his famous (1962) result. More than of historical interest, these shifts provide a unique perspective on the current problem.
Already in the rejoinder to Birnbaum (1962), he is worried about criticisms (by Pratt 1962) pertaining to applying WCP to his constructed mathematical mixtures (what I call Birnbaumization), and hints at replacing WCP with another principle (Irrelevant Censoring). Then there is a gap until around 1968 at which point Birnbaum declares the SLP plausible “only in the simplest case, where the parameter space has but two” predesignated points (1968, 301). He tells us in Birnbaum (1970a, 1033) that he has pursued the matter thoroughly leading to “rejection of both the likelihood concept and various proposed formalizations of prior information”. The basis for this shift is that the SLP permits interpretations that “can be seriously misleading with high probability” (1968, 301). He puts forward the “confidence concept” (Conf) which takes from the Neyman-Pearson (N-P) approach “techniques for systematically appraising and bounding the probabilities (under respective hypotheses) of seriously misleading interpretations of data” while supplying it an evidential interpretation (1970a, 1033). Given the many different associations with “confidence,” I use (Conf) in this Rejoinder to refer to Birnbaum’s idea. Many of the ingenious examples of the incompatibilities of SLP and (Conf) are traceable back to Birnbaum, optional stopping being just one (see Birnbaum 1969). A bibliography of Birnbaum’s work is Giere 1977. Before his untimely death (at 53), Birnbaum denies the SLP even counts as a principle of evidence (in Birnbaum 1977). He thought it anomalous that (Conf) lacked an explicit evidential interpretation even though, at an intuitive level, he saw it as the “one rock in a shifting scene” in statistical thinking and practice (Birnbaum 1970, 1033). I return to this in part IV of this rejoinder……
IV Post-SLP foundations
Return to where we left off in the opening section of this rejoinder: Birnbaum (1969).
The problem-area of main concern here may be described as that of determining precise concepts of statistical evidence (systematically linked with mathematical models of experiments), concepts which are to be non-Bayesian, non-decision-theoretic, and significantly relevant to statistical practice. (Birnbaum 1969, 113)
Given Neyman’s behavioral decision construal, Birnbaum claims that “when a confidence region estimate is interpreted as statistical evidence about a parameter”(1969, p. 122), an investigator has necessarily adjoined a concept of evidence, (Conf) that goes beyond the formal theory. What is this evidential concept? The furthest Birnbaum gets in defining (Conf) is in his posthumous article (1977):
(Conf) A concept of statistical evidence is not plausible unless it finds ‘strong evidence for H2 against H1’ with small probability (α) when H1 is true, and with much larger probability (1 – β) when H2 is true. (1977, 24)
On the basis of (Conf), Birnbaum reinterprets statistical outputs from N-P theory as strong, weak, or worthless statistical evidence depending on the error probabilities of the test (1977, 24-26). While this sketchy idea requires extensions in many ways (e.g., beyond pre-data error probabilities, and beyond the two hypothesis setting), the spirit of (Conf), that error probabilities qualify properties of methods which in turn indicate the warrant to accord a given inference, is, I think, a valuable shift of perspective. This is not the place to elaborate, except to note that my own twist on Birnbaum’s general idea is to appraise evidential warrant by considering the capabilities of tests to have detected erroneous interpretations, a concept I call severity. That Birnbaum preferred a propensity interpretation of error probabilities is not essential. What matters is their role in picking up how features of experimental design and modeling alter a methods’ capabilities to control “seriously misleading interpretations”. Even those who embrace a version of probabilism may find a distinct role for a severity concept. Recall that Fisher always criticized the presupposition that a single use of mathematical probability must be competent for qualifying inference in all logical situations (1956, 47).
Birnbaum’s philosophy evolved from seeking concepts of evidence in degree of support, belief, or plausibility between statements of data and hypotheses to embracing (Conf) with the required control of misleading interpretations of data. The former view reflected the logical empiricist assumption that there exist context-free evidential relationships—a paradigm philosophers of statistics have been slow to throw off. The newer (post-positivist) movements in philosophy and history of science were just appearing in the 1970s. Birnbaum was ahead of his time in calling for a philosophy of science relevant to statistical practice; it is now long overdue!
“Relevant clarifications of the nature and roles of statistical evidence in scientific research may well be achieved by bringing to bear in systematic concert the scholarly methods of statisticians, philosophers and historians of science, and substantive scientists” (Birnbaum 1972, 861).
Link to complete discussion:
Mayo, Deborah G. On the Birnbaum Argument for the Strong Likelihood Principle (with discussion & rejoinder). Statistical Science 29 (2014), no. 2, 227-266.
Links to individual papers:
Mayo, Deborah G. On the Birnbaum Argument for the Strong Likelihood Principle. Statistical Science 29 (2014), no. 2, 227-239.
Dawid, A. P. Discussion of “On the Birnbaum Argument for the Strong Likelihood Principle”. Statistical Science 29 (2014), no. 2, 240-241.
Evans, Michael. Discussion of “On the Birnbaum Argument for the Strong Likelihood Principle”. Statistical Science 29 (2014), no. 2, 242-246.
Martin, Ryan; Liu, Chuanhai. Discussion: Foundations of Statistical Inference, Revisited. Statistical Science 29 (2014), no. 2, 247-251.
Fraser, D. A. S. Discussion: On Arguments Concerning Statistical Principles. Statistical Science 29 (2014), no. 2, 252-253.
Hannig, Jan. Discussion of “On the Birnbaum Argument for the Strong Likelihood Principle”. Statistical Science 29 (2014), no. 2, 254-258.
Bjørnstad, Jan F. Discussion of “On the Birnbaum Argument for the Strong Likelihood Principle”. Statistical Science 29 (2014), no. 2, 259-260.
Mayo, Deborah G. Rejoinder: “On the Birnbaum Argument for the Strong Likelihood Principle”. Statistical Science 29 (2014), no. 2, 261-266.
Abstract: An essential component of inference based on familiar frequentist notions, such as p-values, significance and confidence levels, is the relevant sampling distribution. This feature results in violations of a principle known as the strong likelihood principle (SLP), the focus of this paper. In particular, if outcomes x∗ and y∗ from experiments E1 and E2 (both with unknown parameter θ), have different probability models f1( . ), f2( . ), then even though f1(x∗; θ) = cf2(y∗; θ) for all θ, outcomes x∗ and y∗may have different implications for an inference about θ. Although such violations stem from considering outcomes other than the one observed, we argue, this does not require us to consider experiments other than the one performed to produce the data. David Cox [Ann. Math. Statist. 29 (1958) 357–372] proposes the Weak Conditionality Principle (WCP) to justify restricting the space of relevant repetitions. The WCP says that once it is known which Ei produced the measurement, the assessment should be in terms of the properties of Ei. The surprising upshot of Allan Birnbaum’s [J.Amer.Statist.Assoc.57(1962) 269–306] argument is that the SLP appears to follow from applying the WCP in the case of mixtures, and so uncontroversial a principle as sufficiency (SP). But this would preclude the use of sampling distributions. The goal of this article is to provide a new clarification and critique of Birnbaum’s argument. Although his argument purports that [(WCP and SP), entails SLP], we show how data may violate the SLP while holding both the WCP and SP. Such cases also refute [WCP entails SLP].
Key words: Birnbaumization, likelihood principle (weak and strong), sampling theory, sufficiency, weak conditionality
Regular readers of this blog know that the topic of the “Strong Likelihood Principle (SLP)” has come up quite frequently. Numerous informal discussions of earlier attempts to clarify where Birnbaum’s argument for the SLP goes wrong may be found on this blog. [SEE PARTIAL LIST BELOW.[i]] These mostly stem from my initial paper Mayo (2010) [ii]. I’m grateful for the feedback.
[i] A quick take on the argument may be found in the appendix to: “A Statistical Scientist Meets a Philosopher of Science: A conversation between David Cox and Deborah Mayo (as recorded, June 2011)”
Some previous posts on this topic can be found at the following links (and by searching this blog with key words):
- Midnight with Birnbaum (Happy New Year).
- New Version: On the Birnbaum argument for the SLP: Slides for my JSM talk.
- Don’t Birnbaumize that experiment my friend*–updated reblog.
- Allan Birnbaum, Philosophical Error Statistician: 27 May 1923 – 1 July 1976 .
- LSE seminar
- A. Birnbaum: Statistical Methods in Scientific Inference
- ReBlogging the Likelihood Principle #2: Solitary Fishing: SLP Violations
- Putting the brakes on the breakthrough: An informal look at the argument for the Likelihood Principle.
UPhils and responses
- U-PHIL: Gandenberger & Hennig : Blogging Birnbaum’s Proof
- U-Phil: Mayo’s response to Hennig and Gandenberger
- Mark Chang (now) gets it right about circularity
- U-Phil: Ton o’ Bricks
- Blogging (flogging?) the SLP: Response to Reply- Xi’an Robert
- U-Phil: J. A. Miller: Blogging the SLP
- Mayo, D. G. (2010). “An Error in the Argument from Conditionality and Sufficiency to the Likelihood Principle” in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 305-14.
Below are my slides from my May 2, 2014 presentation in the Virginia Tech Department of Philosophy 2014 Colloquium series:
“Putting the Brakes on the Breakthrough, or
‘How I used simple logic to uncover a flaw in a controversial 50 year old ‘theorem’ in statistical foundations taken as a
‘breakthrough’ in favor of Bayesian vs frequentist error statistics’”
Birnbaum, A. 1962. “On the Foundations of Statistical Inference.” In Breakthroughs in Statistics, edited by S. Kotz and N. Johnson, 1:478–518. Springer Series in Statistics 1993. New York: Springer-Verlag.
*Judges reserve the right to decide if the example constitutes the relevant use of “intentions” (amid a foundations of statistics criticism) in a published article. Different subsets of authors can count for distinct entries. No more than 2 entries per person. This means we need your name.
Hoping this is not a repost – not sure if my first post was received:
Kruschke, J. (2011). Bayesian Assessment of Null Values Via Parameter Estimation and Model Comparison. Perspectives on psychological science, 3, 299-312.
Unfortunately for NHST, the p value is ill-defined. The conventional NHST analysis assumes that the same size N is fixed, and therefore repeating the experiment means generating simulated data based on the null value of the parameter over and over, with N = 47 each time. But the data do not tell us that the intention of the experimenter was to stop when N = 47. The data contain merley the information that z = 32 and N = 47 because we assume that the result of every trial in independent of other trials. The data collector may have intended to stop when the 32nd success was achieved, and it happended to take 47 trials to do that. In this case, the p value is computed by generating simulated data based on the null value of the parameterwith z= 32 each time and with N varying from one sample to another. … There are many other stopping rules that could have generated the data… It is wrong to speak of the “the” p value for a set of data, because any set of data has many different p values depending on the intent of the experimenter. According to NHST … we must know when the data collector intended to stop data collection, even though we also assume that the data are completely insulated from the researcher’s intention.
I always struggled a bit with this. On the one hand it seems obvious that we should care about intentions, and that intentions should matter for our inference. Hearing that a person found a significant result after looking at a single variable, declared before data collection, is much more impressive than finding the exact same significant result after looking at 200 other variables post-hoc. So here clearly intentions are important, and are needed for valid inference. On the other hand, some counter-examples make this sound downright silly. Imagine a researcher who has the intention to sample N = 40. However, the equipment breaks down after N = 20. Should the sampling distribution now be constructed as if N = 20 was the fixed N, or should the sampling distribution constructed taking into account that there is a probability of the equipment breaking down and collecting less than the intended N = 40. So, while I think I tend to generally agree that intentions matter, there are some cases, where it seems silly. It’s often exactly these cases that are presented in papers that try to argue against the use of intentions.