Just as in the past 7 years since I’ve been blogging, I revisit that spot in the road at 9p.m., just outside the Elbar Room, look to get into a strange-looking taxi, to head to “Midnight With Birnbaum”. (The pic on the left is the only blurry image I have of the club I’m taken to.) I wonder if the car will come for me this year, as I wait out in the cold, now that Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (STINT) is out. STINT doesn’t rehearse the argument from my Birnbaum article, but there’s much in it that I’d like to discuss with him. The (Strong) Likelihood Principle–whether or not it is named–remains at the heart of many of the criticisms of Neyman-Pearson (N-P) statistics (and cognate methods). 2018 was the 60th birthday of Cox’s “weighing machine” example, which was the basis of Birnbaum’s attempted proof. Yet as Birnbaum insisted, the “confidence concept” is the “one rock in a shifting scene” of statistical foundations, insofar as there’s interest in controlling the frequency of erroneous interpretations of data. (See my rejoinder.) Birnbaum bemoaned the lack of an explicit evidential interpretation of N-P methods. Maybe in 2019? Anyway, the cab is finally here…the rest is live. Happy New Year!
You know how in that Woody Allen movie, “Midnight in Paris,” the main character (I forget who plays it, I saw it on a plane) is a writer finishing a novel, and he steps into a cab that mysteriously picks him up at midnight and transports him back in time where he gets to run his work by such famous authors as Hemingway and Virginia Wolf? (It was a new movie when I began the blog in 2011.) He is wowed when his work earns their approval and he comes back each night in the same mysterious cab…Well, imagine an error statistical philosopher is picked up in a mysterious taxi at midnight (New Year’s Eve
2011 2012, 2013, 2014, 2015, 2016, 2017, 2018) and is taken back sixty years and, lo and behold, finds herself in the company of Allan Birnbaum.[i] There are a number of 2018 updates.
ERROR STATISTICIAN: It’s wonderful to meet you Professor Birnbaum; I’ve always been extremely impressed with the important impact your work has had on philosophical foundations of statistics. I happen to be writing on your famous argument about the likelihood principle (LP). (whispers: I can’t believe this!)
BIRNBAUM: Ultimately you know I rejected the LP as failing to control the error probabilities needed for my Confidence concept. But you know all this, I’ve read it in your new book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (STINT, 2018, CUP).
ERROR STATISTICIAN: You’ve read my new book? Wow! Then you know I don’t think your argument shows that the LP follows from such frequentist concepts as sufficiency S and the weak conditionality principle WLP. I don’t rehearse my argument there, but I first found it in 2006.[ii] Sorry,…I know it’s famous…
BIRNBAUM: Well, I shall happily invite you to take any case that violates the LP and allow me to demonstrate that the frequentist is led to inconsistency, provided she also wishes to adhere to the WLP and sufficiency (although less than S is needed).
ERROR STATISTICIAN: Well I show that no contradiction follows from holding WCP and S, while denying the LP.
BIRNBAUM: Well, well, well: I’ll bet you a bottle of Elba Grease champagne that I can demonstrate it!
ERROR STATISTICAL PHILOSOPHER: It is a great drink, I must admit that: I love lemons.
BIRNBAUM: OK. (A waiter brings a bottle, they each pour a glass and resume talking). Whoever wins this little argument pays for this whole bottle of vintage Ebar or Elbow or whatever it is Grease.
ERROR STATISTICAL PHILOSOPHER: I really don’t mind paying for the bottle.
BIRNBAUM: Good, you will have to. Take any LP violation. Let x’ be 2-standard deviation difference from the null (asserting μ = 0) in testing a normal mean from the fixed sample size experiment E’, say n = 100; and let x” be a 2-standard deviation difference from an optional stopping experiment E”, which happens to stop at 100. Do you agree that:
(0) For a frequentist, outcome x’ from E’ (fixed sample size) is NOT evidentially equivalent to x” from E” (optional stopping that stops at n)
ERROR STATISTICAL PHILOSOPHER: Yes, that’s a clear case where we reject the strong LP, and it makes perfect sense to distinguish their corresponding p-values (which we can write as p’ and p”, respectively). The searching in the optional stopping experiment makes the p-value quite a bit higher than with the fixed sample size. For n = 100, data x’ yields p’= ~.05; while p” is ~.3. Clearly, p’ is not equal to p”, I don’t see how you can make them equal.
BIRNBAUM: Suppose you’ve observed x”, a 2-standard deviation difference from an optional stopping experiment E”, that finally stops at n=100. You admit, do you not, that this outcome could have occurred as a result of a different experiment? It could have been that a fair coin was flipped where it is agreed that heads instructs you to perform E’ (fixed sample size experiment, with n = 100) and tails instructs you to perform the optional stopping experiment E”, stopping as soon as you obtain a 2-standard deviation difference, and you happened to get tails, and performed the experiment E”, which happened to stop with n =100.
ERROR STATISTICAL PHILOSOPHER: Well, that is not how x” was obtained, but ok, it could have occurred that way.
BIRNBAUM: Good. Then you must grant further that your result could have come from a special experiment I have dreamt up, call it a BB-experiment. In a BB- experiment, if the outcome from the experiment you actually performed has an outcome with a proportional likelihood to one in some other experiment not performed, E’, then we say that your result has an “LP pair”. For any violation of the strong LP, the outcome observed, let it be x”, has an “LP pair”, call it x’, in some other experiment E’. In that case, a BB-experiment stipulates that you are to report x” as if you had determined whether to run E’ or E” by flipping a fair coin.
(They fill their glasses again)
ERROR STATISTICAL PHILOSOPHER: You’re saying that if my outcome from trying and trying again, that is, optional stopping experiment E”, with an “LP pair” in the fixed sample size experiment I did not perform, then I am to report x” as if the determination to run E” was by flipping a fair coin (which decides between E’ and E”)?
BIRNBAUM: Yes, and one more thing. If your outcome had actually come from the fixed sample size experiment E’, it too would have an “LP pair” in the experiment you did not perform, E”. Whether you actually observed x” from E”, or x’ from E’, you are to report it as x” from E”.
ERROR STATISTICAL PHILOSOPHER: So let’s see if I understand a Birnbaum BB-experiment: whether my observed 2-standard deviation difference came from E’ or E” (with sample size n) the result is reported as x’, as if it came from E’ (fixed sample size), and as a result of this strange type of a mixture experiment.
BIRNBAUM: Yes, or equivalently you could just report x*: my result is a 2-standard deviation difference and it could have come from either E’ (fixed sampling, n= 100) or E” (optional stopping, which happens to stop at the 100th trial). That’s how I sometimes formulate a BB-experiment.
ERROR STATISTICAL PHILOSOPHER: You’re saying in effect that if my result has an LP pair in the experiment not performed, I should act as if I accept the strong LP and just report it’s likelihood; so if the likelihoods are proportional in the two experiments (both testing the same mean), the outcomes are evidentially equivalent.
BIRNBAUM: Well, but since the BB- experiment is an imagined “mixture” it is a single experiment, so really you only need to apply the weak LP which frequentists accept. Yes? (The weak LP is the same as the sufficiency principle).
ERROR STATISTICAL PHILOSOPHER: But what is the sampling distribution in this imaginary BB- experiment? Suppose I have Birnbaumized my experimental result, just as you describe, and observed a 2-standard deviation difference from optional stopping experiment E”. How do I calculate the p-value within a Birnbaumized experiment?
BIRNBAUM: I don’t think anyone has ever called it that.
ERROR STATISTICAL PHILOSOPHER: I just wanted to have a shorthand for the operation you are describing, there’s no need to use it, if you’d rather I not. So how do I calculate the p-value within a BB-experiment?
BIRNBAUM: You would report the overall p-value, which would be the average over the sampling distributions: (p’ + p”)/2
Say p’ is ~.05, and p” is ~.3; whatever they are, we know they are different, that’s what makes this a violation of the strong LP (given in premise (0)).
ERROR STATISTICAL PHILOSOPHER: So you’re saying that if I observe a 2-standard deviation difference from E’, I do not report the associated p-value p’, but instead I am to report the average p-value, averaging over some other experiment E” that could have given rise to an outcome with a proportional likelihood to the one I observed, even though I didn’t obtain it this way?
BIRNBAUM: I’m saying that you have to grant that x’ from a fixed sample size experiment E’ could have been generated through a BB- experiment.
My this drink is sour!
ERROR STATISTICAL PHILOSOPHER: Yes, I love pure lemon.
BIRNBAUM: Perhaps you’re in want of a gene; never mind.
I’m saying you have to grant that x’ from a fixed sample size experiment E’ could have been generated through a BB-experiment. If you are to interpret your experiment as if you are within the rules of a BB experiment, then x’ is evidentially equivalent to x” (is equivalent to x*). This is premise (1).
ERROR STATISTICAL PHILOSOPHER: But the result would be that the p-value associated with x’ (fixed sample size) is reported to be larger than it actually is (.05), because I’d be averaging over fixed and optional stopping experiments; while observing x” (optional stopping) is reported to be smaller than it is–in both cases because of an experiment I did not perform.
BIRNBAUM: Yes, the BB-experiment computes the P-value in an unconditional manner: it takes the convex combination over the 2 ways the result could have come about.
ERROR STATISTICAL PHILOSOPHER: this is just a matter of your definitions, it is an analytical or mathematical result, so long as we grant being within your BB experiment.
BIRNBAUM: True, (1) plays the role of the sufficiency assumption, but one need not even appeal to this, it is just a matter of mathematical equivalence.
By the way, I am focusing just on LP violations, therefore, the outcome, by definition, has an LP pair. In other cases, where there is no LP pair, you just report things as usual.
ERROR STATISTICAL PHILOSOPHER: OK, but p’ still differs from p”; so I still don’t how I’m forced to infer the strong LP which identifies the two. In short, I don’t see the contradiction with my rejecting the strong LP in premise (0). (Also we should come back to the “other cases” at some point….)
BIRNBAUM: Wait! Don’t be so impatient; I’m about to get to step (2). Here, let’s toast to the new year: “To Elbar Grease!”
ERROR STATISTICAL PHILOSOPHER: To Elbar Grease!
BIRNBAUM: So far all of this was step (1).
ERROR STATISTICAL PHILOSOPHER: : Oy, what is step 2?
BIRNBAUM: STEP 2 is this: Surely, you agree, that once you know from which experiment the observed 2-standard deviation difference actually came, you ought to report the p-value corresponding to that experiment. You ought NOT to report the average (p’ + p”)/2 as you were instructed to do in the BB experiment.
This gives us premise (2a):
(2a) outcome x”, once it is known that it came from E”, should NOT be analyzed as in a BB- experiment where p-values are averaged. The report should instead use the sampling distribution of the optional stopping test E”, yielding the p-value, p” (~.37). In fact, .37 is the value you give in STINT p. 44 (imagining the experimenter keeps taking 10 more).
ERROR STATISTICAL PHILOSOPHER: So, having first insisted I imagine myself in a Birnbaumized, I mean a BB-experiment, and report an average p-value, I’m now to return to my senses and “condition” in order to get back to the only place I ever wanted to be, i.e., back to where I was to begin with?
BIRNBAUM: Yes, at least if you hold to the weak conditionality principle WCP (of D. R. Cox)—surely you agree to this.
(2b) Likewise, if you knew the 2-standard deviation difference came from E’, then
x’ should NOT be deemed evidentially equivalent to x” (as in the BB experiment), the report should instead use the sampling distribution of fixed test E’, (.05).
ERROR STATISTICAL PHILOSOPHER: So, having first insisted I consider myself in a BB-experiment, in which I report the average p-value, I’m now to return to my senses and allow that if I know the result came from optional stopping, E”, I should “condition” on and report p”.
BIRNBAUM: Yes. There was no need to repeat the whole spiel.
ERROR STATISTICAL PHILOSOPHER: I just wanted to be clear I understood you. Of course,all of this assumes the model is correct or adequate to begin with.
BIRNBAUM: Yes, the SLP is a principle for parametric inference within a given model. So you arrive at (2a) and (2b), yes?
ERROR STATISTICAL PHILOSOPHER: OK, but it might be noted that unlike premise (1), premises (2a) and (2b) are not given by definition, they concern an evidential standpoint about how one ought to interpret a result once you know which experiment it came from. In particular, premises (2a) and (2b) say I should condition and use the sampling distribution of the experiment known to have been actually performed, when interpreting the result.
BIRNBAUM: Yes, and isn’t this weak conditionality principle WCP one that you happily accept?
ERROR STATISTICAL PHILOSOPHER: Well the WCP is defined for actual mixtures, where one flipped a coin to determine if E’ or E” is performed, whereas, you’re requiring I consider an imaginary Birnbaum mixture experiment, where the choice of the experiment not performed will vary depending on the outcome that needs an LP pair; and I cannot even determine what this might be until after I’ve observed the result that would violate the LP? I don’t know what the sample size will be ahead of time.
BIRNBAUM: Sure, but you admit that your observed x” could have come about through a BB-experiment, and that’s all I need. Notice
(1), (2a) and (2b) yield the strong LP!
Outcome x” from E”(optional stopping that stops at n) is evidentially equivalent to x’ from E’ (fixed sample size n).
ERROR STATISTICAL PHILOSOPHER: Clever, but your “proof” is obviously unsound; and before I demonstrate this, notice that the conclusion, were it to follow, asserts p’ = p”, (e.g., .05 = .3!), even though it is unquestioned that p’ is not equal to p”, that is because we must start with an LP violation (premise (0)).
BIRNBAUM: Yes, it is puzzling, but where have I gone wrong?
(The waiter comes by and fills their glasses; they are so deeply engrossed in thought they do not even notice him.)
ERROR STATISTICAL PHILOSOPHER: There are many routes to explaining a fallacious argument. Here’s one. What is required for STEP 1 to hold, is the denial of what’s needed for STEP 2 to hold:
Step 1 requires us to analyze results in accordance with a BB- experiment. If we do so, true enough we get:
premise (1): outcome x” (in a BB experiment) is evidentially equivalent to outcome x’ (in a BB experiment):
That is because in either case, the p-value would be (p’ + p”)/2
Step 2 now insists that we should NOT calculate evidential import as if we were in a BB- experiment. Instead we should consider the experiment from which the data actually came, E’ or E”:
premise (2a): outcome x” (in a BB experiment) is/should be evidentially equivalent to x” from E” (optional stopping that stops at n): its p-value should be p”.
premise (2b): outcome x’ (within in a BB experiment) is/should be evidentially equivalent to x’ from E’ (fixed sample size): its p-value should be p’.
If (1) is true, then (2a) and (2b) must be false!
If (1) is true and we keep fixed the stipulation of a BB experiment (which we must to apply step 2), then (2a) is asserting:
The average p-value (p’ + p”)/2 = p’ which is false.
Likewise if (1) is true, then (2b) is asserting:
the average p-value (p’ + p”)/2 = p” which is false
Alternatively, we can see what goes wrong by realizing:
If (2a) and (2b) are true, then premise (1) must be false.
In short your famous argument requires us to assess evidence in a given experiment in two contradictory ways: as if we are within a BB- experiment (and report the average p-value) and also that we are not, but rather should report the actual p-value.
I can render it as formally valid, but then its premises can never all be true; alternatively, I can get the premises to come out true, but then the conclusion is false—so it is invalid. In no way does it show the frequentist is open to contradiction (by dint of accepting S, WCP, and denying the LP).
BIRNBAUM: Yet some people still think it is a breakthrough (in favor of Bayesianism).
ERROR STATISTICAL PHILOSOPHER: I have a much clearer exposition of what goes wrong in your argument than I did in the discussion from 2010. There were still several gaps, and lack of a clear articulation of the WCP. In fact, I’ve come to see that clarifying the entire argument turns on defining the WCP. Have you seen my 2014 paper in Statistical Science? The key difference is that in (2014), the WCP is stated as an equivalence, as you intended. Cox’s WCP, many claim, was not an equivalence, going in 2 directions. Slides from a presentation may be found on this blogpost.
BIRNBAUM: Yes I have seen your 2014 paper, very clever! Your Rejoinder to some of the critics is gutsy, to say the least. Congratulations! I’ve also seen the slides on your blog.
ERROR STATISTICAL PHILOSOPHER: Thank you, I’m amazed you follow my blog! But look I must get your answer to a question before you leave this year.
Sudden interruption by the waiter
WAITER: Who gets the tab?
BIRNBAUM: I do. To Elbar Grease! And to your new book SIST!
ERROR STATISTICAL PHILOSOPHER: To Elbar Grease! To finally finishing SIST in 2018! Happy New Year!
ERROR STATISTICAL PHILOSOPHER: I have one quick question, Professor Birnbaum, and I swear that whatever you say will be just between us, I won’t tell a soul. In your last couple of papers, you suggest you’d discovered the flaw in your argument for the LP. Am I right? Even in the discussion of your (1962)paper, you seemed to agree with Pratt that WCP can’t do the job you intend.
BIRNBAUM: Savage, you know, never got off my case about remaining at “the half-way house” of likelihood, and not going full Bayesian. Then I wrote the review about the Confidence Concept as the one rock on a shifting scene… Pratt thought the argument should instead appeal to a Censoring Principle (basically, it doesn’t matter if your instrument cannot measure beyond k units if the measurement you’re making is under k units.)
ERROR STATISTICAL PHILOSOPHER: Yes, but who says frequentist error statisticians deny the Censoring Principle? So back to my question, you disappeared before answering last year…I just want to know…you did see the flaw, yes?
WAITER: We’re closing now; shall I call you a taxicab?
ERROR STATISTICAL PHILOSOPHER: ‘Yes’, you discovered the flaw in the argument, or ‘yes’ to the taxi?
MANAGER: We’re closing now; I’m sorry you must leave.
ERROR STATISTICAL PHILOSOPHER: We’re leaving I just need him to clarify his answer….
Large group of people bustle past.
Prof. Birnbaum…? Allan? Where did he go? (oy, not again!)
Link to complete discussion:
Mayo, Deborah G. On the Birnbaum Argument for the Strong Likelihood Principle (with discussion & rejoinder).Statistical Science 29 (2014), no. 2, 227-266.
[i] Many links on the strong likelihood principle (LP or SLP) and Birnbaum may be found by searching this blog. Good sources for where to start as well as historical background papers may be found in my last blogpost.
[ii] By the way, Ronald Giere gave me numerous original papers of yours. They’re in files in my attic library. Some are in mimeo, others typed…I mean, obviously for that time that’s what they’d be…now of course, oh never mind, sorry.