Midnight With Birnbaum: Happy New Year 2026!

.

Anyone here remember that old Woody Allen movie, “Midnight in Paris,” where the main character (I forget who plays it, I saw it on a plane), a writer finishing a novel, steps into a cab that mysteriously picks him up at midnight and transports him back in time where he gets to run his work by such famous authors as Hemingway and Virginia Wolf?  (It was a new movie when I began the blog in 2011.) He is wowed when his work earns their approval and he comes back each night in the same mysterious cab…Well, ever since I began this blog in 2011, I imagine being picked up in a mysterious taxi at midnight on New Year’s Eve, and lo and behold, find myself in the 1960s New York City, in the company of Allan Birnbaum who is is looking deeply contemplative, perhaps studying his 1962 paper…Birnbaum reveals some new and surprising twists this year! [i] 

(The pic on the left is the only blurry image I have of the club I’m taken to.) It has been a decade since  I published my article in Statistical Science (“On the Birnbaum Argument for the Strong Likelihood Principle”), which includes  commentaries by A. P. David, Michael Evans, Martin and Liu, D. A. S. Fraser, Jan Hannig, and Jan Bjornstad. David Cox, who very sadly did in January 2022, is the one who encouraged me to write and publish it. Not only does the (Strong) Likelihood Principle (LP or SLP) remain at the heart of many of the criticisms of Neyman-Pearson (N-P) statistics and of error statistics in general, but a decade after my 2014 paper, it is more central than ever–even if it is often unrecognized.

OUR EXCHANGE:

ERROR STATISTICIAN: It’s wonderful to meet you Professor Birnbaum; I’ve always been extremely impressed with the important impact your work has had on philosophical foundations of statistics.  I happen to have published on your famous argument about the likelihood principle (LP).  (whispers: I can’t believe this!)

BIRNBAUM: Ultimately you know I rejected the LP as failing to control the error probabilities needed for my Confidence concept. But you know all this, I’ve read it in your book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (STINT, 2018, CUP).

ERROR STATISTICIAN: You’ve read my book? Wow! Then you know I don’t think your argument shows that the LP follows from such frequentist concepts as sufficiency S and the weak conditionality principle WLP. I don’t rehearse my argument there, but I first found the problem in 2006, when I was writing something on “conditioning” with David Cox. [ii]  Sorry,…I know it’s famous…

BIRNBAUM:  Well, I shall happily invite you to take any case that violates the LP and allow me to demonstrate that the frequentist is led to inconsistency, provided she also wishes to adhere to the WLP and sufficiency (although less than S is needed).

ERROR STATISTICIAN: Well I show that no contradiction follows from holding WCP and S, while denying the LP.

BIRNBAUM: Well, well, well: I’ll bet you a bottle of Elba Grease champagne that I can demonstrate it!

ERROR STATISTICAL PHILOSOPHER:  It is a great drink, I must admit that: I love lemons.

BIRNBAUM: OK.  (A waiter brings a bottle, they each pour a glass and resume talking).  Whoever wins this little argument pays for this whole bottle of vintage Ebar or Elbow or whatever it is Grease.

.

.

ERROR STATISTICAL PHILOSOPHER:  I really don’t mind paying for the bottle.

BIRNBAUM: Good, you will have to. Take any LP violation. Let  x’ be 2-standard deviation difference from the null (asserting μ = 0) in testing a normal mean from the fixed sample size experiment E’, say n = 100; and let x” be a 2-standard deviation difference from an optional stopping experiment E”, which happens to stop at 100.  Do you agree that:

(0) For a frequentist, outcome x’ from E’ (fixed sample size) is NOT evidentially equivalent to x” from E” (optional stopping that stops at n)

ERROR STATISTICAL PHILOSOPHER: Yes, that’s a clear case where we reject the strong LP, and it makes perfect sense to distinguish their corresponding p-values (which we can write as p’ and p”, respectively).  The searching in the optional stopping experiment makes the p-value quite a bit higher than with the fixed sample size.  For n = 100, data x’ yields p’= ~.05; while p”  is ~.3.  Clearly, p’ is not equal to p”, I don’t see how you can make them equal.

BIRNBAUM: Suppose you’ve observed x”, a 2-standard deviation difference from an optional stopping experiment E”, that finally stops at n=100.  You admit, do you not, that this outcome could have occurred as a result of a different experiment?  It could have been that a fair coin was flipped where it is agreed that heads instructs you to perform E’ (fixed sample size experiment, with n = 100) and tails instructs you to perform the optional stopping experiment E”, stopping as soon as you obtain a 2-standard deviation difference, and you happened to get tails, and performed the experiment E”, which happened to stop with n =100. 

ERROR STATISTICAL PHILOSOPHER:  Well, that is not how x” was obtained, but ok, it could have occurred that way.

BIRNBAUM:  Good. Then you must grant further that your result could have come from a special experiment I have dreamt up, call it a BB-experiment.  In a BB-experiment, if the outcome from the experiment you actually performed has an outcome with a proportional likelihood to one in some other experiment not performed, E’, then we say that your result has an “LP pair”.  For any violation of the strong LP, the outcome observed, let it be x”, has an “LP pair”, call it x’, in some other experiment E’.  In that case, a BB-experiment stipulates that you are to report x” as if you had determined whether to run E’ or E” by flipping a fair coin.

(They fill their glasses again)

ERROR STATISTICAL PHILOSOPHER: You’re saying that if my outcome from trying and trying again, that is, optional stopping experiment E”, with an “LP pair” in the fixed sample size experiment I did not perform, then I am to report x” as if the determination to run E” was by flipping a fair  coin (which decides between E’ and E”)?

BIRNBAUM: Yes, and one more thing. If your outcome had actually come from the fixed sample size experiment E’, it too would have  an “LP pair” in the experiment you did not perform, E”.  Whether you actually observed x” from E”, or x’ from E’, you are to report it as x” from E”.

ERROR STATISTICAL PHILOSOPHER: So let’s see if I understand a Birnbaum BB-experiment: whether my observed 2-standard deviation difference came from E’ or E” (with sample size n) the result is reported as x’, as if it came from E’ (fixed sample size), and as a result of this strange type of a mixture experiment.

BIRNBAUM: Yes, or equivalently you could just report x*: my result is a 2-standard deviation difference and it could have come from either E’ (fixed sampling, n= 100) or E” (optional stopping, which happens to stop at the 100th trial).  That’s how I sometimes formulate a BB-experiment.

ERROR STATISTICAL PHILOSOPHER: You’re saying in effect that if my result has an LP pair in the experiment not performed, I should act as if I accept the strong LP and just report it’s likelihood; so if the likelihoods are proportional in the two experiments (both testing the same mean), the outcomes are evidentially equivalent.

BIRNBAUM: Well, but since the BB- experiment is an imagined “mixture” it is a single experiment, so really you only need to apply the weak LP which frequentists accept.  Yes?  (The weak LP is the same as the sufficiency principle).

ERROR STATISTICAL PHILOSOPHER: But what is the sampling distribution in this imaginary BB- experiment?  Suppose I have Birnbaumized my experimental result, just as you describe, and observed a 2-standard deviation difference from optional stopping experiment E”.  How do I calculate the p-value within a Birnbaumized experiment?

BIRNBAUM: I don’t think anyone has ever called it that.

ERROR STATISTICAL PHILOSOPHER: I just wanted to have a shorthand for the operation you are describing, there’s no need to use it, if you’d rather I not.  So how do I calculate the p-value within a BB-experiment?

BIRNBAUM: You would report the overall p-value, which would be the average over the sampling distributions: (p’ + p”)/2

Say p’ is ~.05, and p” is ~.3; whatever they are, we know they are different, that’s what makes this a violation of the strong LP (given in premise (0)).

ERROR STATISTICAL PHILOSOPHER: So you’re saying that if I observe a 2-standard deviation difference from E’, I do not report  the associated p-value p’, but instead I am to report the average p-value, averaging over some other experiment E” that could have given rise to an outcome with a proportional likelihood to the one I observed, even though I didn’t obtain it this way?

BIRNBAUM: I’m saying that you have to grant that x’ from a fixed sample size experiment E’ could have been generated through a BB-experiment.

My this drink is sour!

ERROR STATISTICAL PHILOSOPHER: Yes, I love pure lemon.

BIRNBAUM: Perhaps you’re in want of a gene; never mind.

I’m saying you have to grant that x’ from a fixed sample size experiment E’ could have been generated through a BB-experiment.  If you are to interpret your experiment as if you are within the rules of a BB experiment, then x’ is evidentially equivalent to x” (is equivalent to  x*).  This is premise (1).

ERROR STATISTICAL PHILOSOPHER: But the result would be that the p-value associated with x’ (fixed sample size) is reported to be larger than it actually is (.05), because I’d be averaging over fixed and optional stopping experiments; while observing x” (optional stopping) is reported to be smaller than it is–in both cases because of an experiment I did not perform.

BIRNBAUM: Yes, the BB-experiment computes the P-value in an unconditional manner: it takes the convex combination over the 2 ways the result could have come about. 

ERROR STATISTICAL PHILOSOPHER: this is just a matter of your definitions, it is an analytical or mathematical result, so long as we grant being within your BB experiment.

BIRNBAUM: True, (1) plays the role of the sufficiency assumption, but one need not even appeal to sufficiency, it is just a matter of mathematical equivalence.

By the way, I am focusing just on LP violations, therefore, the outcome, by definition, has an LP pair.  In other cases, where there is no LP pair, you just report things as usual.

ERROR STATISTICAL PHILOSOPHER: OK, but p’ still differs from p”; so I still don’t how I’m forced to infer the strong LP which identifies the two. In short, I don’t see the contradiction with my rejecting the strong LP in premise (0).  (Also we should come back to the “other cases” at some point….)

BIRNBAUM: Wait! Don’t be so impatient; I’m about to get to step (2). Here, let’s toast to the new year: “To Elbar Grease!”

ERROR STATISTICAL PHILOSOPHER: To Elbar Grease!

BIRNBAUM:  So far all of this was step (1).

ERROR STATISTICAL PHILOSOPHER: : Oy, what is step 2?

BIRNBAUM:  STEP 2 is this: Surely, you agree, that once you know from which experiment the observed 2-standard deviation difference actually came, you ought to report the p-value corresponding to that experiment. You ought NOT to report the average (p’ + p”)/2  as you were instructed to do in the BB experiment.

This gives us premise (2a):

(2a) outcome x”, once it is known that it came from E”, should NOT be  analyzed as in a BB- experiment where p-values are averaged. The report should instead use the sampling distribution of the optional stopping test E”, yielding the p-value, p” (~.37). In fact, .37 is the value you give in STINT p. 44 (imagining the experimenter keeps taking 10 more). 

ERROR STATISTICAL PHILOSOPHER:  So, having first insisted I imagine myself in a Birnbaumized, I mean a BB-experiment, and report an average p-value, I’m now to return to my senses and “condition” in order to get back to the only place I ever wanted to be, i.e., back to where I was to begin with?

BIRNBAUM: Yes, at least if you hold to the weak conditionality principle WCP (of D. R. Cox)—surely you agree to this.

(2b) Likewise, if you knew the 2-standard deviation difference came from E’, then

x’ should NOT be deemed evidentially equivalent to x” (as in the BB experiment), the report should instead use the sampling distribution of fixed test E’, (.05).  

ERROR STATISTICAL PHILOSOPHER: So, having first insisted I consider myself in a BB-experiment, in which I report the average p-value, I’m now to return to my senses and allow that if I know the result came from optional stopping, E”, I should “condition” on and report p”.

BIRNBAUM: Yes.  There was no need to repeat the whole spiel.

ERROR STATISTICAL PHILOSOPHER: I just wanted to be clear I understood you. Of course, all of this assumes the model is correct or adequate to begin with.

BIRNBAUM: Yes, the LP (or SLP, to indicate it’s the strong LP) is a principle for parametric inference within a given model. So you arrive at (2a) and (2b), yes?

ERROR STATISTICAL PHILOSOPHER: OK, but it might be noted that unlike premise (1), premises (2a) and (2b) are not given by definition, they concern an evidential standpoint about how one ought to interpret a result once you know which experiment it came from. In particular, premises (2a) and (2b) say I should condition and use the sampling distribution of the experiment known to have been actually performed, when interpreting the result.

BIRNBAUM: Yes, and isn’t this weak conditionality principle WCP one that you happily accept?

ERROR STATISTICAL PHILOSOPHER: Well the WCP originally refers to actual mixtures, where one flipped a coin to determine if E’ or E” is performed, whereas, you’re requiring I consider an imaginary Birnbaum mixture experiment, where the choice of the experiment not performed will vary depending on the outcome that needs an LP pair; and I cannot even determine what this might be until after I’ve observed the result that would violate the LP? I don’t know what the sample size will be ahead of time.

BIRNBAUM: Sure, but you admit that your observed x” could have come about through a BB-experiment, and that’s all I need.  Notice

(1), (2a) and (2b) yield the strong LP!

Outcome x” from E”(optional stopping that stops at n) is evidentially equivalent to x’ from E’ (fixed sample size n).

ERROR STATISTICAL PHILOSOPHER:  Clever, but your “proof” is obviously unsound; and before I demonstrate this, notice that the conclusion, were it to follow, asserts p’ = p”, (e.g.,  .05 = .3!), even though it is unquestioned that p’ is not equal to p”, that is because we must start with an LP violation (premise (0)).

BIRNBAUM: Yes, it is puzzling, but where have I gone wrong?

(The waiter comes by and fills their glasses; they are so deeply engrossed in thought they do not even notice him.)

ERROR STATISTICAL PHILOSOPHER: There are many routes to explaining a fallacious argument.  The one I find most satisfactory is in Mayo (2014). But, given we’ve been partying, here’s a very simple one. What is required for STEP 1 to hold, is the denial of what’s needed for STEP 2 to hold:

Step 1 requires us to analyze results in accordance with a BB- experiment.  If we do so, true enough we get:

premise (1): outcome x” (in a BB experiment) is evidentially equivalent to outcome x’ (in a BB  experiment):

That is because in either case, the p-value would be (p’ + p”)/2

Step 2 now insists that we should NOT calculate  evidential import as if we were in a BB- experiment.  Instead we should consider the experiment from which the data actually came, E’ or E”:

premise (2a): outcome x” (in a BB experiment) is/should be evidentially equivalent to x” from E” (optional stopping that stops at n):  its p-value should be p”.

premise (2b): outcome x’ (within in a BB experiment) is/should be evidentially equivalent to x’ from E’ (fixed sample size):  its p-value should be p’.

If (1) is true, then (2a) and (2b) must be false!

If (1) is true and we keep fixed the stipulation of a BB experiment (which we must to apply step 2), then (2a) is asserting:

The average p-value (p’ + p”)/2  =  p’  which is false.

Likewise if (1) is true, then (2b) is asserting:

the average p-value (p’ + p”)/2  =  p”  which is false

Alternatively, we can see what goes wrong by realizing:

If (2a) and (2b) are true, then premise (1) must be false.

In short your famous argument requires us to assess evidence in a given experiment in two contradictory ways: as if we are within a BB- experiment (and report the average p-value) and also that we are not, but rather should report the actual p-value.

I can render it as formally valid, but then its premises can never all be true; alternatively, I can get the premises to come out true, but then the conclusion is false—so it is invalid.  In no way does it show the frequentist is open to contradiction (by dint of accepting S, WCP, and denying the LP).

BIRNBAUM: Yet some people still think it is a breakthrough. I never agreed to go as far as Jimmy Savage wanted me too, namely, to be a Bayesian….

ERROR STATISTICAL PHILOSOPHER: I’ve come to see that clarifying the entire argument turns on defining the WCP. Have you seen my 2014 paper in Statistical Science?  The key difference is that in (2014), the WCP is stated as an equivalence, as you intended. Cox’s WCP, many claim, was not an equivalence, going in 2 directions. Slides from a presentation may be found on this blogpost

BIRNBAUM: Yes, the “monster of the LP” arises from viewing WCP as an equivalence, instead of going in one direction (from mixtures to the known result).

ERROR STATISTICAL PHILOSOPHER: In my 2014 paper (unlike my earlier treatments) I too construe WCP as giving an “equivalence” but there is an equivocation that invalidates the purported move to the LP.

On the one hand, it’s true that if z is known (and known for example to have come from optional stopping), it’s irrelevant that it could have resulted from either fixed sample testing or optional stopping.

But it does not follow that if z is known, it’s irrelevant whether it resulted from fixed sample testing or optional stopping. It’s the slippery slide into this second statement–which surely sounds the same as the first–that makes your argument such a brain buster. (Mayo 2014)

BIRNBAUM: Yes I have seen your 2014 paper! Your Rejoinder to some of the critics is gutsy, to say the least. I’ve also seen the slides on your blog.

ERROR STATISTICAL PHILOSOPHER: Thank you, I’m amazed you follow my blog! I haven’t kept it up that much lately; blogs have fallen out of fashion.

BIRNBAUM: As has inferential statistics it seems–it’s all AI/ML. But I have to admit that CHAT GPT illuminates at least part of your argument as to why my reasoning was flawed.

ERROR STATISTICAL PHILOSOPHER: I never thought to check CHAT GPT on my paper, that’s amazing.

BIRNBAUM: Here is what I found on the Chatbot:

CHAT GPT

Birnbaum’s Argument and the Likelihood Principle

In his 1962 paper, Birnbaum argued that if frequentists accept two principles—sufficiency and weak conditionality—they are logically compelled to accept the likelihood principle. The likelihood principle states that all the evidence in data is contained in the likelihood function, meaning that the sampling distribution (and hence frequentist error probabilities) is irrelevant to evidential assessment….

Error Statistician’s Dilemma

If Birnbaum’s argument is correct, then frequentist methods (which rely on error probabilities) would be rendered irrelevant for assessing evidence. This would make it difficult for frequentists to defend their approach as coherent, particularly in the face of Bayesian methods that naturally adhere to the likelihood principle.

However, Deborah Mayo, in her 2014 work, critiques Birnbaum’s argument, exposing a logical flaw in his alleged proof. 

BIRNBAUM: The bot does not get your argument right. The whole experience  has encouraged me to write the first draft of a completely revised paper, reflecting a large advance in my thinking on this. It’s not quite ready to share….

ERROR STATISTICAL PHILOSOPHER: Wow! I’d love to read it…have you identified the problem? In your last couple of papers, you suggest you’d discovered the flaw in your argument for the LP. Am I right? Even in the discussion of your (1962) paper, you seemed to agree with Pratt that WCP can’t do the job you intend. I just want to know, and won’t share your answer with anyone….

(She notices Birnbaum is holding a paper on long legal-sized yellow sheets filled with tiny hand-written comments, covering both sides.)

Sudden interruption by the waiter:

WAITER: Who gets the tab? 

BIRNBAUM: I do. To Elbar Grease!  To Severe Testing!
Happy New Year!

BIRNBAUM (looking wistful): Savage, you know, never got off my case about remaining at “the half-way house” of likelihood, and not going full Bayesian. Then I wrote the review about the Confidence Concept as the one rock on a shifting scene… Pratt thought the argument should instead appeal to a Censoring Principle (basically, it doesn’t matter if your instrument cannot measure beyond k units if the measurement you’re making is under k units.)

ERROR STATISTICAL PHILOSOPHER: Yes, but who says frequentist error statisticians deny the Censoring Principle? So back to my question,…you did uncover the flaw in your argument, yes?

WAITER: We’re closing now; shall I call a Taxi?

BIRNBAUM: Yes, yes!

ERROR STATISTICAL PHILOSOPHER: ‘Yes’, you discovered the flaw in the argument, or ‘yes’ to the taxi? 

MANAGER: We’re closing now; I’m sorry you must leave.

ERROR STATISTICAL PHILOSOPHER: We’re leaving I just need him to clarify his answer….

BIRNBAUM: I predict that 2026 will be the year that people will finally take seriously your paper from a decade ago (30 years from your Lakatos Prize)!

ERROR STATISTICAL PHILOSOPHER: I’ll drink to that!

Suddenly a large group of people bustle past the manager…it’s all chaos.

Prof. Birnbaum…? Allan? Where did he go? (oy, not again!)


Link to complete discussion: 

Mayo, Deborah G. On the Birnbaum Argument for the Strong Likelihood Principle (with discussion & rejoinder).Statistical Science 29 (2014), no. 2, 227-266.

stat-sci

[i] Many links on the strong likelihood principle (LP or SLP) and Birnbaum may be found by searching this blog. Good sources for where to start as well as classic background papers may be found in this blogpost. A link to slides and video of a very introductory presentation of my argument from the 2021 Phil Stat Forum is here.

January 7: “Putting the Brakes on the Breakthrough: On the Birnbaum Argument for the Strong Likelihood Principle” (D.Mayo)

[ii] In 2023 I wrote a paper on Cox’s statistical philosophy. Sadly he died in 2022. (The first David R. Cox Foundations of Statistics Prize, currently given by the ASA on even-numbered years, was awarded to Nancy Reid at the JSM 2023. The second went to Phil Dawid. The Award is now to be given yearly, thanks to the contributions of Friends of David Cox (on this blog!))

Categories: Birnbaum, CHAT GPT, Likelihood Principle, Sir David Cox | Leave a comment

For those who want to binge read the (Strong) Likelihood Principle in 2025

.

David Cox’s famous “weighing machine” example” from my last post is thought to have caused “a subtle earthquake” in foundations of statistics. It’s been 11 years since I published my Statistical Science article on this, Mayo (2014), which includes several commentators, but the issue is still mired in controversy. It’s generally dismissed as an annoying, mind-bending puzzle on which those in statistical foundations tend to hold absurdly strong opinions. Mostly it has been ignored. Yet I sense that 2026 is the year that people will return to it again. It’s at least touched upon in Roderick Little’s new book (pic below). This post gives some background, and collects the essential links that you would need if you want to delve into it. Many readers know that each year I return to the issue on New Year’s Eve…. But that’s tomorrow.

By the way, this is not part of our lesurely tour of SIST. In fact, the argument is not even in SIST, although the SLP (or LP) arises a lot. But if you want to go off the beaten track with me to the SLP conundrum, here’s your opportunity.

What’s it all about? An essential component of inference based on familiar frequentist notions: p-values, significance and confidence levels, is the relevant sampling distribution (hence the term sampling theory, or my preferred error statistics, as we get error probabilities from the sampling distribution). This feature results in violations of a principle known as the strong likelihood principle (SLP). To state the SLP roughly, it asserts that all the evidential import in the data (for parametric inference within a model) resides in the likelihoods. If accepted, it would render error probabilities irrelevant post data.

SLP (We often drop the “strong” and just call it the LP. The “weak” LP just boils down to sufficiency)

For any two experiments E1 and E2 with different probability models f1, f2, but with the same unknown parameter θ, if outcomes x* and y* (from E1 and E2 respectively) determine the same (i.e., proportional) likelihood function (f1(x*; θ) = cf2(y*; θ) for all θ), then x* and y* are inferentially equivalent (for an inference about θ).

(What differentiates the weak and the strong LP is that the weak refers to a single experiment.)

Violation of SLP:

Whenever outcomes x* and y* from experiments E1 and E2 with different probability models f1, f2, but with the same unknown parameter θ, and f1(x*; θ) = cf2(y*; θ) for all θ, and yet outcomes x* and y* have different implications for an inference about θ.

For an example of a SLP violation, E1 might be sampling from a Normal distribution with a fixed sample size n, and E2 the corresponding experiment that uses an optional stopping rule: keep sampling until you obtain a result 2 standard deviations away from a null hypothesis that θ = 0 (and for simplicity, a known standard deviation). When you do, stop and reject the point null (in 2-sided testing).

The SLP tells us  (in relation to the optional stopping rule) that once you have observed a 2-standard deviation result, there should be no evidential difference between its having arisen from experiment E1, where n was fixed, say, at 100, and experiment E2 where the stopping rule happens to stop at n = 100. For the error statistician, by contrast, there is a difference, and this constitutes a violation of the SLP.

———————-

Now for the surprising part: In Cox’s weighing machine example, recall, a coin is flipped to decide which of two experiments to perform.  David Cox (1958) proposes something called the Weak Conditionality Principle (WCP) to restrict the space of relevant repetitions for frequentist inference. The WCP says that once it is known which Ei produced the measurement, the assessment should be in terms of the properties of the particular Ei. Nothing could be more obvious.     

The surprising upshot of Allan Birnbaum’s (1962) argument is that the SLP appears to follow from applying the WCP in the case of mixture experiments, and so uncontroversial a principle as sufficiency (SP)–although even that has been shown to be optional to the argument, strictly speaking. Were this true, it would preclude the use of sampling distributions. J. Savage calls Birnbaum’s argument “a landmark in statistics” (see [i]).

Although his argument purports that [(WCP and SP) entails SLP], in fact data may violate the SLP while holding both the WCP and SP. Such cases also directly refute [WCP entails SLP].

Binge reading the Likelihood Principle.

If you’re keen to binge read the SLP–a way to break holiday/winter break doldrums–or if it comes up during 2025, I’ve pasted most of the early historical sources below. The argument is simple; showing what’s wrong with it took a long time.

My earliest treatment, via counterexample, is in Mayo (2010)–in an appendix to a paper I wrote with David Cox on objectivity and conditionality in frequentist inference.  But the treatment in the appendix doesn’t go far enough, so if you’re interested, it’s best to just check out Mayo (2014) in Statistical Science.[ii] An intermediate paper Mayo (2013) corresponds to a talk I presented at the JSM in 2013.

Interested readers may search this blog for quite a lot of discussion of the SLP including “U-Phils” (discussions by readers) (e.g., here, and here), and amusing notes (e.g., Don’t Birnbaumize that experiment my friend.

This conundrum is relevant to the very notion of “evidence”, blithely taken for granted in both statistics and philosophy. [iii] There’s no statistics involved, just logic and language.My 2014 paper shows the logical problem, but I still think that it will take an astute philosopher of language to adequately classify the linguistic fallacy being committed.

To have a list for binging, I’ve grouped some key readings below.

Classic Birnbaum Papers:

  • Birnbaum, A. (1962), “On the Foundations of Statistical Inference“, Journal of the American Statistical Association 57(298), 269-306.
  • Savage, L. J., Barnard, G., Cornfield, J., Bross, I, Box, G., Good, I., Lindley, D., Clunies-Ross, C., Pratt, J., Levene, H., Goldman, T., Dempster, A., Kempthorne, O, and Birnbaum, A. (1962). “Discussion on Birnbaum’s On the Foundations of Statistical Inference”, Journal of the American Statistical Association 57(298), 307-326.
  • Birnbaum, Allan (1969).” Concepts of Statistical Evidence“. In Ernest Nagel, Sidney Morgenbesser, Patrick Suppes & Morton Gabriel White (eds.), Philosophy, Science, and Method. New York: St. Martin’s Press. pp. 112–143.
  • Birnbaum, A (1970). Statistical Methods in Scientific Inference (letter to the editor). Nature 225, 1033.
  • Birnbaum, A (1972), “More on Concepts of Statistical Evidence“Journal of the American Statistical Association, 67(340), 858-861.

Note to Reader: If you look at the (1962) “discussion”, you can already see Birnbaum backtracking a bit, in response to Pratt’s comments.

Some additional early discussion papers:

Durbin:

There’s also a good discussion in Cox and Hinkley 1974.

Evans, Fraser, and Monette:

Kalbfleisch:

My discussions (also noted above):

Continue reading

Categories: 11 years ago, Likelihood Principle | Leave a comment

67 Years of Cox’s (1958) Chestnut: Excerpt from Excursion 3 Tour II

2025-26 Cruise

.

We’re stopping to consider one of the “chestnuts” in the exhibits of “chestnuts and howlers” in Excursion 3 (Tour II) of Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST 2018). It is now 67 years since Cox gave his famous weighing machine example in Sir David Cox (1958)[1]. It will play a vital role in our discussion of the (strong) Likelihood Principle later this week. The excerpt is from SIST (pp. 170-173).

Exhibit (vi): Two Measuring Instruments of Different Precisions. Did you hear about the frequentist who, knowing she used a scale that’s right only half the time, claimed her method of weighing is right 75% of the time? 

She says, “I flipped a coin to decide whether to use a scale that’s right 100% of the time, or one that’s right only half the time, so, overall, I’m right 75% of the time.” (She wants credit because she could have used a better scale, even knowing she used a lousy one.)

Basis for the joke: An N-P test bases error probability on all possible outcomes or measurements that could have occurred in repetitions, but did not.

As with many infamous pathological examples, often presented as knockdown criticisms of all of frequentist statistics, this was invented by a frequentist, Cox (1958).  It was a way to highlight what could go wrong in the case at hand, if one embraced an unthinking behavioral-performance view.  Yes, error probabilities are taken over hypothetical repetitions of a process, but not just any repetitions will do. Here’s the statistical formulation.

We flip a fair coin to decide which of two instruments, Eor E2, to use in observing a Normally distributed random sample Z to make inferences about mean θ. Ehas variance of 1, while that of Eis 106.  Any randomizing device used to choose which instrument to use will do, so long as it is irrelevant to θ. This is called a mixture experiment. The full data would report both the result of the coin flip and the measurement made with that instrument. We can write the report as having two parts: First, which experiment was run and second the measurement: (Ei, z), i = 1 or 2.

In testing a null hypothesis such as θ = 0, the same z measurement would correspond to a much smaller P-value were it to have come from E1 rather than from E2: denote them as p1(z) and p2(z), respectively. The overall significance level of the mixture: [p1(z) + p2(z)]/2, would give a misleading report of the precision of the actual experimental measurement. The claim is that N-P statistics would report the average P-value rather than the one corresponding to the scale you actually used! These are often called the unconditional and the conditional test, respectively.  The claim is that the frequentist statistician must use the unconditional test.

Suppose that we know we have observed a measurement from Ewith its much larger variance:

The unconditional test says that we can assign this a higher level of significance than we ordinarily do, because if we were to repeat the experiment, we might sample some quite different distribution. But this fact seems irrelevant to the interpretation of an observation which we know came from a distribution [with the larger variance]. (Cox 1958, p. 361)

Once it is known which Ei  has produced z, the P-value or other inferential assessment should be made with reference to the experiment actually run. As we say in Cox and Mayo (2010):

The point essentially is that the marginal distribution of a P-value averaged over the two possible configurations is misleading for a particular set of data. It would mean that an individual fortunate in obtaining the use of a precise instrument in effect sacrifices some of that information in order to rescue an investigator who has been unfortunate enough to have the randomizer choose a far less precise tool. From the perspective of interpreting the specific data that are actually available, this makes no sense. (p. 296)

To scotch his famous example, Cox (1958) introduces a principle: weak conditionality.

Weak Conditionality Principle (WCP): If a mixture experiment (of the aforementioned type) is performed, then, if it is known which experiment produced the data, inferences about θ are appropriately drawn in terms of the sampling behavior in the experiment known to have been performed (Cox and Mayo 2010, p. 296).

It is called weak conditionality because there are more general principles of conditioning that go beyond the special case of mixtures of measuring instruments.

While conditioning on the instrument actually used seems obviously correct, nothing precludes the N-P theory from choosing the procedure “which is best on the average over both experiments” (Lehmann and Romano 2005, p. 394), and it’s even possible that the average or unconditional power is better than the conditional.  In the case of such a conflict, Lehmann says relevant conditioning takes precedence over average power (1993b).He allows that in some cases of acceptance sampling, the average behavior may be relevant, but in scientific contexts the conditional result would be the appropriate one (see Lehmann 1993b, p. 1246). Context matters. Did Neyman and Pearson ever weigh in on this? Not to my knowledge, but I’m sure they’d concur with N-P tribe leader Lehmann.  Admittedly, if your goal in life is to attain a precise α level, then when discrete distributions preclude this, a solution would be to flip a coin to decide the borderline cases! (See also Example 4.6, Cox and Hinkley 1974, pp. 95–6; Birnbaum 1962, p. 491.)

Is There a Catch?

The “two measuring instruments” example occupies a famous spot in the pantheon of statistical foundations, regarded by some as causing “a subtle earthquake” in statistical foundations. Analogous examples are made out in terms of confidence interval estimation methods (Tour III, Exhibit (viii)). It is a warning to the most behavioristic accounts of testing from which we have already distinguished the present approach. Yet justification for the conditioning (WCP) is fully within the frequentist error statistical philosophy, for contexts of scientific inference. There is no suggestion, for example, that only the particular data set be considered. That would entail abandoning the sampling distribution as the basis for inference, and with it the severity goal. Yet we are told that “there is a catch” and that WCP leads to the Likelihood Principle (LP)!

It is not uncommon to see statistics texts argue that in frequentist theory one is faced with the following dilemma: either to deny the appropriateness of conditioning on the precision of the tool chosen by the toss of a coin, or else to embrace the strong likelihood principle, which entails that frequentist sampling distributions are irrelevant to inference once the data are obtained. This is a false dilemma. Conditioning is warranted to achieve objective frequentist goals, and the [weak] conditionality principle coupled with sufficiency does not entail the strong likelihood principle. The ‘dilemma’ argument is therefore an illusion. (Cox and Mayo 2010, p. 298)

There is a large literature surrounding the argument for the Likelihood Principle, made famous by Birnbaum (1962). Birnbaum hankered for something in between radical behaviorism and throwing error probabilities out the window. Yet he himself had apparently proved there is no middle ground (if you accept WCP)! Even people who thought there was something fishy about Birnbaum’s “proof”  were discomfited by the lack of resolution to the paradox. It is time for post-LP philosophies of inference. So long as the Birnbaum argument, which Savage and many others deemed important enough to dub a “breakthrough in statistics,”  went unanswered, the frequentist was thought to be boxed into the pathological examples. She is not.

In fact, I show there is a flaw in his venerable argument (Mayo 2010b, 2013a, 2014b). That’s a relief. Now some of you will howl, “Mayo, not everyone agrees with your disproof! Some say the issue is not settled.”  Fine, please explain where my refutation breaks down. It’s an ideal brainbuster to work on along the promenade after a long day’s tour. Don’t be dismayed by the fact that it has been accepted for so long. But I won’t revisit it here.

From Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo 2018, CUP).

Excursion 3 Tour II, pp. 170-173.

If you’re keen to follow our abbreviated cruise, write to Jean Miller (jemille6@vt.edu) and she will send you the final pages of the monthly readings.


Note to the Reader:

Textbooks should not call a claim a theorem if it’s not a theorem, i.e., if there isn’t a proof of it (within the relevant formal system). Yet you will find many statistics texts, and numerous discussion articles, that blithely repeat that the (strong) Likelihood Principle is a theorem, shown to follow if you accept the (WCP) which frequentist error statisticians do.[2] I argue that Allan Birnbaum’s (1962) alleged proof is circular. So, in in 2025, when you find a text that claims the LP is a theorem, provable from the (WEP), please let me know.

If statistical inference follows Bayesian posterior probabilism, the LP follows easily. It’s shown in just a couple of pages of Excursion 1 Tour II (45-6). All the excitement is whether the frequentist (error statistician) is bound to hold it. If she is, then error probabilities become irrelevant to the evidential import of data (once the data are given), at least when making parametric inferences within a statistical model.

The LP was a main topic for the first few years of this blog. That’s because I was still refining an earlier disproof from Mayo (2010), based on giving a counterexample. I later saw the need for a deeper argument which I give in Mayo (2014) in Statistical Science.[3] (There, among other subtleties, the WCP is put as a logical equivalence as intended.)

“It was the adoption of an unqualified equivalence formulation of conditionality, and related concepts, which led, in my 1962 paper, to the monster of the likelihood axiom,” (Birnbaum 1975, 263).

If you’re keen to try your hand at the arguments (Birnbaum’s or mine), you might start with a summary post (based on slides) here, or an intermediate paper Mayo (2013) that I presented at the JSM. It is not included in SIST. It’s a brainbuster, though, I warn you. There’s no real mathematics or statistics involved, it’s pure logic. But it’s very circuitous, which is why the supposed “proof” has stuck around as long as it has.But I’ve always thought that clarifying it fully demanded the expertise of a philosopher of language, but I haven’t found one yet. 

[1] Cox 1958 has a different variant of the chestnut.

[2] Note sufficiency is not really needed in the “proof”.

[3] The discussion includes commentaries by DawidEvansMartin and Liu, Hannig, and Bjørnstad–some of whom are very unhappy with me. But I’m given the final word in the rejoinder

References (outside of the excerpt; for refs within SIST, please see SIST):

Birnbaum, A. (1962), “On the Foundations of Statistical Inference“, Journal of the American Statistical Association 57(298), 269-306.

Birnbaum, A. (1975). Comments on Paper by J. D. Kalbfleisch. Biometrika, 62 (2), 262–264.

Cox, D. R. (1958), “Some problems connected with statistical inference“, The Annals of Mathematical Statistics, 29, 357-372.

Mayo, D. G. (2013) “Presented Version: On the Birnbaum Argument for the Strong Likelihood Principle”, in JSM Proceedings, Section on Bayesian Statistical Science. Alexandria, VA: American Statistical Association: 440-453.

Mayo, D. G. (2014). Mayo paper: “On the Birnbaum Argument for the Strong Likelihood Principle,” Paper with discussion and Mayo rejoinder: Statistical Science 29(2) pp. 227-239, 261-266.

Categories: 2025 leisurely cruise, Birnbaum, Likelihood Principle | Leave a comment

(DEC #2) December Leisurely Tour Meeting 3: SIST Excursion 3 Tour III

2025-26 Cruise

We are now at the second stop on our December leisurely cruise through SIST: Excursion 3 Tour III. I am pasting the slides and video from this session during the LSE Research Seminars in 2020 (from which this cruise derives). (Remember it was early pandemic, and we weren’t so adept with zooming.)  The Higgs discussion clarifies (and defends) a somewhat controversial interpretation of p-values. (If you’re interested in the Higgs discovery, there’s a lot more on this blog you can find with the search. I am not sure if I would include the section on “capability and severity” were I to write a second edition, though I would keep the duality of tests and CIs. My goal was to expose a fallacy that is even more common nowadays, but I would have placed a revised version later in the book. Share your remarks in the comments.

.

 

 

 

 

 

III. Deeper Concepts: Confidence Intervals and Tests: Higgs’ Discovery: Continue reading

Categories: 2025 leisurely cruise, confidence intervals and tests | Leave a comment

December leisurely cruise “It’s the Methods, Stupid!” Excursion 3 Tour II (3.4-3.6)

2025-26 Cruise

Welcome to the December leisurely cruise:
Wherever we are sailing, assume that it’s warm, warm, warm (not like today in NYC). This is an overview of our first set of readings for December from my Statistical Inference as Severe Testing: How to get beyond the statistics wars (CUP 2018): [SIST]–Excursion 3 Tour II. This leisurely cruise, participants know, is intended to take a whole month to cover one week of readings from my 2020 LSE Seminars, except for December and January which double up. 

What do you think of  “3.6 Hocus-Pocus: P-values Are Not Error probabilities, Are Not Even Frequentist”? This section refers to Jim Berger’s famous attempted unification of Jeffreys, Neyman and Fisher in 2003. The unification considers testing 2 simple hypotheses using a random sample from a Normal distribution, computing their two P-values, rejecting whichever gets a smaller P-value, and then computing its posterior probability, assuming each gets a prior of .5. This becomes what he calls the “Bayesian error probability” upon which he defines “the frequentist principle”. On Berger’s reading of an important paper* by Neyman (1977), Neyman criticized p-values for violating the frequentist principle (SIST p. 186). *The paper is “frequentist probability and frequentist statistics”. Remember that links to readings outside SIST are at the Captains biblio on the top left of the blog. Share your thoughts in the comments.

Some snapshots from Excursion 3 tour II.

Continue reading

Categories: 2025 leisurely cruise | Leave a comment

Modest replication probabilities of p-values–desirable, not regrettable: a note from Stephen Senn

.

You will often hear—especially in discussions about the “replication crisis”—that statistical significance tests exaggerate evidence. Significance testing, we hear, inflates effect sizes, inflates power, inflates the probability of a real effect, or inflates the probability of replication, and thereby misleads scientists.

If you look closely, you’ll find the charges are based on concepts and philosophical frameworks foreign to both Fisherian and Neyman–Pearson hypothesis testing. Nearly all have been discussed on this blog or in SIST (Mayo 2018), but new variations have cropped up. The emphasis that some are now placing on how biased selection effects invalidate error probabilities is welcome, but I say that the recommendations for reinterpreting quantities such as p-values and power introduce radical distortions of error statistical inferences. Before diving into the modern incarnations of the charges it’s worth recalling Stephen Senn’s response to Stephen Goodman’s attempt to convert p-values into replication probabilities nearly 20 years ago (“A Comment on Replication, P-values and Evidence,” Statistics in Medicine). I first blogged it in 2012, here. Below I am pasting some excerpts from Senn’s letter (but readers interested in the topic should look at all of it), because Senn’s clarity cuts straight through many of today’s misunderstandings. 

.

Continue reading

Categories: 13 years ago, p-values exaggerate, replication research, S. Senn | Tags: , , , | 8 Comments

First Look at N-P Methods as Severe Tests: Water plant accident [Exhibit (i) from Excursion 3]

November Cruise

The example I use here to illustrate formal severity comes in for criticism  in a paper to which I reply in a 2025 BJPS paper linked to here. Use the comments for queries.

Exhibit (i) N-P Methods as Severe Tests: First Look (Water Plant Accident) 

There’s been an accident at a water plant where our ship is docked, and the cooling system had to be repaired.  It is meant to ensure that the mean temperature of discharged water stays below the temperature that threatens the ecosystem, perhaps not much beyond 150 degrees Fahrenheit. There were 100 water measurements taken at randomly selected times and the sample mean x computed, each with a known standard deviation σ = 10.  When the cooling system is effective, each measurement is like observing X ~ N(150, 102). Because of this variability, we expect different 100-fold water samples to lead to different values of X, but we can deduce its distribution. If each X ~N(μ = 150, 102) then X is also Normal with μ = 150, but the standard deviation of X is only σ/√n = 10/√100 = 1. So X ~ N(μ = 150, 1). Continue reading

Categories: 2025 leisurely cruise, severe tests, severity function, water plant accident | Leave a comment

Neyman-Pearson Tests: An Episode in Anglo-Polish Collaboration: (3.2)

Neyman & Pearson

November Cruise: 3.2

This third of November’s stops in the leisurely cruise of SIST aligns well with my recent BJPS paper Severe Testing: Error Statistics vs Bayes Factor Tests.  In tomorrow’s zoom, 11 am New York time, we’ll have an overview of the topics in SIST so far, as well as a discussion of this paper. (If you don’t have a link, and want one, write to me at error@vt.edu). 

3.2 N-P Tests: An Episode in Anglo-Polish Collaboration*

We proceed by setting up a specific hypothesis to test, Hin Neyman’s and my terminology, the null hypothesis in R. A. Fisher’s . . . in choosing the test, we take into account alternatives to Hwhich we believe possible or at any rate consider it most important to be on the look out for . . .Three steps in constructing the test may be defined: Continue reading

Categories: 2024 Leisurely Cruise, E.S. Pearson, Neyman, statistical tests | Leave a comment

Where Are Fisher, Neyman, Pearson in 1919? Opening of Excursion 3, snippets from 3.1

November Cruise

This second excerpt for November is really just the preface to 3.1. Remember, our abbreviated cruise this fall is based on my LSE Seminars in 2020, and since there are only 5, I had to cut. So those seminars skipped 3.1 on the eclipse tests of GTR. But I want to share snippets from 3.1 with current readers, along with reflections in the comments.

Excursion 3 Statistical Tests and Scientific Inference

Tour I Ingenious and Severe Tests

[T]he impressive thing about [the 1919 tests of Einstein’s theory of gravity] is the risk involved in a prediction of this kind. If observation shows that the predicted effect is definitely absent, then the theory is simply refuted.The theory is incompatible with certain possible results of observation – in fact with results which everybody before Einstein would have expected. This is quite different from the situation I have previously described, [where] . . . it was practically impossible to describe any human behavior that might not be claimed to be a verification of these [psychological] theories. (Popper 1962, p. 36)

Continue reading

Categories: 2025 leisurely cruise, SIST, Statistical Inference as Severe Testing | 2 Comments

November: The leisurely tour of SIST continues

2025 Cruise

We continue our leisurely tour of Statistical Inference as Severe Testing [SIST] (Mayo 2018, CUP) with Excursion 3. This is based on my 5 seminars at the London School of Economics in 2020; I include slides and video for those who are interested. (use the comments for questions) Continue reading

Categories: 2025 leisurely cruise, significance tests, Statistical Inference as Severe Testing | 1 Comment

Severity and Adversarial Collaborations (i)

.

In the 2025 November/December issue of American Scientist, a group of authors (Ceci, Clark, Jussim and Williams 2025) argue in “Teams of rivals” that “adversarial collaborations offer a rigorous way to resolve opposing scientific findings, inform key sociopolitical issues, and help repair trust in science”. With adversarial collaborations, a term coined by Daniel Kahneman (2003), teams of divergent scholars, interested in uncovering what is the case (rather than endlessly making their case) design appropriately stringent tests to understand–and perhaps even resolve–their disagreements. I am pleased to see that in describing such tests the authors allude to my notion of severe testing (Mayo 2018)*:

Severe testing is the related idea that the scientific community ought to accept a claim only after it surmounts rigorous tests designed to find its flaws, rather than tests optimally designed for confirmation. The strong motivation each side’s members will feel to severely test the other side’s predictions should inspire greater confidence in the collaboration’s eventual conclusions. (Ceci et al., 2025)

1. Why open science isn’t enough Continue reading

Categories: severity and adversarial collaborations | 5 Comments

Excursion 1 Tour I (3rd stop): The Current State of Play in Statistical Foundations: A View From a Hot-Air Balloon (1.3)

Third Stop

Readers: With this third stop we’ve covered Tour 1 of Excursion 1.  My slides from the first LSE meeting in 2020 which dealt with elements of Excursion 1 can be found at the end of this post. There’s also a video giving an overall intro to SIST, Excursion 1. It’s noteworthy to consider just how much things seem to have changed in just the past few years. Or have they? What would the view from the hot-air balloon look like now?  Share your thoughts in the comments.

ZOOM: I propose a zoom meeting for Sunday Nov. 15, Sunday, November 16 at 11 am or Friday, November 21 at 11am, New York time. (An equal # prefer Fri & Sun.) The link will be available to those who register/registered with Dr. Miller*.

The Current State of Play in Statistical Foundations: A View From a Hot-Air Balloon (1.3)

.

How can a discipline, central to science and to critical thinking, have two methodologies, two logics, two approaches that frequently give substantively different answers to the same problems? … Is complacency in the face of contradiction acceptable for a central discipline of science? (Donald Fraser 2011, p. 329)

We [statisticians] are not blameless … we have not made a concerted professional effort to provide the scientific world with a unified testing methodology. (J. Berger 2003, p. 4)

Continue reading

Categories: 2025 leisurely cruise, Statistical Inference as Severe Testing | Leave a comment

The ASA Sir David R. Cox Foundations of Statistics Award is now annual

15 July 1924 – 18 January 2022

The Sir David R. Cox Foundations of Statistics Award will now be given annually by the American Statistical Association (ASA), thanks to generous contributions by “Friends” of David Cox, solicited on this blog!*

Nominations for the 2026 Sir David R. Cox Foundations of Statistics Award are due on November 1, 2025 requiring the following:

  • Nomination letter
  • Candidate’s CV
  • Two letters of support, not to exceed two pages each

Continue reading

Categories: Sir David Cox, Sir David Cox Foundations of Statistics Award | Leave a comment

Excursion 1 Tour I (2nd Stop): Probabilism, Performance, and Probativeness (1.2)

.

Readers: Last year at this time I gave a Neyman seminar at Berkeley and posted on a panel discussion we had. There were lots of great questions, and follow-ups. Here’s a link.

“I shall be concerned with the foundations of the subject. But in case it should be thought that this means I am not here strongly concerned with practical applications, let me say right away that confusion about the foundations of the subject is responsible, in my opinion, for much of the misuse of the statistics that one meets in fields of application such as medicine, psychology, sociology, economics, and so forth”. (George Barnard 1985, p. 2)

While statistical science (as with other sciences) generally goes about its business without attending to its own foundations, implicit in every statistical methodology are core ideas that direct its principles, methods, and interpretations. I will call this its statistical philosophy. To tell what’s true about statistical inference, understanding the associated philosophy (or philosophies) is essential. Discussions of statistical foundations tend to focus on how to interpret probability, and much less on the overarching question of how probability ought to be used in inference. Assumptions about the latter lurk implicitly behind debates, but rarely get the limelight. If we put the spotlight on them, we see that there are two main philosophies about the roles of probability in statistical inference: We may dub them performance (in the long run) and probabilism. Continue reading

Categories: Error Statistics | Leave a comment

2025(1)The leisurely cruise begins: Excerpt from Excursion 1 Tour 1 of Statistical Inference as Severe Testing (SIST)

Ship Statinfasst

Excerpt from excursion 1 Tour I: Beyond Probabilism and Performance: Severity Requirement (1.1)

NOTE: The following is an excerpt from my book: Statistical Inference as Severe Testing: How to get beyond the statistics wars (CUP, 2018). For any new reflections or corrections, I will use the comments. The initial announcement is here (including how to join).

I’m talking about a specific, extra type of integrity that is [beyond] not lying, but bending over backwards to show how you’re maybe wrong, that you ought to have when acting as a scientist. (Feynman 1974/1985, p. 387)

It is easy to lie with statistics. Or so the cliché goes. It is also very difficult to uncover these lies without statistical methods – at least of the right kind. Self- correcting statistical methods are needed, and, with minimal technical fanfare, that’s what I aim to illuminate. Since Darrell Huff wrote How to Lie with Statistics in 1954, ways of lying with statistics are so well worn as to have emerged in reverberating slogans:

  • Association is not causation.
  • Statistical significance is not substantive significamce
  • No evidence of risk is not evidence of no risk.
  • If you torture the data enough, they will confess.

Continue reading

Categories: Statistical Inference as Severe Testing | Leave a comment

2025 Leisurely cruise through Statistical Inference as Severe Testing: First Announcement

Ship Statinfasst

We’re embarking on a leisurely cruise through the highlights of Statistical Inference as Severe Testing [SIST]: How to Get Beyond the Statistics Wars (CUP 2018) this fall (Oct-Jan), following the 5 seminars I led for a 2020 London School of Economics (LSE) Graduate Research Seminar. It had to be run online due to Covid (as were the workshops that followed). Unlike last fall, this time I will include some zoom meetings on the material, as well as new papers and topics of interest to attendees. In this relaxed (self-paced) journey, excursions that had been covered in a week, will be spread out over a month [i] and I’ll be posting abbreviated excerpts on this blog. Look for the posts marked with the picture of ship StatInfAsSt. [ii]  Continue reading

Categories: 2024 Leisurely Cruise, Announcement | Leave a comment

My BJPS paper: Severe Testing: Error Statistics versus Bayes Factor Tests

.

In my new paper, “Severe Testing: Error Statistics versus Bayes Factor Tests”, now out online at the The British Journal for the Philosophy of Science, I “propose that commonly used Bayes factor tests be supplemented with a post-data severity concept in the frequentist error statistical sense”. But how? I invite your thoughts on this and any aspect of the paper.* (You can read it here.)

I’m pasting down the abstract and the introduction. Continue reading

Categories: Bayesian/frequentist, Likelihood Principle, multiple testing | 4 Comments

Are We Listening? Part II of “Sennsible significance” Commentary on Senn’s Guest Post

.

This is Part II of my commentary on Stephen Senn’s guest post, Be Careful What You Wish For. In this follow-up, I take up two topics:

(1) A terminological point raised in the comments to Part I, and
(2) A broader concern about how a popular reform movement reinforces precisely the mistaken construal Senn warns against.

But first, a question—are we listening? Because what underlies what Senn is saying is subtle, and yet what’s at stake is quite important for today’s statistical controversies. It’s not just a matter of which of four common construals is most apt for the population effect we wish to have high power to detect.[1] As I hear Senn, he’s also flagging a misunderstanding that allows some statistical reformers to (wrongly) dictate what statistical significance testers “wish” for in the first place. Continue reading

Categories: clinical relevance, power, reforming the reformers, S. Senn | 5 Comments

“Sennsible significance” Commentary on Senn’s Guest Post (Part I)

.

Have the points in Stephen Senn’s guest post fully come across?  Responding to comments from diverse directions has given Senn a lot of work, for which I’m very grateful. But I say we should not leave off the topic just yet. I don’t think the core of Senn’s argument has gotten the attention it deserves. So, we’re not done yet.[0]

I will write my commentary in two parts, so please return for Part II. In Part I, I’ll attempt to give an overarching version of Senn’s warning (“Be careful what you wish for”) and  his main recommendation. He will tell me if he disagrees. All quotes are from his post. In Senn’s opening paragraph:

…Even if a hypothesis is rejected and the effect is assumed genuine, it does not mean it is important…many a distinguished commentator on clinical trials has confused the difference you would be happy to find with the difference you would not like to miss. The former is smaller than the latter. For reasons I have explained in this blog [reblogged here], you should use the latter for determining the sample size as part of a conventional power calculation.

Continue reading

Categories: clinical relevance, power, S. Senn | 6 Comments

Stephen Senn (guest post): “Relevant significance? Be careful what you wish for”

 

.

Stephen Senn

Consultant Statistician
Edinburgh

Relevant significance?

Be careful what you wish for

Despised and Rejected

Scarcely a good word can be had for statistical significance these days. We are admonished (as if we did not know) that just because a null hypothesis has been ‘rejected’ by some statistical test, it does not mean it is not true and thus it does not follow that significance implies a genuine effect of treatment. Continue reading

Categories: clinical relevance, power, S. Senn | 47 Comments

Blog at WordPress.com.