Error Statistics Philosophy

Announcement: CFP Synthese Topical Collection: Severity and Learning from Error

Mayo — Wed, 24 Jun 2026 02:12:02 +0000

I hope that many readers of this blog will consider contributing to this!

ANNOUNCEMENT SEV26

Synthese Topical Collection CFP: Severity and Learning from Error

This Topical Collection examines how inquiry learns from error by focusing on a basic principle of evidence in science, statistics, medicine, law, epistemology, and day-to-day learning: a claim is not well-tested, known or epistemically warranted, if it is based on a method that makes it easy to accept, conclude or infer the claim, even if it is false. Such a claim may accord well with the data, but it has not passed a stringent or severe test. While this overarching intuition is widely shared, the problem of how to understand or satisfy it remains unsolved. C. S. Peirce emphasizes randomization and (what is now called) pre-designation to achieve self-correcting methods. Popper viewed severity in terms of satisfying novel predictive success and surviving stringent attempts at falsification. Deborah Mayo (1996, 2018) combines elements from Popper and Peirce with the use of error probabilities from statistical methods: proposed solutions to problems earn warrant by surviving probes that were capable of showing them wrong or inadequate. This Topical Collection takes “severity” to be a broad meta-level concept according to which a claim – whether a report of a perception, a prediction, a hypothesis, or part of a model – is assessed according to whether, and how readily, its errors and inadequacies would have been found, if present.

Several questions arise: What errors matter for a given aim? What would it take for a method to be capable of detecting them? How in actual practice can inquirers show they have engaged in responsible error probing when there are no formal probability models? Addressing questions like these is of urgent importance today as we face high-powered methods that make it easy to find impressive looking effects that are spurious and non-replicating, or to arrive at well-fitting models that do not predict well, do not replicate, or do not provide substantive scientific understanding. These questions arise in debates about methodological shifts in AI/ML, randomized clinical trials, legal evidence, climate modeling, statistical inference, and error-prone inference in general. We seek to bring these metascience debates into direct contact and to ask what is often left hidden: What errors are now being controlled, and which have quietly dropped out of view? By bringing together philosophers, statisticians, and scientists, we aim to develop a shared set of problems and tools with a forward-looking goal: to shape emerging practices, rather than merely react to them with retrospective commentary.

We welcome submissions on any topic that broadly relates to severity or learning from error. We invite contributions that develop, apply or challenge severity-based reasoning, or that develop alternative approaches, Bayesian, frequentist, machine-learning and other, which engage the same underlying concern: how inquiry learns from error, and how claims earn warrant by surviving probes that were capable of showing them wrong or inadequate. We encourage contributions that explore connections between concepts of severity in different fields. Notably, the concepts of sensitivity and safety in contemporary epistemology can be understood through the lens of severity, and both are redolent of stability in AI. We also welcome discussions of how contemporary manifestations of severity interrelate with the traditional notions of severity from Popper and Peirce, and how concepts of severity may help in tackling fundamental problems of induction, falsification, underdetermination, and realism in philosophy of science.

The collection is partly motivated by the thirtieth anniversary of Deborah Mayo’s (1996, Chicago) Error and the Growth of Experimental Knowledge (Lakatos Prize 1998) and the development of its account of severe testing.

Appropriate Topics for Submission include, among others:

Severity and philosophy of statistics

Do recent controversies about the uses of error probabilities in statistics (and metastatistics) present a challenge to severity-based reasoning?

Do the new fields of post-selection inference (in AI and other disciplines) allow for error control despite data-driven constructions? Or do they shift attention to different errors?
How does severity link to such notions as calibration, security, and stability, and statistical techniques that promote such notions as robustness analyses, and multiverse analyses?

Severity and philosophy of science

What does it mean for a method, or for science itself, to be self-correcting or error-correcting? Does it fit best with a pragmatist philosophy?
How does severe probing take place in the historical sciences, e.g., climate science, geology? Can claims be well probed without being replicable?
Rather than probing for falsity, how can we severely probe if a model is adequate for a purpose or problem of interest?

Severity and contemporary epistemology

Can a useful cross-cutting epistemology that links science, statistics, and applied epistemology be built around the concept of severity?
Do features of severity (e.g., auditing of assumptions) point to ways to avoid problems of sensitivity and safety in epistemology?
Does requiring severity explain why legal epistemology resists mere base-rates and “naked statistics”? Does it solve proof paradoxes in legal epistemology?

Tracking shifts in error control

How does AI/ML shift from modeling data-generating mechanisms in statistics to optimizing predictive performance in machine learning.
How do changing guidelines for RCTs shift trials from probing biological mechanisms to predicting average treatment effects over a population?
What are the social, epistemic, ethical, and political consequences of shifting regimens of error control?

The value of probing error

How can adversarial collaborations and stress-testing advance science?
How can error repertoires be built and effectively employed to facilitate severity in measurement and experiment?
How does learning from error enter outside science (e.g., in art, architecture and life drawing)?

Submissions via: https://www.editorialmanager.com/synt/default.aspx

Under the drop-down menu, select Severity and Learning from Error.

Submitted papers will undergo the usual Synthese review process.

For further information, please contact the guest editors:

mayod@vt.edu, wendyparker@vt.edu, D.Lakens@tue.nl, staleykw@gmail.com.

The deadline for submissions is the 15th of December, 2026 (with possible short extensions). Use the comments or write to me with your ideas and questions with the subject: SEV26. The website announcement is here: https://link.springer.com/collections/ebjdhfadcd

‘Low power’ and an all too standard error (continuation of “don’t turn power on its head”)

Mayo — Mon, 08 Jun 2026 03:20:40 +0000

“In my opinion, a great deal of confusion about statistics can be traced to the fact that the point estimate is seen as being the be all and end all, the expression of uncertainty being forgotten….to provide a point estimate without also providing a standard error is, indeed, an all too standard error.”

Stephen Senn: “Error point: the importance of knowing how much you don’t know”

In my previous blogpost, (“How not to turn power on its head”), I argued, in relation to a one-sided test of mean μ (e.g., H₀: µ ≤ 0 vs H₁: µ > 0 with known SE):

If POW(μ′) is high (e.g., over .5), then a just significant result is poor evidence that μ > μ′; while if POW(μ′) is low (e.g., less than .2), it is good evidence that μ > μ′ where μ′ is a value greater than 0 (provided assumptions for these claims hold approximately).

By a “just statistically significant result” I mean one that just makes it to the threshold for statistical significance, write it as M* (my last post used D*). The reasoning is essentially this: Because it’s very improbable to obtain as low a P-value as we did, were μ as small as μ′—that is, because POW(μ′) is low—the result indicates we are in a world where μ is greater than μ′. This is exactly the reasoning that allows us to infer μ > 0 with a statistically significant result. Indeed, the power of the test against μ₀ is α. It is supposed that the statistical assumptions needed for the error probabilities to apply hold adequately.

Why then do we often hear that low power is associated with “exaggerated” or “inflated” effects? As we reasoned in the previous post, low power against μ′ strengthens the inference that μ exceeds μ′. Can the same feature—low power—also be associated with overestimation? The answer is, yes it can, but only one of the claims corresponds to a correct application of statistical significance tests.

More specifically, the overestimation charge stems from supposing the observed result M* is taken as a (point) estimate of the population mean (i.e., estimating μ = M*, without providing the SE)–an unkosher (but not so uncommon) move–and then considering a value μ′ against which the test has low power. Since M* is the just-significant cutoff, clearly M* will exceed μ′ (at least in a good test). So if the true population mean takes a value against which the test has low power, and M* is taken as a point estimate of μ, the result will be to “overestimate” the population mean. While the true value is unknown, this if-then claim is correct. Likewise, if the power to detect the true μ is high, the observed M*, will underestimate μ–if M* is used as a point estimate.

To clarify these points, it helps to contrast two different questions that are often run together:

Does the observed (just) statistically significant result M* warrant inferring μ > μ′ (when POW(μ′) is low)?
Does the observed (just) statistically significant result M* exceed μ′ (when POW(μ′) is low)?

The answer to both questions is yes. The very fact invoked to show that M* exceeds μ′—yielding a “yes” answer to #2–namely, that a result at least as large as M* would be improbable were μ = μ′—is precisely what warrants inferring that μ > μ′–yielding a “yes” answer to #1.

However common it may be to equate the observed statistically significant result M* with the population mean, that is not a warranted inference from a significance test. For one thing, significance test inferences are inequalities, not point claims or point estimates. A statistically significant result warrants inferring μ > μ₀ and, more generally, warrants inferring μ > μ′ for values μ′ against which the test has sufficiently low power–although it is not typically put that way. It would more typically be put in terms of the p-value reached in relation to a discrepancy from H₀. (We would get a p-value function over different discrepancies.) What is the p-value were we testing H₀: µ ≤ M*, and observed our just significant result M*? Answer: .5. Thus, to take M* as warranting µ > M*, would be to follow a method that is wrong 50% of the time. (See mountains out of molehill fallacy in SIST.)

There is, of course, a relation between tests and estimation. Rejecting H₀ (at level α) is equivalent to inferring that μ exceeds the corresponding lower confidence bound (at level 1- α, for the 1-sided case). Obtaining this lower bound requires subtracting a number of SEs (e.g., 1.5, 1.65, 1.96, 2) from M*.

Observe that POW(μ₀) = α and POW(M*) = .5. We can relate the consequence of μ′ moving farther below M*:

As μ′ moves farther below M*	Consequence
M* − μ′ increases	Greater overestimation if the observed M* is used to estimate μ
POW(μ′) decreases	The probability of obtaining M ≥ M* under μ = μ′ decreases
P-value for μ = μ′ decreases	Stronger evidence that μ > μ′

Thus, as power against μ′ decreases, the amount by which M* exceeds μ′ increases, but so too does the evidence that μ exceeds μ′. The very circumstance that yields greater overestimation when M* is used to estimate μ yields stronger evidence that μ exceeds μ′.

One final point. If a testing procedure is selectively reporting only statistically significant results, then the original error probabilities no longer apply–whether to the test or equivalent CI estimation.

Share your queries and thoughts in the comments to this post.

For a related post see “Do underpowered tests exaggerate population effects?”

See also the discussion on pp. 359-361 of Mayo (2018, CUP): Statistical Inference as Severe Testing: How to get beyond the statistics wars? (SIST). The relevant excerpt can be found here.

How not to turn power on its head

Mayo — Tue, 12 May 2026 02:28:28 +0000

In giving some informal remarks about power at a seminar a couple of weeks ago, I proposed that the tendency to turn the notion of power on its head might be avoided by imagining we need to define a test’s error probabilities in terms of its power alone. We can refer to the power against the null hypothesis, rather than alluding to a type 1 error probability, for example. What do I mean by turning power on its head? I mean, at least here, supposing that a test provides poor evidence of discrepancies that the test has low power to detect.

This grows out of the assumption that a statistically significant result only provides good evidence of discrepancies (from a null hypothesis) that the test has reasonably high power to detect. But these claims actually reverse what is the case about power and warranted (population) discrepancies. They turn power on its head.

To remind us, the goal of this statistical significance test is to assess the compatibility of data with a reference or null hypothesis, such as to see if the value of test statistic D indicates a genuine positive (population) discrepancy from 0. The tester may go on to consider the evidence for various other positive discrepancies as well. For simplicity consider testing H₀: µ ≤ 0 vs H₁: µ >0 with known SE. I will use some numbers from a guest blog post by Stephen Senn discussing the interpretation of tests in clinical trials:

For simplicity, allow the cut-off to be 2, rather than 1.96. Write the cut-off for rejecting the null as D*, which in Senn’s example is .7. So we have SE =~ .35*. The power of the test against different values of µ doesn’t require knowing the true value of µ; there is a power function. The test is falsificationist, and uses hypothetical reasoning. The power of this test against µ’ is the probability D exceeds D* (.7) computed under the assumption that µ = µ’. Write this as POW(µ’).

Tests, particularly in clinical trials, are often specified to have high probability, .8 or .9, of detecting a discrepancy from the null that “we would not like to miss”. To “miss” means the test does not set off the “significance alarm”, that is, the result is statistically insignificant. Senn’s example stipulates that the population discrepancy we would really hate to miss is ∆ = 1. This means that were the population ∆ = 1 or higher, then we want there to be a high probability that the value of the sample D will exceed D*.

Note: I use the word “discrepancy” in alluding to population effect sizes and “differences” to refer to observed difference. I’m deliberately calling ∆ “the discrepancy we would really hate to miss” because “the discrepancy we would not like to miss” is often interpreted in a weaker manner than intended. In particular, it is often construed as the smallest discrepancy of interest. But this minimal discrepancy of interest would be smaller than ∆ . [1] See also my commentary on Senn’s post:

Let’s now turn to a test H₀: µ ≤ 0 vs H₁: µ >0 .

(1) The power at the null is α. Note that POW(0) = .025 (more like .023)

Let’s assume for the moment that D just makes it to the cut-off D* for rejection. Then POW(0) is also equal to the significance level for the outcome. Here’s the logic of statistical significance tests using power, and D=D*:

(2) If D is just statistically significant, and its statistical significance level is low, then D indicates µ >0.

(2) is equivalent to (2)’:

(2)’ If POW(0) is low, then D* indicates µ >0.

Of course, indications need to be supplemented by audits of assumptions, checks of biasing selection effects, and ideally, replication. But we must first make out the intended logic of tests, under the presumption the assumptions hold approximately, and separately audit them.

(3) If it would be difficult for the test to generate a D as large as D* if µ = 0, and yet we observe D*, then it indicates it was generated by a µ that exceeds 0.

The assertion in (3) holds not just for the null but for discrepancies from 0. Now a critic of tests might note: “But your test also has rather low power to detect positive discrepancies close to 0. For example:

POW(.5 SE) = .07. [i.e., POW(.17) = .07.]”

To which a tester would respond: Yes, and I can similarly infer my D* indicates µ > .17. I reason as follows: were µ ≤ .17, then 93% of the time I’d get a smaller D than I did. That’s the logic of testing. Note too that the P-value is .07, and the lower confidence interval µ > .17. has confidence level .93.

A critic might continue: “But your test also has rather low power to detect positive discrepancies of 1 SE.

POW(1SE) = .16! [i.e., POW(.35) = .16.]”

To which a tester could respond: Yes, and I therefore have a weak indication that µ > .35. The P-value is .16, and the lower confidence interval µ > .35. has confidence level .84.

And she could go on to note: I clearly do not have evidence that µ exceeds those values against which the test has high power! Even to infer, on grounds that POW(.7) = .5, that my observing D* indicates µ > .7 would be wrong 50% of the time!

I hope it is now clear why the bold phrases at the outset turn power on its head, in relation to statistical significance tests. Senn would not say a statistically significant result is fairly good evidence that µ > 1, on the grounds that POW(1) = .8. Yet you will sometimes see medical researchers and spokespeople claim literally this. What we can correctly say is:

(4) If it would be improbable for the test to generate a D > D* were µ < µ0, and yet I observe D*, then D is an indication it was generated by a µ that exceeds µ0.

However, there is a different assertion that has a superficial resemblance to the ones I am pointing to as reversing power, and that other assertion can hold true. I discuss it in my next post. (I promise not to wait a month to write it!)

Share your questions and remarks in the comments to this post.

[1] Other construals: the minimum value of D we hope to observe, the smallest discrepancy we’d like to learn about, or still others. See this earlier Senn post

Error and the Growth of Experimental Knowledge cover: 30 years ago

Mayo — Wed, 01 Apr 2026 19:56:34 +0000

30 years ago today, Chicago Press sent me a draft version of this cover for Error and the Growth of Experimental Knowledge for my approval (except the fuchsia and mustard in “ERROR” were switched). At first I thought it was so cartoony that it might be an April 1 joke! I had sent them a picture I drew (now in the preface), but they didn’t think that worked for a cover. They were right. It’s a fabulous cover!

To access EGEK.

Comments on “The ASA p-value statement 10 years on” (ii)

Mayo — Thu, 26 Mar 2026 21:14:39 +0000

Given how much I’ve blogged about the 2016 ASA p-value statement, the 2019 Executive Editor’s editorial in The American Statistician (TAS), the 2020 ASA (President’s) Task Force, and the various casualties of the related teeth pulling, I thought I should say something about the recent article by Robert Matthews in Significance (March 2026): “The ASA p-value statement 10 years on: An event of statistical significance?” He begins: “Ten years ago this month, the American Statistical Association (ASA) took the unprecedented step of issuing a statement on one of the most controversial issues in statistics: the use and abuse of p-values.” The Statement is here, 2016 ASA Statement on P-Values and Statistical Significance [1]. The Executive director of the ASA, Ronald Wasserstein, invited me to be a ”philosophical observer” at the meeting which gave rise to the 2016 statement. Although the 2016 ASA statement wasn’t radically controversial, at least as compared to the 2019 Executive Editor’s editorial, which I’ll get to in a minute, it was met with critical reactions on all sides. Stephen Senn provides a figure displaying relationships between reactions. Here’s how Matthews’ article begins:

Popularised in the 1920s by the hugely influential English statistician Ronald Fisher, p-values lie at the heart of “significance testing”, widely used by researchers to claim to have found something interesting lurking in data. Yet despite their ubiquity in research journals, p-values have also long been criticised as misunderstood, misleading and open to abuse. The problem lies in their definition. p-values typically give the chances of getting an effect at least as impressive as that seen, assuming it’s actually just a fluke. If these chances are sufficiently low – less than 0.05 is the traditional standard – the finding is then deemed “statistically significant”. For many researchers, this has been taken as implying that their finding is not a fluke, and worth taking seriously. But this overlooks the fact that p-values are calculated on the assumption the result is a fluke. As such, they cannot also be used to decide if this assumption is valid…

Wait a minute. According to Matthews, taking a small p-value as evidence the observed effect is not a fluke “overlooks the fact that p-values are calculated on the assumption the result is a fluke. As such, they cannot also be used to decide if this assumption is valid.” This overlooks the very nature of reductio (or indirect or falsificationist) proofs, say that there’s no smallest rational q: Assume q is the smallest rational. If so, q/2 would be a smaller rational. From this contradiction, infer there is no smallest rational number. It is a deductively valid argument. P-value reasoning is a statistical version of the reductio argument– providing a statistical contradiction to the fluke assumption, with an associated error probability. The small p-value tells us it’s very probable (1-p) that a smaller effect would have resulted, were it due to chance alone. Replicating the small p-value strengthens the contradiction further. [0] So can we please stop saying that assuming a claim C in a reductio argument precludes finding evidence to falsify C?

The assumption in the null hypothesis is just an “implicationary assumption” for purposes of drawing out the consequences of C. Overlooking falsificationist logic is at the heart of today’s confusion over p-value reasoning. If we could run an experiment in which the p-value critics magically became falsificationists for 1 day, I think the scales would fall from the eyes of a statistically significant proportion of them at least during that time.[2]

Admittedly, statistical significance tests are just a small part of a rich set of “techniques for systematically appraising and bounding the probabilities (under respective hypotheses) of seriously misleading interpretations of data” (Birnbaum 1970, p. 1033). The simple Fisherian test that the 2016 Statement restricts itself to–there’s just the single null hypothesis without considering alternatives or power–is an even smaller part. But even they have important uses, especially in testing assumptions of statistical models or misspecification tests. In any event, their limited use is not grounds for misinterpreting their logic. Much less is it grounds to abandon or retire them.

Returning to Matthews:

“Finally, in 2021, the ASA issued [3] another statement, this time from a Presidential Task Force whose focus was not promoting the 2016 principles but addressing concerns” that an editorial in TAS–I’ll call it the ASA Executive Director editorial– “might be seen as official ASA policy.” Why the worry it might be seen as ASA policy? One reason is that one of the authors was the ASA Executive director Wasserstein. The second was that it sounded like a continuation of the 2016 ASA statement–which is ASA policy. According to the 2019 Executive Director’s editorial, the 2016 ASA Statement had “stopped just short of recommending that declarations of ‘statistical significance’ be abandoned”, and they announce: “We take that step here….‘statistically significant’—don’t say it and don’t use it”. The use of p-value thresholds is also verboten. “[T]he problem is not that of having only two labels. Results should not be trichotomized, or indeed categorized into any number of groups…” (2019 Executive Director Editorial, p. 2).

Then ASA president Karen Kafadar (2019) wrote in an ASA Newsletter:

Many of you have written of instances in which authors and journal editors—and even some ASA members—have mistakenly assumed this editorial represented ASA policy. The mistake is understandable: The editorial was co-authored by an official of the ASA.

So she appointed a Task Force in 2019. Its full (1 page) report is in the The Annals of Applied Statistics, also on my blogpost.[4] The report (Benjamini et al. 2021) begins:

In 2019 the President of the American Statistical Association (ASA) established a task force to address concerns that a 2019 editorial in The American Statistician (an ASA journal) might be mistakenly interpreted as official ASA policy. (The 2019 editorial recommended eliminating the use of “p < 0.05” and “statistically significant” in statistical analysis.) This document is the statement of the task force… (Benjamini et al. 2021)

Among its main points:

the use of P -values and significance testing, properly applied and interpreted, are important tools that should not be abandoned”…
P -values are valid statistical measures that provide convenient conventions for communicating the uncertainty inherent in quantitative results. Indeed, P -values and significance tests are among the most studied and best understood statistical procedures in the statistics literature.
They are important tools that have advanced science through their proper application. …(Benjamini et al. 2021)

According to Matthews:

“For those who saw improper use and misinterpretation as the key issue in the p-value debate, this seemed to miss the point.”

But defending the scientific value of a tool when an Executive Director’s editorial is calling for its abandonment is exactly to the point. Forgoing predesignated thresholds obstructs error control. If an account cannot say about any outcomes that they will not count as evidence for a claim—if all thresholds are abandoned—then there is no test of that claim. Giving up on tests means forgoing falsification even of the statistical variety. What’s the point of requiring replication if at no point can you say an effect has failed to replicate?

Maybe the ASA should invite 10 year reflections, or maybe they’re out there and I haven’t seen them.

Please share your queries and thoughts in the comments.

References
Birnbaum, A. (1970), “Statistical Methods in Scientific Inference (letter to the Editor),” Nature 225(5237): 1033.
Mayo, D. G. and Cox, D. R. (2006), “Frequentists Statistics as a Theory of Inductive Inference,” in Optimality: The Second Erich L. Lehmann Symposium, ed. J. Rojo, Lecture Notes-Monograph series, Institute of Mathematical Statistics (IMS), Vol. 49: 77-97.

Some related posts (search this blog for others):

March 7, 2016: “Don’t throw out the error control baby with the bad statistics bathwater”
May 21, 2024: 5-year review: “Les stats, c’est moi”: We take that step here! (Adopt our fav word or phil stat!)(iii)
June 20, 2021: At long last! The ASA President’s Task Force Statement on Statistical Significance and Replicability
Mayo 31, 2024: 2-4 year review: The Statistics Wars and Intellectual Conflicts of Interest
June 17, 2019: The 2019 ASA executive editor’s guide to p-values: Don’t say what you don’t mean
June 4, 2024: 2-4 year review: commentaries on my editorial
May 15, 2022: 2-4 year review: commentaries on my editorial

My editorial: The statistics wars and intellectual conflicts of interest

[0] p-value. The significance test arises to test the conformity of the particular data under analysis with H₀ in some respect: To do this we find a function t = t(y) of the data, to be called the test statistic, such that

the larger the value of t the more inconsistent are the data with H₀;
the corresponding random variable T = t(Y) has a (numerically) known probability distribution when H₀ is true.

…[We define the] p-value corresponding to any t as p = p(t) = P(T ≥ t; H₀). (Mayo and Cox 2006, p. 81)

[1] The 2016 ASA Statement’s six principles: 1. P-values can indicate how incompatible the data are with a specified statistical model. 2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone. 3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold. 4. Proper inference requires full reporting and transparency. 5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result. 6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.

[2] There are a few critics who are falsificationists, notably Andrew Gelman.

[3] The 2019 ASA [president’s] task force submitted its statement to the ASA in 2020, and for a long time its contents were shrouded in mystery. It eventually was published in 2021 in the Annals of Applied Statistics where Kafadar was editor in chief.

[4] The 2019 Task Force members: Linda Young, (Co-Chair), Xuming He, (Co-Chair) Yoav Benjamini, Dick De Veaux, Bradley Efron, Scott Evans, Mark Glickman, Barry Graubard, Xiao-Li Meng, Vijay Nair, Nancy Reid, Stephen Stigler, Stephen Vardeman, Chris Wikle, Tommy Wright, Karen Kafadar, Ex-officio. (Kafadar 2020)

Power and Severity with nonsignificant results: more power puzzles? (ii)

Mayo — Sun, 15 Mar 2026 02:43:31 +0000

The concept of a test’s power, originating in Neyman-Pearson’s early work, by and large, is a pre-data concept for purposes of specifying a test (notably, determining worthwhile sample size), and choosing between tests. In some papers, however, Neyman lists a third goal for power: to interpret test results post data much in the spirit of what is often called “power analysis”. This is to determine the discrepancy from a null hypothesis that may be ruled out, given nonsignificant results. One example is in a paper “The Problem of Inductive Inference” (Neyman 1955)–already a surprising title for behaviorist Neyman. The reason I’m bringing this up is that it has direct bearing on some of today’s most puzzling (and problematic) post-data uses of power. Interestingly, in that 1955 paper, Neyman is talking to none other than the logical positivist philosopher of confirmation, Rudof Carnap:

I am concerned with the term “degree of confirmation” introduced by Carnap. …We have seen that the application of the locally best one-sided test to the data … failed to reject the hypothesis [that the n observations come from a source in which the null hypothesis is true]. The question is: does this result “confirm” the hypothesis that H₀ is true of the particular data set? (Neyman, pp 40-41).

Neyman continues:

The answer … depends very much on the exact meaning given to the words “confirmation,” “confidence,” etc. If one uses these words to describe one’s intuitive feeling of confidence in the hypothesis tested H₀, then…. the attitude described is dangerous [for this n]…. [T]he chance of detecting the presence [of discrepancy from the null], when only [n] observations are available, is extremely slim, even if [the discrepancy is present]. Therefore, the failure of the test to reject H₀ cannot be reasonably considered as anything like a confirmation of H₀. The situation would have been radically different if the power function [corresponding to a discrepancy of interest] were, for example, greater than 0.95. (ibid.)

The general conclusion is that it is a little rash to base one’s intuitive confidence in a given hypothesis on the fact that a test failed to reject this hypothesis. A more cautious attitude would be to form one’s intuitive opinion only after studying the power function of the test applied.

Neyman alludes to a one-sided test of the mean of a Normal distribution with n iid samples, and known standard deviation, call it test T+.

H₀: µ ≤ µ₀ against H₁: µ > µ₀.

The test statistic d(X) is the standardized sample mean.

The test rule: Infer a (positive) discrepancy from µ₀ iff d(x₀) > cα where cα corresponds to a difference statistically significant at the α level.

In Carnap’s example the test could not reject the null hypothesis, i.e., d(x₀) ≤ cα, but (to paraphrase Neyman) the problem is that the chance of detecting the presence of discrepancy δ from the null, with so few observations, is extremely slim, even if [δ is present]. Says Neyman:

“One may be confident in the absence of that discrepancy only if the power to detect it were high.”

The power of the test T+ to detect discrepancy δ:

(1) P(d(X) > cα; µ = µ₀ + δ)

This is rather different than the more behavioristic construal Neyman usually championed. In fact, Neyman sounds like a Cohen-style power analyst!

Still, in standard power analysis, power is calculated relative to an outcome just missing the cutoff cα. This is, in effect, the worst case of a negative (non significant) result. If the actual outcome corresponds to a larger p-value (an even more negative result), it seems to me that should be taken into account in interpreting the results. Do you agree? It is more informative, therefore, to look at the probability of getting a worse fit (with the null hypothesis) than you did:

(2) P(d(X) > d(x0); µ = µ₀ + δ)

In this example, this gives a measure of the severity (or degree of corroboration) for the inference µ < µ₀ + δ.

Although (1) may be low, (2) may be high (For numbers, see Mayo and Spanos 2006).

Spanos and I (Mayo and Spanos 2006) couldn’t find a term in the literature defined precisely this way–as defined in Mayo (1996). Note that the observed outcome enters in d(x0), not in the discrepancy under which the probability of d(X) > d(x0) is computed. It’s really just the p-value corresponding to µ = µ₀ + δ. So, this differs from what some have called “observed power” and I call “shpower” (see this post). Spanos and I called it the severity interpretation for acceptance SIA; in SIST, it’s also called attained power, and is cashed out in SIN: the severity interpretation of negative results. With SIA and SIN, we consider the value of the observed statistic, rather than the cut-off for rejection or significance. [i] This is a core concept that I claim testers should be using to interpret warranted discrepancies post-data.

The claim in (2) could also be made out viewing the p-value as a random variable, calculating its distribution for various alternatives (Cox 2006, 25). This reasoning yields a basic frequentist principle of evidence (FEV) in Mayo and Cox 2010, 256):

FEV:¹ A moderate (i.e., non-small) p-value is evidence of the absence of a discrepancy δ from H₀ only if there is a high probability the test would have given a worse fit with H₀ (i.e., smaller p value) were a discrepancy d to exist.

It is only in the case of a negative result that severity for various inferences is in the same direction as power. In the case of significant results, with d(x) in excess of the cutoff, the opposite concern arises—namely, the test may be too sensitive to warrant a claimed discrepancy. So severity is always relative to the particular inference being entertained: speaking of the “severity of a test” simpliciter is an incomplete statement in this account. These assessments enable sidestepping classic fallacies of tests that are either too sensitive or not sensitive enough, relative to a claim of interest.²
________________________________________

The full version of the Cox-Mayo frequentist principle of evidence FEV is:

x is evidence of a discrepancy d from H₀ iff, if H₀ is a correct description of the mechanism generating x, then, with high probability a less discordant result would have occurred.

Severity (SEV) may be seen as a meta-statistical principle that follows the same logic as statistical significance test reasoning.

By making a SEV assessment relevant to the inference under consideration, we obtain a measure where high (low) values always correspond to good (poor) evidential warrant.

Severity did not have to be defined this way, but I felt it was desirable to have a concept or measure that was always good–by contrast to type 1 and 2 errors. However, it means SEV has to be computed relative to what is being inferred. This requires appropriately swapping out the claim H for which one wants to assess SEV.

[i] Cox famously (in Cox and Hinkley) said that power was irrelevant post data. But he agreed that attained power was relevant for interpreting nonsignificant results.

NOTE: This discussion was part of what I dubbed Neyman’s Nursery posts (NN1-NN5). This was the second, NN2. Why I used that term is a long story, but if you’re curious, you can learn about by searching this blog.

REFERENCES:

Cohen, J. (1992) A Power Primer.

Mayo, D. and Spanos, A. (2006), “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,” British Journal of Philosophy of Science, 57: 323-357.

Mayo, D. and Cox, D. (2010), “Frequentist Statistics as a Theory of Inductive Inference,” in D. Mayo and A. Spanos (2011), pp. 247-275.

Neyman, J. (1955), “The Problem of Inductive Inference,” Communications on Pure and Applied Mathematics, VIII, 13-46.

Neyman, J. [1957]: ‘The Use of the Concept of Power in Agricultural Experimentation,’ Journal of the Indian Society of Agricultural Statistics, IX, pp. 9–17.

Continuing the blizzard of 26 power puzzles

Mayo — Wed, 04 Mar 2026 03:42:03 +0000

The mayor of NYC offered $30 an hour to help shovel the ~ 30 inches of snow that fell last Sunday and Monday. From what I hear, it was a very effective program. Here’s a little power puzzle to very easily shovel through [1]

Suppose you are reading about a result x that is just statistically significant at level α (i.e., P-value = α) in a one-sided test T+ of the mean of a Normal distribution with n iid samples, and (for simplicity) known σ: H₀: µ ≤ 0 against H₁: µ > 0. I have heard some people say:

A. If the test’s power to detect alternative µ’ is very low, then the just statistically significant x is poor evidence of a discrepancy (from the null) corresponding to µ’. (i.e., there’s poor evidence that µ > µ’ ). I am keeping symbols as simple as possible. *See point on language in notes.

They will generally also hold that if POW(µ’) is reasonably high (at least .5), then the inference to µ > µ’ is warranted, or at least not problematic.

I have heard other people say:

B. If the test’s power to detect alternative µ’ is very low, then the just statistically significant x is good evidence of a discrepancy (from the null) corresponding to µ’ (i.e., there’s good evidence that µ > µ’).

They will generally also hold that if POW(µ’) is reasonably high (at least .5), then the inference to µ > µ’ is unwarranted.

Which is correct, from the perspective of the (error statistical) philosophy, within which power and associated tests are defined?

(Note the qualification that would arise if you were only told the result was statistically significant at some level less than or equal to α rather than, as I intend, that it is just significant at level α, discussed in a comment due to Michael Lew here [0])

Allow the test assumptions are adequately met (though usually, this is what’s behind the problem). I have often said on this blog, and I repeat, the most misunderstood and abused (or unused) concept from frequentist statistics is that of a test’s power to reject the null hypothesis under the assumption alternative µ’ is true: POW(µ’). I deliberately write it in this correct manner because it is faulty to speak of the power of a test without specifying against what alternative it’s to be computed. It will also get you into trouble if you define power as in the first premise of this post:the probability of correctly rejecting the null–which is both ambiguous and fails to specify the all important conjectured alternative. That you compute power for several alternatives is not the slightest bit problematic; it’s what you want to do in order to assess the test’s capability to detect discrepancies. There’s a power function. If you knew the true parameter value, why would you be running an inquiry to make statistical inferences about it?

It must be kept in mind that inferences are going to be in the form of µ > µ’ =µ₀+ δ, or µ < µ’ =µ₀+ δ or the like. They are not to point values! (Not even to the point µ =M₀.) Most simply, you may consider that the inference is in terms of the one-sided lower confidence bound (for various confidence levels)–the dual for test T+.

POWER: POW(T+,µ’) = POW(Test T+ rejects H₀;µ’) = Pr(M > M*; µ’), where M is the sample mean and M* is the cut-off for rejection at level α . (Since it’s continuous it doesn’t matter if we write > or ≥). I’m simplifying notation. Also, I’ll leave off the T+ and write POW(µ’).

In terms of P-values: POW(µ’) = Pr(P < p*; µ’) where P < p* corresponds to rejecting the null hypothesis at the given level.

Let σ = 10, n = 100, so (σ/ √n) = 1. Test T+ rejects H₀at the .025 level if M > 1.96(1). For simplicity, let the cut-off, M*, be 2.

Test T+ rejects H₀at ~ .025 level if M > 2.

CASE 1: We need a µ’ such that POW(µ’) = low. The power against alternatives between the null and the cut-off M* will range from α to .5. Consider the power against the null:

1. POW(µ= 0) = α = .025.

Since the the probability of M > 2, under the assumption that µ= 0, is low, the just significant result indicates µ > 0. That is, since power against µ= 0 is low, the statistically significant result is a good indication that µ > 0.

Equivalently, 0 is the lower bound of a .975 confidence interval.

2. For a second example of low power that does not use the null: We get power of .04 if µ’ = M* – 1.75 (σ/ √n) unit –which in this case is (2 – 1.75) .25. That is, POW(.25) =.04.[ii]

Equivalently, µ >.25 is the lower confidence interval (CI) at level .96 (this is the CI that is dual to the test T+.)

CASE 2: We need a µ’ such that POW(µ’) = high. POW(M* + 1(σ/ √n)) = .84.

3. That is, adding one (σ/ √n) unit to the cut-off M* takes us to an alternative against which the test has power = .84. So POW(T+, µ= 3) = .84.

Should we say that the significant result is a good indication that µ > 3? No, there’s a high probability (.84) you’d have gotten a larger difference than you did, were µ > 3.

Pr(M > 2; µ = 3 ) = Pr(Z > -1) = .84. It would be terrible evidence for µ > 3!

Blue curve is the null, red curve is one possible conjectured alternative: µ= 3. Green area is power, little turquoise area is α.

Note that the evidence our result affords µ > µ’ gets worse and worse as we drag µ further and further to the right, even though in so doing we’re increasing the power.

As Stephen Senn points out (in my favorite of his guest posts), the alternative against which we set high power is the discrepancy from the null that “we should not like to miss”, delta Δ. Δ is not the discrepancy we may infer from a significant result (in a test where POW(Δ) = .84), nor one that we believe obtains.

So the correct answer is B.

Does A hold true if we happen to know (based on previous severe tests) that µ <µ’?

No, but it does allow some legitimate ways to mount complaints based on a significant result from a test with low power to detect a known discrepancy.

It does mean that if M* (the cut-off for a result just statistically significant at level α) is used as an estimate of µ, with no standard error given, although we know independently that µ < M*, then M* is larger than µ. This is a tautology, and crucial information about the unreliability of the estimate is hidden. While it might be said that your observed result, which we’re assuming is M*, “exaggerates” µ, if you were to use it as an estimate, without reporting the SE, it is more correctly describing a misuse of tests, which would direct you to use the lower limit of a confidence interval if you were keen to estimate effect size once finding significance. Why is it highly unkosher for a significance tester? Despite knowing µ < M* (one of the givens of the example), it’s estimated as M*–as if there’s no uncertainty.

If the study is in a field known to have lots of researcher flexibility, it might legitimately raise the question of whether the researchers cheated, reporting only the one impressive result after trying and trying again, or tampered with the discretionary points of the study to achieve nominal significance. This is a different issue, and doesn’t change my answer. More generally, it’s because the answer is B that the only way to raise the criticism legitimately is to challenge the assumptions of the test.

Why does the criticism arise illegitimately? There are a few reasons. In some circles it’s a direct result of trying to do a Bayesian computation and setting about to compute Pr(µ = µ’|Mo = M*) using POW(µ’)/α as a kind of likelihood ratio in favor of µ’. Notice that supposing the probability of a type I error goes down as power increases is at odds with the trade-off that we know holds between these error rates. So this immediately indicates a different use of terms.

[1] This is a modified reblog of an earlier post.

*Point on language: “to detect alternative µ'” means, “produce a statistically significant result when µ = µ’.” It does not mean we infer µ’. Nor do we know the underlying µ’ after we see the data, obviously. The power of the test to detect µ’ just refers to the probability the test would produce a result that rings the significance alarm, if the data were generated from a world or experiment where µ = µ’.

A Blizzard of Power Puzzles Replicate in Meta-Research

Mayo — Mon, 02 Mar 2026 03:17:54 +0000

I often say that the most misunderstood concept in error statistics is power. One week ago, stuck in the blizzard of 2026 in NYC —exciting, if also a bit unnerving, with airports closed for two and a half days and no certainty of when I might fly out—I began collecting the many power howlers I’ve discussed in the past, because some of them are being replicated in todays meta-research about replication failure! Apparently, mistakes about statistical concepts replicate quite reliably—even when statistically significant effects do not. Others I find in medical reports of clinical trials of treatments I’m trying to evaluate in real life! Here’s one variant: A statistically significant result in a clinical trial with fairly high (e.g., .8) power to detect an impressive improvement δ’ is taken as good evidence of its impressive improvement δ’. Often the high power of .8 is even used as a (posterior) probability of the hypothesis of improvement being δ’. [0] If these do not immediately strike you as fallacious, compare:

If the house is fully ablaze, then very probably the fire alarm goes off.
If the fire alarm goes off, then very probably the house is fully ablaze.

The first bullet is saying the fire alarm has high power to detect the house being fully ablaze. It does not mean the converse in the second bullet.

Today’s meta-statistical researchers are keen to point up the consequences of using statistical significance tests, figuring out why they lead to the various replication crises in science, and how they may be more honestly viewed. Yet they too use statistical analyses, and these can reflect philosophical and conceptual standpoints that may replicate the same shortcomings that arise in classic criticisms of significance tests. A major purpose of my Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP) is to clarify basic notions to get beyond what I call “chestnuts” and “howlers” of tests, but these misunderstandings tend to crop up today at the meta-level. When power enters this meta-research, often as a kind of probability of replication, unsurprisingly, the same confusions pop up at the meta-level. But I will not tackle any meta-research in this post. Instead, let’s go back to power howlers that arise in criticisms of tests. I had a blogpost long ago on Ziliac and McCloskey (2008) (Z & M) on power (from Oct. 2011), following a review of their book by Aris Spanos (2008). They write:

“The error of the second kind is the error of accepting the null hypothesis of (say) zero effect when the null is in face false, that is, when (say) such and such a positive effect is true.”

So far so good, keeping in mind that “positive effect” refers to a parameter discrepancy, say δ, not an observed difference.

And the power of a test to detect that such and such a positive effect δ is true is equal to the probability of rejecting the null hypothesis of (say) zero effect when the null is in fact false, and a positive effect as large as δ is present.

Fine. Let this alternative be abbreviated H’(δ):

H’(δ): there is a positive (population) effect at least as large as δ.

Suppose the test rejects the null when it reaches a significance level of .01 (nothing turns on the small value chosen).

(1) The power of the test to detect H’(δ) = Pr(test rejects null at the .01 level| H’(δ) is true).

Say it is 0.85.

According to Z & M:

“[If] the power of a test is high, say, 0.85 or higher, then the scientist can be reasonably confident that at minimum the null hypothesis (of, again, zero effect if that is the null chosen) is false and that therefore his rejection of it is highly probably correct.” (Z & M, 132-3)

But this is not so. They are mistaking (1), defining power, as giving a posterior probability of .85–either to some effect, or specifically to H’(δ)! That is, (1) is being transformed to (1′):

(1’) Pr(H’(δ) is true| test rejects null at .01 level)=.85!

(I am using the symbol for conditional probability “|” all the way through for ease in following the argument, even though, strictly speaking, the error statistician would use “;”, abbreviating “under the assumption that”). Or to put this in other words, they argue:

1. Pr(test rejects the null | H’(δ) is true) = 0.85.

2. Test rejects the null hypothesis.

Therefore, the rejection is probably correct, e.g., the probability H’ is true is 0.85.

Oops. Premises 1 and 2 are true, but the conclusion fallaciously replaces premise 1 with 1′.

High power as high hurdle. As Aris Spanos (2008) points out, “They have it backwards”. Their reasoning comes from thinking that the higher the power of the test from which statistical significance emerges, the higher the hurdle it has gotten over. Extracting from a Spanos comment on this blog in 2011:

“When [Ziliak and McCloskey] claim that: ‘What is relevant here for the statistical case is that refutations of the null are trivially easy to achieve if power is low enough or the sample size is large enough.’ (Z & M, p. 152), they exhibit [confusion] about the notion of power and its relationship to the sample size; their two instances of ‘easy rejection’ separated by ‘or’ contradict each other! Rejections of the null are not easy to achieve when the power is ‘low enough’. They are more difficult exactly because the test does not have adequate power (generic capacity) to detect discrepancies from the null; that stems from the very definition of power and optimal tests. [Their second claim] is correct for the wrong reason. Rejections are easy to achieve when the sample size n is large enough due to high not low power. This is because the power of a ‘decent’ (consistent) frequentist test increases monotonically with n!” (Spanos 2011) [i]

Ziliak and McCloskey (2008) tell us: “It is the history of Fisher significance testing. One erects little “significance” hurdles, six inches tall, and makes a great show of leaping over them, . . . If a test does a good job of uncovering efficacy, then the test has high power and the hurdles are high not low.” (ibid., p. 133) This explains why they suppose high power translates into high hurdles, but it is the opposite. The higher the hurdle required before rejecting the null, the more difficult it is to reject, and the lower the power. High hurdles correspond to insensitive tests, like insensitive fire alarms. It might be that using “sensitivity” rather than power would make this abundantly clear. We may coin: The high power = high hurdle (for rejecting the null) fallacy. A powerful test does give the null hypothesis a harder time in the sense that it’s more probable that discrepancies from it are detected. But this makes it easier to infer H₁. To infer H₁with severity, H₁ needs to be given a hard time.

Ponder the consequences their construal would have for the required trade-off between type 1 and type 2 error probabilities. (Use the comments to explain what happens.) For a fuller discussion, see this link to Excursion 5 Tour I of SIST (2018). [ii] [iii]

What power howlers have you found? Share them in the comments and I’ll add them to my blizzard.

Spanos, A. (2008), Review of S. Ziliak and D. McCloskey’s The Cult of Statistical Significance, Erasmus Journal for Philosophy and Economics, volume 1, issue 1: 154-164.

Ziliak, Z. and McCloskey, D. (2008), The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice and Lives, University of Michigan Press.

[0] Some meta-researchers, having brilliently generated a superpopulation of treatments (from which this one is taken as a random sample), and finding these probabilities don’t hold, take this to show p-values exaggerate effects. I’ll come back to that case, which is a bit different from the one in today’s post.

[i] When it comes to raising the power by increasing sample size, Z & M often make true claims, so it’s odd when there’s a switch, as when they say “refutations of the null are trivially easy to achieve if power is low enough or the sample size is large enough”. (Z & M, p. 152) It is clear that “low” is not a typo here either (as I at first assumed), so it’s mysterious.

[ii] Remember that a power computation is not the probability of data x under some alternative hypothesis, it’s the probability that data fall in the rejection region of a test under some alternative hypothesis. In terms of a test statistic d(X), it is Pr(test statistic d(X) is statistically significant | H’ true), at a given level of significance. So it’s the probability of getting any of the outcomes that would lead to statistical significance at the chosen level, under the assumption that alternative H’ is true. The alternative H’ used to compute power is a point in the alternative region. However, the inference that is made in tests is not to a point hypothesis but to an inequality, e.g., θ > θ’.

[iii] My rendering of their fallacy above sees it as a type of affirming the consequent. They are right that if inference is by way of a Bayes boost, then affirming the consequent is not a fallacy. A hypothesis H that entails (or renders probable) data x will get a “B-boost” from x, unless its probability is already 1. The trouble erupts when Z & M take an error statistical concept like power, and construe it Bayesianly. Even more confusing, they only do so some of the time.

The next blizzard 26 of power puzzles on this blog is here

Leisurely Cruise February 2026: power, shpower, positive predictive value

Mayo — Thu, 12 Feb 2026 23:03:07 +0000

2025-26 Cruise

" data-large-file="https://errorstatistics.com/wp-content/uploads/2021/10/ship-cut.png?w=104" class="wp-image-33413 size-full" src="https://errorstatistics.com/wp-content/uploads/2021/10/ship-cut.png" alt="" width="104" height="137" />

2025-6 Leisurely Cruise

The following is the February stop of our leisurely cruise (meeting 6 from my 2020 Seminar at the LSE). There was a guest speaker, Professor David Hand. Slides and videos are below. Ship StatInfasSt may head back to port or continue for an additional stop or two, if there is interest. Although I often say on this blog that the classical notion of power, as defined by Neyman and Pearson, is one of the most misunderstood notions in stat foundations. I did not know, in writing SIST, just how ingrained those misconceptions would become. I’ll write more on this in my next post. (The following is from SIST pp. 354-356, the pages are provided below)

Shpower and Retrospective Power Analysis

It’s unusual to hear books condemn an approach in a hush-hush sort of way without explaining what’s so bad about it. This is the case with something called post hoc power analysis, practiced by some who live on the outskirts of Power Peninsula. Psst, don’t go there. We hear “there’s a sinister side to statistical power, … I’m referring to post hoc power” (Cumming 2012, pp. 340-1), also called observed power and retrospective (retro) power. I will be calling it shpower analysis. It distorts the logic of ordinary power analysis (from insignificant results). The “post hoc” part comes in because it’s based on the observed results. The trouble is that ordinary power analysis is also post-data. The criticisms are often wrongly taken to reject both.

Shpower evaluates power with respect to the hypothesis that the population effect size (discrepancy) equals the observed effect size, for example, that the parameter μ equals the observed mean. In T+ this would be to set μ = x. Conveniently, their examples use variations on test T+. We may define:

The Shpower of test T+: Pr(X > x_α; μ = x)

The thinking, presumably, is that, since we don’t know the value of μ, we might use the observed x to estimate it, and then compute power in the usual way, except substituting the observed value. But a moment’s thought shows the problem – at least for the purpose of using power analysis to interpret insignificant results. Why?

Since alternative μ is set equal to the observed x, and x is given as statistically insignificant, we know we are in Case 1 from Section 5.1: the power can never exceed 0.5. In other words, since x < x_α, the shpower = POW(T+, μ = x) . But power analytic reasoning is all about finding an alternative against which the test has high capability to have rung the significance bell, were that the true parameter value – high power. Shpower is always “slim” (to echo Neyman) against such alternatives. Unsurprisingly, then, shpower analytical reasoning has been roundly criticized in the literature. But the critics think they’re maligning power analytic reasoning.

Now we know the severe tester insists on using attained power Pr(d(X) > d(x₀); μ’) to evaluate severity, but when addressing the criticisms of power analysis, we have to stick to ordinary power:[1]

Ordinary power: POW(μ’): Pr(d(X) > c_α; μ’)
Shpower (aka post hoc or retro power): Pr(d(X) > c_α; μ = x)

An article by Hoenig and Heisey (2001) (“The Abuse of Power”) calls power analysis abusive. Is it? Aris Spanos and I say no (in a 2002 note on them), but the journal declined to publish it [because the deadline for comments had passed]. Since then their slips have spread like kudzu through the literature.

Howlers of Shpower Analysis

Hoenig and Heisey notice that within the class of insignificant results, the more significant the observed x is, the higher the “observed power” against μ = x, until it reaches 0.5 (when x reaches x_α and becomes significant). “That’s backwards!” they howl. It is backwards if “observed power” is defined as shpower. Because, if you were to regard higher shpower as indicating better evidence for the null, you’d be saying the more statistically significant the observed difference (between x and μ₀), the more the evidence of the absence of a discrepancy from the null hypothesis μ₀. That would contradict the logic of tests.

Two fallacies are being committed here. The first we dealt with in discussing Greenland: namely, supposing that a negative result, with high power against μ₁, is evidence for the null rather than merely evidence that μ < μ_1.The more serious fallacy is that their “observed power” is shpower. Neither Cohen nor Neyman define power analysis this way. It is concluded that power analysis is paradoxical and inconsistent with -value reasoning. You should really only conclude that shpower analytic reasoning is paradoxical. If you’ve redefined a concept and find that a principle that held with the original concept is contradicted, you should suspect your redefinition. It might have other uses, but there is no warrant to discredit the original notion.

The shpower computation is asking: What’s the probability of getting X > x_α under μ = x? We still have that the larger the power (against μ = x), the better x indicates that μ < x – as in ordinary power analysis – it’s just that the indication is never more than 0.5. Other papers and even instructional manuals (Ellis 2010) assume shpower as what retrospective power analysis must mean, and ridicule it because “a nonsignificant result will almost always be associated with low statistical power” (p. 60). Not so. I’m afraid that observed power and retrospective power are all used in the literature to mean shpower. What about my use of severity? Severity will replace the cutoff for rejection with the observed value of the test statistic (i.e., Pr(d(X) > d(x₀); μ₁)), but not the parameter value μ. You might say, we don’t know the value of μ₁. True, but that doesn’t stop us from forming power or severity curves and interpreting results accordingly. Let’s leave shpower and consider criticisms of ordinary power analysis. Again, pointing to Hoenig and Heisey’s article (2001) is ubiquitous.

[1] In deciphering existing discussions on ordinary power analysis, we can suppose that d(x₀) happens to be exactly at the cut-off for rejection, in discussing significant results; and just misses the cut-off for discussions on insignificant results in test T+. Then att-power for μ₁ equals ordinary power for μ₁ .

Reading:

SIST Excursion 5 Tour I (pp. 323-332; 338-344; 346-352),Tour II (pp. 353-6; 361-370), and Farewell Keepsake pp. 436-444

Recommended (if time) What Ever Happened to Bayesian Foundations (Excursion 6 Tour I)

Mayo Memos for Meeting 6:

-Souvenirs Meeting 6: W: The Severity Interpretation of Negative Results (SIN) for Test T+; X: Power and Severity Analysis; Z: Understanding Tribal Warfare

-Selected blogposts on Power

05/08/17: How to tell what’s true about power if you’re practicing within the error-statistical tribe
12/12/17: How to avoid making mountains out of molehills (using power and severity)

There is also a guest speaker: Professor David Hand:
“Trustworthiness of Statistical Analysis”

_______________________________________________________________________________________________________

Slides & Video Links for Meeting 6:

Slides: Mayo 2nd Draft slides for 25 June (not beautiful)

Video of Meeting #6: (Viewing Videos in full screen mode helps with buffering issues.)
VIDEO LINK: https://wp.me/abBgTB-mZ

VIDEO LINK to David Hand’s Presentation: https://wp.me/abBgTB-mS
David Hand’s recorded Powerpoint slides: https://wp.me/abBgTB-n4
AUDIO LINK to David Hand’s Presentation & Discussion: https://wp.me/abBgTB-nm

Another link is here.

Please share your thoughts and queries in the comments.

Severe testing of deep learning models of cognition (ii)

Mayo — Thu, 29 Jan 2026 17:31:08 +0000

Butterfly brain

" data-image-caption="

" data-large-file="https://errorstatistics.com/wp-content/uploads/2014/08/f1ce127a4cfe95c4f645f0cc98f04fca.jpg?w=236" class="wp-image-15758" src="https://errorstatistics.com/wp-content/uploads/2014/08/f1ce127a4cfe95c4f645f0cc98f04fca.jpg?w=216" alt="" width="132" height="183" srcset="https://errorstatistics.com/wp-content/uploads/2014/08/f1ce127a4cfe95c4f645f0cc98f04fca.jpg?w=132 132w, https://errorstatistics.com/wp-content/uploads/2014/08/f1ce127a4cfe95c4f645f0cc98f04fca.jpg?w=108 108w, https://errorstatistics.com/wp-content/uploads/2014/08/f1ce127a4cfe95c4f645f0cc98f04fca.jpg 236w" sizes="auto, (max-width: 132px) 100vw, 132px" />

From time to time I hear of an application of the severe testing philosophy in intriguing ways in fields I know very little about. An example is a recent article by cognitive psychologist Jeffrey Bowers and colleagues (2023): “On the importance of severely testing deep learning models of cognition” (abstract below). Because deep neural networks (DNNs)–advanced machine learning models–seem to recognize images of objects at a similar or even better rate than humans, many researchers suppose DNNs learn to recognize objects in a way similar to humans. However, Bowers and colleagues argue that, on closer inspection, the evidence is remarkably weak, and “in order to address this problem, we argue that the philosophy of severe testing is needed”.

The problem is this. Deep learning models, after all, consist of millions of (largely uninterpretable) parameters. Without understanding how the black box model moves from inputs to outputs, it’s easy to see why observed correlations can easily occur even where the DNN output is due to a variety of factors other than using a similar mechanism as the human visual system. From the standpoint of severe testing, this is a familiar mistake. For data to provide evidence for a claim, it does not suffice that the claim agrees with data, the method must have been capable of revealing the claim to be false, (just) if it is. Here the type of claim of interest is that a given algorithmic model uses similar features or mechanisms as humans to categorize images.[1] The problem isn’t the engineering one of getting more accurate algorithmic models, the problem is inferring claim C: DNNs mimic human cognition in some sense (they focus on vision), even though C has not been well probed.

“Contrary to the model comparison approach that is popular in deep learning applications to cognitive/neural modeling it will be argued that the mere advantage of one model over the other in predicting domain-relevant data is wholly insufficient even as the weakest evidentiary standard”

—in particular, “weak severity”.[2] Bowers et al. argue that many celebrated demonstrations of human–DNN similarity fail even this minimal standard of evidence: nothing has been done that would have found C false, even if it is. While the authors grant that “many questions arise as we attempt to unfold what severity requirements mean in practice. … current testing does not even come close to any reasonable severity requirement”. [Pursuing their questions about applying severity deserves a separate discussion.]

While the experiments are artificial, as with all experiments, they do seem to replicate known features of human vision such as identifying objects by shape rather than texture, and a human sensitivity to relations between parts. Although similar patterns may be found in DNNs, once researchers scratch a bit below the surface using genuinely probative tests, the agreement collapses. One example concerns a disagreement in how DNNs vs humans perceive relations between parts of objects, such as one part being above the other. An experiment goes something like this (I do not claim any expertise): Humans and DNNs are trained on labeled examples and must infer what matters to the classification. In fact the classification depends on how the parts are arranged, but no rule is given. When shown a new object whose parts stand in the same relations as before, humans typically regard it as belonging to the same category. Not so for objects with the same parts but with those relations altered—e.g., what was above is now below–at least for humans.

The DNN appears to do just as well until the relationship, but not the parts, are swapped (e.g., what was above is now below). Keeping the parts the same, in other words, while changing the relation, the model does not change its classification. Humans do. The DNN, it appears, is tracking features (the parts) that suffice for prediction during training, rather than the relational structure that humans infer as explaining the classification. For more examples and discussion (in The Behavioral and Brain Sciences,), see Bowers et al. (2022).

They ask: “Why is there so little severe testing in this domain? We argue that part of the problem lies with the peer-review system that incentivizes researchers to carry out research designed to highlight DNN-human similarities and minimize differences.” As a result, researchers are incentivized—by peer review, publication practices, and the culture of the field—to design experiments that show agreement rather than ones that seriously risk falsifying claims of DNN-human similarities.

“Indeed, reviewers and editors often claim that “negative results” — i.e., results that falsify strong claims of similarity between humans and DNNs — are not enough and that “solutions” — i.e., models that report DNN-human similarities – are needed for publishing in the top venues…”

We should not equate the use of “negative results” in this context, with common “null” results in statistical significance tests. Commonly null results in statistical significance tests merely fail to provide evidence for an effect (or indicate upper bounds to the effect size, based on a proper use of power). In the human/DNN studies, by contrast, the sense of “negative” is closer to a falsification, or at least a serious undermining, of the proposed claim C that the DNNs rely on the same or similar mechanisms as humans, in regard to a certain process. As they put it, ‘negative results’ are “results that falsify strong claims of similarity between humans and DNNs”:

“the main objective of researchers comparing DNNs to humans is to better understand the brain through DNNs. If apparent DNN-human similarities are mediated by qualitatively different systems, then the claim that DNNs are good models of brains is simply wrong.”

The relational experiments show the two are tracking different things. As such, I recommend they call them “falsifying results” rather than ‘negative’, since we are generally entitled to say something much weaker in the case of “null” results in standard statistical tests (with a no effect null). Even “fixing” the DNN to match the human output does not restore the claim that the two systems are tracking the same things–so far as I understand what’s going on here. (We assume there wasn’t some underlying flaw with the apparently falsifying experiments.)

Of most interest is that the authors stress a constructive upshot, “that following the principles of severe testing is likely to steer empirical deep learning approaches to brain and cognitive science onto a more constructive direction”. Encouragingly the most updated benchmark tests seem to bear this out, but as an outsider, I can only speculate. [I will write to the authors and report on their sense of a shift in the field.] Whether this shift will become widespread remains to be seen, but it marks a welcome and interesting move toward more severe and genuinely informative testing in these experiments.

[1] The claim, of course, could be something else, such as, DNNs are useful for understanding the relationships between DNNs and human cognition.

[2] Severity Requirement (weak): If data x agree with a claim C but the method was practically incapable of finding flaws with C even if they exist, then x is poor evidence for C.

Abstract of Bowers et al. (2023) Researchers studying the correspondences between Deep Neural Networks (DNNs) and humans often give little consideration to severe testing when drawing conclusions from empirical findings, and this is impeding progress in building better models of minds. We first detail what we mean by severe testing and highlight how this is especially important when working with opaque models with many free parameters that may solve a given task in multiple different ways. Second, we provide multiple examples of researchers making strong claims regarding DNN-human similarities without engaging in severe testing of their hypotheses. Third, we consider why severe testing is undervalued. We provide evidence that part of the fault lies with the review process. There is now a widespread appreciation in many areas of science that a bias for publishing positive results (among other practices) is leading to a credibility crisis, but there seems less awareness of the problem here

Reference

Bowers, J. S., Malhotra, G., Dujmović, M., Llera Montero, M., Tsvetkov, C., Biscione, V., Puebla, G., Adolfi, F., Hummel, J. E., Heaton, R. F., Evans, B. D., Mitchell, J., & Blything, R. (2023). Deep problems with neural network models of human vision, commentary & response. The Behavioral and Brain Sciences, 46, Article e385. https://doi.org/10.1017/S0140525X22002813

(JAN #2) Leisurely cruise January 2026: Excursion 4 Tour II: 4.4 “Do P-Values Exaggerate the Evidence?”

Mayo — Sat, 17 Jan 2026 00:32:02 +0000

2025-26 Cruise

2026-26 Cruise

Our second stop in 2026 on the leisurely tour of SIST is Excursion 4 Tour II which you can read here. This criticism of statistical significance tests takes a number of forms. Here I consider the best known. The bottom line is that one should not suppose that quantities measuring different things ought to be equal. At the bottom you will see links to posts discussing this issue, each with a large number of comments. The comments from readers are of interest! We will have a zoom meeting Fri Jan 23 11AM ET on these last two posts.*If you want to join us, contact us.

getting beyond…

Excerpt from Excursion 4 Tour II*

4.4 Do P-Values Exaggerate the Evidence?

“Significance levels overstate the evidence against the null hypothesis,” is a line you may often hear. Your first question is:

What do you mean by overstating the evidence against a hypothesis?

Several (honest) answers are possible. Here is one possibility:

What I mean is that when I put a lump of prior weight π₀ of 1/2 on a point null H₀ (or a very small interval around it), the P-value is smaller than my Bayesian posterior probability on H₀.

More generally, the “P-values exaggerate” criticism typically boils down to showing that if inference is appraised via one of the probabilisms – Bayesian posteriors, Bayes factors, or likelihood ratios – the evidence against the null (or against the null and in favor of some alternative) isn’t as big as 1 − P.

You might react by observing that: (a) P-values are not intended as posteriors in H₀ (or Bayes ratios, likelihood ratios) but rather are used to determine if there’s an indication of discrepancy from, or inconsistency with, H₀. This might only mean it’s worth getting more data to probe for a real effect. It’s not a degree of belief or comparative strength of support to walk away with. (b) Thus there’s no reason to suppose a P-value should match numbers computed in very different accounts, that differ among themselves, and are measuring entirely different things. Stephen Senn gives an analogy with “height and stones”:

. . . [S]ome Bayesians in criticizing P-values seem to think that it is appropriate to use a threshold for significance of 0.95 of the probability of the alternative hypothesis being true. This makes no more sense than, in moving from a minimum height standard (say) for recruiting police officers to a minimum weight standard, declaring that since it was previously 6 foot it must now be 6 stone. (Senn 2001b, p. 202)

To top off your rejoinder, you might ask: (c) Why assume that “the” or even “a” correct measure of evidence (relevant for scrutinizing the P-value) is one of the probabilist ones?

All such retorts are valid, and we’ll want to explore how they play out here. Yet, I want to push beyond them. Let’s be open to the possibility that evidential measures from very different accounts can be used to scrutinize each other.

Getting Beyond “I’m Rubber and You’re Glue”. The danger in critiquing statistical method X from the standpoint of the goals and measures of a distinct school Y, is that of falling into begging the question. If the P-value is exaggerating evidence against a null, meaning it seems too small from the perspective of school Y, then Y’s numbers are too big, or just irrelevant, from the perspective of school X. Whatever you say about me bounces off and sticks to you. This is a genuine worry, but it’ s not fatal. The goal of this journey is to identify minimal theses about “ bad evidence, no test (BENT)” that enable some degree of scrutiny of any statistical inference account – at least on the meta-level. Why assume all schools of statistical inference embrace the minimum severity principle? I don’t, and they don’t. But by identifying when methods violate severity, we can pull back the veil on at least one source of disagreement behind the battles.

Thus, in tackling this latest canard, let’ s resist depicting the critics as committing a gross blunder of confusing a P-value with a posterior probability in a null. We resist, as well, merely denying we care about their measure of support. I say we should look at exactly what the critics are on about. When we do, we will have gleaned some short-cuts for grasping a plethora of critical debates. We may even wind up with new respect for what a P-value, the least popular girl in the class, really does.

To visit the core arguments, we travel to 1987 to papers by J. Berger and Sellke, and Casella and R. Berge r. These, in turn, are based on a handful of older ones (Cox 1977, E, L, & S 1963, Pratt 1965), and current discussions invariably revert back to them. Our struggles through quicksand of Excursion 3, Tour II, are about to pay large dividends.

*I decided to put off the Birnbaum zoom until Feb. (date TBA). Jan 23 will cover this and the last post. 11-12:30 (OK to arrive late/leave early).

This excerpt comes from Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo, CUP 2018).

Readers can find blogposts that trace out the discussion of this topic, as I was developing it, along with comments. The following 2 are central:

(7/14) “P-values overstate the evidence against the null”: legit or fallacious? (revised) 71 comments

(7/23) Continued:”P-values overstate the evidence against the null”: legit or fallacious? 39 comments

Where you are in the journey.
ship

" data-image-caption="" data-large-file="https://errorstatistics.com/wp-content/uploads/2018/09/screen-shot-2018-09-08-at-11-00-11-pm.png?w=240" class="wp-image-24460 aligncenter" src="https://errorstatistics.com/wp-content/uploads/2018/09/screen-shot-2018-09-08-at-11-00-11-pm.png?w=68&h=53" alt="" width="68" height="53" srcset="https://errorstatistics.com/wp-content/uploads/2018/09/screen-shot-2018-09-08-at-11-00-11-pm.png?w=68 68w, https://errorstatistics.com/wp-content/uploads/2018/09/screen-shot-2018-09-08-at-11-00-11-pm.png?w=136 136w" sizes="auto, (max-width: 68px) 100vw, 68px" />

(JAN #1) Leisurely Cruise January 2026: Excursion 4 Tour I: The Myth of “The Myth of Objectivity” (Mayo 2018, CUP)

Mayo — Thu, 08 Jan 2026 21:19:05 +0000

2025-26 Cruise

Our first stop in 2026 on the leisurely tour of SIST is Excursion 4 Tour I which you can read here. I hope that this will give you the chutzpah to push back in 2026, if you hear that objectivity in science is just a myth. This leisurely tour may be a bit more leisurely than I intended, but this is philosophy, so slow blogging is best. (Plus, we’ve had some poor sailing weather). Please use the comments to share thoughts.

Tour I The Myth of “The Myth of Objectivity”*

Objectivity in statistics, as in science more generally, is a matter of both aims and methods. Objective science, in our view, aims to find out what is the case as regards aspects of the world [that hold] independently of our beliefs, biases and interests; thus objective methods aim for the critical control of inferences and hypotheses, constraining them by evidence and checks of error. (Cox and Mayo 2010, p. 276) [i]

Whenever you come up against blanket slogans such as “no methods are objective” or “all methods are equally objective and subjective” it is a good guess that the problem is being trivialized into oblivion. Yes, there are judgments, disagreements, and values in any human activity, which alone makes it too trivial an observation to distinguish among very different ways that threats of bias and unwarranted inferences may be controlled. Is the objectivity–subjectivity distinction really toothless, as many will have you believe? I say no. I know it’s a meme promulgated by statistical high priests, but you agreed, did you not, to use a bit of chutzpah on this excursion? Besides, cavalier attitudes toward objectivity are at odds with even more widely endorsed grass roots movements to promote replication, reproducibility, and to come clean on a number of sources behind illicit results: multiple testing, cherry picking, failed assumptions, researcher latitude, publication bias and so on. The moves to take back science are rooted in the supposition that we can more objectively scrutinize results – even if it’s only to point out those that are BENT. The fact that these terms are used equivocally should not be taken as grounds to oust them but rather to engage in the difficult work of identifying what there is in “objectivity” that we won’t give up, and shouldn’t.

The Key Is Getting Pushback! While knowledge gaps leave plenty of room for biases, arbitrariness, and wishful thinking, we regularly come up against data that thwart our expectations and disagree with the predictions we try to foist upon the world. We get pushback! This supplies objective constraints on which our critical capacity is built. Our ability to recognize when data fail to match anticipations affords the opportunity to systematically improve our orientation. Explicit attention needs to be paid to communicating results to set the stage for others to check, debate, and extend the inferences reached. Which conclusions are likely to stand up? Where do the weakest parts remain? Don’t let anyone say you can’t hold them to an objective account.

Excursion 2, Tour II led us from a Popperian tribe to a workable demarcation for scientific inquiry. That will serve as our guide now for scrutinizing the myth of the myth of objectivity. First, good sciences put claims to the test of refutation, and must be able to embark on an inquiry to pin down the sources of any apparent effects. Second, refuted claims aren’t held on to in the face of anomalies and failed replications; they are treated as refuted in further work (at least provisionally); well-corroborated claims are used to build on theory or method: science is not just stamp collecting. The good scientist deliberately arranges inquiries so as to capitalize on pushback, on effects that will not go away, on strategies to get errors to ramify quickly and force us to pay attention to them. The ability to register how hunting, optional stopping, and cherry picking alter their error-probing capacities is a crucial part of a method’s objectivity. In statistical design, day-to-day tricks of the trade to combat bias are consciously amplified and made systematic. It is not because of a “disinterested stance” that we invent such methods; it is that we, quite competitively and self-interestedly, want our theories to succeed in the market place of ideas.

Admittedly, that desire won’t suffice to incentivize objective scrutiny if you can do just as well producing junk. Successful scrutiny is very different from success at grants, getting publications and honors. That is why the reward structure of science is so often blamed nowadays. New incentives, gold stars and badges for sharing data and for resisting the urge to cut corners are being adopted in some fields. Fortunately, for me, our travels will bypass lands of policy recommendations, where I have no special expertise. I will stop at the perimeters of scrutiny of methods which at least provide us citizen scientists armor against being misled. Still, if the allure of carrots has grown stronger than the sticks, we need stronger sticks.

Problems of objectivity in statistical inference are deeply intertwined with a jungle of philosophical problems, in particular with questions about what objectivity demands, and disagreements about “objective versus subjective” probability. On to the jungle!

[i] Mayo and Cox (2010), “Objectivity and Conditionality in Frequentist Inference”, is the paper that led me to the critical analysis of Birnbaum on the Likelihood Principle. How could I write on “conditionality” if it leads to renouncing error probabilities? I asked David Cox. We agreed that it did not.

*From Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo 2018, CUP)

To see where you are in the book, check the full Itinerary here.
If you want to follow us, write to jemille6@vt.edu, for a clean copy of the readings.

Midnight With Birnbaum: Happy New Year 2026!

Mayo — Thu, 01 Jan 2026 04:59:36 +0000

Anyone here remember that old Woody Allen movie, “Midnight in Paris,” where the main character (I forget who plays it, I saw it on a plane), a writer finishing a novel, steps into a cab that mysteriously picks him up at midnight and transports him back in time where he gets to run his work by such famous authors as Hemingway and Virginia Wolf? (It was a new movie when I began the blog in 2011.) He is wowed when his work earns their approval and he comes back each night in the same mysterious cab…Well, ever since I began this blog in 2011, I imagine being picked up in a mysterious taxi at midnight on New Year’s Eve, and lo and behold, find myself in the 1960s New York City, in the company of Allan Birnbaum who is is looking deeply contemplative, perhaps studying his 1962 paper…Birnbaum reveals some new and surprising twists this year! [i]

(The pic on the left is the only blurry image I have of the club I’m taken to.) It has been a decade since I published my article in Statistical Science (“On the Birnbaum Argument for the Strong Likelihood Principle”), which includes commentaries by A. P. David, Michael Evans, Martin and Liu, D. A. S. Fraser, Jan Hannig, and Jan Bjornstad. David Cox, who very sadly did in January 2022, is the one who encouraged me to write and publish it. Not only does the (Strong) Likelihood Principle (LP or SLP) remain at the heart of many of the criticisms of Neyman-Pearson (N-P) statistics and of error statistics in general, but a decade after my 2014 paper, it is more central than ever–even if it is often unrecognized.

OUR EXCHANGE:

ERROR STATISTICIAN: It’s wonderful to meet you Professor Birnbaum; I’ve always been extremely impressed with the important impact your work has had on philosophical foundations of statistics. I happen to have published on your famous argument about the likelihood principle (LP). (whispers: I can’t believe this!)

BIRNBAUM: Ultimately you know I rejected the LP as failing to control the error probabilities needed for my Confidence concept. But you know all this, I’ve read it in your book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (STINT, 2018, CUP).

ERROR STATISTICIAN: You’ve read my book? Wow! Then you know I don’t think your argument shows that the LP follows from such frequentist concepts as sufficiency S and the weak conditionality principle WLP. I don’t rehearse my argument there, but I first found the problem in 2006, when I was writing something on “conditioning” with David Cox. [ii] Sorry,…I know it’s famous…

BIRNBAUM: Well, I shall happily invite you to take any case that violates the LP and allow me to demonstrate that the frequentist is led to inconsistency, provided she also wishes to adhere to the WLP and sufficiency (although less than S is needed).

ERROR STATISTICIAN: Well I show that no contradiction follows from holding WCP and S, while denying the LP.

BIRNBAUM: Well, well, well: I’ll bet you a bottle of Elba Grease champagne that I can demonstrate it!

ERROR STATISTICAL PHILOSOPHER: It is a great drink, I must admit that: I love lemons.

BIRNBAUM: OK. (A waiter brings a bottle, they each pour a glass and resume talking). Whoever wins this little argument pays for this whole bottle of vintage Ebar or Elbow or whatever it is Grease.

" data-large-file="https://errorstatistics.com/wp-content/uploads/2014/12/photo.jpg?w=690" class="wp-image-17054" src="https://errorstatistics.com/wp-content/uploads/2014/12/photo.jpg?w=105&h=136" alt="." width="105" height="136" srcset="https://errorstatistics.com/wp-content/uploads/2014/12/photo.jpg?w=115 115w, https://errorstatistics.com/wp-content/uploads/2014/12/photo.jpg?w=105 105w, https://errorstatistics.com/wp-content/uploads/2014/12/photo.jpg?w=210 210w" sizes="auto, (max-width: 105px) 100vw, 105px" />

ERROR STATISTICAL PHILOSOPHER: I really don’t mind paying for the bottle.

BIRNBAUM: Good, you will have to. Take any LP violation. Let x’ be 2-standard deviation difference from the null (asserting μ = 0) in testing a normal mean from the fixed sample size experiment E’, say n = 100; and let x” be a 2-standard deviation difference from an optional stopping experiment E”, which happens to stop at 100. Do you agree that:

(0) For a frequentist, outcome x’ from E’ (fixed sample size) is NOT evidentially equivalent to x” from E” (optional stopping that stops at n)

ERROR STATISTICAL PHILOSOPHER: Yes, that’s a clear case where we reject the strong LP, and it makes perfect sense to distinguish their corresponding p-values (which we can write as p’ and p”, respectively). The searching in the optional stopping experiment makes the p-value quite a bit higher than with the fixed sample size. For n = 100, data x’ yields p’= ~.05; while p” is ~.3. Clearly, p’ is not equal to p”, I don’t see how you can make them equal.

BIRNBAUM: Suppose you’ve observed x”, a 2-standard deviation difference from an optional stopping experiment E”, that finally stops at n=100. You admit, do you not, that this outcome could have occurred as a result of a different experiment? It could have been that a fair coin was flipped where it is agreed that heads instructs you to perform E’ (fixed sample size experiment, with n = 100) and tails instructs you to perform the optional stopping experiment E”, stopping as soon as you obtain a 2-standard deviation difference, and you happened to get tails, and performed the experiment E”, which happened to stop with n =100.

ERROR STATISTICAL PHILOSOPHER: Well, that is not how x” was obtained, but ok, it could have occurred that way.

BIRNBAUM: Good. Then you must grant further that your result could have come from a special experiment I have dreamt up, call it a BB-experiment. In a BB-experiment, if the outcome from the experiment you actually performed has an outcome with a proportional likelihood to one in some other experiment not performed, E’, then we say that your result has an “LP pair”. For any violation of the strong LP, the outcome observed, let it be x”, has an “LP pair”, call it x’, in some other experiment E’. In that case, a BB-experiment stipulates that you are to report x” as if you had determined whether to run E’ or E” by flipping a fair coin.

(They fill their glasses again)

ERROR STATISTICAL PHILOSOPHER: You’re saying that if my outcome from trying and trying again, that is, optional stopping experiment E”, with an “LP pair” in the fixed sample size experiment I did not perform, then I am to report x” as if the determination to run E” was by flipping a fair coin (which decides between E’ and E”)?

BIRNBAUM: Yes, and one more thing. If your outcome had actually come from the fixed sample size experiment E’, it too would have an “LP pair” in the experiment you did not perform, E”. Whether you actually observed x” from E”, or x’ from E’, you are to report it as x” from E”.

ERROR STATISTICAL PHILOSOPHER: So let’s see if I understand a Birnbaum BB-experiment: whether my observed 2-standard deviation difference came from E’ or E” (with sample size n) the result is reported as x’, as if it came from E’ (fixed sample size), and as a result of this strange type of a mixture experiment.

BIRNBAUM: Yes, or equivalently you could just report x*: my result is a 2-standard deviation difference and it could have come from either E’ (fixed sampling, n= 100) or E” (optional stopping, which happens to stop at the 100^th trial). That’s how I sometimes formulate a BB-experiment.

ERROR STATISTICAL PHILOSOPHER: You’re saying in effect that if my result has an LP pair in the experiment not performed, I should act as if I accept the strong LP and just report it’s likelihood; so if the likelihoods are proportional in the two experiments (both testing the same mean), the outcomes are evidentially equivalent.

BIRNBAUM: Well, but since the BB- experiment is an imagined “mixture” it is a single experiment, so really you only need to apply the weak LP which frequentists accept. Yes? (The weak LP is the same as the sufficiency principle).

ERROR STATISTICAL PHILOSOPHER: But what is the sampling distribution in this imaginary BB- experiment? Suppose I have Birnbaumized my experimental result, just as you describe, and observed a 2-standard deviation difference from optional stopping experiment E”. How do I calculate the p-value within a Birnbaumized experiment?

BIRNBAUM: I don’t think anyone has ever called it that.

ERROR STATISTICAL PHILOSOPHER: I just wanted to have a shorthand for the operation you are describing, there’s no need to use it, if you’d rather I not. So how do I calculate the p-value within a BB-experiment?

BIRNBAUM: You would report the overall p-value, which would be the average over the sampling distributions: (p’ + p”)/2

Say p’ is ~.05, and p” is ~.3; whatever they are, we know they are different, that’s what makes this a violation of the strong LP (given in premise (0)).

ERROR STATISTICAL PHILOSOPHER: So you’re saying that if I observe a 2-standard deviation difference from E’, I do not report the associated p-value p’, but instead I am to report the average p-value, averaging over some other experiment E” that could have given rise to an outcome with a proportional likelihood to the one I observed, even though I didn’t obtain it this way?

BIRNBAUM: I’m saying that you have to grant that x’ from a fixed sample size experiment E’ could have been generated through a BB-experiment.

My this drink is sour!

ERROR STATISTICAL PHILOSOPHER: Yes, I love pure lemon.

BIRNBAUM: Perhaps you’re in want of a gene; never mind.

I’m saying you have to grant that x’ from a fixed sample size experiment E’ could have been generated through a BB-experiment. If you are to interpret your experiment as if you are within the rules of a BB experiment, then x’ is evidentially equivalent to x” (is equivalent to x*). This is premise (1).

ERROR STATISTICAL PHILOSOPHER: But the result would be that the p-value associated with x’ (fixed sample size) is reported to be larger than it actually is (.05), because I’d be averaging over fixed and optional stopping experiments; while observing x” (optional stopping) is reported to be smaller than it is–in both cases because of an experiment I did not perform.

BIRNBAUM: Yes, the BB-experiment computes the P-value in an unconditional manner: it takes the convex combination over the 2 ways the result could have come about.

ERROR STATISTICAL PHILOSOPHER: this is just a matter of your definitions, it is an analytical or mathematical result, so long as we grant being within your BB experiment.

BIRNBAUM: True, (1) plays the role of the sufficiency assumption, but one need not even appeal to sufficiency, it is just a matter of mathematical equivalence.

By the way, I am focusing just on LP violations, therefore, the outcome, by definition, has an LP pair. In other cases, where there is no LP pair, you just report things as usual.

ERROR STATISTICAL PHILOSOPHER: OK, but p’ still differs from p”; so I still don’t how I’m forced to infer the strong LP which identifies the two. In short, I don’t see the contradiction with my rejecting the strong LP in premise (0). (Also we should come back to the “other cases” at some point….)

BIRNBAUM: Wait! Don’t be so impatient; I’m about to get to step (2). Here, let’s toast to the new year: “To Elbar Grease!”

ERROR STATISTICAL PHILOSOPHER: To Elbar Grease!

BIRNBAUM: So far all of this was step (1).

ERROR STATISTICAL PHILOSOPHER: : Oy, what is step 2?

BIRNBAUM: STEP 2 is this: Surely, you agree, that once you know from which experiment the observed 2-standard deviation difference actually came, you ought to report the p-value corresponding to that experiment. You ought NOT to report the average (p’ + p”)/2 as you were instructed to do in the BB experiment.

This gives us premise (2a):

(2a) outcome x”, once it is known that it came from E”, should NOT be analyzed as in a BB- experiment where p-values are averaged. The report should instead use the sampling distribution of the optional stopping test E”, yielding the p-value, p” (~.37). In fact, .37 is the value you give in STINT p. 44 (imagining the experimenter keeps taking 10 more).

ERROR STATISTICAL PHILOSOPHER: So, having first insisted I imagine myself in a Birnbaumized, I mean a BB-experiment, and report an average p-value, I’m now to return to my senses and “condition” in order to get back to the only place I ever wanted to be, i.e., back to where I was to begin with?

BIRNBAUM: Yes, at least if you hold to the weak conditionality principle WCP (of D. R. Cox)—surely you agree to this.

(2b) Likewise, if you knew the 2-standard deviation difference came from E’, then

x’ should NOT be deemed evidentially equivalent to x” (as in the BB experiment), the report should instead use the sampling distribution of fixed test E’, (.05).

ERROR STATISTICAL PHILOSOPHER: So, having first insisted I consider myself in a BB-experiment, in which I report the average p-value, I’m now to return to my senses and allow that if I know the result came from optional stopping, E”, I should “condition” on and report p”.

BIRNBAUM: Yes. There was no need to repeat the whole spiel.

ERROR STATISTICAL PHILOSOPHER: I just wanted to be clear I understood you. Of course, all of this assumes the model is correct or adequate to begin with.

BIRNBAUM: Yes, the LP (or SLP, to indicate it’s the strong LP) is a principle for parametric inference within a given model. So you arrive at (2a) and (2b), yes?

ERROR STATISTICAL PHILOSOPHER: OK, but it might be noted that unlike premise (1), premises (2a) and (2b) are not given by definition, they concern an evidential standpoint about how one ought to interpret a result once you know which experiment it came from. In particular, premises (2a) and (2b) say I should condition and use the sampling distribution of the experiment known to have been actually performed, when interpreting the result.

BIRNBAUM: Yes, and isn’t this weak conditionality principle WCP one that you happily accept?

ERROR STATISTICAL PHILOSOPHER: Well the WCP originally refers to actual mixtures, where one flipped a coin to determine if E’ or E” is performed, whereas, you’re requiring I consider an imaginary Birnbaum mixture experiment, where the choice of the experiment not performed will vary depending on the outcome that needs an LP pair; and I cannot even determine what this might be until after I’ve observed the result that would violate the LP? I don’t know what the sample size will be ahead of time.

BIRNBAUM: Sure, but you admit that your observed x” could have come about through a BB-experiment, and that’s all I need. Notice

(1), (2a) and (2b) yield the strong LP!

Outcome x” from E”(optional stopping that stops at n) is evidentially equivalent to x’ from E’ (fixed sample size n).

ERROR STATISTICAL PHILOSOPHER: Clever, but your “proof” is obviously unsound; and before I demonstrate this, notice that the conclusion, were it to follow, asserts p’ = p”, (e.g., .05 = .3!), even though it is unquestioned that p’ is not equal to p”, that is because we must start with an LP violation (premise (0)).

BIRNBAUM: Yes, it is puzzling, but where have I gone wrong?

(The waiter comes by and fills their glasses; they are so deeply engrossed in thought they do not even notice him.)

ERROR STATISTICAL PHILOSOPHER: There are many routes to explaining a fallacious argument. The one I find most satisfactory is in Mayo (2014). But, given we’ve been partying, here’s a very simple one. What is required for STEP 1 to hold, is the denial of what’s needed for STEP 2 to hold:

Step 1 requires us to analyze results in accordance with a BB- experiment. If we do so, true enough we get:

premise (1): outcome x” (in a BB experiment) is evidentially equivalent to outcome x’ (in a BB experiment):

That is because in either case, the p-value would be (p’ + p”)/2

Step 2 now insists that we should NOT calculate evidential import as if we were in a BB- experiment. Instead we should consider the experiment from which the data actually came, E’ or E”:

premise (2a): outcome x” (in a BB experiment) is/should be evidentially equivalent to x” from E” (optional stopping that stops at n): its p-value should be p”.

premise (2b): outcome x’ (within in a BB experiment) is/should be evidentially equivalent to x’ from E’ (fixed sample size): its p-value should be p’.

If (1) is true, then (2a) and (2b) must be false!

If (1) is true and we keep fixed the stipulation of a BB experiment (which we must to apply step 2), then (2a) is asserting:

The average p-value (p’ + p”)/2 = p’ which is false.

Likewise if (1) is true, then (2b) is asserting:

the average p-value (p’ + p”)/2 = p” which is false

Alternatively, we can see what goes wrong by realizing:

If (2a) and (2b) are true, then premise (1) must be false.

In short your famous argument requires us to assess evidence in a given experiment in two contradictory ways: as if we are within a BB- experiment (and report the average p-value) and also that we are not, but rather should report the actual p-value.

I can render it as formally valid, but then its premises can never all be true; alternatively, I can get the premises to come out true, but then the conclusion is false—so it is invalid. In no way does it show the frequentist is open to contradiction (by dint of accepting S, WCP, and denying the LP).

BIRNBAUM: Yet some people still think it is a breakthrough. I never agreed to go as far as Jimmy Savage wanted me too, namely, to be a Bayesian….

ERROR STATISTICAL PHILOSOPHER: I’ve come to see that clarifying the entire argument turns on defining the WCP. Have you seen my 2014 paper in Statistical Science? The key difference is that in (2014), the WCP is stated as an equivalence, as you intended. Cox’s WCP, many claim, was not an equivalence, going in 2 directions. Slides from a presentation may be found on this blogpost.

BIRNBAUM: Yes, the “monster of the LP” arises from viewing WCP as an equivalence, instead of going in one direction (from mixtures to the known result).

ERROR STATISTICAL PHILOSOPHER: In my 2014 paper (unlike my earlier treatments) I too construe WCP as giving an “equivalence” but there is an equivocation that invalidates the purported move to the LP.

On the one hand, it’s true that if z is known (and known for example to have come from optional stopping), it’s irrelevant that it could have resulted from either fixed sample testing or optional stopping.

But it does not follow that if z is known, it’s irrelevant whether it resulted from fixed sample testing or optional stopping. It’s the slippery slide into this second statement–which surely sounds the same as the first–that makes your argument such a brain buster. (Mayo 2014)

BIRNBAUM: Yes I have seen your 2014 paper! Your Rejoinder to some of the critics is gutsy, to say the least. I’ve also seen the slides on your blog.

ERROR STATISTICAL PHILOSOPHER: Thank you, I’m amazed you follow my blog! I haven’t kept it up that much lately; blogs have fallen out of fashion.

BIRNBAUM: As has inferential statistics it seems–it’s all AI/ML. But I have to admit that CHAT GPT illuminates at least part of your argument as to why my reasoning was flawed.

ERROR STATISTICAL PHILOSOPHER: I never thought to check CHAT GPT on my paper, that’s amazing.

BIRNBAUM: Here is what I found on the Chatbot:

CHAT GPT

Birnbaum’s Argument and the Likelihood Principle

In his 1962 paper, Birnbaum argued that if frequentists accept two principles—sufficiency and weak conditionality—they are logically compelled to accept the likelihood principle. The likelihood principle states that all the evidence in data is contained in the likelihood function, meaning that the sampling distribution (and hence frequentist error probabilities) is irrelevant to evidential assessment….

Error Statistician’s Dilemma

If Birnbaum’s argument is correct, then frequentist methods (which rely on error probabilities) would be rendered irrelevant for assessing evidence. This would make it difficult for frequentists to defend their approach as coherent, particularly in the face of Bayesian methods that naturally adhere to the likelihood principle.

However, Deborah Mayo, in her 2014 work, critiques Birnbaum’s argument, exposing a logical flaw in his alleged proof.

BIRNBAUM: The bot does not get your argument right. The whole experience has encouraged me to write the first draft of a completely revised paper, reflecting a large advance in my thinking on this. It’s not quite ready to share….

ERROR STATISTICAL PHILOSOPHER: Wow! I’d love to read it…have you identified the problem? In your last couple of papers, you suggest you’d discovered the flaw in your argument for the LP. Am I right? Even in the discussion of your (1962) paper, you seemed to agree with Pratt that WCP can’t do the job you intend. I just want to know, and won’t share your answer with anyone….

(She notices Birnbaum is holding a paper on long legal-sized yellow sheets filled with tiny hand-written comments, covering both sides.)

Sudden interruption by the waiter:

WAITER: Who gets the tab?

BIRNBAUM: I do. To Elbar Grease! To Severe Testing!
Happy New Year!

BIRNBAUM (looking wistful): Savage, you know, never got off my case about remaining at “the half-way house” of likelihood, and not going full Bayesian. Then I wrote the review about the Confidence Concept as the one rock on a shifting scene… Pratt thought the argument should instead appeal to a Censoring Principle (basically, it doesn’t matter if your instrument cannot measure beyond k units if the measurement you’re making is under k units.)

ERROR STATISTICAL PHILOSOPHER: Yes, but who says frequentist error statisticians deny the Censoring Principle? So back to my question,…you did uncover the flaw in your argument, yes?

WAITER: We’re closing now; shall I call a Taxi?

BIRNBAUM: Yes, yes!

ERROR STATISTICAL PHILOSOPHER: ‘Yes’, you discovered the flaw in the argument, or ‘yes’ to the taxi?

MANAGER: We’re closing now; I’m sorry you must leave.

ERROR STATISTICAL PHILOSOPHER: We’re leaving I just need him to clarify his answer….

BIRNBAUM: I predict that 2026 will be the year that people will finally take seriously your paper from a decade ago (30 years from your Lakatos Prize)!

ERROR STATISTICAL PHILOSOPHER: I’ll drink to that!

Suddenly a large group of people bustle past the manager…it’s all chaos.

Prof. Birnbaum…? Allan? Where did he go? (oy, not again!)

Link to complete discussion:

Mayo, Deborah G. On the Birnbaum Argument for the Strong Likelihood Principle (with discussion & rejoinder).Statistical Science 29 (2014), no. 2, 227-266.

" data-large-file="https://errorstatistics.com/wp-content/uploads/2014/12/stat-sci.jpg?w=88" class="alignleft size-full wp-image-17064" src="https://errorstatistics.com/wp-content/uploads/2014/12/stat-sci.jpg?w=690" alt="stat-sci" />

[i] Many links on the strong likelihood principle (LP or SLP) and Birnbaum may be found by searching this blog. Good sources for where to start as well as classic background papers may be found in this blogpost. A link to slides and video of a very introductory presentation of my argument from the 2021 Phil Stat Forum is here.

January 7: “Putting the Brakes on the Breakthrough: On the Birnbaum Argument for the Strong Likelihood Principle” (D.Mayo)

[ii] In 2023 I wrote a paper on Cox’s statistical philosophy. Sadly he died in 2022. (The first David R. Cox Foundations of Statistics Prize, currently given by the ASA on even-numbered years, was awarded to Nancy Reid at the JSM 2023. The second went to Phil Dawid. The Award is now to be given yearly, thanks to the contributions of Friends of David Cox (on this blog!))

For those who want to binge read the (Strong) Likelihood Principle in 2025

Mayo — Wed, 31 Dec 2025 04:56:21 +0000

" data-large-file="https://errorstatistics.com/wp-content/uploads/2018/12/images.png?w=208" class="wp-image-25842 size-thumbnail" src="https://errorstatistics.com/wp-content/uploads/2018/12/images.png?w=128&h=150" alt="" width="128" height="150" srcset="https://errorstatistics.com/wp-content/uploads/2018/12/images.png?w=128 128w, https://errorstatistics.com/wp-content/uploads/2018/12/images.png 208w" sizes="auto, (max-width: 128px) 100vw, 128px" />

David Cox’s famous “weighing machine” example” from my last post is thought to have caused “a subtle earthquake” in foundations of statistics. It’s been 11 years since I published my Statistical Science article on this, Mayo (2014), which includes several commentators, but the issue is still mired in controversy. It’s generally dismissed as an annoying, mind-bending puzzle on which those in statistical foundations tend to hold absurdly strong opinions. Mostly it has been ignored. Yet I sense that 2026 is the year that people will return to it again. It’s at least touched upon in Roderick Little’s new book (pic below). This post gives some background, and collects the essential links that you would need if you want to delve into it. Many readers know that each year I return to the issue on New Year’s Eve…. But that’s tomorrow.

By the way, this is not part of our lesurely tour of SIST. In fact, the argument is not even in SIST, although the SLP (or LP) arises a lot. But if you want to go off the beaten track with me to the SLP conundrum, here’s your opportunity.

What’s it all about? An essential component of inference based on familiar frequentist notions: p-values, significance and confidence levels, is the relevant sampling distribution (hence the term sampling theory, or my preferred error statistics, as we get error probabilities from the sampling distribution). This feature results in violations of a principle known as the strong likelihood principle (SLP). To state the SLP roughly, it asserts that all the evidential import in the data (for parametric inference within a model) resides in the likelihoods. If accepted, it would render error probabilities irrelevant post data.

SLP (We often drop the “strong” and just call it the LP. The “weak” LP just boils down to sufficiency)

For any two experiments E₁ and E₂ with different probability models f₁, f₂, but with the same unknown parameter θ, if outcomes x* and y* (from E₁ and E₂ respectively) determine the same (i.e., proportional) likelihood function (f₁(x*; θ) = cf₂(y*; θ) for all θ), then x* and y* are inferentially equivalent (for an inference about θ).

(What differentiates the weak and the strong LP is that the weak refers to a single experiment.)

Violation of SLP:

Whenever outcomes x* and y* from experiments E₁ and E₂ with different probability models f₁, f₂, but with the same unknown parameter θ, and f₁(x*; θ) = cf₂(y*; θ) for all θ, and yet outcomes x* and y* have different implications for an inference about θ.

For an example of a SLP violation, E₁ might be sampling from a Normal distribution with a fixed sample size n, and E₂ the corresponding experiment that uses an optional stopping rule: keep sampling until you obtain a result 2 standard deviations away from a null hypothesis that θ = 0 (and for simplicity, a known standard deviation). When you do, stop and reject the point null (in 2-sided testing).

The SLP tells us (in relation to the optional stopping rule) that once you have observed a 2-standard deviation result, there should be no evidential difference between its having arisen from experiment E₁, where n was fixed, say, at 100, and experiment E₂ where the stopping rule happens to stop at n = 100. For the error statistician, by contrast, there is a difference, and this constitutes a violation of the SLP.

———————-

Now for the surprising part: In Cox’s weighing machine example, recall, a coin is flipped to decide which of two experiments to perform. David Cox (1958) proposes something called the Weak Conditionality Principle (WCP) to restrict the space of relevant repetitions for frequentist inference. The WCP says that once it is known which E_iproduced the measurement, the assessment should be in terms of the properties of the particular E_i. Nothing could be more obvious.

The surprising upshot of Allan Birnbaum’s (1962) argument is that the SLP appears to follow from applying the WCP in the case of mixture experiments, and so uncontroversial a principle as sufficiency (SP)–although even that has been shown to be optional to the argument, strictly speaking. Were this true, it would preclude the use of sampling distributions. J. Savage calls Birnbaum’s argument “a landmark in statistics” (see [i]).

Although his argument purports that [(WCP and SP) entails SLP], in fact data may violate the SLP while holding both the WCP and SP. Such cases also directly refute [WCP entails SLP].

Binge reading the Likelihood Principle.

If you’re keen to binge read the SLP–a way to break holiday/winter break doldrums–or if it comes up during 2025, I’ve pasted most of the early historical sources below. The argument is simple; showing what’s wrong with it took a long time.

My earliest treatment, via counterexample, is in Mayo (2010)–in an appendix to a paper I wrote with David Cox on objectivity and conditionality in frequentist inference. But the treatment in the appendix doesn’t go far enough, so if you’re interested, it’s best to just check out Mayo (2014) in Statistical Science.[ii] An intermediate paper Mayo (2013) corresponds to a talk I presented at the JSM in 2013.

Interested readers may search this blog for quite a lot of discussion of the SLP including “U-Phils” (discussions by readers) (e.g., here, and here), and amusing notes (e.g., Don’t Birnbaumize that experiment my friend.

This conundrum is relevant to the very notion of “evidence”, blithely taken for granted in both statistics and philosophy. [iii] There’s no statistics involved, just logic and language.My 2014 paper shows the logical problem, but I still think that it will take an astute philosopher of language to adequately classify the linguistic fallacy being committed.

To have a list for binging, I’ve grouped some key readings below.

Classic Birnbaum Papers:

Birnbaum, A. (1962), “On the Foundations of Statistical Inference“, Journal of the American Statistical Association 57(298), 269-306.
Savage, L. J., Barnard, G., Cornfield, J., Bross, I, Box, G., Good, I., Lindley, D., Clunies-Ross, C., Pratt, J., Levene, H., Goldman, T., Dempster, A., Kempthorne, O, and Birnbaum, A. (1962). “Discussion on Birnbaum’s On the Foundations of Statistical Inference”, Journal of the American Statistical Association 57(298), 307-326.
Birnbaum, Allan (1969).” Concepts of Statistical Evidence“. In Ernest Nagel, Sidney Morgenbesser, Patrick Suppes & Morton Gabriel White (eds.), Philosophy, Science, and Method. New York: St. Martin’s Press. pp. 112–143.
Birnbaum, A (1970). Statistical Methods in Scientific Inference (letter to the editor). Nature 225, 1033.
Birnbaum, A (1972), “More on Concepts of Statistical Evidence“, Journal of the American Statistical Association, 67(340), 858-861.

Note to Reader: If you look at the (1962) “discussion”, you can already see Birnbaum backtracking a bit, in response to Pratt’s comments.

Some additional early discussion papers:

Durbin:

Durbin, J. (1970), “On Birnbaum’s Theorem on the Relation Between Sufficiency, Conditionality and Likelihood”, Journal of the American Statistical Association, Vol. 65, No. 329 (Mar., 1970), pp. 395-398.
Savage, L. J., (1970), “Comments on a Weakened Principle of Conditionality”, Journal of the American Statistical Association, Vol. 65, No. 329 (Mar., 1970), pp. 399-401.
Birnbaum, A. (1970), “On Durbin’s Modified Principle of Conditiona lity”, Journal of the American Statistical Association, Vol. 65, No. 329 (Mar., 1970), pp. 402-403.

There’s also a good discussion in Cox and Hinkley 1974.

Evans, Fraser, and Monette:

Evans, M., Fraser, D.A., and Monette, G., (1986), “On Principles and Arguments to Likelihood.” The Canadian Journal of Statistics 14: 181-199.

Kalbfleisch:

Kalbfleisch, J. D. (1975), “Sufficiency and Conditionality”, Biometrika, Vol. 62, No. 2 (Aug., 1975), pp. 251-259.
Barnard, G. A., (1975), “Comments on Paper by J. D. Kalbfleisch”, Biometrika, Vol. 62, No. 2 (Aug., 1975), pp. 260-261.
Barndorff-Nielsen, O. (1975), “Comments on Paper by J. D. Kalbfleisch”, Biometrika, Vol. 62, No. 2 (Aug., 1975), pp. 261-262.
Birnbaum, A. (1975), “Comments on Paper by J. D. Kalbfleisch”, Biometrika, Vol. 62, No. 2 (Aug., 1975), pp. 262-264.
Kalbfleisch, J. D. (1975), “Reply to Comments”, Biometrika, Vol. 62, No. 2 (Aug., 1975), p. 268.

My discussions (also noted above):

Mayo, D. G. (2013) “Presented Version: On the Birnbaum Argument for the Strong Likelihood Principle”, in JSM Proceedings: 440-453.
Mayo, D. G. (2014). Mayo paper: “On the Birnbaum Argument for the Strong Likelihood Principle,” Paper with discussion and Mayo rejoinder: Statistical Science 29(2) pp. 227-239, 261-266.

[i] Savage on Birnbaum: “This paper is a landmark in statistics. . . . I, myself, like other Bayesian statisticians, have been convinced of the truth of the likelihood principle for a long time. Its consequences for statistics are very great. . . . [T]his paper is really momentous in the history of statistics. It would be hard to point to even a handful of comparable events. …once the likelihood principle is widely recognized, people will not long stop at that halfway house but will go forward and accept the implications of personalistic probability for statistics” (Savage 1962, 307-308).

[ii] The link Mayo (2014) includes comments on my paper by Bjornstad, Dawid, Evans, Fraser, Hannig, and Martin and Liu, and my rejoinder.

[iii] In Birnbaum’s argument, he introduces an informal, and rather vague, notion of the “evidence (or evidential meaning) of an outcome z from experiment E”. He writes it: Ev(E,z).

In my formulation of the argument, I introduce a new symbol ⇒ to represent a function from a given experiment-outcome pair, (E,z) to a generic inference implication. It (hopefully) lets us be clearer than does Ev.

(E,z) ⇒ Infr_E(z) is to be read “the inference implication from outcome z in experiment E” (according to whatever inference type/school is being discussed).
If E is within error statistics, for example, it is necessary to know the relevant sampling distribution associated with a statistic. If it is within a Bayesian account, a relevant prior would be needed.

[iv] I’ve blogged these links in the past; please let me know if any links are broken.

67 Years of Cox’s (1958) Chestnut: Excerpt from Excursion 3 Tour II

Mayo — Mon, 29 Dec 2025 23:52:56 +0000

2025-26 Cruise

We’re stopping to consider one of the “chestnuts” in the exhibits of “chestnuts and howlers” in Excursion 3 (Tour II) of Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST 2018). It is now 67 years since Cox gave his famous weighing machine example in Sir David Cox (1958)[1]. It will play a vital role in our discussion of the (strong) Likelihood Principle later this week. The excerpt is from SIST (pp. 170-173).

Exhibit (vi): Two Measuring Instruments of Different Precisions. Did you hear about the frequentist who, knowing she used a scale that’s right only half the time, claimed her method of weighing is right 75% of the time?

She says, “I flipped a coin to decide whether to use a scale that’s right 100% of the time, or one that’s right only half the time, so, overall, I’m right 75% of the time.” (She wants credit because she could have used a better scale, even knowing she used a lousy one.)

Basis for the joke: An N-P test bases error probability on all possible outcomes or measurements that could have occurred in repetitions, but did not.

As with many infamous pathological examples, often presented as knockdown criticisms of all of frequentist statistics, this was invented by a frequentist, Cox (1958). It was a way to highlight what could go wrong in the case at hand, if one embraced an unthinking behavioral-performance view. Yes, error probabilities are taken over hypothetical repetitions of a process, but not just any repetitions will do. Here’s the statistical formulation.

We flip a fair coin to decide which of two instruments, E₁or E₂, to use in observing a Normally distributed random sample Z to make inferences about mean θ. E₁has variance of 1, while that of E₂is 10⁶. Any randomizing device used to choose which instrument to use will do, so long as it is irrelevant to θ. This is called a mixture experiment. The full data would report both the result of the coin flip and the measurement made with that instrument. We can write the report as having two parts: First, which experiment was run and second the measurement: (E_i, z), i = 1 or 2.

In testing a null hypothesis such as θ = 0, the same z measurement would correspond to a much smaller P-value were it to have come from E₁ rather than from E₂: denote them as p₁(z) and p₂(z), respectively. The overall significance level of the mixture: [p₁(z) + p₂(z)]/2, would give a misleading report of the precision of the actual experimental measurement. The claim is that N-P statistics would report the average P-value rather than the one corresponding to the scale you actually used! These are often called the unconditional and the conditional test, respectively. The claim is that the frequentist statistician must use the unconditional test.

Suppose that we know we have observed a measurement from E₂with its much larger variance:

The unconditional test says that we can assign this a higher level of significance than we ordinarily do, because if we were to repeat the experiment, we might sample some quite different distribution. But this fact seems irrelevant to the interpretation of an observation which we know came from a distribution [with the larger variance]. (Cox 1958, p. 361)

Once it is known which E_i has produced z, the P-value or other inferential assessment should be made with reference to the experiment actually run. As we say in Cox and Mayo (2010):

The point essentially is that the marginal distribution of a P-value averaged over the two possible configurations is misleading for a particular set of data. It would mean that an individual fortunate in obtaining the use of a precise instrument in effect sacrifices some of that information in order to rescue an investigator who has been unfortunate enough to have the randomizer choose a far less precise tool. From the perspective of interpreting the specific data that are actually available, this makes no sense. (p. 296)

To scotch his famous example, Cox (1958) introduces a principle: weak conditionality.

Weak Conditionality Principle (WCP): If a mixture experiment (of the aforementioned type) is performed, then, if it is known which experiment produced the data, inferences about θ are appropriately drawn in terms of the sampling behavior in the experiment known to have been performed (Cox and Mayo 2010, p. 296).

It is called weak conditionality because there are more general principles of conditioning that go beyond the special case of mixtures of measuring instruments.

While conditioning on the instrument actually used seems obviously correct, nothing precludes the N-P theory from choosing the procedure “which is best on the average over both experiments” (Lehmann and Romano 2005, p. 394), and it’s even possible that the average or unconditional power is better than the conditional. In the case of such a conflict, Lehmann says relevant conditioning takes precedence over average power (1993b).He allows that in some cases of acceptance sampling, the average behavior may be relevant, but in scientific contexts the conditional result would be the appropriate one (see Lehmann 1993b, p. 1246). Context matters. Did Neyman and Pearson ever weigh in on this? Not to my knowledge, but I’m sure they’d concur with N-P tribe leader Lehmann. Admittedly, if your goal in life is to attain a precise α level, then when discrete distributions preclude this, a solution would be to flip a coin to decide the borderline cases! (See also Example 4.6, Cox and Hinkley 1974, pp. 95–6; Birnbaum 1962, p. 491.)

Is There a Catch?

The “two measuring instruments” example occupies a famous spot in the pantheon of statistical foundations, regarded by some as causing “a subtle earthquake” in statistical foundations. Analogous examples are made out in terms of confidence interval estimation methods (Tour III, Exhibit (viii)). It is a warning to the most behavioristic accounts of testing from which we have already distinguished the present approach. Yet justification for the conditioning (WCP) is fully within the frequentist error statistical philosophy, for contexts of scientific inference. There is no suggestion, for example, that only the particular data set be considered. That would entail abandoning the sampling distribution as the basis for inference, and with it the severity goal. Yet we are told that “there is a catch” and that WCP leads to the Likelihood Principle (LP)!

It is not uncommon to see statistics texts argue that in frequentist theory one is faced with the following dilemma: either to deny the appropriateness of conditioning on the precision of the tool chosen by the toss of a coin, or else to embrace the strong likelihood principle, which entails that frequentist sampling distributions are irrelevant to inference once the data are obtained. This is a false dilemma. Conditioning is warranted to achieve objective frequentist goals, and the [weak] conditionality principle coupled with sufficiency does not entail the strong likelihood principle. The ‘dilemma’ argument is therefore an illusion. (Cox and Mayo 2010, p. 298)

There is a large literature surrounding the argument for the Likelihood Principle, made famous by Birnbaum (1962). Birnbaum hankered for something in between radical behaviorism and throwing error probabilities out the window. Yet he himself had apparently proved there is no middle ground (if you accept WCP)! Even people who thought there was something fishy about Birnbaum’s “proof” were discomfited by the lack of resolution to the paradox. It is time for post-LP philosophies of inference. So long as the Birnbaum argument, which Savage and many others deemed important enough to dub a “breakthrough in statistics,” went unanswered, the frequentist was thought to be boxed into the pathological examples. She is not.

In fact, I show there is a flaw in his venerable argument (Mayo 2010b, 2013a, 2014b). That’s a relief. Now some of you will howl, “Mayo, not everyone agrees with your disproof! Some say the issue is not settled.” Fine, please explain where my refutation breaks down. It’s an ideal brainbuster to work on along the promenade after a long day’s tour. Don’t be dismayed by the fact that it has been accepted for so long. But I won’t revisit it here.

From Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo 2018, CUP).

Excursion 3 Tour II, pp. 170-173.

If you’re keen to follow our abbreviated cruise, write to Jean Miller (jemille6@vt.edu) and she will send you the final pages of the monthly readings.

Note to the Reader:

Textbooks should not call a claim a theorem if it’s not a theorem, i.e., if there isn’t a proof of it (within the relevant formal system). Yet you will find many statistics texts, and numerous discussion articles, that blithely repeat that the (strong) Likelihood Principle is a theorem, shown to follow if you accept the (WCP) which frequentist error statisticians do.[2] I argue that Allan Birnbaum’s (1962) alleged proof is circular. So, in in 2025, when you find a text that claims the LP is a theorem, provable from the (WEP), please let me know.

If statistical inference follows Bayesian posterior probabilism, the LP follows easily. It’s shown in just a couple of pages of Excursion 1 Tour II (45-6). All the excitement is whether the frequentist (error statistician) is bound to hold it. If she is, then error probabilities become irrelevant to the evidential import of data (once the data are given), at least when making parametric inferences within a statistical model.

The LP was a main topic for the first few years of this blog. That’s because I was still refining an earlier disproof from Mayo (2010), based on giving a counterexample. I later saw the need for a deeper argument which I give in Mayo (2014) in Statistical Science.[3] (There, among other subtleties, the WCP is put as a logical equivalence as intended.)

“It was the adoption of an unqualified equivalence formulation of conditionality, and related concepts, which led, in my 1962 paper, to the monster of the likelihood axiom,” (Birnbaum 1975, 263).

If you’re keen to try your hand at the arguments (Birnbaum’s or mine), you might start with a summary post (based on slides) here, or an intermediate paper Mayo (2013) that I presented at the JSM. It is not included in SIST. It’s a brainbuster, though, I warn you. There’s no real mathematics or statistics involved, it’s pure logic. But it’s very circuitous, which is why the supposed “proof” has stuck around as long as it has.But I’ve always thought that clarifying it fully demanded the expertise of a philosopher of language, but I haven’t found one yet.

[1] Cox 1958 has a different variant of the chestnut.

[2] Note sufficiency is not really needed in the “proof”.

[3] The discussion includes commentaries by Dawid, Evans, Martin and Liu, Hannig, and Bjørnstad–some of whom are very unhappy with me. But I’m given the final word in the rejoinder.

References (outside of the excerpt; for refs within SIST, please see SIST):

Birnbaum, A. (1962), “On the Foundations of Statistical Inference“, Journal of the American Statistical Association 57(298), 269-306.

Birnbaum, A. (1975). Comments on Paper by J. D. Kalbfleisch. Biometrika, 62 (2), 262–264.

Cox, D. R. (1958), “Some problems connected with statistical inference“, The Annals of Mathematical Statistics, 29, 357-372.

Mayo, D. G. (2013) “Presented Version: On the Birnbaum Argument for the Strong Likelihood Principle”, in JSM Proceedings, Section on Bayesian Statistical Science. Alexandria, VA: American Statistical Association: 440-453.

Mayo, D. G. (2014). Mayo paper: “On the Birnbaum Argument for the Strong Likelihood Principle,” Paper with discussion and Mayo rejoinder: Statistical Science 29(2) pp. 227-239, 261-266.

Error Statistics Philosophy

Announcement: CFP Synthese Topical Collection: Severity and Learning from Error

ANNOUNCEMENT SEV26 Synthese Topical Collection CFP: Severity and Learning from Error

‘Low power’ and an all too standard error (continuation of “don’t turn power on its head”)

How not to turn power on its head

Error and the Growth of Experimental Knowledge cover: 30 years ago

Comments on “The ASA p-value statement 10 years on” (ii)

Power and Severity with nonsignificant results: more power puzzles? (ii)

Continuing the blizzard of 26 power puzzles

A Blizzard of Power Puzzles Replicate in Meta-Research

Leisurely Cruise February 2026: power, shpower, positive predictive value

Reading:

Mayo Memos for Meeting 6:

Slides & Video Links for Meeting 6:

Severe testing of deep learning models of cognition (ii)

(JAN #2) Leisurely cruise January 2026: Excursion 4 Tour II: 4.4 “Do P-Values Exaggerate the Evidence?”

Excerpt from Excursion 4 Tour II*

(JAN #1) Leisurely Cruise January 2026: Excursion 4 Tour I: The Myth of “The Myth of Objectivity” (Mayo 2018, CUP)

Tour I The Myth of “The Myth of Objectivity”*

To see where you are in the book, check the full Itinerary here. If you want to follow us, write to jemille6@vt.edu, for a clean copy of the readings.

Midnight With Birnbaum: Happy New Year 2026!

For those who want to binge read the (Strong) Likelihood Principle in 2025

67 Years of Cox’s (1958) Chestnut: Excerpt from Excursion 3 Tour II

ANNOUNCEMENT SEV26

Synthese Topical Collection CFP: Severity and Learning from Error

To see where you are in the book, check the full Itinerary here.
If you want to follow us, write to jemille6@vt.edu, for a clean copy of the readings.