A Blizzard of Power Puzzles Replicate in Meta-Research

.

I often say that the most misunderstood concept in error statistics is power. One week ago, stuck in the blizzard of 2026 in NYC —exciting, if also a bit unnerving, with airports closed for two and a half days and no certainty of when I might fly out—I began collecting the many power howlers I’ve discussed in the past, because some of them are being replicated in todays meta-research about replication failure! Apparently, mistakes about statistical concepts replicate quite reliably—even when statistically significant effects do not. Others I find in medical reports of clinical trials of treatments I’m trying to evaluate in real life! Here’s one variant: A statistically significant result in a clinical trial with fairly high (e.g.,  .8) power to detect an impressive improvement δ’ is taken as good evidence of its impressive improvement δ’. Often the high power of .8 is even used as a (posterior) probability of the hypothesis of improvement being δ’. [0] If these do not immediately strike you as fallacious, compare:

  • If the house is fully ablaze, then very probably the fire alarm goes off.
  • If the fire alarm goes off, then very probably the house is fully ablaze.

The first bullet is saying the fire alarm has high power to detect the house being fully ablaze. It does not mean the converse in the second bullet.

Today’s meta-statistical researchers are keen to point up the consequences of using statistical significance tests, figuring out why they lead to the various replication crises in science, and how they may be more honestly viewed. Yet they too use statistical analyses, and these can reflect philosophical and conceptual standpoints that may replicate the same shortcomings that arise in classic criticisms of significance tests.  A major purpose of my Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP) is to clarify basic notions to get beyond what I call “chestnuts” and “howlers” of tests, but these misunderstandings tend to crop up today at the meta-level. When power enters this meta-research, often as a kind of probability of replication, unsurprisingly, the same confusions pop up at the meta-level. But I will not tackle any meta-research in this post. Instead, let’s go back to power howlers that arise in criticisms of tests. I had a blogpost long ago on Ziliac and McCloskey (2008) (Z & M) on power (from Oct. 2011), following a review of their book by Aris Spanos (2008). They write:

“The error of the second kind is the error of accepting the null hypothesis of (say) zero effect when the null is in face false, that is, when (say) such and such a positive effect is true.”

So far so good, keeping in mind that “positive effect” refers to a parameter discrepancy, say δ, not an observed difference.

And the power of a test to detect that such and such a positive effect δ is true is equal to the probability of rejecting the null hypothesis of (say) zero effect when the null is in fact false, and a positive effect as large as δ is present.

Fine. Let this alternative be abbreviated H’(δ):

H’(δ): there is a positive (population) effect at least as large as δ.

Suppose the test rejects the null when it reaches a significance level of .01 (nothing turns on the small value chosen).

(1) The power of the test to detect H’(δ) = Pr(test rejects null at the .01 level| H’(δ) is true).

Say it is 0.85.

According to Z & M:

“[If] the power of a test is high, say, 0.85 or higher, then the scientist can be reasonably confident that at minimum the null hypothesis (of, again, zero effect if that is the null chosen) is false and that therefore his rejection of it is highly probably correct.” (Z & M, 132-3)

But this is not so. They are mistaking (1), defining power, as giving a posterior probability of .85–either to some effect, or specifically to H’(δ)! That is, (1) is being transformed to (1′):

(1’) Pr(H’(δ) is true| test rejects null at .01 level)=.85!

(I am using the symbol for conditional probability “|” all the way through for ease in following the argument, even though, strictly speaking, the error statistician would use “;”, abbreviating “under the assumption that”). Or to put this in other words, they argue:

1. Pr(test rejects the null | H’(δ) is true) = 0.85.

2. Test rejects the null hypothesis.

Therefore, the rejection is probably correct, e.g., the probability H’ is true is 0.85.

Oops. Premises 1 and 2 are true, but the conclusion fallaciously replaces premise 1 with 1′.

 

High power as high hurdle. As Aris Spanos (2008) points out, “They have it backwards”.  Their reasoning comes from thinking that the higher the power of the test from which statistical significance emerges, the higher the hurdle it has gotten over. Extracting from a Spanos comment on this blog in 2011:

“When [Ziliak and McCloskey] claim that: ‘What is relevant here for the statistical case is that refutations of the null are trivially easy to achieve if power is low enough or the sample size is large enough.’ (Z & M, p. 152), they exhibit [confusion] about the notion of power and its relationship to the sample size; their two instances of ‘easy rejection’ separated by ‘or’ contradict each other! Rejections of the null are not easy to achieve when the power is ‘low enough’. They are more difficult exactly because the test does not have adequate power (generic capacity) to detect discrepancies from the null; that stems from the very definition of power and optimal tests. [Their second claim] is correct for the wrong reason. Rejections are easy to achieve when the sample size n is large enough due to high not low power. This is because the power of a ‘decent’ (consistent) frequentist test increases monotonically with n!” (Spanos 2011) [i]

 Ziliak and McCloskey (2008) tell us: “It is the history of Fisher significance testing. One erects little significancehurdles, six inches tall, and makes a great show of leaping over them, . . . If a test does a good job of uncovering efficacy, then the test has high power and the hurdles are high not low.” (ibid., p. 133) This explains why they suppose high power translates into high hurdles, but it is the opposite. The higher the hurdle required before rejecting the null, the more difficult it is to reject, and the lower the power. High hurdles correspond to insensitive tests, like insensitive fire alarms. It might be that using sensitivityrather than power would make this abundantly clear. We may coin: The high power = high hurdle (for rejecting the null) fallacy. A powerful test does give the null hypothesis a harder time in the sense that its more probable that discrepancies from it are detected. But this makes it easier to infer H1. To infer H1 with severity, H1 needs to be given a hard time. 

Ponder the consequences their construal would have for the required trade-off between type 1 and type 2 error probabilities. (Use the comments to explain what happens.) For a fuller discussion, see this link to Excursion 5 Tour I of SIST (2018). [ii] [iii]

What power howlers have you found? Share them in the comments and I’ll add them to my blizzard. 

Spanos, A. (2008), Review of S. Ziliak and D. McCloskey’s The Cult of Statistical SignificanceErasmus Journal for Philosophy and Economics, volume 1, issue 1: 154-164.

Ziliak, Z. and McCloskey, D. (2008), The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice and Lives, University of Michigan Press.

[0] Some meta-researchers, having brilliently generated a superpopulation of treatments (from which this one is taken as a random sample), and finding these probabilities don’t hold, take this to show p-values exaggerate effects. I’ll come back to that case, which is a bit different from the one in today’s post.

[i] When it comes to raising the power by increasing sample size, Z & M often make true claims, so it’s odd when there’s a switch, as when they say “refutations of the null are trivially easy to achieve if power is low enough or the sample size is large enough”. (Z & M, p. 152) It is clear that “low” is not a typo here either (as I at first assumed), so it’s mysterious. 

[ii] Remember that a power computation is not the probability of data x under some alternative hypothesis, it’s the probability that data fall in the rejection region of a test under some alternative hypothesis. In terms of a test statistic d(X), it is Pr(test statistic d(X) is statistically significant | H’ true), at a given level of significance. So it’s the probability of getting any of the outcomes that would lead to statistical significance at the chosen level, under the assumption that alternative H’ is true. The alternative H’ used to compute power is a point in the alternative region. However, the inference that is made in tests is not to a point hypothesis but to an inequality, e.g., θ > θ’. 

[iii] My rendering of their fallacy above sees it as a type of affirming the consequent. They are right that if inference is by way of a Bayes boost, then affirming the consequent is not a fallacy. A hypothesis H that entails (or renders probable) data x will get a “B-boost” from x, unless its probability is already 1. The trouble erupts when Z & M take an error statistical concept like power, and construe it Bayesianly. Even more confusing, they only do so some of the time.

Categories: blizzard of 26, power, SIST, statistical significance tests | Tags: , , | 1 Comment

Leisurely Cruise February 2026: power, shpower, positive predictive value

2025-6 Leisurely Cruise

The following is the February stop of our leisurely cruise (meeting 6 from my 2020 Seminar at the LSE). There was a guest speaker, Professor David Hand. Slides and videos are below. Ship StatInfasSt may head back to port or continue for an additional stop or two, if there is interest. Although I often say on this blog that the classical notion of power, as defined by Neyman and Pearson, is one of the most misunderstood notions in stat foundations. I did not know, in writing SIST, just how ingrained those misconceptions would become. I’ll write more on this in my next post. (The following is from SIST pp. 354-356, the pages are provided below)

Shpower and Retrospective Power Analysis

It’s unusual to hear books condemn an approach in a hush-hush sort of way without explaining what’s so bad about it. This is the case with something called post hoc power analysis, practiced by some who live on the outskirts of Power Peninsula. Psst, don’t go there. We hear “there’s a sinister side to statistical power, … I’m referring to post hoc power” (Cumming 2012, pp. 340-1), also called observed power and retrospective (retro) power. I will be calling it shpower analysis. It distorts the logic of ordinary power analysis (from insignificant results). The “post hoc” part comes in because it’s based on the observed results. The trouble is that ordinary power analysis is also post-data. The criticisms are often wrongly taken to reject both.

Shpower evaluates power with respect to the hypothesis that the population effect size (discrepancy) equals the observed effect size, for example, that the parameter μ equals the observed mean. In T+ this would be to set μ = x. Conveniently, their examples use variations on test T+. We may define:

The Shpower of test T+: Pr(X > xα; μ = x)

The thinking, presumably, is that, since we don’t know the value of μ, we might use the observed x to estimate it, and then compute power in the usual way, except substituting the observed value. But a moment’s thought shows the problem – at least for the purpose of using power analysis to interpret insignificant results. Why?

Since alternative μ is set equal to the observed x, and x is given as statistically insignificant, we know we are in Case 1 from Section 5.1: the power can never exceed 0.5. In other words, since  x < xα, the shpower = POW(T+, μ = x) . But power analytic reasoning is all about finding an alternative against which the test has high capability to have rung the significance bell, were that the true parameter value – high power. Shpower is always “slim” (to echo Neyman) against such alternatives. Unsurprisingly, then, shpower analytical reasoning has been roundly criticized in the literature. But the critics think they’re maligning power analytic reasoning.

Now we know the severe tester insists on using attained power Pr(d(X) > d(x0); μ’) to evaluate severity, but when addressing the criticisms of power analysis, we have to stick to ordinary power:[1]

Ordinary power: POW(μ’): Pr(d(X) > cα; μ’)
Shpower (aka post hoc or retro power): Pr(d(X) > cα; μ = x)

An article by Hoenig and Heisey (2001) (“The Abuse of Power”) calls power analysis abusive. Is it? Aris Spanos and I say no (in a 2002 note on them), but the journal declined to publish it [because the deadline for comments had passed]. Since then their slips have spread like kudzu through the literature.

Howlers of Shpower Analysis

Hoenig and Heisey notice that within the class of insignificant results, the more significant the observed x is, the higher the “observed power” against μ = x, until it reaches 0.5 (when x reaches xα and becomes significant). “That’s backwards!” they howl. It is backwards if “observed power” is defined as shpower. Because, if you were to regard higher shpower as indicating better evidence for the null, you’d be saying the more statistically significant the observed difference (between x and μ0), the more the evidence of the absence of a discrepancy from the null hypothesis μ0. That would contradict the logic of tests.

Two fallacies are being committed here. The first we dealt with in discussing Greenland: namely, supposing that a negative result, with high power against μ1, is evidence for the null rather than merely evidence that μ < μ1.The more serious fallacy is that their “observed power” is shpower. Neither Cohen nor Neyman define power analysis this way. It is concluded that power analysis is paradoxical and inconsistent with -value reasoning. You should really only conclude that shpower analytic reasoning is paradoxical. If you’ve redefined a concept and find that a principle that held with the original concept is contradicted, you should suspect your redefinition. It might have other uses, but there is no warrant to discredit the original notion.

The shpower computation is asking: What’s the probability of getting X > xα under μ = x? We still have that the larger the power (against μ = x), the better x indicates that μ < x – as in ordinary power analysis – it’s just that the indication is never more than 0.5. Other papers and even instructional manuals (Ellis 2010) assume shpower as what retrospective power analysis must mean, and ridicule it because “a nonsignificant result will almost always be associated with low statistical power” (p. 60). Not so. I’m afraid that observed power and retrospective power are all used in the literature to mean shpower. What about my use of severity? Severity will replace the cutoff for rejection with the observed value of the test statistic (i.e., Pr(d(X) > d(x0); μ1)), but not the parameter value μ. You might say, we don’t know the value of μ1. True, but that doesn’t stop us from forming power or severity curves and interpreting results accordingly. Let’s leave shpower and consider criticisms of ordinary power analysis. Again, pointing to Hoenig and Heisey’s article (2001) is ubiquitous.

[1] In deciphering existing discussions on ordinary power analysis, we can suppose that d(x0)  happens to be exactly at the cut-off for rejection, in discussing significant results; and just misses the cut-off for discussions on insignificant results in test T+. Then att-power for μ1 equals ordinary power for μ1 .

Reading:

SIST Excursion 5 Tour I (pp. 323-332; 338-344; 346-352),Tour II (pp. 353-6; 361-370), and Farewell Keepsake pp. 436-444

Recommended (if time) What Ever Happened to Bayesian Foundations (Excursion 6 Tour I)


Mayo Memos for Meeting 6:

-Souvenirs Meeting 6: W: The Severity Interpretation of Negative Results (SIN) for Test T+; X: Power and Severity Analysis; Z: Understanding Tribal Warfare

-Selected blogposts on Power

  • 05/08/17: How to tell what’s true about power if you’re practicing within the error-statistical tribe
  • 12/12/17: How to avoid making mountains out of molehills (using power and severity)

There is also a guest speaker: Professor David Hand:
      “Trustworthiness of Statistical Analysis”

_______________________________________________________________________________________________________

Slides & Video Links for Meeting 6:

Slides: Mayo 2nd Draft slides for 25 June (not beautiful)

Video of Meeting #6: (Viewing Videos in full screen mode helps with buffering issues.)
VIDEO LINK:
https://wp.me/abBgTB-mZ

VIDEO LINK to David Hand’s Presentation: https://wp.me/abBgTB-mS
David Hand’s recorded Powerpoint slides: https://wp.me/abBgTB-n4
AUDIO LINK to David Hand’s Presentation & Discussion: https://wp.me/abBgTB-nm

Another link is here.

Please share your thoughts and queries in the comments.

Categories: 2025-2026 Leisurely Cruise, power | Leave a comment

Severe testing of deep learning models of cognition (ii)

.

From time to time I hear of an application of the severe testing philosophy in intriguing ways in fields I know very little about. An example is a recent article by cognitive psychologist Jeffrey Bowers and colleagues (2023): “On the importance of severely testing deep learning models of cognition” (abstract below). Because deep neural networks (DNNs)–advanced machine learning models–seem to recognize images of objects at a similar or even better rate than humans, many researchers suppose DNNs learn to recognize objects in a way similar to humans. However, Bowers and colleagues argue that, on closer inspection, the evidence is remarkably weak, and “in order to address this problem, we argue that the philosophy of severe testing is needed”.

The problem is this. Deep learning models, after all, consist of millions of (largely uninterpretable) parameters. Without understanding how the black box model moves from inputs to outputs, it’s easy to see why observed correlations can easily occur even where the DNN output is due to a variety of factors other than using a similar mechanism as the human visual system. From the standpoint of severe testing, this is a familiar mistake. For data to provide evidence for a claim, it does not suffice that the claim agrees with data, the method must have been capable of revealing the claim to be false, (just) if it is. Here the type of claim of interest is that a given algorithmic model uses similar features or mechanisms as humans to categorize images.[1] The problem isn’t the engineering one of getting more accurate algorithmic models, the problem is inferring claim C: DNNs mimic human cognition in some sense (they focus on vision), even though C has not been well probed.

“Contrary to the model comparison approach that is popular in deep learning applications to cognitive/neural modeling it will be argued that the mere advantage of one model over the other in predicting domain-relevant data is wholly insufficient even as the weakest evidentiary standard”

—in particular, “weak severity”.[2] Bowers et al. argue that many celebrated demonstrations of human–DNN similarity fail even this minimal standard of evidence: nothing has been done that would have found C false, even if it is. While the authors grant that “many questions arise as we attempt to unfold what severity requirements mean in practice. … current testing does not even come close to any reasonable severity requirement”. [Pursuing their questions about applying severity deserves a separate discussion.]

While the experiments are artificial, as with all experiments, they do seem to replicate known features of human vision such as identifying objects by shape rather than texture, and a human sensitivity to relations between parts. Although similar patterns may be found in DNNs, once researchers scratch a bit below the surface using genuinely probative tests, the agreement collapses. One example concerns a disagreement in how DNNs vs humans perceive relations between parts of objects, such as one part being above the other. An experiment goes something like this (I do not claim any expertise): Humans and DNNs are trained on labeled examples and must infer what matters to the classification. In fact the classification depends on how the parts are arranged, but no rule is given. When shown a new object whose parts stand in the same relations as before, humans typically regard it as belonging to the same category. Not so for objects with the same parts but with those relations altered—e.g., what was above is now below–at least for humans.

The DNN appears to do just as well until the relationship, but not the parts, are swapped (e.g., what was above is now below). Keeping the parts the same, in other words, while changing the relation, the model does not change its classification. Humans do. The DNN, it appears, is tracking features (the parts) that suffice for prediction during training, rather than the relational structure that humans infer as explaining the classification. For more examples and discussion (in The Behavioral and Brain Sciences,), see Bowers et al. (2022).

They ask: “Why is there so little severe testing in this domain? We argue that part of the problem lies with the peer-review system that incentivizes researchers to carry out research designed to highlight DNN-human similarities and minimize differences.” As a result, researchers are incentivized—by peer review, publication practices, and the culture of the field—to design experiments that show agreement rather than ones that seriously risk falsifying claims of DNN-human similarities.

“Indeed, reviewers and editors often claim that “negative results” — i.e., results that falsify strong claims of similarity between humans and DNNs — are not enough and that “solutions” — i.e., models that report DNN-human similarities – are needed for publishing in the top venues…”

We should not equate the use of “negative results” in this context, with common “null” results in statistical significance tests. Commonly null results in statistical significance tests merely fail to provide evidence for an effect (or indicate upper bounds to the effect size, based on a proper use of power). In the human/DNN studies, by contrast, the sense of “negative” is closer to a falsification, or at least a serious undermining, of the proposed claim C that the DNNs rely on the same or similar mechanisms as humans, in regard to a certain process. As they put it,  ‘negative results’ are “results that falsify strong claims of similarity between humans and DNNs”:

“the main objective of researchers comparing DNNs to humans is to better understand the brain through DNNs. If apparent DNN-human similarities are mediated by qualitatively different systems, then the claim that DNNs are good models of brains is simply wrong.”

The relational experiments show the two are tracking different things. As such, I recommend they call them “falsifying results” rather than ‘negative’, since we are generally entitled to say something much weaker in the case of “null” results in standard statistical tests (with a no effect null). Even “fixing” the DNN to match the human output does not restore the claim that the two systems are tracking the same things–so far as I understand what’s going on here. (We assume there wasn’t some underlying flaw with the apparently falsifying experiments.)

Of most interest is that the authors stress a constructive upshot,  “that following the principles of severe testing is likely to steer empirical deep learning approaches to brain and cognitive science onto a more constructive direction”. Encouragingly the most updated benchmark tests seem to bear this out, but as an outsider, I can only speculate. [I will write to the authors and report on their sense of a shift in the field.] Whether this shift will become widespread remains to be seen, but it marks a welcome and interesting move toward more severe and genuinely informative testing in these experiments.

[1] The claim, of course, could be something else, such as, DNNs are useful for understanding the relationships between DNNs and human cognition.

[2] Severity Requirement (weak): If data x agree with a claim C but the method was practically incapable of finding flaws with C even if they exist, then x is poor evidence for C.

Abstract of Bowers et al. (2023) Researchers studying the correspondences between Deep Neural Networks (DNNs) and humans often give little consideration to severe testing when drawing conclusions from empirical findings, and this is impeding progress in building better models of minds. We first detail what we mean by severe testing and highlight how this is especially important when working with opaque models with many free parameters that may solve a given task in multiple different ways. Second, we provide multiple examples of researchers making strong claims regarding DNN-human similarities without engaging in severe testing of their hypotheses. Third, we consider why severe testing is undervalued. We provide evidence that part of the fault lies with the review process. There is now a widespread appreciation in many areas of science that a bias for publishing positive results (among other practices) is leading to a credibility crisis, but there seems less awareness of the problem here

Reference

Bowers, J. S., Malhotra, G., Dujmović, M., Llera Montero, M., Tsvetkov, C., Biscione, V., Puebla, G., Adolfi, F., Hummel, J. E., Heaton, R. F., Evans, B. D., Mitchell, J., & Blything, R. (2023). Deep problems with neural network models of human vision, commentary & response. The Behavioral and Brain Sciences46, Article e385. https://doi.org/10.1017/S0140525X22002813

Categories: severity and deep learning models | 5 Comments

(JAN #2) Leisurely cruise January 2026: Excursion 4 Tour II: 4.4 “Do P-Values Exaggerate the Evidence?”

2026-26 Cruise

Our second stop in 2026 on the leisurely tour of SIST is Excursion 4 Tour II which you can read here. This criticism of statistical significance tests takes a number of forms. Here I consider the best known.  The bottom line is that one should not suppose that quantities measuring different things ought to be equal. At the bottom you will see links to posts discussing this issue, each with a large number of comments. The comments from readers are of interest! We will have a zoom meeting Fri Jan 23 11AM ET on these last two posts.*If you want to join us, contact us.

getting beyond…

Excerpt from Excursion 4 Tour II*

4.4 Do P-Values Exaggerate the Evidence? Continue reading

Categories: 2026 Leisurely Cruise, frequentist/Bayesian, P-values | Leave a comment

(JAN #1) Leisurely Cruise January 2026: Excursion 4 Tour I: The Myth of “The Myth of Objectivity” (Mayo 2018, CUP)

2025-26 Cruise

Our first stop in 2026 on the leisurely tour of SIST is Excursion 4 Tour I which you can read here. I hope that this will give you the chutzpah to push back in 2026, if you hear that objectivity in science is just a myth. This leisurely tour may be a bit more leisurely than I intended, but this is philosophy, so slow blogging is best. (Plus, we’ve had some poor sailing weather). Please use the comments to share thoughts.

.

Tour I The Myth of “The Myth of Objectivity”*

Objectivity in statistics, as in science more generally, is a matter of both aims and methods. Objective science, in our view, aims to find out what is the case as regards aspects of the world [that hold] independently of our beliefs, biases and interests; thus objective methods aim for the critical control of inferences and hypotheses, constraining them by evidence and checks of error. (Cox and Mayo 2010, p. 276) [i]

Continue reading

Categories: 2026 Leisurely Cruise, objectivity, Statistical Inference as Severe Testing | Leave a comment

Midnight With Birnbaum: Happy New Year 2026!

.

Anyone here remember that old Woody Allen movie, “Midnight in Paris,” where the main character (I forget who plays it, I saw it on a plane), a writer finishing a novel, steps into a cab that mysteriously picks him up at midnight and transports him back in time where he gets to run his work by such famous authors as Hemingway and Virginia Wolf?  (It was a new movie when I began the blog in 2011.) He is wowed when his work earns their approval and he comes back each night in the same mysterious cab…Well, ever since I began this blog in 2011, I imagine being picked up in a mysterious taxi at midnight on New Year’s Eve, and lo and behold, find myself in the 1960s New York City, in the company of Allan Birnbaum who is is looking deeply contemplative, perhaps studying his 1962 paper…Birnbaum reveals some new and surprising twists this year! [i] 

(The pic on the left is the only blurry image I have of the club I’m taken to.) It has been a decade since  I published my article in Statistical Science (“On the Birnbaum Argument for the Strong Likelihood Principle”), which includes  commentaries by A. P. David, Michael Evans, Martin and Liu, D. A. S. Fraser, Jan Hannig, and Jan Bjornstad. David Cox, who very sadly did in January 2022, is the one who encouraged me to write and publish it. Not only does the (Strong) Likelihood Principle (LP or SLP) remain at the heart of many of the criticisms of Neyman-Pearson (N-P) statistics and of error statistics in general, but a decade after my 2014 paper, it is more central than ever–even if it is often unrecognized.

OUR EXCHANGE:

ERROR STATISTICIAN: It’s wonderful to meet you Professor Birnbaum; I’ve always been extremely impressed with the important impact your work has had on philosophical foundations of statistics.  I happen to have published on your famous argument about the likelihood principle (LP).  (whispers: I can’t believe this!) Continue reading

Categories: Birnbaum, CHAT GPT, Likelihood Principle, Sir David Cox | Leave a comment

For those who want to binge read the (Strong) Likelihood Principle in 2025

.

David Cox’s famous “weighing machine” example” from my last post is thought to have caused “a subtle earthquake” in foundations of statistics. It’s been 11 years since I published my Statistical Science article on this, Mayo (2014), which includes several commentators, but the issue is still mired in controversy. It’s generally dismissed as an annoying, mind-bending puzzle on which those in statistical foundations tend to hold absurdly strong opinions. Mostly it has been ignored. Yet I sense that 2026 is the year that people will return to it again. It’s at least touched upon in Roderick Little’s new book (pic below). This post gives some background, and collects the essential links that you would need if you want to delve into it. Many readers know that each year I return to the issue on New Year’s Eve…. But that’s tomorrow.

By the way, this is not part of our lesurely tour of SIST. In fact, the argument is not even in SIST, although the SLP (or LP) arises a lot. But if you want to go off the beaten track with me to the SLP conundrum, here’s your opportunity. Continue reading

Categories: 11 years ago, Likelihood Principle | Leave a comment

67 Years of Cox’s (1958) Chestnut: Excerpt from Excursion 3 Tour II

2025-26 Cruise

.

We’re stopping to consider one of the “chestnuts” in the exhibits of “chestnuts and howlers” in Excursion 3 (Tour II) of Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST 2018). It is now 67 years since Cox gave his famous weighing machine example in Sir David Cox (1958)[1]. It will play a vital role in our discussion of the (strong) Likelihood Principle later this week. The excerpt is from SIST (pp. 170-173).

Exhibit (vi): Two Measuring Instruments of Different Precisions. Did you hear about the frequentist who, knowing she used a scale that’s right only half the time, claimed her method of weighing is right 75% of the time? 

She says, “I flipped a coin to decide whether to use a scale that’s right 100% of the time, or one that’s right only half the time, so, overall, I’m right 75% of the time.” (She wants credit because she could have used a better scale, even knowing she used a lousy one.)

Basis for the joke: An N-P test bases error probability on all possible outcomes or measurements that could have occurred in repetitions, but did not. Continue reading

Categories: 2025 leisurely cruise, Birnbaum, Likelihood Principle | Leave a comment

(DEC #2) December Leisurely Tour Meeting 3: SIST Excursion 3 Tour III

2025-26 Cruise

We are now at the second stop on our December leisurely cruise through SIST: Excursion 3 Tour III. I am pasting the slides and video from this session during the LSE Research Seminars in 2020 (from which this cruise derives). (Remember it was early pandemic, and we weren’t so adept with zooming.)  The Higgs discussion clarifies (and defends) a somewhat controversial interpretation of p-values. (If you’re interested in the Higgs discovery, there’s a lot more on this blog you can find with the search. I am not sure if I would include the section on “capability and severity” were I to write a second edition, though I would keep the duality of tests and CIs. My goal was to expose a fallacy that is even more common nowadays, but I would have placed a revised version later in the book. Share your remarks in the comments.

.

 

 

 

 

 

III. Deeper Concepts: Confidence Intervals and Tests: Higgs’ Discovery: Continue reading

Categories: 2025 leisurely cruise, confidence intervals and tests | Leave a comment

December leisurely cruise “It’s the Methods, Stupid!” Excursion 3 Tour II (3.4-3.6)

2025-26 Cruise

Welcome to the December leisurely cruise:
Wherever we are sailing, assume that it’s warm, warm, warm (not like today in NYC). This is an overview of our first set of readings for December from my Statistical Inference as Severe Testing: How to get beyond the statistics wars (CUP 2018): [SIST]–Excursion 3 Tour II. This leisurely cruise, participants know, is intended to take a whole month to cover one week of readings from my 2020 LSE Seminars, except for December and January which double up. 

What do you think of  “3.6 Hocus-Pocus: P-values Are Not Error probabilities, Are Not Even Frequentist”? This section refers to Jim Berger’s famous attempted unification of Jeffreys, Neyman and Fisher in 2003. The unification considers testing 2 simple hypotheses using a random sample from a Normal distribution, computing their two P-values, rejecting whichever gets a smaller P-value, and then computing its posterior probability, assuming each gets a prior of .5. This becomes what he calls the “Bayesian error probability” upon which he defines “the frequentist principle”. On Berger’s reading of an important paper* by Neyman (1977), Neyman criticized p-values for violating the frequentist principle (SIST p. 186). *The paper is “frequentist probability and frequentist statistics”. Remember that links to readings outside SIST are at the Captains biblio on the top left of the blog. Share your thoughts in the comments.

Some snapshots from Excursion 3 tour II.

Continue reading

Categories: 2025 leisurely cruise | Leave a comment

Modest replication probabilities of p-values–desirable, not regrettable: a note from Stephen Senn

.

You will often hear—especially in discussions about the “replication crisis”—that statistical significance tests exaggerate evidence. Significance testing, we hear, inflates effect sizes, inflates power, inflates the probability of a real effect, or inflates the probability of replication, and thereby misleads scientists.

If you look closely, you’ll find the charges are based on concepts and philosophical frameworks foreign to both Fisherian and Neyman–Pearson hypothesis testing. Nearly all have been discussed on this blog or in SIST (Mayo 2018), but new variations have cropped up. The emphasis that some are now placing on how biased selection effects invalidate error probabilities is welcome, but I say that the recommendations for reinterpreting quantities such as p-values and power introduce radical distortions of error statistical inferences. Before diving into the modern incarnations of the charges it’s worth recalling Stephen Senn’s response to Stephen Goodman’s attempt to convert p-values into replication probabilities nearly 20 years ago (“A Comment on Replication, P-values and Evidence,” Statistics in Medicine). I first blogged it in 2012, here. Below I am pasting some excerpts from Senn’s letter (but readers interested in the topic should look at all of it), because Senn’s clarity cuts straight through many of today’s misunderstandings. 

.

Continue reading

Categories: 13 years ago, p-values exaggerate, replication research, S. Senn | Tags: , , , | 8 Comments

First Look at N-P Methods as Severe Tests: Water plant accident [Exhibit (i) from Excursion 3]

November Cruise

The example I use here to illustrate formal severity comes in for criticism  in a paper to which I reply in a 2025 BJPS paper linked to here. Use the comments for queries.

Exhibit (i) N-P Methods as Severe Tests: First Look (Water Plant Accident) 

There’s been an accident at a water plant where our ship is docked, and the cooling system had to be repaired.  It is meant to ensure that the mean temperature of discharged water stays below the temperature that threatens the ecosystem, perhaps not much beyond 150 degrees Fahrenheit. There were 100 water measurements taken at randomly selected times and the sample mean x computed, each with a known standard deviation σ = 10.  When the cooling system is effective, each measurement is like observing X ~ N(150, 102). Because of this variability, we expect different 100-fold water samples to lead to different values of X, but we can deduce its distribution. If each X ~N(μ = 150, 102) then X is also Normal with μ = 150, but the standard deviation of X is only σ/√n = 10/√100 = 1. So X ~ N(μ = 150, 1). Continue reading

Categories: 2025 leisurely cruise, severe tests, severity function, water plant accident | Leave a comment

Neyman-Pearson Tests: An Episode in Anglo-Polish Collaboration: (3.2)

Neyman & Pearson

November Cruise: 3.2

This third of November’s stops in the leisurely cruise of SIST aligns well with my recent BJPS paper Severe Testing: Error Statistics vs Bayes Factor Tests.  In tomorrow’s zoom, 11 am New York time, we’ll have an overview of the topics in SIST so far, as well as a discussion of this paper. (If you don’t have a link, and want one, write to me at error@vt.edu). 

3.2 N-P Tests: An Episode in Anglo-Polish Collaboration*

We proceed by setting up a specific hypothesis to test, Hin Neyman’s and my terminology, the null hypothesis in R. A. Fisher’s . . . in choosing the test, we take into account alternatives to Hwhich we believe possible or at any rate consider it most important to be on the look out for . . .Three steps in constructing the test may be defined: Continue reading

Categories: 2024 Leisurely Cruise, E.S. Pearson, Neyman, statistical tests | Leave a comment

Where Are Fisher, Neyman, Pearson in 1919? Opening of Excursion 3, snippets from 3.1

November Cruise

This second excerpt for November is really just the preface to 3.1. Remember, our abbreviated cruise this fall is based on my LSE Seminars in 2020, and since there are only 5, I had to cut. So those seminars skipped 3.1 on the eclipse tests of GTR. But I want to share snippets from 3.1 with current readers, along with reflections in the comments.

Excursion 3 Statistical Tests and Scientific Inference

Tour I Ingenious and Severe Tests

[T]he impressive thing about [the 1919 tests of Einstein’s theory of gravity] is the risk involved in a prediction of this kind. If observation shows that the predicted effect is definitely absent, then the theory is simply refuted.The theory is incompatible with certain possible results of observation – in fact with results which everybody before Einstein would have expected. This is quite different from the situation I have previously described, [where] . . . it was practically impossible to describe any human behavior that might not be claimed to be a verification of these [psychological] theories. (Popper 1962, p. 36)

Continue reading

Categories: 2025 leisurely cruise, SIST, Statistical Inference as Severe Testing | 2 Comments

November: The leisurely tour of SIST continues

2025 Cruise

We continue our leisurely tour of Statistical Inference as Severe Testing [SIST] (Mayo 2018, CUP) with Excursion 3. This is based on my 5 seminars at the London School of Economics in 2020; I include slides and video for those who are interested. (use the comments for questions) Continue reading

Categories: 2025 leisurely cruise, significance tests, Statistical Inference as Severe Testing | 1 Comment

Severity and Adversarial Collaborations (i)

.

In the 2025 November/December issue of American Scientist, a group of authors (Ceci, Clark, Jussim and Williams 2025) argue in “Teams of rivals” that “adversarial collaborations offer a rigorous way to resolve opposing scientific findings, inform key sociopolitical issues, and help repair trust in science”. With adversarial collaborations, a term coined by Daniel Kahneman (2003), teams of divergent scholars, interested in uncovering what is the case (rather than endlessly making their case) design appropriately stringent tests to understand–and perhaps even resolve–their disagreements. I am pleased to see that in describing such tests the authors allude to my notion of severe testing (Mayo 2018)*:

Severe testing is the related idea that the scientific community ought to accept a claim only after it surmounts rigorous tests designed to find its flaws, rather than tests optimally designed for confirmation. The strong motivation each side’s members will feel to severely test the other side’s predictions should inspire greater confidence in the collaboration’s eventual conclusions. (Ceci et al., 2025)

1. Why open science isn’t enough Continue reading

Categories: severity and adversarial collaborations | 5 Comments

Excursion 1 Tour I (3rd stop): The Current State of Play in Statistical Foundations: A View From a Hot-Air Balloon (1.3)

Third Stop

Readers: With this third stop we’ve covered Tour 1 of Excursion 1.  My slides from the first LSE meeting in 2020 which dealt with elements of Excursion 1 can be found at the end of this post. There’s also a video giving an overall intro to SIST, Excursion 1. It’s noteworthy to consider just how much things seem to have changed in just the past few years. Or have they? What would the view from the hot-air balloon look like now?  Share your thoughts in the comments.

ZOOM: I propose a zoom meeting for Sunday Nov. 15, Sunday, November 16 at 11 am or Friday, November 21 at 11am, New York time. (An equal # prefer Fri & Sun.) The link will be available to those who register/registered with Dr. Miller*.

The Current State of Play in Statistical Foundations: A View From a Hot-Air Balloon (1.3)

.

How can a discipline, central to science and to critical thinking, have two methodologies, two logics, two approaches that frequently give substantively different answers to the same problems? … Is complacency in the face of contradiction acceptable for a central discipline of science? (Donald Fraser 2011, p. 329)

We [statisticians] are not blameless … we have not made a concerted professional effort to provide the scientific world with a unified testing methodology. (J. Berger 2003, p. 4)

Continue reading

Categories: 2025 leisurely cruise, Statistical Inference as Severe Testing | Leave a comment

The ASA Sir David R. Cox Foundations of Statistics Award is now annual

15 July 1924 – 18 January 2022

The Sir David R. Cox Foundations of Statistics Award will now be given annually by the American Statistical Association (ASA), thanks to generous contributions by “Friends” of David Cox, solicited on this blog!*

Nominations for the 2026 Sir David R. Cox Foundations of Statistics Award are due on November 1, 2025 requiring the following:

  • Nomination letter
  • Candidate’s CV
  • Two letters of support, not to exceed two pages each

Continue reading

Categories: Sir David Cox, Sir David Cox Foundations of Statistics Award | Leave a comment

Excursion 1 Tour I (2nd Stop): Probabilism, Performance, and Probativeness (1.2)

.

Readers: Last year at this time I gave a Neyman seminar at Berkeley and posted on a panel discussion we had. There were lots of great questions, and follow-ups. Here’s a link.

“I shall be concerned with the foundations of the subject. But in case it should be thought that this means I am not here strongly concerned with practical applications, let me say right away that confusion about the foundations of the subject is responsible, in my opinion, for much of the misuse of the statistics that one meets in fields of application such as medicine, psychology, sociology, economics, and so forth”. (George Barnard 1985, p. 2)

While statistical science (as with other sciences) generally goes about its business without attending to its own foundations, implicit in every statistical methodology are core ideas that direct its principles, methods, and interpretations. I will call this its statistical philosophy. To tell what’s true about statistical inference, understanding the associated philosophy (or philosophies) is essential. Discussions of statistical foundations tend to focus on how to interpret probability, and much less on the overarching question of how probability ought to be used in inference. Assumptions about the latter lurk implicitly behind debates, but rarely get the limelight. If we put the spotlight on them, we see that there are two main philosophies about the roles of probability in statistical inference: We may dub them performance (in the long run) and probabilism. Continue reading

Categories: Error Statistics | Leave a comment

2025(1)The leisurely cruise begins: Excerpt from Excursion 1 Tour 1 of Statistical Inference as Severe Testing (SIST)

Ship Statinfasst

Excerpt from excursion 1 Tour I: Beyond Probabilism and Performance: Severity Requirement (1.1)

NOTE: The following is an excerpt from my book: Statistical Inference as Severe Testing: How to get beyond the statistics wars (CUP, 2018). For any new reflections or corrections, I will use the comments. The initial announcement is here (including how to join).

I’m talking about a specific, extra type of integrity that is [beyond] not lying, but bending over backwards to show how you’re maybe wrong, that you ought to have when acting as a scientist. (Feynman 1974/1985, p. 387)

It is easy to lie with statistics. Or so the cliché goes. It is also very difficult to uncover these lies without statistical methods – at least of the right kind. Self- correcting statistical methods are needed, and, with minimal technical fanfare, that’s what I aim to illuminate. Since Darrell Huff wrote How to Lie with Statistics in 1954, ways of lying with statistics are so well worn as to have emerged in reverberating slogans:

  • Association is not causation.
  • Statistical significance is not substantive significamce
  • No evidence of risk is not evidence of no risk.
  • If you torture the data enough, they will confess.

Continue reading

Categories: Statistical Inference as Severe Testing | Leave a comment

Blog at WordPress.com.