(JAN #2) Leisurely cruise January 2026: Excursion 4 Tour II: 4.4 “Do P-Values Exaggerate the Evidence?”

2026-26 Cruise

Our second stop in 2026 on the leisurely tour of SIST is Excursion 4 Tour II which you can read here. This criticism of statistical significance tests takes a number of forms. Here I consider the best known.  The bottom line is that one should not suppose that quantities measuring different things ought to be equal. At the bottom you will see links to posts discussing this issue, each with a large number of comments. The comments from readers are of interest! We will have a zoom meeting Fri Jan 23 11AM ET on these last two posts.*If you want to join us, contact us.

getting beyond…

Excerpt from Excursion 4 Tour II*

4.4 Do P-Values Exaggerate the Evidence?

“Significance levels overstate the evidence against the null hypothesis,” is a line you may often hear. Your first question is:

What do you mean by overstating the evidence against a hypothesis?

Several (honest) answers are possible. Here is one possibility:

What I mean is that when I put a lump of prior weight π0 of 1/2 on a point null H0 (or a very small interval around it), the P-value is smaller than my Bayesian posterior probability on H0.

More generally, the “P-values exaggerate” criticism typically boils down to showing that if inference is appraised via one of the probabilisms – Bayesian posteriors, Bayes factors, or likelihood ratios – the evidence against the null (or against the null and in favor of some alternative) isn’t as big as 1 − P.

You might react by observing that: (a) P-values are not intended as posteriors in H0 (or Bayes ratios, likelihood ratios) but rather are used to determine if there’s an indication of discrepancy from, or inconsistency with, H0. This might only mean it’s worth getting more data to probe for a real effect. It’s not a degree of belief or comparative strength of support to walk away with. (b) Thus there’s no reason to suppose a P-value should match numbers computed in very different accounts, that differ among themselves, and are measuring entirely different things. Stephen Senn gives an analogy with “height and stones”:

. . . [S]ome Bayesians in criticizing P-values seem to think that it is appropriate to use a threshold for significance of 0.95 of the probability of the alternative hypothesis being true. This makes no more sense than, in moving from a minimum height standard (say) for recruiting police officers to a minimum weight standard, declaring that since it was previously 6 foot it must now be 6 stone. (Senn 2001b, p. 202)

To top off your rejoinder, you might ask: (c) Why assume that “the” or even “a” correct measure of evidence (relevant for scrutinizing the P-value) is one of the probabilist ones?

All such retorts are valid, and we’ll want to explore how they play out here. Yet, I want to push beyond them. Let’s be open to the possibility that evidential measures from very different accounts can be used to scrutinize each other.

Getting Beyond “I’m Rubber and You’re Glue”. The danger in critiquing statistical method X from the standpoint of the goals and measures of a distinct school Y, is that of falling into begging the question. If the P-value is exaggerating evidence against a null, meaning it seems too small from the perspective of school Y, then Y’s numbers are too big, or just irrelevant, from the perspective of school X. Whatever you say about me bounces off and sticks to you. This is a genuine worry, but it’ s not fatal. The goal of this journey is to identify minimal theses about “ bad evidence, no test (BENT)” that enable some degree of scrutiny of any statistical inference account – at least on the meta-level. Why assume all schools of statistical inference embrace the minimum severity principle? I don’t, and they don’t. But by identifying when methods violate severity, we can pull back the veil on at least one source of disagreement behind the battles.

Thus, in tackling this latest canard, let’ s resist depicting the critics as committing a gross blunder of confusing a P-value with a posterior probability in a null. We resist, as well, merely denying we care about their measure of support. I say we should look at exactly what the critics are on about. When we do, we will have gleaned some short-cuts for grasping a plethora of critical debates. We may even wind up with new respect for what a P-value, the least popular girl in the class, really does.

To visit the core arguments, we travel to 1987 to papers by J. Berger and Sellke, and Casella and R. Berger. These, in turn, are based on a handful of older ones (Cox 1977, E, L, & S 1963, Pratt 1965), and current discussions invariably revert back to them. Our struggles through quicksand of Excursion 3, Tour II, are about to pay large dividends.


*I decided to put off the Birnbaum zoom until Feb. (date TBA). Jan 23 will cover this and the last post. 11-12:30 (OK to arrive late/leave early).

 Readers can find blogposts that trace out the discussion of this topic, as I was developing it, along with comments. The following 2 are central:

(7/14) “P-values overstate the evidence against the null”: legit or fallacious? (revised) 71 comments

(7/23) Continued:”P-values overstate the evidence against the null”: legit or fallacious? 39 comments

Where you are in the journey.

Categories: 2026 Leisurely Cruise, frequentist/Bayesian, P-values | Leave a comment

(JAN #1) Leisurely Cruise January 2026: Excursion 4 Tour I: The Myth of “The Myth of Objectivity” (Mayo 2018, CUP)

2025-26 Cruise

Our first stop in 2026 on the leisurely tour of SIST is Excursion 4 Tour I which you can read here. I hope that this will give you the chutzpah to push back in 2026, if you hear that objectivity in science is just a myth. This leisurely tour may be a bit more leisurely than I intended, but this is philosophy, so slow blogging is best. (Plus, we’ve had some poor sailing weather). Please use the comments to share thoughts.

.

Tour I The Myth of “The Myth of Objectivity”*

Objectivity in statistics, as in science more generally, is a matter of both aims and methods. Objective science, in our view, aims to find out what is the case as regards aspects of the world [that hold] independently of our beliefs, biases and interests; thus objective methods aim for the critical control of inferences and hypotheses, constraining them by evidence and checks of error. (Cox and Mayo 2010, p. 276) [i]

Continue reading

Categories: 2026 Leisurely Cruise, objectivity, Statistical Inference as Severe Testing | Leave a comment

Midnight With Birnbaum: Happy New Year 2026!

.

Anyone here remember that old Woody Allen movie, “Midnight in Paris,” where the main character (I forget who plays it, I saw it on a plane), a writer finishing a novel, steps into a cab that mysteriously picks him up at midnight and transports him back in time where he gets to run his work by such famous authors as Hemingway and Virginia Wolf?  (It was a new movie when I began the blog in 2011.) He is wowed when his work earns their approval and he comes back each night in the same mysterious cab…Well, ever since I began this blog in 2011, I imagine being picked up in a mysterious taxi at midnight on New Year’s Eve, and lo and behold, find myself in the 1960s New York City, in the company of Allan Birnbaum who is is looking deeply contemplative, perhaps studying his 1962 paper…Birnbaum reveals some new and surprising twists this year! [i] 

(The pic on the left is the only blurry image I have of the club I’m taken to.) It has been a decade since  I published my article in Statistical Science (“On the Birnbaum Argument for the Strong Likelihood Principle”), which includes  commentaries by A. P. David, Michael Evans, Martin and Liu, D. A. S. Fraser, Jan Hannig, and Jan Bjornstad. David Cox, who very sadly did in January 2022, is the one who encouraged me to write and publish it. Not only does the (Strong) Likelihood Principle (LP or SLP) remain at the heart of many of the criticisms of Neyman-Pearson (N-P) statistics and of error statistics in general, but a decade after my 2014 paper, it is more central than ever–even if it is often unrecognized.

OUR EXCHANGE:

ERROR STATISTICIAN: It’s wonderful to meet you Professor Birnbaum; I’ve always been extremely impressed with the important impact your work has had on philosophical foundations of statistics.  I happen to have published on your famous argument about the likelihood principle (LP).  (whispers: I can’t believe this!) Continue reading

Categories: Birnbaum, CHAT GPT, Likelihood Principle, Sir David Cox | Leave a comment

For those who want to binge read the (Strong) Likelihood Principle in 2025

.

David Cox’s famous “weighing machine” example” from my last post is thought to have caused “a subtle earthquake” in foundations of statistics. It’s been 11 years since I published my Statistical Science article on this, Mayo (2014), which includes several commentators, but the issue is still mired in controversy. It’s generally dismissed as an annoying, mind-bending puzzle on which those in statistical foundations tend to hold absurdly strong opinions. Mostly it has been ignored. Yet I sense that 2026 is the year that people will return to it again. It’s at least touched upon in Roderick Little’s new book (pic below). This post gives some background, and collects the essential links that you would need if you want to delve into it. Many readers know that each year I return to the issue on New Year’s Eve…. But that’s tomorrow.

By the way, this is not part of our lesurely tour of SIST. In fact, the argument is not even in SIST, although the SLP (or LP) arises a lot. But if you want to go off the beaten track with me to the SLP conundrum, here’s your opportunity. Continue reading

Categories: 11 years ago, Likelihood Principle | Leave a comment

67 Years of Cox’s (1958) Chestnut: Excerpt from Excursion 3 Tour II

2025-26 Cruise

.

We’re stopping to consider one of the “chestnuts” in the exhibits of “chestnuts and howlers” in Excursion 3 (Tour II) of Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST 2018). It is now 67 years since Cox gave his famous weighing machine example in Sir David Cox (1958)[1]. It will play a vital role in our discussion of the (strong) Likelihood Principle later this week. The excerpt is from SIST (pp. 170-173).

Exhibit (vi): Two Measuring Instruments of Different Precisions. Did you hear about the frequentist who, knowing she used a scale that’s right only half the time, claimed her method of weighing is right 75% of the time? 

She says, “I flipped a coin to decide whether to use a scale that’s right 100% of the time, or one that’s right only half the time, so, overall, I’m right 75% of the time.” (She wants credit because she could have used a better scale, even knowing she used a lousy one.)

Basis for the joke: An N-P test bases error probability on all possible outcomes or measurements that could have occurred in repetitions, but did not. Continue reading

Categories: 2025 leisurely cruise, Birnbaum, Likelihood Principle | Leave a comment

(DEC #2) December Leisurely Tour Meeting 3: SIST Excursion 3 Tour III

2025-26 Cruise

We are now at the second stop on our December leisurely cruise through SIST: Excursion 3 Tour III. I am pasting the slides and video from this session during the LSE Research Seminars in 2020 (from which this cruise derives). (Remember it was early pandemic, and we weren’t so adept with zooming.)  The Higgs discussion clarifies (and defends) a somewhat controversial interpretation of p-values. (If you’re interested in the Higgs discovery, there’s a lot more on this blog you can find with the search. I am not sure if I would include the section on “capability and severity” were I to write a second edition, though I would keep the duality of tests and CIs. My goal was to expose a fallacy that is even more common nowadays, but I would have placed a revised version later in the book. Share your remarks in the comments.

.

 

 

 

 

 

III. Deeper Concepts: Confidence Intervals and Tests: Higgs’ Discovery: Continue reading

Categories: 2025 leisurely cruise, confidence intervals and tests | Leave a comment

December leisurely cruise “It’s the Methods, Stupid!” Excursion 3 Tour II (3.4-3.6)

2025-26 Cruise

Welcome to the December leisurely cruise:
Wherever we are sailing, assume that it’s warm, warm, warm (not like today in NYC). This is an overview of our first set of readings for December from my Statistical Inference as Severe Testing: How to get beyond the statistics wars (CUP 2018): [SIST]–Excursion 3 Tour II. This leisurely cruise, participants know, is intended to take a whole month to cover one week of readings from my 2020 LSE Seminars, except for December and January which double up. 

What do you think of  “3.6 Hocus-Pocus: P-values Are Not Error probabilities, Are Not Even Frequentist”? This section refers to Jim Berger’s famous attempted unification of Jeffreys, Neyman and Fisher in 2003. The unification considers testing 2 simple hypotheses using a random sample from a Normal distribution, computing their two P-values, rejecting whichever gets a smaller P-value, and then computing its posterior probability, assuming each gets a prior of .5. This becomes what he calls the “Bayesian error probability” upon which he defines “the frequentist principle”. On Berger’s reading of an important paper* by Neyman (1977), Neyman criticized p-values for violating the frequentist principle (SIST p. 186). *The paper is “frequentist probability and frequentist statistics”. Remember that links to readings outside SIST are at the Captains biblio on the top left of the blog. Share your thoughts in the comments.

Some snapshots from Excursion 3 tour II.

Continue reading

Categories: 2025 leisurely cruise | Leave a comment

Modest replication probabilities of p-values–desirable, not regrettable: a note from Stephen Senn

.

You will often hear—especially in discussions about the “replication crisis”—that statistical significance tests exaggerate evidence. Significance testing, we hear, inflates effect sizes, inflates power, inflates the probability of a real effect, or inflates the probability of replication, and thereby misleads scientists.

If you look closely, you’ll find the charges are based on concepts and philosophical frameworks foreign to both Fisherian and Neyman–Pearson hypothesis testing. Nearly all have been discussed on this blog or in SIST (Mayo 2018), but new variations have cropped up. The emphasis that some are now placing on how biased selection effects invalidate error probabilities is welcome, but I say that the recommendations for reinterpreting quantities such as p-values and power introduce radical distortions of error statistical inferences. Before diving into the modern incarnations of the charges it’s worth recalling Stephen Senn’s response to Stephen Goodman’s attempt to convert p-values into replication probabilities nearly 20 years ago (“A Comment on Replication, P-values and Evidence,” Statistics in Medicine). I first blogged it in 2012, here. Below I am pasting some excerpts from Senn’s letter (but readers interested in the topic should look at all of it), because Senn’s clarity cuts straight through many of today’s misunderstandings. 

.

Continue reading

Categories: 13 years ago, p-values exaggerate, replication research, S. Senn | Tags: , , , | 8 Comments

First Look at N-P Methods as Severe Tests: Water plant accident [Exhibit (i) from Excursion 3]

November Cruise

The example I use here to illustrate formal severity comes in for criticism  in a paper to which I reply in a 2025 BJPS paper linked to here. Use the comments for queries.

Exhibit (i) N-P Methods as Severe Tests: First Look (Water Plant Accident) 

There’s been an accident at a water plant where our ship is docked, and the cooling system had to be repaired.  It is meant to ensure that the mean temperature of discharged water stays below the temperature that threatens the ecosystem, perhaps not much beyond 150 degrees Fahrenheit. There were 100 water measurements taken at randomly selected times and the sample mean x computed, each with a known standard deviation σ = 10.  When the cooling system is effective, each measurement is like observing X ~ N(150, 102). Because of this variability, we expect different 100-fold water samples to lead to different values of X, but we can deduce its distribution. If each X ~N(μ = 150, 102) then X is also Normal with μ = 150, but the standard deviation of X is only σ/√n = 10/√100 = 1. So X ~ N(μ = 150, 1). Continue reading

Categories: 2025 leisurely cruise, severe tests, severity function, water plant accident | Leave a comment

Neyman-Pearson Tests: An Episode in Anglo-Polish Collaboration: (3.2)

Neyman & Pearson

November Cruise: 3.2

This third of November’s stops in the leisurely cruise of SIST aligns well with my recent BJPS paper Severe Testing: Error Statistics vs Bayes Factor Tests.  In tomorrow’s zoom, 11 am New York time, we’ll have an overview of the topics in SIST so far, as well as a discussion of this paper. (If you don’t have a link, and want one, write to me at error@vt.edu). 

3.2 N-P Tests: An Episode in Anglo-Polish Collaboration*

We proceed by setting up a specific hypothesis to test, Hin Neyman’s and my terminology, the null hypothesis in R. A. Fisher’s . . . in choosing the test, we take into account alternatives to Hwhich we believe possible or at any rate consider it most important to be on the look out for . . .Three steps in constructing the test may be defined: Continue reading

Categories: 2024 Leisurely Cruise, E.S. Pearson, Neyman, statistical tests | Leave a comment

Where Are Fisher, Neyman, Pearson in 1919? Opening of Excursion 3, snippets from 3.1

November Cruise

This second excerpt for November is really just the preface to 3.1. Remember, our abbreviated cruise this fall is based on my LSE Seminars in 2020, and since there are only 5, I had to cut. So those seminars skipped 3.1 on the eclipse tests of GTR. But I want to share snippets from 3.1 with current readers, along with reflections in the comments.

Excursion 3 Statistical Tests and Scientific Inference

Tour I Ingenious and Severe Tests

[T]he impressive thing about [the 1919 tests of Einstein’s theory of gravity] is the risk involved in a prediction of this kind. If observation shows that the predicted effect is definitely absent, then the theory is simply refuted.The theory is incompatible with certain possible results of observation – in fact with results which everybody before Einstein would have expected. This is quite different from the situation I have previously described, [where] . . . it was practically impossible to describe any human behavior that might not be claimed to be a verification of these [psychological] theories. (Popper 1962, p. 36)

Continue reading

Categories: 2025 leisurely cruise, SIST, Statistical Inference as Severe Testing | 2 Comments

November: The leisurely tour of SIST continues

2025 Cruise

We continue our leisurely tour of Statistical Inference as Severe Testing [SIST] (Mayo 2018, CUP) with Excursion 3. This is based on my 5 seminars at the London School of Economics in 2020; I include slides and video for those who are interested. (use the comments for questions) Continue reading

Categories: 2025 leisurely cruise, significance tests, Statistical Inference as Severe Testing | 1 Comment

Severity and Adversarial Collaborations (i)

.

In the 2025 November/December issue of American Scientist, a group of authors (Ceci, Clark, Jussim and Williams 2025) argue in “Teams of rivals” that “adversarial collaborations offer a rigorous way to resolve opposing scientific findings, inform key sociopolitical issues, and help repair trust in science”. With adversarial collaborations, a term coined by Daniel Kahneman (2003), teams of divergent scholars, interested in uncovering what is the case (rather than endlessly making their case) design appropriately stringent tests to understand–and perhaps even resolve–their disagreements. I am pleased to see that in describing such tests the authors allude to my notion of severe testing (Mayo 2018)*:

Severe testing is the related idea that the scientific community ought to accept a claim only after it surmounts rigorous tests designed to find its flaws, rather than tests optimally designed for confirmation. The strong motivation each side’s members will feel to severely test the other side’s predictions should inspire greater confidence in the collaboration’s eventual conclusions. (Ceci et al., 2025)

1. Why open science isn’t enough Continue reading

Categories: severity and adversarial collaborations | 5 Comments

Excursion 1 Tour I (3rd stop): The Current State of Play in Statistical Foundations: A View From a Hot-Air Balloon (1.3)

Third Stop

Readers: With this third stop we’ve covered Tour 1 of Excursion 1.  My slides from the first LSE meeting in 2020 which dealt with elements of Excursion 1 can be found at the end of this post. There’s also a video giving an overall intro to SIST, Excursion 1. It’s noteworthy to consider just how much things seem to have changed in just the past few years. Or have they? What would the view from the hot-air balloon look like now?  Share your thoughts in the comments.

ZOOM: I propose a zoom meeting for Sunday Nov. 15, Sunday, November 16 at 11 am or Friday, November 21 at 11am, New York time. (An equal # prefer Fri & Sun.) The link will be available to those who register/registered with Dr. Miller*.

The Current State of Play in Statistical Foundations: A View From a Hot-Air Balloon (1.3)

.

How can a discipline, central to science and to critical thinking, have two methodologies, two logics, two approaches that frequently give substantively different answers to the same problems? … Is complacency in the face of contradiction acceptable for a central discipline of science? (Donald Fraser 2011, p. 329)

We [statisticians] are not blameless … we have not made a concerted professional effort to provide the scientific world with a unified testing methodology. (J. Berger 2003, p. 4)

Continue reading

Categories: 2025 leisurely cruise, Statistical Inference as Severe Testing | Leave a comment

The ASA Sir David R. Cox Foundations of Statistics Award is now annual

15 July 1924 – 18 January 2022

The Sir David R. Cox Foundations of Statistics Award will now be given annually by the American Statistical Association (ASA), thanks to generous contributions by “Friends” of David Cox, solicited on this blog!*

Nominations for the 2026 Sir David R. Cox Foundations of Statistics Award are due on November 1, 2025 requiring the following:

  • Nomination letter
  • Candidate’s CV
  • Two letters of support, not to exceed two pages each

Continue reading

Categories: Sir David Cox, Sir David Cox Foundations of Statistics Award | Leave a comment

Excursion 1 Tour I (2nd Stop): Probabilism, Performance, and Probativeness (1.2)

.

Readers: Last year at this time I gave a Neyman seminar at Berkeley and posted on a panel discussion we had. There were lots of great questions, and follow-ups. Here’s a link.

“I shall be concerned with the foundations of the subject. But in case it should be thought that this means I am not here strongly concerned with practical applications, let me say right away that confusion about the foundations of the subject is responsible, in my opinion, for much of the misuse of the statistics that one meets in fields of application such as medicine, psychology, sociology, economics, and so forth”. (George Barnard 1985, p. 2)

While statistical science (as with other sciences) generally goes about its business without attending to its own foundations, implicit in every statistical methodology are core ideas that direct its principles, methods, and interpretations. I will call this its statistical philosophy. To tell what’s true about statistical inference, understanding the associated philosophy (or philosophies) is essential. Discussions of statistical foundations tend to focus on how to interpret probability, and much less on the overarching question of how probability ought to be used in inference. Assumptions about the latter lurk implicitly behind debates, but rarely get the limelight. If we put the spotlight on them, we see that there are two main philosophies about the roles of probability in statistical inference: We may dub them performance (in the long run) and probabilism. Continue reading

Categories: Error Statistics | Leave a comment

2025(1)The leisurely cruise begins: Excerpt from Excursion 1 Tour 1 of Statistical Inference as Severe Testing (SIST)

Ship Statinfasst

Excerpt from excursion 1 Tour I: Beyond Probabilism and Performance: Severity Requirement (1.1)

NOTE: The following is an excerpt from my book: Statistical Inference as Severe Testing: How to get beyond the statistics wars (CUP, 2018). For any new reflections or corrections, I will use the comments. The initial announcement is here (including how to join).

I’m talking about a specific, extra type of integrity that is [beyond] not lying, but bending over backwards to show how you’re maybe wrong, that you ought to have when acting as a scientist. (Feynman 1974/1985, p. 387)

It is easy to lie with statistics. Or so the cliché goes. It is also very difficult to uncover these lies without statistical methods – at least of the right kind. Self- correcting statistical methods are needed, and, with minimal technical fanfare, that’s what I aim to illuminate. Since Darrell Huff wrote How to Lie with Statistics in 1954, ways of lying with statistics are so well worn as to have emerged in reverberating slogans:

  • Association is not causation.
  • Statistical significance is not substantive significamce
  • No evidence of risk is not evidence of no risk.
  • If you torture the data enough, they will confess.

Continue reading

Categories: Statistical Inference as Severe Testing | Leave a comment

2025 Leisurely cruise through Statistical Inference as Severe Testing: First Announcement

Ship Statinfasst

We’re embarking on a leisurely cruise through the highlights of Statistical Inference as Severe Testing [SIST]: How to Get Beyond the Statistics Wars (CUP 2018) this fall (Oct-Jan), following the 5 seminars I led for a 2020 London School of Economics (LSE) Graduate Research Seminar. It had to be run online due to Covid (as were the workshops that followed). Unlike last fall, this time I will include some zoom meetings on the material, as well as new papers and topics of interest to attendees. In this relaxed (self-paced) journey, excursions that had been covered in a week, will be spread out over a month [i] and I’ll be posting abbreviated excerpts on this blog. Look for the posts marked with the picture of ship StatInfAsSt. [ii]  Continue reading

Categories: 2024 Leisurely Cruise, Announcement | Leave a comment

My BJPS paper: Severe Testing: Error Statistics versus Bayes Factor Tests

.

In my new paper, “Severe Testing: Error Statistics versus Bayes Factor Tests”, now out online at the The British Journal for the Philosophy of Science, I “propose that commonly used Bayes factor tests be supplemented with a post-data severity concept in the frequentist error statistical sense”. But how? I invite your thoughts on this and any aspect of the paper.* (You can read it here.)

I’m pasting down the abstract and the introduction. Continue reading

Categories: Bayesian/frequentist, Likelihood Principle, multiple testing | 4 Comments

Are We Listening? Part II of “Sennsible significance” Commentary on Senn’s Guest Post

.

This is Part II of my commentary on Stephen Senn’s guest post, Be Careful What You Wish For. In this follow-up, I take up two topics:

(1) A terminological point raised in the comments to Part I, and
(2) A broader concern about how a popular reform movement reinforces precisely the mistaken construal Senn warns against.

But first, a question—are we listening? Because what underlies what Senn is saying is subtle, and yet what’s at stake is quite important for today’s statistical controversies. It’s not just a matter of which of four common construals is most apt for the population effect we wish to have high power to detect.[1] As I hear Senn, he’s also flagging a misunderstanding that allows some statistical reformers to (wrongly) dictate what statistical significance testers “wish” for in the first place. Continue reading

Categories: clinical relevance, power, reforming the reformers, S. Senn | 5 Comments

Blog at WordPress.com.