Author Archives: Mayo

Posted on May 11, 2026 by Mayo

In giving some informal remarks about power at a seminar a couple of weeks ago, I proposed that the tendency to turn the notion of power on its head might be avoided by imagining we need to define a test error probabilities in terms of its power alone. We can refer to the power against the null hypothesis, rather than alluding to a type 1 error probability, for example. What do I mean by turning power on its head? I mean, at least here, supposing that a test provides poor evidence of discrepancies that the test has low power to detect.

This grows out of the assumption that a statistical significant result only provides good evidence of discrepancies (from a null hypothesis) that the test has reasonably high power to detect. But these claims actually reverse what is the case about power and warranted (population) discrepancies. They turn power on its head.

To remind us, the goal of this statistical significance test is to assess the compatibility of data with a reference or null hypothesis, such as to see if the value of test statistic D indicates a genuine positive (population) discrepancy from 0. The tester may go on to consider the evidence for various other positive discrepancies as well. For simplicity consider testing H₀: µ ≤ 0 vs H₁: µ >0 with known SE. I will use some numbers from a guest blog post by Stephen Senn discussing the interpretation of tests in clinical trials:

For simplicity, allow the cut-off to be 2, rather than 1.96. Write the cut-off for rejecting the null as D*, which in Senn’s example is .7. So we have SE =~ .35*. The power of the test against different values of µ doesn’t require knowing the true value of µ; there is a power function. The test is falsificationist, and uses hypothetical reasoning. The power of this test against µ’ is the probability D exceeds D* (.7) computed under the assumption that µ = µ’. Write this as POW(µ’).

Tests, particularly in clinical trials, are often specified to have high probability, .8 or .9, of detecting a discrepancy from the null that “we would not like to miss”. To “miss” means the test does not set off the “significance alarm”, that is, the result is statistically insignificant. Senn’s example stipulates that the population discrepancy we would really hate to miss is ∆ = 1. This means that were the population ∆ = 1 or higher, then we want there to be a high probability that the value of the sample D will exceed D*.

Note: I use the word “discrepancy” in alluding to population effect sizes and “differences” to refer to observed difference. I’m deliberately calling ∆ “the discrepancy we would really hate to miss” because “the discrepancy we would not like to miss” is often interpreted in a weaker manner than intended. In particular, it is often construed as the smallest discrepancy of interest. But this minimal discrepancy of interest would be smaller than ∆ . [1] See also my commentary on Senn’s post:

Let’s now turn to a test H₀: µ ≤ 0 vs H₁: µ >0 .

(1) The power at the null is α. Note that POW(0) = .025 (more like .023)

Let’s assume for the moment that D just makes it to the cut-off D* for rejection. Then POW(0) is also equal to the significance level for the outcome. Here’s the logic of statistical significance tests using power, and D=D*:

(2) If D is just statistically significant, and its statistical significance level is low, then D indicates µ >0.

(2) is equivalent to (2)’:

(2)’ If POW(0) is low, then D* indicates µ >0.

Of course, indications need to be supplemented by audits of assumptions, checks of biasing selection effects, and ideally, replication. But we must first make out the intended logic of tests, under the presumption the assumption hold approximately, and separately audit them.

(3) If it would be difficult for the test to generate a D as large as D* if µ = 0, and yet we observe D*, then it indicates it was generated by a µ that exceeds 0.

The assertion in (3) holds not just for the null but for discrepancies from 0. Now a critic of tests might note: “But your test also has rather low power to detect positive discrepancies close to 0. For example:

POW(.5 SE) = .07. [i.e., POW(.17) = .07.]”

To which a tester would respond: Yes, and I can similarly infer my D* indicates µ > .17. I reason as follows: were µ ≤ .17, then 93% of the time I’d get a smaller D than I did. That’s the logic of testing. Note too that the P-value is .07, and the lower confidence interval µ > .17. has confidence level .93.

A critic might continue: “But your test also has rather low power to detect positive discrepancies of 1 SE.

POW(1SE) = .16!” [i.e., POW(.35) = .16.]”

To which a tester could respond: Yes, and I therefore have a weak indication that µ > .35. The P-value is .16, and the lower confidence interval µ > .35. has confidence level .84.

And she could go on to note: I clearly do not have evidence that µ exceeds those values against which the test has high power! Even to infer, on grounds that POW(.7) = .5, that my observing D* indicates µ > .7 would be wrong 50% of the time!

I hope it is now clear why the bold phrases at the outset turn power on its head, in relation to statistical significance tests. Senn would not say a statistically significant result is fairly good evidence that µ > 1, on the grounds that POW(1) = .8. Yet you will sometimes see medical researchers and spokespeople claim literally this. What we can correctly say is:

(4) If it would be improbable for the test to generate a D > D* were µ < µ0, and yet I observe D*, then D is an indication it was generated by a µ that exceeds µ0.

However, there is a different assertion that has a superficial resemblance to the ones I am pointing to as reversing power, and that other assertion can hold true. I discuss it in my next post. (I promise not to wait a month to write it!)

Share your questions and remarks in the comments to this post.

[1] Other construals: the minimum value of D we hope to observe, the smallest discrepancy we’d like to learn about, or still others. See this earlier Senn post

Categories: power | Leave a comment

Error and the Growth of Experimental Knowledge cover: 30 years ago

Posted on April 1, 2026 by Mayo

30 years ago today, Chicago Press sent me a draft version of this cover for Error and the Growth of Experimental Knowledge for my approval (except the fuchsia and mustard in “ERROR” were switched). At first I thought it was so cartoony that it might be an April 1 joke! I had sent them a picture I drew (now in the preface), but they didn’t think that worked for a cover. They were right. It’s a fabulous cover!

To access EGEK.

Categories: Error and the Growth of Experimental Knowledge | 4 Comments

Comments on “The ASA p-value statement 10 years on” (ii)

Posted on March 26, 2026 by Mayo

Given how much I’ve blogged about the 2016 ASA p-value statement, the 2019 Executive Editor’s editorial in The American Statistician (TAS), the 2020 ASA (President’s) Task Force, and the various casualties of the related teeth pulling, I thought I should say something about the recent article by Robert Matthews in Significance (March 2026): “The ASA p-value statement 10 years on: An event of statistical significance?” He begins: “Ten years ago this month, the American Statistical Association (ASA) took the unprecedented step of issuing a statement on one of the most controversial issues in statistics: the use and abuse of p-values.” The Statement is here, 2016 ASA Statement on P-Values and Statistical Significance [1]. The Executive director of the ASA, Ronald Wasserstein, invited me to be a ”philosophical observer” at the meeting which gave rise to the 2016 statement. Although the 2016 ASA statement wasn’t radically controversial, at least as compared to the 2019 Executive Editor’s editorial, which I’ll get to in a minute, it was met with critical reactions on all sides. Stephen Senn provides a figure displaying relationships between reactions. Here’s how Matthews’ article begins: Continue reading →

Categories: abandon statistical significance, ASA Task Force on Significance and Replicability, P-values, significance tests, stat wars and their casualties | 26 Comments

Power and Severity with nonsignificant results: more power puzzles? (ii)

Posted on March 14, 2026 by Mayo

The concept of a test’s power, originating in Neyman-Pearson’s early work, by and large, is a pre-data concept for purposes of specifying a test (notably, determining worthwhile sample size), and choosing between tests. In some papers, however, Neyman lists a third goal for power: to interpret test results post data much in the spirit of what is often called “power analysis”. This is to determine the discrepancy from a null hypothesis that may be ruled out, given nonsignificant results. One example is in a paper “The Problem of Inductive Inference” (Neyman 1955)–already a surprising title for behaviorist Neyman. The reason I’m bringing this up is that it has direct bearing on some of today’s most puzzling (and problematic) post-data uses of power. Interestingly, in that 1955 paper, Neyman is talking to none other than the logical positivist philosopher of confirmation, Rudof Carnap:

I am concerned with the term “degree of confirmation” introduced by Carnap. …We have seen that the application of the locally best one-sided test to the data … failed to reject the hypothesis [that the n observations come from a source in which the null hypothesis is true]. The question is: does this result “confirm” the hypothesis that H₀ is true of the particular data set? (Neyman, pp 40-41).

Neyman continues: Continue reading →

Categories: Neyman's Nursery, power analysis | Tags: negative result, Neyman, power, severe testing | Leave a comment

Continuing the blizzard of 26 power puzzles

Posted on March 3, 2026 by Mayo

The mayor of NYC offered $30 an hour to help shovel the ~ 30 inches of snow that fell last Sunday and Monday. From what I hear, it was a very effective program. Here’s a little power puzzle to very easily shovel through [1]

Suppose you are reading about a result x that is just statistically significant at level α (i.e., P-value = α) in a one-sided test T+ of the mean of a Normal distribution with n iid samples, and (for simplicity) known σ: H₀: µ ≤ 0 against H₁: µ > 0. I have heard some people say:

A. If the test’s power to detect alternative µ’ is very low, then the just statistically significant x is poor evidence of a discrepancy (from the null) corresponding to µ’. (i.e., there’s poor evidence that µ > µ’ ). I am keeping symbols as simple as possible. *See point on language in notes.

They will generally also hold that if POW(µ’) is reasonably high (at least .5), then the inference to µ > µ’ is warranted, or at least not problematic.

I have heard other people say:

B. If the test’s power to detect alternative µ’ is very low, then the just statistically significant x is good evidence of a discrepancy (from the null) corresponding to µ’ (i.e., there’s good evidence that µ > µ’).

They will generally also hold that if POW(µ’) is reasonably high (at least .5), then the inference to µ > µ’ is unwarranted.

Which is correct, from the perspective of the (error statistical) philosophy, within which power and associated tests are defined? Continue reading →

Categories: blizzard of 26 power puzzles, power, reforming the reformers | 1 Comment

A Blizzard of Power Puzzles Replicate in Meta-Research

Posted on March 1, 2026 by Mayo

I often say that the most misunderstood concept in error statistics is power. One week ago, stuck in the blizzard of 2026 in NYC —exciting, if also a bit unnerving, with airports closed for two and a half days and no certainty of when I might fly out—I began collecting the many power howlers I’ve discussed in the past, because some of them are being replicated in todays meta-research about replication failure! Apparently, mistakes about statistical concepts replicate quite reliably—even when statistically significant effects do not. Others I find in medical reports of clinical trials of treatments I’m trying to evaluate in real life! Here’s one variant: A statistically significant result in a clinical trial with fairly high (e.g., .8) power to detect an impressive improvement δ’ is taken as good evidence of its impressive improvement δ’. Often the high power of .8 is even used as a (posterior) probability of the hypothesis of improvement being δ’. [0] If these do not immediately strike you as fallacious, compare:

If the house is fully ablaze, then very probably the fire alarm goes off.
If the fire alarm goes off, then very probably the house is fully ablaze.

The first bullet is saying the fire alarm has high power to detect the house being fully ablaze. It does not mean the converse in the second bullet. Continue reading →

Categories: blizzard of 26, power, SIST, statistical significance tests | Tags: misunderstanding power, power analysts, Ziliac & McCloskey | 11 Comments

Leisurely Cruise February 2026: power, shpower, positive predictive value

Posted on February 12, 2026 by Mayo

2025-6 Leisurely Cruise

The following is the February stop of our leisurely cruise (meeting 6 from my 2020 Seminar at the LSE). There was a guest speaker, Professor David Hand. Slides and videos are below. Ship StatInfasSt may head back to port or continue for an additional stop or two, if there is interest. Although I often say on this blog that the classical notion of power, as defined by Neyman and Pearson, is one of the most misunderstood notions in stat foundations. I did not know, in writing SIST, just how ingrained those misconceptions would become. I’ll write more on this in my next post. (The following is from SIST pp. 354-356, the pages are provided below)

Shpower and Retrospective Power Analysis

It’s unusual to hear books condemn an approach in a hush-hush sort of way without explaining what’s so bad about it. This is the case with something called post hoc power analysis, practiced by some who live on the outskirts of Power Peninsula. Psst, don’t go there. We hear “there’s a sinister side to statistical power, … I’m referring to post hoc power” (Cumming 2012, pp. 340-1), also called observed power and retrospective (retro) power. I will be calling it shpower analysis. It distorts the logic of ordinary power analysis (from insignificant results). The “post hoc” part comes in because it’s based on the observed results. The trouble is that ordinary power analysis is also post-data. The criticisms are often wrongly taken to reject both. Continue reading →

Categories: 2025-2026 Leisurely Cruise, power | Leave a comment

Severe testing of deep learning models of cognition (ii)

Posted on January 29, 2026 by Mayo

From time to time I hear of an application of the severe testing philosophy in intriguing ways in fields I know very little about. An example is a recent article by cognitive psychologist Jeffrey Bowers and colleagues (2023): “On the importance of severely testing deep learning models of cognition” (abstract below). Because deep neural networks (DNNs)–advanced machine learning models–seem to recognize images of objects at a similar or even better rate than humans, many researchers suppose DNNs learn to recognize objects in a way similar to humans. However, Bowers and colleagues argue that, on closer inspection, the evidence is remarkably weak, and “in order to address this problem, we argue that the philosophy of severe testing is needed”.

The problem is this. Deep learning models, after all, consist of millions of (largely uninterpretable) parameters. Without understanding how the black box model moves from inputs to outputs, it’s easy to see why observed correlations can easily occur even where the DNN output is due to a variety of factors other than using a similar mechanism as the human visual system. From the standpoint of severe testing, this is a familiar mistake. For data to provide evidence for a claim, it does not suffice that the claim agrees with data, the method must have been capable of revealing the claim to be false, (just) if it is. Here the type of claim of interest is that a given algorithmic model uses similar features or mechanisms as humans to categorize images.[1] The problem isn’t the engineering one of getting more accurate algorithmic models, the problem is inferring claim C: DNNs mimic human cognition in some sense (they focus on vision), even though C has not been well probed. Continue reading →

Categories: severity and deep learning models | 5 Comments

(JAN #2) Leisurely cruise January 2026: Excursion 4 Tour II: 4.4 “Do P-Values Exaggerate the Evidence?”

Posted on January 16, 2026 by Mayo

2026-26 Cruise

Our second stop in 2026 on the leisurely tour of SIST is Excursion 4 Tour II which you can read here. This criticism of statistical significance tests takes a number of forms. Here I consider the best known. The bottom line is that one should not suppose that quantities measuring different things ought to be equal. At the bottom you will see links to posts discussing this issue, each with a large number of comments. The comments from readers are of interest! We will have a zoom meeting Fri Jan 23 11AM ET on these last two posts.*If you want to join us, contact us.

getting beyond…

Excerpt from Excursion 4 Tour II*

4.4 Do P-Values Exaggerate the Evidence? Continue reading →

Categories: 2026 Leisurely Cruise, frequentist/Bayesian, P-values | Leave a comment

(JAN #1) Leisurely Cruise January 2026: Excursion 4 Tour I: The Myth of “The Myth of Objectivity” (Mayo 2018, CUP)

Posted on January 8, 2026 by Mayo

2025-26 Cruise

Our first stop in 2026 on the leisurely tour of SIST is Excursion 4 Tour I which you can read here. I hope that this will give you the chutzpah to push back in 2026, if you hear that objectivity in science is just a myth. This leisurely tour may be a bit more leisurely than I intended, but this is philosophy, so slow blogging is best. (Plus, we’ve had some poor sailing weather). Please use the comments to share thoughts.

Tour I The Myth of “The Myth of Objectivity”*

Objectivity in statistics, as in science more generally, is a matter of both aims and methods. Objective science, in our view, aims to find out what is the case as regards aspects of the world [that hold] independently of our beliefs, biases and interests; thus objective methods aim for the critical control of inferences and hypotheses, constraining them by evidence and checks of error. (Cox and Mayo 2010, p. 276) [i]

Continue reading →

Categories: 2026 Leisurely Cruise, objectivity, Statistical Inference as Severe Testing | Leave a comment

Midnight With Birnbaum: Happy New Year 2026!

Posted on December 31, 2025 by Mayo

Anyone here remember that old Woody Allen movie, “Midnight in Paris,” where the main character (I forget who plays it, I saw it on a plane), a writer finishing a novel, steps into a cab that mysteriously picks him up at midnight and transports him back in time where he gets to run his work by such famous authors as Hemingway and Virginia Wolf? (It was a new movie when I began the blog in 2011.) He is wowed when his work earns their approval and he comes back each night in the same mysterious cab…Well, ever since I began this blog in 2011, I imagine being picked up in a mysterious taxi at midnight on New Year’s Eve, and lo and behold, find myself in the 1960s New York City, in the company of Allan Birnbaum who is is looking deeply contemplative, perhaps studying his 1962 paper…Birnbaum reveals some new and surprising twists this year! [i]

(The pic on the left is the only blurry image I have of the club I’m taken to.) It has been a decade since I published my article in Statistical Science (“On the Birnbaum Argument for the Strong Likelihood Principle”), which includes commentaries by A. P. David, Michael Evans, Martin and Liu, D. A. S. Fraser, Jan Hannig, and Jan Bjornstad. David Cox, who very sadly did in January 2022, is the one who encouraged me to write and publish it. Not only does the (Strong) Likelihood Principle (LP or SLP) remain at the heart of many of the criticisms of Neyman-Pearson (N-P) statistics and of error statistics in general, but a decade after my 2014 paper, it is more central than ever–even if it is often unrecognized.

OUR EXCHANGE:

ERROR STATISTICIAN: It’s wonderful to meet you Professor Birnbaum; I’ve always been extremely impressed with the important impact your work has had on philosophical foundations of statistics. I happen to have published on your famous argument about the likelihood principle (LP). (whispers: I can’t believe this!) Continue reading →

Categories: Birnbaum, CHAT GPT, Likelihood Principle, Sir David Cox | Leave a comment

For those who want to binge read the (Strong) Likelihood Principle in 2025

Posted on December 30, 2025 by Mayo

David Cox’s famous “weighing machine” example” from my last post is thought to have caused “a subtle earthquake” in foundations of statistics. It’s been 11 years since I published my Statistical Science article on this, Mayo (2014), which includes several commentators, but the issue is still mired in controversy. It’s generally dismissed as an annoying, mind-bending puzzle on which those in statistical foundations tend to hold absurdly strong opinions. Mostly it has been ignored. Yet I sense that 2026 is the year that people will return to it again. It’s at least touched upon in Roderick Little’s new book (pic below). This post gives some background, and collects the essential links that you would need if you want to delve into it. Many readers know that each year I return to the issue on New Year’s Eve…. But that’s tomorrow.

By the way, this is not part of our lesurely tour of SIST. In fact, the argument is not even in SIST, although the SLP (or LP) arises a lot. But if you want to go off the beaten track with me to the SLP conundrum, here’s your opportunity. Continue reading →

Categories: 11 years ago, Likelihood Principle | Leave a comment

67 Years of Cox’s (1958) Chestnut: Excerpt from Excursion 3 Tour II

Posted on December 29, 2025 by Mayo

2025-26 Cruise

We’re stopping to consider one of the “chestnuts” in the exhibits of “chestnuts and howlers” in Excursion 3 (Tour II) of Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST 2018). It is now 67 years since Cox gave his famous weighing machine example in Sir David Cox (1958)[1]. It will play a vital role in our discussion of the (strong) Likelihood Principle later this week. The excerpt is from SIST (pp. 170-173).

Exhibit (vi): Two Measuring Instruments of Different Precisions. Did you hear about the frequentist who, knowing she used a scale that’s right only half the time, claimed her method of weighing is right 75% of the time?

She says, “I flipped a coin to decide whether to use a scale that’s right 100% of the time, or one that’s right only half the time, so, overall, I’m right 75% of the time.” (She wants credit because she could have used a better scale, even knowing she used a lousy one.)

Basis for the joke: An N-P test bases error probability on all possible outcomes or measurements that could have occurred in repetitions, but did not. Continue reading →

Categories: 2025 leisurely cruise, Birnbaum, Likelihood Principle | Leave a comment

(DEC #2) December Leisurely Tour Meeting 3: SIST Excursion 3 Tour III

Posted on December 19, 2025 by Mayo

2025-26 Cruise

We are now at the second stop on our December leisurely cruise through SIST: Excursion 3 Tour III. I am pasting the slides and video from this session during the LSE Research Seminars in 2020 (from which this cruise derives). (Remember it was early pandemic, and we weren’t so adept with zooming.) The Higgs discussion clarifies (and defends) a somewhat controversial interpretation of p-values. (If you’re interested in the Higgs discovery, there’s a lot more on this blog you can find with the search. I am not sure if I would include the section on “capability and severity” were I to write a second edition, though I would keep the duality of tests and CIs. My goal was to expose a fallacy that is even more common nowadays, but I would have placed a revised version later in the book. Share your remarks in the comments.

III. Deeper Concepts: Confidence Intervals and Tests: Higgs’ Discovery: Continue reading →

Categories: 2025 leisurely cruise, confidence intervals and tests | Leave a comment

December leisurely cruise “It’s the Methods, Stupid!” Excursion 3 Tour II (3.4-3.6)

Posted on December 14, 2025 by Mayo

2025-26 Cruise

Welcome to the December leisurely cruise:
Wherever we are sailing, assume that it’s warm, warm, warm (not like today in NYC). This is an overview of our first set of readings for December from my Statistical Inference as Severe Testing: How to get beyond the statistics wars (CUP 2018): [SIST]–Excursion 3 Tour II. This leisurely cruise, participants know, is intended to take a whole month to cover one week of readings from my 2020 LSE Seminars, except for December and January which double up.

What do you think of “3.6 Hocus-Pocus: P-values Are Not Error probabilities, Are Not Even Frequentist”? This section refers to Jim Berger’s famous attempted unification of Jeffreys, Neyman and Fisher in 2003. The unification considers testing 2 simple hypotheses using a random sample from a Normal distribution, computing their two P-values, rejecting whichever gets a smaller P-value, and then computing its posterior probability, assuming each gets a prior of .5. This becomes what he calls the “Bayesian error probability” upon which he defines “the frequentist principle”. On Berger’s reading of an important paper* by Neyman (1977), Neyman criticized p-values for violating the frequentist principle (SIST p. 186). *The paper is “frequentist probability and frequentist statistics”. Remember that links to readings outside SIST are at the Captains biblio on the top left of the blog. Share your thoughts in the comments.

Some snapshots from Excursion 3 tour II.

Continue reading →

Categories: 2025 leisurely cruise | Leave a comment

Modest replication probabilities of p-values–desirable, not regrettable: a note from Stephen Senn

Posted on December 3, 2025 by Mayo

You will often hear—especially in discussions about the “replication crisis”—that statistical significance tests exaggerate evidence. Significance testing, we hear, inflates effect sizes, inflates power, inflates the probability of a real effect, or inflates the probability of replication, and thereby misleads scientists.

If you look closely, you’ll find the charges are based on concepts and philosophical frameworks foreign to both Fisherian and Neyman–Pearson hypothesis testing. Nearly all have been discussed on this blog or in SIST (Mayo 2018), but new variations have cropped up. The emphasis that some are now placing on how biased selection effects invalidate error probabilities is welcome, but I say that the recommendations for reinterpreting quantities such as p-values and power introduce radical distortions of error statistical inferences. Before diving into the modern incarnations of the charges it’s worth recalling Stephen Senn’s response to Stephen Goodman’s attempt to convert p-values into replication probabilities nearly 20 years ago (“A Comment on Replication, P-values and Evidence,” Statistics in Medicine). I first blogged it in 2012, here. Below I am pasting some excerpts from Senn’s letter (but readers interested in the topic should look at all of it), because Senn’s clarity cuts straight through many of today’s misunderstandings.

Continue reading →

Categories: 13 years ago, p-values exaggerate, replication research, S. Senn | Tags: Evidence-based medicine, p-value vs posterior, significance tests, Stephen Senn | 8 Comments

First Look at N-P Methods as Severe Tests: Water plant accident [Exhibit (i) from Excursion 3]

Posted on November 28, 2025 by Mayo

November Cruise

The example I use here to illustrate formal severity comes in for criticism in a paper to which I reply in a 2025 BJPS paper linked to here. Use the comments for queries.

Exhibit (i) N-P Methods as Severe Tests: First Look (Water Plant Accident)

There’s been an accident at a water plant where our ship is docked, and the cooling system had to be repaired. It is meant to ensure that the mean temperature of discharged water stays below the temperature that threatens the ecosystem, perhaps not much beyond 150 degrees Fahrenheit. There were 100 water measurements taken at randomly selected times and the sample mean x computed, each with a known standard deviation σ = 10. When the cooling system is effective, each measurement is like observing X ~ N(150, 10²). Because of this variability, we expect different 100-fold water samples to lead to different values of X, but we can deduce its distribution. If each X ~N(μ = 150, 10²) then X is also Normal with μ = 150, but the standard deviation of X is only σ/√n= 10/√100 = 1. So X ~ N(μ = 150, 1). Continue reading →

Categories: 2025 leisurely cruise, severe tests, severity function, water plant accident | Leave a comment

Neyman-Pearson Tests: An Episode in Anglo-Polish Collaboration: (3.2)

Posted on November 20, 2025 by Mayo

Neyman & Pearson

November Cruise: 3.2

This third of November’s stops in the leisurely cruise of SIST aligns well with my recent BJPS paper Severe Testing: Error Statistics vs Bayes Factor Tests. In tomorrow’s zoom, 11 am New York time, we’ll have an overview of the topics in SIST so far, as well as a discussion of this paper. (If you don’t have a link, and want one, write to me at error@vt.edu).

3.2 N-P Tests: An Episode in Anglo-Polish Collaboration*

We proceed by setting up a specific hypothesis to test, H₀in Neyman’s and my terminology, the null hypothesis in R. A. Fisher’s . . . in choosing the test, we take into account alternatives to H₀which we believe possible or at any rate consider it most important to be on the look out for . . .Three steps in constructing the test may be defined: Continue reading →

Categories: 2024 Leisurely Cruise, E.S. Pearson, Neyman, statistical tests | Leave a comment

Where Are Fisher, Neyman, Pearson in 1919? Opening of Excursion 3, snippets from 3.1

Posted on November 13, 2025 by Mayo

November Cruise

This second excerpt for November is really just the preface to 3.1. Remember, our abbreviated cruise this fall is based on my LSE Seminars in 2020, and since there are only 5, I had to cut. So those seminars skipped 3.1 on the eclipse tests of GTR. But I want to share snippets from 3.1 with current readers, along with reflections in the comments.

Excursion 3 Statistical Tests and Scientific Inference

Tour I Ingenious and Severe Tests

[T]he impressive thing about [the 1919 tests of Einstein’s theory of gravity] is the risk involved in a prediction of this kind. If observation shows that the predicted effect is definitely absent, then the theory is simply refuted.The theory is incompatible with certain possible results of observation – in fact with results which everybody before Einstein would have expected. This is quite different from the situation I have previously described, [where] . . . it was practically impossible to describe any human behavior that might not be claimed to be a verification of these [psychological] theories. (Popper 1962, p. 36)

Continue reading →

Categories: 2025 leisurely cruise, SIST, Statistical Inference as Severe Testing | 2 Comments

November: The leisurely tour of SIST continues

Posted on November 7, 2025 by Mayo

2025 Cruise

We continue our leisurely tour of Statistical Inference as Severe Testing [SIST] (Mayo 2018, CUP) with Excursion 3. This is based on my 5 seminars at the London School of Economics in 2020; I include slides and video for those who are interested. (use the comments for questions) Continue reading →

Categories: 2025 leisurely cruise, significance tests, Statistical Inference as Severe Testing | 1 Comment

Author Archives: Mayo

Excerpt from Excursion 4 Tour II*

Tour I The Myth of “The Myth of Objectivity”*

The Statistics Wars & Their Casualties

Blog links (references)

Reviews of Statistical Inference as Severe Testing (SIST)

Interviews & Debates on PhilStat (2020)

Interviews on PhilStat (2019)

LSE PH500 Research Seminar (May 21-June 25, 2020): Controversies in Phil Stat

Summer Seminar 2019 (article)

Top Posts & Pages

Conferences & Workshops

RMM Special Topic

Mayo & Spanos, Error Statistics

Follow Blog via Email

My Websites

Recent Posts: PhilStatWars

LOG IN/OUT

Archives

© Deborah G. Mayo, Error Statistics Philosophy, 2011-2018 All Rights Reserved.

Follow Blog via Email

© Deborah G. Mayo, Error Statistics Philosophy, 2011-2025. All Rights Reserved.