Monthly Archives: November 2025

First Look at N-P Methods as Severe Tests: Water plant accident [Exhibit (i) from Excursion 3]

November Cruise

The example I use here to illustrate formal severity comes in for criticism  in a paper to which I reply in a 2025 BJPS paper linked to here. Use the comments for queries.

Exhibit (i) N-P Methods as Severe Tests: First Look (Water Plant Accident) 

There’s been an accident at a water plant where our ship is docked, and the cooling system had to be repaired.  It is meant to ensure that the mean temperature of discharged water stays below the temperature that threatens the ecosystem, perhaps not much beyond 150 degrees Fahrenheit. There were 100 water measurements taken at randomly selected times and the sample mean x computed, each with a known standard deviation σ = 10.  When the cooling system is effective, each measurement is like observing X ~ N(150, 102). Because of this variability, we expect different 100-fold water samples to lead to different values of X, but we can deduce its distribution. If each X ~N(μ = 150, 102) then X is also Normal with μ = 150, but the standard deviation of X is only σ/√n = 10/√100 = 1. So X ~ N(μ = 150, 1).

It is the distribution of X that is the relevant sampling distribution here. Because it’s a large random sample, the sampling distribution of X is Normal or approximately so, thanks to the Central Limit Theorem. Note the mean of the sampling distribution of X is the same as the underlying mean, both are μ. The frequency link was created by randomly selecting the sample, and we assume for the moment it was successful. Suppose they are testing:

H0: μ ≤ 150 vs. H1: μ > 150.

The test rule for α =  0.025 is:

Reject H0: iff  > 150 + cασ/√100 = 150 + 1.96(1)=151.96,
since cα = 1.96.

For simplicity, let’s go to the 2-standard error cut-off for rejection:

Reject H0 (infer there’s an indication that μ >  150) iff X  ≥ 152.

The test statistic d(x) is a standard Normal variable: Z = √100( X – 150)/10 = X – 150 which, for x  = 152 is 2. The area to the right of 2 under the standard Normal is around 0.025.

Now we begin to move beyond the strict N-P interpretation. Say x is just significant at the 0.025 level (x  = 152). What warrants taking the data as indicating μ > 150 is not that they’d rarely be wrong in repeated trials on cooling systems by acting this way–even though that’s true. There’s a good indication that it’s not in compliance right now. Why? The severity rationale: Were the mean temperature no higher than 150, then over 97% of the time their method would have resulted in a lower mean temperature than observed. Were it clearly in the safe zone, say μ = 149 degrees, a lower observed mean would be even more probable. Thus, x = 152 indicates some positive discrepancy from H(though we don’t consider it rejected by a single outcome). They’re going to take another round of measurements before acting. In the context of a policy action, to which this indication might lead, some type of loss function would be introduced. We’re just considering the evidence, based on these measurements; all for illustrative purposes.

Severity Function:

I will abbreviate “the severity with which claim passes test T with data x“:

SEV(test T, outcome x, claim C).

Reject/Do Not Reject: will be interpreted inferentially, in this case as an indication or evidence of the presence or absence of discrepancies of interest.

Let us suppose we are interested in assessing the severity of C: μ > 153. I imagine this would be a full-on emergency for the ecosystem!

Reject H0Suppose the observed mean is x  = 152, just at the cut-off for rejecting H0:

d(x0) = √100(152 – 150)/10 = 2.

The data reject H0 at level 0.025. We want to compute

SEV(T, x = 152, C: μ > 153).

We may say: “the data accord with C: μ > 153,” that is, severity condition (S-1) is satisfied; but severity requires there to be at least a reasonable probability of a worse fit with C if C is false (S-2). Here, “worse fit with C” means x ≤ 152 (i.e., d(x0) ≤ 2). Given it’s continuous, as with all the following examples, < or ≤ give the same result. The context indicates which is more useful. This probability must be high for to pass severely; if it’s low, it’s BENT.

We need Pr(X ≤ 152; μ > 153 is false).  To say μ > 153 is false is to say μ ≤ 153. So we want Pr(X ≤ 152; μ ≤ 153).  But we need only evaluate severity at the point μ = 153, because this probability is even greater for μ < 153:

Pr(X ≤ 152; μ = 153) = Pr(Z ≤ -1) = 0.16.

Here, Z = √100(152 – 153)/10 = -1. Thus SEV(T,  x = 152, C: μ > 153) = 0.16. Very low. Our minimal severity principle blocks μ > 153 because it’s fairly probable (84% of the time) that the test would yield an even larger mean temperature than we got, if the water samples came from a body of water whose mean temperature is 153. Table 3.1 gives the severity values associated with different claims, given x = 152. Call tests of this form T+

In each case, we are making inferences of form: μ > μ= 150 + γ, for different values of γ. To merely infer μ > 150 , the severity is 0.97 since Pr(X ≤ 152; μ = 150) = Pr(Z ≤ 2) = 0.97. While the data give an indication of non-compliance, μ > 150, to infer C: μ > 153 would be making mountains out of molehills. In this case, the observed difference just hit the cut-off for rejection. N-P tests leave things at that coarse level in computing power and the probability of a Type II error, but severity will take into account the actual outcome. Table 3.2 gives the severity values associated with different claims, given x = 153.

If “the major criticism of the Neyman-Pearson frequentist approach” is that it fails to provide “error probabilities fully varying with the data” as J. Berger alleges, (2003, p.6) then, we’ve answered the major criticism.

Non-rejection. Now suppose x = 151, so the test does not reject H0. The standard formulation of N-P (as well as Fisherian) tests stops there. But we want to be alert to a fallacious interpretation of a “negative” result: inferring there’s no positive discrepancy from μ = 150. No (statistical) evidence of non-compliance isn’t evidence of compliance, here’s why. We have (S-1): the data “accord with” H0, but what if the test had little capacity to have alerted us to discrepancies from 150? The alert comes by way of “a worse fit” with H0–namely,  a mean x  > 151*. Condition (S-2) requires us to consider Pr(X > 151; μ = 150), which is only 0.16. To get this, standardize X to obtain a standard Normal variate:  Z = √100(151 – 150)/10 = 1; and Pr(X > 151; μ = 150) = 0.16. Thus, SEV(T+, x  = 151, C: μ ≤ 150) = low(0.16). Table 3.3 gives the severity values associated with different inferences of form: μ ≤ μ1= 150 + γ, given  x = 151.

Can they at least say that x = 151 is a good indication that μ ≤ 150.5? No, SEV(T+, x  = 151, C: μ ≤ 150.5) ≅ 0.3, [Z = 151 – 150.5 = 0.5]. But x = 151 is a good indication that μ ≤ 152 and μ ≤ 153 (with severity indications of 0.84 and 0.97, respectively).

You might say, assessing severity is no different from what we would do with a judicious use of existing error probabilities. That’s what the severe tester says. Formally speaking, it may be seen merely as a good rule of thumb to avoid fallacious interpretations. What’s new is the statistical philosophy behind it. We no longer seek either probabilism or performance, but rather using relevant error probabilities to assess and control severity.5

5Initial developments of the severity idea were Mayo (1983, 1988, 1991, 1996). In Mayo and Spanos (2006, 2011), it was developed much further.

***

NOTEThere is a typo in the book here, it has “-” rather than “>”

You can find the beginning of this section (3.2), the development of N-P tests, in this post.

To read further, see Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo 2018, CUP).

Where you are in the journey:

Excursion 3: Statistical Tests and Scientific Inference

Tour I Ingenious and Severe Tests                                            119

3.1 Statistical Inference and Sexy Science: The 1919
Eclipse Test                                                                                    121

3.2 N-P Tests: An Episode in Anglo-Polish Collaboration              131

YOU exhibit (i) N-P Methods as Severe Tests: First Look (Water Plant Accident)

3.3 How to Do All N-P Tests Do (and more) While
a Member of the Fisherian Tribe                                                    146

  • All excerpts and mementos (until June 2020) are here.
  • The full Itinerary (Table of Contents) is here.
Categories: 2025 leisurely cruise, severe tests, severity function, water plant accident | Leave a comment

Neyman-Pearson Tests: An Episode in Anglo-Polish Collaboration: (3.2)

Neyman & Pearson

November Cruise: 3.2

This third of November’s stops in the leisurely cruise of SIST aligns well with my recent BJPS paper Severe Testing: Error Statistics vs Bayes Factor Tests.  In tomorrow’s zoom, 11 am New York time, we’ll have an overview of the topics in SIST so far, as well as a discussion of this paper. (If you don’t have a link, and want one, write to me at error@vt.edu). 

3.2 N-P Tests: An Episode in Anglo-Polish Collaboration*

We proceed by setting up a specific hypothesis to test, Hin Neyman’s and my terminology, the null hypothesis in R. A. Fisher’s . . . in choosing the test, we take into account alternatives to Hwhich we believe possible or at any rate consider it most important to be on the look out for . . .Three steps in constructing the test may be defined: Continue reading

Categories: 2024 Leisurely Cruise, E.S. Pearson, Neyman, statistical tests | Leave a comment

Where Are Fisher, Neyman, Pearson in 1919? Opening of Excursion 3, snippets from 3.1

November Cruise

This second excerpt for November is really just the preface to 3.1. Remember, our abbreviated cruise this fall is based on my LSE Seminars in 2020, and since there are only 5, I had to cut. So those seminars skipped 3.1 on the eclipse tests of GTR. But I want to share snippets from 3.1 with current readers, along with reflections in the comments.

Excursion 3 Statistical Tests and Scientific Inference

Tour I Ingenious and Severe Tests

[T]he impressive thing about [the 1919 tests of Einstein’s theory of gravity] is the risk involved in a prediction of this kind. If observation shows that the predicted effect is definitely absent, then the theory is simply refuted.The theory is incompatible with certain possible results of observation – in fact with results which everybody before Einstein would have expected. This is quite different from the situation I have previously described, [where] . . . it was practically impossible to describe any human behavior that might not be claimed to be a verification of these [psychological] theories. (Popper 1962, p. 36)

Continue reading

Categories: 2025 leisurely cruise, SIST, Statistical Inference as Severe Testing | 2 Comments

November: The leisurely tour of SIST continues

2025 Cruise

We continue our leisurely tour of Statistical Inference as Severe Testing [SIST] (Mayo 2018, CUP) with Excursion 3. This is based on my 5 seminars at the London School of Economics in 2020; I include slides and video for those who are interested. (use the comments for questions) Continue reading

Categories: 2025 leisurely cruise, significance tests, Statistical Inference as Severe Testing | 1 Comment

Severity and Adversarial Collaborations (i)

.

In the 2025 November/December issue of American Scientist, a group of authors (Ceci, Clark, Jussim and Williams 2025) argue in “Teams of rivals” that “adversarial collaborations offer a rigorous way to resolve opposing scientific findings, inform key sociopolitical issues, and help repair trust in science”. With adversarial collaborations, a term coined by Daniel Kahneman (2003), teams of divergent scholars, interested in uncovering what is the case (rather than endlessly making their case) design appropriately stringent tests to understand–and perhaps even resolve–their disagreements. I am pleased to see that in describing such tests the authors allude to my notion of severe testing (Mayo 2018)*:

Severe testing is the related idea that the scientific community ought to accept a claim only after it surmounts rigorous tests designed to find its flaws, rather than tests optimally designed for confirmation. The strong motivation each side’s members will feel to severely test the other side’s predictions should inspire greater confidence in the collaboration’s eventual conclusions. (Ceci et al., 2025)

1. Why open science isn’t enough Continue reading

Categories: severity and adversarial collaborations | 5 Comments

Blog at WordPress.com.