Exhibit (i) N-P Methods as Severe Tests: First Look (Water Plant Accident)
There’s been an accident at a water plant where our ship is docked, and the cooling system had to be repaired. It is meant to ensure that the mean temperature of discharged water stays below the temperature that threatens the ecosystem, perhaps not much beyond 150 degrees Fahrenheit. There were 100 water measurements taken at randomly selected times and the sample mean x computed, each with a known standard deviation σ = 10. When the cooling system is effective, each measurement is like observing X ~ N(150, 102). Because of this variability, we expect different 100-fold water samples to lead to different values of X, but we can deduce its distribution. If each X ~N(μ = 150, 102) then X is also Normal with μ = 150, but the standard deviation of X is only σ/√n = 10/√100 = 1. So X ~ N(μ = 150, 1).
It is the distribution of X that is the relevant sampling distribution here. Because it’s a large random sample, the sampling distribution of X is Normal or approximately so, thanks to the Central Limit Theorem. Note the mean of the sampling distribution of X is the same as the underlying mean, both are μ. The frequency link was created by randomly selecting the sample, and we assume for the moment it was successful. Suppose they are testing:
H0: μ ≤ 150 vs. H1: μ > 150.
The test rule for α = 0.025 is:
Reject H0: iff X > 150 + cασ/√100 = 150 + 1.96(1)=151.96,
since cα = 1.96.
For simplicity, let’s go to the 2-standard error cut-off for rejection:
Reject H0 (infer there’s an indication that μ > 150) iff X ≥ 152.
The test statistic d(x) is a standard Normal variable: Z = √100( X – 150)/10 = X – 150 which, for x = 152 is 2. The area to the right of 2 under the standard Normal is around 0.025.
Now we begin to move beyond the strict N-P interpretation. Say x is just significant at the 0.025 level (x = 152). What warrants taking the data as indicating μ > 150 is not that they’d rarely be wrong in repeated trials on cooling systems by acting this way–even though that’s true. There’s a good indication that it’s not in compliance right now. Why? The severity rationale: Were the mean temperature no higher than 150, then over 97% of the time their method would have resulted in a lower mean temperature than observed. Were it clearly in the safe zone, say μ = 149 degrees, a lower observed mean would be even more probable. Thus, x = 152 indicates some positive discrepancy from H0 (though we don’t consider it rejected by a single outcome). They’re going to take another round of measurements before acting. In the context of a policy action, to which this indication might lead, some type of loss function would be introduced. We’re just considering the evidence, based on these measurements; all for illustrative purposes.
I will abbreviate “the severity with which claim C passes test T with data x“:
SEV(test T, outcome x, claim C).
Reject/Do Not Reject: will be interpreted inferentially, in this case as an indication or evidence of the presence or absence of discrepancies of interest.
Let us suppose we are interested in assessing the severity of C: μ > 153. I imagine this would be a full-on emergency for the ecosystem!
Reject H0. Suppose the observed mean is x = 152, just at the cut-off for rejecting H0:
d(x0) = √100(152 – 150)/10 = 2.
The data reject H0 at level 0.025. We want to compute
SEV(T, x = 152, C: μ > 153).
We may say: “the data accord with C: μ > 153,” that is, severity condition (S-1) is satisfied; but severity requires there to be at least a reasonable probability of a worse fit with C if C is false (S-2). Here, “worse fit with C” means x ≤ 152 (i.e., d(x0) ≤ 2). Given it’s continuous, as with all the following examples, < or ≤ give the same result. The context indicates which is more useful. This probability must be high for C to pass severely; if it’s low, it’s BENT.
We need Pr(X ≤ 152; μ > 153 is false). To say μ > 153 is false is to say μ ≤ 153. So we want Pr(X ≤ 152; μ ≤ 153). But we need only evaluate severity at the point μ = 153, because this probability is even greater for μ < 153:
Pr(X ≤ 152; μ = 153) = Pr(Z ≤ -1) = 0.16.
Here, Z = √100(152 – 153)/10 = -1. Thus SEV(T, x = 152, C: μ > 153) = 0.16. Very low. Our minimal severity principle blocks μ > 153 because it’s fairly probable (84% of the time) that the test would yield an even larger mean temperature than we got, if the water samples came from a body of water whose mean temperature is 153. Table 3.1 gives the severity values associated with different claims, given x = 152. Call tests of this form T+
In each case, we are making inferences of form: μ > μ1 = 150 + γ, for different values of γ. To merely infer μ > 150 , the severity is 0.97 since Pr(X ≤ 152; μ = 150) = Pr(Z ≤ 2) = 0.97. While the data give an indication of non-compliance, μ > 150, to infer C: μ > 153 would be making mountains out of molehills. In this case, the observed difference just hit the cut-off for rejection. N-P tests leave things at that coarse level in computing power and the probability of a Type II error, but severity will take into account the actual outcome. Table 3.2 gives the severity values associated with different claims, given x = 153.
If “the major criticism of the Neyman-Pearson frequentist approach” is that it fails to provide “error probabilities fully varying with the data” as J. Berger alleges, (2003, p.6) then, we’ve answered the major criticism.
Non-rejection. Now suppose x = 151, so the test does not reject H0. The standard formulation of N-P (as well as Fisherian) tests stops there. But we want to be alert to a fallacious interpretation of a “negative” result: inferring there’s no positive discrepancy from μ = 150. No (statistical) evidence of non-compliance isn’t evidence of compliance, here’s why. We have (S-1): the data “accord with” H0, but what if the test had little capacity to have alerted us to discrepancies from 150? The alert comes by way of “a worse fit” with H0–namely, a mean x > 151*. Condition (S-2) requires us to consider Pr(X > 151; μ = 150), which is only 0.16. To get this, standardize X to obtain a standard Normal variate: Z = √100(151 – 150)/10 = 1; and Pr(X > 151; μ = 150) = 0.16. Thus, SEV(T+, x = 151, C: μ ≤ 150) = low(0.16). Table 3.3 gives the severity values associated with different inferences of form: μ ≤ μ1= 150 + γ, given x = 151.
Can they at least say that x = 151 is a good indication that μ ≤ 150.5? No, SEV(T+, x = 151, C: μ ≤ 150.5) ≅ 0.3, [Z = 151 – 150.5 = 0.5]. But x = 151 is a good indication that μ ≤ 152 and μ ≤ 153 (with severity indications of 0.84 and 0.97, respectively).
You might say, assessing severity is no different from what we would do with a judicious use of existing error probabilities. That’s what the severe tester says. Formally speaking, it may be seen merely as a good rule of thumb to avoid fallacious interpretations. What’s new is the statistical philosophy behind it. We no longer seek either probabilism or performance, but rather using relevant error probabilities to assess and control severity.5
5Initial developments of the severity idea were Mayo (1983, 1988, 1991, 1996). In Mayo and Spanos (2006, 2011), it was developed much further.
NOTE: I will set out some quiz examples of severity in the next week for practice.
*There is a typo in the book here, it has “-” rather than “>”
You can find the beginning of this section (3.2), the development of N-P tests, in this post.
To read further, see Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo 2018, CUP).
Excursion 3: Statistical Tests and Scientific Inference
Tour I Ingenious and Severe Tests 119
3.1 Statistical Inference and Sexy Science: The 1919
Eclipse Test 121
YOU exhibit (i) N-P Methods as Severe Tests: First Look (Water Plant Accident)
3.3 How to Do All N-P Tests Do (and more) While
a Member of the Fisherian Tribe 146