Monthly Archives: December 2018

Memento & Quiz (on SEV): Excursion 3, Tour I


As you enjoy the weekend discussion & concert in the Captain’s Central Limit Library & Lounge, your Tour Guide has prepared a brief overview of Excursion 3 Tour I, and a short (semi-severe) quiz on severity, based on exhibit (i).*


We move from Popper through a gallery on “Data Analysis in the 1919 Eclipse tests of the General Theory of Relativity (GTR)” (3.1) which leads to the main gallery on the origin of statistical tests (3.2) by way of a look at where the main members of our statistical cast are in 1919: Fisher, Neyman and Pearson. From the GTR episode, we identify the key elements of a statistical test–the steps in E.S. Pearson’s opening description of tests in 3.2. The classical testing notions–type I and II errors, power, consistent tests–are shown to grow out of requiring probative tests. The typical (behavioristic) formulation of N-P tests came later. The severe tester breaks out of the behavioristic prison. A first look at the severity construal of N-P tests is in Exhibit (i). Viewing statistical inference as severe testing shows how to do all N-P tests do (and more) while a member of the Fisherian Tribe (3.3). We consider the frequentist principle of evidence FEV and the divergent interpretations that are called for by Cox’s taxonomy of null hypotheses. The last member of the taxonomy–substantively based null hypotheses–returns us to the opening episode of GTR.

key terms (incomplete please send me yours)

GTR, eclipse test, ether effect, corona effect, PPN framework, statistical test ingredients, Anglo-Polish collaboration, Lambda criterion; Type I error, Type II error, power, P-value, unbiased tests, consistent tests uniformly most powerful (UMP); severity interpretation of tests, severity function, water plant accident; sufficient statistic; frequentist principle of evidence FEV; sensitivity achieved, [same as attained power (att power)], Cox’s taxonomy (embedded, nested, dividing, testing assumptions), Nordvedt effect, equivalence principle (strong and weak)

Semi-Severe Severity Quiz, based on the example in Exhibit (i) of Excursion 3

  1. Keeping to Test T+ with H0: μ ≤ 150 vs. H1: μ > 150, σ = 10, and n = 100, observed x  = 152 (i.e., d = 2), find the severity associated with μ > 150.5 .

i.e.,SEV100(μ > 150.5) = ________

  1. Compute 3 or more of the severity assessments for Table 3.2, with x  = 153.
  2. Comparing n = 100 with n = 10,000: Keeping to Test T+ with H0: μ ≤ 150 vs. H1: μ > 150, σ = 10, change the sample size so that  n = 10,000.

The 2SE rejection rule would now be: reject (i.e., “infer evidence against H0”) whenever X   > _____.

Assume x  = just reaches this 2SE cut-off. (added previous sentence, Dec 10, I thought it was clear.) What’s the severity associated with inferring μ > 150.5 now?

i.e., SEV10,000(μ > 150.5) = ____

Compare with SEV100(μ > 150.5).

4. NEW. I realized I needed to include a “negative” result. Assume x  = 151.5. Keeping to the same test with n = 100, find SEV100(μ ≤ 152).

5. If you’re following the original schedule, you’ll have read Tour II of Excursion 3, so here’s an easy question: Why does Souvenir M tell you to “relax”?

6. Extra Credit: supply some key terms from this Tour that I left out in the above list.

*The reference is to Mayo (2018, CUP): Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars.


Categories: Severity, Statistical Inference as Severe Testing | 8 Comments

First Look at N-P Methods as Severe Tests: Water plant accident [Exhibit (i) from Excursion 3]

Excursion 3 Exhibit (i)

Exhibit (i) N-P Methods as Severe Tests: First Look (Water Plant Accident)

There’s been an accident at a water plant where our ship is docked, and the cooling system had to be repaired.  It is meant to ensure that the mean temperature of discharged water stays below the temperature that threatens the ecosystem, perhaps not much beyond 150 degrees Fahrenheit. There were 100 water measurements taken at randomly selected times and the sample mean x computed, each with a known standard deviation σ = 10.  When the cooling system is effective, each measurement is like observing X ~ N(150, 102). Because of this variability, we expect different 100-fold water samples to lead to different values of X, but we can deduce its distribution. If each X ~N(μ = 150, 102) then X is also Normal with μ = 150, but the standard deviation of X is only σ/√n = 10/√100 = 1. So X ~ N(μ = 150, 1).

It is the distribution of X that is the relevant sampling distribution here. Because it’s a large random sample, the sampling distribution of X is Normal or approximately so, thanks to the Central Limit Theorem. Note the mean of the sampling distribution of X is the same as the underlying mean, both are μ. The frequency link was created by randomly selecting the sample, and we assume for the moment it was successful. Suppose they are testing:

H0: μ ≤ 150 vs. H1: μ > 150.

The test rule for α =  0.025 is:

Reject H0: iff  > 150 + cασ/√100 = 150 + 1.96(1)=151.96,
since cα = 1.96.

For simplicity, let’s go to the 2-standard error cut-off for rejection:

Reject H0 (infer there’s an indication that μ >  150) iff X  ≥ 152.

The test statistic d(x) is a standard Normal variable: Z = √100( X – 150)/10 = X – 150 which, for x  = 152 is 2. The area to the right of 2 under the standard Normal is around 0.025.

Now we begin to move beyond the strict N-P interpretation. Say x is just significant at the 0.025 level (x  = 152). What warrants taking the data as indicating μ > 150 is not that they’d rarely be wrong in repeated trials on cooling systems by acting this way–even though that’s true. There’s a good indication that it’s not in compliance right now. Why? The severity rationale: Were the mean temperature no higher than 150, then over 97% of the time their method would have resulted in a lower mean temperature than observed. Were it clearly in the safe zone, say μ = 149 degrees, a lower observed mean would be even more probable. Thus, x = 152 indicates some positive discrepancy from H(though we don’t consider it rejected by a single outcome). They’re going to take another round of measurements before acting. In the context of a policy action, to which this indication might lead, some type of loss function would be introduced. We’re just considering the evidence, based on these measurements; all for illustrative purposes.

Severity Function:

I will abbreviate “the severity with which claim passes test T with data x“:

SEV(test T, outcome x, claim C).

Reject/Do Not Reject: will be interpreted inferentially, in this case as an indication or evidence of the presence or absence of discrepancies of interest.

Let us suppose we are interested in assessing the severity of C: μ > 153. I imagine this would be a full-on emergency for the ecosystem!

Reject H0Suppose the observed mean is x  = 152, just at the cut-off for rejecting H0:

d(x0) = √100(152 – 150)/10 = 2.

The data reject H0 at level 0.025. We want to compute

SEV(T, x = 152, C: μ > 153).

We may say: “the data accord with C: μ > 153,” that is, severity condition (S-1) is satisfied; but severity requires there to be at least a reasonable probability of a worse fit with C if C is false (S-2). Here, “worse fit with C” means x ≤ 152 (i.e., d(x0) ≤ 2). Given it’s continuous, as with all the following examples, < or ≤ give the same result. The context indicates which is more useful. This probability must be high for to pass severely; if it’s low, it’s BENT.

We need Pr(X ≤ 152; μ > 153 is false).  To say μ > 153 is false is to say μ ≤ 153. So we want Pr(X ≤ 152; μ ≤ 153).  But we need only evaluate severity at the point μ = 153, because this probability is even greater for μ < 153:

Pr(X ≤ 152; μ = 153) = Pr(Z ≤ -1) = 0.16.

Here, Z = √100(152 – 153)/10 = -1. Thus SEV(T,  x = 152, C: μ > 153) = 0.16. Very low. Our minimal severity principle blocks μ > 153 because it’s fairly probable (84% of the time) that the test would yield an even larger mean temperature than we got, if the water samples came from a body of water whose mean temperature is 153. Table 3.1 gives the severity values associated with different claims, given x = 152. Call tests of this form T+

In each case, we are making inferences of form: μ > μ= 150 + γ, for different values of γ. To merely infer μ > 150 , the severity is 0.97 since Pr(X ≤ 152; μ = 150) = Pr(Z ≤ 2) = 0.97. While the data give an indication of non-compliance, μ > 150, to infer C: μ > 153 would be making mountains out of molehills. In this case, the observed difference just hit the cut-off for rejection. N-P tests leave things at that coarse level in computing power and the probability of a Type II error, but severity will take into account the actual outcome. Table 3.2 gives the severity values associated with different claims, given x = 153.

If “the major criticism of the Neyman-Pearson frequentist approach” is that it fails to provide “error probabilities fully varying with the data” as J. Berger alleges, (2003, p.6) then, we’ve answered the major criticism.

Non-rejection. Now suppose x = 151, so the test does not reject H0. The standard formulation of N-P (as well as Fisherian) tests stops there. But we want to be alert to a fallacious interpretation of a “negative” result: inferring there’s no positive discrepancy from μ = 150. No (statistical) evidence of non-compliance isn’t evidence of compliance, here’s why. We have (S-1): the data “accord with” H0, but what if the test had little capacity to have alerted us to discrepancies from 150? The alert comes by way of “a worse fit” with H0–namely,  a mean x  > 151*. Condition (S-2) requires us to consider Pr(X > 151; μ = 150), which is only 0.16. To get this, standardize X to obtain a standard Normal variate:  Z = √100(151 – 150)/10 = 1; and Pr(X > 151; μ = 150) = 0.16. Thus, SEV(T+, x  = 151, C: μ ≤ 150) = low(0.16). Table 3.3 gives the severity values associated with different inferences of form: μ ≤ μ1= 150 + γ, given  x = 151.

Can they at least say that x = 151 is a good indication that μ ≤ 150.5? No, SEV(T+, x  = 151, C: μ ≤ 150.5) ≅ 0.3, [Z = 151 – 150.5 = 0.5]. But x = 151 is a good indication that μ ≤ 152 and μ ≤ 153 (with severity indications of 0.84 and 0.97, respectively).

You might say, assessing severity is no different from what we would do with a judicious use of existing error probabilities. That’s what the severe tester says. Formally speaking, it may be seen merely as a good rule of thumb to avoid fallacious interpretations. What’s new is the statistical philosophy behind it. We no longer seek either probabilism or performance, but rather using relevant error probabilities to assess and control severity.5

5Initial developments of the severity idea were Mayo (1983, 1988, 1991, 1996). In Mayo and Spanos (2006, 2011), it was developed much further.


NOTE: I will set out some quiz examples of severity in the next week for practice.

*There is a typo in the book here, it has “-” rather than “>”

You can find the beginning of this section (3.2), the development of N-P tests, in this post.

To read further, see Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo 2018, CUP).

Where you are in the journey:

Excursion 3: Statistical Tests and Scientific Inference

Tour I Ingenious and Severe Tests                                             119

3.1 Statistical Inference and Sexy Science: The 1919
Eclipse Test                                                                                    121

3.2 N-P Tests: An Episode in Anglo-Polish Collaboration              131

YOU exhibit (i) N-P Methods as Severe Tests: First Look (Water Plant Accident)

3.3 How to Do All N-P Tests Do (and more) While
a Member of the Fisherian Tribe                                                    146

  • All excerpts and mementos (until Nov. 30, 2018) are here.
  • The full Itinerary (Table of Contents) is here.
Categories: Error Statistics, Severity, Statistical Inference as Severe Testing | 43 Comments

Neyman-Pearson Tests: An Episode in Anglo-Polish Collaboration: Excerpt from Excursion 3 (3.2)

Neyman & Pearson

3.2 N-P Tests: An Episode in Anglo-Polish Collaboration*

We proceed by setting up a specific hypothesis to test, Hin Neyman’s and my terminology, the null hypothesis in R. A. Fisher’s . . . in choosing the test, we take into account alternatives to Hwhich we believe possible or at any rate consider it most important to be on the look out for . . .Three steps in constructing the test may be defined:

Step 1. We must first specify the set of results . . .

Step 2. We then divide this set by a system of ordered boundaries . . .such that as we pass across one boundary and proceed to the next, we come to a class of results which makes us more and more inclined, on the information available, to reject the hypothesis tested in favour of alternatives which differ from it by increasing amounts.

Step 3. We then, if possible, associate with each contour level the chance that, if H0 is true, a result will occur in random sampling lying beyond that level . . .

In our first papers [in 1928] we suggested that the likelihood ratio criterion, λ, was a very useful one . . . Thus Step 2 proceeded Step 3. In later papers [1933–1938] we started with a fixed value for the chance, ε, of Step 3 . . . However, although the mathematical procedure may put Step 3 before 2, we cannot put this into operation before we have decided, under Step 2, on the guiding principle to be used in choosing the contour system. That is why I have numbered the steps in this order. (Egon Pearson 1947, p. 173)

In addition to Pearson’s 1947 paper, the museum follows his account in “The Neyman–Pearson Story: 1926–34” (Pearson 1970). The subtitle is “Historical Sidelights on an Episode in Anglo-Polish Collaboration”!

We meet Jerzy Neyman at the point he’s sent to have his work sized up by Karl Pearson at University College in 1925/26. Neyman wasn’t that impressed: Continue reading

Categories: E.S. Pearson, Neyman, Statistical Inference as Severe Testing, statistical tests, Statistics | 1 Comment

Blog at