5. The Error-Statistical Philosophy
I recommend moving away, once and for all, from the idea that frequentists must ‘sign up’ for either Neyman and Pearson, or Fisherian paradigms. As a philosopher of statistics I am prepared to admit to supplying the tools with an interpretation and an associated philosophy of inference. I am not concerned to prove this is what any of the founders ‘really meant’.
Fisherian simple-significance tests, with their single null hypothesis and at most an idea of a directional alternative (and a corresponding notion of the ‘sensitivity’ of a test), are commonly distinguished from Neyman and Pearson tests, where the null and alternative exhaust the parameter space, and the corresponding notion of power is explicit. On the interpretation of tests that I am proposing, these are just two of the various types of testing contexts appropriate for different questions of interest. My use of a distinct term, ‘error statistics’, frees us from the bogeymen and bogeywomen often associated with ‘classical’ statistics, and it is to be hoped that that term is shelved. (Even ‘sampling theory’, technically correct, does not seem to represent the key point: the sampling distribution matters in order to evaluate error probabilities, and thereby assess corroboration or severity associated with claims of interest.) Nor do I see that my comments turn on whether one replaces frequencies with ‘propensities’ (whatever they are).
5.1 Error (Probability) Statistics
What is key on the statistics side is that the probabilities refer to the distribution of a statistic d(X)—the so-called sampling distribution. This alone is at odds with Bayesian methods where consideration of outcomes other than the one observed is disallowed (likelihood principle [LP]), at least once the data are available.
Neyman-Pearson hypothesis testing violates the likelihood principle, because the event either happens or does not; and hence has probability one or zero. (Kadane 2011, 439)
The idea of considering, hypothetically, what other outcomes could have occurred in reasoning from the one that did occur seems so obvious in ordinary reasoning that it will strike many, at least those outside of this specialized debate, as bizarre for an account of statistical inference to banish such considerations. And yet, banish them the Bayesian must—at least if she is being coherent. I come back to the likelihood principle in section 7.
What is key on the philosophical side is that error probabilities may be used to quantify the probativeness or severity of tests (in relation to a given inference).
The twin goals of probative tests and informative inferences constrain the selection of tests. But however tests are specified, they are open to an after-data scrutiny based on the severity achieved. Tests do not always or automatically give us relevant severity assessments, and I do not claim one will find just this construal in the literature. Because any such severity assessment is relative to the particular ‘mistake’ being ruled out, it must be qualified in relation to a given inference, and a given testing context. We may write:
SEV(T, x, H) to abbreviate ‘the severity with which test T passes hypothesis H with data x’.
When the test and data are clear, I may just write SEV(H). The standpoint of the severe prober, or the severity principle, directs us to obtain error probabilities that are relevant to determining well testedness, and this is the key, I maintain, to avoiding counterintuitive inferences which are at the heart of often-repeated comic criticisms. This makes explicit what Neyman and Pearson implicitly hinted at:
If properly interpreted we should not describe one [test] as more accurate than another, but according to the problem in hand should recommend this one or that as providing information which is more relevant to the purpose. (Neyman and Pearson 1967, 56–57)
For the vast majority of cases we deal with, satisfying the N-P long-run desiderata leads to a uniquely appropriate test that simultaneously satisfies Cox’s (Fisherian) focus on minimally sufficient statistics, and also the severe testing desiderata (Mayo and Cox 2010).
5.2 Philosophy-Laden Criticisms of Frequentist Statistical Methods
What is rarely noticed in foundational discussions is that appraising statistical accounts at the foundational level is ‘theory-laden’, and in this case the theory is philosophical. A deep as opposed to a shallow critique of such appraisals must therefore unearth the philosophical presuppositions underlying both the criticisms and the plaudits of methods. To avoid question-begging criticisms, the standpoint from which the appraisal is launched must itself be independently defended.
But for many philosophers, in particular, Bayesians, the presumption that inference demands a posterior probability for hypotheses is thought to be so obvious as not to require support. At any rate, the only way to give a generous interpretation of the critics (rather than assume a deliberate misreading of frequentist goals) is to allow that critics are implicitly making assumptions that are at odds with the frequentist statistical philosophy. In particular, the criticisms of frequentist statistical methods assume a certain philosophy about statistical inference (probabilism), often coupled with the allegation that error-statistical methods can only achieve radical behavioristic goals, wherein long-run error rates alone matter.
Criticisms then follow readily, in the form of one or both:
- Error probabilities do not supply posterior probabilities in hypotheses.
- Methods can satisfy long-run error probability demands while giving rise to counterintuitive inferences in particular cases.
I have proposed an alternative philosophy that replaces these tenets with different ones:
- The role of probability in inference is to quantify how reliably or severely claims have been tested.
- The severity principle directs us to the relevant error probabilities; control of long-run error probabilities, while necessary, is not sufficient for good tests.
The following examples will substantiate and flesh out these claims.
5.3 Severity as a ‘Metastatistical’ Assessment
In calling severity ‘metastatistical’, I am deliberately calling attention to the fact that the reasoned deliberation it involves cannot simply be equated to formal quantitative measures, particularly those that arise in recipe-like uses of methods such as significance tests. In applying it, we consider several possible inferences that might be considered of interest. In the example of test T+ [this is a one-sided Normal test of H0: μ≤μ0 against H1: μ>μ0, on p. 81], the data specific severity evaluation quantifies the extent of the discrepancy (γ) from the null that is (or is not) indicated by data x rather than quantifying a ‘degree of confirmation’ accorded a given hypothesis. Still, if one wants to emphasize a post-data measure one can write:
SEV(μ <X0 + γσx) to abbreviate: The severity with which a test T+ with a result x passes the hypothesis:
(μ < X0 + γσx) with σx abbreviating (σ /√n)
One might consider a series of benchmarks or upper severity bounds:
SEV(μ < x0 + 0σx) = .5
SEV(μ < x0 + .5σx) = .7
SEV(μ < x0 + 1σx) = .84
SEV(μ < x0 + 1.5σx) = .93
SEV(μ < x0 + 1.98σx) = .975
More generally, one might interpret nonstatistically significant results (i.e., d(x) ≤ cα) in test T+ above in severity terms:
(μ ≤ X0 + γε(σ /√n)) passes the test T+ with severity (1 –ε),
for any P(d(X)>γε) = ε.
It is true that I am here limiting myself to a case where σ is known and we do not worry about other possible ‘nuisance parameters’. Here I am doing philosophy of statistics; only once the logic is grasped will the technical extensions be forthcoming.
5.3.1 Severity and Confidence Bounds in the Case of Test T+
It will be noticed that these bounds are identical to the corresponding upper confidence interval bounds for estimating μ. There is a duality relationship between confidence intervals and tests: the confidence interval contains the parameter values that would not be rejected by the given test at the specified level of significance. It follows that the (1 – α) one-sided confidence interval (CI) that corresponds to test T+ is of form:
μ > X− cα(σ /√n)
The corresponding CI, in other words, would not be the assertion of the upper bound, as in our interpretation of statistically insignificant results. In particular, the 97.5 percent CI estimator corresponding to test T+ is:
μ > X− 1.96(σ /√n)
Still, applying the severity construal to the application of confidence interval estimation is in sync with the recommendation to consider a series of lower and upper confidence limits, as in Cox (2006). But are not the degrees of severity just another way to say how probable each claim is? No. This would lead to well-known inconsistencies, and gives the wrong logic for ‘how well-tested’ (or ‘corroborated’) a claim is.
A classic misinterpretation of an upper confidence interval estimate is based on the following fallacious instantiation of a random variable by its fixed value:
P(μ < (X+2(σ /√n); μ) = .975,
observe mean x,
therefore, P (μ < ( x+ 2(σ /√n); μ) = .975.
While this instantiation is fallacious, critics often argue that we just cannot help it. Hacking (1980) attributes this assumption to our tendency toward ‘logicism’, wherein we assume a logical relationship exists between any data and hypothesis. More specifically, it grows out of the first tenet of the statistical philosophy that is assumed by critics of error statistics, that of probabilism.
5.3.2 Severity versus Rubbing Off
The severity construal is different from what I call the ‘rubbing off construal’ which says: infer from the fact that the procedure is rarely wrong to the assignment of a low probability to its being wrong in the case at hand. This is still dangerously equivocal, since the probability properly attaches to the method not the inference. Nor will it do to merely replace an error probability associated with an inference to H with the phrase ‘degree of severity’ with which H has passed. The long-run reliability of the rule is a necessary but not a sufficient condition to infer H (with severity).
The reasoning instead is the counterfactual reasoning behind what we agreed was at the heart of an entirely general principle of evidence. Although I chose to couch it within the severity principle, the general frequentist principle of evidence (FEV) or something else could be chosen.
To emphasize another feature of the severity construal, suppose one wishes to entertain the severity associated with the inference:
H: μ< ( x0 + 0σx)
on the basis of mean x0 from test T+. H passes with low (.5) severity because it is easy, i.e., probable, to have obtained a result that agrees with H as well as this one, even if this claim is false about the underlying data generation procedure. Equivalently, if one were calculating the confidence level associated with the one-sided upper confidence limit μ < x, it would have level .5. Without setting a fixed level, one may apply the severity assessment at a number of benchmarks, to infer which discrepancies are, and which are not, warranted by the particular data set. Knowing what fails to be warranted with severity becomes at least as important as knowing what is: it points in the direction of what may be tried next and of how to improve inquiries.
5.3.3 What’s Belief Got to Do with It?
Some philosophers profess not to understand what I could be saying if I am prepared to allow that a hypothesis H has passed a severe test T with x without also advocating (strong) belief in H. When SEV(H) is high there is no problem in saying that x warrants H, or if one likes, that x warrants believing H, even though that would not be the direct outcome of a statistical inference. The reason it is unproblematic in the case where SEV(H) is high is:
If SEV(H) is high, its denial is low, i.e., SEV(~H) is low.
But it does not follow that a severity assessment should obey the probability calculus, or be a posterior probability—it should not, and is not.
After all, a test may poorly warrant both a hypothesis H and its denial, violating the probability calculus. That is, SEV(H) may be low because its denial was ruled out with severity, i.e., because SEV(~H) is high. But Sev(H) may also be low because the test is too imprecise to allow us to take the result as good evidence for H.
Even if one wished to retain the idea that degrees of belief correspond to (or are revealed by?) bets an agent is willing to take, that degrees of belief are comparable across different contexts, and all the rest of the classic subjective Bayesian picture, this would still not have shown the relevance of a measure of belief to the objective appraisal of what has been learned from data. Even if I strongly believe a hypothesis, I will need a concept that allows me to express whether or not the test with outcome x warrants H. That is what a severity assessment would provide. In this respect, a dyed-in-the wool subjective Bayesian could accept the severity construal for science, and still find a home for his personalistic conception.
Critics should also welcome this move because it underscores the basis for many complaints: the strict frequentist formalism alone does not prevent certain counterintuitive inferences. That is why I allowed that a severity assessment is on the metalevel in scrutinizing an inference. Granting that, the error- statistical account based on the severity principle does prevent the counterintuitive inferences that have earned so much fame not only at Bayesian retreats, but throughout the literature.
5.3.4 Tacking Paradox Scotched
In addition to avoiding fallacies within statistics, the severity logic avoids classic problems facing both Bayesian and hypothetical-deductive accounts in philosophy. For example, tacking on an irrelevant conjunct to a well-confirmed hypothesis H seems magically to allow confirmation for some irrelevant conjuncts. Not so in a severity analysis. Suppose the severity for claim H (given test T and data x) is high: i.e., SEV(T, x, H) is high, whereas a claim J is not probed in the least by test T. Then the severity for the conjunction (H & J) is very low, if not minimal.
If SEV(Test T, data x, claim H) is high, but J is not probed in the least by the experimental test T, then SEV (T, x, (H & J)) = very low or minimal.
For example, consider:
H: GTR and J: Kuru is transmitted through funerary cannibalism,
and let data x0 be a value of the observed deflection of light in accordance with the general theory of relativity, GTR. The two hypotheses do not refer to the same data models or experimental outcomes, so it would be odd to conjoin them; but if one did, the conjunction gets minimal severity from this particular data set. Note that we distinguish x severely passing H, and H being severely passed on all evidence in science at a time.
A severity assessment also allows a clear way to distinguish the well-testedness of a portion or variant of a larger theory, as opposed to the full theory. To apply a severity assessment requires exhausting the space of alternatives to any claim to be inferred (i.e., ‘H is false’ is a specific denial of H). These must be relevant rivals to H—they must be at ‘the same level’ as H. For example, if H is asking about whether drug Z causes some effect, then a claim at a different (‘higher’) level might a theory purporting to explain the causal effect. A test that severely passes the former does not allow us to regard the latter as having passed severely. So severity directs us to identify the portion or aspect of a larger theory that passes. We may often need to refine the hypothesis of stated interest so that it is sufficiently local to enable a severity assessment. Background knowledge will clearly play a key role. Nevertheless we learn a lot from determining that we are not allowed to regard given claims or theories as passing with severity. I come back to this in the next section (and much more elsewhere, e.g., Mayo 2010a, b).
*To read sections 3 and 4, please see the RMM page, and scroll down to Mayo’s Sept 25 paper.
(All references can also be found in the link.)