Here is section 5 of my new paper: “Statistical Science Meets Philosophy of Science Part 2: Shallow versus Deep Explorations” SS & POS 2. Sections 1 and 2 are in my last post.*
5. The Error-Statistical Philosophy
I recommend moving away, once and for all, from the idea that frequentists must ‘sign up’ for either Neyman and Pearson, or Fisherian paradigms. As a philosopher of statistics I am prepared to admit to supplying the tools with an interpretation and an associated philosophy of inference. I am not concerned to prove this is what any of the founders ‘really meant’.
Fisherian simple-significance tests, with their single null hypothesis and at most an idea of a directional alternative (and a corresponding notion of the ‘sensitivity’ of a test), are commonly distinguished from Neyman and Pearson tests, where the null and alternative exhaust the parameter space, and the corresponding notion of power is explicit. On the interpretation of tests that I am proposing, these are just two of the various types of testing contexts appropriate for different questions of interest. My use of a distinct term, ‘error statistics’, frees us from the bogeymen and bogeywomen often associated with ‘classical’ statistics, and it is to be hoped that that term is shelved. (Even ‘sampling theory’, technically correct, does not seem to represent the key point: the sampling distribution matters in order to evaluate error probabilities, and thereby assess corroboration or severity associated with claims of interest.) Nor do I see that my comments turn on whether one replaces frequencies with ‘propensities’ (whatever they are).
5.1 Error (Probability) Statistics
What is key on the statistics side is that the probabilities refer to the distribution of a statistic d(X)—the so-called sampling distribution. This alone is at odds with Bayesian methods where consideration of outcomes other than the one observed is disallowed (likelihood principle [LP]), at least once the data are available.
Neyman-Pearson hypothesis testing violates the likelihood principle, because the event either happens or does not; and hence has probability one or zero. (Kadane 2011, 439)
The idea of considering, hypothetically, what other outcomes could have occurred in reasoning from the one that did occur seems so obvious in ordinary reasoning that it will strike many, at least those outside of this specialized debate, as bizarre for an account of statistical inference to banish such considerations. And yet, banish them the Bayesian must—at least if she is being coherent. I come back to the likelihood principle in section 7.
What is key on the philosophical side is that error probabilities may be used to quantify the probativeness or severity of tests (in relation to a given inference).
The twin goals of probative tests and informative inferences constrain the selection of tests. But however tests are specified, they are open to an after-data scrutiny based on the severity achieved. Tests do not always or automatically give us relevant severity assessments, and I do not claim one will find just this construal in the literature. Because any such severity assessment is relative to the particular ‘mistake’ being ruled out, it must be qualified in relation to a given inference, and a given testing context. We may write:
SEV(T, x, H) to abbreviate ‘the severity with which test T passes hypothesis H with data x’.
When the test and data are clear, I may just write SEV(H). The standpoint of the severe prober, or the severity principle, directs us to obtain error probabilities that are relevant to determining well testedness, and this is the key, I maintain, to avoiding counterintuitive inferences which are at the heart of often-repeated comic criticisms. This makes explicit what Neyman and Pearson implicitly hinted at:
If properly interpreted we should not describe one [test] as more accurate than another, but according to the problem in hand should recommend this one or that as providing information which is more relevant to the purpose. (Neyman and Pearson 1967, 56–57)
For the vast majority of cases we deal with, satisfying the N-P long-run desiderata leads to a uniquely appropriate test that simultaneously satisfies Cox’s (Fisherian) focus on minimally sufficient statistics, and also the severe testing desiderata (Mayo and Cox 2010).
5.2 Philosophy-Laden Criticisms of Frequentist Statistical Methods
What is rarely noticed in foundational discussions is that appraising statistical accounts at the foundational level is ‘theory-laden’, and in this case the theory is philosophical. A deep as opposed to a shallow critique of such appraisals must therefore unearth the philosophical presuppositions underlying both the criticisms and the plaudits of methods. To avoid question-begging criticisms, the standpoint from which the appraisal is launched must itself be independently defended.
But for many philosophers, in particular, Bayesians, the presumption that inference demands a posterior probability for hypotheses is thought to be so obvious as not to require support. At any rate, the only way to give a generous interpretation of the critics (rather than assume a deliberate misreading of frequentist goals) is to allow that critics are implicitly making assumptions that are at odds with the frequentist statistical philosophy. In particular, the criticisms of frequentist statistical methods assume a certain philosophy about statistical inference (probabilism), often coupled with the allegation that error-statistical methods can only achieve radical behavioristic goals, wherein long-run error rates alone matter.
Criticisms then follow readily, in the form of one or both:
- Error probabilities do not supply posterior probabilities in hypotheses.
- Methods can satisfy long-run error probability demands while giving rise to counterintuitive inferences in particular cases.
I have proposed an alternative philosophy that replaces these tenets with different ones:
- The role of probability in inference is to quantify how reliably or severely claims have been tested.
- The severity principle directs us to the relevant error probabilities; control of long-run error probabilities, while necessary, is not sufficient for good tests.
The following examples will substantiate and flesh out these claims.
5.3 Severity as a ‘Metastatistical’ Assessment
In calling severity ‘metastatistical’, I am deliberately calling attention to the fact that the reasoned deliberation it involves cannot simply be equated to formal quantitative measures, particularly those that arise in recipe-like uses of methods such as significance tests. In applying it, we consider several possible inferences that might be considered of interest. In the example of test T+ [this is a one-sided Normal test of H0: μ≤μ0 against H1: μ>μ0, on p. 81], the data specific severity evaluation quantifies the extent of the discrepancy (γ) from the null that is (or is not) indicated by data x rather than quantifying a ‘degree of confirmation’ accorded a given hypothesis. Still, if one wants to emphasize a post-data measure one can write:
SEV(μ <X0 + γσx) to abbreviate: The severity with which a test T+ with a result x passes the hypothesis:
(μ < X0 + γσx) with σx abbreviating (σ /√n)
One might consider a series of benchmarks or upper severity bounds:
SEV(μ < x0 + 0σx) = .5
SEV(μ < x0 + .5σx) = .7
SEV(μ < x0 + 1σx) = .84
SEV(μ < x0 + 1.5σx) = .93
SEV(μ < x0 + 1.98σx) = .975
More generally, one might interpret nonstatistically significant results (i.e., d(x) ≤ cα) in test T+ above in severity terms:
(μ ≤ X0 + γε(σ /√n)) passes the test T+ with severity (1 –ε),
for any P(d(X)>γε) = ε.
It is true that I am here limiting myself to a case where σ is known and we do not worry about other possible ‘nuisance parameters’. Here I am doing philosophy of statistics; only once the logic is grasped will the technical extensions be forthcoming.
5.3.1 Severity and Confidence Bounds in the Case of Test T+
It will be noticed that these bounds are identical to the corresponding upper confidence interval bounds for estimating μ. There is a duality relationship between confidence intervals and tests: the confidence interval contains the parameter values that would not be rejected by the given test at the specified level of significance. It follows that the (1 – α) one-sided confidence interval (CI) that corresponds to test T+ is of form:
μ > X− cα(σ /√n)
The corresponding CI, in other words, would not be the assertion of the upper bound, as in our interpretation of statistically insignificant results. In particular, the 97.5 percent CI estimator corresponding to test T+ is:
μ > X− 1.96(σ /√n)
We were only led to the upper bounds in the midst of a severity interpretation of negative results (see Mayo and Spanos 2006). [See also posts on this blog, e.g., on reforming the reformers.]
Still, applying the severity construal to the application of confidence interval estimation is in sync with the recommendation to consider a series of lower and upper confidence limits, as in Cox (2006). But are not the degrees of severity just another way to say how probable each claim is? No. This would lead to well-known inconsistencies, and gives the wrong logic for ‘how well-tested’ (or ‘corroborated’) a claim is.
A classic misinterpretation of an upper confidence interval estimate is based on the following fallacious instantiation of a random variable by its fixed value:
P(μ < (X+2(σ /√n); μ) = .975,
observe mean x,
therefore, P (μ < ( x+ 2(σ /√n); μ) = .975.
While this instantiation is fallacious, critics often argue that we just cannot help it. Hacking (1980) attributes this assumption to our tendency toward ‘logicism’, wherein we assume a logical relationship exists between any data and hypothesis. More specifically, it grows out of the first tenet of the statistical philosophy that is assumed by critics of error statistics, that of probabilism.
5.3.2 Severity versus Rubbing Off
The severity construal is different from what I call the ‘rubbing off construal’ which says: infer from the fact that the procedure is rarely wrong to the assignment of a low probability to its being wrong in the case at hand. This is still dangerously equivocal, since the probability properly attaches to the method not the inference. Nor will it do to merely replace an error probability associated with an inference to H with the phrase ‘degree of severity’ with which H has passed. The long-run reliability of the rule is a necessary but not a sufficient condition to infer H (with severity).
The reasoning instead is the counterfactual reasoning behind what we agreed was at the heart of an entirely general principle of evidence. Although I chose to couch it within the severity principle, the general frequentist principle of evidence (FEV) or something else could be chosen.
To emphasize another feature of the severity construal, suppose one wishes to entertain the severity associated with the inference:
H: μ< ( x0 + 0σx)
on the basis of mean x0 from test T+. H passes with low (.5) severity because it is easy, i.e., probable, to have obtained a result that agrees with H as well as this one, even if this claim is false about the underlying data generation procedure. Equivalently, if one were calculating the confidence level associated with the one-sided upper confidence limit μ < x, it would have level .5. Without setting a fixed level, one may apply the severity assessment at a number of benchmarks, to infer which discrepancies are, and which are not, warranted by the particular data set. Knowing what fails to be warranted with severity becomes at least as important as knowing what is: it points in the direction of what may be tried next and of how to improve inquiries.
5.3.3 What’s Belief Got to Do with It?
Some philosophers profess not to understand what I could be saying if I am prepared to allow that a hypothesis H has passed a severe test T with x without also advocating (strong) belief in H. When SEV(H) is high there is no problem in saying that x warrants H, or if one likes, that x warrants believing H, even though that would not be the direct outcome of a statistical inference. The reason it is unproblematic in the case where SEV(H) is high is:
If SEV(H) is high, its denial is low, i.e., SEV(~H) is low.
But it does not follow that a severity assessment should obey the probability calculus, or be a posterior probability—it should not, and is not.
After all, a test may poorly warrant both a hypothesis H and its denial, violating the probability calculus. That is, SEV(H) may be low because its denial was ruled out with severity, i.e., because SEV(~H) is high. But Sev(H) may also be low because the test is too imprecise to allow us to take the result as good evidence for H.
Even if one wished to retain the idea that degrees of belief correspond to (or are revealed by?) bets an agent is willing to take, that degrees of belief are comparable across different contexts, and all the rest of the classic subjective Bayesian picture, this would still not have shown the relevance of a measure of belief to the objective appraisal of what has been learned from data. Even if I strongly believe a hypothesis, I will need a concept that allows me to express whether or not the test with outcome x warrants H. That is what a severity assessment would provide. In this respect, a dyed-in-the wool subjective Bayesian could accept the severity construal for science, and still find a home for his personalistic conception.
Critics should also welcome this move because it underscores the basis for many complaints: the strict frequentist formalism alone does not prevent certain counterintuitive inferences. That is why I allowed that a severity assessment is on the metalevel in scrutinizing an inference. Granting that, the error- statistical account based on the severity principle does prevent the counterintuitive inferences that have earned so much fame not only at Bayesian retreats, but throughout the literature.
5.3.4 Tacking Paradox Scotched
In addition to avoiding fallacies within statistics, the severity logic avoids classic problems facing both Bayesian and hypothetical-deductive accounts in philosophy. For example, tacking on an irrelevant conjunct to a well-confirmed hypothesis H seems magically to allow confirmation for some irrelevant conjuncts. Not so in a severity analysis. Suppose the severity for claim H (given test T and data x) is high: i.e., SEV(T, x, H) is high, whereas a claim J is not probed in the least by test T. Then the severity for the conjunction (H & J) is very low, if not minimal.
If SEV(Test T, data x, claim H) is high, but J is not probed in the least by the experimental test T, then SEV (T, x, (H & J)) = very low or minimal.
For example, consider:
H: GTR and J: Kuru is transmitted through funerary cannibalism,
and let data x0 be a value of the observed deflection of light in accordance with the general theory of relativity, GTR. The two hypotheses do not refer to the same data models or experimental outcomes, so it would be odd to conjoin them; but if one did, the conjunction gets minimal severity from this particular data set. Note that we distinguish x severely passing H, and H being severely passed on all evidence in science at a time.
A severity assessment also allows a clear way to distinguish the well-testedness of a portion or variant of a larger theory, as opposed to the full theory. To apply a severity assessment requires exhausting the space of alternatives to any claim to be inferred (i.e., ‘H is false’ is a specific denial of H). These must be relevant rivals to H—they must be at ‘the same level’ as H. For example, if H is asking about whether drug Z causes some effect, then a claim at a different (‘higher’) level might a theory purporting to explain the causal effect. A test that severely passes the former does not allow us to regard the latter as having passed severely. So severity directs us to identify the portion or aspect of a larger theory that passes. We may often need to refine the hypothesis of stated interest so that it is sufficiently local to enable a severity assessment. Background knowledge will clearly play a key role. Nevertheless we learn a lot from determining that we are not allowed to regard given claims or theories as passing with severity. I come back to this in the next section (and much more elsewhere, e.g., Mayo 2010a, b).
*To read sections 3 and 4, please see the RMM page, and scroll down to Mayo’s Sept 25 paper.
(All references can also be found in the link.)
Please notify me of symbol errors in the post. In transferring the paper to wordpress, many symbols were scrambled and had to be reentered. I think the Elbians were in too much of a rush to go out to the Elbar Room this evening to check if the symbols they entered matched the paper. I already found and fixed several reentered incorrectly.
Dr. Mayo,
I like your philosophical foundation for statistics (science?) and would like to see more examples of its application. In your blog and papers the only application seems to be the “iid Normal means” one above. What are the best references for examples of difficult real world problems solved by this approach?
Dear Dr. Guest: Thanks for your comment.
The overall philosophy of science is interested in solving the “real world” philosophical problems of evidence and inference (problems of underdetermination, induction, Duhemian problems,etc). With respect to problems of statistics, decades of debate have centered around these basic examples.
This is understandable in that if one cannot be clear on them, one has little basis for directing solutions for more complex problems (and different schools have different views of which are complex).
Aside from these relatively unproblematic examples there is a small cluster of cases that come up time and time again in ridiculing frequentist methods: e.g., the mixture of instruments with different precisions, so-called vacuous confidence intervals, two-sided tests where p-values disagree with posteriors, the Welch example, etc.) They are all rather simple examples, and we have dealt with all of them.
At our June 2010 conference (from which a short sketch of ideas in this paper arose) Bernardo began by saying that “reference” or default Bayesians disagreed on the most basic examples, and suggested that O-Bayesian testing was very difficult and in flux. So there is good reason to get clear on any and all of the examples that continue to figure prominently in these debates within and between statistical schools, and those taken as grounds to reject frequentist methods. The general severity idea apply to entirely non-statistical settings (Mayo, D. (2010). “Error, Severe Testing, and the Growth of Theoretical Knowledge”) and to all N-P tests (I reference Aris Spanos). I think it would be a huge accomplishment if we can get past those same howlers which are so readily taken to end the discussion of frequentist foundations.
Philosophers of science who deal with probability (despite an early history of mingling with statistical developments in the 70s and 80s) generally restrict themselves to “hypotheses” such as “Peter is a Swede”, “the next trial will produce heads” and do not even get to statistical distributions. It is common to teach probability theory and decision theory in philosophy, but not statistics. Although I think philosophers should be involved, I suspect that further extensions in statistics are more likely to come from statisticians who care about foundations, and who have gotten beyond the howlers.
Having said all this, give me an idea of the kind of problem you have in mind.
On the issue pertaining to the generality of the post-data severity evaluation, I have illustrated its usefulness in a variety of statistical contexts beyond the testing of the mean in a NIID model. For example, in the papers cited below, the severity evaluation is illustrated in the context of linear regression, ANOVA and the simple Bernoulli model, respectively:
Spanos, A. (2006), “Revisiting the omitted variables argument: substantive vs. statistical adequacy,” Journal of Economic Methodology, 13: 179–218.
Spanos, A. (2010), “The Discovery of Argon: A Case for Learning from Data?” Philosophy of Science, 77: 359-380.
Spanos, A. (2010), “Is Frequentist Testing Vulnerable to the Base-Rate Fallacy?” Philosophy of Science, 77: 565–583.
Thank you both. Spanos’s references are the kind of think I had in mind.