score years ago (!) we held the conference “Statistical Science and Philosophy of Science: Where Do (Should) They meet?” at the London School of Economics, Center for the Philosophy of Natural and Social Science, CPNSS, where I’m visiting professor  Many of the discussions on this blog grew out of contributions from the conference, and conversations initiated soon after. The conference site is here; my paper on the general question is here.
My main contribution was “Statistical Science Meets Philosophy of Science Part 2: Shallow versus Deep Explorations” SS & POS 2. It begins like this:
1. Comedy Hour at the Bayesian Retreat
Overheard at the comedy hour at the Bayesian retreat: Did you hear the one about the frequentist…
“who defended the reliability of his radiation reading, despite using a broken radiometer, on the grounds that most of the time he uses one that works, so on average he’s pretty reliable?”
“who claimed that observing ‘heads’ on a biased coin that lands heads with probability .05 is evidence of a statistically significant improvement over the standard treatment of diabetes, on the grounds that such an event occurs with low probability (.05)?”
Such jests may work for an after-dinner laugh, but if it turns out that, despite being retreads of ‘straw-men’ fallacies, they form the basis of why some statisticians and philosophers reject frequentist methods, then they are not such a laughing matter. But surely the drubbing of frequentist methods could not be based on a collection of howlers, could it? I invite the reader to stay and find out.
If we are to take the criticisms seriously, and put to one side the possibility that they are deliberate distortions of frequentist statistical methods, we need to identify their sources. To this end I consider two interrelated areas around which to organize foundational issues in statistics: (1) the roles of probability in induction and inference, and (2) the nature and goals of statistical inference in science or learning. Frequentist sampling statistics, which I prefer to call ‘error statistics’, continues to be raked over the coals in the foundational literature, but with little scrutiny of the presuppositions about goals and methods, without which the criticisms lose all force.
First, there is the supposition that an adequate account must assign degrees of probability to hypotheses, an assumption often called probabilism. Second, there is the assumption that the main, if not the only, goal of error-statistical methods is to evaluate long-run error rates. Given the wide latitude with which some critics define ‘controlling long-run error’, it is not surprising to find them arguing that (i) error statisticians approve of silly methods, and/or (ii) rival (e.g., Bayesian) accounts also satisfy error statistical demands. Absent this sleight of hand, Bayesian celebrants would have to go straight to the finale of their entertainment hour: a rousing rendition of ‘There’s No Theorem Like Bayes’s Theorem’.
Never mind that frequentists have responded to these criticisms, they keep popping up (verbatim) in every Bayesian and some non-Bayesian textbooks and articles on philosophical foundations. No wonder that statistician Stephen Senn is inclined to “describe a Bayesian as one who has a reverential awe for all opinions except those of a frequentist statistician” (Senn 2011, 59, this special topic of RMM). Never mind that a correct understanding of the error-statistical demands belies the assumption that any method (with good performance properties in the asymptotic long-run) succeeds in satisfying error-statistical demands.
The difficulty of articulating a statistical philosophy that fully explains the basis for both (i) insisting on error-statistical guarantees, while (ii) avoiding pathological examples in practice, has turned many a frequentist away from venturing into foundational battlegrounds. Some even concede the distorted perspectives drawn from overly literal and radical expositions of what Fisher, Neyman, and Pearson ‘really thought’. I regard this as a shallow way to do foundations.
Here is where I view my contribution—as a philosopher of science—to the long-standing debate: not merely to call attention to the howlers that pass as legitimate criticisms of frequentist error statistics, but also to sketch the main lines of an alternative statistical philosophy within which to better articulate the roles and value of frequentist tools. Let me be clear that I do not consider this the only philosophical framework for frequentist statistics—different terminology could do as well. I will consider myself successful if I can provide one way of building, or one standpoint from which to build, a frequentist, error-statistical philosophy. Here I mostly sketch key ingredients and report on updates in a larger, ongoing project.
2. Popperians Are to Frequentists as Carnapians Are to Bayesians
Statisticians do, from time to time, allude to better-known philosophers of science (e.g., Popper). The familiar philosophy/statistics analogy—that Popper is to frequentists as Carnap is to Bayesians—is worth exploring more deeply, most notably the contrast between the popular conception of Popperian falsification and inductive probabilism. Popper himself remarked:
In opposition to [the] inductivist attitude, I assert that C(H,x) must not be interpreted as the degree of corroboration of H by x, unless x reports the results of our sincere efforts to overthrow H. The requirement of sincerity cannot be formalized—no more than the inductivist requirement that x must represent our total observational knowledge. (Popper 1959, 418, I replace ‘e’ with ‘x’)
In contrast with the more familiar reference to Popperian falsification, and its apparent similarity to statistical significance testing, here we see Popper alluding to failing to reject, or what he called the “corroboration” of hypothesis H. Popper chides the inductivist for making it too easy for agreements between data x and H to count as giving H a degree of confirmation.
Observations or experiments can be accepted as supporting a theory (or a hypothesis, or a scientific assertion) only if these observations or experiments are severe tests of the theory—or in other words, only if they result from serious attempts to refute the theory. (Popper 1994, 89)
(Note the similarity to Peirce in Mayo 2011, 87.)
2.1 Severe Tests
Popper did not mean to cash out ‘sincerity’ psychologically of course, but in some objective manner. Further, high corroboration must be ascertainable: ‘sincerely trying’ to find flaws will not suffice. Although Popper never adequately cashed out his intuition, there is clearly something right in this requirement. It is the gist of an experimental principle presumably accepted by Bayesians and frequentists alike, thereby supplying a minimal basis to philosophically scrutinize different methods. (Mayo 2011, section 2.5, this special topic of RMM) Error-statistical tests lend themselves to the philosophical standpoint reflected in the severity demand. Pretty clearly, evidence is not being taken seriously in appraising hypothesis H if it is predetermined that, even if H is false, a way would be found to either obtain, or interpret, data as agreeing with (or ‘passing’) hypothesis H. Here is one of many ways to state this:
Severity Requirement (weakest): An agreement between data x and H fails to count as evidence for a hypothesis or claim H if the test would yield (with high probability) so good an agreement even if H is false.
Because such a test procedure had little or no ability to find flaws in H, finding none would scarcely count in H’s favor.
2.1.1 Example: Negative Pressure Tests on the Deep Horizon Rig
Did the negative pressure readings provide ample evidence that:
H0: leaking gases, if any, were within the bounds of safety (e.g., less than θ0)?
Not if the rig workers kept decreasing the pressure until H passed, rather than performing a more stringent test (e.g., a so-called ‘cement bond log’ using acoustics). Such a lowering of the hurdle for passing H0 made it too easy to pass H0 even if it was false, i.e., even if in fact:
H1: the pressure build-up was in excess of θ0.
That ‘the negative pressure readings were misinterpreted’, meant that it was incorrect to construe them as indicating H0. If such negative readings would be expected, say, 80 percent of the time, even if H1 is true, then H0 might be said to have passed a test with only .2 severity. Using Popper’s nifty abbreviation, it could be said to have low corroboration, .2. So the error probability associated with the inference to H1 would be .8—clearly high. This is not a posterior probability, but it does just what we want it to do.
2.2 Another Egregious Violation of the Severity Requirement
Too readily interpreting data as agreeing with or fitting hypothesis H is not the only way to violate the severity requirement. Using utterly irrelevant evidence, such as the result of a coin flip to appraise a diabetes treatment, would be another way. In order for data x to succeed in corroborating H with severity, two things are required: (i) x must fit H, for an adequate notion of fit, and (ii) the test must have a reasonable probability of finding worse agreement with H, were H false. I have been focusing on (ii) but requirement (i) also falls directly out from error statistical demands. In general, for H to fit x, H would have to make x more probable than its denial. Coin tossing hypotheses say nothing about hypotheses on diabetes and so they fail the fit requirement. Note how this immediately scotches the second howler in the second opening example.
But note that we can appraise the severity credentials of other accounts by using whatever notion of ‘fit’ they permit. For example, if a Bayesian method assigns high posterior probability to H given data x, we can appraise how often it would do so even if H is false. That is a main reason I do not want to limit what can count as a purported measure of fit: we may wish to entertain different measures for purposes of criticism.
2.3 The Rationale for Severity is to Find Things Out Reliably
Although the severity requirement reflects a central intuition about evidence, I do not regard it as a primitive: it can be substantiated in terms of the goals of learning. To flout it would not merely permit being wrong with high probability—a long-run behavior rationale. In any particular case, little if anything will have been done to rule out the ways in which data and hypothesis can ‘agree’, even where the hypothesis is false. The burden of proof on anyone claiming to have evidence for H is to show that the claim is not guilty of at least an egregious lack of severity.
Although one can get considerable mileage even with the weak severity requirement, I would also accept the corresponding positive conception of evidence, which will comprise the full severity principle:
Severity Principle (full): Data x provide a good indication of or evidence for hypothesis H (only) to the extent that test T severely passes H with x.
Degree of corroboration is a useful shorthand for the degree of severity with which a claim passes, and may be used as long as the meaning remains clear.
2.4 What Can Be Learned from Popper; What Can Popperians Be Taught?
Interestingly, Popper often crops up as a philosopher to emulate—both by Bayesian and frequentist statisticians. As a philosopher, I am glad to have one of our own taken as useful, but feel I should point out that, despite having the right idea, Popperian logical computations never gave him an adequate way to implement his severity requirement, and I think I know why: Popper once wrote to me that he regretted never having learned mathematical statistics. Were he to have made the ‘error probability’ turn, today’s meeting ground between philosophy of science and statistics would likely look very different, at least for followers of Popper, the ‘critical rationalists’.
Consider, for example, Alan Musgrave (1999; 2006). Although he declares that “the critical rationalist owes us a theory of criticism” (2006, 323) this has yet to materialize. Instead, it seems that current-day critical rationalists retain the limitations that emasculated Popper. Notably, they deny that the method they recommend—either to accept or to prefer the hypothesis best-tested so far—is reliable. They are right: the best-tested so far may have been poorly probed. But critical rationalists maintain nevertheless that their account is ‘rational’. If asked why, their response is the same as Popper’s: ‘I know of nothing more rational’ than to accept the best-tested hypotheses. It sounds rational enough, but only if the best-tested hypothesis so far is itself well tested (see Mayo 2006; 2010b). So here we see one way in which a philosopher, using methods from statistics, could go back to philosophy and implement an incomplete idea.
On the other hand, statisticians who align themselves with Popper need to show that the methods they favor uphold falsificationist demands: that they are capable of finding claims false, to the extent that they are false; and retaining claims, just to the extent that they have passed severe scrutiny (of ways they can be false). Error probabilistic methods can serve these ends; but it is less clear that Bayesian methods are well-suited for such goals (or if they are, it is not clear they are properly ‘Bayesian’).
Here is section 5:
5. The Error-Statistical Philosophy
I recommend moving away, once and for all, from the idea that frequentists must ‘sign up’ for either Neyman and Pearson, or Fisherian paradigms. As a philosopher of statistics I am prepared to admit to supplying the tools with an interpretation and an associated philosophy of inference. I am not concerned to prove this is what any of the founders ‘really meant’.
Fisherian simple-significance tests, with their single null hypothesis and at most an idea of a directional alternative (and a corresponding notion of the ‘sensitivity’ of a test), are commonly distinguished from Neyman and Pearson tests, where the null and alternative exhaust the parameter space, and the corresponding notion of power is explicit. On the interpretation of tests that I am proposing, these are just two of the various types of testing contexts appropriate for different questions of interest. My use of a distinct term, ‘error statistics’, frees us from the bogeymen and bogeywomen often associated with ‘classical’ statistics, and it is to be hoped that that term is shelved. (Even ‘sampling theory’, technically correct, does not seem to represent the key point: the sampling distribution matters in order to evaluate error probabilities, and thereby assess corroboration or severity associated with claims of interest.) Nor do I see that my comments turn on whether one replaces frequencies with ‘propensities’ (whatever they are).
5.1 Error (Probability) Statistics
What is key on the statistics side is that the probabilities refer to the distribution of a statistic d(X)—the so-called sampling distribution. This alone is at odds with Bayesian methods where consideration of outcomes other than the one observed is disallowed (likelihood principle [LP]), at least once the data are available.
Neyman-Pearson hypothesis testing violates the likelihood principle, because the event either happens or does not; and hence has probability one or zero. (Kadane 2011, 439)
The idea of considering, hypothetically, what other outcomes could have occurred in reasoning from the one that did occur seems so obvious in ordinary reasoning that it will strike many, at least those outside of this specialized debate, as bizarre for an account of statistical inference to banish such considerations. And yet, banish them the Bayesian must—at least if she is being coherent. I come back to the likelihood principle in section 7.
What is key on the philosophical side is that error probabilities may be used to quantify the probativeness or severity of tests (in relation to a given inference).
The twin goals of probative tests and informative inferences constrain the selection of tests. But however tests are specified, they are open to an after-data scrutiny based on the severity achieved. Tests do not always or automatically give us relevant severity assessments, and I do not claim one will find just this construal in the literature. Because any such severity assessment is relative to the particular ‘mistake’ being ruled out, it must be qualified in relation to a given inference, and a given testing context. We may write:
SEV(T, x, H) to abbreviate ‘the severity with which test Tpasses hypothesis H with data x’.
When the test and data are clear, I may just write SEV(H). The standpoint of the severe prober, or the severity principle, directs us to obtain error probabilities that are relevant to determining well testedness, and this is the key, I maintain, to avoiding counterintuitive inferences which are at the heart of often-repeated comic criticisms. This makes explicit what Neyman and Pearson implicitly hinted at:
If properly interpreted we should not describe one [test] as more accurate than another, but according to the problem in hand should recommend this one or that as providing information which is more relevant to the purpose. (Neyman and Pearson 1967, 56–57)
For the vast majority of cases we deal with, satisfying the N-P long-run desiderata leads to a uniquely appropriate test that simultaneously satisfies Cox’s (Fisherian) focus on minimally sufficient statistics, and also the severe testing desiderata (Mayo and Cox 2010).
5.2 Philosophy-Laden Criticisms of Frequentist Statistical Methods
What is rarely noticed in foundational discussions is that appraising statistical accounts at the foundational level is ‘theory-laden’, and in this case the theory is philosophical. A deep as opposed to a shallow critique of such appraisals must therefore unearth the philosophical presuppositions underlying both the criticisms and the plaudits of methods. To avoid question-begging criticisms, the standpoint from which the appraisal is launched must itself be independently defended.
But for many philosophers, in particular, Bayesians, the presumption that inference demands a posterior probability for hypotheses is thought to be so obvious as not to require support. At any rate, the only way to give a generous interpretation of the critics (rather than assume a deliberate misreading of frequentist goals) is to allow that critics are implicitly making assumptions that are at odds with the frequentist statistical philosophy. In particular, the criticisms of frequentist statistical methods assume a certain philosophy about statistical inference (probabilism), often coupled with the allegation that error-statistical methods can only achieve radical behavioristic goals, wherein long-run error rates alone matter.
Criticisms then follow readily, in the form of one or both:
- Error probabilities do not supply posterior probabilities in hypotheses.
- Methods can satisfy long-run error probability demands while giving rise to counterintuitive inferences in particular cases.
I have proposed an alternative philosophy that replaces these tenets with different ones:
- The role of probability in inference is to quantify how reliably or severely claims have been tested.
- The severity principle directs us to the relevant error probabilities; control of long-run error probabilities, while necessary, is not sufficient for good tests.
The following examples will substantiate and flesh out these claims.
5.3 Severity as a ‘Metastatistical’ Assessment
In calling severity ‘metastatistical’, I am deliberately calling attention to the fact that the reasoned deliberation it involves cannot simply be equated to formal quantitative measures, particularly those that arise in recipe-like uses of methods such as significance tests. In applying it, we consider several possible inferences that might be considered of interest. In the example of test T+ [this is a one-sided Normal test of H0: μ≤μ0 against H1: μ>μ0, on p. 81], the data specific severity evaluation quantifies the extent of the discrepancy (γ) from the null that is (or is not) indicated by data x rather than quantifying a ‘degree of confirmation’ accorded a given hypothesis. Still, if one wants to emphasize a post-data measure one can write:
SEV(μ <X0 + γσx) to abbreviate: The severity with which a test T+ with a result x passes the hypothesis:
(μ < X0 + γσx) with σx abbreviating (σ /√n)
One might consider a series of benchmarks or upper severity bounds:
SEV(μ < x0 + 0σx) = .5
SEV(μ < x0 + .5σx) = .7
SEV(μ < x0 + 1σx) = .84
SEV(μ < x0 + 1.5σx) = .93
SEV(μ < x0 + 1.98σx) = .975
More generally, one might interpret nonstatistically significant results (i.e., d(x) ≤ cα) in test T+ above in severity terms:
(μ ≤ X0 + γε(σ /√n)) passes the test T+ with severity (1 –ε),
for any P(d(X)>γε) = ε.
It is true that I am here limiting myself to a case where σ is known and we do not worry about other possible ‘nuisance parameters’. Here I am doing philosophy of statistics; only once the logic is grasped will the technical extensions be forthcoming.
5.3.1 Severity and Confidence Bounds in the Case of Test T+
It will be noticed that these bounds are identical to the corresponding upper confidence interval bounds for estimating μ. There is a duality relationship between confidence intervals and tests: the confidence interval contains the parameter values that would not be rejected by the given test at the specified level of significance. It follows that the (1 – α) one-sided confidence interval (CI) that corresponds to test T+ is of form:
μ > X− cα(σ /√n)
The corresponding CI, in other words, would not be the assertion of the upper bound, as in our interpretation of statistically insignificant results. In particular, the 97.5 percent CI estimator corresponding to test T+ is:
μ > X− 1.96(σ /√n)
We were only led to the upper bounds in the midst of a severity interpretation of negative results (see Mayo and Spanos 2006). [See also posts on this blog, e.g., on reforming the reformers.]
Still, applying the severity construal to the application of confidence interval estimation is in sync with the recommendation to consider a series of lower and upper confidence limits, as in Cox (2006). But are not the degrees of severity just another way to say how probable each claim is? No. This would lead to well-known inconsistencies, and gives the wrong logic for ‘how well-tested’ (or ‘corroborated’) a claim is.
A classic misinterpretation of an upper confidence interval estimate is based on the following fallacious instantiation of a random variable by its fixed value:
P(μ < (X+2(σ /√n); μ) = .975,
observe mean x,
therefore, P (μ < ( x+ 2(σ /√n); μ) = .975.
While this instantiation is fallacious, critics often argue that we just cannot help it. Hacking (1980) attributes this assumption to our tendency toward ‘logicism’, wherein we assume a logical relationship exists between any data and hypothesis. More specifically, it grows out of the first tenet of the statistical philosophy that is assumed by critics of error statistics, that of probabilism.
5.3.2 Severity versus Rubbing Off
The severity construal is different from what I call the ‘rubbing off construal’ which says: infer from the fact that the procedure is rarely wrong to the assignment of a low probability to its being wrong in the case at hand. This is still dangerously equivocal, since the probability properly attaches to the method not the inference. Nor will it do to merely replace an error probability associated with an inference to H with the phrase ‘degree of severity’ with which H has passed. The long-run reliability of the rule is a necessary but not a sufficient condition to infer H (with severity).
The reasoning instead is the counterfactual reasoning behind what we agreed was at the heart of an entirely general principle of evidence. Although I chose to couch it within the severity principle, the general frequentist principle of evidence (FEV) or something else could be chosen.
To emphasize another feature of the severity construal, suppose one wishes to entertain the severity associated with the inference:
H: μ< ( x0 + 0σx)
on the basis of mean x0 from test T+. H passes with low (.5) severity because it is easy, i.e., probable, to have obtained a result that agrees with H as well as this one, even if this claim is false about the underlying data generation procedure. Equivalently, if one were calculating the confidence level associated with the one-sided upper confidence limit μ < x, it would have level .5. Without setting a fixed level, one may apply the severity assessment at a number of benchmarks, to infer which discrepancies are, and which are not, warranted by the particular data set. Knowing what fails to be warranted with severity becomes at least as important as knowing what is: it points in the direction of what may be tried next and of how to improve inquiries.
5.3.3 What’s Belief Got to Do with It?
Some philosophers profess not to understand what I could be saying if I am prepared to allow that a hypothesis H has passed a severe test T with x without also advocating (strong) belief in H. When SEV(H) is high there is no problem in saying that x warrants H, or if one likes, that x warrants believing H, even though that would not be the direct outcome of a statistical inference. The reason it is unproblematic in the case where SEV(H) is high is:
If SEV(H) is high, its denial is low, i.e., SEV(~H) is low.
But it does not follow that a severity assessment should obey the probability calculus, or be a posterior probability—it should not, and is not.
After all, a test may poorly warrant both a hypothesis H and its denial, violating the probability calculus. That is, SEV(H) may be low because its denial was ruled out with severity, i.e., because SEV(~H) is high. But Sev(H) may also be low because the test is too imprecise to allow us to take the result as good evidence for H.
Even if one wished to retain the idea that degrees of belief correspond to (or are revealed by?) bets an agent is willing to take, that degrees of belief are comparable across different contexts, and all the rest of the classic subjective Bayesian picture, this would still not have shown the relevance of a measure of belief to the objective appraisal of what has been learned from data. Even if I strongly believe a hypothesis, I will need a concept that allows me to express whether or not the test with outcome x warrants H. That is what a severity assessment would provide. In this respect, a dyed-in-the wool subjective Bayesian could accept the severity construal for science, and still find a home for his personalistic conception.
Critics should also welcome this move because it underscores the basis for many complaints: the strict frequentist formalism alone does not prevent certain counterintuitive inferences. That is why I allowed that a severity assessment is on the metalevel in scrutinizing an inference. Granting that, the error- statistical account based on the severity principle does prevent the counterintuitive inferences that have earned so much fame not only at Bayesian retreats, but throughout the literature.
5.3.4 Tacking Paradox Scotched
In addition to avoiding fallacies within statistics, the severity logic avoids classic problems facing both Bayesian and hypothetical-deductive accounts in philosophy. For example, tacking on an irrelevant conjunct to a well-confirmed hypothesis H seems magically to allow confirmation for some irrelevant conjuncts. Not so in a severity analysis. Suppose the severity for claim H (given test T and data x) is high: i.e., SEV(T, x, H) is high, whereas a claim J is not probed in the least by test T. Then the severity for the conjunction (H & J) is very low, if not minimal.
If SEV(Test T, data x, claim H) is high, but J is not probed in the least by the experimental test T, then SEV (T, x, (H& J)) = very low or minimal.
For example, consider:
H: GTR and J: Kuru is transmitted through funerary cannibalism,
and let data x0 be a value of the observed deflection of light in accordance with the general theory of relativity, GTR. The two hypotheses do not refer to the same data models or experimental outcomes, so it would be odd to conjoin them; but if one did, the conjunction gets minimal severity from this particular data set. Note that we distinguish x severely passing H, and H being severely passed on all evidence in science at a time.
A severity assessment also allows a clear way to distinguish the well-testedness of a portion or variant of a larger theory, as opposed to the full theory. To apply a severity assessment requires exhausting the space of alternatives to any claim to be inferred (i.e., ‘H is false’ is a specific denial of H). These must be relevant rivals to H—they must be at ‘the same level’ as H. For example, if H is asking about whether drug Z causes some effect, then a claim at a different (‘higher’) level might a theory purporting to explain the causal effect. A test that severely passes the former does not allow us to regard the latter as having passed severely. So severity directs us to identify the portion or aspect of a larger theory that passes. We may often need to refine the hypothesis of stated interest so that it is sufficiently local to enable a severity assessment. Background knowledge will clearly play a key role. Nevertheless we learn a lot from determining that we are not allowed to regard given claims or theories as passing with severity. I come back to this in the next section (and much more elsewhere, e.g., Mayo 2010a, b).
 co-organized with Aris Spanos.
 This was a special topic of the on-line journal, Rationality, Markets and Morals (RMM), edited by Max Albert—also a conference participant. For more Saturday night reading, check out the page.Authors are: David Cox, Andrew Gelman, David F. Hendry, Deborah G. Mayo, Stephen Senn, Aris Spanos, Jan Sprenger, Larry Wasserman. Search this blog for a number of commentaries on most of these papers.
Long-time blog readers will recognize this from the start of this blog. for some background, and a table of contents for the paper, see my Oct 17 post.
While I realize these stories are strawman statistical arguments made to mock a certain perspective, I can’t help but be confused anyways. Trying to determine whether the measurements from a seemingly malfunctioning instrument can be salvaged is an important and interesting research problem. It comes up all the time. So after a moments reflection, the first joke just seems dumb, irrespective of ones statistical perspective.