Here are the first two sections of my new paper: “Statistical Science Meets Philosophy of Science Part 2: Shallow versus Deep Explorations” SS & POS 2. (Alternatively, go to the RMM page and scroll down to the Sept 26, 2012 entry.)
1. Comedy Hour at the Bayesian Retreat[i]
Overheard at the comedy hour at the Bayesian retreat: Did you hear the one about the frequentist…
“who defended the reliability of his radiation reading, despite using a broken radiometer, on the grounds that most of the time he uses one that works, so on average he’s pretty reliable?”
“who claimed that observing ‘heads’ on a biased coin that lands heads with probability .05 is evidence of a statistically significant improvement over the standard treatment of diabetes, on the grounds that such an event occurs with low probability (.05)?”
Such jests may work for an after-dinner laugh, but if it turns out that, despite being retreads of ‘straw-men’ fallacies, they form the basis of why some statisticians and philosophers reject frequentist methods, then they are not such a laughing matter. But surely the drubbing of frequentist methods could not be based on a collection of howlers, could it? I invite the reader to stay and find out.
If we are to take the criticisms seriously, and put to one side the possibility that they are deliberate distortions of frequentist statistical methods, we need to identify their sources. To this end I consider two interrelated areas around which to organize foundational issues in statistics: (1) the roles of probability in induction and inference, and (2) the nature and goals of statistical inference in science or learning. Frequentist sampling statistics, which I prefer to call ‘error statistics’, continues to be raked over the coals in the foundational literature, but with little scrutiny of the presuppositions about goals and methods, without which the criticisms lose all force.
First, there is the supposition that an adequate account must assign degrees of probability to hypotheses, an assumption often called probabilism. Second, there is the assumption that the main, if not the only, goal of error-statistical methods is to evaluate long-run error rates. Given the wide latitude with which some critics define ‘controlling long-run error’, it is not surprising to find them arguing that (i) error statisticians approve of silly methods, and/or (ii) rival (e.g., Bayesian) accounts also satisfy error statistical demands. Absent this sleight of hand, Bayesian celebrants would have to go straight to the finale of their entertainment hour: a rousing rendition of ‘There’s No Theorem Like Bayes’s Theorem’.
Never mind that frequentists have responded to these criticisms, they keep popping up (verbatim) in every Bayesian and some non-Bayesian textbooks and articles on philosophical foundations. No wonder that statistician Stephen Senn is inclined to “describe a Bayesian as one who has a reverential awe for all opinions except those of a frequentist statistician” (Senn 2011, 59, this special topic of RMM). Never mind that a correct understanding of the error-statistical demands belies the assumption that any method (with good performance properties in the asymptotic long-run) succeeds in satisfying error-statistical demands.
The difficulty of articulating a statistical philosophy that fully explains the basis for both (i) insisting on error-statistical guarantees, while (ii) avoiding pathological examples in practice, has turned many a frequentist away from venturing into foundational battlegrounds. Some even concede the distorted perspectives drawn from overly literal and radical expositions of what Fisher, Neyman, and Pearson ‘really thought’. I regard this as a shallow way to do foundations.
Here is where I view my contribution—as a philosopher of science—to the long-standing debate: not merely to call attention to the howlers that pass as legitimate criticisms of frequentist error statistics, but also to sketch the main lines of an alternative statistical philosophy within which to better articulate the roles and value of frequentist tools. Let me be clear that I do not consider this the only philosophical framework for frequentist statistics—different terminology could do as well. I will consider myself successful if I can provide one way of building, or one standpoint from which to build, a frequentist, error-statistical philosophy. Here I mostly sketch key ingredients and report on updates in a larger, ongoing project.
2. Popperians Are to Frequentists as Carnapians Are to Bayesians
Statisticians do, from time to time, allude to better-known philosophers of science (e.g., Popper). The familiar philosophy/statistics analogy—that Popper is to frequentists as Carnap is to Bayesians—is worth exploring more deeply, most notably the contrast between the popular conception of Popperian falsification and inductive probabilism. Popper himself remarked:
In opposition to [the] inductivist attitude, I assert that C(H,x) must not be interpreted as the degree of corroboration of H by x, unless x reports the results of our sincere efforts to overthrow H. The requirement of sincerity cannot be formalized—no more than the inductivist requirement that x must represent our total observational knowledge. (Popper 1959, 418, I replace ‘e’ with ‘x’)
In contrast with the more familiar reference to Popperian falsification, and its apparent similarity to statistical significance testing, here we see Popper alluding to failing to reject, or what he called the “corroboration” of hypothesis H. Popper chides the inductivist for making it too easy for agreements between data x and H to count as giving H a degree of confirmation.
Observations or experiments can be accepted as supporting a theory (or a hypothesis, or a scientific assertion) only if these observations or experiments are severe tests of the theory—or in other words, only if they result from serious attempts to refute the theory. (Popper 1994, 89)
(Note the similarity to Peirce in Mayo 2011, 87.)
2.1 Severe Tests
Popper did not mean to cash out ‘sincerity’ psychologically of course, but in some objective manner. Further, high corroboration must be ascertainable: ‘sincerely trying’ to find flaws will not suffice. Although Popper never adequately cashed out his intuition, there is clearly something right in this requirement. It is the gist of an experimental principle presumably accepted by Bayesians and frequentists alike, thereby supplying a minimal basis to philosophically scrutinize different methods. (Mayo 2011, section 2.5, this special topic of RMM) Error-statistical tests lend themselves to the philosophical standpoint reflected in the severity demand. Pretty clearly, evidence is not being taken seriously in appraising hypothesis H if it is predetermined that, even if H is false, a way would be found to either obtain, or interpret, data as agreeing with (or ‘passing’) hypothesis H. Here is one of many ways to state this:
Severity Requirement (weakest): An agreement between data x and H fails to count as evidence for a hypothesis or claim H if the test would yield (with high probability) so good an agreement even if H is false.
Because such a test procedure had little or no ability to find flaws in H, finding none would scarcely count in H’s favor.
2.1.1 Example: Negative Pressure Tests on the Deep Horizon Rig
Did the negative pressure readings provide ample evidence that:
H0: leaking gases, if any, were within the bounds of safety (e.g., less than θ0)?
Not if the rig workers kept decreasing the pressure until H passed, rather than performing a more stringent test (e.g., a so-called ‘cement bond log’ using acoustics). Such a lowering of the hurdle for passing H0 made it too easy to pass H0 even if it was false, i.e., even if in fact:
H1: the pressure build-up was in excess of θ0.
That ‘the negative pressure readings were misinterpreted’, meant that it was incorrect to construe them as indicating H0. If such negative readings would be expected, say, 80 percent of the time, even if H1 is true, then H0 might be said to have passed a test with only .2 severity. Using Popper’s nifty abbreviation, it could be said to have low corroboration, .2. So the error probability associated with the inference to H1 would be .8—clearly high. This is not a posterior probability, but it does just what we want it to do.
2.2 Another Egregious Violation of the Severity Requirement
Too readily interpreting data as agreeing with or fitting hypothesis H is not the only way to violate the severity requirement. Using utterly irrelevant evidence, such as the result of a coin flip to appraise a diabetes treatment, would be another way. In order for data x to succeed in corroborating H with severity, two things are required: (i) x must fit H, for an adequate notion of fit, and (ii) the test must have a reasonable probability of finding worse agreement with H, were H false. I have been focusing on (ii) but requirement (i) also falls directly out from error statistical demands. In general, for H to fit x, H would have to make x more probable than its denial. Coin tossing hypotheses say nothing about hypotheses on diabetes and so they fail the fit requirement. Note how this immediately scotches the second howler in the second opening example.
But note that we can appraise the severity credentials of other accounts by using whatever notion of ‘fit’ they permit. For example, if a Bayesian method assigns high posterior probability to H given data x, we can appraise how often it would do so even if H is false. That is a main reason I do not want to limit what can count as a purported measure of fit: we may wish to entertain different measures for purposes of criticism.
2.3 The Rationale for Severity is to Find Things Out Reliably
Although the severity requirement reflects a central intuition about evidence, I do not regard it as a primitive: it can be substantiated in terms of the goals of learning. To flout it would not merely permit being wrong with high probability—a long-run behavior rationale. In any particular case, little if anything will have been done to rule out the ways in which data and hypothesis can ‘agree’, even where the hypothesis is false. The burden of proof on anyone claiming to have evidence for H is to show that the claim is not guilty of at least an egregious lack of severity.
Although one can get considerable mileage even with the weak severity requirement, I would also accept the corresponding positive conception of evidence, which will comprise the full severity principle:
Severity Principle (full): Data x provide a good indication of or evidence for hypothesis H (only) to the extent that test T severely passes H with x.
Degree of corroboration is a useful shorthand for the degree of severity with which a claim passes, and may be used as long as the meaning remains clear.
2.4 What Can Be Learned from Popper; What Can Popperians Be Taught?
Interestingly, Popper often crops up as a philosopher to emulate—both by Bayesian and frequentist statisticians. As a philosopher, I am glad to have one of our own taken as useful, but feel I should point out that, despite having the right idea, Popperian logical computations never gave him an adequate way to implement his severity requirement, and I think I know why: Popper once wrote to me that he regretted never having learned mathematical statistics. Were he to have made the ‘error probability’ turn, today’s meeting ground between philosophy of science and statistics would likely look very different, at least for followers of Popper, the ‘critical rationalists’.
Consider, for example, Alan Musgrave (1999; 2006). Although he declares that “the critical rationalist owes us a theory of criticism” (2006, 323) this has yet to materialize. Instead, it seems that current-day critical rationalists retain the limitations that emasculated Popper. Notably, they deny that the method they recommend—either to accept or to prefer the hypothesis best-tested so far—is reliable. They are right: the best-tested so far may have been poorly probed. But critical rationalists maintain nevertheless that their account is ‘rational’. If asked why, their response is the same as Popper’s: ‘I know of nothing more rational’ than to accept the best-tested hypotheses. It sounds rational enough, but only if the best-tested hypothesis so far is itself well tested (see Mayo 2006; 2010b). So here we see one way in which a philosopher, using methods from statistics, could go back to philosophy and implement an incomplete idea.
On the other hand, statisticians who align themselves with Popper need to show that the methods they favor uphold falsificationist demands: that they are capable of finding claims false, to the extent that they are false; and retaining claims, just to the extent that they have passed severe scrutiny (of ways they can be false). Error probabilistic methods can serve these ends; but it is less clear that Bayesian methods are well-suited for such goals (or if they are, it is not clear they are properly ‘Bayesian’).
TO READ MORE, SEE SECTIONS 1-4 (pp. 71-83) IN SS & POS 2.
(All references can also be found in the link above.)
[i] Long-time blog readers will recognize this from the start of this blog. for some background, and a table of contents for the paper, see my Oct 17 post.
This is really interesting; I’ve had some struggles reconciling my previous understanding of Popper (which, even though I have strong Bayesian inclinations, made lots of sense in a field where reproducible experiments on crops or beer ingredients are possible) to a lot of what we do now which involves observational studies, simulation and checking predictive validity of models in some special cases only. Thanks for the work around Popper and “severity of testing”; I appreciate the insight.