The following is my commentary on a paper by Gelman and Shalizi, forthcoming (some time in 2013) in the British Journal of Mathematical and Statistical Psychology* (submitted February 14, 2012).
“The Error Statistical Philosophy and the Practice of Bayesian Statistics: Comments on A. Gelman and C. Shalizi: Philosophy and the Practice of Bayesian Statistics”**
Deborah G. Mayo
I am pleased to have the opportunity to comment on this interesting and provocative paper. I shall begin by citing three points at which the authors happily depart from existing work on statistical foundations.
First, there is the authors’ recognition that methodology is ineluctably bound up with philosophy. If nothing else “strictures derived from philosophy can inhibit research progress” (p. 4). They note, for example, the reluctance of some Bayesians to test their models because of their belief that “Bayesian models were by definition subjective,” or perhaps because checking involves non-Bayesian methods (4, n4).
Second, they recognize that Bayesian methods need a new foundation. Although the subjective Bayesian philosophy, “strongly influenced by Savage (1954), is widespread and influential in the philosophy of science (especially in the form of Bayesian confirmation theory),”and while many practitioners perceive the “rising use of Bayesian methods in applied statistical work,” (2) as supporting this Bayesian philosophy, the authors flatly declare that “most of the standard philosophy of Bayes is wrong” (2 n2). Despite their qualification that “a statistical method can be useful even if its philosophical justification is in error”, their stance will rightly challenge many a Bayesian.
This will be especially so when one has reached their third thesis, which seeks a new foundation that uses non-Bayesian ideas. Although the authors at first profess that their “perspective is not new”, but rather follows many other statisticians who emphasize “the value of Bayesian inference as an approach for obtaining statistical methods with good frequency properties,” (3), they go on to announce they are “going beyond the evaluation of Bayesian methods based on their frequency properties as recommended by Rubin (1984), Wasserman (2006), among others, to emphasize the learning that comes from the discovery of systematic differences between model and data” (15). Moreover, they suggest that “implicit in the best Bayesian practice is a stance that has much in common with the error-statistical approach of Mayo (1996), despite the latter’s frequentist orientation.[i] Indeed, crucial parts of Bayesian data analysis, such as model checking, can be understood as ‘error probes’ in Mayo’s sense”(2), which might be seen as using modern statistics to implement the Popperian criteria for severe tests.
In Popperian spirit, let me stick my neck out and conjecture that the authors are correct. This is not the place to detail the error-statistical account, but I will illustrate from among its themes where they pertain to the present paper (see Mayo and Spanos 2011).
The idea that non-Bayesian ideas might afford a foundation for the many strands of Bayesianism is not as preposterous as it first seems. Supplying a foundation requires that we step back from formal methods themselves. That is what the error statistical philosophy attempts to provide for such well-known (“sampling theory”) tools as significance tests and confidence interval methods. But the idea of severe testing is sufficiently general to apply to any other methods on offer. On the face of it, any inference, whether to the adequacy of a model (for a given purpose), or to a posterior probability, can be said to be warranted just to the extent that the inference has withstood severe testing.
If the authors are right, several novel pathways for situating current work suddenly open up. But that is for another time. Here, I will point up some places where error statistical methods might yield tools to promote their ends, but also others where they will hold up large warning signs! In so doing I will often refer to the “philosophical coda” in the last several pages of their paper. Leaving to one side quibbles about some of the philosophical positions they mention, their “coda” contains many important philosophical insights that should be applied throughout.
2. Testing in Their Data-Analysis Cycle
The authors claim their statistical analysis is used “not for computing the posterior probability that any particular model was true—we never actually did that” (8), but rather “to fit rich enough models” and upon discerning that aspects of the model “did not fit our data” (8), to build a more complex, better fitting, model; which in turn called for alteration when faced with new data.
This cycle, they rightly note, involves a “non-Bayesian checking of Bayesian models” (11), but they should not describe it as purely deductive: it is not. Nor should they wish to hold to that old distorted view of a Popperian test as “the rule of deduction which says that if p implies q, and q is false, then p must be false” (with p, q, the hypothesis, and data respectively) (22). Having thrown off one oversimplified picture, they should avoid slipping into another. As Popper well knew, any observable predictions are derived only with the help of various auxiliary claims A1, . . . An . Confronted with anomalous data one may at most infer that either H or one of the auxiliaries is to blame: Duhem’s problem. While mentioned in the philosophical coda (p. 23), they should be explicitly raising Duhemian concerns throughout.
To infer evidence of a genuine anomaly is to make an inductive inference to the existence of a reproducible effect: Popper called it a falsifying hypothesis. Although falsification rules must be probabilistic in some sense, it is not enough to regard the anomaly as genuine simply because the outcome is highly improbable under a hypothesized model. Individual outcomes described in detail may easily have very small probabilities without being genuine anomalies.
Alluding to Mayo and Cox (2006, 2010), the authors suggest that any account that moves from data to hypotheses might be called a theory of inductive inference in our sense. Not at all. The requirements for reliable or severe tests must be met. Our point was to show that sampling theory methods, contrary to what has been supposed, satisfy these requirements, so long as they are suitably interpreted. Severity assignments are not posterior probabilities, but they do involve induction. Since the authors concur with the idea of “a model being severely tested if it passed a probe which had a high probability of detecting an error if it is present” (15), it will be up to them to show they can satisfy this. …[To continue reading please click here.]
*I want to thank the British Journal of Mathematical and Statistical Psychology for allowing me to make public the manuscript draft.
**I gratefully acknowledge the insights of Aris Spanos on misspecification testing, and his very useful comments on earlier drafts of this paper.
Gelman, A and C. Shalizi. (Article first published online: 24 FEB 2012). “Philosophy and the Practice of Bayesian statistics (with discussion)”. British Journal of Mathematical and Statistical Psychology (BJMSP).
Mayo, D. G. 1996. Error and the Growth of Experimental Knowledge. Chicago: University of Chicago Press.
Mayo, D. G.
(forthcoming)2013. “Comments on A. Gelman and C. Shalizi: Philosophy and the Practice of Bayesian Statistics” (Uncorrected Draft). British Journal of Mathematical and Statistical Psychology.
Mayo, D., and D. Cox. 2006. “Frequentist Statistics as a Theory of Inductive Inference”. In Optimality: The Second Erich L. Lehmann Symposium, edited by J. Rojo, 77–97. Vol. 49, Lecture Notes-Monograph Series, Institute of Mathematical Statistics (IMS). Reprinted in D.Mayo and A. Spanos, 2010: 247–275.
Mayo, D., and A. Spanos. 2011. Error Statistics. In Philosophy of Statistics, edited by P. S. Bandyopadhyay and M. R. Forster. Handbook of the Philosophy of Science. Oxford: Elsevier.
Wasserman, L. 2006. “Frequentist Bayes is Objective”. Bayesian Analysis 1(3):451-456. URL http://ba.stat.cmu.edu/journal/2006/vol01/issue03/wasserman.pdf.
[i] I refer to these methods as “error statistical” because of their focus on using sampling distributions to control and assess error probabilities. In contexts of scientific inference, error probabilities are used to evaluate severity and inseverity. The single concept of severity applies to both the usual rejections and non-rejections, but the severity, which is data-dependent, is only in the same direction as power in the case of non-rejections.