Comment on Gelman’s “Induction and Deduction in Bayesian Data Analysis” (RMM)
Dr. Christian Hennig (Senior Lecturer, Department of Statistical Science, University College London)
I have read quite a bit of what Andrew Gelman has written in recent years, including some of his blog. One thing that I find particularly refreshing and important about his approach is that he contrasts the Bayesian and frequentist philosophical conceptions honestly with what happens in the practice of data analysis, which often cannot (or does better not to) proceed according to an inflexible dogmatic book of rules.
I also like the emphasis on the fact that all models are wrong. I personally believe that a good philosophy of statistics should consistently take into account that models are rather tools for thinking than able to “match” reality, and in the vast majority of cases we know clearly that they are wrong (all continuous models are wrong because all observed data are discrete, for a start).
There is, however, one issue on which I find his approach unsatisfactory (or at least not well enough explained), and on which both frequentism and subjective Bayesianism seem superior to me.
Note that for me the terms “frequentist” and “subjective Bayes” point to interpretations of probability, and not to specific methods of inference. The frequentist one refers to the idea that there is an underlying data generating process that repeatedly throws out data and would approximate the assumed distribution if one could only repeat it infinitely often. The subjective Bayesian one is about quantifying belief in a rational way; following de Finetti, it would in fact be about belief in observable future outcomes of experiments, and not in the truth of models. Priors over model parameters, according to de Finetti, are only technical devices to deal with belief distributions for future outcomes, and should not be interpreted in their own right. In order to adopt such interpretations, it isn’t necessary to believe that the world (or a perfectly rational human brain) really behave in such a way. It is enough to find some use in thinking about the world (or about rationality) as if they did.
Despite common historical roots, I think that these two interpretations of probability are separate. If mathematical probability is used to model a data generating process, is does not model rational handling of beliefs at the same time.
There are further interpretations of probability apart from these two. Gelman emphasizes that he isn’t keen on using either of them consistently, but he doesn’t explain what his alternative is. I think that a statistician doesn’t need to be either a consistent frequentist or a consistent Bayesian all the time; I don’t see anything wrong with doing a Bayesian analysis of voting behaviour and then doing a frequentist analysis of some physical data connected to the general theory of relativity. I think that it is also perfectly fine to use Bayesian methodology with a frequentist interpretation of probability, at least as long as the final interpretation is consistent with this (which means that posterior probabilities can only be interpreted as probabilities if the prior can be given a proper “physical” meaning, but a posterior mode can be used as an estimator with hopefully good frequentist properties in any case).
However, I think that any single analysis that uses and interprets probabilities can only make sense if it is clear what is meant by “probability” in that particular situation. So I think that it’s a quite serious omission that Gelman doesn’t tell us his interpretation (he may do that elsewhere, though). I don’t think that it is straightforward to find one. He seems to interpret data models given a paremeter mainly in a frequentist way (otherwise he couldn’t test them against the data). But at least in the present paper he doesn’t give any frequentist meaning to the prior distributions. If he doesn’t want to do that, I don’t see how Bayes’s theorem can be justified, because if the parametric model is frequentist; whereas, the prior distribution is epistemic, there are two interpretationally incompatible probabilities that I don’t think are justified to be used in a formalism as if they were the same kind of thing.
In a presentation I attended, Gelman actually made an attempt to motivate priors using ideas that I’d qualify as frequentist, namely using distributions “over all kinds of problems a data analyst may encounter that somehow look like the one just analysed” (he may not be happy with this explanation so he is invited to correct me). This is a nice idea but would lead into a discussion which data analytic problem for which reason can be interpreted as “repetition” in the sense of frequentism of which other problem, to which I can’t see a satisfactory answer (certainly it depends on researcher’s modelling decisions whether their particular studies qualify or not?), and even if there were one, I don’t think that a proper empirical basis for figuring out appropriate prior distributions for a given problem currently exists (one could air this as a research programme, though).
So I think that the key issue Gelman needs to address is: “Explain what the modeled probabilities mean!” By the way, using a clear interpretation consistently can explain the reluctance of a subjective Bayesian to check the model against the data.
I add a note on predictive posterior checking. In a standard Bayesian analysis, the idea is that all the data at hand come from the same parameter; in terms of data generating mechanisms: first the prior distribution throws out a single parameter value, and then the parametric distribution with this parameter value generates all the data. This implies that the effective sample size for checking the prior distribution is actually smaller than one! (It would be one if we knew the parameter generated by the prior, because there is just one, but in fact we can only know it imprecisely through the data.) Of course, sometimes a distribution can be rejected based on a sample size of one or smaller (e.g., a Normal (0,1) distribution observing a single value above 100, even imprecisely), but surely the power of such a procedure can only be very weak, and it is strictly impossible to diagnose for example variation features of the prior.