Continuing with our discussion of contributions to the special topic, Statistical Science and Philosophy of Science in Rationality, Markets and Morals (RMM),* I am pleased to post some comments on Andrew **Gelman’s paper “Induction and Deduction in Bayesian Data Analysis”. (More comments to follow—as always, feel free to comment.)

Note: March 9, 2012: Gelman has commented to some of our comments on his blog today: http://andrewgelman.com/2012/03/coming-to-agreement-on-philosophy-of-statistics/

**D. Mayo**

For now, I will limit my own comments to two: First, a fairly uncontroversial point, while Gelman writes that “Popper has argued (convincingly, in my opinion) that scientific inference is not inductive but deductive,” a main point of my series (Part 1, 2, 3) of “No-Pain” philosophy was that “deductive” falsification involves inductively inferring a “falsifying hypothesis”.

More importantly, and more challengingly, Gelman claims the view he recommends “corresponds closely to the error-statistics idea of Mayo (1996)”. Now the idea that non-Bayesian ideas might afford a foundation for strands of Bayesianism is not as implausible as it may seem. On the face of it, any inference to a claim, whether to the adequacy of a model (for a given purpose), or even to a posterior probability, can be said to be warranted just to the extent that the claim has withstood a severe test (i.e, a test that would, at least with reasonable probability, have discerned a flaw with the claim, were it false). The question is: How well do Gelman’s methods for inferring statistical models satisfy severity criteria? (I’m not sufficiently familiar with his intended applications to say.)

## Stephen Senn

**Competence Center for Methodology and Statistics, CRP-Santé, Strassen, Luxenbourg**

I am in sympathy with much of what Andrew Gelman has to say. This is not surprising since I know that he takes data-analysis seriously and that is always a recommendation. There is no contradiction in being Bayesian and liking falsificationism. Not only Popper but also De Finetti was a falsificationist. They differed, of course, in terms of their interpretation of probability and whereas Popper stressed the growth of knowledge through the refutation of hypotheses for De Finetti it was the falsificationist modification of *belief* that mattered.

However, it is not the testing of hypotheses or models *per se* that is the potential problem here. The problem is how it serves the main purpose of data-analysis. Whether (s)he is frequentist or Bayesian the testing of auxiliary hypotheses (for example, concerning the nature of the model) as a feature of fitting data is something that should make the statistician uncomfortable. To the extent that such a procedure of fitting, testing and re-fitting can be made automatic, it inevitably defines (to take the frequentist point of view) a way in which the sample space leads to a given conclusion. Thus, what seems like several stages is in fact just one (a point analogous to the one that Gelman makes about model averaging). The danger is that one behaves as if the activity of assumption-checking were innocent.

A notorious example can be given. For the early 1960s to the late 1980s a standard frequentist approach to analysing AB/BA cross-over designs was to start by using the sum of the responses over both periods to test for carry-over at the 10% level. If the null hypothesis of no-carry-over was not rejected one proceeded to perform the within-patient test one always intended (a slight modification of the matched pairs t) and if not, one performed a two-sample t-test on the first period values only (which could not be affected by carry-over). It was a Bayesian statistician, Peter Freeman, who demonstrated convincingly in 1989[1] that this procedure had very bad properties. Although either of the two tests one carried out of the treatment effect (depending on the result of testing for carry-over) maintained the nominal significance level if chosen unconditionally, the procedure as a whole had a type one error rate in excess of the claimed level. In fact, one can show that if a notional 5% level is used then either the two-stage procedure is irrelevant (one performs the within-patient test) or the type I error rate is between 25% and 50%[2, 3]!

It seems to me that a Bayesian problem of assumption checking is as follows. One defines a model and establishes prior distributions for parameters. These prior distributions can frequently be expressed in terms of ‘pseudo data’ that they bring to the problem. To an extent that is defined by the model pseudo-data and data are exchangeable so that, under the Bayesian formalism and given what model plus prior is supposed to express about your belief, it is a matter of complete indifference once the data are in whether they are pretty much as predicted by the prior distribution or completely at odds with it. Suppose, however, that you do allow yourself the possibility of changing some aspect of the prior probabilities of possible data patterns, which probabilities are defined by the combination of substantive model and prior distributions on parameters, as a result of actual data patterns obtained. (This is *not* a question of Bayesian updating the prior distribution to become the posterior distribution but of changing the prior distribution itself.) What then are the consequences?

It is analogous to claiming *force majeure* in operation of a contract. It is dangerous for any applied statistician to willingly forego the ability to check models and revise analysis plans in consequence[4]. (Note, however, that in drug development, where this is an important issue of trust, regulatory guidance discourages this.[5]) However, the problem is not just that the model may be changed if the observed data are add odds with their prediction but that the model will be retained with an inappropriate degree of certainty (one that never really applied because change was allowed) if the data happen to agree with the model.

I don’t think this is necessarily more of a problem for Bayesians than for frequentists. We all cheat in this way but nevertheless it is worrying. It means that the statistical intervals that we issue, whether confidence intervals or credible intervals ought to be wider than we claim they are. Our statements of uncertainty are far from including all uncertainties. This is, perhaps realistic, but it is also not entirely satisfactory.

References

1. Freeman, P., *The performance of the two-stage analysis of two-treatment, two-period cross-over trials.* Statistics in Medicine, 1989. **8**: p. 1421-1432.

2. Senn, S.J., *The AB/BA cross-over: how to perform the two-stage analysis if you can’t be persuaded that you shouldn’t.*, in *Liber Amicorum Roel van Strik*, B. Hansen and M. de Ridder, Editors. 1996, Erasmus University: Rotterdam. p. 93-100.

3. Senn, S.J., *The case for cross-over trials in phase III [letter; comment].* Statistics in Medicine, 1997. **16**(17): p. 2021-2.

4. Spanos, A., *Foundational Issues in Statistical Modeling: Statistical Model Specification and Validation.* Rationality, Markets and Morals, 2011. **2**: p. 146-178.

5. International Conference on Harmonisation, *Statistical principles for clinical trials (ICH E9).* Statistics in Medicine, 1999. **18**: p. 1905-1942.

_______________________________________

## Larry Wasserman

**Professor in the Department of Statistics and in the Machine Learning Department, Carnegie Mellon**

I agree with Andrew Gelman’s main point, namely, that the usual division of statistics into the simplistic categories of “Frequentist” and “Bayesian” is misleading. When analyzing data, there are many things I would do differently than Andrew. Nonetheless, I agree with his general idea of analyzing data without living in a philosophical straight-jacket.

Here’s how I see it. The idealized forms of Bayesian and frequentist statistics are sterile and unrealistic. The pure Bayesian envisions perfectly rational agents with well defined subjective distributions on everything. This is absurd. The pure frequentist imagines that data are an i.i.d. sample from some distribution P. This is also absurd. Data are just bunch of numbers (or images, text etc).

A pragmatic Bayesian will temporarily embrace the Bayesian viewpoint as a way to frame their analysis. But they are willing to step outside the framework and challenge their models and priors and use practical tools like goodness of fit tests. They will ask about the frequentist behavior of their Bayesian procedures. This is what Gelman argues for and I wish more Bayesians would adopt this view. The more philosophically strict Bayesians are painting themselves into a corner. They end up rejecting goodness of fit tests, randomization, permutation tests, distribution free methods and many other useful tools. I once saw a Bayesian give a talk and in which he apologized profusely for using cross-validation! When you are apologizing for using one of the most useful practical tools in statistics, you need to re-examine your principles.

Similarly, frequentists need to be open to challenging their world view. In particular, frequentists should be open to using Bayesian thinking when it is appropriate.

I could disagree with Andrew on a few small points. For example, I am not a fan of posterior predictive checks. And I think his model-centered way of approaching statistics biases him against some great distribution-free methods. (But maybe I’m wrong about that.) Also, I don’t agree with his statement that classical nonparametric methods rely on unrealistic assumptions.

Let me digress a little and give an example of a powerful distribution free method. (This is based on recent work by Jing Lei, Jamie Robins and me; see http://arxiv.org/abs/1111.1418.) We observe n data points Y1, … , Yn. The goal is to predict a new observation Y. Building on ideas from Vladmir Vovk and his colleagues, we constructed a prediction set C with the following properties: if the data are drawn from some distribution P, then C contains Y with probability at least 1 – alpha, no matter what the distribution P is. Furthermore, if P happens to be smooth, then C is guaranteed to be as small as possible. Something like this is very useful and makes almost no assumptions. But it would seem to be out of reach to even the pragmatic Bayesian, at least, as far as I can tell. So I wish the pragmatic Bayesian framework could be broadened to include ideas like this.

But overall, I see Andrew’s main message as being this: philosophers need to recognize that statistics, as practiced by good statisticians, is rarely at the philosophically pure extremes one finds in discussion on foundations. I agree with this message.

________________________

*Note: The papers in this special topic volume initially grew out of presentations at the conference, “Statistical Science and Philosophy of Science: Where Do/Should They Meet?” at the London School of Economics, June 2010, and subsequently other statisticians.

My own personal approach to Senn’s

force majeureproblem is founded on Bayesian nonparametrics. I use a model space as flexible as possible; upon seeing some data that doesn’t appear to need the added flexibility I’ll replace it with a restricted model byforce majeure. In doing so, I postulate/imagine/hope that I am saving computation time by skipping a more correct analysis with a point mass on an “effectively” parametric model within the nonparametric setting (as in this paper).