To resume sharing some notes I scribbled down on the contributions to our Philosophy of Science Association symposium on Philosophy of Statistics (Nov. 4, 2016), I’m up to Gelman. Comments on Gigerenzer and Glymour are here and here. Gelman didn’t use slides but gave a very thoughtful, extemporaneous presentation on his conception of “falsificationist Bayesianism”, its relation to current foundational issues, as well as to error statistical testing. My comments follow his abstract.

*Confirmationist and Falsificationist Paradigms in Statistical Practice*

Andrew Gelman

There is a divide in statistics between classical frequentist and Bayesian methods. Classical hypothesis testing is generally taken to follow a falsificationist, Popperian philosophy in which research hypotheses are put to the test and rejected when data do not accord with predictions. Bayesian inference is generally taken to follow a confirmationist philosophy in which data are used to update the probabilities of different hypotheses. We disagree with this conventional Bayesian-frequentist contrast: We argue that classical null hypothesis significance testing is actually used in a confirmationist sense and in fact does not do what it purports to do; and we argue that Bayesian inference cannot in general supply reasonable probabilities of models being true. The standard research paradigm in social psychology (and elsewhere) seems to be that the researcher has a favorite hypothesis A. But, rather than trying to set up hypothesis A for falsification, the researcher picks a null hypothesis B to falsify, which is then taken as evidence in favor of A. Research projects are framed as quests for confirmation of a theory, and once confirmation is achieved, there is a tendency to declare victory and not think too hard about issues of reliability and validity of measurements.

Instead, we recommend a falsificationist Bayesian approach in which models are altered and rejected based on data. The conventional Bayesian confirmation view blinds many Bayesians to the benefits of predictive model checking. The view is that any Bayesian model necessarily represents a subjective prior distribution and as such could never be tested. It is not only Bayesians who avoid model checking. Quantitative researchers in political science, economics, and sociology regularly fit elaborate models without even the thought of checking their fit. We can perform a Bayesian test by first assuming the model is true, then obtaining the posterior distribution, and then determining the distribution of the test statistic under hypothetical replicated data under the fitted model. A posterior distribution is not the final end, but is part of the derived prediction for testing. In practice, we implement this sort of check via simulation.

Posterior predictive checks are disliked by some Bayesians because of their low power arising from their allegedly “using the data twice”. This is not a problem for us: it simply represents a dimension of the data that is virtually automatically fit by the model. What can statistics learn from philosophy? Falsification and the notion of scientific revolutions can make us willing to check our model fit and to vigorously investigate anomalies rather than treat prediction as the only goal of statistics. What can the philosophy of science learn from statistical practice? The success of inference using elaborate models, full of assumptions that are certainly wrong, demonstrates the power of deductive inference, and posterior predictive checking demonstrates that ideas of falsification and error statistics can be applied in a fully Bayesian environment with informative likelihoods and prior distributions.

**Mayo Comments:**

(a) I welcome Gelman’s arguments against all Bayesian probabilisms, and am intrigued with Gelman and Shalizi’s (2013) ‘meeting of the minds’ (which I regard as a kind of error statistical Bayesianism) [1]. As I say in my concluding remark on their paper:

The authors have provided a radical and important challenge to the foundations of current Bayesian statistics, in a way that reflects current practice. Their paper points to interesting new research problems for advancing what is essentially a dramatic paradigm change in Bayesian foundations. …I hope that [it]…will motivate Bayesian epistemologists in philosophy to take note of foundational problems in Bayesian practice, and that it will inspire philosophically-minded frequentist error statisticians to help craft a new foundation for using statistical tools – one that will afford a series of error probes that, taken together, enable stringent or severe testing.

I’ve been trying to understand the workings of the approach well enough to illuminate its philosophical foundations–more on that in a later post [2].

(b) Going back to my symposium chicken-scratching, I wrote: “Gelman says p-values aren’t falsificationist, but confirmationist–[he’s referring to] that abusive animal” whereby a statistically significant result is taken as evidence in favor of a research claim *H* taken to entail the observed effect. This is also how Glymour characterized confirmatory research in his talk (see the slide I discuss). In one of my own slides from the PSA, I describe p-value reasoning, given an apt test statistic T:

From inferring a genuine discrepancy from a test hypothesis, you can’t go directly to a genuine falsification of, or discrepancy from, the test hypothesis, but you can once you’ve shown a significant result rarely fails to be brought about (as Fisher required). The next stages may lead to a revised model or hypothesis being warranted with severity; later still, a falsification of a research claim may be well-corroborated. Once the statistical (relativistic) light-bending effect was vouchsafed (by means of statistically rejecting Newtonian null hypotheses), it falsified the Newtonian prediction (of a 0 or half the Einstein deflection effect) and, together with other statistical inferences, led to passing the Einstein effect severely. The large randomized, controlled trials of Hormone Replacement Therapy in 2002 revealed statistically significant increased risks of heart disease. They falsified, first, the nulls of the RCTs, and second, the widely accepted claim (from observational studies) that HRT helps prevent heart disease. I’m skimming details, but the gist is clear. *How else is Gelman’s own statistical falsification program supposed to work?* Posterior predictive p-values follow essentially the same error statistical testing reasoning.

*Share your thoughts.*

[1] Another relevant, short, and clear paper is Gelman’s (2011) “Induction and Deduction in Bayesian Data Analysis” (2011).

[2] You can search this blog for quite a lot on Gelman and our exchanges.

**REFERENCES**

Fisher, R. A. 1947. *The Design of Experiments *(4^{th} ed.). Edinburgh: Oliver and Boyd.

Gelman, A. 2011. ‘Induction and Deduction in Bayesian Data Analysis’, in *Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science*, Mayo, D., Spanos, A. and Staley, K. (eds.), pp. 67-78. Cambridge: Cambridge University Press.

Gelman, A. and Shalizi, C. 2013. ‘Philosophy and the Practice of Bayesian Statistics’ and ‘Rejoinder’, *British Journal of Mathematical and Statistical Psychology* 66(1): 8–38; 76-80.

Mayo, D. G. (2013) “Comments on A. Gelman and C. Shalizi:“Philosophy and the Practice of Bayesian Statistics”, commentary on A. Gelman and C. Shalizi “Philosophy and the Practice of Bayesian Statistics” (with discussion), *British Journal of Mathematical and Statistical Psychology *66(1): 5-64.