Head, Methodology and Statistics Group,
Competence Center for Methodology and Statistics (CCMS),
“Dawid’s Selection Paradox”
You can protest, of course, that Dawid’s Selection Paradox is no such thing but then those who believe in the inexorable triumph of logic will deny that anything is a paradox. In a challenging paper published nearly 20 years ago (Dawid 1994), Philip Dawid drew attention to a ‘paradox’ of Bayesian inference. To describe it, I can do no better than to cite the abstract of the paper, which is available from Project Euclid, here: http://projecteuclid.org/DPubS/Repository/1.0/Disseminate?
When the inference to be made is selected after looking at the data, the classical statistical approach demands — as seems intuitively sensible — that allowance be made for the bias thus introduced. From a Bayesian viewpoint, however, no such adjustment is required, even when the Bayesian inference closely mimics the unadjusted classical one. In this paper we examine more closely this seeming inadequacy of the Bayesian approach. In particular, it is argued that conjugate priors for multivariate problems typically embody an unreasonable determinism property, at variance with the above intuition.
I consider this to be an important paper not only for Bayesians but also for frequentists, yet it has only been cited 14 times as of 15 November 2013 according to Google Scholar. In fact I wrote a paper about it in the American Statistician a few years back (Senn 2008) and have also referred to it in a previous blogpost (12 May 2012). That I think it is important and neglected is excuse enough to write about it again.
Philip Dawid is not responsible for my interpretation of his paradox but the way that I understand it can be explained by considering what it means to have a prior distribution. First, as a reminder, if you are going to be 100% Bayesian, which is to say that all of what you will do by way of inference will be to turn a prior into a posterior distribution using the likelihood and the operation of Bayes theorem, then your prior distribution has to satisfy two conditions. First, it must be what you would use to bet now (that is to say at the moment it is established) and second no amount of subsequent data will change your prior qua prior. It will, of course, be updated by Bayes theorem to form a posterior distribution once further data are obtained but that is another matter. The relevant time here is your observation time not the time when the data were collected, so that data that were available in principle but only came to your attention after you established your prior distribution count as further data.
Now suppose that you are going to make an inference about a population mean, θ, using a random sample from the population and choose the standard conjugate prior distribution. Then in that case you will use a Normal distribution with known (to you) parameters μ and σ2. If σ2 is large compared to the random variation you might expect for the means in your sample, then the prior distribution is fairly uninformative and if it is small then fairly informative but being uninformative is not in itself a virtue. Being not informative enough runs the risk that your prior distribution is not one you might wish to use to bet now and being too informative that your prior distribution is one you might be tempted to change given further information. In either of these two cases your prior distribution will be wrong. Thus the task is to be neither too informative nor not informative enough.
However, as I pointed out in my previous blogpost on this topic, a conjugate prior distribution is perfectly informative about itself. Consider the accompanying figure. Here I have simulated a number of true population means (that is to say possible values of θ) from the prior distribution that governs the probability with which they will arise. (I have chosen a Normal distribution with mean μ=0 and variance σ2 = 25 but any values would do to illustrate the point.) I have simulated 10, 100, 1000 and 10,000 samples from this distribution and in each case the simulated values are shown along the X axis and an empirical density estimation is show as a black curve. For each of the four cases the red curve shows the parent distribution from which the values are simulated.
If you look at the empirical density curves you will see that for the samples of 10 and 100 means these do not estimate the distribution well and one could argue that even the sample of 1000 does not do an excellent job although it provides a fair fit. Thus to have a prior distribution is equivalent to having seen rather a lot of information of a certain sort. Note that what the simulation has provided is a sample of true means not sample means, since it is sampled from the prior distribution. Thus the information is equivalent to having seen a very great number of extremely large samples, each from randomly chosen populations from a hyper-population.
Thus, one can see that if this inferential set-up is assumed to apply, the ‘paradox’ becomes explicable. Suppose that you have run a number of experiments measuring (say) ten treatments in vitro. A sample mean is estimated for each of the ten treatments based (say) on 20 replicates. You now choose the largest observed mean and make an inference about the relevant treatment. Why should the fact that this is the largest of ten have any influence on your inference? After all you have a prior distribution that provides information on 1000s of treatments measured using enormous sample sizes. The information provided by the other nine treatments you studied now is paltry by comparison.
But there is a sting in the tail. If I tell you that I took 200 replicates from one treatment and randomly split these into ten samples of 20 and that the mean I am using to make an inference about that treatment is the highest of the 10, then this is something that as a Bayesian you must take account of. In the case of the ten treatments, given the conjugate prior, the distribution of one set of 20 values is conditionally independent of all the other sets. There is no possible leakage of information from one to the other. This is not the same when you are sampling from the same treatment.
This suggests that a way that you can partially restore the frequentist intuition to the Bayesian analysis is to use a hierarchical modelling set up with a hyper-prior distribution on the prior distribution. See my paper for further discussion (Senn 2008).
Note, that I am not claiming here that the frequentist approach is right and the Bayesian is wrong. I am making a point that I have made elsewhere, namely that thinking about each is useful (Senn 2011).
Dawid, A. P. (1994). Selection paradoxes of Bayesian inference. Multivariate Analysis and its Applications. T. W. Anderson, K. a.-t. a. Fang and I. Olkin. 24.
Senn, S. (2008). “A note concerning a selection “Paradox” of Dawid’s.” American Statistician 62(3): 206-210.
Senn, S. J. (2011). “You may believe you are a Bayesian but you are probably wrong.” Rationality, Markets and Morals 2: 48-66.
Stephen: Thanks so much for the guest post. I’ll study it more tomorrow.
When you say “you have a prior distribution that provides information on 1000s of treatments measured using enormous sample sizes” you seem to be assuming some kind of empirical frequentist prior. Where would the priors on the priors on the priors ever stop?