*Stephen Senn*

Head of the Methodology and Statistics Group,

Competence Center for Methodology and Statistics (CCMS), Luxembourg

This paradox is clearly inspired by and in a sense is just another form of Philip Dawid’s selection paradox[1]. See my paper in *The American Statistician* for a discussion of this[2]. However, I rather like this concrete example of it.

Imagine that you are about to carry out a Bayesian analysis of a new treatment for rheumatism. However, just to avoid various complications I am going to assume that you are looking at a potential side effect of the treatment. I am going to take the effect on diastolic blood pressure (DBP) as the example of a side-effect one might look at.

Now, to be truly Bayesian I think that you ought to have a look at a long list of previous treatments for rheumatism but time is short and this is not always so easy. So instead you argue like this.

- I know from the results of the WHO Monica project that the standard deviation of DBP is about 11mmHg in a general population.
- I have no prior opinion as to whether anti-rheumatics as a class have a beneficial or harmful effect on DBP
- I think that large effects on DBP, whether harmful or beneficial, are rather improbable for a drug designed to treat rheumatism.
- I believe the data are approximately Normal
- I am going to use a conjugate prior for the effect of treatment with mean 0 and standard deviation = 4 mm Hg. This makes very large beneficial or harmful effects unlikely but still allows reasonable play for the data. This means that the prior variance is 16mgHg
^{2}compared to a data variance I am expecting to be about 120 mmHg^{2}. This means that as soon as I have treated 8 subjects the data mean variance should be smaller (about 15 mmHg^{2}) that the prior mean and so I will actually be weighting the data more than the prior at that point. This seems about reasonable to me.

You can choose different figures if you want but here I am attempting to apply a standard Bayesian analysis in a reasonably honest manner.

A few days after you have decided on your prior distribution a colleague announces very excitedly that the collaborative data summary project they have been working on jointly sponsored by the FDA and the EMA and involving a huge data collection exercise going many years back into the archives of dozens of sponsors has now concluded. They are now in a position to make a statement about the distribution of true effects on DBP in anti-rheumatics (and indeed a host of other drugs) whether or not they came to market. (In doing this of course they have avoided the naïve error of using the observed variation amongst drugs, since different drugs will have been measured with different precision and none of them with infinite precision.) Would you like to make some use of this?

You now make a disturbing discovery. In the framework you set up you can make *no use* of this. This is because it is only (potentially) relevant to your prior probability distribution and although this prior probability distribution is not very informative about the new drug it is 100% informative about itself. There was no prior distribution for your prior distribution. As soon as you know that mean effects over all drugs, past possible and future are given by a N(0,16mmHg^{2}) then in that case you can give the probability of the true effect of a random drug falling between any limits you like. Imagine the task of doing this empirically; you would need dozens if not hundreds and perhaps thousands of true effects of various pharmaceuticals using either a histogram or some sort of density estimation approach to get anywhere near what your Normal distribution says.

The problem is thus that prior distributions that are fairly uninformative about the given parameter we are trying to estimate are infinitely informative about themselves. To deal with that requires a higher level of the hierarchy and as Jack Good was wont to point out dealing with this honestly is harder than many suppose[3].

## References

1. Dawid AP. Selection paradoxes of Bayesian inference. In Multivariate Analysis and its ApplicationsAnderson TW, Fang Ka-ta, Olkin I (eds), 1994.

2. Senn S. A note concerning a selection “Paradox” of Dawid’s. *American Statistician *2008; **62**: 206-210.

3. Good I.J. *Good Thinking: The Foundations of Probability and its Applications. *University of Minnesota Press: Minneapolis, 1983.

_____________________________________

Stephen: thanks for this. I’m not familiar with this paradox, and not quite catching the meaning of the prior being 100% informative about itself.

Consider first the distribution of measured blood pressure ‘effects’ for individual patients. I am claiming that these have mean mu (where mu is unknown) with variance (approximately) 120mmHg squared. (If mu is zero the drug will have no effect on blood pressures, on average at least.) What can I say about mu? I can’t say what mu is for sure. If I knew, there would be nothing to learn aboute this drug. I am going to use the data to learn but I am also going to use some prior ‘information’. As prior information I am going to use a distribution of presumed effects of drugs..I regard mu as a random realisation from this population of effects opf drugs. This ‘prior’ distribution has mean tau=0 and variance gamma= 16mmHg squared. This variance expreses my prior uncertainty about mu. But what expresses my uncertainy about tau, or, for that matter, gamma? Nothing. Tau just is and so is gamma. I can’t learn about them.

Stephen: OK but if the prior distribution is uninformative about the given parameter we are trying to estimate, then why are priors deemed important background information by Bayesians? Would this new info override the initial conjugate prior?

Deborah, the prior distribution is not uninformative about the new drug. If I have got my calculations right (and I easily make mistakes) it has information equivalent to having seen 8 patients on the new drug. On the other hand it has information equivalent to having see an infinity of other drugs and that is the problem.

Stephen:

Does our hierarchical prior distribution in our toxicology model here work for you?

The German word is “jein”. To my inexpert eye this looks like a good approach (I would have to run it past Nick Holford to whom I defer in all matters PK to get a subject-matter opinion). Based on a quick look, however, it does not quite deal with the problem. You have hyper-parameters based on the literature. This means, I think, that given more or better literature you would just change them. That being so my interpretation is that your inferences are (very weakly no doubt) conditional on accepting these as ‘true’ parameters. However, if you would be prepared to change them they are not certain and just changing them is not Bayesian, so you need another level of the hierarchy.

However, that does not mean that the analysis is not good, clever and useful.

Stephen: Can you say a little more as how general you think this problem is for Bayesians? It would seem common. You say the move to merely change them is “not Bayesian” because…. Please explain a bit more.

To be truly Bayesian implies the following (I think). There is only Bayes theorem and Bayesian updating. There is nothing else. Even logical deduction can be seen as a special case with likelihoods that are zero. You can’t just replace a prior distribution you have to update it. This means that in my example there has to be a prior on the prior so that you can update it in the light of new information. It would go too far, however, to say that this is a problem for Bayesian practice: you could always make the defence that it works to an order of magnitude and that a frequentist complaining about this was a case of the pot calling the kettle black. It may be a problem for Bayesian theory, however..

On a related point I have been enjoying reading Daniel Kanneman’s recent book Thinking Fast and Slow and I am stuck by the extent to which psychologist have shown that practical inference is very different from theory.

Stephen. But it seems the Bayesians are dropping Bayesian updating in droves. In this connection, I’d be glad to hear what you think of my recent “U-Phil” post:

http://errorstatistics.com/2012/04/15/3376/

(Williamson responded in http://errorstatistics.com/2012/04/23/u-phil-john-williamson-deconstructing-dutch-books/)

I’m planning to write a “page” of where the contemporary positions seem to be.

Stephen: Isn’t this just another instance of the fact that models are models and not the truth (in the Bayesian case this means: the truth about the “degree of uncertainty existing in a certain situation”)? Of course it violates Bayesian principles to replace the prior, but one could still say: “I got it wrong first and using the new information I hopefully do a better job”. (As an excuse, it may be too complicated to design a prior that accomodates all kinds of new information coming in and not just the data one is about to collect, so getting it right before some unexpected and relevant news arrives is not really an option.)

I don’t think that this implies that the principles should be abandoned. People should just know that the principles are guidelines not truths, and they should be used with caution. The principles may still enable you to do a good job with the data and the new better prior (despite the fact that this won’t be perfect either).

To some extent this could be seen as the Bayesian version of a misspecification test (taking into account that Bayesian probability is about epistemic uncertainty and not about modelling the data). If information comes in that changes my mind in ways not accomodated by the prior, I reject the prior (as a model for my state of mind) and choose a new one that takes this information into account.

Mayo may not like this parallel but she may appreciate it a little bit more if I say that frequentist misspecification tests are much better understood and therefore probably more reliable than this.

If what they’re doing (in changing priors) fits under the error statistical umbrella fine, but it is misleading to declare it’s all within Bayesian updating, as I see it. We’ve always advertised a hodge-podge of methods for planning and asking a series of local (prospective) questions that nevertheless fit together in an inquiry (yielding more reliable from less reliable bits*), thanks to learning from error and capitalizing on knowledge of error, and error control. The Bayesian philosophy has tended to advertise itself as a grand sum-up of inference, rationality, belief change or the like. That was one of the key things that drove N & P up a wall (one could likely trace out a political story here, but never mind). *But Fisher and Pearson are the ones who got this point.

Christian, I agree with much of what you say and I think that you will see that I have been quite careful in various criticisms I have made to say that I don’t think that they amount to implying (still less proving) that Bayesian analyses are necessarily bad. However, there are two points I consider important. 1 there is an extreme form of Bayesiansim that claims to be a theory of everything. I think it is legitimate to see ‘paradoxes’ of this sort as casting doubt on the validity of that assertion. 2. More intriguingly it raises the issue as to what is going on. What exactly is happeneing when people tear up their models and start again and which theory of statistics justifies this?

two mostly unrelated comments…

I think checking if the model that you used is a marginal of a reasonable model over a larger number of observations is a potentially powerful way to understand if your model is appropriate, by thought alone. This logic is in fact used in the de Finetti representation which only models exchangeable sequences which can be marginals of probability specifications over arbitrarily large exchangeable sequences. Apparently in this example a hierachichal model is needed to deal with a more complex partial exchangeability relationship. I guess this form of model checking by thought is also available to non-Bayesians in a limited form.

I think the conditioning is always optimal mantra is appropriate to engineers and computer scientist when they build intelligent systems from robots or communication channels… but I don’t think it is always helpful to scientists… for the reason that I don’t think probabilistic specification over the whole world that they care about is reasonable… approximate tentative specification of fully or partially specified probabilities may be useful… If a scientist wants to change the specification, I think they should just do it, without trying to find a way to justify it with Bayes rule, for the reason that Henning said.