Today is George Barnard’s birthday. In honor of this, I have typed in an exchange between Barnard, Savage (and others) on an important issue that we’d never gotten around to discussing explicitly (on likelihood vs probability). Please share your thoughts.
The exchange is from pp 79-84 (of what I call) “The Savage Forum” (Savage, 1962)[i]
♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠
BARNARD:…Professor Savage, as I understand him, said earlier that a difference between likelihoods and probabilities was that probabilities would normalize because they integrate to one, whereas likelihoods will not. Now probabilities integrate to one only if all possibilities are taken into account. This requires in its application to the probability of hypotheses that we should be in a position to enumerate all possible hypotheses which might explain a given set of data. Now I think it is just not true that we ever can enumerate all possible hypotheses. … If this is so we ought to allow that in addition to the hypotheses that we really consider we should allow something that we had not thought of yet, and of course as soon as we do this we lose the normalizing factor of the probability, and from that point of view probability has no advantage over likelihood. This is my general point, that I think while I agree with a lot of the technical points, I would prefer that this is talked about in terms of likelihood rather than probability. I should like to ask what Professor Savage thinks about that, whether he thinks that the necessity to enumerate hypotheses exhaustively, is important.
SAVAGE: Surely, as you say, we cannot always enumerate hypotheses so completely as we like to think. The list can, however, always be completed by tacking on a catch-all ‘something else’. In principle, a person will have probabilities given ‘something else’ just as he has probabilities given other hypotheses. In practice, the probability of a specified datum given ‘something else’ is likely to be particularly vague–an unpleasant reality. The probability of ‘something else’ is also meaningful of course, and usually, though perhaps poorly defined, it is definitely very small. Looking at things this way, I do not find probabilities unnormalizable, certainly not altogether unnormalizable.
Whether probability has an advantage over likelihood seems to me like the question whether volts have an advantage over amperes. The meaninglessness of a norm for likelihood is for me a symptom of the great difference between likelihood and probability. Since you question that symptom, I shall mention one or two others. …
On the more general aspect of the enumeration of all possible hypotheses, I certainly agree that the danger of losing serendipity by binding oneself to an over-rigid model is one against which we cannot be too alert. We must not pretend to have enumerated all the hypotheses in some simple and artificial enumeration that actually excludes some of them. The list can however be completed, as I have said, by adding a general ‘something else’ hypothesis, and this will be quite workable, provided you can tell yourself in good faith that ‘something else’ is rather improbable. The ‘something else’ hypothesis does not seem to make it any more meaningful to use likelihood for probability than to use volts for amperes.
Let us consider an example. Off hand, one might think it quite an acceptable scientific question to ask, ‘What is the melting point of californium?’ Such a question is, in effect, a list of alternatives that pretends to be exhaustive. But, even specifying which isotope of californium is referred to and the pressure at which the melting point is wanted, there are alternatives that the question tends to hide. It is possible that californium sublimates without melting or that it behaves like glass. Who dare say what other alternatives might obtain? An attempt to measure the melting point of californium might, if we are serendipitous, lead to more or less evidence that the concept of melting point is not directly applicable to it. Whether this happens or not, Bayes’s theorem will yield a posterior probability distribution for the melting point given that there really is one, based on the corresponding prior conditional probability and on the likelihood of the observed reading of the thermometer as a function of each possible melting point. Neither the prior probability that there is no melting point, nor the likelihood for the observed reading as a function of hypotheses alternative to that of the existence of a melting point enter the calculation. The distinction between likelihood and probability seems clear in this problem, as in any other.
BARNARD: Professor Savage says in effect, ‘add at the bottom of list H1, H2,…”something else”’. But what is the probability that a penny comes up heads given the hypothesis ‘something else’. We do not know. What one requires for this purpose is not just that there should be some hypotheses, but that they should enable you to compute probabilities for the data, and that requires very well defined hypotheses. For the purpose of applications, I do not think it is enough to consider only the conditional posterior distributions mentioned by Professor Savage.
LINDLEY: I am surprised at what seems to me an obvious red herring that Professor Barnard has drawn across the discussion of hypotheses. I would have thought that when one says this posterior distribution is such and such, all it means is that among the hypotheses that have been suggested the relevant probabilities are such and such; conditionally on the fact that there is nothing new, here is the posterior distribution. If somebody comes along tomorrow with a brilliant new hypothesis, well of course we bring it in.
BARTLETT: But you would be inconsistent because your prior probability would be zero one day and non-zero another.
LINDLEY: No, it is not zero. My prior probability for other hypotheses may be ε. All I am saying is that conditionally on the other 1 – ε, the distribution is as it is.
BARNARD: Yes, but your normalization factor is now determined by ε. Of course ε may be anything up to 1. Choice of letter has an emotional significance.
LINDLEY: I do not care what it is as long as it is not one.
BARNARD: In that event two things happen. One is that the normalization has gone west, and hence also this alleged advantage over likelihood. Secondly, you are not in a position to say that the posterior probability which you attach to an hypothesis from an experiment with these unspecified alternatives is in any way comparable with another probability attached to another hypothesis from another experiment with another set of possibly unspecified alternatives. This is the difficulty over likelihood. Likelihood in one class of experiments may not be comparable to likelihood from another class of experiments, because of differences of metric and all sorts of other differences. But I think that you are in exactly the same difficulty with conditional probabilities just because they are conditional on your having thought of a certain set of alternatives. It is not rational in other words. Suppose I come out with a probability of a third that the penny is unbiased, having considered a certain set of alternatives. Now I do another experiment on another penny and I come out of that case with the probability one third that it is unbiased, having considered yet another set of alternatives. There is no reason why I should agree or disagree in my final action or inference in the two cases. I can do one thing in one case and other in another, because they represent conditional probabilities leaving aside possibly different events.
LINDLEY: All probabilities are conditional.
BARNARD: I agree.
LINDLEY: If there are only conditional ones, what is the point at issue?
PROFESSOR E.S. PEARSON: I suggest that you start by knowing perfectly well that they are conditional and when you come to the answer you forget about it.
BARNARD: The difficulty is that you are suggesting the use of probability for inference, and this makes us able to compare different sets of evidence. Now you can only compare probabilities on different sets of evidence if those probabilities are conditional on the same set of assumptions. If they are not conditional on the same set of assumptions they are not necessarily in any way comparable.
LINDLEY: Yes, if this probability is a third conditional on that, and if a second probability is a third, conditional on something else, a third still means the same thing. I would be prepared to take my bets at 2 to 1.
BARNARD: Only if you knew that the condition was true, but you do not.
GOOD: Make a conditional bet.
BARNARD: You can make a conditional bet, but that is not what we are aiming at.
WINSTEN: You are making a cross comparison where you do not really want to, if you have got different sets of initial experiments. One does not want to be driven into a situation where one has to say that everything with a probability of a third has an equal degree of credence. I think this is what Professor Barnard has really said.
BARNARD: It seems to me that likelihood would tell you that you lay 2 to 1 in favour of H1 against H2, and the conditional probabilities would be exactly the same. Likelihood will not tell you what odds you should lay in favour of H1 as against the rest of the universe. Probability claims to do that, and it is the only thing that probability can do that likelihood cannot.
You can read the rest of pages 78-103 of the Savage Forum here.
HAPPY BIRTHDAY GEORGE!
References
*A few other Barnard links on this blog:
Aris Spanos: Comment on the Barnard and Copas (2002) Empirical Example
Stephen Senn: On the (ir)relevance of stopping rules in meta-analysis
Mayo, Barnard, Background Information/Intentions https://errorstatistics.com/2012/09/19/barnard-background-infointentions/
Links to a scan of the entire Savage forum may be found at: https://errorstatistics.com/2013/04/06/who-is-allowed-to-cheat-i-j-good-and-that-after-dinner-comedy-hour/
Pearson’s point is surely the key. There seems to be a common view ‘in the wild’ that it is rational to ‘forget about’ the possibility of ‘something else’. This is clearly efficient, but – as in the run-up to the financial crisis – is sometimes unsafe.
Dave: Yes, and I’m glad I included it; I nearly stopped typing before that….
David Marsay: You stole my comment. 😦
In my blog I argue that pragmatism is sometimes unsafe, and here is an example. Perhaps we should always ‘expect’ to be disproven in the long run, so that a priori probabilities only make sense in the short run, bearing in mind that the short run can sometimes be shorter than usual.
Dave: I thought they were to get wiped out in the long-run. I took Barnard’s point to be that they’re problematic right now because you can manipulate them to wind up with all different answers depending on your handling of the Bayesian catchall factor. Granted pure likelihoodism can only get around this by being limited to comparative claims, not necessarily even exhaustive.
I take Barnard to support the view that in practice the Bayesian approach is never genuinely exhaustive, and that it’s main drawback is that it is too often treated as if it is.
Thus in 2006 people were saying that the probability of a crash was zero, when what they should have said was that they couldn’t conceive of any way that a crash could occur. It seems to me that the likelihood approach is philosophically more honest, as you only ever get ratios between hypotheses that you have actually considered. But one can also use a notation like Jack Good’s P(A|B:C) to acknowledge that the probability is dependent on something, and not absolute.
The other issues on your blog matter too, but I think often the main thing is to understand what is properly meant by a ‘probability’.
The situation is actually worse than what was said. If we’re talking at any specific parametric model (let’s say for continuous data to make it more obvious), one could say that the subjective probability for believing that the model holds exactly is zero, so whatever posterior probability is conditional on the parametric model is conditional on an event with zero probability and can therefore be anything.
Laurie Davies also made the point that if it is enough that such a model holds “approximately”, prior probabilities should actually not sum up or integrate out to one but to something larger, because if N(0,1) is a good approximation, N(10^{-9},1) will be a good approximation as well.
The frequentists of course have problems, too, connected with not really believing that their parametric modes are true, but at least they don’t need to specify a probability for any specific model being true.
(Actually the above problem should not be a problem worse than for frequentists for pure de Finetti-Bayesians because for de Finetti there is no such thing as a “true model/hypothesis” anyway and it is all only about prediction of observable future events, so approximation neither means that we condition on probability zero-events nor does it mean that probabilities should add up to something larger than one; rather it only means that predictive posterior probabilities are just approximations.)
Laurie Davies’ work seems interesting. I suppose I view likelihood as an often convenient but non-fundamental concept and his approach appears consistent with this.
Mayo: I don’t take the basic ideas of your approach to fundamentally require likelihood either* (a recent post emphasised this?), right? Though all of the examples I’ve seen use it to some degree. What happens as we transition to the informal version of the severity principle? Must well-probed be expressed using only probability or likelihood concepts? Why not other measures?
*There appears to be a bigger difference in emphasis on ‘data representation’ and ‘hypothesis testing’ for the two approaches. Although Spanos maybe bridges this gap more, while sticking to a very parametric approach.
Omaclaren: You ask about what happens when we transition to the informal notion of severity. That topic came up today, because, I guess given my work on stat foundations, some think it applies only there. But in fact you are right. I introduced SEV as an informal notion and even in statistics as a “metastatistical” notion. The strongest argument from coincidence are outside of formal statistics. I often say that we appeal to statistics only when the uncertainties and/or variabilities are too great to deal with strong, informal arguments from coincidence. The origin of SEV came from trying to distinguish when non-novel evidence and use-constructed hypotheses are not ad hoc but quite warranted. The distinction turns on severity. I assumed statistics had fairly rigorous ways to distinguish when double counting and selection effects matter—and often it does, but definitely not in general. So, for example, in the Mayo and Cox (2006/2010) paper, one of the goals was to employ the informal notion of severity to distinguish between cases where selection effect do and where they do not diminish the severity.
On likelihood, I see it as an example of a “fit” measure or a support measure. In addition to H’ being better “supported” than H” in this sense, I’d require that the procedure very probability would support H’ less well, if H’ is false. The problem is only with deeming comparative likelihoods as any kind of account of warranted evidence. For one thing, you can never compare the support for the same H by two pieces of data x and x’ (at least according to the leading likelihoodist Royall). But it’s crucial to me to be able to say that even if H gets the same likelihood, the error probability can be horrible in one case (e.g., pejorative optional stopping). In that case, H is poorly tested. Of course this violates the LP which demands ignoring error probabilities once the data are in hand. Increasingly, default Bayesians find they cannot live within the LP because, for starters, considering the sampling distribution is crucial for model checking.
For another thing, what good is it to know that H’ renders x more probable than H”? x can be quite poor for both.
You’re likely referring to the post in which Hacking was making Barnard’s point: the likelihood ratios don’t mean the same thing in different contexts. The numbers can be 5, 50, 5000, and they don’t correspond to the same amounts of intuitive evidential strength. And that was beside the problem raised here regarding the choice of the “catchall-hypothesis”.
Mayo:
Thanks for your response. A few more thoughts and questions follow.
The way I see mathematical modellers use Bayesian methods in practice typically involves two elements.
Firstly, modellers frequently want to see some sort of ‘map’ of model fit in parameter space – where are the regions of good fit, where are the bad etc. This can be done either via posteriors or likelihoods but Bayesian methods are much more frequently associated with this sort of visualisation among casual users. This is to their credit, I think. ‘Frequentists’ or whatever you want to call them are associated with point estimates and null hypothesis tests, not exploration of complex parameter space.
Secondly, modellers often need to impose additional constraints on their model fit measures to emphasise solution types they are looking for. This can be done via likelihood penalisation/regularisation – e.g. adding derivative or complexity penalties to the fit – or via priors. There are certainly arguments over which approach is better, more flexible, works in different situations etc, but the general concept is essentially the same: penalise some model features to remove solutions of this sort. Again this is quite natural to modellers, who typically have to look for solutions to complex equations which have particular characteristics (e.g. a ‘travelling wave solution’ to a differential equation), rather than being able to obtain general solutions. The parameter estimation or inverse problem for difficult equations is typically ill-posed and so must be stabilised by introducing additional bias.
I think the first point might raise some points regarding your counterfactual severity account and maybe Gelman’s comments below.
For example, there might be a number of regions in parameter space that fit well and others that fit worse. Which regions/subsets do you compare to which? I take it your account partitions hypothesis space into null H0 and alternative H1 sets. Are these comparisons supposed to be ‘local’ in parameter space – e.g. nearby values – or ‘global’ over the whole space? How would I know whether a model has a bad fit because of local competitors in parameter space, global competitors or whether they lie on a particular curve in parameter space etc? Sure I could test all possible subsets against their complements, but a picture or analysis based more directly on the topology or geometry of the fit in parameter space seems a reasonable alternative. This is how likelihoodist and Bayesian approaches are often used in practice, right?
Would you disagree that lumping competitors into H0 and H1 seem to miss these features? And, doesn’t observing e.g. a bimodal posterior tell you something possibly interesting? I suppose your account is intended to be piecemeal – which is an emphasis I like in a general sense – but couldn’t this quickly raise combinatorial issues for your approach if applied naively to problems with large/complex parameter spaces?
I ask these questions having general sympathies for your philosophical account but competing instincts from a modelling point of view.
O: No time to take up al of these issues, but Barnard’s point is the quite true one that the Bayesian requires exhaustive setting out of hypotheses, plus priors to them. The frequentist error statistical account does not. That is why people like George Box, who are prepared to be Bayesian ONCE EVERY THING IS IN PLACE, insists that we can’t possibly make that be the whole of science because it is essential for science to have a space for new ideas. For the same reason Box denies Bayesian methods can be used to test assumptions.
The leading person in exploratory data analysis is Tukey–a frequentist. Why don’t you read the reasons he gives for rejecting the Bayesian approach for EDA. Bottom line, when it comes to explorations, testing assumptions, model building vs the neat and tidy Bayesian updating (assuming one even wanted to do that), the more appropriate methods are frequentist. It is precisely because of the great difficulty of actually getting the pieces for the Bayesian account is so arduous that even staunch Bayesians like Jaynes opt for default priors. But there’s scarce agreement as to which to use or why they mean.
OK, fair enough, I realise I took this on a bit of a tangent. I suppose it’s that I don’t think anyone here, including Gelman, buy the ‘catchall’ defense of Bayesian philosophy. I think your/Barnard’s/etc argument is a good one for why Bayesian updating is (at least) not sufficient; however, Gelman (and others) have developed different perspectives that are not really subject to these concerns in the same way. Your conversation below seems to suggest that you two are talking past each other a bit.
Part of my speculations concern why even `non-Bayesian’ folk (e.g. Matloff, Wasserman in the blogosphere, and many others in the real world from what I’ve seen) don’t like a ‘testing-based’ account of their approach. It doesn’t seem to capture what they do in practice. Also, the widespread use of regularisation-type methods and their relation to priors is interesting. Do these have any philosophical import or are they just practical tools?
But yes, sorry, this is all orthogonal to the ‘catchall’ issue.
They don’t like stupid, unthinking misuses of “tests” of the sort that have been lampooned for ages. But the logic of stat sig tests is importantly correct–even if not so many people seem to get it these days. I don’t think it’s consistent to favor CIs and reject tests,plus you need to supplement CIs with testing considerations to avoid blatant fallacies. I think many of the folks you read are committing “Dale Carnegie fallacies” left and right, and, finally, any inference can be seen in terms of a test, and I favor that locution because something must have been done to rule out errors. It boils down to the goal of statistical inference. In my view, that goal is finding things out in a self-correcting manner, not reporting comparative measures of support belief or what have you. In fact, it’s mandatory,as I see it, to be able to distinguish a plausibe or even a true hypothesis from a well tested one: true but badly tested has to be an explicit standpoint. I think you’re going to have to delve more deeply into these matters and with an independent, unintimidated, mindset. Truly understanding the methods sets one free from howlers. Else, wait for the book! sorry to be quick.
Readers can search this blog for these topics.
Honestly, I don’t think this is a fair and accurate account of what various people are saying about how they do statistics and why. Sure there are many simplistic fallacies around but there are many interesting new thoughts too.
I do look forward to reading your book though.
omaclaren: I wrote my comment to your remark quite quickly, sorry, and I know your remark had a lot more in it I couldn’t take up. I think the frequentist modelers do quite elaborate modeling (my colleague Aris Spanos, Larry Wasserman David Hendry) which is distinct from testing hypotheses, even though all these approaches are built on lots and lots and lots of tests (e.g., to figure out variables, misspecifications, etc).From the RMM issue:
Click to access Article_Hendry.pdf
Click to access Article_Wasserman.pdf
Click to access Article_Spanos.pdf
So I think you’re mixing in lots of different things. Model specification differs from model selection differs from formal testing differs from exploratory data analysis, etc. And the frequentists certainly use Bayes’s theorem so long as there’s conditioning, and there are both legit frequentist priors and approx priors that can serve to estimate parameters. It’s silly to spoze some tools are off limits if they promote the overall learning goal. Testing model assumptions can be graphical but there’s a corresponding significance test argument. By the way, just because an account has penalties does not mean that account can warrant decent error probabilities in the least. Getting a good “fit” doesn’t mean you’ve captured systematic regularities for prediction. Nor does it show they avoid misspecifications. My work is on statistical inference, admittedly, not modeling in its own right. Here are some items, links are on my publication page on left.:
Mayo, D. G. and Spanos, A. (2011) “Error Statistics” in Philosophy of Statistics , Handbook of Philosophy of Science Volume 7 Philosophy of Statistics.
Mayo, D. G. and Cox, D. R. (2010). “Frequentist Statistics as a Theory of Inductive Inference” The Second Erich L. Lehmann Symposium: Optimality, 2006, Lecture Notes-Monograph Series, Volume 49, Institute of Mathematical Statistics, pp. 247-275.
Cox D. R. and Mayo. D. G. (2010). “Objectivity and Conditionality in Frequentist Inference” in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 276-304.
omaclaren: I am very curious about your comments
“see some sort of ‘map’ of model fit in parameter space”
“Bayesian methods are much more frequently associated with this sort of visualisation ” and “Gelman’s comments … there might be a number of regions in parameter space that fit well”
My sense is few Bayesians plot anything in the parameter space (unless there is just trivially a single parameter and the Bayesian triplot is given) and Gelman seems to have a strong preference for plotting in the data space (comparing aspects of the predicted data with the actual data) though there is an example in the parameter space by him on page 395 of Data Analysis Using Regression and Multilevel/Hierarchical Models.
I have been arguing for more plotting in the parameter space for instance trying to discern the separate impacts of prior and individual observations or groups of observations but its a hard sell. http://andrewgelman.com/2011/05/14/missed_friday_t/
Do you really mean ‘map’ of model fit in parameter space rather than data space and if so do you have an examples?
Thanks
For a detailed look at the relationship between Tukey, Jaynes and EDA see here:
Click to access cchirp.pdf
Jaynes shows that the periodogram (an intuitive device advocated by Tukey for EDA in this problem) is an approximation to a very specialized Bayesian solution. By carrying out the full and general Bayesian solution Jaynes and his students were able to revolutionize Nuclear Magnetic Resonance Imaging (the improvement over previous frequenitst methods was so great it was initially thought to be a hoax before becoming the standard).
Jaynes admits that the periodgram, being simple and easy to implement, is usable and handy for EDA. However, in this case you can use the Bayesian methods in the paper to derive a vastly improved periodgram which is usable in more situations and is no harder to implement or use for EDA (this is not in the paper, but is a trivial exercises if you read the paper). I’ve been using this improved periodogram for 15 years to great success.
Finally, Jaynes never mentions “default priors” in any of his work. He’s the opposite of a “default prior” advocator and spend most of his time finding methods to get priors which accuratly reflect given states of information.
In this paper, he gives strong and objective reasons for the prior chosen. The reasons are given in bottom of page 6- to of page 7 and in appendix B. The reasoning is more realistic (assumes information known in practice) and significantly more general/applicable than what’s used by the typical statistican to justify their distributions.
All you need do understand it is to come to grips with the notion that probability distributions aren’t frequency distributions.
Setting ‘setting the record straight’ straight. Just on the point of “default”, maxent and such “non-subjective” (noncommittal?) priors are generally put under default or reference, I don’t care what umbrella term is used. Granted, there’s little if any agreement as to which of these to use or why.
I’m posting the e-mail exchange I had with Andrew Gelman on this post last night–of course it’s in reverse order:
On Sep 24, 2014, at 4:28 PM, Andrew Gelman wrote:
I get a posterior but it is conditional on the model.
——
On Sep 24, 2014, at 4:05 AM, Deborah Mayo wrote:
If not, then you’re not doing what Bayesians typically do: get posteriors. Mayo
——
On Sep 23, 2014, at 11:53 PM, Andrew Gelman wrote:
I haven’t read in detail; to me, the Bayesian inference is conditional on the model. I respect that some calculations can be made using the catchall idea, and I respect that, especially in decision analysis, the catchall could be a useful concept, but in general I don’t like discrete model choice or discrete model averaging of any sort. So I am not particularly interested in any framing regarding “enumerating hypotheses.” This does not look like the science or engineering that I do, nor does it look like the applied statistics that I do. To me, it’s an idealized problem, and idealized problems can be interesting but we must take care not to take them too seriously.
In your thread, Hennig writes, “The frequentists of course have problems, too, connected with not really believing that their parametric modes are true, but at least they don’t need to specify a probability for any specific model being true.” I don’t specify a probability for any specific model being true either!
——
On Sep 23, 2014, at 11:24 PM, Deborah Mayo wrote:
I take it that this is Barnard’s point, if I understand you (the catchall being all the other possible hypotheses that we haven’t though of). And yet the catchall is needed to get the probs to sum to one. So maybe you are agreeing with Barnard on this one point at least?
——
On Sep 23, 2014, at 11:10 PM, Andrew Gelman wrote:
I don’t think catchall works becuase there’s no real way to specify the data model for the catchall.
——-
On Sep 23, 2014, at 10:55 PM, Deborah Mayo
Not sure what you mean by “works”. Bottom line is that a prob has be exhaustive and we usually only have H1, H2, etc and thus need to posit “and everything else”–the catchall. (Everyone says only leave a little prob for the catchall, but how little?) Depending on how you cut up your hypothesis space, you get different answers for the prior. When a new hyp enters, something that used to have 0 prob has to now get a prob. Barnard’s point is that they (the Bayesians) haven’t done so well in getting a unified quantitative number that means the same thing in all cases. That has been a big criticism of likelihoodists, so Barnard was saying that they haven’t escaped the problem.
——–
On Sep 23, 2014, at 9:50 PM, Andrew Gelman wrote:
I don’t think the catchall factor really works. To me the catchall is an example of Savage saying something that sounds sage but which doesn’t really work. Regarding likelihood and prior, I will repeat what I always say which is that in practice all parts of the model are approximations, and the importance of the approximations depends on context. In some contexts, the prior is not really important and a likelihood-based analysis is fine; in other contexts, a lot of prior information is available and a likelihood-based analysis is both too weak (in ignoring the prior information) and too strong (in being sensitive to particular aspects of the assumed likelihood model).
——
On Sep 23, 2014, at 9:47 PM, Deborah Mayo wrote:
Andrew: Well this was about Barnard and the problem of assigning priors in the face of the catchall factor. It related to Barnard’s preference for sticking with likelihoods.
On Sep 23, 2014, at 9:45 PM, Andrew Gelman wrote:
I’ve never found Savage’s writing or philosophy to be at all compelling. So many people were impressed with him that I assume he must have been very brilliant in person but, to me, his ideas did not age well.
I think Andrew might feel he is not that worried about one day going from one day having 0 prob on something to non-zero prob: we (all statisticians?) do this all the time when we change our models, or do anything resembling model selection.
So but is it a move authorized by Bayesian updating or something outside that framework?
Oh, outside, for sure. It is difficult to justify living as a “pure” Bayesian, but I don’t think it means that one cannot identify as “Bayesian” in nature.
Click to access degroot_a-conversation-with-george-a-barnard.pdf
Here’s an interview of Barnard by deGroot. Barnard uses the term “experimental probability” for the use of probability in science. So I think I may use that term, which would fit with experimental knowledge.
Excellent interview filled with surprising, important insights.
Very readable and worth reading. In most cases where I have seen the Bayesian approach being misused it is where an analyst has assumed a uniform prior or some simplistic set of possibilities without engaging adequately with those with qualitative experience with the subject domain. Senior figures used to ask searching questions about assumptions, but the Bayesian approach – in some quarters – has become a dogma in which certain assumptions are said to be ‘scientific’ and those who challenge them are made to look foolish or otherwise side-lined.
In this sense, I think – contra Barnard – that the ‘meaning’ of an unconditional probability is something that people now need to understand better. But maybe it is enough to emphasise, as Barnard does, the conditionality?
Andrew Gelman: Not sure you’re reading this. Anyway, I realised already that your use and interpretation of priors is much more flexible than how what priors are is explained in the standard literature. However, I’m still at times confused about what the probabilities encoded in the prior actually mean in your account.
Are you saying that it still is belief in the true parameter conditional on the model being approximately true? Or is it rather, as de Finetti has it, just a technical tool in order to have a nice formalisation of predictive distributions for observable future events?
I’d think that it is often something else, and I don’t think I haven’t yet seen a convincing account of what it is (I remeber that you referred me to your book at some point when I raised a similar concern, but what I found there wasn’t totally clear to me).
I also think that in order to give a convincing account of it, one needs to some extent to go away from the idea that this is about truth and belief and accept that it is rather some kind of tuning that may have more to do with the aims of data analysis than with some kind of underlying truth (be it even true prediction accuracy).
@christian From his writings he seems to implicitly have in mind a calibrated model as an ideal. Whenever he says such-and-such is a “bad” prior, the rationale seems to be that the posterior predictive probabilities are inconsistent with data by a frequentist definition of probability.
However, I think he’s a bit loose with this criterion in that even if you don’t do a detailed analysis of model calibration – it can still be useful as a rough approximation. For example, with beta binomial examples, he’ll often just throw a uniform 0-1 prior without making a strong claim as to the coverage properties of the posterior distribution.