Continuing the discussion on truncation, Bayesian convergence and testing of priors

.

.

My post “What’s wrong with taking (1 – β)/α, as a likelihood ratio comparing H0 and H1?” gave rise to a set of comments that were mostly off topic but interesting in their own right. Being too long to follow, I put what appears to be the last group of comments here, starting with Matloff’s query. Please feel free to continue the discussion here; we may want to come back to the topic. Feb 17: Please note one additional voice at the end. (Check back to that post if you want to see the history)

matloff

I see the conversation is continuing. I have not had time to follow it, but I do have a related question, on which I’d be curious as to the response of the Bayesians in our midst here.

Say the analyst is sure that μ > c, and chooses a prior distribution with support on (c,∞). That guarantees that the resulting estimate is > c. But suppose the analyst is wrong, and μ is actually less than c. (I believe that some here conceded this could happen in some cases in whcih the analyst is “sure” μ > c.) Doesn’t this violate one of the most cherished (by Bayesians) features of the Bayesian method — that the effect of the prior washes out as the sample size n goes to infinity?

Alan

(to Matloff),

The short answer is that assuming information such as “mu is greater than c” which isn’t true screws up the analysis. It’s like a mathematician starting a proof of by saying “assume 3 is an even number”. If it were possible to consistently get good results from false assumptions, there would be no need to ever get our assumptions right.

The longer answer goes like this. Statisticians can get inferences and their associated uncertainties from probability distributions. If those inferences are true to within those uncertainties, we say the distribution is ‘good’. Statisticians typically do this with posteriors. Good posteriors being those that give us interval estimates that jive with reality. Obviously though it can be done for any distribution no matter what it’s type or purpose.

Therefore, a prior is only ‘good’ if the inferences drawn from it are true to within the implied uncertainties. That’s how Bayesian priors on mu are ‘tested’ even though the prior is modeling the uncertainty in a single value of mu rather than the frequency of multiple mu’s. You simply compare the inferences from the prior and see if it’s consistent with the prior information.

Given the prior with support on (c, infty) we’d infer that “the true mu is greater than c”. If the true mu is less than c, then the prior is ‘bad’ and shouldn’t’ be used. Using it is equivalent to making a false assumption no different than “assume 3 is an even number”,

Alan

The moral of the story Matloff is that your prior should only say “mu is greater than c” if your prior information guarantees it. If the prior information about mu isn’t strong enough to guarantee it with certainty you should choose a prior which reflects that and has a larger support than (c, infty)

rasmusab

Well using a (c,∞) prior makes a model that “considers” values less than c impossible and is useful when you don’t have time or need to coming up with something more nuanced. But if it seems that the (c,∞) is not doing a good job (or if you learn new information) there is nothing stopping you from changing the prior (as you can change other assumptions in the model). So you could say, “All priors all false, but some are usefull”.

Of course, if you want to you can put some other prior on μ where you reserve a tiny bit of probability on μ less than 3 and in that model you would have the property that “the effect of the prior washes out as the sample size n goes to infinity”.

matloff

Thanks for the thoughtful comments, Alan and rasmusab. But I think you agree, then, with my point: One of the most famous defenses offered by Bayesians for their methods — that the influence of the prior gradually washes out (“Our answers won’t be much different from those of the frequentists”) — fails in a broad category of situations. The Bayesian philosophy is not quite as advertised.

The other point I’d make in response to your comments (which I’ve mentioned before here and in Andrew Gelman’s blog) is that frequentist methods are robust to bad assumptions, in the sense that one can verify the assumptions via the data (if you have enough of it). By contrast, one can’t do that for a (subjective) prior, by definition, because one is working with only one realization of the parameter θ.

Alan

Matloff, I’ve never heard anyone claim that if a prior assigns zero probability to the true value of mu that the posterior will settle on the true mu given enough data. Since elementary algebra shows the support of the posterior is a subset of the support of the prior, the claim is trivially false, and I doubt anyone ever did say it was true.

John Byrd, there is no “validated by estimating error probabilities that will result from applications of it” being done. The prior and posterior describe an uncertainty range for a single mu. There are no frequencies to calibrate to. Separately, if x_i = mu+e_i and the measuring instrument gives errors ~N(0,10) as in the post, it’s possible to get a CI entirely below the cuttoff. This will happen some small percentage of the time randomly. If we know from other evidence that mu is guaranteed to be greater than the cutoff, then “truncation” will imply the true mu is in the empty set (the intersection of the CI and the interval greater than the cutoff). Is that answer acceptable to you? Mayo seems to indicate it is, and that I’m “stamping [my] feet” over it.

Mayo, for P(mu|A) to do it’s job it has to faithfully reflect what A says about mu. If it doesn’t the distribution is “wrong”. If A says “it’s possible mu is less than c” but P(mu|A) says “mu must be bigger than c” then the distribution is bad. P(mu|A) is contradicting what A has to say about mu. That’s the philosophical origin of the ‘test’ and it in no way requires some extra Bayesian ingredient.

Even if it did, in what sense could this secretly be “Error Statistical” when it involves assigning probabilities to hypothesis and uses distributions which aren’t frequency distributions in any way? (this is not a rhetorical question. If everything else is ignored, please answer this one)

john byrd

From Alan: “Therefore, a prior is only ‘good’ if the inferences drawn from it are true to within the implied uncertainties. That’s how Bayesian priors on mu are ‘tested’ even though the prior is modeling the uncertainty in a single value of mu rather than the frequency of multiple mu’s. You simply compare the inferences from the prior and see if it’s consistent with the prior information.”

I understand that a Bayesian model– like any model– can be validated by estimating error probabilities that will result from applications of it. That is a good thing and a saving grace. But, consider this need for validation in the context of the toy example of the couch measurement, and it becomes very clear why Mark’s answer was correct, and my suggestion to stick to the CI because a laser transit has its own error makes practical sense for scientists trying to solve problems. If you get a CI with most likely values of mu below 3, you will likely end up having to revise your prior following attempts to validate…

It seems very improbable to me that you can follow the protocol of validating a Bayesian model against real data and end sharply divergent from the CI in a case like that. If you gain advantage by validation in that you obtain more data, then the CI can also be narrowed with the additional data. Two paths to the same end point?

rasmusab

I actually believe most people that do Bayesian data analysis (those you call Bayesians) actually use convenience priors (such as default priors, or reference priors). And I think that’s fine, as long as you know that you are using a convenience prior. Just like most people use convenience models (like linear regression), it’s quick and easy and hopefully works ok most of the time.

It’s’ a different case if you were to chose a convenience prior and then stick to it whatever happens. That would be like sticking to linear regression without ever questioning the model assumptions. And that would be questionable.

A useful way of thinking about priors is just as “part of the model”. Just like the assumption of linearity is part of the model, and has to be justified, the priors are also part of the model and have to be justified. But sometimes use use linear regression because you have no better option and sometimes you use convenience priors because you haven’t figured out something better.

What I meant with the Rubin/Jaynes approach was a very pragmatic approach to Bayesian data analysis, like the one described here, for example:http://projecteuclid.org/euclid.aos/1176346785

Matloff:

I’m replying to rasmusab, who had replied to me.

You and I of course agree on the conditions under which the Bayesians’ famous “the prior eventually washes out” claim fails. But my point was that the Bayesians don’t put an asterisk on that famous slogan, which is why I said the Bayesian approach is not quite as advertised. That’s a really big deal to me.

And more importantly, we’re not talking about some rare case here. On the contrary, the excellent book Bayesian Ideas and Data Analysis, one of whose authors is my former colleague Wes Johnson (a really smart guy and a leading Bayesian), is chock full of examples of priors that assume bound(s) on θ.

The examples in that book — and in every other book I know of on the Bayesian method — show that many, indeed most, Bayesians set up priors exactly in the way you believe that the vast majority don’t: Their priors are chosen, as you say, because “it feels right.” Of course, they also often choose “convenient” priors because they lead to nice posterior distributions, making the priors even more questionable.

I’m not familiar with the Rubin/Jaynes approach. A quick Web search seems to indicate it is aimed at performing “What if?” analyses. I have no problem with that at all (providing, as always, that the ultimate consumers of the analyses are aware of the nature of what was done).

john byrd

Alan: It appears that you employ circular reasoning. The prior is to be corrected through “experience” unless it is to be taken as a certainty before application? Makes no sense. This is what I call the self-licking ice cream cone approach to Bayesian philosophy. Establish a prior, take it as meaningful, sell it to others unless the model does not work. If the model performs poorly, change the prior, call it prior information anyway, then repeat process.

You say: ” If we know from other evidence that mu is guaranteed to be greater than the cutoff, then “truncation” will imply the true mu is in the empty set .”. So, you say we must accept the prior as more important than the data. And also:“Therefore, a prior is only ‘good’ if the inferences drawn from it are true to within the implied uncertainties. That’s how Bayesian priors on mu are ‘tested’ even though the prior is modeling the uncertainty in a single value of mu rather than the frequency of multiple mu’s. You simply compare the inferences from the prior and see if it’s consistent with the prior information.” It appears the latter approach of testing to correct the prior is most reasonable. The latter approach would correct the prior to avoid the empty set.

So, you are faced with a scenario where IF you are willing to allow that your prior is subject to revision when faced with reality, then your Bayesian model will gravitate to the CI solution. Or, you can simply not test it. But then it becomes religion not science.

And, it appears to me that validating a model by comparing its predictions to reality to measure its performance is precisely seeking to minimize error probabilities. Seems obvious to me. I am puzzled that you do nor think so.

Mayo

John: You bring out a good point: they have to assume something like the single mu that is responsible for the current data itself having been randomly selected from a population of mus. That’s a sample of size 1. We wouldn’t reject a statistical hypothesis on the basis of a sample of size one. So, it’s not clear they can be seen as getting error probabilities, which require a sampling distribution. We’re never just interested in fitting this case, the error probabilities are used to assess the overall capacity of the method to have resulted in erroneous interpretations of data.

And of course, there’s the problem of distinguishing between violated assumptions, like iid, and a violated prior. I note this in my remarks on Gelman and Shalizi’s paper.

matloff

But rasmuab, you are ignoring the key point: One can use the data to assess the propriety of frequentist models, as linearity of a regression function, but one can NOT do that in the (subjective) Bayesian case. In Bayesian settings, since one has only a single realization of θ at hand, one can’t estimate the distribution of θ to verify the prior.

All this changes in the empirical Bayes case. Then there is a real distribution for θ , and one’s model for that distribution can be verified as in any other frequentist method — because it IS a frequentist method. For instance, Fisher’s linear discriminant analysis (or for that matter logistic regression) without raising an eyebrow, even though it is an empirical Bayes method.

I skimmed through the first few pages of the Rubin paper (thanks for the interesting link), and immediately noticed that his very first example, on law school grades uses an empirical Bayes approach, not a subjective one, which makes it frequentist.

Feb 17 addition:

I had grabbed the last handful of comments (excluding most of mine) but didn’t mean to exclude anyone who made remarks on the new topic (of truncation), so here was Mark’s remark to Alan’s initial concern about truncation:

Mark

Alan, let me get this straight. Your example involves a case where there’s a hard physical constraint on the mean being greater than 3, but no such physical constraint on individual observations? The only possible way to get a CI that lies almost entirely below the cutoff is to have the vast majority of values lying below the cutoff. What’s a Bayesian to do in this case, stamp his feet and say “no, no, no, the mean must be constrained to be greater than 3, so I’ll put the vast majority of my weight on my prior” (that is, acknowledge that the data are noisy and so essentially throw them out)? I’d love to see a Bayesian analysis where a) there is a physical constraint on the mean being greater than 3, b) almost all of the data are sufficiently lower than the cutoff *such that the standard frequentist CI was almost entirely below the cutoff*, and c) the final inference was not based almost exclusively on the prior. If your answer is that your final inference in this case would be essentially the prior, then I frankly don’t see anything less absurd in your approach than claiming that (3, 3.00001) is a reasonable CI. It’s the same argument, as far as I’m concerned, they’re equally concocted.

Now, if there truly is a physical cutoff, such that both the mean and realized values are required to be above this cutoff, then there is a very simple frequentist approach to incorporate this background information. Do a transformation like log(X-3). No need to truncate, your entire CI will be in the required range.

 

 

 

 

 

Categories: Discussion continued, Statistics

Post navigation

60 thoughts on “Continuing the discussion on truncation, Bayesian convergence and testing of priors

  1. Mark

    Whoa! Someone said I was right, and I missed it? Thanks John Byrd!

    • But I also said you were right.

      • Mark

        Yes you did, thank you. But, as you said, it’s pretty hard to keep up with 84 comments. But, of course, I didn’t make the cut onto this continued post… ah, well.

        • It was solely in terms of time, and your smart response was early, and dealt with earlier remarks. If I included all, it would be just as unmanageable, and if I selected those I liked, well….

          Write me a guest post any time.

        • Mark: I entered your comment, realizing it was on the new topic, and I believe everyone else touching on it is included. (I had been only looking at time of comment before.)

    • I also notice that John and I were making largely the same point, regarding the impossibility of assessing validity of subjective priors.

      • And you and I were also in sync on the point that there’s only 1 mu. I really worry when people like rasmusab say we can always change our priors just as we change the model. Well, first of all, we don’t just change the model if the data don’t fit into a model with which the data do fit. There are lots of models one could tailor to fit the data, and it’s a highly unreliable method to just “fix” it to fit your data. My point is that I suspect some of this cavalier attitude toward changing the prior reflects a very worrisome lax attitude toward respecifying a stat model in the face of misfits. What’s to stop them from changing the prior to avoid unwanted misfits with a model? The whole idea that the prior represented prior information must also be discounted. Senn has discussed this in various places.

        • Deborah, here you have hit upon what I consider the very essence of the difference between the Bayesians and the frequentists, from what I’ve observed, especially in recent years. The Bayesians have this exploratory frame of mind, along the lines of “I feel such and such, and let’s see where it and the data take me.” The frequentists, by contrast, have this quaint notion of wanting to be objective and scientific, and want to come up with a principled analysis.

          So, while frequentists react in horror when Bayesians change priors midstream, the Bayesians are baffled by all the fuss.

          • Matloff: I actually disagree but in a sense that needs qualification. They may view it this way, but it’s quite the opposite. Frequentist error statistical methods are piecemeal methods for developing knowledge and building theories and is all about what you do “on your way” to creating cumulative knowledge. Even where there is model checking, there is a piece that needs to be warranted. If all of the pieces, or even several of them, can be moving at once, then you are building on sand, or rather quicksand. It is the Bayesian who comes closer to a grandious scheme of inference, where a great deal has been done to get all the hypotheses and all of their priors and to think of all or many of the results that could obtain in order to very carefully choose the prior, and only after all of this is done does the great Bayesian inference occur. In other words, they need a ton to get going, where as frequentists can jump in and out of small questions, with approximate models, in order to jump out again. Because Bayesian inference presents itself as such a complete story, any nips and tucks and changing of this that or the other thing as it goes is actually what precludes a reliability guarantee for the inference at hand. Perhaps one might say it’s allowing data dependent exploratory moves as if sketches for a painting, but using the result as a polished portrait. But it’s not clear what it represents. You can’t retrace how it got there the way that valid error probabilities force. Sorry if this is a bit metaphorical. Snow day here.

            • It lately has been fashionable to present the Bayesian approach as the “consistent” (in the sense of “coherent”) one, apparently related to your phrasing “grandious scheme of inference.” I’ve never understood that, and have suspected it’s just a buzzword. Maybe some people here can illuminate it for me.

              But I just don’t think the “quicksand” metaphor is right, because again I think the Bayesians have different uses of that “cumulative knowledge” than you do. And because of that, they can afford to be more lax.

              • matloff: the presentations that I’ve seen that talk about coherence are usually pretty clear about the technical meaning of the term — it’s the Dutch book argument. A Dutch book is a set of bets on offer such that a bettor who can choose which side of each bet she wants can find some combination of bets that result in certain gain. Suppose one is required to set odds for every member of the sigma-algebra of some sample space. If the resulting book is a Dutch book, the odds are said to be “incoherent”. De Finetti proved that one can avoid creating a Dutch book iff the odds correspond to some probability measure.

                The way this gives rise to what Mayo calls a grandiose scheme of inference is that strictly speaking, a Bayesian agent has no way to expand its sample space in the face of the unexpected, so every possible event must be considered at the outset.

                • Corey: On first para: Interested readers can look up “going Dutch” on this blog for a few problems with Dutch book arguments.

                  On second para: to Alan-this may clarify something you were doubting back on the first post. Of course they try to have a hold-out catchall factor.

                  • rasmusab

                    Let’s not mix up statisticians doing Bayesian data analysis with Bayesian agents! Unless you are a psychologist subscribing to the Bayesian brain hypothesis there is nothing making me into a Bayesian agent just because I use Bayesian tools to make inferences.

                    That’s one thing I like a lot with Jaynes description of Bayesian data analysis, he early on introduces his Bayesian robot (he would probably call it agent today) as the tool for doing inference. This perspective makes many things much less mysterious, for example, changing the priors on your model. If you were the agent, what would it mean to change your own priors? Crazy! If you are using a Bayesian agent as a tool to make inference, then it’s not so strange, you are not changing your prior, you are changing the prior in the agent (or, as you would probably say it: of the model).

                    So your Bayesian agent/model does not need to “have a hold-out catchall factor”, as long as you as a responsible analysis is aware that the posterior depends both on the data and the model. (and perhaps throw in a couple of posterior predictive checks, for good measure).

                    • Rasmusab: The Bayesian most certainly requires leaving out a catchall factor if he is ever to introduce a new hypothesis beyond what he’s already assigned priors to. Please see this post “G.A. Barnard: The Bayesian Catchall Factor”: https://errorstatistics.com/2014/09/23/g-a-barnard-the-bayesian-catch-all-factor-probability-vs-likelihood/

                      In your previous post, you spoke of assigning 1/6 to the probability of a die outcome as if that was a prior. It is not. I don’t mean to be critical, but you are saying many things that show unclarity about the Bayesian approach. Maybe others can suggest a down to basics source.

                    • rasmusab

                      I’ve tried to show some clarity in my answer to the original question below

                • Thanks, Corey, very interesting. I’m not moved by the decision-theoretic arguments (“Every good rule is a Bayes rule for some prior,” I seem to recall), but now I see why some would find it appealing.

                  • One may be able to redescribe what person X is doing under a lot of descriptions and it doesn’t follow that that’s what X IS doing or saying, nor that that’s what gives a rationale to what X is doing and saying.
                    It’s an old argument but a very poor one.

                  • matloff: That’s a different theorem (due to Wald), albeit there are some similarities in spirit.

        • rasmusab

          No need to worry. I don’t say that you should change that model if it doesn’t fit (though it might be a good idea to check why it doesn’t), I say that you should change the model if it doesn’t work, where work is defined by what you want the model to do for you. For example, if you are interested in prediction, a model that doesn’t work is a model that makes bad predictions.

          When changing a prior you have to make the same kind of considerations as when changing the rest of the model. For example if a linear assumption is not working well then you might consider adding a quadratic term, however you need to be aware of what you are doing, in order not to overfit. It’s the same way with changing a prior, indeed the situation with going from a line model to a quadratic model can be viewed as going from a strong prior on zero for the quadratic term to a wider (vaguer) prior on the quadratic term.

          • What I say about not fitting holds as well for “not doing what you want it to do”, be it prediction or whatever. My point is, and remains, that more is required to properly change the model, although each case differs.

          • john byrd

            Rasmusab: So, I read your comment to be in line with the idea that a Bayesian model must be validated against reality (real data) before being taken seriously? It is hard to imagine that a model that is not validated can do any work for us, except perhaps in what-if simulations.

            • rasmusab

              I’m not 100% sure what you’re getting at. I would believe all kinds of statistical models would need to either be based on data, prior information or perhaps the best of both worlds, both. If you have a lot of good prior information then you don’t need that much data to build useful models. For example, I could build a fairly useful model of yahtzee (the dice throwing part, not the human decision making part) building on my knowledge of how dice tend to behave. This would constitute a useful model, that I would trust even though it’s has not been validated against data and that is not really a what-if type of model.

              • john byrd

                Rasmusab: How does one determine whether a prior is a “good” one , and what constitutes “a lot of good information” a priori, if it does not entail having demonstrated conformity with real data from known sources? As Matloff has indicated, some argue that you do not need to worry about such things because the data will overwhelm the prior and lead to convergence with an accurate solution. Do you think so? If not, then are you in the camp of Gelman and others who espouse the need to use testing to ensure that the priors are appropriate?

                • rasmusab

                  ‘How does one determine whether a prior is a “good” one […]?’

                  That’s an interesting question! Some thoughts:

                  * Trivially, a prior is better if it puts comparably more probability mass on/around the “true” parameter value, and the best prior is a prior with all the probability mass on that value. But that’s not super helpful, as we often don’t know the truth.

                  * The best prior is the prior that includes all information we have regarding a parameter, no more no less. Of course, this is not something that can be completely verified, just as you can’t completely verify normality, linearity, etc. But one can argue for that a prior is pretty good. One example of a prior that I believe is pretty perfect is this: say that you model the outcome of a single toss of a die, a prior for the outcome would be to put 1/6 probability on each outcome. This is pretty close to a perfect prior, I would argue, as it is hard to see how I could have any better information regarding this specific dice roll. But then one could think up scenarios where the die is loaded and where I could have figured that out, but as a prior goes, this is as close to optimal as I can think of.

                  * A reasonable prior is the best you can do given the limited amount of time and resources you have. Also, if you spend many weeks developing the perfect prior for a problem, when you could have been out doubling your data set, you are clearly wasting time… 🙂

                  • Why call it a “prior” to put 1/6 prob on outcome of a die toss–that’s what’s given by an ordinary model of the process, it’s not an assignment to the the probability that the probability is 1/6.
                    I have a feeling some people are confusing probabilities of outcomes under a model with prob to the model itself.

                    • rasmusab

                      A prior is a probability distribution over some possible outcomes/states/parameter space that has not yet been informed by some data at hand and that represents the uncertainty regarding the actual outcome/state/parameter. A posterior is a distribution that has been informed by data, but this posterior can then be used as a prior when more data becomes available, so whether you call something a prior or a posterior depends on your perspective.

                      My assignment of probability over the possible outcomes of a single, specific throw of a die constitutes a prior. The distribution of 1/6 probability per outcome has not been informed by any data, only my prior knowledge about how dice tend to behave. In fact, my friend Maria has already thrown the die, but she is hiding it under a cup, and wont tell me what it shows. She does tell me that it is an odd number. Using this piece of data I can update the prior resulting in the posterior of 1/3 probability on 1, 3 and 5.

                      A reason why it feels strange to think of this as a prior might be because dice and coins are such common examples in statistics and that we can be so sure that that specific prior is the best we can possibly do. Note that I could also put a prior on the relative frequency (often also called the probability in frequentist statistics) of the different outcomes of the die in repeated throws, but that’s a different thing. All this written above does not really makes sense, as much of Bayesian statistics, if you think of probability as a long-run frequency, but it makes sense if you see probability as a way of representing uncertainty.

                    • Oh my.

                    • john byrd

                      Rasmusab: Does your calculation impress Maria, or it is only for your use before she reveals the true state of the die, or what? Does the equal prior you have chosen to use (1/6) contribute in any way to the work you wish to do? If so, how? I think we might be close to understanding how you see these problems…

                    • rasmusab

                      john byrd: “Does the equal prior you have chosen to use (1/6) contribute in any way to the work you wish to do? If so, how? I think we might be close to understanding how you see these problems…”

                      This was a follow-up comment on Mayo’s comment questioning whether the die prior actually was a prior. I just tried to take the most simple example I could come up with. That it is not a very interesting problem in itself and that Maria might not be so impressed is another matter.

                      Mayo: “Oh my.”

                      ?

  2. Matloff:
    I don’t see that the convergence argument ever had much power because, for starters, it requires non-extreme priors and all kinds of model assumptions. but the main reason is that it’s not how we ever critically evaluate a scientific claim or method: in the vast long run if we kept collecting data over and over and over again, people who didn’t disagree too much will converge. Any halfway decent account could say that. But we criticize a statistical inference NOW. We don’t say that if Potti had kept going maybe in 50 years he’d have gotten a good predictive model for personalized cancer treatment. We identify the flaws with this inference and this study.

    Further, I recall citing Henry Kyburg in EGEK as having shown that you can always show posteriors will still be specifiably far apart with non-extreme priors.

    • Well, all of large-sample theory is based on convergence, so it is pretty much a given that one operates under the framework of what would happen if n were large. In order to make that work, one must develop a feel, via simulation or whatever, for “how large is large?” Not an easy question, especially in problems in which more than one quantity is going to infinity (or to 0) at the same time,, but I think most statisticians are comfortable with it, for good reason in my view.

      As to the Kyburg statement, I would guess that you are incorrectly recalling the details. What he probably said applied to fixed n.

      • Yes, of course fixed n, at any point you still have the disagreement. In real life, that’s what matters, but even there, as I say, it’s quite out of sync with what’s needed to adjudicate disagreements, or embark on an inquiry to identify the source of rival inferences.

      • matloff: Charles Geyer has a paper on the question of “how large is large” — a paper which deserves to be better known, apparently.

        The upshot is that you just look at the realized log-likelihood function, and if the peak is well-approximated by a quadratic, then you’re in Asymptopia. But as Geyer notes, “When I tell typical statisticians that the asymptotics of maximum likelihood have nothing to do with the law of large numbers (LLN) or the central limit theorem (CLT) but rather with how close the log likelihood is to quadratic, I usually get blank stares.”

        • I’ve heard of that “you’re in Asymptopia.” I haven’t read it but I should, because it sounds very clever and insightful, even though it wouldn’t be of direct use to me since I try to stay away from likelihood functions.

    • David Rohde

      Actually, reading a statistics book e.g. Bernado and Smith would show that rasmusab is right.

      Referring to the probability of an unknown dice role is not only legitimate, from a foundational point of view it is the only type of probability assertion that is. Parameters and distributions over parameters are only required to produce probability specifications over exchangeable sequences of observations. Beyond that parameters and distributions over them are an ‘indulgence’.

      Doing decision theory on the parameter space (rather than the predictive distribution) is one of the most common mistakes in statistics.

      • I agree (I think) re:statistical/probability model foundations – I’ve found the Bernardo et al view quite illuminating. However, the ‘indulgence’ of parameters etc is of course what allows one to bring substantive theory into the model and hence ‘understanding’ and generalisation to new circumstances. Funnily enough this seems very close to Spanos’ view except he emphasises iid instead of exchangeabilty to connect ‘purely statistical’ models to ‘substantive’ models.

      • Well maybe if you look at ordinary non-Bayesian texts you’ll find the usual probability distributions of random variables do the trick for assigning probs to outcomes.

      • David: If any pre-data assignment of probs to outcomes is a prior, then having changed radically the meaning, you’ll need a new term to discuss what the rest of us have been talking about. it would follow that randomization is a prior too I suppose. And you can’t go fixing your priors using data and still call them priors or pretend you used Bayes theorem when you needed to go outside that school in order to employ a kind of pure significance test. time’s got nothing to do with it.
        We error statisticians use both pre-data and post-data error probabilities and do not have probabilities depend on irrelevant time considerations.

        • AFAICT only your first sentence is responding to me.

          I agree terminology is difficult and non-standard. The first 150 pages or so of Bernado and Smith cover the probability of observations only. It is only after that the standard setup of probabilities of parameters appears. The point is that the probability of a parameter is only a special case and needs to be intepreted through the probability distribution over observables with the parameter marginalized out. Frank Lad’s book is also good.

          That is the full probabilistic specification is

          \[
          p(y_1,…,y_{N+1}) = \int \prod_n p(y_n|\theta) p(\theta) d\theta
          \]

          You condition this in order to obtain the predictive distribution

          \[
          P(y_{N+1}|y_1,…,y_N) = p(y_1,…,y_{N+1})/p(y_1,…,y_N)
          \]

          Equivelantly

          \[
          p(y_{N+1}|y_1,…,y_N) = \int p(y_{N+1}|\theta) p(\theta|y_1,…,y_N) d\theta
          \]

          where p(\theta|y_1,…,y_N) is the “posterior” in the standard sense of the word, but this is just a convenient method for computing the above.

          The point being that the full probabilistic specification corresponds to :
          \[
          p(y_1,…,y_{N+1})
          \]

          which is to be done by investigating decision preferences. (I agree not by looking at the data!)

          Attacks on prior specification seem to obsess with considering only marginals of the full specification either:

          \[
          p(\theta)
          \]

          or perhaps

          \[
          p(y_1) = \int p(y_1|\theta)p(\theta)d\theta
          \]

          Of course the marginals are all the same i.e. $p(y_1)=p(y_2)$ etc..

          Yet the full prior specification corresponds to: $p(y_1,…,y_{N+1})$ and most of the interesting content in the probabilistic specification is not present in either of the above marginals.

          FWIW many applied statisticians probably don’t know this, so if you don’t you are in good company. Similarly most books are not as pedantic about foundations as Bernado and Smith or Lad, but as a philospher of science being pedantic about all of this is more important I think…

          • This is essentially the same view I was trying to present further below, though I tried to emphasise the parameteric representation and points of similarity with other approaches. Bernardo and Smith and other articles by Bernardo also helped me a lot to understand this view. However, though some of the finer points are perhaps hidden, it is essentially the view in BDA and so it *should* be known far more widely, whether pure/applied or philosophical.

  3. george

    matloff: (et al) You seem to be either ignoring or trivializing a large literature by claiming that “one of the most cherished (by Bayesians) features … that the effect of the prior washes out as the sample size n goes to infinity”.

    As alan indicates, the mathematics of when cumulative data does and does not dominate a prior is well-studied – go back to Doob (1949), Bernstein & von Mises, Le Cam or, more simply, Lindley coining the term “Cromwell’s Rule”. I invite you to look up these names if you like to check, but it is well-known that the feature you claim is touted as “one of the most famous defenses” is not universally true. Furthermore any half-decent training in Bayesian statistics will make clear it is not universally true, and statisticians with any half-decent training don’t go round claiming it to be universally true, in “defenses” or any other form of argument.

    That your straw man has generated quite so much comment here is staggering.

    Also, your claim that “one can use the data to assess the propriety of frequentist models” requires similar caveats. Just because you make an assumption is no indication that it can be usefully checked with any power (or precision, if you’re not interested in testing). Further long discussions in the statistical literature have shown that one can’t always test one’s way to knowing the true pattern of missingness, for example, nor can one always know the impact of unmeasured confounders. Yet in some fields you can’t open a journal issue without papers that use exclusively-frequentist analyses with such assumptions baked-in, so please don’t tell me it doesn’t happen. And even when one can usefully test assumptions, calibrating the properties of the resulting frequentist inference – the version that passed all the tests – is, in practice, at best subtle and often difficult-to-impossible.

    • The “prior washes out” slogan is too common to dismiss my comments as setting up a straw man. I’ve seen many Bayesians, some of them prominent researchers make the statement in public, the private awareness of Bayesians to the contrary notwithstanding. To me, it’s a truth in advertising issue.

      I like your missing-data example very much, George, a timely remark since I’m currently working on some methodology for that area. I suspect that the various kinds of missingness (assuming that one of them is true for a given setting, which may not be the case) actually can be distinguished by examining the data (albeit typically needing very large samples to do so). In other words, I don’t think we have an identifiability problem there. By contrast, as several of us here have pointed out, checking a subjective Bayesian prior is simply not possible, due to having just one realization of θ.

      • george

        matloff: “I’ve seen many Bayesians … make the statement in public”

        Who? Where? In what context? With any form of caveat (in which case the statement may well be correct) or without caveat? (in which case it is flat wrong). If you won’t provide details there isn’t much argument to have, sorry.

        Re missingness, again a decent discussion requires details; yes under some conditions this can be checked but not always – and, unhappily, that includes realistic scenarios encountered in practice. Examples are in recent work by Mohan and Pearl, but the literature goes back a lot further.

        • john byrd

          I would consider the common praise of Bayesian updating as a way of leaving behind an unfortunate choice of prior as implying a washing out.

  4. Alan

    Just wanted to make a general comment about Bayesian “testing” of probability distributions. This applies to any distribution P(B|A) regardless of what type it is (prior, sampling, posterior).

    For Bayesians, P(B|A) can be “tested” in two senses; one physical, one mathematical. First, we can test or verify whether A is true. This is done physically. To do so you have to go out into the real world and see if A is true. Second, we can verify that P(B|A) faithfully reflects what A has to say about B. Given a precise enough definition of “faithfully” this is purely a mathematical question. It can be done from the comfort or your favorite armchair.

    Frequentists reject this common sense view and think P(B|A) can only be tested in one way. Namely, in conditions when A holds, B is seen with frequencies approximating P(B|A).

    Despite a superficial conflict, these two viewpoints are not contradictory. Rather the former is merely more general. Under special circumstances, the former Bayesian viewpoint reduces to the later Frequentist one.

    • john byrd

      Alan: “Second, we can verify that P(B|A) faithfully reflects what A has to say about B. Given a precise enough definition of “faithfully” this is purely a mathematical question. It can be done from the comfort or your favorite armchair.”. This seems a strange statement. Are you saying that you can verify without data? That seems to be implied. If you do use data to verify, is it typically sample data, or do you rely upon the entire population in the armchair exercise?

  5. To me this thread exhibits a fair bit of ‘I don’t understand it so it must be wrong’ all around. I certainly don’t understand all the issues and am far, far from an expert myself, but have a couple of comments that could possibly be useful to some.

    As mentioned in a comment above, the Box/Rubin/Gelman (and probably Jaynes – I’ve read a bit but not enough to compare) style of Bayesian modelling includes ideas of model checking, for both priors and posteriors. Let’s call it the BRG approach for now, though I can’t claim to be able to accurately represent it here. How good a job these checks do is I suppose still open to debate/further investigation (?) but that the ideas exist and are not really discussed here is what I wanted to mention.

    Also, I should add as a side point that these approaches try to include ideas/lessons from ‘frequentist/error probability’ style approaches, which might lead to the idea that they are ‘not really Bayesian’; however, I don’t think that this should obscure that they are really trying to formulate their own version of these ideas within their own (Bayesian) framework and that these formulations could in fact be more general – or more specific or not directly comparable – to the related ideas from which they borrow.

    A particular point of difference, emphasised repeatedly by some of the Bayes-oriented commenters here, is that they take all uncertainty to be modelled by probability distributions (which could e.g. be motivated by information-theoretic ideas) regardless of the existence of an empirical distribution or the direct accessibility of some quantity. However I think it is more important to look at how this approach is used (h/t Peirce), than focus all the debate on it directly.

    Most importantly, using non-empirical prior probability distributions *does not stop any of the typical checking approaches* like those of BRG, from being used.

    This gets at the point that quite a few above seem to endorse – ‘we only have one value of theta at hand/observed (?) and so can’t check its prior’. While I can see the intuitive appeal of this position, I don’t think that this is correct from the perspective of BRG or related approaches. I’ll try to illustrate my interpretation below (BRG obviously bear no responsibility for my mistakes).

  6. With the long preamble above, now some sketchy equations to illustrate! I’m not sure whether latex works here so I won’t use it.

    The most important equation, coming before any Bayes, seems to me to be

    p(y) = int{p(y | theta) p(theta)}dtheta

    This defines the probabilistic modelling approach as based on a *predictive integral representation* – the prediction for any future observations y is based on integrating the product of a ‘sampling distribution’ p(y | theta) (I’m not sure if this terminology will be confusing for frequentists, but seems standard in the Bayesian literature) and a ‘prior distribution’ p(theta). This is sometimes called the ‘prior predictive distribution’.

    All/most checks (according to the BRG approach) are based on using *predictive equations* of this general form to compare against the *data*. Obviously is we subsequently obtain information on the ‘true’ parameter theta then one can empirically test the prior directly *but this is not the only/main way of model checking!*.

    Instead, and speaking intuitively, one wants to know whether y0 (the data!) is a representative sample from p(y) (the predictive distribution!). One can define Bayesian p-values etc to test this. In a sense, I don’t see how this is any different from a null hypothesis test, except now the ‘null’ is defined by a prior distribution rather than a fixed value. Formally at least, one could take p(theta) = delta (theta – theta_0) and obtain the usual point null to test.

    One important difference, however, is that the goal is not typically to ‘fail’ this null hypothesis test – it simply defines a criterion for an acceptable prior distribution, i.e. it must, along with the sampling model, give a predictive distribution for which the observations are ‘representative’ of. The next step is to try to refine the prior to get a *better predictive model*, typically by holding the sampling model fixed and updating (‘estimating’) the parameters via Bayes’ theorem.

    For example, after observing y0, one uses Bayes’ theorem in the usual way to get p(theta | y0 ) = p(y0 | theta) p (theta) / p(y0), and can arrive at a ‘posterior predictive distribution’. Again this is in integral form over a ‘prior’ – but this has now been updated to a ‘posterior’ – giving

    p(y) = int{p(y | theta) p(theta | y0)} dtheta

    where the ‘sampling distribution’ p(y | theta) is assumed to be unchanged by observing y0, i.e. p(y | theta, y0) = p(y | theta). This model can also be ‘tested’ for consistency with the data e.g. via a ‘null’ style test, where the posterior is now the ‘null’. If this model can reasonably be considered a ‘generator’ of the observed data then, *assuming the sampling model is correct* we have some confidence in the posterior.

    That to me is the general approach of BRG style Bayes, and includes many nice elements of both styles of statistics. It would seem to me to be helpful for both ‘sides’ to try to focus more on these common elements in the shared goals of model criticism and estimation (to follow Box) – initial model criticism (specification), model estimation conditional on a model and post-estimation model criticism (checking).

    • I obviously should have written p(y | y0) for the posterior predictive distribution, but you get the point…once all the updating has been done all the info is contained in the ‘new prior’ and the ‘sampling model’ and so one can drop the explicit conditioning on y0 in the predictive model.

      • john byrd

        This I understand more or less because there is an effort to produce a model with empirical validity. I do not understand the armchair statement nor the notion about a prior requiring a mean >3 combined with a CI below 3 is a “gotcha” for the frequentist. It is more aptly a gotcha for the prior. If you work to obtain a prior with no contradiction to reality,you ought to be able to develop a useful model for certain purposes, like prediction.

        • One thing I should mention, relating to my comment above re: Spanos’ approach, is that once the model is ‘statistically well specified’ – ie the predictive distribution captures the data/passes the tests – then one can better trust the parameter estimation. Here though, the parameter estimation is represented in terms of a probability distribution p(theta|yo) instead of a confidence interval etc. So basically, the *full* ‘error statistical approach’ as outlined by both Spanos and Mayo has an analogy in the BRG approach (but the latter seems to me to be perhaps better suited for handling the estimation phase in complex models).

          • Also, re: Mayo’s concerns about complex models, I believe this is where hierarchical modelling as championed by eg Gelman would be crucial.

            • why? how does it work? when did I express concerns about complex models?

              • Re:complex models – I thought you’ve often emphasised the need to ‘break down’ investigations in a piecemeal manner and that the Bayesian approach is often too ‘all at once’ to get much insight/separate out the parts of the model. Hierarchical modelling (which doesn’t strictly have to be Bayesian – though the Bayesian approach is convenient for estimating) does this through conditional independence/exchaneability assumptions for the various parts of the model.

                Gelman’s book(s) discuss(es) applied examples in detail.

                See also maybe: Bernardo ‘The concept of exchangeability’, which briefly mentions hierarchical models at the end, here: http://www.uv.es/~bernardo/Exchangeability.pdf

                This latter reference illustrates the points about defining the probability of sequences and the use of exchangeability to induce a parametric model. Again, as mentioned further above, I think that this is in fact not *that* far from some of Spanos’ ideas, though relying on exchangeability rather than iid.

                • In fact, closer than I thought: Spanos (p. 541 in his 1999 book) states “the De Finetti [exchangeability] representation theorem (reinterpreted) can be used to operationalize the specification problem of statistical inference: the choice of the appropriate Statistical model in view of the observations”.

                  So the Bayesian approach based on constructing statistically-adequate predictive distributions (including prior checking) seems to correspond to Spanos’ phase of misspecification testing for statistical adequacy.

                  [The estimation phases then of course differ with Bayes’ theorem used for the estimation in the Bayesian approach c.f. confidence intervals/hypothesis testing. But now the ‘assumptions can’t be checked’ objection to Bayes seems to not apply so strongly – Bayes’ theorem is only really used in the estimation phase not the specification phase.]

                  • Omaclaren: Spanos’ whole point is to use the probabiistic reduction to get beyond iid, not retain i even in exchangeability form.

                    • Yes I see that I was unfairly assuming Spanos emphasised iid as I had only seen examples of this sort on this blog.

                      The key point, though, is the striking analogy between the predictive distribution-based Bayesian approach and the PR/Error Statistical approach. The former seems to have influenced the latter (via De Finetti at least) and the latter the former (via Box at least, it seems).

                      This is good, no? And seems to shed more light (to me anyway) on how priors enter the modelling and can be subject to (some) testing.

          • ???

            • Is this ??? in reference to my comments? Is something specific unclear? Do you follow and/or agree with my sketch of the prior/posterior predictive distribution-based approach?

  7. I just saw this, and will definitely take a closer look …

Blog at WordPress.com.