My post “What’s wrong with taking (1 – β)/α, as a likelihood ratio comparing H0 and H1?” gave rise to a set of comments that were mostly off topic but interesting in their own right. Being too long to follow, I put what appears to be the last group of comments here, starting with Matloff’s query. Please feel free to continue the discussion here; we may want to come back to the topic. Feb 17: Please note one additional voice at the end. (Check back to that post if you want to see the history)
I see the conversation is continuing. I have not had time to follow it, but I do have a related question, on which I’d be curious as to the response of the Bayesians in our midst here.
Say the analyst is sure that μ > c, and chooses a prior distribution with support on (c,∞). That guarantees that the resulting estimate is > c. But suppose the analyst is wrong, and μ is actually less than c. (I believe that some here conceded this could happen in some cases in whcih the analyst is “sure” μ > c.) Doesn’t this violate one of the most cherished (by Bayesians) features of the Bayesian method — that the effect of the prior washes out as the sample size n goes to infinity?
The short answer is that assuming information such as “mu is greater than c” which isn’t true screws up the analysis. It’s like a mathematician starting a proof of by saying “assume 3 is an even number”. If it were possible to consistently get good results from false assumptions, there would be no need to ever get our assumptions right.
The longer answer goes like this. Statisticians can get inferences and their associated uncertainties from probability distributions. If those inferences are true to within those uncertainties, we say the distribution is ‘good’. Statisticians typically do this with posteriors. Good posteriors being those that give us interval estimates that jive with reality. Obviously though it can be done for any distribution no matter what it’s type or purpose.
Therefore, a prior is only ‘good’ if the inferences drawn from it are true to within the implied uncertainties. That’s how Bayesian priors on mu are ‘tested’ even though the prior is modeling the uncertainty in a single value of mu rather than the frequency of multiple mu’s. You simply compare the inferences from the prior and see if it’s consistent with the prior information.
Given the prior with support on (c, infty) we’d infer that “the true mu is greater than c”. If the true mu is less than c, then the prior is ‘bad’ and shouldn’t’ be used. Using it is equivalent to making a false assumption no different than “assume 3 is an even number”,
The moral of the story Matloff is that your prior should only say “mu is greater than c” if your prior information guarantees it. If the prior information about mu isn’t strong enough to guarantee it with certainty you should choose a prior which reflects that and has a larger support than (c, infty)
Well using a (c,∞) prior makes a model that “considers” values less than c impossible and is useful when you don’t have time or need to coming up with something more nuanced. But if it seems that the (c,∞) is not doing a good job (or if you learn new information) there is nothing stopping you from changing the prior (as you can change other assumptions in the model). So you could say, “All priors all false, but some are usefull”.
Of course, if you want to you can put some other prior on μ where you reserve a tiny bit of probability on μ less than 3 and in that model you would have the property that “the effect of the prior washes out as the sample size n goes to infinity”.
Thanks for the thoughtful comments, Alan and rasmusab. But I think you agree, then, with my point: One of the most famous defenses offered by Bayesians for their methods — that the influence of the prior gradually washes out (“Our answers won’t be much different from those of the frequentists”) — fails in a broad category of situations. The Bayesian philosophy is not quite as advertised.
The other point I’d make in response to your comments (which I’ve mentioned before here and in Andrew Gelman’s blog) is that frequentist methods are robust to bad assumptions, in the sense that one can verify the assumptions via the data (if you have enough of it). By contrast, one can’t do that for a (subjective) prior, by definition, because one is working with only one realization of the parameter θ.
Matloff, I’ve never heard anyone claim that if a prior assigns zero probability to the true value of mu that the posterior will settle on the true mu given enough data. Since elementary algebra shows the support of the posterior is a subset of the support of the prior, the claim is trivially false, and I doubt anyone ever did say it was true.
John Byrd, there is no “validated by estimating error probabilities that will result from applications of it” being done. The prior and posterior describe an uncertainty range for a single mu. There are no frequencies to calibrate to. Separately, if x_i = mu+e_i and the measuring instrument gives errors ~N(0,10) as in the post, it’s possible to get a CI entirely below the cuttoff. This will happen some small percentage of the time randomly. If we know from other evidence that mu is guaranteed to be greater than the cutoff, then “truncation” will imply the true mu is in the empty set (the intersection of the CI and the interval greater than the cutoff). Is that answer acceptable to you? Mayo seems to indicate it is, and that I’m “stamping [my] feet” over it.
Mayo, for P(mu|A) to do it’s job it has to faithfully reflect what A says about mu. If it doesn’t the distribution is “wrong”. If A says “it’s possible mu is less than c” but P(mu|A) says “mu must be bigger than c” then the distribution is bad. P(mu|A) is contradicting what A has to say about mu. That’s the philosophical origin of the ‘test’ and it in no way requires some extra Bayesian ingredient.
Even if it did, in what sense could this secretly be “Error Statistical” when it involves assigning probabilities to hypothesis and uses distributions which aren’t frequency distributions in any way? (this is not a rhetorical question. If everything else is ignored, please answer this one)
From Alan: “Therefore, a prior is only ‘good’ if the inferences drawn from it are true to within the implied uncertainties. That’s how Bayesian priors on mu are ‘tested’ even though the prior is modeling the uncertainty in a single value of mu rather than the frequency of multiple mu’s. You simply compare the inferences from the prior and see if it’s consistent with the prior information.”
I understand that a Bayesian model– like any model– can be validated by estimating error probabilities that will result from applications of it. That is a good thing and a saving grace. But, consider this need for validation in the context of the toy example of the couch measurement, and it becomes very clear why Mark’s answer was correct, and my suggestion to stick to the CI because a laser transit has its own error makes practical sense for scientists trying to solve problems. If you get a CI with most likely values of mu below 3, you will likely end up having to revise your prior following attempts to validate…
It seems very improbable to me that you can follow the protocol of validating a Bayesian model against real data and end sharply divergent from the CI in a case like that. If you gain advantage by validation in that you obtain more data, then the CI can also be narrowed with the additional data. Two paths to the same end point?
I actually believe most people that do Bayesian data analysis (those you call Bayesians) actually use convenience priors (such as default priors, or reference priors). And I think that’s fine, as long as you know that you are using a convenience prior. Just like most people use convenience models (like linear regression), it’s quick and easy and hopefully works ok most of the time.
It’s’ a different case if you were to chose a convenience prior and then stick to it whatever happens. That would be like sticking to linear regression without ever questioning the model assumptions. And that would be questionable.
A useful way of thinking about priors is just as “part of the model”. Just like the assumption of linearity is part of the model, and has to be justified, the priors are also part of the model and have to be justified. But sometimes use use linear regression because you have no better option and sometimes you use convenience priors because you haven’t figured out something better.
What I meant with the Rubin/Jaynes approach was a very pragmatic approach to Bayesian data analysis, like the one described here, for example:http://projecteuclid.org/euclid.aos/1176346785
I’m replying to rasmusab, who had replied to me.
You and I of course agree on the conditions under which the Bayesians’ famous “the prior eventually washes out” claim fails. But my point was that the Bayesians don’t put an asterisk on that famous slogan, which is why I said the Bayesian approach is not quite as advertised. That’s a really big deal to me.
And more importantly, we’re not talking about some rare case here. On the contrary, the excellent book Bayesian Ideas and Data Analysis, one of whose authors is my former colleague Wes Johnson (a really smart guy and a leading Bayesian), is chock full of examples of priors that assume bound(s) on θ.
The examples in that book — and in every other book I know of on the Bayesian method — show that many, indeed most, Bayesians set up priors exactly in the way you believe that the vast majority don’t: Their priors are chosen, as you say, because “it feels right.” Of course, they also often choose “convenient” priors because they lead to nice posterior distributions, making the priors even more questionable.
I’m not familiar with the Rubin/Jaynes approach. A quick Web search seems to indicate it is aimed at performing “What if?” analyses. I have no problem with that at all (providing, as always, that the ultimate consumers of the analyses are aware of the nature of what was done).
Alan: It appears that you employ circular reasoning. The prior is to be corrected through “experience” unless it is to be taken as a certainty before application? Makes no sense. This is what I call the self-licking ice cream cone approach to Bayesian philosophy. Establish a prior, take it as meaningful, sell it to others unless the model does not work. If the model performs poorly, change the prior, call it prior information anyway, then repeat process.
You say: ” If we know from other evidence that mu is guaranteed to be greater than the cutoff, then “truncation” will imply the true mu is in the empty set .”. So, you say we must accept the prior as more important than the data. And also:“Therefore, a prior is only ‘good’ if the inferences drawn from it are true to within the implied uncertainties. That’s how Bayesian priors on mu are ‘tested’ even though the prior is modeling the uncertainty in a single value of mu rather than the frequency of multiple mu’s. You simply compare the inferences from the prior and see if it’s consistent with the prior information.” It appears the latter approach of testing to correct the prior is most reasonable. The latter approach would correct the prior to avoid the empty set.
So, you are faced with a scenario where IF you are willing to allow that your prior is subject to revision when faced with reality, then your Bayesian model will gravitate to the CI solution. Or, you can simply not test it. But then it becomes religion not science.
And, it appears to me that validating a model by comparing its predictions to reality to measure its performance is precisely seeking to minimize error probabilities. Seems obvious to me. I am puzzled that you do nor think so.
John: You bring out a good point: they have to assume something like the single mu that is responsible for the current data itself having been randomly selected from a population of mus. That’s a sample of size 1. We wouldn’t reject a statistical hypothesis on the basis of a sample of size one. So, it’s not clear they can be seen as getting error probabilities, which require a sampling distribution. We’re never just interested in fitting this case, the error probabilities are used to assess the overall capacity of the method to have resulted in erroneous interpretations of data.
And of course, there’s the problem of distinguishing between violated assumptions, like iid, and a violated prior. I note this in my remarks on Gelman and Shalizi’s paper.
But rasmuab, you are ignoring the key point: One can use the data to assess the propriety of frequentist models, as linearity of a regression function, but one can NOT do that in the (subjective) Bayesian case. In Bayesian settings, since one has only a single realization of θ at hand, one can’t estimate the distribution of θ to verify the prior.
All this changes in the empirical Bayes case. Then there is a real distribution for θ , and one’s model for that distribution can be verified as in any other frequentist method — because it IS a frequentist method. For instance, Fisher’s linear discriminant analysis (or for that matter logistic regression) without raising an eyebrow, even though it is an empirical Bayes method.
I skimmed through the first few pages of the Rubin paper (thanks for the interesting link), and immediately noticed that his very first example, on law school grades uses an empirical Bayes approach, not a subjective one, which makes it frequentist.
Feb 17 addition:
I had grabbed the last handful of comments (excluding most of mine) but didn’t mean to exclude anyone who made remarks on the new topic (of truncation), so here was Mark’s remark to Alan’s initial concern about truncation:
Alan, let me get this straight. Your example involves a case where there’s a hard physical constraint on the mean being greater than 3, but no such physical constraint on individual observations? The only possible way to get a CI that lies almost entirely below the cutoff is to have the vast majority of values lying below the cutoff. What’s a Bayesian to do in this case, stamp his feet and say “no, no, no, the mean must be constrained to be greater than 3, so I’ll put the vast majority of my weight on my prior” (that is, acknowledge that the data are noisy and so essentially throw them out)? I’d love to see a Bayesian analysis where a) there is a physical constraint on the mean being greater than 3, b) almost all of the data are sufficiently lower than the cutoff *such that the standard frequentist CI was almost entirely below the cutoff*, and c) the final inference was not based almost exclusively on the prior. If your answer is that your final inference in this case would be essentially the prior, then I frankly don’t see anything less absurd in your approach than claiming that (3, 3.00001) is a reasonable CI. It’s the same argument, as far as I’m concerned, they’re equally concocted.
Now, if there truly is a physical cutoff, such that both the mean and realized values are required to be above this cutoff, then there is a very simple frequentist approach to incorporate this background information. Do a transformation like log(X-3). No need to truncate, your entire CI will be in the required range.
Whoa! Someone said I was right, and I missed it? Thanks John Byrd!