I’m extremely grateful to Drs. Owhadi and Scovel for replying to my request for “a plain Jane” explication of their interesting paper, “When Bayesian Inference Shatters”, and especially for permission to post it. If readers want to ponder the paper awhile and send me comments for guest posts or “U-PHILS*” (by OCT 15), let me know. Feel free to comment as usual in the mean time.

—————————————-

**Houman Owhadi**

Professor of Applied and Computational Mathematics and Control and Dynamical Systems, Computing + Mathematical Sciences,

California Institute of Technology, USA

California Institute of Technology, USA

**“When Bayesian Inference Shatters: A plain Jane explanation”**

This is an attempt at a “plain Jane” presentation of the results discussed in the recent arxiv paper “When Bayesian Inference Shatters” located at http://arxiv.org/abs/1308.6306 with the following abstract:

“With the advent of high-performance computing, Bayesian methods are increasingly popular tools for the quantification of uncertainty throughout science and industry. Since these methods impact the making of sometimes critical decisions in increasingly complicated contexts, the sensitivity of their posterior conclusions with respect to the underlying models and prior beliefs is becoming a pressing question. We report new results suggesting that, although Bayesian methods are robust when the number of possible outcomes is finite or when only a finite number of marginals of the data-generating distribution are unknown, they are generically brittle when applied to continuous systems with finite information on the data-generating distribution. This brittleness persists beyond the discretization of continuous systems and suggests that Bayesian inference is generically ill-posed in the sense of Hadamard when applied to such systems: if closeness is defined in terms of the total variation metric or the matching of a finite system of moments, then (1) two practitioners who use arbitrarily close models and observe the same (possibly arbitrarily large amount of) data may reach diametrically opposite conclusions; and (2) any given prior and model can be slightly perturbed to achieve any desired posterior conclusions.”

Now, it is already known from classical Robust Bayesian Inference that Bayesian Inference has some robustness if the random outcomes live in a finite space or if the class of priors considered is finite-dimensional (i.e. what you know is infinite and what you do not know is finite). What we have shown is that if the random outcomes live in an approximation of a continuous space (for instance, when they are decimal numbers given to finite precision) and your class of priors is finite co-dimensional (i.e. what you know is finite and what you do not know may be infinite) then, if the data is observed at a fine enough resolution, the range of posterior values is the deterministic range of the quantity of interest, irrespective of the size of the data.

A good way to understand this is through a simple example: Assume that you want to estimate the mean E_{μ†}[X] of some random variable X with respect to some unknown distribution μ^{†} on the interval [0,1] based on the observation of n i.i.d. samples, given to finite resolution δ, from the unknown distribution μ^{†}. The Bayesian answer to that problem is to assume that μ^{†} is the realization of some random measure distributed according to some prior π (i.e. μ ~ π) and then compute the posterior value of the mean by conditioning on the data. Now to specify the prior π you need to specify the distribution of all the moments of μ (i.e. the distribution of the infinite-dimensional vector (E_{μ}[X], E_{μ}[X^{2}], E_{μ}[X^{3}],…)). So a natural way to assess the sensitivity of the Bayesian answer with respect to the choice of prior is to specify the distribution ℚ of only a large, but finite, number of moments of μ (i.e. to specify the distribution of (E_{μ}[X], E_{μ}[X^{2}],…, E_{μ}[X^{k}]), where k can be arbitrarily large). This defines a class of priors Π and our results show that no matter how large k is, no matter how large the number of samples n is, for any ℚ that has a positive density with respect to the uniform distribution on the first k moments, if you observe the data at a fine enough resolution, then the minimum and maximum of the posterior value of the mean over the class of priors Π are 0 and 1.

It is important to note that these brittleness theorems concern the whole of the posterior distribution and not just its expected value, since by simply raising the quantity of interest to an arbitrary power, we obtain brittleness with respect to all higher-order moments. Moreover, since the quantity of interest may be any (measurable) function of the data-generating distribution, in the example above, we would get the same brittleness results if instead of estimating the mean we estimate the median, some other quantile, or the probability of an event of interest.

Instead of moments or other finite-dimensional features, we can also consider perturbations of the model quantified by a metric on the space of probability measures. In this case, our results show that, for any perturbation level α > 0, no matter the size of the data, for any parametric Bayesian model, there is another Bayesian model that is at distance at most α from the first one in the Prokhorov or Total Variation (TV) metric leading to diametrically opposite conclusions in the sense described above.

Since, as noted by G. E. P. Box, for complex systems, all models are mis-specified, these brittleness results suggest that, in the absence of a rigorous accuracy/performance analysis, the guarantee on the accuracy of Bayesian Inference for complex systems is similar to what one gets by arbitrarily picking a point between the minimum and maximum value of the quantity of interest.

Now concerning using closeness in Kullback–Leibler (KL) divergence rather than Prokhorov or TV, observe that closeness in KL divergence is not something you can test with discrete data, but you can test closeness in TV or Prokhorov. Moreover, the assumption of closeness in KL divergence requires the non-singularity of the data generating distribution with respect to the Bayesian model, which could be a very strong assumption if you are trying to certify the safety of a critical system. Indeed, when performing Bayesian analysis on function spaces e.g. for studying PDE solutions as is now increasingly popular, results like the Feldman–Hajek Theorem tell us that “most” pairs of measures are mutually singular, and hence at KL distance infinity from one another. Observe also that if the distance in TV between two Bayesian models is smaller than, say 10^{-9}, then those models are nearly indistinguishable (you will not be able to test the difference between them without a ridiculously large amount of data). So the brittleness theorems state that for any Bayesian model there is a second one, nearly indistinguishable from the first, achieving any desired posterior value within the deterministic range of the quantity of interest.

To summarize, our understanding of these results is as follows: the current robust Bayesian framework (sensitivity analysis posterior to the observation of the data) leads to Brittleness under finite-information or local mis-specification. This situation might be remedied by computing robustness estimates prior to the observation of the data with respect to the worst case scenario of what the data generating distribution could be. We are currently working on this while pursuing the goal of developing a mathematical framework that can reduce the task of developing optimal statistical estimators and models into an algorithm.

We do not think that this is the end of Bayesian Inference, but we hope that these results will help stimulate the discussion on the necessity to couple posterior estimates with rigorous performance/accuracy analysis. Indeed, it is interesting to note that the situations where Bayesian Inference has been successful can be characterized by the presence of some kind of “non-Bayesian” feedback loop on its accuracy. In other words, as it currently stands, although Bayesian Inference offers an easy and powerful “tool” to build/design statistical estimators, it does not say anything (by itself) about the statistical accuracy of these estimators if the model is not exact.

The main examples that come to mind are those where the accuracy of posteriors given by Bayesian Inference cannot be tested by repeating experiments, or through the availability of additional samples (climate modeling, catastrophic risk assessment, etc…) The main consequence is that for such systems risk may be severely (dangerously, ruinously) underestimated.

Houman and Clint

*”U-PHILS” = “U-Philosophize”. Info and exemplars at the link.

This is all a bit hard to follow for the lay person. But, I think I am seeing a point made which relates to the naivete of blind reliance on the strong likelihood principle in Bayesian inference. That is, the need for post-hoc evaluation of the posteriors seems to torpedo the notion that the data and model and the prior tell us what we need to know. Am I off base?

I hope this does shatter naïve Bayesianism. But maybe something can be reconstructed from the pieces. Contemporary Bayesianism often seems to claim a lot, and to be fragile. Maybe if it claimed less it could be robust.

I note some related work and ideas at my blog,

http://djmarsay.wordpress.com/bibliography/rationality-and-uncertainty/probability/criticism-of-mathematical-probability/owhadi-eas-when-bayesian-inference-shatters/

I will give it some more thought. Regards.

Houman, Clint and Tim: Thanks so much for sending this intriguing guest post, and so soon! I was about to post a new “Saturday night comedy at the Bayesian retreat” when this came through, so I went with this instead! Earlier comments (by Corey and Christian Hennig) when I was running this by last week may be found here:

https://errorstatistics.com/2013/09/07/first-blog-did-you-hear-the-one-about-the-frequentist-and-frequentists-in-exile/

I’m wondering how this compares to a non-Bayesian treatment; I’ve always suspected Bayesian success stories depended on non-Bayesian feedback loops. But is there actually a danger of observing the data at a fine enough resolution? Are you spared if you don’t? (Naive philosopher’s questions)

It’s nice to see you writing directly on this blog!

I wonder whether the following statement is a bit misleading: “It is important to note that these brittleness theorems concern the whole of the posterior distribution and not just its expected value, since by simply raising the quantity of interest to an arbitrary power, we obtain brittleness with respect to all higher-order moments.” It is true that the discontinuity of the expectation as functional of distributions extends to all higher order moments, but still I’d distinguish this from what I’d call “the whole of the posterior distribution”, because a number of things that people could do with the posterior don’t seem to be affected by this, such as the posterior median or credibility intervals as long as they are not centered about the expectation. I’d think that the fact that moments of a distribution don’t imply information about the probability of sets as long as the sets don’t depend on moments (or the probability is one) makes such a theorem possible in the first place. Or am I wrong?

“We do not think that this is the end of Bayesian Inference”

I expect it’ll suffer the same fate as that fragile flower mechanics did after it was similarly “shattered”:

Erk! Sorry about the bad link. The “(Look Inside –> Introduction)” bit at the end isn’t meant to be a part of it.

The assumption that an adopted prior distribution is continuous (and therefore has an infinite number of moments) does not make sense to me when applied to practical applications.

The proof of instability might be useful in situations where the idea of Bayesian estimation is being extended beyond its usefulness.

If the intent is simply to prove that guessing a prior distribution could run into trouble in certain cases, that does not sound like an unreasonable point.

I love that bit, “We do not think that this is the end of Bayesian Inference.”

What a relief! I was worried there for a minute.

Yes, but it may be intended as an ironic understatement. There is still a little red wine in that shattered glass (depicted on this post) but you’d cut yourself badly if you tried to drink from it.

Plain Jane

Hi everyone,

Thanks for your posts and comments.

Here are a few thoughts:

John: Thanks for your comment. It basically depends on whether your model is well-specified or not. What we have shown is that if your model is misspecified (in TV or Prokhorov metrics) then you will not find any guarantee of accuracy on the posterior estimates within the strict framework of Bayesian Inference.

Indeed (after observing the data) you can show that there exists a nearly indistinguishable model (nearly indistinguishable to testing based on discrete data) that produces diametrically opposite results.

Dave: Thanks for your blog. I definitely agree with your comment “Often, the non-probabilistic uncertainty has dominated the overall uncertainty”. This is precisely why we have developed the Optimal Uncertainty Quantification (OUQ) framework (SIREV, http://arxiv.org/abs/1009.0679) where you look at worst and best case scenarios with respect to what the underlying probability distributions and response functions could be. We found those brittleness results while incorporating priors into the OUQ framework. Our current work (in progress) concerns the incorporation of sample data into the OUQ framework and scientific computation of optimal estimators.

Deborah: Thanks for giving us the opportunity to interact via this blog. The danger if we can call it that is not in the observation of data but in the conditioning by events of very small probability. More precisely, the improvement in the resolution of our observations allows us to condition on events of smaller and smaller probabilities and by doing that we are implicitly making some stronger and stronger assumptions about the accuracy of prior.

Behind this instability you have that of the conditional expectation when you try to interpret it as a derivative.

The first brittleness paper also contains an example where by adding a Gaussian noise to the measurements you decrease the range of posterior values as you increase that noise. This appears at first paradoxical but what really happens is that as you increase noise you loose the information contained in the samples and your posterior values converge towards prior values.

Concerning the question of being spared if you don’t observe the data at a fine enough resolution, for the reason given above the answer is no if it is based on a analysis posterior to the observation of the data. For an analysis prior to the observation of the data the question remains open but I can already tell you that if you observe the data with infinite resolution then a prior analysis will also show brittleness if you do not rule out the possibility that your prior and the true prior are singular (a variant of the Borel-Kolmogorov paradox will rear its head).

The non-Bayesian treatment is what we are currently working on: it is based on an analysis prior to observing the data and we consider worst case scenarios with respect to what the data generating distribution could be. Our purpose is to compute optimal statistical estimators and from that point of view you could see Bayesian models and priors as optimization variables (that would be the non-Bayesian loop). Those optimal estimators may be Bayesian, frequentist or something else based on the available information (this will be the subject of a sequel work).

Christian: Thanks for your comment. In theorems 1 and 2, the quantity of interest ($\Phi$) could be any (integrable, measurable) function of the data generating distribution, so it could be its median (or another quantile), or credibility intervals (or the probability of any event of interest). So those brittleness results concern the whole posterior distribution, not just the mean.

Paul: Thanks for your comment. Bayesian Inference is still a nice/powerful way to construct statistical estimators, the controversy starts when it is used without any analysis on its accuracy. And this could be the non-Bayesian feedback loop that Deborah was referring to.

Andrew: Well, it had to be said explicitly given some feedback we have received.

Houman

Houman, Clint and Tim: I just want to express my thanks for taking the time to write specific, clear and useful responses to our comments.

This is very interesting.

Question: are your results related to these papers:

Bahadur, Raghu Raj and Savage, Leonard J. (1956).

The nonexistence of certain statistical procedures in nonparametric problems.

The Annals of Mathematical Statistics, 27, p. 1115-1122.

Donoho, David L. (1988).

One-sided inference about functionals of a density.

The Annals of Statistics, 16, p. 1390-1420.

–Larry Wasserman

I have many questions about this result:

– what are the relations between this theorem and the many posterior consistency theorems of Bayesian nonparametrics?

– I have an intuition that the result depends on the existence of measurable sets where the model density is really really small but non-zero and the perturbed model density is strictly zero, thereby generating tiny differences in total variation but infinite KL distance. Is this an accurate intuition?

– does the result hold for computable measures?

– what are the implications of the result for what Larry Wasserman calls presistency, i.e., predictive consistency?

“Thanks for your comment. In theorems 1 and 2, the quantity of interest ($\Phi$) could be any (integrable, measurable) function of the data generating distribution, so it could be its median (or another quantile), or credibility intervals (or the probability of any event of interest). So those brittleness results concern the whole posterior distribution, not just the mean.”

Just to say that looking at the original longer paper “Bayesian Brittleness: Why no Bayesian model is “good enough” and getting more familiar with the notation I now accept that this is true, and my previous comments about the result only applying to the posterior expectation were wrong.

Corey: “I have an intuition that the result depends on the existence of measurable sets where the model density is really really small but non-zero and the perturbed model density is strictly zero, thereby generating tiny differences in total variation but infinite KL distance. Is this an accurate intuition?”

My intuition now is that the trick is that you need to manipulate the probability of a small region around the observed data (which according to the assumption goes to zero if the region is small enough), making it arbitrarily small under some parameter values and keep it constant nonzero under others (that would otherwise look “bad”). Which is probably about what you’re saying.

Hopefully the authors will confirm or correct it…

I now also realise how one could criticise the use of Prokhorov/TV here. This allows to choose weird-looking distributions as sampling models that specifically exclude certain potentially observable data sets for some parameter values but not for others, regardless of for which parameters these data sets lie in high density regions in models with smooth densities (in the Prokhorov neighborhoods of which, as we know, we can find all kinds of weird stuff). Such distributions are certainly counterintuitive, despite not being strictly distinguishable by observations from the sampling models we would usually use.

So, referring to our discussion before, the discontinuity of densities (or more generally the discontinuity of “size-standardised” limit probabilities of small sets) is actually more important to this result than that of the expectation.

Dave and Deborah: The following comment on Dave’s blog is quite interesting:

“My own approach to this problem has been from the application end. Many attempts to apply Bayesian analysis to complex systems have resulted in conclusions that have later been seen to be ridiculous, or which even if correct have not commanded the confidence of decision-makers.”

We would be very much interested if you had good references for this statement (from the application area point of view). Numerous people (from the application side) told us the exact same thing (this and the fact that they also observe extreme brittleness of posterior conclusions upon slight changes in priors or upon repeating the experiments) but people tend to not publish negative results in the application side so it is not easy to find these references.

The overall picture end up being quite biased (if we can used that term) with positive results being published and negative results ignored (Deborah: we are aware of the papers by Stephen Senn and yours on “cultivating Senn’s ability” which we have found very interesting).

Dratman: Thanks for your comment. The brittleness theorems suggest that even if your guess of the prior distribution is good (in TV or Prokhorov metrics) then you may run into trouble by conditioning if you remain within the strict framework of Robust Bayesian analysis posterior to the observation of the data. It is possible that this situation might be remedied by a (non Bayesian) analysis prior to the observation of the data.

Larry: First thanks a lot for providing these references which are indeed related. The core mechanism allowing us to prove those brittleness theorems is similar to the one discovered by Bahadur-Savage and Donoho. We also observe that our brittleness results are not caused by tail properties but by the process of Bayesian conditioning (henceforth we are dealing with spaces of measures over measures and had to develop the required reduction calculus to obtain those brittleness theorems). For this reason the results of Bahadur-Savage and Donoho do not apply if the class of distributions share a common compact support (see also Romano, 2004, On Non-parametric Testing, the Uniform Behaviour of the t-test, and Related Problems) whereas our brittleness theorems still apply for such compact supports (for the interval $[0,1]$ for instance). Those results by Bahadur-Savage and Donoho are also quite relevant to our sequel work (in terms of what can be done when one tries to find optimal estimators based on analysis prior to the observation of the data).

We also realize that what we have shown is that the process of conditioning on data at finite enough resolution is sensitive (as defined in your 1988 “Sensitive parameters” paper, modulo a small technicality) with respect to the underlying distributions (under TV and Prokhorov). Thanks again, we really appreciate those pointers.

Corey: Thanks for your questions. Your first question concerns our sequel work, the short answer would be that convergence errors can be bounded by robustness estimates “prior” to the observation of the data.

Your intuition is accurate but it is not strictly zero in the perturbed model. The basic idea is that two measures of probability can be very close in TV norm but have very different probabilities of observing the data (if the resolution is fine enough).

The results do hold for computable measures (you get brittleness at finite resolution). The notion of presistency concerns our sequel work (analysis prior to the observation of the data) where you have to introduce the notion of near optimal statistical estimators.

Christian: Thanks for your comments. Your intuition is correct. The troubles caused by Prokhorov/TV (with robustness estimates posterior to the observation of the data) may be avoided by going through an analysis prior the observation of the data (by looking at statistical errors in the worst case scenario with respect to what the data generating distribution could be) but you have to exit the strict framework of Bayesian Inference to do that (and start using frequentist types of error estimates, this concerns our sequel work).

Houman and Clint

Thanks for the clarification

Houman is referring to my comments on Senn’s paper: “You may believe you are a Bayesian but you are probably wrong”

There was a “U-Phil” (links are on my March 11, 2012 post https://errorstatistics.com/2012/03/11/2724/)

I nearly forgot that we also published the follow-ups: “How Can We Cultivate Senn’s Ability, Comment on Stephen Senn, ‘You May Believe You are a Bayesian But You’re Probably Wrong’” and Senn’s, “Names and Games, A Reply to Deborah G. Mayo,” under the Discussion Section of Rationality, Markets, and Morals.(Special Topic: Statistical Science and Philosophy of Science: Where Do/Should They Meet?”). http://www.rmm-journal.de/downloads/Comment_on_Senn.pdf

Houman, My only reference is, like you, to go and talk to those who have worked at the bleeding edge. I have tried publishing myself and am still looking for references. Any suggestions, anyone?

Like Andrew Gelman, I see the main problems as arising in those areas where we lack adequate prior models. It seems to me that we ought to condition our conclusions on our models, and that if Bayesians (of whatever stripe) were more explicit about their assumptions their conclusions would be both more credible and useful. But there is also scope for developing more general methods.

Deborah: Thanks a lot for the references!

Dave: Thanks for your reply. Besides the references provided by Deborah above, the reference that we are aware of is another interesting and related discussion in chapter 15 of “Robust Statistics” by Huber and Ronchetti.

One anecdotal comment that we got from people who have worked on the safety assessment of Deepwater Horizon is that 5 different Bayesian studies led to 5 remarkably distinct conclusions.

Houman

Houman: I’m not sure whether you are referring to pre-spill measurements (e.g., of blowout preventer capacity) or post-spill measures. I’m guessing the latter. As I understand it, they were desperately trying to measure the spill rate; they sought a wide variety of techniques because no one really knew how to measure deep water spills of this type. The company started out with an absurdly low spill rate. Eventually a number of number of novel techniques were invented. Side issue: his blog happens to be related to the spill because it began as a forum to discuss papers growing out of my June 2010 conference at the LSE “Statistical Science Meets Philosophy of Science”*, and it was during the conference planning in April 2010 that the Macondo well exploded (and my Diamond Offshore crashed).

http://rejectedpostsofdmayo.com/2013/01/22/philstock-beyond-luck-or-method/:

Having gotten the irrelevancies out of the way, I’d be very interested to know more about the examples you mention.

*https://errorstatistics.com/2012/10/17/rmm-8-new-mayo-paper-statsci-and-philsci-part-2-shallow-vs-deep-explorations/

Deborah: Thanks for the clarification, the link and the reference (“Statistical Science and Philosophy of Science Part 2: Shallow versus Deep Explorations”). Yes it is post-spill and the comment we got is that the confidence intervals for total leak, as obtained by 5-6 different analyses, had empty intersection. The other example that struck me concerned materials in extreme environments (the analysis was done by a mathematician who reported her surprise at the sensitivity of posterior conclusions).

Unfortunately our sources are anecdotal so I have no precise references to give you.

Houman

Pingback: Two Announcements « Normal Deviate

Normal Deviate linked to our discussion, and also announced an on-line conference being planned on the future of statistics:

http://normaldeviate.wordpress.com/2013/09/17/two-announcements/

Regarding a non-Bayesian feedback loop, is not this device always required when moving between “small worlds” (e.g. parameterizations, models, etc)? This issue has vexed and perplexed since I was a young student. The Bayesian algorithm seems overly rigid; specifying an entire joint distribution that includes all possibilities, and simple moving around within this space cannot possible capture the scientific process, either descriptively or normatively (I still think the DOB approach can be useful).

Assuming I understand the meaning of non-Bayesian feedback, here are some interesting readings on the topic.

I think this idea was touched on by Tony O’Hagan when interviewing Dennis Lindley. Lindley doesn’t like it when O’Hagan says something like “anything outside the small world is given P()=0”. With zero probabilities, I think Bayesian updating is precluded. At the 1959 Joint Statistics Seminar of Birkbeck and Imperial Colleges (see Savage’s “Foundations of Statistical Inference”), Lindley tries to avoid the issue by arguing that epsilon is allocated to “something else”, but obviously the likelihood for this SE is ill-defined (and common sense posteriors likely probably don’t follow from priors and likelihoods; also see comments on Goods “reverse inference”)

This topic is also discussed by G. Larry Bretthorst in “Bayesian Spectrum Analysis and Parameter Estimation” page 55-57. Bretthorst argues that odds ratios for specified hypotheses remain unchanged by the SE hypothesis. It seems that this is the Lindley argument in disguise. (I’ve seen a point similar to Bretthorst’s argued against in the literature on Bayesian epistemology; alas, my memory fails on specifics).

I have also seen this point discussed in an open peer commentary in Behavioral and Brian Sciences (doi: 10.1017/S0140525X09000284). The main article was about models of cognition that use Bayes’ Theorem, and the commentary was by Keith Stenning (Human Communcations researcher) and Michiel van Lambalgen (Logic/Cog Sci researcher). Stenning and van Limbalgan discuss the trouble with updating on zero probability events, and how there “are no rationality principles governing changes in probabilities when enlarging the probability space”. They also address what I’ll call the epsilon argument of Lindley, concluding it is ultimately unsatisfactory due to its computational complexity. Interestingly, humans tend to do this “non-Baysian” updating quite easily (it’s common sense, really!).

Good seems to get around this idea by inferring priors from posteriors (discussed at 1959 Seminar and Good Thinking). Since coherence is the only thing that BT requires, one can infer the priors that are implied by some posteriors. Although in a purely mathematical sense this preserves coherence, it seems deeply problematic.

David H. Wolpert talks about related issues in “Reconciling Bayesian and Non-Bayesian Analysis” (available online), which I highly recommend in addition to his other work on Bayesian stats. Wolpert has significantly influenced my thinking, particularly with his “extended Bayesian framework”.

In summary, this “non-bayesian” belief updating seems absolutely to capture scientific, and even common-sense, inference. It seems that, strictly speaking, Bayesian DOBs and Bayesian learning cannot give a complete account of normative or descriptive belief updating (scientific of common sense).

Biz: Thanks so much for your interesting comments. I’m not familiar with Wolpert’s non-Bayesian belief updating (I’ll look it up). My own view is that finding things out is not well captured by “belief updating” altogether, unless one just feels like attaching “I believe that” to every claim or inference. But then that claim must similarly have “I believe that” in front of it, and so on and so forth. Even if one does wish to speak this way, I don’t think formal probability captures the evidential moves at all. One of the reasons is the one you mention: there’s an “open universe” in science.

You refer the 1959 Joint Statistics Seminar (which I dub “the Savage Forum” on this blog). Yes, Barnard hammered Lindley and Savage on this, arguing that given the relativity to choice of the Bayesian catchall factor, P(x|~H), there was no benefit over likelihoods. Moreover, the idea of leaving a small probability for “~H” would be very distant from science. (see Nelder on this blog). Then again, I deny that we’re after highly probable hypotheses and theories in the first place. That would keep us too low to the observations.

I don’t understand the point you make regarding Bretthorst, and will be interested to check the BBS journal. The updating and downdating business (in Good and others) I regard as a disaster: you can adjust your prior after the data so as to get the posterior you want. True, I think Good imagined you’d contemplate zillions of possibilities in advance and then set your prior before-hand, but it’s unrealistic, and sure to change after the data.

Your doubts about the normative role of Bayesian conditioning is echoed by many, along with rejecting Dutch book arguments, etc. But then what’s left of Bayesian foundations?

Many of these topics are discussed on this blog, so if you’re interested, try the search. Thanks again.

Thanks, Mayo; I think I’ll be a regular reader from now on.

Your blog, along with Larry’s (and Efron’s recent work), kind of lead to a crisis in my own philosophy; and, well ,,,. there’s been lots of presumably non-Bayesian belief updating for me in the last few weeks. But think I’ve come out wiser, and addressed some of the issues that have been bothering me for the last few years.

Thanks for the resources!

Biz: If i can help nudge someone to this type of crisis–prying those baked-on scales loose–then I feel my efforts are worthwhile. Most of the time I don’t feel anyone is really listening. So thanks.

I came back to your post to reflect. I appreciate your comments (particularly on how science is done, and on Good; I totally agree), and realized I should clarify on Bretthorst.

Bretthorst goes into some depth about the “Something Else not yet thought of”=SE hypothesis (or model). BT says:

P(Hi|D) = [P(Hi)*P(D|Hi)]/P(D)

Adding SE effects P(Hi) and P(D). Bretthorst, like Lindley, handles P(Hi) by setting sum P(Hi) = a, so that P(SE) = 1-a. He says P(D) is not so easy to deal with, since P(D|SE) is indeterminate. He attempts a work around, noting “the relative probabilities of the specified models are still well defined, because the indeterminates cancel out”, so:

P(Hi|D)/P(Hj|D) = P(Hi|D)/P(Hj|D) * P(D|Hi)/P(D|Hj)

Since the P(D) cancels out. That means the relative prior and posterior probabilities between all Hi, excluding SE, are the same whether SE is included or not. Next, Bretthorst argues that: “while it is not wrong to introduce an unspecified “Something Else” into a probability calculation, no useful purpose is served by it, and we shall not do so hear”. Obviously, the DOB+small worlds is essential here. Also, it is assumed that specifying P(SE) doesn’t impact P(Hi)/P(Hj), although this might not be true (writing jogged my memory; I’ve read this criticism).

In my view, this is Lindley in disguise. First, he goes with the “epsilon argument”. Then, he argues that the SE hypothesis should be ignored since relative probabilities are not affected by it, which seems to be the same thing as saying that all probabilities are conditional on a small world. But it leaves unanswered the question of how to deal with the specification of SE after its nature becomes apparent, due to thought or evidence.

Biz: (I corrected your post).

Coincidentally, I wrote a second post on Barnard last week (for his birthday) that I didn’t put up (because the Gelman talk came up), and it entirely focused on his bringing up the SE (something else)–or Bayesian catchall–at the 1959/1962 Savage Forum. I’ll put it up at another time. But if people want to see the pages, I do have the entire forum scanned by a very rustic, out-in-Elba woods scanner at the bottom of the following post:

https://errorstatistics.com/2013/04/06/who-is-allowed-to-cheat-i-j-good-and-that-after-dinner-comedy-hour/

The relevant section is ~pp 80-82 or so.

These old things should be reprinted. (Barnard and D. Cox were editors, by the way.) I thank A. Spanos for finding me a physical copy.

Mayo: Thanks! I’d love to hear your thoughts on the Bayesian catchall. From what I gather, I will find lots of interesting (and free!) reading material on your blog.

PS I’m spoiled by the second largest academic library system in the US; it’s knowing what to read that’s hardest for me! References in your blog should help; thanks.

Biz: The Barnard post from last week is here:

https://errorstatistics.com/2013/09/23/barnards-birthday-background-likelihood-principle-intentions/

(I just realized you were commenting on a post prior to that one).

Thanks for your interest.

Isn’t what Owhadi, et al. are saying is that regardless of the modeling paradigm — Bayesian or not — in some (pathological, albeit common) situations, we can never be sure of our models when built and tested on finite datasets? And some other yet to be defined meta-level inference is necessary to reason beyond the model space upon which we chose to impose a prior, or to consider in a non-Bayesian approach? If so, this should cap the hyperbole in the claims of Bayesians as to the universality of the robustness of Bayesian analysis, especially as a model for general scientific inquiry. But it doesn’t mean there are any known non-Bayesian methods we can turn to that offer us any greater reassurance about our models. So what shall we do about it? Is discovering this “what” one of the lures of UQ research?

Apollo: Thanks for your comment. You can never be sure that your model (Bayesian or not) is exact when tested against a finite data set (if the space of possible outcomes is not finite) but you can still bound the statistical error of your model (based on a finite data set if you exit the strict Bayesian framework).

What we have shown is that if your model/updating rule is Bayesian then closeness to the (true) data generating distribution (measured in TV/Prokhorov metrics or the number of accurate generalized moments) is not sufficient to guarantee accuracy based on a posterior robustness analysis (the issue remains open for the robustness analysis prior to the observation of the data but for that you have to exit the strict framework of Bayesian Inference). In fact you could use the reduction theorems of arXiv:1304.6772 to show that even if, in addition to closeness (as described above), your model accurately capture the (true) probability of observing the data up to a multiplicative constant, you still get brittleness as that constant grows large/deviates from one.

I would definitely agree with “discovering optimal models/estimators (given the information at hand)” being one of the “objectives” of UQ research: i.e. you want to be able to process available information in an optimal way and come up with the sharpest (rigorous) interval of confidence on the quantity of interest. These optimal models may be non-Bayesian or characterized by non-Bayesian updating rules (depending on available information).

Observe that this program is challenging in two major ways:

(1) Finding an optimal statistical estimator/model remains to be formulated as a well posed problem when information on the system of interest is incomplete and comes in the form of

a complex combination of sample data, partial knowledge of constitutive relations and a limited description of the distribution of input random variables.

(2) The space of admissible scenarios along with the space of relevant information, assumptions, and/or beliefs, tend to be infinite dimensional, whereas calculus on a computer is necessarily discrete and finite.

Houman

Houman:

Glad to have you back–thanks for the new comment. You mention again the advantage of an essentially non-Bayesian analysis prior to the observation of the data. Since a pre-data perspective is what allows error statistical considerations to enter more generally, I’m guessing there’s a connection to what’s going on here too, even though this is a vague and crude suspicion. So here’s a vague question (to which I’d be very glad to have a vague, intuitive answer): how does the pre-data analysis manage to “take into account” the worst case in the complex/restricted knowledge situation you describe.

Deborah: Thanks for your question. To take into account the worst case in the complex/restricted knowledge that I describe the pre-data analysis would have to bound the solution of an optimization problem. The statistical error of any model/estimator can be obtained by integration if the measure corresponding to the data generating distribution is known. If that measure is unknown then one can still bound it (prior to the observation of the data) by bounding the maximum (sup) of that error with respect to all admissible measures that could correspond to the data generating distribution (concentration of measure inequalities allow us to do that for frequentist estimators but they are not optimal).

The computation of optimal bounds on statistical errors (for the pre-data analysis described above) is difficult because the optimization variables are “measures and functions” or “measures over spaces of measures and functions” (depending on the nature of the information).

arXiv:1304.6772 provides reduction theorems for such (apparently computationally intractable) optimization problems (in addition to addressing measurability issues).

Houman

Houman: Hundreds of people are flocking to this post today, so your work must be getting extra-special attention this week. Too bad nobody is leaving a comment. Keep me posted.

Any idea where from?

Dave: not sure, it indicates “reddit”. It’s died down now.

Pingback: On the Brittleness of Bayesian Inference–An Update: Owhadi and Scovel (guest post) | Error Statistics Philosophy

Model criticism is the whole business of statistics and it is no bad thing to remind ourselves that, whatever our tastes in inference, that results are sensitive to assumptions. Some of those assumptions are stated, some unstated, some quantifiable, some not (Jeffrey’s lime jelly bean). I would take to task both bayesians and frequentists for paying insufficient attention to exchangeability, or whatever other prediction-enabling condition you adopt. Looking at Daniel Kahneman’s work on learning from data we need “An environment that is sufficiently regular to be predictable.” We never spend enough time on that, or ensuring that it is maintained once the modelling is “finished”. When did you ever see any piece of modelling work where there was a proper investigation of the exchangeability of the residuals?