Professor of Applied and Computational Mathematics and Control and Dynamical Systems,
Computing + Mathematical Sciences
California Institute of Technology, USA
Computing + Mathematical Sciences
California Institute of Technology, USA
“On the Brittleness of Bayesian Inference: An Update”
This is an update on the results discussed in http://arxiv.org/abs/1308.6306 (“On the Brittleness of Bayesian Inference”) and a high level presentation of the more recent paper “Qualitative Robustness in Bayesian Inference” available at http://arxiv.org/abs/1411.3984.
In http://arxiv.org/abs/1304.6772 we looked at the robustness of Bayesian Inference in the classical framework of Bayesian Sensitivity Analysis. In that (classical) framework, the data is fixed, and one computes optimal bounds on (i.e. the sensitivity of) posterior values with respect to variations of the prior in a given class of priors. Now it is already well established that when the class of priors is finite-dimensional then one obtains robustness. What we observe is that, under general conditions, when the class of priors is finite codimensional, then the optimal bounds on posterior are as large as possible, no matter the number of data points.
Our motivation for specifying a finite co-dimensional class of priors is to look at what classical Bayesian sensitivity analysis would conclude under finite information and the best way to understand this notion of “brittleness under finite information” is through the simple example already given in https://errorstatistics.com/2013/09/14/when-bayesian-inference-shatters-owhadi-scovel-and-sullivan-guest-post/ and recalled in Example 1. The mechanism causing this “brittleness” has its origin in the fact that, in classical Bayesian Sensitivity Analysis, optimal bounds on posterior values are computed after the observation of the specific value of the data, and that the probability of observing the data under some feasible prior may be arbitrarily small (see Example 2 for an illustration of this phenomenon). This data dependence of worst priors is inherent to this classical framework and the resulting brittleness under finite-information can be seen as an extreme occurrence of the dilation phenomenon (the fact that optimal bounds on prior values may become less precise after conditioning) observed in classical robust Bayesian inference .
Although these worst priors do depend on the data, “look nasty”, and make the probability of observing the data very small, they are not “isolated pathologies” but directions of instability (of Bayesian conditioning) and their number increase with the number of data points. Example 3 illustrates this point by placing a uniform constraint on the probability of observing the data in the model class. This example also suggests that learning and robustness are, to some degree, antagonistic properties: a strong constraint on the probability of the data makes the method robust but learning impossible and, as the constraint is relaxed, learning becomes possible but posterior values become brittle.
Since “brittleness under finite information” appears to be inherent to classical Bayesian Sensitivity Analysis (in which worst priors are computed given the specific value of the data), one may ask whether robustness could be established under finite information by exiting the strict framework of Robust Bayesian Inference and computing the sensitivity of posterior conclusions independently of the specific value of the data. To investigate this question, we have, in http://arxiv.org/abs/1411.3984, generalized Hampel and Cuevas’ notion of qualitative robustness to Bayesian inference based on the quantification of the sensitivity of the distribution of the posterior distribution with respect to perturbations of the prior and the data generating distribution, in the limit when the number of data points grows towards infinity. Note that, contrary to classical Bayesian Sensitivity Analysis, in the qualitative formulation, the data is not fixed and posterior values are therefore analyzed as dynamical systems randomized through the distribution of the data. To express finite information we have used the total variation, Prokhorov, and Ky Fan metrics to quantify perturbations and sensitivities.
Since this notion of qualitative robustness is established in the limit when the number of data points grows towards infinity, it is natural to expect that the notion of consistency (i.e. the property that posterior distributions convergence towards the data generating distribution) will play an important role. Although consistency is primarily a frequentist notion, it is also equivalent to intersubjective agreement which means that two Bayesians will ultimately have very close predictive distributions. Therefore, it also has importance for Bayesians. Fortunately, not only are there mild conditions which guarantee consistency, but the Bernstein-von-Mises theorem goes further in providing mild conditions under which the posterior is asymptotically normal. The most famous of these are Doob , Le Cam and Schwartz , and Schwartz [5, Thm. 6.1]. Moreover, the assumptions needed for this consistency are so mild that one can be lead to the conclusion that the prior does not really matter once there is enough data. For example, we quote Edwards, Lindeman and Savage :
“Frequently, the data so completely control your posterior opinion that there is no practical need to attend to the details of your prior opinion.”
To some, the consistency results appeared to generate more confidence than possibly they should. We quote A. W. F. Edwards [2, Pg. 60]:
“It is sometimes said, in defence of the Bayesian concept, that the choice of prior distribution is unimportant in practice, because it hardly influences the posterior distribution at all when there are moderate amounts of data. The less said about this ‘defence’ the better.”
In http://arxiv.org/abs/1411.3984 we have shown that the Edwards defence is essentially what produces non qualitative robustness in Bayesian inference. In particular, the assumptions required for consistency (e.g. the assumption that the prior has Kullback-Leibler support at the parameter value generating the data) are such that arbitrarily small local perturbations of the prior distribution (near the data generating distribution) results in consistency or non-consistency, and therefore, have large impacts on the asymptotic behavior of posterior distributions. See Example 4 and Example 5 for simple illustrations of this phenomenon, where the core mechanism generating non qualitative robustness is derived from the nature of both the assumptions and assertions of consistency results. These mechanisms are different and complementary to those discovered by Hampel and developed by Cuevas, and they suggest that consistency and robustness are, to some degree, antagonistic requirements (a careful selection of the prior is important if both properties, or their approximations, are to be achieved) and also indicate that misspecification generates non qualitative robustness (see Example 6 and Example 7).
In conclusion, the exploration of Bayesian inference in a continuous world has revealed both positive and negative results. However, positive results regarding the classical or qualitative robustness of Bayesian inference under finite information have yet to be obtained. To that end, observe that Example 3 suggests that there may be a missing stability condition for Bayesian inference in a continuous world under finite information akin to the CFL condition for the stability of a discrete numerical scheme used to approximate a continuous PDE. Although numerical schemes that do not satisfy the CFL condition may look grossly inadequate, the existence of such perverse examples certainly does not imply the dismissal of the necessity of a stability condition. Similarly, although one may, as in Example 2, exhibit grossly perverse worst priors, the existence of such priors does not invalidate the need for a study of stability conditions for using Bayesian Inference under finite information. Example 3 suggests that, in the framework of Bayesian Sensitivity Analysis, under finite information, such a stability condition would strongly depend on how well the probability of the data is known or constrained in the model class in addition to the class of priors and the resolution of our measurements.
Moreover, this question will increase in importance as Bayesian methods increase in popularity due to the availability of computational methodologies and environments to compute the posteriors, such as Markov chain Monte Carlo (MCMC) simulations. Indeed, when posterior distributions are approximated using such methods, the robustness analysis naturally includes not only quantifying sensitivities with respect to the data generating distribution and the choice of prior, but also the analysis of convergence and stability of the computational method. This is particularly true in Bayesian updating where Bayes’ rule is applied iteratively and computed approximate posterior distributions are then treated as prior distributions. The singular, and apparently antagonistic, relationship between qualitative robustness and consistency discussed here suggests that the metrics used to analyze convergence and qualitative robustness should be chosen with care and not independently from each other.
To close, although recently we have stumbled over negative results, we look forward to the discovery and the development of positive results regarding the Robustness of Bayesian inference under finite information. We would like to thank Deborah for inviting us to post this update.
Houman and Clint
Note: Their earlier post is reblogged here.
 J. L. Doob. Application of the theory of martingales. In Le Calcul des Probabilit´es et ses Applications, Colloques Internationaux du Centre National de la Recherche Scientifique, no. 13, pages 23–27. Centre National de la Recherche Scientifique, Paris, 1949.
 A. W. F. Edwards. Likelihood. Johns Hopkins University Press, Baltimore, expanded edition, 1992.
 W. Edwards, H. Lindman, and L. J. Savage. Bayesian statistical inference for psychological research. Psychological Review, 70(3): 193, 1963.
 L. Le Cam and L. Schwartz. A necessary and sufficient condition for the existence of consistent estimates. The Annals of Mathematical Statistics, pages 140–150, 1960.
 L. Schwartz. On Bayes procedures. Z. Wahrscheinlichkeitstheorie und Verw. Gebiete, 4: 10–26, 1965.
 L. Wasserman and T. Seidenfeld. The dilation phenomenon in robust Bayesian inference. J. Statist. Plann. Inference, 40: 345–356, 1994.
Dear Houman and Clint:
Thank you so much for this update. I’m afraid that I’d need a “plain Jane” tutorial this time as well, though I understand if that’s not feasible. I’ll try to learn from the discussion. I was wondering, do the following points from your earlier post still hold wrt this update?
“This situation might be remedied by computing robustness estimates prior to the observation of the data with respect to the worst case scenario of what the data generating distribution could be. ……
Indeed, it is interesting to note that the situations where Bayesian Inference has been successful can be characterized by the presence of some kind of “non-Bayesian” feedback loop on its accuracy.”
The answer is no with respect to computing robustness estimates prior to the observation of the data (i.e. you can still get non-robustness under finite information), the moral of the story appears to be that you have to give up on something if you want robustness and that something could be convergence/consistency (this could be done by coarsening/reducing the problem and this could impose a limit on the complexity of the system/resolution of measurements beyond which your predictions become non-robust).
The “non-Bayesian feedback loop” appears to have been suggested by several authors as a way to validate Bayesian predictions. In the qualitative robustness framework the degree of non-robustness corresponds to the degree of misspecification and frequentist estimates could be used to estimate that degree of misspecification.
Houman & Clint
A brief comment, based upon a very quick read.
First a question, are you aware of the fundamental theorem of probability/prevision. It seems related (although it is for discrete spaces only) see de Finetti, Lad or Walley.
Second: A direct implications seems to be that using moments to specify the prior doesn’t really work… but other methods could be used.
For your example 1. you could discretize the space between 0 and 1 and then specify intervals of probability for all possible events.
The simplest possible case would be to divide the interval into 2, then use the count into the first region as a non-sufficient statistic and use Polya distributions to specify a lower and upper probabilities on all possible events. It seems this sort of specification would not “shatter”.
Thanks for your comment and for the pointer. Concerning possible connections with FTP, note that here you are dealing with measures over measures and (contrary to the bounds on prior values) the bounds on posterior values do not converge towards each other as you increase the number of (possibly linear) constraints on your prior.
Concerning your suggestion of discretizing space, the answer is yes, you would get robustness if the discretization level is coarse enough (but below a non-zero/non-asymptotic level of discretization you will get brittleness), note that by discretizing space you are giving up some degree of convergence/consistency.
Houman & Clint
In looking up some of Owhadi and Scovel’s work, I came across an interview of them here.
The interview doesn’t really illuminate the brittleness business, but it provides glimpses of a philosophy of science sort. So I take the liberty of going possibly off topic, weaving some remarks (prefaced by “Mayo”) into snippets from that interview. “Paradigm change” is that one phrase pf Kuhn’s that has made it into the common lexicon, and Owhadi and Scovel have in mind a possibly radical paradigm change in science. Here’s what they say…
Owhadi: … – 200 years ago, if I were to ask you to solve a partial differential equation, you wouldn’t use a computer, you would probably use your brain. You would probably not come with a quantitative estimate of the solution, but only qualitative estimates.
Now, if ask you the same question today, you will not use your brain to solve the partial differential equation – you will use a computer. But you will still use your brain to program the computer ….
Now today if I asked you to find a statistical estimator, or to find the best possible climate model, or to find a test that will tell me if some data that I’m observing is corrupted or not – you’re not going to use a computer … You’re going to use your brain and guess work.
What we want to do here is basically turn this guesswork into an algorithm that we’ll be able to implement on a high performance cluster.
If you look at the mathematics behind this problem of turning the process of scientific discovery into an algorithm, you can basically translate it into an optimization problem….
Mayo: What happens to creativity and paradigm change living in this computerized reality? How do you break out of the paradigm your super computer creates?
Owhadi: .. Part of this program and this paradigm shift that we’re talking about is actually coming up with formulation of what it means to be an optimal solution to these things. That’s actually a big part of the program. It’s not just, I know what I want to compute and I need more computing power. It’s actually, what do we want to compute and what does it mean to be the best.
Mayo: Now there’s nothing so radical about wanting to turn scientific “discovery” into an algorithm, people have talked of doing this for donkey’s years…. But can you imagine being locked into the scientific framework thought to be optimal by a group of elites at a given time (with their desires, biases, politics implicit in the choices?) Brittleness indeed.
Owhadi: …. You need to have everybody communicating so you’re formulating not only what the objectives that you’re interested in but you formulate what pieces of information you have good confidence about, and those establish a quantitative set of realistic paradigms. ..
Mayo: Sounds dangerous. Who is “everybody”? Who would decide the facts, values, goals in which to have ‘confidence”? How will it be possible to entirely challenge the reigning “model of reality” embedded in their super-computer? You’d have to employ the reigning rules for doing so, and we know from the historical record that would be a disaster, that all rules, methods, goals and values have had to be shattered (if not all simultaneously)—but what if we’re stuck in Owhardi and Scovel’s super model of reality, wherein “creativity” and “guesswork” is relative to the optimization choices at time t?
Owhadi: …..What this technology that we are developing will allow to do – this is basically a long term vision – is give the ability to a computer to develop a model of reality…
Mayo: I hope they will leave a few of anarchists, rabble-rousers, exiles, and creative geniuses outside the paradigm that by historical accident was deemed optimal by those in power at time t. Or perhaps several alternative computerized realities–enough to foment revolutions.
This is off topic but here is a brief answer.
What we have in mind is a form of calculus that would enable the algorithmic implementation of a generalization of Wald’s decision theory framework. In that generalization optimal models are found as (and the process of discovery is facilitated via) optimal strategies of non-cooperative (min max) information games. Is there some evidence that this algorithmization of the process of scientific discovery could be done in practice? We think that the answer to that question is positive, see http://arxiv.org/abs/1304.7046 for an example in pure mathematics (based on the form of calculus mentioned here) and http://arxiv.org/abs/1406.6668 for an example in applied mathematics (based on the non-cooperative information game formulation).
Houman & Clint
Below is the youtube link to the interview that Prof Mayo mentioned in her comment above: http://youtu.be/QEaZE3bCHkw
This is all very interesting. Thanks for posting this here. A discussion of the Mayo remarks on the interview may deserve a separate posting.
Anyway, a question regarding the work on Qualitative Robustness in Bayesian Inference.
May it be that a different condition than the uniform probability one, which is used in Example 3, could join some kind of robustness and some kind of consistency?
What about something related to smoothness or unimodality of the densities in the class of priors, but still alowing them to be close to zero somewhere? This would not allow the “sudden drops” around potentially observed data that seem to cause trouble, but would still make the class sensitive enough to have the posterior following the data. Also, such a condition could be seen as “natural” at least in setups where there is no reason to rule out something a priori that is surrounded by something very likely.
(I kind of expect that you tell me that and why this is not possible… I’m asking this for clarification, not because I would have thought enough about this to really believe that it could be a solution.)
Thanks for your comment. Example 3 concerns the classical Robustness framework.
In the Qualitative Robustness (QR) framework (Ex 4-7) we look at the sensitivity of the distribution of posterior distributions, therefore in that (QR) setting the data is not fixed but randomized through the data generating distribution. The mechanism causing non-robustness in the (QR) setting is not based on priors having “sudden drops” around the data but on priors that put an arbitrarily small amount of mass around the data generating distribution (or the data generating parameter). This small amount of mass around the data generating distribution (or the data generating parameter) is sufficient to generate consistency or non-consistency and therefore, it has a large impact on the distribution of posterior distributions. The problem has to do with the fact that the conditions generating consistency are local (in the space containing the data generating distribution or the parameter space) whereas robustness is akin to a global property.
Houman & Clint
Dear Houman & Clint, thanks, I got that. Now looking at Examples 4 and 5, aren’t you basically saying that if the true \theta is \theta^* you create a robustness problem by giving a ball around \theta^* prior probability zero, in which case consistency can hold no more?
This doesn’t seem very remarkable to me, to be honest. If you can’t rule out \theta^* being the true parameter, why should you be interested in a prior that gives a ball around \theta^* probability zero – be it close or not to the prior you want that doesn’t have this property?
Robustness is interesting for figuring out whether small changes in distributions – **so small that you can hardly detect them given the amount of data available**, so that you can never be sure that it’s not the “bad” distribution at work instead of the “good” one you’re assuming – make big differences in inference.
But the thing with the prior is that you can be sure that you don’t want a prior (and it can’t be the true prior in any sense) that puts probability zero around \theta^* if \theta^* is a serious contender for being the true parameter.
So does your formally correct result actually have any worrying implications for Bayesians? They could always say: “OK, there are priors that get us into a mess, even in a neighbourhood of the ones we love, but we will avoid them because they look silly”. (Whereas a frequentist cannot “avoid” distributions she doesn’t like.)
Thanks for your comment. There are several aspects to your questions. Roughly:
If your parameter space \Theta is compact and the model well-specified (the data is generated from a parameter in that space), then, indeed, you should choose a prior satisfying Cromwell’s rule (putting mass in the neighborhood of all parameters) and your prior will be qualitatively robust (and the degree of robustness will be a function of how much mass you put in each neighborhood).
If \Theta is non-compact (not bounded) then your prior can’t be qualitatively robust (because no matter how small \epsilon is you can always find a neighborhood of the parameter space having mass smaller than \epsilon).
If \Theta is compact and your model is misspecified then, even if your prior is nice and smooth, these results suggest it is not qualitatively robust (with a degree of non-robustness corresponding to the degree of misspecification and your prior doesn’t need to look silly to be non-qualitatively-robust) .
Now in the classical robustness framework the extremizers of posterior values do indeed “look silly” but these are only directions of instability (your prior doesn’t need to “look silly” to be sensitive to these instabilities).
In the qualitative case, we agree that (the simple) example 4 does not appear “very remarkable” but it is true nevertheless and this poses a serious challenge to proving robustness in the tv metric or any weaker metric, such as those used in the convergence of MCMC.
In which situations should we care? Well imagine, for example, that you are using a sophisticated numerical Bayesian model to predict the climate where the Bayes rule is applied iteratively and posterior values become prior values for the next iteration.
How do you make sure that your predictions are robust, not only with respect to the choice of prior but also with respect to numerical instabilities arising in the iterative application of the Bayes rule? The non-robustness results discussed here suggest that, unless your prior is chosen very carefully and, unless you have a very tight control on numerical instabilities/errors/approximations at each step of the iteration, your final predictions will be very unstable.
Note that oftentimes these posterior distributions (which are then treated as prior distributions) are only approximated (e.g. via MCMC methods), how do you go about ensuring the stability of your method in such situations? (the Brittleness results discussed here suggest that having strong convergence of your MCMC method in TV would not be enough to ensure stability).
At a higher level, these results appear to suggest that robust inference (in a continuous world under finite-information) should be done with reduced/coarse models rather than highly sophisticated/complex models (and the level of “coarseness/reduction” would depend on the available “finite-information”)
Houman & Clint