Professor of Applied and Computational Mathematics and Control and Dynamical Systems,
Computing + Mathematical Sciences
California Institute of Technology, USA
Computing + Mathematical Sciences
California Institute of Technology, USA
“On the Brittleness of Bayesian Inference: An Update”
This is an update on the results discussed in http://arxiv.org/abs/1308.6306 (“On the Brittleness of Bayesian Inference”) and a high level presentation of the more recent paper “Qualitative Robustness in Bayesian Inference” available at http://arxiv.org/abs/1411.3984.
In http://arxiv.org/abs/1304.6772 we looked at the robustness of Bayesian Inference in the classical framework of Bayesian Sensitivity Analysis. In that (classical) framework, the data is fixed, and one computes optimal bounds on (i.e. the sensitivity of) posterior values with respect to variations of the prior in a given class of priors. Now it is already well established that when the class of priors is finite-dimensional then one obtains robustness. What we observe is that, under general conditions, when the class of priors is finite codimensional, then the optimal bounds on posterior are as large as possible, no matter the number of data points.
Our motivation for specifying a finite co-dimensional class of priors is to look at what classical Bayesian sensitivity analysis would conclude under finite information and the best way to understand this notion of “brittleness under finite information” is through the simple example already given in https://errorstatistics.com/2013/09/14/when-bayesian-inference-shatters-owhadi-scovel-and-sullivan-guest-post/ and recalled in Example 1. The mechanism causing this “brittleness” has its origin in the fact that, in classical Bayesian Sensitivity Analysis, optimal bounds on posterior values are computed after the observation of the specific value of the data, and that the probability of observing the data under some feasible prior may be arbitrarily small (see Example 2 for an illustration of this phenomenon). This data dependence of worst priors is inherent to this classical framework and the resulting brittleness under finite-information can be seen as an extreme occurrence of the dilation phenomenon (the fact that optimal bounds on prior values may become less precise after conditioning) observed in classical robust Bayesian inference .
Although these worst priors do depend on the data, “look nasty”, and make the probability of observing the data very small, they are not “isolated pathologies” but directions of instability (of Bayesian conditioning) and their number increase with the number of data points. Example 3 illustrates this point by placing a uniform constraint on the probability of observing the data in the model class. This example also suggests that learning and robustness are, to some degree, antagonistic properties: a strong constraint on the probability of the data makes the method robust but learning impossible and, as the constraint is relaxed, learning becomes possible but posterior values become brittle.
Since “brittleness under finite information” appears to be inherent to classical Bayesian Sensitivity Analysis (in which worst priors are computed given the specific value of the data), one may ask whether robustness could be established under finite information by exiting the strict framework of Robust Bayesian Inference and computing the sensitivity of posterior conclusions independently of the specific value of the data. To investigate this question, we have, in http://arxiv.org/abs/1411.3984, generalized Hampel and Cuevas’ notion of qualitative robustness to Bayesian inference based on the quantification of the sensitivity of the distribution of the posterior distribution with respect to perturbations of the prior and the data generating distribution, in the limit when the number of data points grows towards infinity. Note that, contrary to classical Bayesian Sensitivity Analysis, in the qualitative formulation, the data is not fixed and posterior values are therefore analyzed as dynamical systems randomized through the distribution of the data. To express finite information we have used the total variation, Prokhorov, and Ky Fan metrics to quantify perturbations and sensitivities.
Since this notion of qualitative robustness is established in the limit when the number of data points grows towards infinity, it is natural to expect that the notion of consistency (i.e. the property that posterior distributions convergence towards the data generating distribution) will play an important role. Although consistency is primarily a frequentist notion, it is also equivalent to intersubjective agreement which means that two Bayesians will ultimately have very close predictive distributions. Therefore, it also has importance for Bayesians. Fortunately, not only are there mild conditions which guarantee consistency, but the Bernstein-von-Mises theorem goes further in providing mild conditions under which the posterior is asymptotically normal. The most famous of these are Doob , Le Cam and Schwartz , and Schwartz [5, Thm. 6.1]. Moreover, the assumptions needed for this consistency are so mild that one can be lead to the conclusion that the prior does not really matter once there is enough data. For example, we quote Edwards, Lindeman and Savage :
“Frequently, the data so completely control your posterior opinion that there is no practical need to attend to the details of your prior opinion.”
To some, the consistency results appeared to generate more confidence than possibly they should. We quote A. W. F. Edwards [2, Pg. 60]:
“It is sometimes said, in defence of the Bayesian concept, that the choice of prior distribution is unimportant in practice, because it hardly influences the posterior distribution at all when there are moderate amounts of data. The less said about this ‘defence’ the better.”
In http://arxiv.org/abs/1411.3984 we have shown that the Edwards defence is essentially what produces non qualitative robustness in Bayesian inference. In particular, the assumptions required for consistency (e.g. the assumption that the prior has Kullback-Leibler support at the parameter value generating the data) are such that arbitrarily small local perturbations of the prior distribution (near the data generating distribution) results in consistency or non-consistency, and therefore, have large impacts on the asymptotic behavior of posterior distributions. See Example 4 and Example 5 for simple illustrations of this phenomenon, where the core mechanism generating non qualitative robustness is derived from the nature of both the assumptions and assertions of consistency results. These mechanisms are different and complementary to those discovered by Hampel and developed by Cuevas, and they suggest that consistency and robustness are, to some degree, antagonistic requirements (a careful selection of the prior is important if both properties, or their approximations, are to be achieved) and also indicate that misspecification generates non qualitative robustness (see Example 6 and Example 7).
In conclusion, the exploration of Bayesian inference in a continuous world has revealed both positive and negative results. However, positive results regarding the classical or qualitative robustness of Bayesian inference under finite information have yet to be obtained. To that end, observe that Example 3 suggests that there may be a missing stability condition for Bayesian inference in a continuous world under finite information akin to the CFL condition for the stability of a discrete numerical scheme used to approximate a continuous PDE. Although numerical schemes that do not satisfy the CFL condition may look grossly inadequate, the existence of such perverse examples certainly does not imply the dismissal of the necessity of a stability condition. Similarly, although one may, as in Example 2, exhibit grossly perverse worst priors, the existence of such priors does not invalidate the need for a study of stability conditions for using Bayesian Inference under finite information. Example 3 suggests that, in the framework of Bayesian Sensitivity Analysis, under finite information, such a stability condition would strongly depend on how well the probability of the data is known or constrained in the model class in addition to the class of priors and the resolution of our measurements.
Moreover, this question will increase in importance as Bayesian methods increase in popularity due to the availability of computational methodologies and environments to compute the posteriors, such as Markov chain Monte Carlo (MCMC) simulations. Indeed, when posterior distributions are approximated using such methods, the robustness analysis naturally includes not only quantifying sensitivities with respect to the data generating distribution and the choice of prior, but also the analysis of convergence and stability of the computational method. This is particularly true in Bayesian updating where Bayes’ rule is applied iteratively and computed approximate posterior distributions are then treated as prior distributions. The singular, and apparently antagonistic, relationship between qualitative robustness and consistency discussed here suggests that the metrics used to analyze convergence and qualitative robustness should be chosen with care and not independently from each other.
To close, although recently we have stumbled over negative results, we look forward to the discovery and the development of positive results regarding the Robustness of Bayesian inference under finite information. We would like to thank Deborah for inviting us to post this update.
Houman and Clint
Note: Their earlier post is reblogged here.
 J. L. Doob. Application of the theory of martingales. In Le Calcul des Probabilit´es et ses Applications, Colloques Internationaux du Centre National de la Recherche Scientifique, no. 13, pages 23–27. Centre National de la Recherche Scientifique, Paris, 1949.
 A. W. F. Edwards. Likelihood. Johns Hopkins University Press, Baltimore, expanded edition, 1992.
 W. Edwards, H. Lindman, and L. J. Savage. Bayesian statistical inference for psychological research. Psychological Review, 70(3): 193, 1963.
 L. Le Cam and L. Schwartz. A necessary and sufficient condition for the existence of consistent estimates. The Annals of Mathematical Statistics, pages 140–150, 1960.
 L. Schwartz. On Bayes procedures. Z. Wahrscheinlichkeitstheorie und Verw. Gebiete, 4: 10–26, 1965.
 L. Wasserman and T. Seidenfeld. The dilation phenomenon in robust Bayesian inference. J. Statist. Plann. Inference, 40: 345–356, 1994.